# Problem Statement

Create an AI agent that leverages the capabilities of a large language model. This agent should be able to extract answers based on the content of a large PDF document and post the results on Slack. Ideally, you use OpenAI LLMs. If you use the Langchain or LLama Index framework to implement this agentic functionality, please don’t use pre-built chains for the task. Implement the logic yourself. Please write production grade code as opposed to scripts as we will be evaluating your code quality.


---

# GPU


In [1]:
!nvidia-smi

Thu Sep 12 10:49:29 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.58.02              Driver Version: 555.58.02      CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA GeForce RTX 3050 ...    Off |   00000000:01:00.0 Off |                  N/A |
| N/A   52C    P8              8W /   35W |      15MiB /   4096MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [2]:
!nvcc -V

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Thu_Nov_18_09:45:30_PST_2021
Cuda compilation tools, release 11.5, V11.5.119
Build cuda_11.5.r11.5/compiler.30672275_0


# POC Setup


In [3]:
import os
from langchain.document_loaders import PyMuPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.docstore.document import Document
from langchain_core.output_parsers import JsonOutputParser
from pydantic import BaseModel, Field
from openai import OpenAI
import json
from typing import List, Dict
from dotenv import load_dotenv

from transformers.agents import Tool, ReactJsonAgent

load_dotenv()

  from .autonotebook import tqdm as notebook_tqdm


True

In [4]:
api_key = os.getenv("OPENAI_API_KEY")

if api_key:
    print("API Key loaded successfully!")
else:
    print("API Key not found!")

client = OpenAI(api_key=api_key)

API Key loaded successfully!


### Step: 1. Extract Text from the PDF


In [5]:
pdf_folder = "../docs"

In [6]:
def load_pdf_files(pdf_folder):
    """Load PDF files from a folder, return document objects and extracted text."""
    documents = []
    texts = []

    for file in os.listdir(pdf_folder):
        if file.endswith(".pdf"):
            pdf_path = os.path.join(pdf_folder, file)
            try:
                pdf_loader = PyMuPDFLoader(file_path=pdf_path)
                pdf_docs = pdf_loader.load()
                documents.extend(pdf_docs)
                texts.extend([doc.page_content for doc in pdf_docs])
            except Exception as e:
                print(f"Error loading {file}: {e}")

    return documents, texts

In [7]:
pdf_docs, pdf_texts = load_pdf_files(pdf_folder=pdf_folder)

### Step: 2. Extract Text from the PDF


In [8]:
def split_text_into_chunks(text):
    """Split the extracted text into chunks using text splitting."""
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000, chunk_overlap=100, separators=["\n", "\n\n"]
    )
    chunks = text_splitter.split_text(text)
    return chunks

In [9]:
raw_data = " ".join(pdf_texts)
texts = len(split_text_into_chunks(raw_data))

In [10]:
def wrap_text_in_documents(texts):
    """Wrap each text chunk in a Document object."""
    documents = [Document(page_content=text) for text in texts]
    return documents

### Initialize ChromaDB

Create embeddings for each chunk and insert into the Chroma vector database.


In [11]:
documents = wrap_text_in_documents(pdf_texts)

In [12]:
embeddings = OpenAIEmbeddings()

  embeddings = OpenAIEmbeddings()


In [13]:
!rm -rf ../vdb/*

In [14]:
vectordb = Chroma.from_documents(documents, embeddings, persist_directory="../vdb")

In [15]:
vectordb._collection.count()

46

In [16]:
query = "What is the name of the company?"

In [17]:
context_chunks = vectordb.similarity_search_with_score(query, k=3)

In [18]:
print(context_chunks[0][0].page_content)

Closing Statement
Thank you for reading our handbook. We hope it has provided you with an understanding of our mission, history, and
structure as well as our current policies and guidelines. We look forward to working with you to create a successful
Company and a safe, productive, and pleasant workplace.
Shruti Gupta, CEO
Zania, Inc.
45



### LLM


### PromptTemplate


In [19]:
class AnswerSchema(BaseModel):
    question: str = Field(description="The question asked")
    answer: str = Field(description="The answer extracted from the context")

In [20]:
parser = JsonOutputParser(pydantic_object=AnswerSchema)

In [21]:
from langchain.prompts import PromptTemplate

# Define the PromptTemplate
template = """You are a helpful assistant. Use the provided context from the document to answer the question accurately. If the answer is not available, respond with 'Data Not Available'.

Context:
{context}

{format_instructions}

Question:
{query}
"""

# Create the PromptTemplate instance
prompt = PromptTemplate(
    template=template,
    input_variables=["context", "query"],
    partial_variables={"format_instructions": parser.get_format_instructions()},
)

In [22]:
def get_answer_from_context(query, context_chunks):
    # Combine the context into a single string
    context = "\n".join([chunk.page_content for chunk, _ in context_chunks])

    # Use the prompt template to format the prompt
    formatted_prompt = prompt.format(context=context, query=query)

    # Get the answer from the OpenAI API
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        response_format={"type": "json_object"},
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": formatted_prompt},
        ],
    )

    answer = json.dumps(response.choices[0].message.content)
    return answer

In [23]:
answer = get_answer_from_context(query, context_chunks)

In [24]:
answer

'"{\\n  \\"question\\": \\"What is the name of the company?\\",\\n  \\"answer\\": \\"Zania, Inc.\\"\\n}"'

In [25]:
def get_answers_for_questions(
    questions: List[str], vectordb, k: int = 3
) -> Dict[str, str]:
    """
    This method takes a list of questions, performs a similarity search for each question,
    and returns a dictionary containing the answers for each question.

    Args:
        questions (List[str]): A list of questions to answer.
        vectordb: The vector database used for similarity search.
        k (int): Number of context chunks to retrieve for each question.

    Returns:
        Dict[str, str]: A dictionary where the keys are questions and values are the answers.
    """
    answers = {}

    # Loop through each question
    for question in questions:
        # Perform a similarity search to get context chunks for the current question
        context_chunks = vectordb.similarity_search_with_score(question, k=k)

        # Get the answer for the current question using the retrieved context
        answer = get_answer_from_context(question, context_chunks)

        # Store the answer in the dictionary
        answers[question] = answer

    return answers

In [26]:
questions = [
    "What is the name of the company?",
    "Who is the CEO of the company?",
    "What is their vacation policy?",
    "What is the termination policy?",
]

In [27]:
answers = get_answers_for_questions(questions, vectordb)

# Traditional RAG Output


In [28]:
from pprint import pprint

for question, answer in answers.items():
    print("======================")
    print(f"\nQuestion: {question}")
    pprint(f"Answer: {answer}")


Question: What is the name of the company?
('Answer: "{\\n  \\"question\\": \\"What is the name of the company?\\",\\n  '
 '\\"answer\\": \\"Zania, Inc.\\"\\n}"')

Question: Who is the CEO of the company?
('Answer: "{\\n  \\"question\\": \\"Who is the CEO of the company?\\",\\n  '
 '\\"answer\\": \\"Shruti Gupta\\"\\n}"')

Question: What is their vacation policy?
('Answer: "{\\n  \\"question\\": \\"What is their vacation policy?\\",\\n  '
 '\\"answer\\": \\"Zania, Inc. provides employees with paid vacation. All '
 'full-time regular employees are eligible to receive vacation time '
 'immediately upon hire. Vacation is calculated according to your work '
 'anniversary year, calendar year, or fiscal year. Employees accrue vacation '
 'based on their length of service, with amounts varying each year. Vacation '
 'granted during the first year will be prorated based on the hire date. '
 'Employees must request vacation in advance and must take it in increments of '
 'at least a specified 

---


# Agentic RAG


In [29]:
class RetrieverTool(Tool):
    name = "retriever"
    description = "Using semantic similarity, retrieves some documents from the knowledge base that have the closest embeddings to the input query."
    inputs = {
        "query": {
            "type": "text",
            "description": "The query to perform. This should be semantically close to your target documents. Use the affirmative form rather than a question.",
        }
    }
    output_type = "text"

    def __init__(self, vectordb, **kwargs):
        super().__init__(**kwargs)
        self.vectordb = vectordb

    def forward(self, query: str) -> str:
        assert isinstance(query, str), "Your search query must be a string"

        docs = self.vectordb.similarity_search(
            query,
            k=7,
        )

        return "\nRetrieved documents:\n" + "".join(
            [
                f"===== Document {str(i)} =====\n" + doc.page_content
                for i, doc in enumerate(docs)
            ]
        )


# Create an instance of the RetrieverTool
retriever_tool = RetrieverTool(vectordb)

In [30]:
retriever_tool

<__main__.RetrieverTool at 0x7d2cfac92980>

In [31]:
from transformers.agents.llm_engine import MessageRole, get_clean_message_list


openai_role_conversions = {
    MessageRole.TOOL_RESPONSE: MessageRole.USER,
}


class OpenAIEngine:
    def __init__(self, model_name="gpt-4o-mini"):
        self.model_name = model_name
        self.client = client

    def __call__(self, messages, stop_sequences=[]):
        messages = get_clean_message_list(
            messages, role_conversions=openai_role_conversions
        )

        response = self.client.chat.completions.create(
            model=self.model_name,
            messages=messages,
            stop=stop_sequences,
            temperature=0.5,
        )
        return response.choices[0].message.content

In [32]:
llm_engine = OpenAIEngine()

In [33]:
agent = ReactJsonAgent(
    tools=[retriever_tool], llm_engine=llm_engine, max_iterations=5, verbose=2
)

In [34]:
agent

<transformers.agents.agents.ReactJsonAgent at 0x7d2cfac9a890>

### Function to run the agent


In [35]:
def run_agentic_rag(question: str) -> str:
    enhanced_question = f"""
    Using the information contained in your knowledge base, which you can access with the 'retriever' tool, give a comprehensive answer to the question below. Respond only to the question asked, response should be concise and relevant to the question. If you cannot find information, do not give up and try calling your retriever again with different arguments! Make sure to have covered the question completely by calling the retriever tool several times with semantically different queries. Your queries should not be questions but affirmative form sentences: e.g. rather than  "query should be "What is the termination policy?", "What is their vacation policy?".

    Question:
    {question}
"""

    return agent.run(enhanced_question)

In [36]:
# Example usage
question = "What is their vacation policy?"
answer = run_agentic_rag(question)
print(f"Question: {question}")
print(f"Answer: {answer}")

[37;1m
    Using the information contained in your knowledge base, which you can access with the 'retriever' tool, give a comprehensive answer to the question below. Respond only to the question asked, response should be concise and relevant to the question. If you cannot find information, do not give up and try calling your retriever again with different arguments! Make sure to have covered the question completely by calling the retriever tool several times with semantically different queries. Your queries should not be questions but affirmative form sentences: e.g. rather than  "query should be "What is the termination policy?", "What is their vacation policy?".

    Question:
    What is their vacation policy?
[0m
[38;20mSystem prompt is as follows:[0m
[38;20mYou are an expert assistant who can solve any task using JSON tool calls. You will be given a task to solve as best you can.
To do so, you have been given access to the following tools: 'retriever', 'final_answer'
The way 

Question: What is their vacation policy?
Answer: Zania, Inc. provides paid vacation to all eligible employees. Vacation granted during the first year of employment is prorated based on the hire date. Employees accrue vacation based on their length of service, with specific amounts designated for each year of employment. Employees are encouraged to use their vacation time and must request it in advance from their Manager. Unused vacation can typically be carried over to the following year, but specific conditions apply depending on the company's policies.


In [37]:
print(f"Question: {question}")
print(f"Answer: {answer}")

Question: What is their vacation policy?
Answer: Zania, Inc. provides paid vacation to all eligible employees. Vacation granted during the first year of employment is prorated based on the hire date. Employees accrue vacation based on their length of service, with specific amounts designated for each year of employment. Employees are encouraged to use their vacation time and must request it in advance from their Manager. Unused vacation can typically be carried over to the following year, but specific conditions apply depending on the company's policies.
