# Chapter 10: Retrieval Augmented Generation (RAG)

Retrieval Augmented Generation (RAG) is an important improvement for LLMs. It combines the power of LLMs with the ability to fetch real-time information. Instead of just relying on the data they were trained on, these AI agents can now pull in relevant external data as they generate responses. This means they can give more accurate, up-to-date, and context-aware answers. This makes them more effective in areas like conversational AI, decision support, and automated research. You may have noticed that when an LLM uses tools, it usually forms its final answer based on the output of the last tool it called, often pulling data from a JSON response. In RAG, the focus shifts to giving the LLM the ability to answer questions using information from documents.

In [None]:
import numpy as np
import pandas as pd
from language_models.proxy_client import ProxyClient
from language_models.agent import Agent, OutputType, PromptingStrategy, Workflow, WorkflowLLMStep
from language_models.models.llm import ChatMessage, ChatMessageRole
from language_models.tools.tool import Tool
from language_models.models.llm import OpenAILanguageModel
from language_models.models.embedding import SentenceTransformerEmbeddingModel
from language_models.vector_stores import FAISSVectorStore, DistanceMetric
from language_models.settings import settings
from langchain_core.documents import Document
from pydantic import BaseModel, Field
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [None]:
proxy_client = ProxyClient(
    client_id=settings.CLIENT_ID,
    client_secret=settings.CLIENT_SECRET,
    auth_url=settings.AUTH_URL,
    api_base=settings.API_BASE,
)

## Idea

Integrating RAG into AI systems can significantly enhance LLM responses. RAG allows the LLM to reference a knowledge base outside of its training data sources before generating a response. This is useful when dealing with documents or content specific to your business that LLMs may not be familiar with. By using external information to ground the LLM, it can effectively answer questions related to those topics. However, careful implementation is essential.

![rag](./assets/images/rag.png)

In [None]:
system_prompt = "You are an expert in job postings. Respond with the most accurate information about the job."

llm = OpenAILanguageModel(
    proxy_client=proxy_client,
    model="gpt-4",
    max_tokens=250,
    temperature=0.2,
)

agent = Agent.create(
    llm=llm,
    system_prompt=system_prompt,
    prompt="{question}",
    prompt_variables=["question"],
    output_type=OutputType.STRING,
    prompting_strategy=PromptingStrategy.SINGLE_COMPLETION,
)

In this example, we inquire about the salary range for a job position, specifically an airport engineer. Since the LLM lacks specific details about our company, it relies on its general knowledge and information it has gathered from the internet to generate an answer.

In [None]:
response = agent.invoke({"question": "What is the salary range of an airport engineer."})

In [None]:
print(response.final_answer)

Here, we provide the relevant job details from our document, allowing the LLM to use this specific information to provide an accurate response to the user.

In [None]:
question = """What is the salary range of an airport engineer.

Context:
AIRPORT ENGINEER
Class Code:       7256
Open Date:  07-06-18
(Exam Open to All, including Current City Employees)

ANNUAL SALARY

$105,005 to $153,509 and $111,854 to $163,532."""

response = agent.invoke({"question": question})

In [None]:
print(response.final_answer)

## Embeddings

To effectively compare documents and automatically find relevant ones, we create embeddings that capture the semantic meaning of the questions posed to the LLM. This involves converting text into numerical representations, or vectors, using an embedding model. Typically, the embedding model of the LLM itself is used for this purpose. However, due to limited access to OpenAI's embedding model, we will use sentence transformers to showcase this process. By comparing these vectors, we can assess the similarity between documents and queries, enabling us to identify the most relevant information efficiently.

![embedding](./assets/images/embedding.png)

In [None]:
embedding_model = SentenceTransformerEmbeddingModel(model="all-MiniLM-L6-v2")

First, we convert the user query into a vector representation.

In [None]:
query = "What is the salary range of an airport engineer."
embedding1 = embedding_model.embed_query(query)

In [None]:
print(embedding1)

Next, we convert our documents into vectors, allowing us to compare them with the user's question. By embedding both the user's query and the documents, we project the query into the same vector space as the documents. This enables us to identify the closest matches, such as the 5 most similar documents.

![vector-space](./assets/images/vector-space.png)

In [None]:
document = """AIRPORT ENGINEER
Class Code:       7256
Open Date:  07-06-18
(Exam Open to All, including Current City Employees)

ANNUAL SALARY

$105,005 to $153,509 and $111,854 to $163,532."""

embedding2 = embedding_model.embed_query(document)

To compute the similarities between vectors, we can use mathematical formulas such as cosine similarity, euclidean distance, or inner product:
- **Euclidean distance:** Measures the straight-line distance between two points. Smaller values indicate greater similarity.
- **Cosine similarity:** Measures the cosine of the angle between two vectors, ranging from -1 to 1. Values closer to 1 indicate greater similarity.
- **Inner product:** Measures the dot product of two vectors. Higher values typically indicate greater similarity.

In [None]:
cosine_similarity = np.dot(embedding1, embedding2) / (np.linalg.norm(embedding1) * np.linalg.norm(embedding2))
print(f"Cosine similarity: {cosine_similarity:.4f}")

## Asking Questions Related to the Hiring Process

As a company, part of our operations involves hiring individuals for a variety of positions. To facilitate the hiring process and improve efficiency, we've incorporated the use of an LLM to assist us. Our hiring process entails the collaboration of various entities, including recruiters, hiring managers, engineers, applicants, departments, and the specific job roles themselves. For this demonstration, we'll use the LLM to respond to inquiries related to our hiring procedures and overall business operations.

We'll focus on using documents containing job descriptions, but we'll also consider that we have access to other documents related to various entities. Although these additional documents are virtual, we'll assume they are available for reference.

In [None]:
df_jobs = pd.read_csv("./assets/datasets/jobs.csv")
df_jobs.head()

In [None]:
df_resumes = pd.read_csv("./assets/datasets/resumes.csv")
df_resumes.head()

### Naive Implementation

For the simplest implementation, we embed our available documents, chunk them into smaller pieces if necessary, and store them in a vector database. For this demonstration, we'll use FAISS.

In [None]:
documents = [Document(page_content=document) for document in df_jobs.text.tolist() + df_resumes.Resume_str.tolist()]
text_splitter = RecursiveCharacterTextSplitter(separators=["\n\n", "\n", " ", ""], chunk_size=2000, chunk_overlap=100)
documents = text_splitter.split_documents(documents)

In [16]:
try:
    vector_store = FAISSVectorStore.load_local("./assets/datasets/", "embeddings")
except:
    vector_store = FAISSVectorStore.from_documents(
        documents=documents,
        embedding_model=embedding_model,
        distance_metric=DistanceMetric.COSINE_SIMILARITY,
    )
    vector_store.save_local("./assets/datasets/", "embeddings")

To reduce costs, instead of adding additional context from documents to every user question - which can worsen the answer quality if the LLM doesn't need external knowledge - we'll convert the RAG functionality into a tool. This way, the LLM can autonomously decide when to search the documents.

In [None]:
tool_name = "Search Documents"
tool_description = "Use this tool to search for documents relevant to the user's query"

class SearchDocuments(BaseModel):
    user_text: str = Field(description="The user query")
    fetch_k: int = Field(5, description="The number of documents to return")

In our basic retriever, we've implemented everything discussed so far. When a user asks a question, the LLM uses the search tool to find relevant documents and then provides an answer based on the information from those documents.

In [None]:
class BasicRetriever(BaseModel):
    """Class that implements naive RAG."""

    vector_store: FAISSVectorStore

    def get_relevant_documents(self, user_text: str, fetch_k: int = 5) -> str:
        """Gets relevant documents."""
        documents = self.vector_store.similarity_search(user_text, fetch_k)
        documents = [document for document, _ in documents]
        return "\n\n".join(document.page_content for document in documents)

basic_retriever = Tool(
    function=BasicRetriever(vector_store=vector_store).get_relevant_documents,
    name=tool_name,
    description=tool_description,
    args_schema=SearchDocuments,
)

In [None]:
system_prompt = """You are an expert in job postings. Respond with the most accurate information about the job.

Use the search tool to answer the user's question."""

llm = OpenAILanguageModel(
    proxy_client=proxy_client,
    model="gpt-4-32k",
    max_tokens=500,
    temperature=0.2,
)

basic_retriever_agent = Agent.create(
    llm=llm,
    system_prompt=system_prompt,
    prompt="{question}",
    prompt_variables=["question"],
    output_type=OutputType.STRING,
    tools=[basic_retriever],
    prompting_strategy=PromptingStrategy.CHAIN_OF_THOUGHT,
)

In [None]:
output = basic_retriever_agent.invoke({"question": "Give me the job description of an airport engineer."})

In [None]:
print(output.final_answer)

Another popular implementation is contextual compression, where two LLMs are used. One LLM is tasked with evaluating documents retrieved from the vector database, filtering out irrelevant ones. Each document is sequentially presented to this LLM to determine its relevance to the user's question. After filtering, the remaining documents are then provided to the 2nd LLM, which interacts directly with the user and provides the final answer.

In [None]:
INSTRUCTIONS = """Given the following question and context, respond with YES if the context is relevant to the question and NO if it isn't.

Question:
{question}

Context:
{context}"""


class ContextualCompressionRetriever(BaseModel):
    """Class that implements a contextual compression retriever."""

    llm: OpenAILanguageModel
    vector_store: FAISSVectorStore

    def _parse_output(self, output: str) -> bool:
        """Parses LLM output."""
        cleaned_upper_text = output.strip().upper()
        if "YES" in cleaned_upper_text and "NO" in cleaned_upper_text:
            raise ValueError(f"Ambiguous response. Both 'YES' and 'NO' in received: {output}.")
        elif "YES" in cleaned_upper_text:
            return True
        elif "NO" in cleaned_upper_text:
            return False
        else:
            raise ValueError(f"Expected output value to include either 'YES' or 'NO'. Received {output}.")

    def _compress_documents(self, user_text: str, documents: list[Document]) -> list[Document]:
        """Filters relevant documents."""
        compressed_documents = []
        for document in documents:
            prompt = INSTRUCTIONS.format(question=user_text, context=document.page_content)
            output = self.llm.get_completion([ChatMessage(role=ChatMessageRole.USER, content=prompt)])
            try:
                include_doc = self._parse_output(output)
            except ValueError:
                include_doc = False
            if include_doc:
                compressed_documents.append(document)
        return compressed_documents

    def get_relevant_documents(self, user_text: str, fetch_k: int = 5) -> str:
        """Gets relevant documents."""
        documents = self.vector_store.similarity_search(user_text, fetch_k)
        documents = [document for document, _ in documents]
        compressed_documents = self._compress_documents(user_text, documents)
        return "\n\n".join(document.page_content for document in compressed_documents)

llm = OpenAILanguageModel(
    proxy_client=proxy_client,
    model="gpt-4",
    max_tokens=16,
    temperature=0.2,
)

contextual_compression_retriever = Tool(
    function=ContextualCompressionRetriever(llm=llm, vector_store=vector_store).get_relevant_documents,
    name=tool_name,
    description=tool_description,
    args_schema=SearchDocuments,
)

In [None]:
system_prompt = """You are an expert in job postings. Respond with the most accurate information about the job.

Use the search tool to answer the user's question."""

llm = OpenAILanguageModel(
    proxy_client=proxy_client,
    model="gpt-4-32k",
    max_tokens=500,
    temperature=0.2,
)

contextual_compression_retriever_agent = Agent.create(
    llm=llm,
    system_prompt=system_prompt,
    prompt="{question}",
    prompt_variables=["question"],
    output_type=OutputType.STRING,
    tools=[contextual_compression_retriever],
    iterations=5,
)

In [None]:
output = contextual_compression_retriever_agent.invoke({"question": "Give me the job description of an airport engineer."})

In [None]:
print(output.final_answer)

Every naive implementation of RAG, including those demonstrated previously and others (RAG + keyword search, RAG + re-ranking, etc.), encounters a common challenge.

Naive RAG is adequate for handling highly specific objects. For instance, if the vector space contains only documents related to earthquakes, a particular car model or a particular job, naive RAG is sufficient. However, such simplistic RAG solutions present difficulties in ensuring that the retrieved documents contain the necessary context to address the query effectively. For example, when projecting an embedding into the vector space and examining the closest 5 neighbors or the 5 most similar documents, there's no guarantee that we'll only obtain documents related to the specific object of interest.

![naive-rag-vector-space](./assets/images/naive-rag-vector-space.png)

In our basic demonstration, we've confined our vector space to contain solely documents about the job listings; nonetheless, the LLM's responses remain subpar. After reviewing the logs of the LLM's Chain-of-Thought process, it is evident that the search tool retrieves irrelevant documents, such as those discussing unrelated roles like an airport police captain. While the information about the airport police captain could potentially be used to formulate an answer, it ultimately depends on the LLM. However, we can enhance the process by excluding such irrelevant documents from the beginning.

When employing contextual compression, where documents are initially reviewed and filtered by another LLM to remove irrelevant content, we observed that the LLM effectively eliminated irrelevant documents in this instance. However, this success is not guaranteed every time, so it's important not to overly rely on the LLM's judgement. Additionally, this method has a drawback: the LLM filtered out 2 documents deemed irrelevant, reducing the context from 5 documents to 3. This limited information might be insufficient for the LLM to deliver a high-quality answer.

Now, envision a scenario where our document repository extends beyond job-related materials to encompass various entities integral to our hiring process. Applicants provide documents such as resumes and cover letters; recruiters, engineers, and managers maintain candidate-related notes from interviews; departments house documents outlining team compositions and project details for prospective hires. If all these documents were incorporated into the vector space, it would worsen the issue, leading to further deterioration in the LLM's responses.

This challenge arises from the ambiguity in document usage and accessibility. For instance, when querying about an airport engineer position, applicant documents may mention prior experience in the role, recruiters/engineers may record interactions related to airport engineering roles, and departments may outline their need for such positions. This lack of control over document utilization and access complicates the task and contributes to the degradation of the LLM's responses. For improved outcomes, we can use an ontology/knowledge graph.

Technically, we could add metadata, such as job titles, to achieve similar results to the ontology/knowledge graph implementation. However, as mentioned earlier, we assume there are documents related to other entities as well. This complicates the situation because documents tied to different business objects may contain different metadata. While searching using metadata is still possible, it would require filtering through numerous documents, which is slow. In addition, managing documents with different metadata in the same database is tedious. This approach would also make it more difficult for users to understand.

One might consider standardizing all documents to include all metadata, but this would lead to many documents being filled with irrelevant metadata and dummy values, resulting in significant inefficiency in space usage. Another issue is the potential for metadata fields to have the same name across different business objects, which would prevent exclusive searches for relevant documents. Additionally, some metadata, like document creation dates or filenames, might not be known to users, who could simply look up the documents if they had this information.
Therefore, in our demo, we perform a simple search without metadata filtering, as this is a more practical approach.

### Ontology/Knowledge Graph Implementation

Implementing knowledge ontology/knowledge graph-based RAG presents a pragmatic solution. When considering knowledge graphs, we often envision structures similar to the figure below.

![knowledge-graph](./assets/images/knowledge-graph.png)

Nevertheless, we need to recalibrate our perception of knowledge graphs within the framework of LLMs. Ideally, we aim to create a digital twin of our business processes, wherein nodes symbolize objects and edges denote their links. Since the use case was introduced earlier, we're already familiar with the entities engaged in the hiring process. This includes applicants, who provide documents such as cover letters and resumes, along with additional metadata like their name, birthday, degrees, and responses to application questions. Additionally, individuals involved in the process — such as recruiters, engineers, and hiring managers — may maintain notes about applicants from interviews, alongside metadata such as name, birthday, and department affiliation. Furthermore, departments play a role, providing both metadata and documents, and the job postings themselves contain metadata such as job title, salary range, department, and required technical skills. The figure below provides a basic outline of the process. While it may not depict every detail accurately, it conveys the overall concept.

![knowledge-graph-vector-space](./assets/images/knowledge-graph-vector-space.png)

Now that we've established the ontology/knowledge graph, we can utilize traversal methods like depth-first search and breadth-first search. This enables us to initially locate the relevant object or entity related to the question. Moreover, we can leverage the available metadata to refine our search process, effectively reducing the pool of documents to be searched. Essentially, this allows us to exclude entirely irrelevant documents and focus solely on those that are highly relevant. This structure also provides enhanced control and security. We can decide how the ontology/knowledge graph is traversed and specify which data, including metadata, the LLM is permitted to access.

In [None]:
documents = [Document(page_content=document) for document in df_jobs.text.tolist() + df_resumes.Resume_str.tolist()]
text_splitter = RecursiveCharacterTextSplitter(separators=["\n\n", "\n", " ", ""], chunk_size=2000, chunk_overlap=100)
documents = text_splitter.split_documents(documents)

In our simple scenario focusing solely on job-related data, where we solely retain metadata regarding job titles, we'll implement a chain of LLMs. The initial LLM is tasked with identifying the precise job title based on the inquiry, while the subsequent LLM utilizes this information to search relevant documents. It applies the identified job title as metadata to refine the document pool, thereby narrowing down the selection before conducting a similarity search to pinpoint the most relevant segments for the specified role of an airport engineer.

Since we're acknowledging the presence of documents linked to different business objects, we'd typically need an extra step to locate the relevant object first. However, since we're exclusively dealing with job-related documents, we'll skip this step in our chain.

| Inputs |
|------|
| question |

<br/>

| LLM that finds the job title |
|------|
| `Inputs` question |
| `Output` job title |

<br/>

| LLM that answers the user question |
|------|
| `Inputs` job title |
| `Output` content |

<br/>

| Output |
|------|
| content |

In [None]:
def get_available_job_titles() -> list[str]:
    return data["jobs"].job_title.unique().tolist()

get_jobs_tool = Tool(
    function=get_available_job_titles,
    name="Get Available Job Titles",
    description="Use this tool to get the job titles.",
)

In [None]:
job_agent = Agent.create(
    llm=llm,
    system_prompt="",
    prompt="{question} \n\nOnly respond with the job title.",
    prompt_variables=["question"],
    output_type=OutputType.STRING,
    tools=[get_jobs_tool],
    iterations=10,
)

In [None]:
class Search(BaseModel):
    user_text: str = Field(description="The user question/prompt/text.")
    fetch_k: int = Field(5, description="The number of documents to return.")
    job_title: str = Field(description="The job title to filter for. Must be all caps.")

def search(user_text: str, fetch_k: int, job_title: str) -> str:

    def calculate_cosine_similarity(user_text_embedding, embedding):
        cosine_similarity = np.dot(user_text_embedding, embedding) / (np.linalg.norm(user_text_embedding) * np.linalg.norm(embedding))
        return cosine_similarity

    user_text_embedding = embedding_model.embed_query(user_text)
    df = data["jobs"]
    df = df.loc[df["job_title"] == job_title.upper()].copy()
    df["cosine_similarity"] = df.embedding.apply(lambda embedding: calculate_cosine_similarity(user_text_embedding, embedding))
    df = df.sort_values(by="cosine_similarity", ascending=False)
    df = df.iloc[:fetch_k]
    documents = "\n\n".join(df.text.tolist())
    return f"Context:\n\n{documents}"


search_tool = Tool(
    function=search,
    name=tool_name,
    description=tool_description,
    args_schema=Search,
)

In [None]:
system_prompt = """You are an expert in job postings. Respond with the most accurate information about the job.

Use the Search tool to find the job description."""

agent = Agent.create(
    llm=llm,
    system_prompt=system_prompt,
    prompt="{job_title}",
    prompt_variables=["job_title"],
    output_type=OutputType.STRING,
    tools=[search_tool],
    iterations=10,
)

In [None]:
class SearchJobPostings(BaseModel):
    question: str = Field(description="The user question")

workflow = Workflow(
    name="Search Job Postings",
    description="Allows you to search for specific jobs",
    inputs=SearchJobPostings,
    output="job",
    steps=[WorkflowLLMStep(name="job_title", agent=job_agent), WorkflowLLMStep(name="job", agent=agent)],
)

In [None]:
output = workflow.invoke({"question": "Give me the job description of an airport engineer."})

In [None]:
pprint(output.output)

As evident from the improved response of the LLM, our approach ensures that the model exclusively accesses documents about an airport engineer, thereby enhancing its performance. Scaling the solution to encompass multiple entities and numerous documents allows us to effectively filter out irrelevant content, concentrating solely on significant documents. Consequently, this boosts the likelihood of the LLM receiving high-quality documents to address the task. 

In conclusion, we ought to separate our business components before enhancing our information retrieval and generation process to ensure the best results. If RAG doesn't improve the LLM's answer quality, you might consider taking an additional step: fine-tuning a model specifically for the domain and optionally integrating RAG for further enhancement.