# Retrieval Augmented Generation (RAG)

## Why RAG?

### 1. Knowledge Cutoff
*They don't know information after their training date*

- **Training data freeze**: Models trained on data up to specific cutoff date.
- **Growing knowledge gap**: Information becomes increasingly outdated as time passes
- **Missing recent events**: No knowledge of current news, updates, releases, or discoveries
- **Example**: Can't answer "Who won the 2024 election?" or "Latest iPhone features"

### 2. No Real-time Data
*Can't access current information*

- **Static knowledge only**: Cannot connect to internet, APIs, or live databases
- **No dynamic information**: Cannot fetch stock prices, weather, traffic, or breaking news
- **Frozen in time**: Information reflects training period, not current state
- **Example**: Cannot provide today's weather or current market conditions

### 3. Hallucination
*May generate plausible but incorrect information*

- **Confident fabrication**: Creates believable but false information when uncertain
- **Pattern-based guessing**: Fills knowledge gaps with plausible-sounding responses
- **No verification mechanism**: Cannot fact-check or validate generated content
- **Example**: May invent fake statistics, URLs, or medical advice

### 4. Domain-Specific Knowledge
*Limited knowledge about your private/company data*

- **Public data only**: Only knows information available during training
- **No private access**: Cannot read internal documents, policies, or proprietary data
- **Generic responses**: Cannot provide company-specific procedures or information
- **Example**: Cannot answer questions about internal APIs, company policies, or customer data

### 5. Memory Limitations
*Can't remember previous conversations*

- **No conversation history**: Each session starts fresh with no memory of past interactions
- **Context window limits**: Forgets earlier parts of long conversations
- **No user preferences**: Cannot learn or adapt to individual user needs over time
- **Example**: User must re-explain context and preferences in every new session

## How RAG Solves These Problems

RAG bridges these gaps by:
- **Retrieving current information** from updated knowledge bases
- **Grounding responses** in verified, sourced content
- **Accessing private data** through custom document collections
- **Maintaining context** through conversation and document history

## Part 0: Introduction, Installations and Environment

**Indexing**

1. Load: First we need to load our data. This is done with [Document Loaders](https://python.langchain.com/docs/concepts/document_loaders/).
2. Split: [Text splitters](https://python.langchain.com/docs/concepts/text_splitters/) break large Documents into smaller chunks. This is useful both for indexing data and for passing it in to a model, since large chunks are harder to search over and won't fit in a model's finite context window.
3. Store: We need somewhere to store and index our splits, so that they can later be searched over. This is often done using a [VectorStore](https://python.langchain.com/docs/concepts/#vector-stores) and [Embeddings](https://python.langchain.com/docs/concepts/embedding_models/) model.

![](RAG_VectorStore.png)

**Retrieval and Generation**

4. Retrieve: Given a user input, relevant splits are retrieved from storage using a [Retriever](https://python.langchain.com/docs/concepts/retrievers/).
5. Generate: A ChatModel / LLM produces an answer using a prompt that includes the question and the retrieved data.

![](Retrieve_Generate.png)

In [None]:
! pip install langchain_community tiktoken langchain-openai langchainhub chromadb langchain sentence-transformers

In [None]:
! pip install bs4

In [25]:
import os
from dotenv import load_dotenv
 
load_dotenv()

OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')

## Part 1: Overview

 
[RAG quickstart](https://python.langchain.com/docs/tutorials/rag/)

In [26]:
import bs4
from langchain import hub
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import WebBaseLoader
from langchain_community.vectorstores import Chroma
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

# Load Documents
loader = WebBaseLoader(
    web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent/",),
    bs_kwargs=dict(
        parse_only=bs4.SoupStrainer(
            class_=("post-content", "post-title", "post-header")
        )
    ),
)
docs = loader.load()

# Split
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = text_splitter.split_documents(docs)

# Embed
vectorstore = Chroma.from_documents(documents=splits, 
                                    embedding=OpenAIEmbeddings())

retriever = vectorstore.as_retriever()

# Prompt
prompt = hub.pull("rlm/rag-prompt")

# LLM
llm = ChatOpenAI(model_name="gpt-4o-mini", temperature=0)

# Post-processing
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

# Chain
rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

# Question
rag_chain.invoke("What is Task Decomposition?")

'Task Decomposition is the process of breaking down complex tasks into smaller, manageable steps. Techniques like Chain of Thought (CoT) and Tree of Thoughts facilitate this by prompting models to think step by step and explore multiple reasoning possibilities. It can be achieved through simple prompts, task-specific instructions, or human inputs.'

In [23]:
# import
from langchain_community.vectorstores import Chroma
from langchain_community.document_loaders import TextLoader
from langchain_community.embeddings.sentence_transformer import (
    SentenceTransformerEmbeddings,
)

# load the document and split it into chunks
loader = TextLoader("shakespeare.txt")
documents = loader.load()

# split it into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1024, chunk_overlap=200)
docs = text_splitter.split_documents(documents)

# create the open-source embedding function
embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

# load it into Chroma with a specific collection name
db = Chroma.from_documents(
    docs, 
    embedding_function,
    collection_name="shakespeare"
)

# query it
query = "What is Malcolm?"
docs = db.similarity_search(query)

# print results
print(docs[0].page_content)

  embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")
  from .autonotebook import tqdm as notebook_tqdm


MALCOLM. Let us seek out some desolate shade and there
    Weep our sad bosoms empty.
  MACDUFF. Let us rather
    Hold fast the mortal sword, and like good men
    Bestride our downfall'n birthdom. Each new morn
    New widows howl, new orphans cry, new sorrows
    Strike heaven on the face, that it resounds
    As if it felt with Scotland and yell'd out
    Like syllable of dolor.
  MALCOLM. What I believe, I'll wall;
    What know, believe; and what I can redress,
    As I shall find the time to friend, I will.
    What you have spoke, it may be so perchance.
    This tyrant, whose sole name blisters our tongues,
    Was once thought honest. You have loved him well;
    He hath not touch'd you yet. I am young, but something
    You may deserve of him through me, and wisdom  
    To offer up a weak, poor, innocent lamb
    To appease an angry god.
  MACDUFF. I am not treacherous.
  MALCOLM. But Macbeth is.
    A good and virtuous nature may recoil


In [27]:
from langchain_core.prompts import PromptTemplate

# Combine the content of the retrieved documents for context
context = docs[0].page_content

# Create a prompt for the language model
prompt = PromptTemplate.from_template(
    "Based on the following context, answer the question:\n\n{context}\n\nQuestion: {query}\nAnswer:"
)

# Initialize the language model (e.g., OpenAI)
llm_model="gpt-4o-mini"
llm = ChatOpenAI(temperature=0.0, model=llm_model) # Replace with your model of choice

# Generate an answer using the language model
answer = llm.invoke(prompt.format(context=context, query=query))

# Print the generated answer
print(answer)

content='The provided context does not mention "Malcolm." Therefore, based on the information available, I cannot provide an answer regarding what Malcolm is. If you have a specific context or details about Malcolm, please share them, and I would be happy to help!' additional_kwargs={'refusal': None} response_metadata={'token_usage': {'completion_tokens': 51, 'prompt_tokens': 9380, 'total_tokens': 9431, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_name': 'gpt-4o-mini-2024-07-18', 'system_fingerprint': 'fp_34a54ae93c', 'id': 'chatcmpl-BdVrxxH9z3N3QYbbl0mAPocSs0cnZ', 'service_tier': 'default', 'finish_reason': 'stop', 'logprobs': None} id='run--ab0d7a05-9a2f-4201-880a-3839af6d5820-0' usage_metadata={'input_tokens': 9380, 'output_tokens': 51, 'total_tokens': 9431, 'input_token_details': {'audio': 0, 'cache_read': 0}, 'outpu

## Part 2: Indexing

There are different algorithms to index the documents/splits generated out of the initial document: Hierarchical Navigable Small World (HNSW - Chroma, Inverted File Index - FAISS and Pinecone, Locality Sensitive Hashing, Tree-Based Indexing)

In [3]:
# Documents
question = "What kinds of pets do I like?"
document = "My favorite pet is a cat."

In [4]:
import tiktoken

def num_tokens_from_string(string: str, encoding_name: str) -> int:
    encoding = tiktoken.get_encoding(encoding_name)
    num_tokens = len(encoding.encode(string))
    return num_tokens

num_tokens_from_string(question, "cl100k_base")

8

There are different [embedding models](https://python.langchain.com/docs/integrations/text_embedding/) available.

In [5]:
from langchain_openai import OpenAIEmbeddings
embd = OpenAIEmbeddings()
query_result = embd.embed_query(question)
document_result = embd.embed_query(document)

print(f'Length of query: {len(query_result)}')
print(f'Length of document: {len(document_result)}')

Length of query: 1536
Length of document: 1536


In [6]:
import numpy as np

def cosine_similarity(vec1, vec2):
    dot_product = np.dot(vec1, vec2)
    norm_vec1 = np.linalg.norm(vec1)
    norm_vec2 = np.linalg.norm(vec2)
    return dot_product / (norm_vec1 * norm_vec2)

similarity = cosine_similarity(query_result, document_result)
print("Cosine Similarity:", similarity)

Cosine Similarity: 0.8806915835035409


[Document Loaders](https://python.langchain.com/docs/integrations/document_loaders/)

In [8]:
# Load blog
import bs4
from langchain_community.document_loaders import WebBaseLoader
loader = WebBaseLoader(
    web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent/",),
    bs_kwargs=dict(
        parse_only=bs4.SoupStrainer(
            class_=("post-content", "post-title", "post-header")
        )
    ),
)
blog_docs = loader.load()

There are different [text splitters](https://python.langchain.com/docs/concepts/text_splitters/) available.

`RecursiveCharacterTextSplitter` is the recommended one for generic text. It is parameterized by a list of characters. It tries to split on them in order until the chunks are small enough. The default list is ["\n\n", "\n", " ", ""]. This has the effect of trying to keep all paragraphs (and then sentences, and then words) together as long as possible, as those would generically seem to be the strongest semantically related pieces of text.

In [9]:
# Split
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=300, 
    chunk_overlap=50)

# Make splits
splits = text_splitter.split_documents(blog_docs)

[Vectorstores](https://python.langchain.com/docs/integrations/vectorstores/)

In [10]:
# Index
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
vectorstore = Chroma.from_documents(documents=splits, 
                                    embedding=OpenAIEmbeddings())

retriever = vectorstore.as_retriever()

## Part 3: Retrieval

In [20]:
# Index
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
vectorstore = Chroma.from_documents(documents=splits, 
                                    embedding=OpenAIEmbeddings())


retriever = vectorstore.as_retriever(search_kwargs={"k": 3}) # picks the one with highest similarity

In [21]:
docs = retriever.get_relevant_documents("What is Task Decomposition?")

In [22]:
docs

[Document(metadata={'source': 'https://lilianweng.github.io/posts/2023-06-23-agent/'}, page_content='Component One: Planning#\nA complicated task usually involves many steps. An agent needs to know what they are and plan ahead.\nTask Decomposition#\nChain of thought (CoT; Wei et al. 2022) has become a standard prompting technique for enhancing model performance on complex tasks. The model is instructed to “think step by step” to utilize more test-time computation to decompose hard tasks into smaller and simpler steps. CoT transforms big tasks into multiple manageable tasks and shed lights into an interpretation of the model’s thinking process.\nTree of Thoughts (Yao et al. 2023) extends CoT by exploring multiple reasoning possibilities at each step. It first decomposes the problem into multiple thought steps and generates multiple thoughts per step, creating a tree structure. The search process can be BFS (breadth-first search) or DFS (depth-first search) with each state evaluated by a

In [28]:
len(docs)

1

## Part 4: Generation

In [29]:
from langchain_openai import ChatOpenAI
from langchain.prompts import ChatPromptTemplate

# Prompt
template = """Answer the question based only on the following context:
{context}

Question: {question}
"""

prompt = ChatPromptTemplate.from_template(template)
prompt

ChatPromptTemplate(input_variables=['context', 'question'], input_types={}, partial_variables={}, messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context', 'question'], input_types={}, partial_variables={}, template='Answer the question based only on the following context:\n{context}\n\nQuestion: {question}\n'), additional_kwargs={})])

In [30]:
# LLM
llm = ChatOpenAI(model_name="gpt-4o-mini", temperature=0)

In [31]:
# Chain
chain = prompt | llm

In [32]:
# Run
chain.invoke({"context":docs,"question":"What is Task Decomposition?"})

AIMessage(content="Task Decomposition is the process by which an agent breaks down large tasks into smaller, manageable subgoals. This enables the agent to handle complex tasks more efficiently. It can be achieved through various methods, such as prompting the LLM with specific questions about the steps needed to accomplish a task, using task-specific instructions, or incorporating human inputs. Task Decomposition is essential for enhancing the model's performance on complex tasks by transforming them into simpler, more manageable components.", additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 93, 'prompt_tokens': 9989, 'total_tokens': 10082, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_name': 'gpt-4o-mini-2024-07-18', 'system_fingerprint': 'fp_34a54ae93c', 'id': 'chatcmpl-BdVt3

In [45]:
from langchain import hub
prompt_hub_rag = hub.pull("rlm/rag-prompt")

In [51]:
prompt_hub_rag

ChatPromptTemplate(input_variables=['context', 'question'], input_types={}, partial_variables={}, metadata={'lc_hub_owner': 'rlm', 'lc_hub_repo': 'rag-prompt', 'lc_hub_commit_hash': '50442af133e61576e74536c6556cefe1fac147cad032f4377b60c436e6cdcb6e'}, messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context', 'question'], input_types={}, partial_variables={}, template="You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.\nQuestion: {question} \nContext: {context} \nAnswer:"), additional_kwargs={})])

In [52]:
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

rag_chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt_hub_rag
    | llm
    | StrOutputParser()
)

rag_chain.invoke("What is Task Decomposition?")

'Task Decomposition is the process of breaking down a complicated task into smaller, more manageable steps. Techniques like Chain of Thought (CoT) and Tree of Thoughts enhance this process by guiding models to think step by step and explore multiple reasoning possibilities. This can be achieved through simple prompts, task-specific instructions, or human inputs.'