# A Gentle Introduction to RAG Applications

This notebook creates a simple RAG (Retrieval-Augmented Generation) system to answer questions from a PDF document using an open-source model.

In [17]:
PDF_FILE = "paul.pdf"

# We'll be using Llama 3.1 8B for this example.
MODEL = "llama3:latest"

## Loading the PDF document

Let's start by loading the PDF document and breaking it down into separate pages.

<img src='images/documents.png' width="1000">

In [12]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader(PDF_FILE)
pages = loader.load()

print(f"Number of pages: {len(pages)}")
print(f"Length of a page: {len(pages[1].page_content)}")
print("Content of a page:", pages[1].page_content)

Number of pages: 9
Length of a page: 3272
Content of a page: 10% a week. And while 110 may not seem much better than 100,
if you keep growing at 10% a week you'll be surprised how big
the numbers get. After a year you'll have 14,000 users, and after
2 years you'll have 2 million.
You'll be doing different things when you're acquiring users a
thousand at a time, and growth has to slow down eventually. But
if the market exists you can usually start by recruiting users
manually and then gradually switch to less manual methods. [3]
Airbnb is a classic example of this technique. Marketplaces are so
hard to get rolling that you should expect to take heroic measures
at first. In Airbnb's case, these consisted of going door to door in
New York, recruiting new users and helping existing ones improve
their listings. When I remember the Airbnbs during YC, I picture
them with rolly bags, because when they showed up for tuesday
dinners they'd always just flown back from somewhere.
Fragile
Airbnb no

## Splitting the pages in chunks

Pages are too long, so let's split pages into different chunks.

<img src='images/splitter.png' width="1000">


In [None]:
# Import the RecursiveCharacterTextSplitter class for splitting text into manageable chunks.
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Create an instance of RecursiveCharacterTextSplitter.
# - `chunk_size=1500`: Defines the maximum number of characters in each chunk.
# - `chunk_overlap=100`: Specifies the number of overlapping characters between consecutive chunks, 
#   ensuring context continuity between chunks.
splitter = RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=100)

# Use the text splitter to divide the input document(s) (`pages`) into smaller chunks.
# `pages` is expected to be a list of documents or text data (e.g., PDF pages or paragraphs).
chunks = splitter.split_documents(pages)

# Print the total number of chunks created after splitting.
print(f"Number of chunks: {len(chunks)}")

# Print the length (number of characters) of the second chunk.
# `chunks[1].page_content` refers to the text content of the second chunk.
print(f"Length of a chunk: {len(chunks[1].page_content)}")

# Print the content of the second chunk to examine its text.
print("Content of a chunk:", chunks[1].page_content)


Number of chunks: 23
Length of a chunk: 1236
Content of a chunk: took better advantage of it than Stripe. At YC we use the term
"Collison installation" for the technique they invented. More
diffident founders ask "Will you try our beta?" and if the answer is
yes, they say "Great, we'll send you a link." But the Collison
brothers weren't going to wait. When anyone agreed to try Stripe
they'd say "Right then, give me your laptop" and set them up on
the spot.
There are two reasons founders resist going out and recruiting
users individually. One is a combination of shyness and laziness.
They'd rather sit at home writing code than go out and talk to a
bunch of strangers and probably be rejected by most of them.
But for a startup to succeed, at least one founder (usually the
CEO) will have to spend a lot of time on sales and marketing. [2]
The other reason founders ignore this path is that the absolute
numbers seem so small at first. This can't be how the big, famous
startups got started, th

## Storing the chunks in a vector store

We can now generate embeddings for every chunk and store them in a vector store.

<img src='images/vectorstore.png' width="1000">


In [20]:
# Import the FAISS class for creating a vector store, which enables efficient similarity search and retrieval.
from langchain_community.vectorstores import FAISS

# Import the OllamaEmbeddings class, which generates vector embeddings from a specified language model.
from langchain_community.embeddings import OllamaEmbeddings

# Create an instance of OllamaEmbeddings by specifying the model to use.
# This model will convert text data into numerical vector representations.
embeddings = OllamaEmbeddings(model=MODEL)

# Create a FAISS vector store from the provided document chunks and their embeddings.
# `chunks` is a collection of document segments (e.g., paragraphs or sentences).
# `embeddings` is used to generate vector representations for these chunks.
vectorstore = FAISS.from_documents(chunks, embeddings)

## Setting up a retriever

We can use a retriever to find chunks in the vector store that are similar to a supplied question.

<img src='images/retriever.png' width="1000">



In [21]:
# Convert the vector store object into a retriever.
# - `vectorstore.as_retriever()` creates an interface to retrieve relevant information from the vector store.
# - The retriever uses the embeddings in the vector store to find the most relevant documents based on the query.
retriever = vectorstore.as_retriever()

# Invoke the retriever with a specific query.
# - The query ("What can you get away with when you only have a small number of users?") is used to find the most relevant
#   information stored in the vector store.
# - The retriever processes the query using the stored embeddings and returns the closest matching content.
retriever.invoke("What can you get away with when you only have a small number of users?")

[Document(id='1d97e3cc-fc0f-4a46-abb0-d53ef3b7fce7', metadata={'source': 'paul.pdf', 'page': 2}, page_content="in charge of their narrow domain of building things, rather than\nrunning the whole show. You can be ornery when you're Scotty,\nbut not when you're Kirk.\nAnother reason founders don't focus enough on individual\ncustomers is that they worry it won't scale. But when founders of\nlarval startups worry about this, I point out that in their current\nstate they have nothing to lose. Maybe if they go out of their way\nto make existing users super happy, they'll one day have too\nmany to do so much for. That would be a great problem to have.\nSee if you can make it happen. And incidentally, when it does,\nyou'll find that delighting customers scales better than you\nexpected. Partly because you can usually find ways to make\nanything scale more than you would have predicted, and partly\nbecause delighting customers will by then have permeated your\nculture.\nI have never once seen 

## Configuring the model

We'll be using Ollama to load the local model in memory. After creating the model, we can invoke it with a question to get the response back.

<img src='images/model.png' width="1000">

In [23]:
# Import the ChatOllama class from the langchain_ollama module.
# - ChatOllama is a wrapper designed to interact with the Ollama model, which is used for conversational AI tasks.
from langchain_ollama import ChatOllama

# Initialize the ChatOllama model with specific parameters.
# - `model=MODEL`: Specifies the name or identifier of the Ollama model to be used.
# - `temperature=0`: Sets the temperature parameter, which controls the randomness of the output.
#    - A temperature of 0 ensures deterministic responses (always provides the same output for the same input).
model = ChatOllama(model=MODEL, temperature=0, num_predict=2)

# Invoke the ChatOllama model with a query.
# - The `.invoke()` method sends a query to the model for processing.
# - "Who is the president of the United States?" is the input query that the model processes.
# - The model generates and returns a response based on the knowledge encoded within it.
model.invoke("Who is the president of the United States?")

AIMessage(content='As of', additional_kwargs={}, response_metadata={'model': 'llama3:latest', 'created_at': '2024-12-31T10:42:28.982217Z', 'done': True, 'done_reason': 'length', 'total_duration': 1748837708, 'load_duration': 34473292, 'prompt_eval_count': 19, 'prompt_eval_duration': 1680000000, 'eval_count': 2, 'eval_duration': 33000000, 'message': Message(role='assistant', content='', images=None, tool_calls=None)}, id='run-c99ab300-17f6-4357-8d44-ca15b5b1d53a-0', usage_metadata={'input_tokens': 19, 'output_tokens': 2, 'total_tokens': 21})

## Parsing the model's response

The response from the model is an `AIMessage` instance containing the answer. We can extract the text answer by using the appropriate output parser. We can connect the model and the parser using a chain.

<img src='images/parser.png' width="1000">


In [26]:
# Import the StrOutputParser class from the langchain_core.output_parsers module.
# - StrOutputParser is used to process and parse the output from a model into a string format.
from langchain_core.output_parsers import StrOutputParser

# Initialize the StrOutputParser to parse the output from the model.
# - StrOutputParser will convert the output of the model into a clean and structured string format.
parser = StrOutputParser()

# Define a 'chain' that combines the model and the output parser.
# - The pipe (`|`) operator is used to create a sequence of steps, where the model's output is passed directly to the parser.
# - This means the model will first generate the response, and then the parser will process it into a string.
chain = model | parser 

# Invoke the chain with a query.
# - The `.invoke()` method is used to send the query ("Who is the president of the United States?") through the chain.
# - The model generates a response, which is then parsed by the StrOutputParser, and the final result is printed.
print(chain.invoke("Who is the president of the United States?"))

As of


## Setting up a prompt

In addition to the question we want to ask, we also want to provide the model with the context from the PDF file. We can use a prompt template to define and reuse the prompt we'll use with the model.


<img src='images/prompt.png' width="1000">

In [24]:
from langchain.prompts import PromptTemplate

template = """
You are an assistant that provides answers to questions based on
a given context. 

Answer the question based on the context. If you can't answer the
question, reply "I don't know".

Be as concise as possible and go straight to the point.

Context: {context}

Question: {question}
"""

prompt = PromptTemplate.from_template(template)
print(prompt.format(context="Here is some context", question="Here is a question"))


You are an assistant that provides answers to questions based on
a given context. 

Answer the question based on the context. If you can't answer the
question, reply "I don't know".

Be as concise as possible and go straight to the point.

Context: Here is some context

Question: Here is a question



## Adding the prompt to the chain

We can now chain the prompt with the model and the parser.

<img src='images/chain1.png' width="1000">

In [27]:
chain = prompt | model | parser

chain.invoke({
    "context": "Anna's sister is Susan", 
    "question": "Who is Susan's sister?"
})


'Anna.'

## Adding the retriever to the chain

Finally, we can connect the retriever to the chain to get the context from the vector store.

<img src='images/chain2.png' width="1000">

In [30]:
# Import the itemgetter function from the operator module.
# - itemgetter is used to retrieve specific items (e.g., keys or values) from dictionaries or lists.
from operator import itemgetter

# Define a 'chain' that combines multiple steps into a sequence using the pipe operator (`|`).
# - Each step in the chain processes the data and passes it to the next component.

chain = (
    {
        # This step retrieves the value associated with the "question" key from a dictionary and passes it to the 'retriever'.
        # itemgetter("question") is used to extract the value for the "question" key from a dictionary-like structure.
        # The 'retriever' is likely a function or model that uses the question to retrieve relevant information or context.
        "context": itemgetter("question") | retriever,
        
        # This step retrieves the value of the "question" key and passes it along as-is.
        # The "question" is passed directly to the next component (prompt) without any transformation.
        "question": itemgetter("question"),
    }
    # Now that the dictionary is set up with context and question retrieval, the following steps in the chain are executed.
    # The sequence of components is as follows: the dictionary goes through the 'prompt', 'model', and 'parser'.
    | prompt        # The 'prompt' is likely a step where a prompt template or format is applied to the input data.
    | model         # The 'model' generates a response based on the prompt and context (possibly a language model).
    | parser        # The 'parser' processes and formats the output from the model (likely into a structured string).
)

## Using the chain to answer questions

Finally, we can use the chain to ask questions that will be answered using the PDF document.

In [29]:
questions = [
    "What can you get away with when you only have a small number of users?",
    "What's the most common unscalable thing founders have to do at the start?",
    "What's one of the biggest things inexperienced founders and investors get wrong about startups?",
]

for question in questions:
    print(f"Question: {question}")
    print(f"Answer: {chain.invoke({'question': question})}")
    print("*************************\n")

Question: What can you get away with when you only have a small number of users?
Answer: You can
*************************

Question: What's the most common unscalable thing founders have to do at the start?
Answer: According to
*************************

Question: What's one of the biggest things inexperienced founders and investors get wrong about startups?
Answer: I don
*************************

