# A Gentle Introduction to RAG Applications

This notebook creates a simple RAG (Retrieval-Augmented Generation) system to answer questions from a PDF document using an open-source model.

In [1]:
PDF_FILE = "Exoplanets.pdf"

# We'll be using Llama 3.1 8B for this example.
MODEL = "llama3.1"

## Loading the PDF document

Let's start by loading the PDF document and breaking it down into separate pages.

<img src='images/documents.png' width="1000">

In [2]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader(PDF_FILE)
pages = loader.load()

print(f"Number of pages: {len(pages)}")
print(f"Length of a page: {len(pages[1].page_content)}")
print("Content of a page:", pages[1].page_content)

Number of pages: 54
Length of a page: 1096
Content of a page:   
Exoplanet Travel Bureau Virtual Reality Experience 
https://exoplanets.nasa.gov/alien -worlds/exoplanet -travel -bureau   
  
Eyes on Exoplanets   
https://exoplanets.nasa.gov/eyes -on-exoplanets -web   
  
Eyes on Exoplanets Tutorial 1: The Basics   
https://exoplanets.nasa.gov/resources/1051/eyes -on-exoplanets -tutorial -1-the-basics   
  
Eyes on Exoplanets Tutorial 2: Advanced Tutorial  
https://exoplanets.nasa.gov/resources/1052/eyes -on-exoplanets -tutorial -2-advanced -tutorial   
  
Eyes on Exoplanets Tutorial 3: Tips and Tricks  
https://exoplanets.nasa.gov/resources/1053/eyes -on-exoplanets -tutorial -3-tips-and-tricks   
   Note: Eyes on Exoplanets Tutorials based on desktop version.   
  
  
Habitable Hunt: A 'Pi in the Sky' Math Challenge   
https://www.jpl.nasa.gov/edu/teach/activity/habitable -hunt -a-pi-in-the-sky-math -challenge   
  
Interactive: 5 Ways to Find a Planet   
https://exoplanets.nasa.gov/al

## Splitting the pages in chunks

Pages are too long, so let's split pages into different chunks.

<img src='images/splitter.png' width="1000">


In [3]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=100)

chunks = splitter.split_documents(pages)
print(f"Number of chunks: {len(chunks)}")
print(f"Length of a chunk: {len(chunks[1].page_content)}")
print("Content of a chunk:", chunks[1].page_content)


Number of chunks: 84
Length of a chunk: 1090
Content of a chunk: Exoplanet Travel Bureau Virtual Reality Experience 
https://exoplanets.nasa.gov/alien -worlds/exoplanet -travel -bureau   
  
Eyes on Exoplanets   
https://exoplanets.nasa.gov/eyes -on-exoplanets -web   
  
Eyes on Exoplanets Tutorial 1: The Basics   
https://exoplanets.nasa.gov/resources/1051/eyes -on-exoplanets -tutorial -1-the-basics   
  
Eyes on Exoplanets Tutorial 2: Advanced Tutorial  
https://exoplanets.nasa.gov/resources/1052/eyes -on-exoplanets -tutorial -2-advanced -tutorial   
  
Eyes on Exoplanets Tutorial 3: Tips and Tricks  
https://exoplanets.nasa.gov/resources/1053/eyes -on-exoplanets -tutorial -3-tips-and-tricks   
   Note: Eyes on Exoplanets Tutorials based on desktop version.   
  
  
Habitable Hunt: A 'Pi in the Sky' Math Challenge   
https://www.jpl.nasa.gov/edu/teach/activity/habitable -hunt -a-pi-in-the-sky-math -challenge   
  
Interactive: 5 Ways to Find a Planet   
https://exoplanets.nasa.gov/al

## Storing the chunks in a vector store

We can now generate embeddings for every chunk and store them in a vector store.

<img src='images/vectorstore.png' width="1000">


In [4]:
from langchain_community.vectorstores import FAISS
from langchain_community.embeddings import OllamaEmbeddings

embeddings = OllamaEmbeddings(model=MODEL)
vectorstore = FAISS.from_documents(chunks, embeddings)

## Setting up a retriever

We can use a retriever to find chunks in the vector store that are similar to a supplied question.

<img src='images/retriever.png' width="1000">



In [5]:
retriever = vectorstore.as_retriever()
retriever.invoke("What are the ways to find a planet?")

[Document(metadata={'source': 'Exoplanets.pdf', 'page': 29}, page_content='potential for finding habitable environments beyond Earth.  \nWhat is an exoplanet candidate?  \nAn exoplanet candidate is a likely planet discovered by a telescope that has not yet been confirmed to \nexist.'),
 Document(metadata={'source': 'Exoplanets.pdf', 'page': 25}, page_content='measurement difficult.  \n \nHow do the sizes and masses of exoplanets relate to their potential compositions?'),
 Document(metadata={'source': 'Exoplanets.pdf', 'page': 33}, page_content='investigation will determine if some of them possess atmospheres, oceans, or other signs of habitability.'),
 Document(metadata={'source': 'Exoplanets.pdf', 'page': 15}, page_content="which the universe is expanding.  \n \nWhat method do scientists use to measure the universe's expansion?")]

## Configuring the model

We'll be using Ollama to load the local model in memory. After creating the model, we can invoke it with a question to get the response back.

<img src='images/model.png' width="1000">

In [6]:
from langchain_ollama import ChatOllama

model = ChatOllama(model=MODEL, temperature=0)
# model.invoke("What are the ways to find a planet?")

## Parsing the model's response

The response from the model is an `AIMessage` instance containing the answer. We can extract the text answer by using the appropriate output parser. We can connect the model and the parser using a chain.

<img src='images/parser.png' width="1000">


In [7]:
from langchain_core.output_parsers import StrOutputParser

parser = StrOutputParser()

chain = model | parser 
# print(chain.invoke("What are the ways to find a planet?"))

## Setting up a prompt

In addition to the question we want to ask, we also want to provide the model with the context from the PDF file. We can use a prompt template to define and reuse the prompt we'll use with the model.


<img src='images/prompt.png' width="1000">

In [8]:
from langchain.prompts import PromptTemplate

#  based on a given context
# you can also provide the infromation if the context doesn't have it

# If you can't answer the
#question, reply "I don't know" and also provide the links in the context if relevant.

template = """
You are an assistant that provides answers to questions. 

Answer the question and also provide the links in the context if the question is related. 

Be as concise as possible and go straight to the point.

Context: {context}

Question: {question}
"""

prompt = PromptTemplate.from_template(template)
print(prompt.format(context="Here is some context", question="Here is a question"))


You are an assistant that provides answers to questions. 

Answer the question and also provide the links in the context if the question is related. 

Be as concise as possible and go straight to the point.

Context: Here is some context

Question: Here is a question



## Adding the prompt to the chain

We can now chain the prompt with the model and the parser.

<img src='images/chain1.png' width="1000">

In [9]:
chain = prompt | model | parser

# chain.invoke({
#     "context": "exoplanet", 
#     "question": "What are the ways to find a planet?"
# })


## Adding the retriever to the chain

Finally, we can connect the retriever to the chain to get the context from the vector store.

<img src='images/chain2.png' width="1000">

In [10]:
from operator import itemgetter

chain = (
    {
        "context": itemgetter("question") | retriever,
        "question": itemgetter("question"),
    }
    | prompt
    | model
    | parser
)

## Using the chain to answer questions

Finally, we can use the chain to ask questions that will be answered using the PDF document.

In [11]:
questions = [
    "What are the ways to find a planet?",
    "What is an exoplanet?",
    "Is there life on other planets?",
    "How many exoplanets are there?",
    "Where are exoplanets?",
    "Our Milky Way Galaxy: How Big is Space?",
]

for question in questions:
    print(f"Question: {question}")
    print(f"Answer: {chain.invoke({'question': question})}")
    print("*************************\n")

Question: What are the ways to find a planet?
Answer: According to the provided documents, there are several methods to find a planet:

1. **Telescope observations**: Telescopes can detect exoplanet candidates by observing their effects on their host stars.
2. **Transit method**: By measuring the decrease in brightness of a star as a planet passes in front of it.
3. **Radial velocity method**: By measuring the star's wobbling motion caused by an orbiting planet.

For more information, you can refer to:

* NASA Exoplanet Archive: [https://exoplanets.nasa.gov/](https://exoplanets.nasa.gov/)
* American Astronomical Society (AAS) Exoplanet Encyclopedia: [https://www.aas.org/publications/aas-exoplanet-encyclopedia](https://www.aas.org/publications/aas-exoplanet-encyclopedia)
* European Space Agency (ESA) Exoplanets page: [https://www.esa.int/Science_Explained/Exoplanets](https://www.esa.int/Science_Explained/Exoplanets)
*************************

Question: What is an exoplanet?
Answer: An e