<a href="https://colab.research.google.com/drive/1K5u1xyfHBG0NmB2OwLz0U-bQqwnwQkB3
" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



In [3]:
!pip install python-dotenv -q

# A Gentle Introduction to RAG Applications

This notebook creates a simple RAG (Retrieval-Augmented Generation) system to answer questions from a PDF document using an open-source model.

In [1]:
PDF_FILE = "FAQ_GDG.pdf"

# We'll be using Llama 3.1 8B for this example.
MODEL = "llama3.1"

## Loading the PDF document

starting by loading the PDF document and breaking it down into separate pages.

<img src='images/documents.png' width="500">

In [2]:
%pip install pypdf

Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Note: you may need to restart the kernel to use updated packages.


In [2]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader(PDF_FILE)
pages = loader.load()

print(f"Number of pages: {len(pages)}")
print(f"Length of a page: {len(pages[1].page_content)}")
print("Content of a page:", pages[1].page_content)

Number of pages: 12
Length of a page: 2037
Content of a page: W h y w a s m y a c c o u n t t u r n e d t o p r i v a t e ? 
 If we r easonably belie v e content in y our pr o ﬁ le violates our content policy , y our account will be switched t o priv ate and the content in y our pr o ﬁ le will be deleted. Y ou won 't be able t o mak e y our account public again for at least 60 da ys. Google also r eser v es the right t o suspend or terminate y our access t o the ser vices or delete y our Google Account, as described in the T aking action in case of pr oblems section of the Google T erms of Ser vice. 
W h a t h a p p e n s w h e n I i n t e g r a t e m y p r o ﬁ l e w i t h a t h i r d - p a r t y a p p o r s e r v i c e ? 
 If y ou authoriz e an application t o access y our Google De v eloper Pr ogr am pr o ﬁ le, that application will be able t o see y our pr o ﬁ le information, e v en if y ou ha v e not made y our pr o ﬁ le public. Learn mor e about how t o manage thir d-par ty apps a

## Splitting the pages in chunks

Pages are too long, so let's split pages into different chunks.

<img src='images/splitter.png' width="1000">


In [3]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(chunk_size=700, chunk_overlap=100)

chunks = splitter.split_documents(pages)
print(f"Number of chunks: {len(chunks)}")
print(f"Length of a chunk: {len(chunks[1].page_content)}")
print("Content of a chunk:", chunks[1].page_content)


Number of chunks: 36
Length of a chunk: 582
Content of a chunk: H o w d o I e d i t m y p r o ﬁ l e ? 
 Y ou can edit y our Google De v eloper Pr ogr am pr o ﬁ le b y going t o de v elopers.google.com/pr o ﬁ le/u/me . 
W h a t h a p p e n s i f I m a k e m y p r o ﬁ l e p u b l i c ? 
 Making y our pr o ﬁ le public mak es it viewable b y any one online. This includes y our name, image, r ole, company or school, bio, badges y ou'v e r eceiv ed, stats, and y our social media links (including GitHub, GitLab, X, Link edIn, and Stack Ov er ﬂ ow). Y our pages sa v ed, pages r ated, and e v ents attended ar e not par t of y our public pr o ﬁ le.


## Storing the chunks in a vector store

We can now generate embeddings for every chunk and store them in a vector store.
we will use FAISS: Facebook AI Similarity Search

<img src='images/vectorstore.png' width="1000">


In [5]:
%pip install -qU langchain-community faiss-cpu

Note: you may need to restart the kernel to use updated packages.


In [4]:
import faiss
print(faiss.__version__)

1.9.0


In [None]:
%pip install -qU langchain-huggingface
%pip install -qU langchain-ollama
%pip install einops

Note: you may need to restart the kernel to use updated packages.


In [None]:
from sentence_transformers import SentenceTransformer
#ST = SentenceTransformer("mixedbread-ai/mxbai-embed-large-v1")
ST = SentenceTransformer("all-MiniLM-L6-v2")
query= "hello "
embedded_query = ST.encode(query)
embedded_query

  from tqdm.autonotebook import tqdm, trange
    ImportError: DLL load failed while importing onnx_cpp2py_export: A dynamic link library (DLL) initialization routine failed.
  warn(message, cls)


In [None]:
len(embedded_query)

In [6]:
from langchain_community.vectorstores import FAISS
#from langchain_community.embeddings import OllamaEmbeddings
from langchain_ollama import OllamaEmbeddings # new lib

embed_model = "nomic-embed-text" # we can use same embed model as llama3.1 as they came in pairs
embeddings = OllamaEmbeddings(model=embed_model)
vectorstore = FAISS.from_documents(chunks, embeddings)

## Setting up a retriever

We can use a retriever to find chunks in the vector store that are similar to a supplied question.

<img src='images/retriever1.png' width="1000">



In [None]:
retriever = vectorstore.as_retriever()
retriever.invoke("what is GDSC ?")
# 4 k- similar chunks by default

[Document(metadata={'source': 'FAQ_GDG.pdf', 'page': 9}, page_content='GDSC\nFAQ\nThe\npurpose\nof\nthis\ndocument\nis\nto\ncapture\nfrequently\nasked\nquestions\nabout\nthe\nGDSC\nprogram.\nJoin\nGDSC\nWho\nshould\njoin\nGoogle\nDeveloper\nStudent\nClubs?\nCollege\nand\nuniv ersity\nstudents\nar e\nencour aged\nt o\njoin\nGoogle\nDe v eloper\nStudent\nClubs.\nCan\nI\njoin\nmultiple\nchapters?\nY ou\ncan\npar ticipate\nin\ne v ents\nor ganiz ed\nb y\nmultiple\nchapters,\nhowe v er\nif\ny ou\ndecide\nt o\ndedicate\ny ourself\nt o\nbecome\na\nGDSC\nLead\nor\nCor e\nT eam\nMember ,\ny ou\nwill\nbe\noﬃcially\nassigned\nt o\none\nchapter .\nWhat\ndoes\na\nGDSC\nlead\ndo?\nIn\ngener al,\nGDSC\nleaders\nar e\nfocused\non\nthe\nfollowing\nar eas:\n●\nStar t\na\nclub\n-\nW ork\nwith\ny our\nuniv ersity\nor\ncollege\nt o\nstar t\na\nstudent\nclub.\nSelect\na\ncor e\nteam\nand'),
 Document(metadata={'source': 'FAQ_GDG.pdf', 'page': 10}, page_content="Lead\napplication\n.\n●\nW e 'll\nr e view\ny 

## Configuring the model

We'll be using Ollama to load the local model in memory. After creating the model, we can invoke it with a question to get the response back.

<img src='images/model1.png' width="1000">

In [9]:
import langchain_community
print(dir(langchain_community))

['__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '__version__', 'docstore', 'document_loaders', 'utils', 'vectorstores']


In [None]:
from langchain_ollama import ChatOllama

model = ChatOllama(model=MODEL, temperature=0)
model.invoke("Who is the president of the United States?") # eww whats is that "AIMEssage"

AIMessage(content='As of my last update in April 2023, Joe Biden is the President of the United States. He took office on January 20, 2021, succeeding Donald Trump as the 46th President of the United States. Please note that this information might change over time due to elections or other political developments.', additional_kwargs={}, response_metadata={'model': 'llama3.1', 'created_at': '2024-11-28T13:38:25.9847052Z', 'message': {'role': 'assistant', 'content': ''}, 'done_reason': 'stop', 'done': True, 'total_duration': 30240396900, 'load_duration': 17220478000, 'prompt_eval_count': 19, 'prompt_eval_duration': 2285017000, 'eval_count': 65, 'eval_duration': 10723552000}, id='run-e63e6834-903d-45cc-8a93-aa7bd0f88662-0', usage_metadata={'input_tokens': 19, 'output_tokens': 65, 'total_tokens': 84})

In [25]:
%pip install -qU langchain-groq


Note: you may need to restart the kernel to use updated packages.


In [6]:
# Initialize the model
from langchain_groq import ChatGroq

model = ChatGroq(
    temperature=0,
    model= "llama-3.1-70b-versatile",#MODEL
    api_key="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",
    verbose= True,
    max_retries=3,

)


## Parsing the model's response

The response from the model is an `AIMessage` instance containing the answer. We can extract the text answer by using the appropriate output parser. We can connect the model and the parser using a chain.

<img src='images/parser1.png' width="1000">


In [8]:
from langchain_core.output_parsers import StrOutputParser

parser = StrOutputParser()

chain = model | parser 
print(chain.invoke("where tunisia is located"))

Tunisia is a country located in North Africa. It is situated in the Maghreb region, bordered by:

1. Algeria to the west
2. Libya to the southeast
3. Mediterranean Sea to the north and east

Tunisia is a relatively small country, with a total area of approximately 163,610 square kilometers (63,170 square miles). Its capital and largest city is Tunis, which is located in the northeastern part of the country.

Geographically, Tunisia is characterized by a diverse landscape, with mountains, deserts, and coastal plains. The country has a long coastline along the Mediterranean Sea, with several important ports, including the Port of Tunis and the Port of Sfax.

Here's a rough outline of Tunisia's location:

* Latitude: 30° - 38° N
* Longitude: 7° - 12° E

Tunisia is a strategic location, connecting Europe, Africa, and the Middle East, making it an important hub for trade, culture, and tourism.


## Setting up a prompt

In addition to the question we want to ask, we also want to provide the model with the context from the PDF file. We can use a prompt template to define and reuse the prompt we'll use with the model.


<img src='images/prompt1.png' width="1000">

In [13]:
from langchain.prompts import PromptTemplate

template = """
You are an assistant that provides answers to questions based on
a given context. 

Answer the question based on the context. If you can't answer the
question, reply "I don't know".

Be as concise as possible and go straight to the point.

Context: {context}

Question: {question}
"""

prompt = PromptTemplate.from_template(template)
print(prompt.format(context="Here is some context", question="Here is a question"))


You are an assistant that provides answers to questions based on
a given context. 

Answer the question based on the context. If you can't answer the
question, reply "I don't know".

Be as concise as possible and go straight to the point.

Context: Here is some context

Question: Here is a question



## Adding the prompt to the chain

We can now chain the prompt with the model and the parser.

<img src='images/chain11.png' width="1000">

In [14]:
chain = prompt | model | parser

chain.invoke({
    "context": "Anna's sister is Susan", 
    "question": "Who is Susan's sister?"
})


'Anna.'

## Adding the retriever to the chain

Finally, we can connect the retriever to the chain to get the context from the vector store.

context is list of 4 most similar

<img src='images/chain22.png' width="1000">

In [15]:
from operator import itemgetter

chain = (
    {
        "context": itemgetter("question") | retriever,
        "question": itemgetter("question"),
    }
    | prompt
    | model
    | parser
)

## Using the chain to answer questions

Finally, we can use the chain to ask questions that will be answered using the PDF document.

In [16]:
questions = [
    "What is GDG ?",
    "What is GDSC ?",
    "What is GDE?",
]

for question in questions:
    print(f"Question: {question}")
    print(f"Answer: {chain.invoke({'question': question})}")
    print("*************************\n")

Question: What is GDG ?
Answer: The Google Developer Group offers tools and resources to help you on your development journey.
*************************

Question: What is GDSC ?
Answer: GDSC stands for Google Developer Student Clubs.
*************************

Question: What is GDE?
Answer: GDE stands for Google Developer Experts.
*************************



In [17]:
q= "how many members in GDG carthage ? "

chain.invoke({'question': q})

"I don't know"

In [18]:
q= "how can i join GDG and be a GDE "
chain.invoke({'question': q})

'To join GDG: Visit the members site at https://gdg.community.dev/ and join a chapter or apply to create a new chapter if none exists in your area.\n\nTo become a GDE: You need to have solid expertise in an area featuring Google technology, be passionate about giving back to the community, and meet the eligibility criteria.'

In [19]:
q= "do i need to pay any fees to be a member in GDG and attend workshops ? "

chain.invoke({'question': q})

'No, there is no cost to join a chapter or attend events.'

What to do next ? 

Play with the size of chunks

Try different documents 

Try multiple documents not just one.

Different model other than llama3.1 

Try different embedding models.

Try different vectorstore databases

