<a href="https://colab.research.google.com/github/NehaParveen03/my_first_report/blob/main/08thMarch_RAG_Dev.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Retrieval Augmented Generation using `gpt-4o-mini`**

### **Mounting the Drive**

In [None]:
from google.colab import drive
drive.mount("/content/drive")

Mounted at /content/drive


### **Types of RAG**
* **Basic RAG**
* **Advanced RAG**
  * **Pre-retrieval Query Rewriting**
* **Agentic RAG**

## **Basic RAG using GPT-4o-mini**

### **1. Requirement Phase**

In [None]:
!pip install langchain-community

Collecting langchain-community
  Downloading langchain_community-0.3.19-py3-none-any.whl.metadata (2.4 kB)
Collecting langchain-core<1.0.0,>=0.3.41 (from langchain-community)
  Downloading langchain_core-0.3.43-py3-none-any.whl.metadata (5.9 kB)
Collecting langchain<1.0.0,>=0.3.20 (from langchain-community)
  Downloading langchain-0.3.20-py3-none-any.whl.metadata (7.7 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain-community)
  Downloading pydantic_settings-2.8.1-py3-none-any.whl.metadata (3.5 kB)
Collecting httpx-sse<1.0.0,>=0.4.0 (from langchain-community)
  Downloading httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading marshmallow-3.26.1-py3-none-any.whl.metadata (7.3 kB)
Collecting typing-inspect<1,>=0.4.0 (from dataclasses-

In [None]:
!pip install langchain_chroma langchain_openai

Collecting langchain_chroma
  Downloading langchain_chroma-0.2.2-py3-none-any.whl.metadata (1.3 kB)
Collecting langchain_openai
  Downloading langchain_openai-0.3.8-py3-none-any.whl.metadata (2.3 kB)
Collecting chromadb!=0.5.10,!=0.5.11,!=0.5.12,!=0.5.4,!=0.5.5,!=0.5.7,!=0.5.9,<0.7.0,>=0.4.0 (from langchain_chroma)
  Downloading chromadb-0.6.3-py3-none-any.whl.metadata (6.8 kB)
Collecting tiktoken<1,>=0.7 (from langchain_openai)
  Downloading tiktoken-0.9.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Collecting build>=1.0.3 (from chromadb!=0.5.10,!=0.5.11,!=0.5.12,!=0.5.4,!=0.5.5,!=0.5.7,!=0.5.9,<0.7.0,>=0.4.0->langchain_chroma)
  Downloading build-1.2.2.post1-py3-none-any.whl.metadata (6.5 kB)
Collecting chroma-hnswlib==0.7.6 (from chromadb!=0.5.10,!=0.5.11,!=0.5.12,!=0.5.4,!=0.5.5,!=0.5.7,!=0.5.9,<0.7.0,>=0.4.0->langchain_chroma)
  Downloading chroma_hnswlib-0.7.6-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (252 bytes)
Collecti

**Langchain**
* **langchain**
* **langchain-core**
* **langchain-community**

In [None]:
import os
from langchain_chroma import Chroma                        # Library for our Vector Database
from langchain_core.output_parsers import StrOutputParser  # This is for working / parsing the output into strings
from langchain_core.prompts import PromptTemplate          # This is like a empty box, in which you put a prompt and then it is given to the model as a instruction
from langchain_openai import OpenAIEmbeddings, ChatOpenAI  # Here, we have our embedding model from OpenAI, and OpenAI model
from langchain_text_splitters import CharacterTextSplitter # From splitting the text (Context Window)
from langchain_core.runnables import RunnablePassthrough   # It will help us to run everything together

### **2. Setting up the OpenAI**

In [None]:
from google.colab import userdata
os.environ["OPENAI_API_KEY"] = userdata.get('GPt') # Here, we are introducing the GPT API key into our colab

### **3. Models**

* **OPENAI -> PAID**
* **GEMINI -> PAID**
* **HUGGINGFACEHUB -> FREE**
* **COPILOT -> Free / Paid**
* **MIXTRAL -> FREE using HF**
* **Azure AI -> PAID**


## **Start with the DB setup**

### **1. Start with Embedding Model**

In [None]:
embedding = OpenAIEmbeddings(model = "text-embedding-3-small")

In [None]:
embedding

OpenAIEmbeddings(client=<openai.resources.embeddings.Embeddings object at 0x7cffdea0ae90>, async_client=<openai.resources.embeddings.AsyncEmbeddings object at 0x7cffde41d150>, model='text-embedding-3-small', dimensions=None, deployment='text-embedding-ada-002', openai_api_version=None, openai_api_base=None, openai_api_type=None, openai_proxy=None, embedding_ctx_length=8191, openai_api_key=SecretStr('**********'), openai_organization=None, allowed_special=None, disallowed_special=None, chunk_size=1000, max_retries=2, request_timeout=None, headers=None, tiktoken_enabled=True, tiktoken_model_name=None, show_progress_bar=False, model_kwargs={}, skip_empty=False, default_headers=None, default_query=None, retry_min_seconds=4, retry_max_seconds=20, http_client=None, http_async_client=None, check_embedding_ctx_length=True)

### **2. Creation of VectorDB**

In [None]:
vector_store = Chroma(
    collection_name = "dev_08thmarch_rag", # THis is our DB name
    embedding_function = embedding         # Here we are using the model, the data needs to be embedded.
)

In [None]:
vector_store

<langchain_chroma.vectorstores.Chroma at 0x7cffdea09110>

### **3. Splitting the files and Storing them into vectorstores**

In [None]:
with open("/content/drive/MyDrive/PPTs/Session Files/Morning - GenAI/RAG Implementation/Text Data Source/2024_state_of_the_union.txt") as f:
  files = f.read()

In [None]:
print(files)

March 07, 2024
Remarks of President Joe Biden — State of the Union Address As Prepared for Delivery
Home
Briefing Room
Speeches and Remarks
The United States Capitol

###

Good evening. 

Mr. Speaker. Madam Vice President. Members of Congress. My Fellow Americans. 

In January 1941, President Franklin Roosevelt came to this chamber to speak to the nation. 

He said, “I address you at a moment unprecedented in the history of the Union.” 

Hitler was on the march. War was raging in Europe. 

President Roosevelt’s purpose was to wake up the Congress and alert the American people that this was no ordinary moment.   

Freedom and democracy were under assault in the world. 

Tonight I come to the same chamber to address the nation. 

Now it is we who face an unprecedented moment in the history of the Union. 

And yes, my purpose tonight is to both wake up this Congress, and alert the American people that this is no ordinary moment either. 

Not since President Lincoln and the Civil War have 

### **4. Split the document**
* **Context Window: The capability or limit of the LLM to process and understand the words**

In [None]:
text_splitter = CharacterTextSplitter(
    chunk_size = 1000,     # Size of the split
    chunk_overlap = 200,   # Overlapping is done to make sure that the context is continuous and maintain the context
    length_function = len
)

In [None]:
text = text_splitter.create_documents([files])

In [None]:
text

[Document(metadata={}, page_content='March 07, 2024\nRemarks of President Joe Biden — State of the Union Address As Prepared for Delivery\nHome\nBriefing Room\nSpeeches and Remarks\nThe United States Capitol\n\n###\n\nGood evening. \n\nMr. Speaker. Madam Vice President. Members of Congress. My Fellow Americans. \n\nIn January 1941, President Franklin Roosevelt came to this chamber to speak to the nation. \n\nHe said, “I address you at a moment unprecedented in the history of the Union.” \n\nHitler was on the march. War was raging in Europe. \n\nPresident Roosevelt’s purpose was to wake up the Congress and alert the American people that this was no ordinary moment.   \n\nFreedom and democracy were under assault in the world. \n\nTonight I come to the same chamber to address the nation. \n\nNow it is we who face an unprecedented moment in the history of the Union. \n\nAnd yes, my purpose tonight is to both wake up this Congress, and alert the American people that this is no ordinary mome

In [None]:
num_documents = len(text)
print(f"Number of documents: {num_documents}")

Number of documents: 48


### **5. Save the document in vectorstore**

In [None]:
ids = vector_store.add_documents(text)

In [None]:
len(ids)

48

### **6. Sematic Searching using VectorDB**

In [None]:
# Query with the vectorstore
results = vector_store.similarity_search(
    "Who invaded Ukraine?",
    k = 2 # No of results that I'm expecting
)

In [None]:
results

[Document(id='2049b1bb-02df-4460-a6be-226d81a108bf', metadata={}, page_content='And yes, my purpose tonight is to both wake up this Congress, and alert the American people that this is no ordinary moment either. \n\nNot since President Lincoln and the Civil War have freedom and democracy been under assault here at home as they are today. \n\nWhat makes our moment rare is that freedom and democracy are under attack, both at home and overseas, at the very same time. \n\nOverseas, Putin of Russia is on the march, invading Ukraine and sowing chaos throughout Europe and beyond. \n\nIf anybody in this room thinks Putin will stop at Ukraine, I assure you, he will not. \n\nBut Ukraine can stop Putin if we stand with Ukraine and provide the weapons it needs to defend itself. That is all Ukraine is asking. They are not asking for American soldiers. \n\nIn fact, there are no American soldiers at war in Ukraine. And I am determined to keep it that way. \n\nBut now assistance for Ukraine is being b

In [None]:
count = 1
for x in results:

  print(f"Results {count}: \n{x.page_content}[Location: {x.id}]")
  print("-----------------------_")
  count += 1

Results 1: 
And yes, my purpose tonight is to both wake up this Congress, and alert the American people that this is no ordinary moment either. 

Not since President Lincoln and the Civil War have freedom and democracy been under assault here at home as they are today. 

What makes our moment rare is that freedom and democracy are under attack, both at home and overseas, at the very same time. 

Overseas, Putin of Russia is on the march, invading Ukraine and sowing chaos throughout Europe and beyond. 

If anybody in this room thinks Putin will stop at Ukraine, I assure you, he will not. 

But Ukraine can stop Putin if we stand with Ukraine and provide the weapons it needs to defend itself. That is all Ukraine is asking. They are not asking for American soldiers. 

In fact, there are no American soldiers at war in Ukraine. And I am determined to keep it that way. 

But now assistance for Ukraine is being blocked by those who want us to walk away from our leadership in the world.[Locatio

## **RAG Pipeline**

### **Converting into strings**

In [None]:
# Function for parsing these documents into string
def format_docs(docs):
  return "\n\n".join(x.page_content for x in docs)

### **Retriever Function**

In [None]:
# IF I WANT TO SETUP A RETRIVER, WHAT WILL BE MY SOURCE
retriver = vector_store.as_retriever()

### **LLM Instance**

In [None]:
llm = ChatOpenAI(model = "gpt-4o-mini")

In [None]:
llm

ChatOpenAI(client=<openai.resources.chat.completions.Completions object at 0x7cffde1ca910>, async_client=<openai.resources.chat.completions.AsyncCompletions object at 0x7cffddbcdfd0>, root_client=<openai.OpenAI object at 0x7cffde128d10>, root_async_client=<openai.AsyncOpenAI object at 0x7cffddaaf5d0>, model_name='gpt-4o-mini', model_kwargs={}, openai_api_key=SecretStr('**********'))

### **Building the Prompt**

In [None]:
template = """Use the context provided to answer the user's question. If you don't know the answer based on context provided, tell the user you don't know the answer based on the context and that you are sorry.

context: {context}

question: {query}

answer:
"""

In [None]:
# Create a template
custom_rag_template = PromptTemplate.from_template(template)

In [None]:
custom_rag_template

PromptTemplate(input_variables=['context', 'query'], input_types={}, partial_variables={}, template="Use the context provided to answer the user's question. If you don't know the answer based on context provided, tell the user you don't know the answer based on the context and that you are sorry.\n\ncontext: {context} \n \nquestion: {query}\n\nanswer: \n")

### **Chain these all together**

In [None]:
rag_chain = (
    {"context": retriver | format_docs, "query" : RunnablePassthrough()}
    | custom_rag_template
    | llm
    | StrOutputParser()
)

### **Test the chain**

In [None]:
rag_chain.invoke("Who invaded Ukraine?")

'Putin of Russia invaded Ukraine.'

In [None]:
rag_chain.invoke("Who is Prasanta?")

"I'm sorry, but I don't know the answer to your question about Prasanta based on the context provided."

In [None]:
rag_chain.invoke("Summarize the whole document")

'In his State of the Union address delivered on March 7, 2024, President Joe Biden reflects on the historical significance of the moment, drawing parallels to past challenges the nation has faced, such as World War II. He emphasizes that the current moment is also unprecedented, requiring action and unity among Congress and the American people. Biden proclaims that the state of the union is strong and is improving, advocating for an end to trickle-down economics and calling for better benefits for the middle class while ensuring fairness for all. He addresses the high cost of prescription drugs, asserting that it must change. The president stresses the importance of honesty, decency, and defending democracy, particularly in light of the January 6th insurrection, which he describes as a grave threat to democracy. He calls for bipartisan cooperation to protect democratic values, urging all to remember their oath to defend the nation against both foreign and domestic threats.'