## **Objective**
  * **We are going to create our conversational AI, that will answer the questions based on the given data source (pdf, text, img, json)**

### **RAG Application**
  * **Load the data: Document Loader**
  * **Split the data: Text Splitter**
  * **Embed the data: Embedding Model**
  * **Save the data into a DB: VectorDB (`Chroma` and PineCone)**
<hr>
  * **Setup LLM: ChatGPT (4o-mini, GPT-4)**
  * **Prompt Engineering (To make sure the model works fine)**
  * **Connect & Chain these all together: Chain**
  * **Utilize the LLM: Test**
<hr>
  * **Interface for having results as output: Gradio**

# **Creation of our RAG Based System**


## **1. Requirement Gathering**

* **Data Source: `plain text file`**
* **Frameworks: `Langchain`**

#### **Installing the dependencies**

In [None]:
!pip install langchain langchain-community langchain_openai langchain_chroma

Collecting numpy<3,>=1.26.2 (from langchain-community)
  Using cached numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
Using cached numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.3 MB)
Installing collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 2.2.4
    Uninstalling numpy-2.2.4:
      Successfully uninstalled numpy-2.2.4
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-colab 1.0.0 requires pandas==2.2.2, but you have pandas 2.2.3 which is incompatible.[0m[31m
[0mSuccessfully installed numpy-1.26.4


In [None]:
import os
from langchain_chroma import Chroma                                 # This is your vector DB
from langchain_core.prompts import PromptTemplate                   # This will be used for framing our prompt
from langchain_openai import OpenAIEmbeddings, ChatOpenAI           # This will help us in getting embedding model and chat model
from langchain_text_splitters import CharacterTextSplitter          # This will be used for splitting your documents token by token
from langchain_core.runnables import RunnablePassthrough            # To run this chain
from langchain_core.output_parsers.string import StrOutputParser    # This will help us convert a particualar output into string

## **2. Document Processing**

**Documentation Link: https://python.langchain.com/docs/integrations/providers/**

#### **1. Taking a text file**

In [None]:
with open("/content/drive/MyDrive/PPTs/Session Files/Morning - GenAI/RAG Implementation/Text Data Source/2024_state_of_the_union.txt") as f:
  files = f.read()

In [None]:
print(files)

March 07, 2024
Remarks of President Joe Biden — State of the Union Address As Prepared for Delivery
Home
Briefing Room
Speeches and Remarks
The United States Capitol

###

Good evening. 

Mr. Speaker. Madam Vice President. Members of Congress. My Fellow Americans. 

In January 1941, President Franklin Roosevelt came to this chamber to speak to the nation. 

He said, “I address you at a moment unprecedented in the history of the Union.” 

Hitler was on the march. War was raging in Europe. 

President Roosevelt’s purpose was to wake up the Congress and alert the American people that this was no ordinary moment.   

Freedom and democracy were under assault in the world. 

Tonight I come to the same chamber to address the nation. 

Now it is we who face an unprecedented moment in the history of the Union. 

And yes, my purpose tonight is to both wake up this Congress, and alert the American people that this is no ordinary moment either. 

Not since President Lincoln and the Civil War have 

#### **2. Split the Data inside the files**

In [None]:
text_splitter = CharacterTextSplitter(
    chunk_size = 1000,
    chunk_overlap = 200,
    length_function = len
)

#### **3. Create the split / segment of the documents**

In [None]:
texts = text_splitter.create_documents([files])

##### **Output of Text Split**

In [None]:
len(texts)

48

## **3. Embeddings Formation**

**HuggingFaceEmbedding: https://python.langchain.com/api_reference/huggingface/embeddings/langchain_huggingface.embeddings.huggingface.HuggingFaceEmbeddings.html**

* **OpenAI Embedding Models**

In [None]:
from google.colab import userdata

os.environ["OPENAI_API_KEY"] = userdata.get("GPTKEY")

In [None]:
openai_embedding = OpenAIEmbeddings(model = "text-embedding-3-small")

#### **Database Formation**

In [None]:
# Initalize the vector storage
vector_storage = Chroma(
    collection_name = "23rdmarch_dev",
    embedding_function = openai_embedding
)

* **HuggingFaceEmbeddings**

In [None]:
!pip install langchain_huggingface



In [None]:
# from langchain_huggingface import HuggingFaceEmbeddings

# model_name = "all-MiniLM-L6-v2"
# hf_embedding_model = HuggingFaceEmbeddings(
#     model_name=model_name,
# )

#### **Need to load the data into DB**

In [None]:
storage_id = vector_storage.add_documents(texts)

In [None]:
storage_id

['555e375b-71c2-424f-804f-05ad1c803712',
 '781be9ef-fe19-4698-abcd-872cf5d51486',
 '379ec5d3-e689-47f9-b942-cd689fc316c2',
 '58d6349c-87a8-41f1-a7db-8b57c15722b2',
 '121ae344-cd96-4211-838e-92d758e74372',
 '602d2c4d-6ff7-44a1-aee7-c378e53605ae',
 '545c5c47-5eac-486c-979f-794419ead7ba',
 '07a05170-da99-4c4a-80ef-76aae5b906f0',
 'ca12923c-ad0f-4a46-9434-2cd7f3222368',
 '052b54eb-1709-4e57-b542-e99742304998',
 '46dfcf28-a600-4443-92e9-fa01d1730977',
 '6a84224e-005e-4160-aef8-70dc0664b04b',
 '9122fe99-aa87-4734-a4d0-89d94a4f7bef',
 '3ae76937-db89-4a19-967a-aa9ab91218ef',
 'c331d01d-d760-4e19-a8aa-7edaecb025ab',
 '3700ab68-ec7c-4ecb-873d-21482f14675a',
 '285697fe-017e-4c6b-b6d1-48e647cfb029',
 '7c2b7fe0-d17d-4f60-bf73-505ec0d94864',
 '950461f4-1df9-409a-8bbd-8198aede53a0',
 '7d374281-aeff-42f3-869c-3d15f5297047',
 'b99f2040-f605-4c43-81a9-670a4ce015cb',
 '64299042-6a76-415b-b4be-9835c3e9d1ba',
 'c9c19352-2299-4961-a7da-e5febba28915',
 '30277484-e5e3-470f-8684-6a304fe46589',
 'c43d0537-67c8-

#### **Sematic Searching from Vector Database**

In [None]:
results = vector_storage.similarity_search(
    "Who invaded Ukraine?",
    k = 2 # I want the top 3 results
)

In [None]:
results

[Document(id='781be9ef-fe19-4698-abcd-872cf5d51486', metadata={}, page_content='And yes, my purpose tonight is to both wake up this Congress, and alert the American people that this is no ordinary moment either. \n\nNot since President Lincoln and the Civil War have freedom and democracy been under assault here at home as they are today. \n\nWhat makes our moment rare is that freedom and democracy are under attack, both at home and overseas, at the very same time. \n\nOverseas, Putin of Russia is on the march, invading Ukraine and sowing chaos throughout Europe and beyond. \n\nIf anybody in this room thinks Putin will stop at Ukraine, I assure you, he will not. \n\nBut Ukraine can stop Putin if we stand with Ukraine and provide the weapons it needs to defend itself. That is all Ukraine is asking. They are not asking for American soldiers. \n\nIn fact, there are no American soldiers at war in Ukraine. And I am determined to keep it that way. \n\nBut now assistance for Ukraine is being b

In [None]:
for x in results:
  print(f"* ID: {x.id}\nCONTENT:{x.page_content}\n----------------------------------------")

* ID: 781be9ef-fe19-4698-abcd-872cf5d51486
CONTENT:And yes, my purpose tonight is to both wake up this Congress, and alert the American people that this is no ordinary moment either. 

Not since President Lincoln and the Civil War have freedom and democracy been under assault here at home as they are today. 

What makes our moment rare is that freedom and democracy are under attack, both at home and overseas, at the very same time. 

Overseas, Putin of Russia is on the march, invading Ukraine and sowing chaos throughout Europe and beyond. 

If anybody in this room thinks Putin will stop at Ukraine, I assure you, he will not. 

But Ukraine can stop Putin if we stand with Ukraine and provide the weapons it needs to defend itself. That is all Ukraine is asking. They are not asking for American soldiers. 

In fact, there are no American soldiers at war in Ukraine. And I am determined to keep it that way. 

But now assistance for Ukraine is being blocked by those who want us to walk away fr

<hr>

## **4. Setting up the retrievals**

### **a. Set up function for converting this data**

In [None]:
def format_docs(docs):
  return "\n".join(x.page_content for x in docs)

In [None]:
format_docs(results)

'And yes, my purpose tonight is to both wake up this Congress, and alert the American people that this is no ordinary moment either. \n\nNot since President Lincoln and the Civil War have freedom and democracy been under assault here at home as they are today. \n\nWhat makes our moment rare is that freedom and democracy are under attack, both at home and overseas, at the very same time. \n\nOverseas, Putin of Russia is on the march, invading Ukraine and sowing chaos throughout Europe and beyond. \n\nIf anybody in this room thinks Putin will stop at Ukraine, I assure you, he will not. \n\nBut Ukraine can stop Putin if we stand with Ukraine and provide the weapons it needs to defend itself. That is all Ukraine is asking. They are not asking for American soldiers. \n\nIn fact, there are no American soldiers at war in Ukraine. And I am determined to keep it that way. \n\nBut now assistance for Ukraine is being blocked by those who want us to walk away from our leadership in the world.\nMr.

### **b. Setup a Retriever**

In [None]:
retriever = vector_storage.as_retriever()

### **c. LLM Instance**

#### **OpenAI**

In [None]:
llm = ChatOpenAI(model = "gpt-4o-mini")

#### **HuggingFace**

In [None]:
os.environ["HUGGINGFACEHUB_API_TOKEN"] = userdata.get("HuggingFace")

In [None]:
!pip install --upgrade langchain-huggingface



In [None]:
from langchain_community.llms.huggingface_hub import HuggingFaceHub

In [None]:
llmHF = HuggingFaceHub(
    repo_id="HuggingFaceH4/zephyr-7b-beta",
    task="text-generation"
)

  llmHF = HuggingFaceHub(


In [None]:
llmHF

HuggingFaceHub(client=<InferenceClient(model='HuggingFaceH4/zephyr-7b-beta', timeout=None)>, repo_id='HuggingFaceH4/zephyr-7b-beta', task='text-generation')

### **d. Prompt Instance**

**PDF Loading Part**

In [None]:
template = """
Use the context provided to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.
Use three sentences maximum and keep the answer concise.

context: {context}

question: {query}

answer:
"""

In [None]:
custom_template = PromptTemplate.from_template(template)

In [None]:
custom_template

PromptTemplate(input_variables=['context', 'query'], input_types={}, partial_variables={}, template="\nUse the context provided to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.\nUse three sentences maximum and keep the answer concise.\n\ncontext: {context}\n\nquestion: {query}\n\nanswer:\n")

**We have a template, Model, a Database**
  * **Can we connect**

In [None]:
rag_chain = (
    {"context": retriever, "query": RunnablePassthrough()}
    | custom_template
    | llm
    | StrOutputParser()
)

## **Test**

In [None]:
rag_chain.invoke("who is president Modi?")

"I don't know."

## **Interface**

In [None]:
pip install --upgrade gradio



In [None]:
import gradio as gr

In [None]:
# General function for collection of response from LLM
def generate_response(question):
  response = rag_chain.invoke(question)
  return response

In [None]:
iface = gr.Interface(
    fn=generate_response,
    inputs=gr.Textbox(lines=2, placeholder="Enter your question here..."),
    outputs="text",
    title="RAG Based Question Answering",
    description="Ask questions based on the provided document.",
)

In [None]:
iface.launch()