## **Objective**
  * **We are going to create our conversational AI, that will answer the questions based on the given data source (pdf, text, img, json)**

* **`Open Source Model`: Deepseek, Mixtral, Zephyr, Dolly, Llama, Phi (HuggingFace, Unsloth, replicate)**

* **`Proprietry Models`: OpenAI, Google Gemini & PaLm, Microsoft**


### **RAG Application**
* **Indexing**
  * **Load the data: Document Loader**
  * **Split the data: Text Splitter**
  * **Embed the data: Embedding Model**
  * **Save the data into a DB: VectorDB (`Chroma` and PineCone)**
<hr>
* **Retrieval**
  * **Setup LLM: ChatGPT (4o-mini, GPT-4)**
  * **Prompt Engineering (To make sure the model works fine)**
  * **Connect & Chain these all together: Chain**
  * **Utilize the LLM: Test**
<hr>
  * **Interface for having results as output: Gradio**

# **Step 1 - Requirement Phase**

* **Data Source: `plain text file`**
* **Framework: `Langchain`**

In [None]:
!pip install langchain langchain_community langchain_chroma

Collecting langchain_community
  Downloading langchain_community-0.3.27-py3-none-any.whl.metadata (2.9 kB)
Collecting langchain_chroma
  Downloading langchain_chroma-0.2.4-py3-none-any.whl.metadata (1.1 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain_community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain_community)
  Downloading pydantic_settings-2.10.1-py3-none-any.whl.metadata (3.4 kB)
Collecting httpx-sse<1.0.0,>=0.4.0 (from langchain_community)
  Downloading httpx_sse-0.4.1-py3-none-any.whl.metadata (9.4 kB)
Collecting chromadb>=1.0.9 (from langchain_chroma)
  Downloading chromadb-1.0.15-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.0 kB)
Collecting pybase64>=1.4.1 (from chromadb>=1.0.9->langchain_chroma)
  Downloading pybase64-1.4.1-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (8.4 kB)
Collecting post

### **Importing the dependencies**

In [None]:
from langchain_chroma import Chroma
from langchain_core.prompts import PromptTemplate
from langchain_text_splitters import CharacterTextSplitter
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers.string import StrOutputParser

# **Step 2 - Document Processing**

### **1. Taking a plain text file**

**Link: https://drive.google.com/file/d/1z5FTeCvkrfHnMrSfbtlvHH1CpYKJ6udR/view?usp=sharing**

In [None]:
with open('/content/2024_state_of_the_union.txt') as f:
  files = f.read()

In [None]:
print(files)

March 07, 2024
Remarks of President Joe Biden — State of the Union Address As Prepared for Delivery
Home
Briefing Room
Speeches and Remarks
The United States Capitol

###

Good evening. 

Mr. Speaker. Madam Vice President. Members of Congress. My Fellow Americans. 

In January 1941, President Franklin Roosevelt came to this chamber to speak to the nation. 

He said, “I address you at a moment unprecedented in the history of the Union.” 

Hitler was on the march. War was raging in Europe. 

President Roosevelt’s purpose was to wake up the Congress and alert the American people that this was no ordinary moment.   

Freedom and democracy were under assault in the world. 

Tonight I come to the same chamber to address the nation. 

Now it is we who face an unprecedented moment in the history of the Union. 

And yes, my purpose tonight is to both wake up this Congress, and alert the American people that this is no ordinary moment either. 

Not since President Lincoln and the Civil War have 

### **2. Split the data**


In [None]:
text_split=CharacterTextSplitter(
    chunk_size=1000,    # no.of charcters in a chunk
    chunk_overlap=200,  # common part between the chunk i and i-1 {end 200 char of i-1 = start 200 char of i}
    length_function=len
)

### **3. Create the split / segment the documentation**

In [None]:
texts=text_split.create_documents([files])

### **Output**

In [None]:
len(texts)

48

In [None]:
texts[0]

Document(metadata={}, page_content='March 07, 2024\nRemarks of President Joe Biden — State of the Union Address As Prepared for Delivery\nHome\nBriefing Room\nSpeeches and Remarks\nThe United States Capitol\n\n###\n\nGood evening. \n\nMr. Speaker. Madam Vice President. Members of Congress. My Fellow Americans. \n\nIn January 1941, President Franklin Roosevelt came to this chamber to speak to the nation. \n\nHe said, “I address you at a moment unprecedented in the history of the Union.” \n\nHitler was on the march. War was raging in Europe. \n\nPresident Roosevelt’s purpose was to wake up the Congress and alert the American people that this was no ordinary moment.   \n\nFreedom and democracy were under assault in the world. \n\nTonight I come to the same chamber to address the nation. \n\nNow it is we who face an unprecedented moment in the history of the Union. \n\nAnd yes, my purpose tonight is to both wake up this Congress, and alert the American people that this is no ordinary momen

# **Step 3 - Embed the data using Embedding Model**

### **Create the embeddings**

In [None]:
from langchain.embeddings import HuggingFaceEmbeddings

In [None]:
embedding_model=HuggingFaceEmbeddings(model_name='all-MiniLM-L6-v2')

  embedding_model=HuggingFaceEmbeddings(model_name='all-MiniLM-L6-v2')
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

### **Databae Formation**

In [None]:
vectorDB = Chroma(
    collection_name='Jay',
    embedding_function=embedding_model
)

In [None]:
vectorDB

<langchain_chroma.vectorstores.Chroma at 0x795464fa8890>

### **Load the documents in the DB**

In [None]:
storage_id = vectorDB.add_documents(texts)

In [None]:
len(storage_id)

48

In [None]:
storage_id[0]

'829228f3-10ac-447b-969d-c16137a6aac6'

1. Text

   └── Raw input text data (e.g., document, web page, transcript)

2. Split into Chunks

   └── Divide text into manageable chunks (e.g., by sentences or paragraphs)

3. Embedding Model

   └── Use a model (like OpenAI, Sentence-BERT) to convert text chunks into embeddings

4. Vectors

   └── Embeddings are high-dimensional numeric representations of the text

5. Vector Database

   └── Store these vectors in a database optimized for similarity search (e.g., FAISS, Pinecone, Weaviate)

6. Primary IDs

   └── Assign a unique identifier to each vector entry

7. Ensure Uniqueness

   └── Validate that each ID is distinct to avoid collisions or duplication


### **Similarity Searching using VecDB**

In [None]:
res=vectorDB.similarity_search(
    query="What did the president say about Ketanji Brown Jackson",
    k=1
)

In [None]:
res

[Document(id='e81e2fd9-1007-41bc-9b7c-3e93a5fcea81', metadata={}, page_content='To take on crimes of domestic violence, I am ramping up federal enforcement of the Violence Against Women Act, that I proudly wrote, so we can finally end the scourge of violence against women in America!  \n\nAnd there’s another kind of violence I want to stop. \n\nWith us tonight is Jasmine, whose 9-year-old sister Jackie was murdered with 21 classmates and teachers at her elementary school in Uvalde, Texas. \n\nSoon after it happened, Jill and I went to Uvalde and spent hours with the families. \n\nWe heard their message, and so should everyone in this chamber do something. \n\nI did do something by establishing the first-ever Office of Gun Violence Prevention in the White House that Vice President Harris is leading. \n\nMeanwhile, my predecessor told the NRA he’s proud he did nothing on guns when he was President. \n\nAfter another school shooting in Iowa he said we should just “get over it.” \n\nI say 

In [None]:
storage_id.index('e81e2fd9-1007-41bc-9b7c-3e93a5fcea81')

36

In [None]:
texts[36]

Document(metadata={}, page_content='To take on crimes of domestic violence, I am ramping up federal enforcement of the Violence Against Women Act, that I proudly wrote, so we can finally end the scourge of violence against women in America!  \n\nAnd there’s another kind of violence I want to stop. \n\nWith us tonight is Jasmine, whose 9-year-old sister Jackie was murdered with 21 classmates and teachers at her elementary school in Uvalde, Texas. \n\nSoon after it happened, Jill and I went to Uvalde and spent hours with the families. \n\nWe heard their message, and so should everyone in this chamber do something. \n\nI did do something by establishing the first-ever Office of Gun Violence Prevention in the White House that Vice President Harris is leading. \n\nMeanwhile, my predecessor told the NRA he’s proud he did nothing on guns when he was President. \n\nAfter another school shooting in Iowa he said we should just “get over it.” \n\nI say we must stop it.')

# **Step 4 - Setting up the Retrievals**

### **a. Create a retriever**

In [None]:
retriever = vectorDB.as_retriever()

### **b. LLM Instance**

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline

In [None]:
tokenizer = AutoTokenizer.from_pretrained('google/flan-t5-large')

tokenizer_config.json: 0.00B [00:00, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

In [None]:
if tokenizer.pad_token is None:
  tokenizer.add_special_tokens({'pad_toke':'[PAD]'})

In [None]:
model=AutoModelForSeq2SeqLM.from_pretrained('google/flan-t5-large')

config.json:   0%|          | 0.00/662 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.13G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

In [None]:
model.resize_token_embeddings(len(tokenizer))

Embedding(32100, 1024)

In [None]:
model.config.pad_token_id = tokenizer.pad_token_id

In [None]:
generator=pipeline(
    'text2text-generation',
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=150
)

Device set to use cuda:0


In [None]:
from langchain.llms import HuggingFacePipeline

llm = HuggingFacePipeline(pipeline=generator)

#### **Other Examples**

* **HuggingFaceH4/zephyr-7b-beta**
* **Qwen/Qwen3-235B-A22B**

### **c. Design a Prompt**

In [None]:
template= """Use the context provided to answer the question. If you don't know the answer, say you don't know.

Context:
{context}

Question: {question}
Answer:"""

In [None]:
custom_templete=PromptTemplate(
    template=template
)

In [None]:
custom_templete

PromptTemplate(input_variables=['context', 'question'], input_types={}, partial_variables={}, template="Use the context provided to answer the question. If you don't know the answer, say you don't know.\n\nContext:\n{context}\n\nQuestion: {question}\nAnswer:")

**We have a template, model, database**

* **Can we connect them**

In [None]:
rag_chain=(
    {'context':retriever,'question':RunnablePassthrough()}
    |custom_templete
    |llm
    |StrOutputParser()
)

In [None]:
# Input question
#      ↓
# Retriever → Format → Context
#      ↓
# {context, question}
#      ↓
# Prompt Template (custom_template)
#      ↓
# LLM (generate answer)
#      ↓
# StrOutputParser (final output string)

# **Step 5 - Test**

In [None]:
query = "What did the President say about Ukrain?"
answer = rag_chain.invoke(query)
answer

'If the United States walks away now, it will put Ukraine at risk'

In [None]:
!pip install gradio



In [None]:
import gradio as gr

In [None]:
def chat(message,history):
  bot_message=rag_chain.invoke(message)
  history.append(message,bot_message)
  return history,history
with gr.Blocks() as demo:
  chatbot=gr.Chatbot()
  msg=gr.Textbox()
  clear=gr.Button('clear')
  msg.submit(chat,[msg,chatbot],[chatbot,chatbot])
  clear.click(lambda: None,None,chatbot,queue=False )
demo.launch()

  chatbot=gr.Chatbot()


It looks like you are running Gradio on a hosted Jupyter notebook, which requires `share=True`. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://6733430fa81e8670d0.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


