<a href="https://colab.research.google.com/github/Shubh121102/Gen-AI-RAG-project-on-US-census/blob/main/GEN_AI_RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **GEN AI RAG Project**
**Retrieval Augmented Generation project on US census data using Langchain, HuggingFace, Mistral LLM model and Chroma DB**

**Pip Installations**

In [1]:
!pip install langchain
!pip install langchain-community
!pip install sentence-transformers
!pip install langchain_huggingface
!pip install pypdf
!pip install chromadb



**Importing Models**

In [2]:
from langchain_community.document_loaders import PyPDFLoader
from langchain_community.document_loaders import PyPDFDirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma
from langchain.prompts import PromptTemplate
from langchain.llms import huggingface_hub
from langchain.chains import RetrievalQA

**Loading Documents**

In [3]:
loader=PyPDFDirectoryLoader('./us_census')
docs=loader.load()

**Text Splitter**

In [4]:
text_splitter=RecursiveCharacterTextSplitter(chunk_size=1000,chunk_overlap=100)
chunks=text_splitter.split_documents(docs)

**Accesing API Token from userdata**

In [5]:
from google.colab import userdata
set_key=userdata.get('HUGGINGFACE_API_TOKEN')

import os
os.environ['HUGGINGFACE_API_TOKEN']=set_key

**Creating Embedding Vectors**

In [6]:
embeddings=HuggingFaceEmbeddings(model_name='sentence-transformers/all-MiniLM-L6-v2')

  embeddings=HuggingFaceEmbeddings(model_name='sentence-transformers/all-MiniLM-L6-v2')
  from tqdm.autonotebook import tqdm, trange
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


**Storing Embedding Vectors in Chroma DB**

In [7]:
vectorstore=Chroma.from_documents(chunks,embeddings)

**Retrieving similar vectors from the vectorstore**

We retrieve 3 chunks of similar vector embeddings for a given argument hence search_kwargs is 3

In [8]:
retriever=vectorstore.as_retriever(search_type='similarity', search_kwargs={'k':3})

**Creating Prompt Template**

In [9]:
p_template="""Using the following context answer the question asked
if you do not know the answer just say you don't know. make sure u strictly answer relevant to the context

{context}
Question:{question}

Answer is:
"""
prompt=PromptTemplate(template=p_template,input_variables=['context','question'])

**Configuring HuggingFaceHub API Token**

In [10]:
from google.colab import userdata
set_key=userdata.get('HUGGINGFACEHUB_API_TOKEN')

import os
os.environ['HUGGINGFACEHUB_API_TOKEN']=set_key

from langchain.llms import HuggingFaceHub
llm=HuggingFaceHub(repo_id="mistralai/Mistral-7B-v0.1",model_kwargs={"temperature":0.5,"max_length":512})

  llm=HuggingFaceHub(repo_id="mistralai/Mistral-7B-v0.1",model_kwargs={"temperature":0.5,"max_length":512})


**Creating a  stuff chain using RetrievalQA**

In [11]:
retrievalQA=RetrievalQA.from_chain_type(llm=llm,
                        chain_type='stuff',
                        retriever=retriever,
                        return_source_documents=True,
                        chain_type_kwargs={'prompt':prompt})

**Adding own query and final result**

You can ask your own questions in the query and the model generates the output based on the documents provided.

In [12]:
query="""Differences in the uninsured rate by state in 2022"""
result=retrievalQA.invoke({'query':query})
print(result['result'])

Using the following context answer the question asked
if you do not know the answer just say you don't know. make sure u strictly answer relevant to the context

excludes single-service plans, such as accident, disability, dental, vision, or prescription 
medicine plans.The large sample size of the ACS 
allows for an examination of the 
uninsured rate and coverage by 
type for subnational geographies.8
Key Findings
• In 2022, the uninsured rate 
varied from 2.4 percent in 
Massachusetts to 16.6 percent 
in Texas (Figure 1 and Figure 
2). The District of Columbia 
was among the lowest with an 
uninsured rate of 2.9 percent, 
not statistically different from 
Massachusetts.
• Utah and North Dakota reported 
the highest rate of private cov -
erage (78.4 percent) in 2022, 
while New Mexico had the low -
est private coverage rate (54.4 
percent) (Figure 3).9
• Utah had the lowest rate of 
public coverage in 2022 (22.2 
percent), and New Mexico had 
the highest (Figure 4). 
• Twenty-seven st