<a href="https://colab.research.google.com/github/Lucyfer1865/RAG_Model_with_Gemini_and_Langchain/blob/main/RAG_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# RAG with LangChain and Gemini

I am going to create a RAG (Retrieval Augmented Generation) model using Google Gemini API (`gemini-1.5-pro`) and LangChain with ChromaDB as my vector database.

## Setup

First we will install the required libraries.

Then setup the api key and gemini 1.5 pro model.

In [None]:
%pip install -q --upgrade langchain langchain-community langchain-core google-generativeai langchain-google-genai chromadb PyMuPDF

In [None]:
from IPython.display import display
from IPython.display import Markdown
import textwrap

# Define a function to convert output text to markdown
def to_markdown(text):
  text = text.replace('•', '  *')
  return Markdown(textwrap.indent(text, '> ', predicate=lambda _: True))

In [None]:
from langchain_core.prompts import PromptTemplate
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain.chains.retrieval import create_retrieval_chain

In [None]:
import google.generativeai as genai
from google.colab import userdata

GOOGLE_API_KEY=userdata.get('GOOGLE_API_KEY')
genai.configure(api_key=GOOGLE_API_KEY)

In [None]:
from langchain_google_genai import GoogleGenerativeAIEmbeddings, ChatGoogleGenerativeAI

In [None]:
# Create the model
llm = ChatGoogleGenerativeAI(
    model="gemini-1.5-pro",
    temperature=0.3,
    max_tokens=None,
    timeout=None,
    max_retries=2,
    google_api_key = GOOGLE_API_KEY
)


## Load Data from Files

Load data from the pdfs (Add them to the session storage in Google Colab)

In [None]:
# Use fitz to open and read pdfs
import fitz
# tqdm for progress bars
from tqdm.auto import tqdm

pdf_paths = ["/content/Placement_Chronicles_2023-24.pdf",
            "/content/SI_Chronicles_23-24_Sem_I.pdf"] # Enter file paths

# We create a text_formatter function to clean our pdf text.
def text_formatter(text: str) -> str:
    cleaned_text = text.replace("\n", " ")

    return cleaned_text

# Define function to read our pdf data and store it in `texts`
def open_and_read_pdf(pdf_paths: list[str]) -> list[str]:

    texts = []
    for pdf_path in tqdm(pdf_paths):
        pages = 0
        doc = fitz.open(pdf_path)
        print(pdf_path)
        for page in tqdm(doc):
            text = page.get_text()
            text = text_formatter(text=text)
            pages +=1
            texts.append(text)
        print(pages)
    return texts

texts = open_and_read_pdf(pdf_paths = pdf_paths) # Read the text in pdfs
combined_text = ' '.join(texts) # Create a single string of all text in pdfs

print(f"Combined text length: {len(combined_text)}")
print(f"First 500 characters of combined text: {combined_text[:500]}")

  0%|          | 0/2 [00:00<?, ?it/s]

/content/Placement_Chronicles_2023-24.pdf


  0%|          | 0/104 [00:00<?, ?it/s]

104
/content/SI_Chronicles_23-24_Sem_I.pdf


  0%|          | 0/122 [00:00<?, ?it/s]

122
Combined text length: 215716
First 500 characters of combined text: Chronicles Semester 1 2023-2024 Placement Repository Committee  02 Debaayus Swain Pitching Coordinator Facing one of the toughest job markets after the 2009  recession, our students worked hard and proved their  mettle. I hope that these challenging times have  prepared the batch for the dynamic job landscape. We  had the fortune of hosting some exciting new recruiters  apart from our legacy ones. This particular season  established the value of putting your best foot forward in  the summer inte


## Split the Data into Chunks

Split the data loaded from the pdfs into chunks to embed them later

In [None]:
# Initialize the RecursiveCharacterTextSplitter so that we can split combined_text into chunks
splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=200)

# Split the combined_text into chunks using splitter
chunks = splitter.split_text(combined_text)

## Create Embeddings

We will now initialize the embeddings model using `GoogleGenerativeAIEmbeddings` and create a vector database using Chroma to store the embeddings created by the chunks.

In [None]:
embeddings = GoogleGenerativeAIEmbeddings(model="models/embedding-001",google_api_key=GOOGLE_API_KEY) # Initialize embedding model

In [None]:
vectordb = Chroma.from_texts(chunks, embeddings) # Create vector database

Further we will create a retriever object which will help us in retrieving data based on our query.

In [None]:
retriever = vectordb.as_retriever(search_kwargs={"k": 5}) # Create retriever object which will retrieve best 5 chunks

In [None]:
len(vectordb)

240

In [None]:
print(vectordb)
print(retriever)

<langchain_community.vectorstores.chroma.Chroma object at 0x7fea16ba68f0>
tags=['Chroma', 'GoogleGenerativeAIEmbeddings'] vectorstore=<langchain_community.vectorstores.chroma.Chroma object at 0x7fea16ba68f0> search_kwargs={'k': 5}


## Retrieve Data and Return Output

Now we use prompt template with `create_stuff_documents_chain` and `create_retrieval_chain` functions.

Then invoke the llm to return an output based on the template.

In [None]:
template = """
You are a helpful AI assistant that tells provides information about the Placement and SI chronicles 2023-24 in BITS Pilani.
Always answer based on the context provided.
context: {context}
input: {input}
answer:
"""

In [None]:
# Create a retrieval chain
prompt = PromptTemplate.from_template(template)
combine_docs_chain = create_stuff_documents_chain(llm, prompt)
retrieval_chain = create_retrieval_chain(retriever, combine_docs_chain)

Run the following cell to use the model:

**Try**: "How were the placements this year?" or "How to get internship in ___"

In [None]:
#Invoke the retrieval chain after asking for query
query = str(input("Enter your query: "))
response=retrieval_chain.invoke({"input":query})

#Print the answer to the question with markdown
to_markdown(response["answer"])

Enter your query: How were the placement this year?


> The text states that despite a challenging job market, BITS Pilani students achieved impressive placement results. The job market was compared to the tough market after the 2009 recession. 
> 
> Here are some key takeaways:
> 
> * **Strong Performance:** Students "proved their mettle" and showcased their skills, reinforcing BITS Pilani as a top choice for recruiters.
> * **New Recruiters:**  The university attracted exciting new recruiters along with their legacy recruiters.
> * **Well-Rounded Profiles:** The importance of a well-rounded profile, developed through activities like summer internships, was highlighted.
> 
> The text celebrates the students' perseverance and the university's commitment to placement excellence. 
