**Project Title:** Q&A System with Retrieval-Augmented Generation(RAG) Using Gemini

**Overview:**To build a Retrieval-Augmented Generation (RAG) system that allows users to ask questions over research papers or technical documents, delivering accurate, context-aware responses by combining document retrieval with Google’s Gemini language model.

In [None]:
#Importing necessary libraries
!pip install -q langchain-community
!pip install pypdf
!pip install langchain_chroma
!pip install langchain_google_genai

from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_google_genai import GoogleGenerativeAIEmbeddings
from langchain_chroma import Chroma
from langchain_google_genai import ChatGoogleGenerativeAI

import warnings
warnings.filterwarnings('ignore')

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m19.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m18.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m434.1/434.1 kB[0m [31m15.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.4/44.4 kB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.9/50.9 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pypdf
  Downloading pypdf-5.4.0-py3-none-any.whl.metadata (7.3 kB)
Downloading pypdf-5.4.0-py3-none-any.whl (302 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.3/302.3 kB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pypdf
Successfully installed pypdf-5.4.0
Collecting langchain_chroma
  Downloading langchain_chroma-0.2.3-py3-none-any

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import os
os.chdir('/content/drive/My Drive/RAG')

In [None]:
#Data Loading
pdf_files = ["my_paper.pdf", "Rainfall_Paper.pdf"]

data = []
for pdf_path in pdf_files:
    loader = PyPDFLoader(pdf_path)
    docs = loader.load()
    data.extend(docs)
len(data)

20

In [None]:
# loader1 = PyPDFLoader("my_paper.pdf")
# data1 = loader1.load()
# loader2 = PyPDFLoader("Rainfall_Paper.pdf")
# data2 = loader2.load()
# data = data1 + data2

In [None]:
# splitting text from pdf into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000)
docs = text_splitter.split_documents(data)

print("Total number of documents: ",len(docs))

Total number of documents:  73


In [None]:
docs[0]

Document(metadata={'producer': 'pdfTeX-1.40.21', 'creator': 'LaTeX with hyperref', 'creationdate': '2021-09-14T02:52:38+00:00', 'author': '', 'keywords': '', 'moddate': '2021-09-14T02:52:38+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.14159265-2.6-1.40.21 (TeX Live 2020) kpathsea version 6.3.2', 'subject': '', 'title': '', 'trapped': '/False', 'rgid': 'PB:357213035_AS:1103436619751424@1640091199662', 'source': 'my_paper.pdf', 'total_pages': 15, 'page': 0, 'page_label': '1'}, page_content='See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/357213035\nDevelopment of Multiple Combined Regression Methods for Rainfall\nMeasurement Development of Multiple Combined Regression Methods for\nRainfall Measurement\nArticle · December 2021\nCITATIONS\n0\nREADS\n711\n6 authors, including:\nNusrat Jahan Prottasha\nDaffodil International University\n26 PUBLICATIONS\xa0\xa0\xa0299 CITATIONS\xa0\xa0\xa0\nSEE PROFILE\nMd Kowsher\nStevens

In [None]:
os.environ["GOOGLE_API_KEY"] = "API_KEY" # To Get an API key: https://ai.google.dev/gemini-api/docs/api-key

In [None]:
embeddings = GoogleGenerativeAIEmbeddings(model="models/embedding-001")
vector = embeddings.embed_query("Hai, world!")
vector[:5]

[0.06513094902038574,
 -0.011213342659175396,
 -0.06175588071346283,
 -0.005176943726837635,
 0.021008262410759926]

In [None]:
vectorstore = Chroma.from_documents(documents=docs, embedding=GoogleGenerativeAIEmbeddings(model="models/embedding-001"))

In [None]:
retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 10})

retrieved_docs = retriever.invoke("What is new in Annual Rainfall Classification Using Machine Learning Techniques paper?")

In [None]:
len(retrieved_docs)

10

In [None]:
retrieved_docs

[Document(id='10514176-9eb7-48e8-ab24-42c554e45819', metadata={'author': '', 'creationdate': '2021-09-14T02:52:38+00:00', 'creator': 'LaTeX with hyperref', 'keywords': '', 'moddate': '2021-09-14T02:52:38+00:00', 'page': 0, 'page_label': '1', 'producer': 'pdfTeX-1.40.21', 'ptex.fullbanner': 'This is pdfTeX, Version 3.14159265-2.6-1.40.21 (TeX Live 2020) kpathsea version 6.3.2', 'rgid': 'PB:357213035_AS:1103436619751424@1640091199662', 'source': 'my_paper.pdf', 'subject': '', 'title': '', 'total_pages': 15, 'trapped': '/False'}, page_content='See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/357213035\nDevelopment of Multiple Combined Regression Methods for Rainfall\nMeasurement Development of Multiple Combined Regression Methods for\nRainfall Measurement\nArticle · December 2021\nCITATIONS\n0\nREADS\n711\n6 authors, including:\nNusrat Jahan Prottasha\nDaffodil International University\n26 PUBLICATIONS\xa0\xa0\xa0299 CITATIONS\x

In [None]:
print(retrieved_docs[6].page_content)

Rainfall Prediction 11
forest and Gradient Boosting Regressor have acquired almost the same Accuracy
but if we consider the evaluation metrics of then so, Random forest has a low
error rate compare to Gradient Boosting. So, here we have considered the Ran-
dom forest approach. Overall all of regressors showed a standard and acceptable
performance.
The bar chart is a graph for representing all regressors algorithms with Sta-
tistical measurement. The bar can be vertically or horizontally. Here is the bar
graph of our selective algorithms, down below.
Fig. 4.Selective algorithms
5 Conclusion
In this work, we have presented an initial attempt to determine how much rain
will come when it’s raining time. In the data collection phase, we adopted real
data from Australia from the Kaggle platform. The primary purpose of this
task is to ﬁnd out the best regression technique for the prediction of rain. For
this reason, we have used a variety of regression analysis techniques that can


In [None]:
llm = ChatGoogleGenerativeAI(model="gemini-1.5-pro",temperature=0.3, max_tokens=500)

In [None]:
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate

system_prompt = (
    "You are an assistant for question-answering tasks. "
    "Use the following pieces of retrieved context to answer "
    "the question. If you don't know the answer, say that you "
    "don't know. Use three sentences maximum and keep the "
    "answer concise."
    "\n\n"
    "{context}"
)

prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        ("human", "{input}"),
    ]
)

In [None]:
question_answer_chain = create_stuff_documents_chain(llm, prompt)
rag_chain = create_retrieval_chain(retriever, question_answer_chain)

In [None]:
response = rag_chain.invoke({"input": "What is new in Annual Rainfall Classification Using Machine Learning Techniques paper?"})
print(response["answer"])

The paper uses machine learning (Decision Trees, Logistic Regression, and Random Forest) to classify annual rainfall patterns in Indian subdivisions from 1901-2017 data.  It compares the algorithms' accuracy, efficiency, and interpretability for this task. The study also emphasizes data visualization to make the classification results more accessible.


**Conclusion:** This RAG system effectively bridges the gap between unstructured documents and intelligent querying. By combining semantic retrieval with Google’s Gemini model, it enables users to extract meaningful insights from large document collections. The modular design supports future scalability, multi-document input, and domain-specific applications like legal, research, or medical document Q&A.