# RAG Assignment - PDF Chatbot


## Problem Statement

Build a RAG (Retrieval Augmented Generation) system that takes a PDF as input and answers questions based on it.


## Architecture

1. **Data Source**: PDF File.

2. **Text Chunking**: RecursiveCharacterTextSplitter (Size: 1000, Overlap: 200).

3. **Embeddings**: HuggingFaceEmbeddings (all-MiniLM-L6-v2).

4. **Vector Store**: FAISS.

5. **LLM**: Google Gemini (via `langchain-google-genai`).

In [2]:
# Install dependencies if running in a new environment

# !pip install langchain langchain-community langchain-google-genai pypdf faiss-cpu sentence-transformers python-dotenv
import os

from langchain_community.document_loaders import PyPDFLoader

from langchain_text_splitters import RecursiveCharacterTextSplitter

from langchain_community.vectorstores import FAISS

from langchain_community.embeddings import HuggingFaceEmbeddings

from langchain_google_genai import ChatGoogleGenerativeAI

from langchain_classic.chains import RetrievalQA

from dotenv import load_dotenv



# Load environment variables (API Key)

load_dotenv()



if not os.getenv("GOOGLE_API_KEY"):

    print("Warning: GOOGLE_API_KEY not found in environment. Please set it in .env or here.")

    # os.environ["GOOGLE_API_KEY"] = "your_api_key_here"

## 1. Data Loading

Load the PDF file. We use `PyPDFLoader`.

In [3]:
pdf_path = "../data/sample.pdf"



if os.path.exists(pdf_path):

    loader = PyPDFLoader(pdf_path)

    documents = loader.load()

    print(f"Loaded {len(documents)} pages.")

else:

    print(f"File not found: {pdf_path}. Please check the path.")

Loaded 1 pages.


## 2. Text Chunking

Split the text into chunks. We use `RecursiveCharacterTextSplitter`.

- **Chunk Size**: 1000 characters.

- **Overlap**: 200 characters to preserve context across boundaries.

In [4]:
text_splitter = RecursiveCharacterTextSplitter(

    chunk_size=1000,

    chunk_overlap=200

)




chunks = text_splitter.split_documents(documents)

print(f"Created {len(chunks)} chunks.")

print(f"First chunk content: {chunks[0].page_content[:200]}...")

Created 1 chunks.
First chunk content: Dummy PDF file...


## 3. Embeddings & Vector Store

Generate embeddings using `HuggingFaceEmbeddings` (runs locally) and store them in `FAISS`.

In [None]:
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")




vector_store = FAISS.from_documents(chunks, embeddings)

print("Vector store created.")

## 4. Retrieval & Generation

Setup the retriever and the QA chain using Google Gemini.

In [None]:
retriever = vector_store.as_retriever()




llm = ChatGoogleGenerativeAI(model="gemini-pro", temperature=0.3)




qa_chain = RetrievalQA.from_chain_type(

    llm=llm,

    chain_type="stuff",

    retriever=retriever,

    return_source_documents=True

)

## 5. Testing

Ask questions to the system.

In [None]:
def ask_question(query):

    response = qa_chain.invoke({"query": query})

    print(f"Q: {query}")

    print(f"A: {response['result']}")

    print("Sources:")

    for doc in response['source_documents']:

        print(f"- Page {doc.metadata.get('page', 'N/A')}")

    print("-" * 50)




test_queries = [

    "What is the main topic of this document?",

    "Summarize the key points.",

    "Are there any specific dates mentioned?"

]




for query in test_queries:

    ask_question(query)