# RAG Demo using PDF + Chroma

This notebook demonstrates Retrieval-Augmented Generation (RAG) using a PDF document, Chroma vector database, and optional Gemini LLM.

In [4]:
# Import necessary libraries from langchain_community for document loading, embeddings, and vector stores.
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma
import os

In [5]:
# Install required Python packages silently (`-q`).
# langchain: Core LangChain library.
# langchain-community: Community contributed LangChain components (loaders, embeddings, vector stores).
# chromadb: The Chroma vector database.
# pypdf: PDF document parsing library.
# sentence-transformers: For generating embeddings.
# google-generativeai: For integrating with Google's Generative AI models.
# langchain-text-splitters: For text splitting functionalities in LangChain.
!pip -q install langchain langchain-community chromadb pypdf sentence-transformers google-generativeai langchain-text-splitters

In [6]:

# Define the path to the PDF document.
PDF_PATH = "/content/Promotional Content Creation.pdf"
# Initialize PyPDFLoader with the PDF path to load the document.
loader = PyPDFLoader(PDF_PATH)
# Load the pages from the PDF, resulting in a list of Document objects.
docs = loader.load()
# Print the number of pages successfully loaded from the PDF.
print("Pages loaded:", len(docs))


Pages loaded: 7


In [7]:

# Initialize a RecursiveCharacterTextSplitter to break down documents into smaller chunks.
# chunk_size=900: Specifies the maximum size of each text chunk.
# chunk_overlap=150: Defines the number of characters that consecutive chunks will overlap,
#                    helping to maintain context across splits.
text_splitter = RecursiveCharacterTextSplitter(chunk_size=900, chunk_overlap=150)
# Split the loaded documents into smaller, manageable chunks.
chunks = text_splitter.split_documents(docs)
# Print the total number of chunks created.
print("Total chunks:", len(chunks))


Total chunks: 25


In [8]:

# Initialize HuggingFaceEmbeddings to convert text into numerical vectors.
# model_name="sentence-transformers/all-MiniLM-L6-v2": Specifies the pre-trained model to use for embeddings.
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
# Define the directory where the Chroma vector database will be persisted.
persist_dir = "/mnt/data/chroma_kncet"
# Create a Chroma vector database from the text chunks and embeddings.
# The database will be stored in the specified `persist_directory`.
vectordb = Chroma.from_documents(chunks, embeddings, persist_directory=persist_dir)
# Persist the vector database to disk (note: in newer Chroma versions, this might be automatic).
vectordb.persist()
# Confirm that the vector database has been created.
print("Vector DB created")


  embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]



sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/103 [00:00<?, ?it/s]

BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Vector DB created


  vectordb.persist()


In [18]:
# Initialize the retriever from the Chroma vector database.
# It's configured to search for the top 4 most relevant documents.
retriever = vectordb.as_retriever(search_kwargs={"k":4})

# Define the query string to search for within the documents.
query =     "List 3 key reasons to choose KNCET."

# Example queries (commented out) demonstrating other possible searches:
# "What are the main accreditations and autonomous status details mentioned?"
# "What infrastructure highlights are mentioned (campus size, labs, library, Wi-Fi)?",
# "What placement details are mentioned (highest and average package, recruiters)?",

# Execute the query using the retriever to get relevant documents.
docs = retriever.invoke(query)

# Iterate through each retrieved document.
for d in docs:
    # Clean the document content by removing null characters (\x00) which can appear from PDF processing.
    cleaned_content = d.page_content.replace('\x00', '') # Remove null characters
    # Print the cleaned content, prefixed with a bullet point and truncated to the first 400 characters for brevity.
    print(f"* {cleaned_content[:400]}")

* Why Choose KNCET?
NAAC A+ & NBA accredited programmes – quality engineering education.
Autonomous curriculum with flexibility to introduce modern industry‑relevant courses.
Eco
†friendly campus with smart classrooms, digital library and 100+ labs.
Scholarships & fee concessions for meritorious and sports students from disadvantaged
backgrounds.
Support for research & start
†ups via the Kongunadu I
* From  Campus  to  Corporate  )ö –  Our  Campus‑to‑Corporate  centre  trains  you  for  success  and
connects you with top recruiters like  TCS, Wipro & HCL. Begin your journey with us today!
#Placements #CareerReady
Hands
†on Learning )÷ – With  100+ laboratories and a state
†of
†the
†art digital library, KNCET
gives  you  the  tools  to  innovate.  Join  a  college  where  experiments  become  in
* Engineering & Technology (KNCET) – 70‑acre green campus, NAAC A+ & NBA accreditations and
strong placements (highest package ₹12 LPA). Admissions open! Reply YES to know more.
Hi! Start your e

Optional: Add Gemini API key to generate final answers.