# üåü **RAG Essentials: Loading Documents, Chunking, Embeddings & Chroma DB**

This notebook is a beginner-friendly walkthrough of how to build the core components of a **Retrieval-Augmented Generation (RAG)** pipeline using **Gemini**, **LangChain**, and **ChromaDB**.

You‚Äôll go from loading a PDF ‚Üí splitting it into chunks ‚Üí creating embeddings ‚Üí storing them in a vector database ‚Üí and finally querying using a RetrievalQA chain.

<br>

## üîç **What You Will Learn (Short & Friendly)**

* How to load PDFs using LangChain
* How to chunk documents using different strategies
* How to create embeddings using Gemini Embedding models
* How to store & persist embeddings using **ChromaDB**
* How to reload your DB and run a **RAG Retriever + QA chain**

<br>

## üß† **Prerequisites**

* Basic Python
* A Google API key (Gemini)
* A PDF file to test (your example uses `Ex-policy.pdf`)
* Basic understanding of vector databases (helpful, but optional)

<br>

## üöÄ **Goal of This Notebook**

To help beginners understand the **entire RAG preprocessing and vector search workflow** ‚Äî from raw documents to a fully functional RetrievalQA system.

By the end, you'll know how to preprocess your own PDFs and query them intelligently using Gemini.

<br>

## üìù **Note**

This notebook documents my learning journey.
The implementations are practical and based on real issues I encountered while building RAG systems.
They may or may not be production-ready ‚Äî but they are designed to make each step easy to understand for beginners.



In [None]:
!pip install -q -U google-genai

In [None]:
!pip install langchain_google_genai

In [None]:
from google.colab import userdata
GOOGLE_API_KEY = userdata.get('GOOGLE_API_KEY')

In [None]:
from langchain_google_genai import ChatGoogleGenerativeAI
import json

llm = ChatGoogleGenerativeAI(model="gemini-2.5-flash-lite",api_key=GOOGLE_API_KEY)
response = llm.invoke("Say Hello!")
print(response.content)

Hello!


In [None]:
!pip install langchain_community

In [None]:
!pip install pypdf

In [None]:
from langchain_community.document_loaders import PyPDFLoader

pdf_loader = PyPDFLoader("/content/Ex-policy.pdf")
loaded_pdf = pdf_loader.load()
print(type(loaded_pdf))
print(len(loaded_pdf))

<class 'list'>
5


In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

chunk_splitter = RecursiveCharacterTextSplitter(chunk_size=400, chunk_overlap=10,separators=["\n\n","\n"," "])
chunks = chunk_splitter.split_documents(loaded_pdf)
print(type(chunks))
print(len(chunks))

<class 'list'>
31


In [None]:
from langchain_google_genai import GoogleGenerativeAIEmbeddings

embeddings = GoogleGenerativeAIEmbeddings(model="gemini-embedding-exp-03-07",google_api_key=GOOGLE_API_KEY)

In [None]:
!pip install chromadb

---

**‚ö° Note on Using Embeddings**



We can generate embeddings in **two ways**:

1. **Directly using the model**

   * Send your text to the embedding model each time you need it.
   * Simple, but every time the session restarts, embeddings are recomputed.
   * Potential issues:

     * Model may be deprecated in the future.
     * Free-tier usage limits may be reached quickly.

2. **Store already created embeddings and reuse them**

   * Compute embeddings **once**, save them in a **vector database** (like Chroma).
   * Later, you can **reload and query** the DB without recomputing.
   * Advantages:

     * Faster queries.
     * Saves API usage and avoids hitting limits.
     * Ensures consistency even if the model changes or session restarts.

> ‚úÖ **Best practice:** For large documents or frequent querying, always persist embeddings and reuse them instead of generating every time.

---

**üß© Step 1: Create and persist the DB**

```python
from langchain.vectorstores import Chroma
from langchain_google_genai import GoogleGenerativeAIEmbeddings

# 1Ô∏è‚É£ Initialize embeddings
embeddings = GoogleGenerativeAIEmbeddings(
    model="gemini-embedding-exp-03-07",
    google_api_key=GOOGLE_API_KEY
)

# 2Ô∏è‚É£ Create and persist Chroma DB to a folder
persist_directory = "/content/chroma_db"

db = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory=persist_directory
)

# 3Ô∏è‚É£ Save (commit) to disk
db.persist()
print("‚úÖ Vector DB saved to:", persist_directory)
```

‚úÖ **After running this,** Colab will create a folder `/content/chroma_db` containing all the vectors and metadata.
You can zip & download it with:

```python
!zip -r chroma_db.zip /content/chroma_db
```

---

**üîÅ Step 2: Reuse it in a later Colab session**

When you open Colab again (or restart the runtime):

1. Upload the `chroma_db.zip`
2. Extract it:

   ```python
   !unzip chroma_db.zip -d /content/
   ```
3. Load it back:

   ```python
   from langchain.vectorstores import Chroma
   from langchain_google_genai import GoogleGenerativeAIEmbeddings

   embeddings = GoogleGenerativeAIEmbeddings(
       model="gemini-embedding-exp-03-07",
       google_api_key=GOOGLE_API_KEY
   )

   persist_directory = "/content/chroma_db"

   db = Chroma(
       persist_directory=persist_directory,
       embedding_function=embeddings
   )

   retriever = db.as_retriever(search_kwargs={"k": 8})
   ```

Now you can **directly query** without reprocessing documents or recomputing embeddings.

---

**‚öôÔ∏è Step 3: Use it with your RAG chain again**

```python
from langchain.chains import RetrievalQA

rag = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=retriever,
    return_source_documents=True
)
```


In [None]:
from langchain.vectorstores import Chroma



# Method 1

persist_dir = "/content/db"
db = Chroma.from_documents(documents=chunks, embedding=embeddings ,persist_directory=persist_dir)

# Method 2

# db = Chroma.from_documents(documents=chunks, embedding=embeddings)

# Method 3

# persist_directory = "/content/chroma_db"

# db = Chroma(
#     persist_directory=persist_directory,
#     embedding_function=embeddings
# )

db_index = db.as_retriever(search_kwargs={"k":8})

In [None]:
db.persist() # when using Method 1

  db.persist()


In [None]:
from langchain.chains import RetrievalQA

rag = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=db_index,
    return_source_documents=True
)

In [None]:
import json

question = "What is the notice period if I want to resign ??"
response = rag({"query": question})
import json

# Convert source_documents to a list of dicts
response_dict = {
    "result": response["result"],
    "source_documents": [
        {"page_content": doc.page_content, "metadata": doc.metadata}
        for doc in response["source_documents"]
    ]
}

print(json.dumps(response_dict, indent=2))


{
  "result": "The provided text states that resignations should include \"applicable notice periods defined by contract and aligned with local law.\" However, it does not specify what that notice period is.",
  "source_documents": [
    {
      "page_content": "Resignations should be submitted in writing through the HRIS or email with applicable notice\nperiods defined by contract and aligned with local law; mutual waivers may be considered\nby management.\u00003\u0000\u00005\u0000\nFinal settlement will include unpaid wages, eligible leave encashment if applicable, statutory\ncontributions, and recovery of company assets per due process.\u00001\u0000\u00003\u0000\nSeparation for cause",
      "metadata": {
        "page": 3,
        "moddate": "2025-10-06T18:54:41+00:00",
        "creationdate": "2025-10-06T18:54:41+00:00",
        "page_label": "4",
        "source": "/content/Ex-policy.pdf",
        "creator": "Chromium",
        "total_pages": 5,
        "producer": "Skia/PDF m127


# ‚úÖ **Summary & Next Steps**

In this notebook, you explored the full preprocessing pipeline required to build a RAG system using **Gemini + LangChain + ChromaDB**.
Here‚Äôs what you accomplished:

### üîπ You Have Done:

* Loaded a PDF using **PyPDFLoader**
* Split the document into chunks using:

  * `RecursiveCharacterTextSplitter`
  * (Optional) different separators and overlaps
* Generated vector embeddings using:

  * `GoogleGenerativeAIEmbeddings` (Gemini embedding model)
* Stored the embeddings in **ChromaDB** using:

  * `from_documents()`
  * `persist_directory` for saving to disk
* Reloaded the persisted Chroma DB in a new session
* Built a **Retriever** with `search_kwargs={"k": 8}`
* Created and used a **RetrievalQA** chain
* Queried your document (‚Äúnotice period‚Äù example) and viewed source chunks

<br>

## üß± **What This Notebook Gives You**

You now understand the core workflow behind any RAG application:

* Loading documents
* Chunking intelligently
* Generating embeddings
* Storing & reusing those embeddings
* Querying via a Retriever ‚Üí LLM chain

<br>

üîÆ This prepares you for more advanced notebooks that involve:

*   Building agents that can search, plan, and answer with external tools
*   Creating custom tools for tasks like date retrieval and API calls
*   Using LangChain Agents for multi-step reasoning
*   Using LangGraph to build structured agent workflows


<br>

üí¨ **Tip:**
Experiment with different PDFs, chunk sizes, and models ‚Äî small changes dramatically impact retrieval quality.

**Keep exploring ‚Äî this is the heart of building real AI applications. üöÄ**
