<a href="https://colab.research.google.com/github/RDGopal/IB9AU-2026/blob/main/RAG3_Text_Based.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Building a RAG Solution

###Step 1 - Install Required Packages

Before we can run any of the code in this notebook, we need to make sure all necessary software libraries are installed. This step uses `pip`, Python's package installer, to download and install several libraries that are crucial for building our RAG (Retrieval Augmented Generation) system:

*   **`llama-index`**: A data framework for LLM (Large Language Model) applications, used for data ingestion, indexing, and querying.
*   **`llama-index-llms-huggingface`**: LlamaIndex integration for Hugging Face Large Language Models.
*   **`llama-index-embeddings-huggingface`**: LlamaIndex integration for Hugging Face embedding models.
*   **`llama-index-vector-stores-chroma`**: LlamaIndex integration for the Chroma vector database.
*   **`chromadb`**: A fast, in-memory vector database used to store and search our document embeddings.
*   **`pypdf`**: A library to work with PDF files, enabling us to read and extract text from our documents.
*   **`sentence-transformers`**: Provides state-of-the-art pre-trained models for creating text embeddings.
*   **`torch`**: PyTorch, a powerful open-source machine learning framework, essential for running the language models.
*   **`accelerate`**: A Hugging Face library that simplifies using PyTorch models on different hardware (like GPUs) efficiently.
*   **`bitsandbytes`**: A library for efficient 8-bit quantization of neural networks, which helps reduce memory usage and speed up inference for large models.

The `-q` flag means 'quiet' (less output), and `-U` means 'upgrade' (ensure the latest version is installed). This ensures our environment has all the tools needed for the RAG pipeline.

In [None]:
# --- STEP 1: INSTALL REQUIRED PACKAGES ---
!pip install -q -U llama-index llama-index-llms-huggingface llama-index-embeddings-huggingface llama-index-vector-stores-chroma chromadb pypdf sentence-transformers torch accelerate bitsandbytes

### Step 2 - Imports

This section imports all the necessary modules and classes from the installed libraries. Each import plays a specific role in building our RAG system:

*   **`os`**: Provides a way to interact with the operating system, for example, to create directories for storing documents.
*   **`torch`**: The PyTorch library, essential for running and managing machine learning models, especially the LLM and embedding models.
*   **`llama_index.core` components**:
    *   **`VectorStoreIndex`**: The core class from LlamaIndex for creating and querying a vector index, which stores our document embeddings.
    *   **`SimpleDirectoryReader`**: Used to easily load documents (like PDFs) from a specified directory.
    *   **`StorageContext`**: Manages the storage backend for the LlamaIndex, telling it where to store the index data (in our case, ChromaDB).
    *   **`Settings`**: A global configuration object for LlamaIndex, allowing us to set default models for embeddings and LLMs, as well as document parsing strategies.
*   **`llama_index.core.node_parser.SentenceSplitter`**: A tool to break down large documents into smaller, manageable chunks (nodes) based on sentences, which is crucial for effective retrieval.
*   **`llama_index.embeddings.huggingface.HuggingFaceEmbedding`**: Allows us to use pre-trained embedding models from Hugging Face to convert text into numerical vector representations.
*   **`llama_index.llms.huggingface.HuggingFaceLLM`**: Enables the integration of Large Language Models (LLMs) available on Hugging Face directly into our LlamaIndex application.
*   **`llama_index.vector_stores.chroma.ChromaVectorStore`**: The LlamaIndex adapter to use ChromaDB as the persistent vector store for our embeddings.
*   **`chromadb`**: The client library for Chroma, used to directly interact with the Chroma vector database (e.g., to create or get collections).

In [None]:
# --- STEP 2: IMPORTS ---
import os
import torch
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, StorageContext, Settings
from llama_index.core.node_parser import SentenceSplitter
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.huggingface import HuggingFaceLLM
from llama_index.vector_stores.chroma import ChromaVectorStore
import chromadb

### Step 3 - Global Settings (Embeddings and Chunking)

This step is crucial for configuring how our RAG system processes and understands documents. We set up two main global settings for LlamaIndex:

1.  **Embedding Model (`Settings.embed_model`)**:
    *   `Settings.embed_model = HuggingFaceEmbedding(model_name="sentence-transformers/all-MiniLM-L6-v2")`:
        *   **Purpose**: An embedding model converts text into numerical vectors (embeddings). These vectors capture the semantic meaning of the text, allowing the RAG system to find relevant document chunks based on the similarity of their embeddings to the query's embedding.
        *   **Model Choice**: We're using `"sentence-transformers/all-MiniLM-L6-v2"`. This is a popular, efficient, and effective pre-trained model from Hugging Face designed to create good quality sentence embeddings. It's a good balance of performance and computational cost.

2.  **Node Parser / Chunking Strategy (`Settings.node_parser`)**:
    *   `Settings.node_parser = SentenceSplitter(chunk_size=500, chunk_overlap=50)`:
        *   **Purpose**: Large documents need to be broken down into smaller, manageable pieces (called 'nodes' or 'chunks') before they can be effectively stored and retrieved. This is because LLMs have a limited context window, and providing smaller, relevant chunks improves retrieval accuracy and reduces computational load.
        *   **`SentenceSplitter`**: This specific parser splits documents into chunks based on sentence boundaries, which often leads to more coherent and meaningful chunks compared to splitting purely by character count.
        *   **`chunk_size=500`**: Each chunk will aim to contain approximately 500 tokens (or words/subwords, depending on the tokenizer). This is a common size, balancing detail with brevity.
        *   **`chunk_overlap=50`**: This creates an overlap of 50 tokens between consecutive chunks. Overlap helps ensure that context isn't lost when a piece of information spans across two chunks. For example, if a sentence is at the very end of one chunk, the beginning of that sentence will also appear in the next chunk, providing continuity.

In [None]:
# --- STEP 3: GLOBAL SETTINGS (Embeddings and Chunking) ---
# Embedding model (same as your original)
Settings.embed_model = HuggingFaceEmbedding(model_name="sentence-transformers/all-MiniLM-L6-v2")

# Chunk size and overlap (equivalent to your RecursiveCharacterTextSplitter)
Settings.node_parser = SentenceSplitter(chunk_size=500, chunk_overlap=50)

### Step 4 - Load Phi-3 Mini (Local LLM)

This step initializes and configures the Large Language Model (LLM) that our RAG system will use to generate answers based on the retrieved documents. We are utilizing a powerful, open-source model from Hugging Face, specifically **Phi-3-mini-4k-instruct**.

*   **`model_id = "microsoft/Phi-3-mini-4k-instruct"`**: This line specifies the exact model we want to load from the Hugging Face model hub. Phi-3-mini is a lightweight yet capable model developed by Microsoft, designed for instruction-following tasks, making it suitable for a RAG system.

*   **`llm = HuggingFaceLLM(...)`**: We instantiate the `HuggingFaceLLM` class from LlamaIndex, which acts as a wrapper to easily integrate Hugging Face models.
    *   **`model_name=model_id`**: Points to the model we defined earlier.
    *   **`model_kwargs={"torch_dtype": torch.bfloat16}`**: This sets the data type for the model's computations to `bfloat16`. Using `bfloat16` (Brain Floating Point) is a common optimization for large models, as it reduces memory usage and speeds up calculations on compatible hardware (like modern GPUs) while maintaining sufficient precision.
    *   **`device_map="auto"`**: This crucial setting tells the `accelerate` library (which `HuggingFaceLLM` uses internally) to automatically distribute the model across available devices (e.g., GPU, CPU) in the most efficient way. For large models, this helps manage memory and ensures the model runs as fast as possible.
    *   **`max_new_tokens=256`**: This parameter limits the maximum number of tokens the LLM will generate in its response. This helps control the length of the answers and manage computational resources.
    *   **`generate_kwargs={}`**: This is a dictionary where you can pass additional arguments directly to the model's `generate` method (e.g., `temperature`, `top_p` for controlling creativity/randomness in generation).
    *   **`tokenizer_name=model_id`**: Specifies that the tokenizer (which converts text into numerical tokens for the model) associated with the `model_id` should be loaded.

*   **`Settings.llm = llm`**: Finally, we set this configured `llm` object as the default Large Language Model for all subsequent operations within our LlamaIndex application. This means any query engine or other LlamaIndex component will automatically use this `Phi-3-mini-4k-instruct` model for generating responses.

In [None]:
# --- STEP 4: LOAD PHI-3 MINI (Local LLM) ---
print("Loading Phi-3-mini-4k-instruct (this takes ~1-2 minutes)...")
model_id = "microsoft/Phi-3-mini-4k-instruct"

llm = HuggingFaceLLM(
    model_name=model_id,
    model_kwargs={
        "torch_dtype": torch.bfloat16,
    },
    device_map="auto",
    max_new_tokens=256,
    generate_kwargs={},
    tokenizer_name=model_id,
)
Settings.llm = llm

### Step 5 - Ingest PDFs

This crucial step is responsible for loading your company's onboarding documents (in PDF format) into the RAG system. Here's a breakdown:

1.  **Create Document Directory (`os.makedirs`)**:
    *   `os.makedirs("onboarding_docs", exist_ok=True)`: This line uses Python's `os` module to create a directory named `onboarding_docs` in the current working directory. The `exist_ok=True` argument ensures that if the directory already exists, no error is raised, making the script robust to multiple runs.
    *   **Purpose**: This folder will be where you place all the PDF documents you want your RAG assistant to be able to answer questions about.

2.  **Load Documents with `SimpleDirectoryReader`**:
    *   `reader = SimpleDirectoryReader(input_dir="onboarding_docs", required_exts=[".pdf"])`: We initialize `SimpleDirectoryReader` from `llama_index.core`. This class is designed to easily load documents from a specified directory.
        *   `input_dir="onboarding_docs"`: Tells the reader to look for files within the `onboarding_docs` folder.
        *   `required_exts=[".pdf"]`: Specifies that only files with the `.pdf` extension should be considered for loading.
    *   `documents = reader.load_data()`: This method executes the loading process. It reads all specified PDF files in the directory and converts their content into a format that LlamaIndex can process. Each page of a PDF typically becomes a separate 'document' object in the `documents` list.

3.  **Verify Document Loading**:
    *   The subsequent `if not documents:` block (`1TJKY1251BN6`) checks if any documents were actually loaded. If the `documents` list is empty, it prints a warning, prompting the user to upload PDFs. Otherwise, it confirms how many document pages were loaded.

In summary, this step sets up the document storage, reads your PDF content, and prepares it for further processing by the RAG pipeline.

In [None]:
# --- STEP 5: INGEST PDFs ---
os.makedirs("onboarding_docs", exist_ok=True)
print("üëâ Please upload your PDFs to the 'onboarding_docs' folder now (if not already there).")

reader = SimpleDirectoryReader(input_dir="onboarding_docs", required_exts=[".pdf"])
documents = reader.load_data()

In [None]:
if not documents:
    print("‚ö†Ô∏è No files found! Please upload PDFs to the folder.")
else:
    print(f"Loaded {len(documents)} document pages.")

### Step 6 - Build or Load Chroma Vector Index

This step is where our processed documents are stored and indexed, allowing for efficient retrieval later. We use ChromaDB, a lightweight vector database, to store the numerical embeddings of our document chunks.

1.  **Define Persistence Directory (`persist_dir`)**:
    *   `persist_dir = "./chroma_db"`: This sets a local directory named `chroma_db` where ChromaDB will store its data. This ensures that our vector index is saved to disk and can be reused between Colab sessions, avoiding the need to re-index documents every time.

2.  **Initialize Chroma Client and Collection**:
    *   `chroma_client = chromadb.PersistentClient(path=persist_dir)`: We create a `PersistentClient` for ChromaDB, pointing it to our `persist_dir`. This client will manage the interaction with the database.
    *   `chroma_collection = chroma_client.get_or_create_collection("onboarding_rag")`: We get (if it exists) or create (if it doesn't) a collection named `onboarding_rag`. A collection in ChromaDB is where our document embeddings and their associated metadata are stored.

3.  **Configure LlamaIndex Storage Context**:
    *   `vector_store = ChromaVectorStore(chroma_collection=chroma_collection)`: We create a `ChromaVectorStore` instance, linking it to our `onboarding_rag` collection. This makes ChromaDB compatible with LlamaIndex's vector store interface.
    *   `storage_context = StorageContext.from_defaults(vector_store=vector_store)`: We then create a `StorageContext` for LlamaIndex, telling it to use our configured `ChromaVectorStore` for all storage operations.

4.  **Build or Load Index Logic**:
    *   `if chroma_collection.count() == 0:`: This condition checks if the `onboarding_rag` collection is empty. If it is (meaning no embeddings have been stored yet), we proceed to build a new index.
        *   `index = VectorStoreIndex.from_documents(documents, storage_context=storage_context, show_progress=True)`: If the collection is empty, a new `VectorStoreIndex` is built from the `documents` (loaded in Step 5). LlamaIndex processes these documents, creates embeddings using the `Settings.embed_model` (from Step 3), and stores them in the `chroma_collection` using the `storage_context`.
        *   `print("‚úÖ New index built and persisted.")`: Confirms that a new index was created.
    *   `else:`: If the `chroma_collection` is not empty (i.e., it already contains embeddings from a previous run).
        *   `index = VectorStoreIndex.from_vector_store(vector_store, storage_context=storage_context)`: LlamaIndex loads the existing index directly from our `ChromaVectorStore`. This saves time and computational resources as the embeddings don't need to be re-generated.
        *   `print("‚úÖ Loaded existing index.")`: Confirms that an existing index was loaded.

This step ensures that our document content is efficiently stored and ready for retrieval during the query process.

In [None]:
# --- STEP 6: BUILD OR LOAD CHROMA VECTOR INDEX ---
print("Building Vector Index (this takes a bit of time)...")
persist_dir = "./chroma_db"

chroma_client = chromadb.PersistentClient(path=persist_dir)
chroma_collection = chroma_client.get_or_create_collection("onboarding_rag")

vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

# Build index if not exists, otherwise load
if chroma_collection.count() == 0:
    index = VectorStoreIndex.from_documents(
        documents,
        storage_context=storage_context,
        show_progress=True
    )
    print("‚úÖ New index built and persisted.")
else:
    index = VectorStoreIndex.from_vector_store(
        vector_store,
        storage_context=storage_context
    )
    print("‚úÖ Loaded existing index.")

### Step 7 - Create Query Engine (RAG Pipeline)

This step is where we create the **query engine**, which is the core component that allows us to interact with our RAG system. The query engine takes a user's question, retrieves relevant information from our indexed documents, and then uses the LLM to generate a coherent answer.

*   **`query_engine = index.as_query_engine(similarity_top_k=3)`**:
    *   **`index.as_query_engine()`**: This method on our previously built `VectorStoreIndex` (from Step 6) creates a query engine. This engine orchestrates the entire RAG process: it receives a query, converts it into an embedding, uses that embedding to search the `ChromaVectorStore` for the most similar document chunks, and then passes those chunks along with the original query to the LLM for answer generation.
    *   **`similarity_top_k=3`**: This is a crucial parameter for the retrieval part of RAG. It tells the query engine to retrieve the **top 3 most similar document chunks** (or nodes) from our vector store based on the similarity of their embeddings to the query's embedding. A higher `k` means more context is provided to the LLM, potentially leading to more comprehensive answers but also increasing processing time and token usage. A lower `k` focuses on the most relevant bits.

After this step, our RAG pipeline is fully configured and ready to answer questions interactively!

In [None]:
# --- STEP 7: CREATE QUERY ENGINE (RAG Pipeline) ---
# Top-k retrieval (same as your k=3)
query_engine = index.as_query_engine(similarity_top_k=3)

print("‚úÖ RAG Pipeline Ready!")


### Step 8 - Test the RAG Pipeline

This final step demonstrates how to use the `query_engine` we built in Step 7 to ask questions and receive answers, along with the source documents that informed the answer. This is where you can interact with your fully functional RAG system.

1.  **Define a Query**:
    *   `query = "What is the deductible for Dental and Vision?"`: We define a natural language question that we want our RAG assistant to answer. This query will be passed to the `query_engine`.

2.  **Execute the Query**:
    *   `response = query_engine.query(query)`: This line sends the `query` to the `query_engine`. The engine then performs the following actions internally:
        *   Converts the query into an embedding.
        *   Uses this embedding to retrieve the most relevant document chunks from the Chroma vector store (based on `similarity_top_k=3` set in Step 7).
        *   Passes these retrieved chunks and the original query to the loaded LLM (Phi-3 Mini from Step 4) to generate a coherent answer.

3.  **Display the Answer**:
    *   `print("ü§ñ Answer:")` and `print(response.response)`: The generated answer from the LLM is extracted from the `response` object and printed. This is the natural language answer to your question.

4.  **Display Source Documents**:
    *   `print("üìÑ Source Documents:")` and the subsequent loop: This section iterates through `response.source_nodes`. Each `node` represents a chunk of text retrieved from your documents that was used by the LLM to formulate its answer.
    *   `file_name = node.metadata.get("file_name", "Unknown file")` and `page_label = node.metadata.get("page_label", "N/A")`: For each source node, we extract metadata such as the original `file_name` and the `page_label` from where the chunk was extracted. This is crucial for transparency and allowing users to verify the information. If the metadata isn't available, default values are used.
    *   This provides a clear trail of where the information came from, enhancing the trustworthiness and interpretability of the RAG system's output.

By running this cell, you can see the RAG pipeline in action, demonstrating its ability to retrieve relevant information and synthesize it into an answer.

In [None]:
# --- STEP 8: TEST IT ---
query = "What is the deductible for Dental and Vision?"
query = "How much time off do I get?"

print(f"\n‚ùì Asking: {query}\n")

response = query_engine.query(query)

print("ü§ñ Answer:")
print("--------------------------------------------------")
print(response.response)
print("--------------------------------------------------")

print("\nüìÑ Source Documents:")
print("--------------------------------------------------")
if response.source_nodes:
    for i, node in enumerate(response.source_nodes):
        file_name = node.metadata.get("file_name", "Unknown file")
        page_label = node.metadata.get("page_label", "N/A")
        print(f"  {i+1}. File: {file_name}, Page: {page_label}")
else:
    print("  No source documents found.")
print("--------------------------------------------------")

## Building an User Interface for RAG

We will use `gradio` (Construct the Gradio interface using `gradio.Interface) to create the interface that provides a text input for questions and text outputs for the answer and source documents.


In [None]:
print("Installing Gradio...")
!pip install -q gradio

## Define RAG Query Function

We will need to create a Python function that takes a user query, calls the existing `query_engine`, and formats the response (answer and source documents) into a displayable string for the Gradio interface.


In [None]:
def query_rag(question):
    """
    Queries the RAG pipeline with the given question and formats the response.
    """
    response = query_engine.query(question)

    # Extract the answer
    answer = response.response

    # Format source documents
    source_docs_str = "\n\nüìÑ Source Documents:\n--------------------------------------------------"
    if response.source_nodes:
        for i, node in enumerate(response.source_nodes):
            file_name = node.metadata.get("file_name", "Unknown file")
            page_label = node.metadata.get("page_label", "N/A")
            source_docs_str += f"\n  {i+1}. File: {file_name}, Page: {page_label}"
    else:
        source_docs_str += "\n  No source documents found."
    source_docs_str += "\n--------------------------------------------------"

    # Combine answer and source documents
    formatted_output = f"ü§ñ Answer:\n--------------------------------------------------\n{answer}\n--------------------------------------------------{source_docs_str}"

    return formatted_output

print("Defined query_rag function.")

### Build the Gradio interface
This involves importing Gradio, defining the interface with the `query_rag` function, and specifying textboxes for input and output, along with a title.



In [None]:
import gradio as gr

print("Building Gradio interface...")

iface = gr.Interface(
    fn=query_rag,
    inputs=gr.Textbox(lines=2, placeholder="Type your question here..."),
    outputs=gr.Textbox(label="RAG Response", lines=10), # Increased lines for output
    title="Onboarding RAG Assistant"
)

print("‚úÖ Gradio interface built.")

# To launch the interface in a new cell later:
# iface.launch(share=True)

### Launch the Interface
The Gradio interface has been built; the next step is to launch it to allow interactive querying.



In [None]:
print("Launching Gradio interface...")

iface.launch(share=True)