## 📌 Objective
Create a simple system that can:
1. Accept PDF documents from users
2. Understand questions about the document
3. Provide accurate answers using AI
4. Automatically improve its answers by refining questions

## ❓ Problem Statement
Traditional document systems struggle with:
- Handling large documents
- Understanding natural language questions
- Providing precise answers from long texts
- Improving answers automatically

## 💡 Solution
Our system solves these problems using:
1. **Document Processing**: Breaks files into manageable chunks
2. **Smart Storage**: Stores content for quick searching
3. **AI Assistance**: Uses language models to understand questions
4. **Self-Improvement**: Automatically refines questions for better answers

## 🤖 What is Agentic RAG?
**RAG (Retrieval-Augmented Generation)**:
- Combines document search with AI answers
- "Looks up" information before answering

**Agentic RAG** adds:
- Ability to automatically improve questions
- Self-correcting answers
- Better understanding through context

*Example:*
If you ask "What is the Name of the Customer", the system will change it to "What is the name of the customer referenced in the Master Agreement and Service Order?" for better results

---

In [1]:
# Step 1: Install required libraries
!pip install chromadb langchain pypdf2 sentence-transformers pyboxen

Collecting chromadb
  Downloading chromadb-0.6.3-py3-none-any.whl.metadata (6.8 kB)
Collecting pypdf2
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Collecting pyboxen
  Downloading pyboxen-1.3.0-py3-none-any.whl.metadata (6.8 kB)
Collecting build>=1.0.3 (from chromadb)
  Downloading build-1.2.2.post1-py3-none-any.whl.metadata (6.5 kB)
Collecting chroma-hnswlib==0.7.6 (from chromadb)
  Downloading chroma_hnswlib-0.7.6-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (252 bytes)
Collecting fastapi>=0.95.2 (from chromadb)
  Downloading fastapi-0.115.11-py3-none-any.whl.metadata (27 kB)
Collecting uvicorn>=0.18.3 (from uvicorn[standard]>=0.18.3->chromadb)
  Downloading uvicorn-0.34.0-py3-none-any.whl.metadata (6.5 kB)
Collecting posthog>=2.4.0 (from chromadb)
  Downloading posthog-3.18.1-py2.py3-none-any.whl.metadata (2.9 kB)
Collecting onnxruntime>=1.14.1 (from chromadb)
  Downloading onnxruntime-1.20.1-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x8

## 📦 **Import Statements & Their Purpose**
```python
import ipywidgets as widgets  # For creating interactive UI elements
from IPython.display import display, clear_output  # To control UI output
from langchain.vectorstores import Chroma  # For vector-based search
from langchain.embeddings import HuggingFaceEmbeddings  # Convert text into AI-readable format
from langchain.text_splitter import RecursiveCharacterTextSplitter  # Break text into small pieces
from langchain.llms import OpenAI  # AI-powered language model
from langchain.prompts import PromptTemplate  # Helps structure prompts
from PyPDF2 import PdfReader  # Read and extract text from PDF files
import chromadb  # The ChromaDB client for database interaction
from pyboxen import boxen  # Stylish text boxes for output
import os  # Access environment variables (e.g., API keys)
```

### 🌐 **About LangChain (In Simple Terms)**
LangChain is a framework that helps developers **connect AI models with external data sources** like databases or APIs. In this project, we use LangChain to:
- Embed text for efficient search.
- Retrieve relevant information from the database.
- Generate answers using OpenAI’s model.


In [2]:
! pip install langchain-community

Collecting langchain-community
  Downloading langchain_community-0.3.19-py3-none-any.whl.metadata (2.4 kB)
Collecting langchain-core<1.0.0,>=0.3.41 (from langchain-community)
  Downloading langchain_core-0.3.41-py3-none-any.whl.metadata (5.9 kB)
Collecting langchain<1.0.0,>=0.3.20 (from langchain-community)
  Downloading langchain-0.3.20-py3-none-any.whl.metadata (7.7 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain-community)
  Downloading pydantic_settings-2.8.1-py3-none-any.whl.metadata (3.5 kB)
Collecting httpx-sse<1.0.0,>=0.4.0 (from langchain-community)
  Downloading httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading marshmallow-3.26.1-py3-none-any.whl.metadata (7.3 kB)
Collecting typing-inspect<1,>=0.4.0 (from dataclasses-

In [3]:
# Step 2: Import necessary modules
import ipywidgets as widgets
from IPython.display import display, clear_output
from langchain.vectorstores import Chroma
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.llms import OpenAI
from langchain.prompts import PromptTemplate
from PyPDF2 import PdfReader
import chromadb
from pyboxen import boxen
import os

### 📌 **Step 3: Initializing Components**
```python
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
chroma_client = chromadb.PersistentClient(path="./chroma_db")
os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY"
llm = OpenAI(temperature=0)
```
- **Embeddings:** We use `HuggingFaceEmbeddings` to convert text into numerical format.
- **ChromaDB Client:** Creates a database at `./chroma_db` to store text embeddings.
- **OpenAI API:** We set the API key to access OpenAI’s language model.
[Generate Your OpenAI API Key](https://github.com/initmahesh/MLAI-community-labs/tree/main/Class-Labs/Lab-0(Pre-requisites))

- **LLM Initialization:** We set the temperature to `0` for more deterministic responses.

In [20]:
# Step 3: Initialize components
# Initialize embedding model
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

# Initialize ChromaDB client
chroma_client = chromadb.PersistentClient(path="./chroma_db")

# Initialize LLM (Replace with your API key)
os.environ["OPENAI_API_KEY"] = ""
llm = OpenAI(temperature=0)

ValidationError: 1 validation error for OpenAI
  Value error, Did not find openai_api_key, please add an environment variable `OPENAI_API_KEY` which contains it, or pass `openai_api_key` as a named parameter. [type=value_error, input_value={'temperature': 0, 'model...ne, 'http_client': None}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.10/v/value_error

#### **Processing the Uploaded File**

- Reads the PDF and extracts text.  
- Splits the text into smaller **chunks**.  
- Stores these chunks in **ChromaDB**.  

We create a file upload widget to allow users to upload a PDF.  

[📄 **Reference Document You Can Use**](https://drive.google.com/file/d/1WWa_TgI49HIAGFuXTNvMLtkFBU6ZduHq/view?usp=sharing)  

- When you run the below cell, you will see an **upload button**. Click on it, upload your document (you can use the reference document provided), and then click **Process File**.  
- You will get a confirmation message: **"File processed and stored in ChromaDB!"**  


In [5]:
# Step 4: File upload widget
from io import BytesIO
uploader = widgets.FileUpload(accept='.pdf', multiple=False)
display(uploader)

process_btn = widgets.Button(description="Process File")
process_output = widgets.Output()

def process_file(b):
    with process_output:
        clear_output()
        if not uploader.value:
            print("No file uploaded!")
            return

        # Get the first uploaded file's content
        # The 'uploader.value' now contains a dictionary, and you need to iterate
        # through its values (which are the actual file data objects)
        for filename, file_info in uploader.value.items():
            # Assuming there's only one file due to 'multiple=False' in the uploader
            pdf = PdfReader(BytesIO(file_info['content']))  # Read using BytesIO
            break  # Exit loop as we've processed the only file

        text = "\n".join([page.extract_text() for page in pdf.pages])

        # Split text into chunks
        text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=1000, chunk_overlap=100
        )
        chunks = text_splitter.split_text(text)

        # Create Chroma collection
        Chroma.from_texts(
            chunks, embeddings,
            client=chroma_client,
            collection_name="doc_collection"
        )
        print(boxen("File processed and stored in ChromaDB!", title="Success", color="green"))

display(process_btn, process_output)
process_btn.on_click(process_file)

FileUpload(value={}, accept='.pdf', description='Upload')

Button(description='Process File', style=ButtonStyle())

Output()

### 📌 **Step 5: Query Processing Functions**

#### **Retrieve Relevant Chunks**
```python
def retrieve_chunks(query, collection_name="doc_collection"):
    collection = chroma_client.get_collection(collection_name)
    results = collection.query(query_texts=[query], n_results=3)
    return results['documents'][0]
```
- Searches for the most relevant text chunks **based on the user’s query**.

#### **Generate AI-Powered Answers**
```python
def generate_answer(query, chunks):
    context = "\n\n".join(chunks)
    prompt = f"Answer this query: {query}\nUsing this context:\n{context}"
    return llm(prompt)
```
- Uses OpenAI’s language model to generate an answer based on **retrieved chunks**.

- Uses AI to **rewrite the query** for improved results.

---

In [None]:
# Step 5: Query processing functions
def retrieve_chunks(query, collection_name="doc_collection"):
    collection = chroma_client.get_collection(collection_name)
    results = collection.query(query_texts=[query], n_results=3)
    return results['documents'][0]

def generate_answer(query, chunks):
    context = "\n\n".join(chunks)
    prompt = f"Answer this query: {query}\nUsing this context:\n{context}"
    ans = llm(prompt)
    return ans

# 📌 Step 5.5: Add Knowledge Graph Component

## 🧠 Purpose:
The **Knowledge Graph Component** helps refine queries by mapping key terms to structured business relationships. This ensures that queries align with domain-specific terminology for more accurate responses.

## 🛠️ Implementation:
The function below returns a predefined question related to extracting company information from an agreement.

```python
def knowledge_graph():
    return "What is the name of the Company in the Agreement?"
```

In [1]:
# Step 5.5: Add Knowledge Graph Component
def knowledge_graph():
  return "What is the name of the Company in the Agreement?"

### 📌 **Step 6: Query Interface**

- Creates a simple **text input** for user queries.
- Displays responses interactively.
- Retrieves an answer **before and after query optimization**.
- Displays the **difference in responses**.
- When you run this cell, you will get an input field. Enter your query (e.g., "What is the name of the customer?"), click on submit, and wait for a few seconds.
You will see:
  - The difference between the query you asked and the response you got.
  - The query generated by the LLM and the accurate response it retrieved.
  - The document chunks that both responses retrieved.

---

In [2]:
# Step 6: Query interface and knowledge graph transformation
query_input = widgets.Text(placeholder="Enter your query")
submit_btn = widgets.Button(description="Submit")
query_output = widgets.Output()

# Step 6: Updated Query Handling with Knowledge Graph
def handle_query(b):
    with query_output:
        clear_output()
        query = query_input.value

        # Initial processing
        original_chunks = retrieve_chunks(query)
        original_answer = generate_answer(query, original_chunks)


        # Knowledge Graph enhancement
        kg_query = knowledge_graph()
        kg_chunks = retrieve_chunks(kg_query)
        kg_answer = generate_answer(kg_query, kg_chunks)

        # Display results
        print(boxen(
            f"ORIGINAL QUESTION: {query}\nANSWER: {original_answer}",
            title="First Try", color="red"
        ))


        print(boxen(
            f"BUSINESS TERMS QUESTION: {kg_query}\nANSWER: {kg_answer}",
            title="Professional Version", color="green"
        ))

display(query_input, submit_btn, query_output)
submit_btn.on_click(handle_query)

NameError: name 'widgets' is not defined