In [13]:
!pip install -q langchain_community langchain langchain-google-genai faiss-cpu pypdf python-dotenv pandas scikit-learn sentence-transformers mlflow


## Problem Statement

“How do I build an AI assistant that can reliably answer questions over my own private documents (PDFs, manuals, reports, etc.)—without exposing that content publicly?”

- Pain Point: Traditional keyword search or “copy‑paste into ChatGPT” is brittle, unindexed, often leaks context, and doesn’t trace back to your source.
- GenAI Solution: Combine an embedding‑based vector store (for retrieval) with a generative LLM (for fluent answers). This let’s you:
1. Index your docs into a FAISS vector database via embeddings
2. Retrieve the top‑k most relevant passages at query time
3. Generate a concise, grounded answer that cites its source chunks

## How to Solves It (Step‑by‑Step)

1. Environment & MLflow Setup
- Installs/initializes all required libraries (LangChain, Gemini SDK, FAISS, MLflow).
- Starts an MLflow experiment to track chunk‑size, overlap, query results, and future feedback.

2. Document Ingestion & Chunking
- Loads PDFs/TXT/DOCX/CSV with LangChain’s loaders.
- Splits them into ~1 000‑token overlapping “chunks” for granular retrieval.

3. Embedding & Vector Store
- Uses GoogleGenerativeAIEmbeddings to turn each chunk into a vector.
- Builds a FAISS index for sub‑second similarity search.

4. Retrieval‑QA Chain Construction
- Wires a RetrievalQA chain:
    - Retriever: FAISS top‑k lookup
    - LLM: Gemini chat model with a prompt that “Use ONLY this context”
- Ensures the answer is always grounded in the retrieved snippets.

5. Interactive Q&A Loop
- Drops you into a REPL: type any question (or exit) and get back:
    - A generative answer
    - The exact source snippets that informed the answer

In [15]:
# Install required packages

import os
import tempfile
import pandas as pd
import numpy as np
import uuid
import mlflow
from datetime import datetime
from typing import List, Dict, Any

# Document processing
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import (
    PyPDFLoader, 
    TextLoader, 
    CSVLoader,
    UnstructuredWordDocumentLoader
)

# Gemini and embeddings
import google.generativeai as genai
from langchain_google_genai import ChatGoogleGenerativeAI, GoogleGenerativeAIEmbeddings
from langchain_community.vectorstores import FAISS
from langchain.prompts import PromptTemplate
from langchain.chains import RetrievalQA
from IPython.display import display, Markdown

### 1. Initialize MLflow & Define Core RAG Utilities

In this section:

- **Set up MLflow** for experiment and run tracking (`mlflow.set_experiment`).
- **Load documents** of various formats (`.pdf`, `.txt`, `.docx`, `.csv`) into LangChain.
- **Chunk** each document into overlapping text segments for better retrieval.
- **Embed** these chunks with Gemini embeddings and build a FAISS vector store.
- **Instantiate** a RetrievalQA chain using the Gemini LLM, wired to our vector store, so we can ask context‑grounded questions.


In [16]:
# Set up MLflow
mlflow.set_experiment("document-qa-gemini")

# Document processing functions
def load_document(file_path):
    """Load document based on file extension"""
    _, file_extension = os.path.splitext(file_path)
    
    if file_extension.lower() == '.pdf':
        loader = PyPDFLoader(file_path)
    elif file_extension.lower() == '.txt':
        loader = TextLoader(file_path)
    elif file_extension.lower() == '.docx':
        loader = UnstructuredWordDocumentLoader(file_path)
    elif file_extension.lower() == '.csv':
        loader = CSVLoader(file_path)
    else:
        raise ValueError(f"Unsupported file format: {file_extension}")
    
    return loader.load()

def chunk_documents(documents, chunk_size=1000, chunk_overlap=200):
    """Split documents into chunks"""
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        length_function=len
    )
    chunks = text_splitter.split_documents(documents)
    return chunks

def create_vector_store(chunks, api_key):
    """Create a vector store from document chunks"""
    os.environ["GOOGLE_API_KEY"] = api_key
    embeddings = GoogleGenerativeAIEmbeddings(model="models/embedding-001")
    vector_store = FAISS.from_documents(chunks, embeddings)
    return vector_store

def get_qa_chain(vector_store, api_key, model_name="gemini-1.5-pro"):
    """Create a question-answering chain with Gemini model"""
    os.environ["GOOGLE_API_KEY"] = api_key
    
    llm = ChatGoogleGenerativeAI(model=model_name, temperature=0.2)
    
    template = """
    You are a helpful AI assistant trained to answer questions based on provided context.
    Use only the following context to answer the question. If you don't know the answer based on 
    the context, say "I don't have enough information to answer this question" - don't make up information.
    
    Context:
    {context}
    
    Question: {question}
    
    Answer:
    """
    
    QA_CHAIN_PROMPT = PromptTemplate(
        input_variables=["context", "question"],
        template=template
    )
    
    qa_chain = RetrievalQA.from_chain_type(
        llm=llm,
        chain_type="stuff",
        retriever=vector_store.as_retriever(search_kwargs={"k": 4}),
        chain_type_kwargs={"prompt": QA_CHAIN_PROMPT},
        return_source_documents=True
    )
    
    return qa_chain

### 2. Define the DocumentQASystem Class

Encapsulate the entire RAG workflow into a single, reusable class:

- **`__init__`**  
  - Configures the Gemini API key and MLflow experiment session  
  - Initializes internal state (vector store, QA chain, session ID, chunk settings, history)

- **`process_documents(file_paths, chunk_size, chunk_overlap)`**  
  1. Loads each file (PDF, TXT, DOCX, CSV)  
  2. Splits text into overlapping chunks for retrieval  
  3. Creates embeddings & builds a FAISS vector store  
  4. Instantiates a RetrievalQA chain with Gemini  
  5. Logs processing parameters (chunk size, file count, etc.) to MLflow  
  6. Returns the total number of chunks indexed  

- **`ask_question(question, track_metrics=True)`**  
  1. Records the user question in `conversation_history`  
  2. Retrieves the top‐k chunks and generates an answer via the QA chain  
  3. Formats and saves source snippets  
  4. Appends the assistant’s response to history  
  5. Optionally logs the question, answer, and source metadata to MLflow  
  6. Returns a dict containing `answer` and `sources`

Finally, create an instance of `DocumentQASystem`, point it at the law‑book PDF, and kick off document processing.  


In [17]:

class DocumentQASystem:
    """Main class to handle document Q&A with evaluation tracking"""
    
    def __init__(self, api_key):
        self.api_key = api_key
        self.vector_store = None
        self.qa_chain = None
        self.session_id = str(uuid.uuid4())
        self.chunk_size = 1000
        self.chunk_overlap = 200
        self.conversation_history = []
        os.environ["GOOGLE_API_KEY"] = api_key
        genai.configure(api_key=api_key)
    
    def process_documents(self, file_paths, chunk_size=1000, chunk_overlap=200):
        """Process multiple documents and create vector store"""
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap
        
        all_chunks = []
        for file_path in file_paths:
            documents = load_document(file_path)
            chunks = chunk_documents(documents, chunk_size, chunk_overlap)
            all_chunks.extend(chunks)
        
        self.vector_store = create_vector_store(all_chunks, self.api_key)
        self.qa_chain = get_qa_chain(self.vector_store, self.api_key)
        
        # Log document processing in MLflow
        with mlflow.start_run():
            mlflow.log_param("chunk_size", chunk_size)
            mlflow.log_param("chunk_overlap", chunk_overlap)
            mlflow.log_param("document_count", len(file_paths))
            mlflow.log_param("chunk_count", len(all_chunks))
            mlflow.log_param("session_id", self.session_id)
        
        return len(all_chunks)
    
    def ask_question(self, question, track_metrics=True):
        """Ask a question about the loaded documents"""
        if not self.vector_store or not self.qa_chain:
            raise ValueError("Documents haven't been processed yet. Call process_documents first.")
        
        # Add question to conversation history
        self.conversation_history.append({
            "role": "user",
            "content": question,
            "timestamp": datetime.now().isoformat()
        })
        
        # Get answer from QA chain
        response = self.qa_chain.invoke({"query": question})
        answer = response["result"]
        source_docs = response["source_documents"]
        
        # Format sources
        sources = []
        for i, doc in enumerate(source_docs):
            source_text = doc.page_content[:200] + "..." if len(doc.page_content) > 200 else doc.page_content
            sources.append(f"Source {i+1}: {source_text}")
        
        # Add response to conversation history
        self.conversation_history.append({
            "role": "assistant",
            "content": answer,
            "sources": sources,
            "timestamp": datetime.now().isoformat()
        })
        
        # Track metrics in MLflow
        if track_metrics:
            with mlflow.start_run():
                mlflow.log_param("question", question)
                mlflow.log_param("session_id", self.session_id)
                mlflow.log_text(answer, "answer.txt")
                mlflow.log_dict({"sources": [doc.page_content for doc in source_docs]}, "sources.json")
        
        return {
            "answer": answer,
            "sources": sources
        }
    
qa_system = DocumentQASystem(api_key)

law_book_path = "/kaggle/input/international-law-handbook/book_1.pdf"
sample_files = [law_book_path]

# Process documents
chunk_count = qa_system.process_documents(sample_files)
print(f"Processed {chunk_count} chunks from {len(sample_files)} documents")

Processed 2833 chunks from 1 documents


### 3. Start the Interactive Q&A Session

Enter a simple REPL loop that:

1. **Prompts** user to type a question about the indexed documents (or `exit` to quit)  
2. **Invokes** `qa_system.ask_question(...)` to retrieve an answer and source snippets  
3. **Prints** the model’s answer under “----- Answer -----”  
4. **Lists** each retrieved source under “----- Sources -----” for full transparency  

This lets user to explore the dataset law book in real time—just type a question and see instant, grounded responses!  


In [19]:
# Interactive Q&A loop

while True:
    question = input("\nAsk a question about law (or type 'exit' to quit): ")
    if question.lower() == 'exit':
        break
        
    result = qa_system.ask_question(question)
    print("\n your Question is: ",question)
    print("\n--------------------------- Answer ---------------------------------")
    print(result["answer"])
    print("\n----- Sources -----")
    for source in result["sources"]:
        print(source)
        print("---")
    
   


Ask a question about law (or type 'exit' to quit):  Under what conditions can a treaty be terminated?



 your Question is:  Under what conditions can a treaty be terminated?

--------------------------- Answer ---------------------------------
A treaty can be terminated or a party can withdraw in the following ways:

1. In conformity with the treaty's provisions.
2. By consent of all parties after consultation with the contracting States and contracting organizations.
3. If it conflicts with a new peremptory norm of general international law.
4. Due to a fundamental change of circumstances.
5. If the parties intended to admit the possibility of denunciation or withdrawal, or if such a right can be implied by the nature of the treaty, provided twelve months' notice is given.

----- Sources -----
Source 1: States and international organizations: treaties  87
Article 56. Denunciation of or withdrawal from a treaty containing 
no provision regard ing termination, denunciation or withdrawal
1. A treaty whi...
---
Source 2: States and international organizations: treaties  89
4. If, under the


Ask a question about law (or type 'exit' to quit):  what is python



 your Question is:  what is python

--------------------------- Answer ---------------------------------
I don't have enough information to answer this question

----- Sources -----
Source 1: 1. For the purposes of this Convention, the term “torture” means any act by which severe pain 
or suffering, whether physical or mental, is intentionally inflicted on a person for such purposes as 
ob...
---
Source 2: tive and alternative modes, means and formats of communication, including accessible information 
and communication technology;
“Language” includes spoken and signed languages and other forms of non-s...
---
Source 3: governed by international law, whether embodied in a single instrument or in two or more related 
instruments and whatever its particular designation;
(b) “ratification,” “acceptance,” “approval” and ...
---
Source 4: offence was committed.
article 12
No one shall be subjected to arbitrary interference with his privacy, family, home or corre -
spondence, nor to attacks


Ask a question about law (or type 'exit' to quit):  exit
