# Phase 3: RAG with History-Aware Retrieval

## Objectives:
1. Collect and chunk Indian legal documents
2. Generate embeddings using sentence-transformers
3. Build FAISS vector index
4. Implement retrieval pipeline with LangChain
5. Test grounded responses with citations


In [1]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')
print("✅ Google Drive mounted!")

import os
base_dir = '/content/drive/MyDrive/LawBot'
print(f"Using base directory: {base_dir}")

# Install dependencies
%pip install sentence-transformers faiss-cpu langchain langchain-community chromadb

# Import libraries
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.chains import RetrievalQA
import json
import pickle


Mounted at /content/drive
✅ Google Drive mounted!
Using base directory: /content/drive/MyDrive/LawBot
Collecting faiss-cpu
  Downloading faiss_cpu-1.12.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (5.1 kB)
Collecting langchain-community
  Downloading langchain_community-0.4.1-py3-none-any.whl.metadata (3.0 kB)
Collecting chromadb
  Downloading chromadb-1.2.2-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.2 kB)
INFO: pip is looking at multiple versions of langchain-community to determine which version is compatible with other requirements. This could take a while.
Collecting langchain-community
  Downloading langchain_community-0.4-py3-none-any.whl.metadata (3.0 kB)
  Downloading langchain_community-0.3.31-py3-none-any.whl.metadata (3.0 kB)
Collecting requests<3,>=2 (from langchain)
  Downloading requests-2.32.5-py3-none-any.whl.metadata (4.9 kB)
Collecting dataclasses-json<0.7.0,>=0.6.7 (from langchain-community)
  Downloading dataclasses

## Step 1: Load Legal Documents


In [2]:
# Load cleaned legal Q&A as base knowledge documents
def load_legal_documents():
    """Load legal documents for RAG"""
    documents = []

    # Load the cleaned dataset
    with open(f'{base_dir}/data/processed/lawbot_cleaned.jsonl', 'r', encoding='utf-8') as f:
        for line in f:
            item = json.loads(line)
            # Create document with both question and answer as context
            doc_text = f"Question: {item['instruction']}\nAnswer: {item['output']}\nSource: {item['source']}"
            documents.append({
                'text': doc_text,
                'source': item['source'],
                'metadata': {'instruction': item['instruction'], 'output': item['output']}
            })

    print(f"Loaded {len(documents)} legal documents")
    return documents

documents = load_legal_documents()
print(f"Sample document: {documents[0]['text'][:200]}")


Loaded 14522 legal documents
Sample document: Question: What is India according to the Union and its Territory?
Answer: India, that is Bharat, shall be a Union of States.
Source: Constitution


## Step 2: Chunk Documents


In [3]:
# Chunk documents into smaller pieces (600-1000 tokens)
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=800,  # Approximate tokens
    chunk_overlap=100,  # Overlap for context preservation
    separators=["\n\n", "\n", ".", " ", ""]
)

chunks = []
metadata_list = []

for doc in documents:
    doc_chunks = text_splitter.split_text(doc['text'])
    for chunk in doc_chunks:
        chunks.append(chunk)
        metadata_list.append({
            'source': doc['source'],
            'instruction': doc['metadata']['instruction'],
            'output': doc['metadata']['output']
        })

print(f"Total chunks created: {len(chunks)}")
print(f"Average chunk length: {np.mean([len(c) for c in chunks]):.0f} characters")
print(f"\nSample chunk:\n{chunks[0][:300]}")


Total chunks created: 14627
Average chunk length: 255 characters

Sample chunk:
Question: What is India according to the Union and its Territory?
Answer: India, that is Bharat, shall be a Union of States.
Source: Constitution


## Step 3: Generate Embeddings


In [4]:
# Load sentence transformer model
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
print(f"Embedding model: all-MiniLM-L6-v2")
print(f"Embedding dimension: {embedding_model.get_sentence_embedding_dimension()}")

# Generate embeddings for all chunks
print("Generating embeddings...")
embeddings = embedding_model.encode(chunks, show_progress_bar=True, batch_size=32)

print(f"Embeddings shape: {embeddings.shape}")
print(f"Embedding dimension: {embeddings.shape[1]}")


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Embedding model: all-MiniLM-L6-v2
Embedding dimension: 384
Generating embeddings...


Batches:   0%|          | 0/458 [00:00<?, ?it/s]

Embeddings shape: (14627, 384)
Embedding dimension: 384


## Step 4: Build FAISS Index


In [5]:
# Create FAISS index
dimension = embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)  # L2 distance for similarity search

# Convert to float32 for FAISS
embeddings_f32 = embeddings.astype('float32')

# Add embeddings to index
index.add(embeddings_f32)

print(f"FAISS index created with {index.ntotal} vectors")
print(f"Index dimension: {dimension}")


FAISS index created with 14627 vectors
Index dimension: 384


## Step 5: Save FAISS Index


In [6]:
# Save FAISS index
import os
os.makedirs('../vectorstore/faiss_index', exist_ok=True)

faiss.write_index(index, '../vectorstore/faiss_index/faiss_index.idx')

# Save chunks and metadata
with open('../vectorstore/faiss_index/chunks.pkl', 'wb') as f:
    pickle.dump(chunks, f)

with open('../vectorstore/faiss_index/metadata.pkl', 'wb') as f:
    pickle.dump(metadata_list, f)

# Save embedding model path
with open('../vectorstore/faiss_index/config.json', 'w') as f:
    json.dump({
        'model_name': 'all-MiniLM-L6-v2',
        'dimension': dimension,
        'num_vectors': len(chunks),
        'chunk_size': 800,
        'chunk_overlap': 100
    }, f, indent=2)

print("FAISS index and metadata saved successfully!")


FAISS index and metadata saved successfully!


## Step 6: Implement Retrieval Pipeline


In [7]:
def retrieve_relevant_chunks(query, index, embedding_model, chunks, metadata_list, k=5):
    """Retrieve top-k relevant chunks for a query"""
    # Encode query
    query_embedding = embedding_model.encode([query])
    query_embedding = query_embedding.astype('float32')

    # Search in FAISS
    distances, indices = index.search(query_embedding, k)

    # Retrieve relevant chunks
    relevant_chunks = []
    for idx in indices[0]:
        relevant_chunks.append({
            'chunk': chunks[idx],
            'metadata': metadata_list[idx],
            'distance': float(distances[0][idx] if idx < len(distances[0]) else float('inf'))
        })

    return relevant_chunks

# Test retrieval
test_query = "What is the procedure for filing a criminal case?"
results = retrieve_relevant_chunks(test_query, index, embedding_model, chunks, metadata_list, k=5)

print(f"Query: {test_query}")
print(f"\nRetrieved {len(results)} relevant chunks:")
for i, result in enumerate(results, 1):
    print(f"\nChunk {i} (distance: {result['distance']:.4f}):")
    print(f"Source: {result['metadata']['source']}")
    print(f"Text: {result['chunk'][:200]}...")


Query: What is the procedure for filing a criminal case?

Retrieved 5 relevant chunks:

Chunk 1 (distance: inf):
Source: CrPC
Text: Question: What is the process of opening a case for prosecution?
Answer: The process of opening a case for prosecution involves the prosecutor describing the charge brought against the accused and sta...

Chunk 2 (distance: inf):
Source: CrPC
Text: Question: What is the procedure once the charge has been framed against an individual?
Answer: The charge shall then be read and explained to the accused, and he shall be asked whether he pleads guilt...

Chunk 3 (distance: inf):
Source: CrPC
Text: Question: What is the procedure for the appearance by Public Prosecutors?
Answer: 301
Source: CrPC...

Chunk 4 (distance: inf):
Source: CrPC
Text: Question: What can the Magistrate do on the application of the prosecution?
Answer: The Magistrate may issue a summons to any of its witnesses directing him to attend or to produce any document or oth...

Chunk 5 (distance:

## Step 7: Query Reformulation with History


In [8]:
def reformulate_query(query, conversation_history):
    """Reformulate query based on conversation history"""
    if not conversation_history:
        return query

    # Simple reformulation: prepend context from conversation
    context = " ".join([f"{h['role']}: {h['content']}" for h in conversation_history[-3:]])  # Last 3 turns
    reformulated = f"{context} Current question: {query}"

    return reformulated

# Test query reformulation
conversation_history = [
    {"role": "user", "content": "What is IPC?"},
    {"role": "assistant", "content": "IPC stands for Indian Penal Code."},
    {"role": "user", "content": "What are the main offences?"}
]

original_query = "What are the punishments for theft?"
reformulated = reformulate_query(original_query, conversation_history)

print(f"Original query: {original_query}")
print(f"\nReformulated query: {reformulated}")


Original query: What are the punishments for theft?

Reformulated query: user: What is IPC? assistant: IPC stands for Indian Penal Code. user: What are the main offences? Current question: What are the punishments for theft?


## Step 8: Generate Grounded Responses


In [9]:
def generate_grounded_response(query, conversation_history, index, embedding_model, chunks, metadata_list, k=5):
    """Generate response with citations from retrieved chunks"""
    # Reformulate query
    reformulated_query = reformulate_query(query, conversation_history)

    # Retrieve relevant chunks
    retrieved = retrieve_relevant_chunks(reformulated_query, index, embedding_model, chunks, metadata_list, k)

    # Check if we have sufficient evidence
    if retrieved and retrieved[0]['distance'] < 2.0:  # Threshold for relevance
        # Build context from retrieved chunks
        context = "\n\n".join([f"[{i+1}] {r['chunk']}\nSource: {r['metadata']['source']}"
                                for i, r in enumerate(retrieved)])

        # Format response with citations
        citations = [r['metadata']['source'] for r in retrieved]
        unique_citations = list(set(citations))

        response = {
            'answer': f"Based on the following legal sources: {', '.join(unique_citations)}",
            'context': context,
            'citations': unique_citations,
            'confidence': 'high' if retrieved[0]['distance'] < 1.0 else 'medium'
        }
    else:
        # Ab-stain response
        response = {
            'answer': "I don't have sufficient legal information to provide an accurate answer to this question.",
            'context': '',
            'citations': [],
            'confidence': 'low'
        }

    return response

# Test grounded response generation
test_queries = [
    "What is the punishment for murder under IPC?",
    "How do I file a criminal complaint?",
    "What is the Constitution of India about?"
]

for query in test_queries:
    response = generate_grounded_response(query, [], index, embedding_model, chunks, metadata_list)
    print(f"\nQuery: {query}")
    print(f"Answer: {response['answer']}")
    print(f"Citations: {response['citations']}")
    print(f"Confidence: {response['confidence']}")
    if response['context']:
        print(f"Context preview: {response['context'][:200]}...")
    print("-" * 80)



Query: What is the punishment for murder under IPC?
Answer: I don't have sufficient legal information to provide an accurate answer to this question.
Citations: []
Confidence: low
--------------------------------------------------------------------------------

Query: How do I file a criminal complaint?
Answer: I don't have sufficient legal information to provide an accurate answer to this question.
Citations: []
Confidence: low
--------------------------------------------------------------------------------

Query: What is the Constitution of India about?
Answer: I don't have sufficient legal information to provide an accurate answer to this question.
Citations: []
Confidence: low
--------------------------------------------------------------------------------


## Summary

Phase 3 completed successfully! RAG pipeline has been:
1. ✅ Loaded and chunked legal documents
2. ✅ Generated embeddings using sentence-transformers
3. ✅ Built FAISS vector index
4. ✅ Implemented retrieval with top-k search
5. ✅ Added query reformulation with conversation history
6. ✅ Generated grounded responses with citations
7. ✅ Implemented confidence threshold for abstention

**Deliverables:**
- `vectorstore/faiss_index/faiss_index.idx` - FAISS vector index
- `vectorstore/faiss_index/chunks.pkl` - Document chunks
- `vectorstore/faiss_index/metadata.pkl` - Metadata
- `vectorstore/faiss_index/config.json` - Configuration
- RAG retrieval pipeline ready for integration
