# AI-Powered Legal Research System
## Hackathon Project: Legal RAG (Retrieval-Augmented Generation)

This notebook implements a legal research engine using:
- **PDF Processing**: Extracts text from legal documents
- **Vector Embeddings**: Creates semantic search capabilities
- **FAISS**: Fast similarity search for retrieving relevant legal passages
- **RAG Pipeline**: Generates accurate answers from legal documents

---
### Setup Instructions
1. Run the installation cell below
2. Run the main RAG system cell
3. Test with legal queries

In [28]:
# Install Required Packages
# Run this cell ONCE at the beginning

%pip install -q sentence-transformers faiss-cpu transformers PyPDF2 torch

print("All packages installed successfully!")

Note: you may need to restart the kernel to use updated packages.
All packages installed successfully!


In [29]:
# LOCAL WINDOWS VERSION - Legal AI RAG System

# 0) Installs (run once)
%pip install -q sentence-transformers faiss-cpu transformers PyPDF2

# 1) Imports
import os, glob, io, pickle
from PyPDF2 import PdfReader
from sentence_transformers import SentenceTransformer
import numpy as np
import faiss
from transformers import pipeline
import torch

# 2) Detect PDFs - ADAPTED FOR WINDOWS
# Get the current directory and look in the Docs folder
current_dir = os.path.dirname(os.path.abspath(__file__)) if '__file__' in globals() else os.getcwd()
docs_folder = os.path.join(current_dir, "Docs")

# Fallback: if running in notebook, use the notebook's directory
if not os.path.exists(docs_folder):
    docs_folder = r"c:\Users\Sanjay\Desktop\SRM VDP HACKATHON\Docs"

print(f"Searching for PDFs in: {docs_folder}")

pdf_paths = []
if os.path.exists(docs_folder):
    for ext in ("*.pdf", "*.PDF"):
        pdf_paths += glob.glob(os.path.join(docs_folder, ext))
pdf_paths = sorted(set(pdf_paths))

print(f"Found {len(pdf_paths)} PDFs:")
for p in pdf_paths:
    print(f"  - {os.path.basename(p)}")

if not pdf_paths:
    raise SystemExit(f"No PDFs found in {docs_folder}. Please check the path.")

# 3) Extract text from PDFs
def extract_pdf_text(path):
    text = ""
    try:
        reader = PdfReader(path)
        for p in reader.pages:
            page_text = p.extract_text()
            if page_text:
                text += page_text + "\n"
    except Exception as e:
        print(f"Error reading {path}: {e}")
    return text

raw_texts = {}
for p in pdf_paths:
    txt = extract_pdf_text(p)
    raw_texts[p] = txt
    print(f"Extracted {len(txt)} chars from {os.path.basename(p)}")

# 4) Chunking with overlap - REDUCED CHUNK SIZE for better model performance
def chunk_text(text, chunk_size=800, overlap=150):
    chunks = []
    if not text:
        return chunks
    start = 0
    L = len(text)
    while start < L:
        end = min(L, start + chunk_size)
        chunk = text[start:end].strip()
        if chunk:
            chunks.append(chunk)
        start += chunk_size - overlap
    return chunks

docs = []
metas = []
for path, txt in raw_texts.items():
    cks = chunk_text(txt, chunk_size=800, overlap=150)
    for i, c in enumerate(cks):
        docs.append(c)
        metas.append({"source": os.path.basename(path), "chunk_id": i})
print(f"Total chunks: {len(docs)}")

# 5) Create embeddings (batch)
embed_model_name = "all-MiniLM-L6-v2"
print("Loading embedder:", embed_model_name)
embedder = SentenceTransformer(embed_model_name)

batch_size = 64
emb_list = []
for i in range(0, len(docs), batch_size):
    batch = docs[i:i+batch_size]
    e = embedder.encode(batch, show_progress_bar=True, convert_to_numpy=True)
    emb_list.append(e)
embeddings = np.vstack(emb_list).astype("float32")
print("Embeddings shape:", embeddings.shape)

# 6) Build FAISS index and persist
d = embeddings.shape[1]
index = faiss.IndexFlatL2(d)
index.add(embeddings)
print("FAISS index size:", index.ntotal)

faiss.write_index(index, "faiss.index")
with open("rag_metas.pkl","wb") as f:
    pickle.dump({"metas": metas, "docs": docs}, f)
print("Saved faiss.index and rag_metas.pkl")

# 7) Generator model (small, CPU-friendly by default)
gen_model_name = "google/flan-t5-small"
device = 0 if torch.cuda.is_available() else -1
print("Loading generator:", gen_model_name, "device:", device)
generator = pipeline("text2text-generation", model=gen_model_name, device=device, max_length=512)

# 8) Retriever + RAG answer function
def retrieve_topk(query, top_k=4):
    q_emb = embedder.encode([query]).astype("float32")
    D, I = index.search(q_emb, top_k)
    results = []
    for idx in I[0]:
        results.append({"chunk": docs[idx], "meta": metas[idx]})
    return results

def build_context(retrieved, max_chars=1500):
    """Build context with character limit to avoid token overflow"""
    parts = []
    total_chars = 0
    for i, r in enumerate(retrieved):
        chunk_text = r['chunk']
        # Truncate if adding this chunk would exceed limit
        if total_chars + len(chunk_text) > max_chars:
            remaining = max_chars - total_chars
            if remaining > 100:  # Only add if meaningful text remains
                chunk_text = chunk_text[:remaining] + "..."
                parts.append(f"Source: {r['meta']['source']} (chunk {r['meta']['chunk_id']})\n{chunk_text}")
            break
        parts.append(f"Source: {r['meta']['source']} (chunk {r['meta']['chunk_id']})\n{chunk_text}")
        total_chars += len(chunk_text)
    return "\n\n---\n\n".join(parts)

chat_history = []

def answer_query(query, top_k=3):
    """Reduced top_k to 3 for better performance"""
    retrieved = retrieve_topk(query, top_k=top_k)
    context = build_context(retrieved, max_chars=1500)
    prompt = (
        "Answer the question using the context below. Be concise.\n\n"
        f"CONTEXT:\n{context}\n\nQUESTION: {query}\n\nANSWER:"
    )
    out = generator(prompt, max_new_tokens=200, do_sample=False)[0]["generated_text"].strip()
    # sometimes models echo the prompt; try to strip if echoed
    if out.startswith(prompt):
        out = out[len(prompt):].strip()
    chat_history.append((query, out))
    return out, retrieved

# 9) System ready notification
print("\n" + "=" * 70)
print("LEGAL AI SYSTEM READY!")
print("=" * 70)
print("System loaded with:")
print(f"   - {len(pdf_paths)} legal documents")
print(f"   - {len(docs)} text chunks indexed")
print(f"   - {embeddings.shape[0]} embeddings created")
print("=" * 70)
print("\nYou can now run the query cells below to test the system!")


Note: you may need to restart the kernel to use updated packages.
Searching for PDFs in: c:\Users\Sanjay\Desktop\SRM VDP HACKATHON\Docs
Found 9 PDFs:
  - 4877+Life.pdf
  - AI_and_India_Justice_CambridgeUPress (1).pdf
  - AI_and_India_Justice_CambridgeUPress.pdf
  - Responsible-AI-22022021.pdf
  - V5I564.pdf
  - legal 2.pdf
  - legal 3.pdf
  - legal 4.pdf
  - legal1.pdf
Extracted 36425 chars from 4877+Life.pdf
Extracted 36525 chars from AI_and_India_Justice_CambridgeUPress (1).pdf
Extracted 36525 chars from AI_and_India_Justice_CambridgeUPress.pdf
Extracted 93016 chars from Responsible-AI-22022021.pdf
Extracted 33153 chars from V5I564.pdf
Extracted 0 chars from legal 2.pdf
Extracted 12212 chars from legal 3.pdf
Extracted 0 chars from legal 4.pdf
Extracted 9823 chars from legal1.pdf
Total chunks: 401
Loading embedder: all-MiniLM-L6-v2


Batches:   0%|          | 0/2 [00:00<?, ?it/s]

Batches:   0%|          | 0/2 [00:00<?, ?it/s]

Batches:   0%|          | 0/2 [00:00<?, ?it/s]

Batches:   0%|          | 0/2 [00:00<?, ?it/s]

Batches:   0%|          | 0/2 [00:00<?, ?it/s]

Batches:   0%|          | 0/2 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Embeddings shape: (401, 384)
FAISS index size: 401
Saved faiss.index and rag_metas.pkl
Loading generator: google/flan-t5-small device: -1


Device set to use cpu



LEGAL AI SYSTEM READY!
System loaded with:
   - 9 legal documents
   - 401 text chunks indexed
   - 401 embeddings created

You can now run the query cells below to test the system!


In [42]:
# LEGAL & JUSTICE OPTIMIZED RAG ANSWER FUNCTION
# IMPORTANT: Make sure to run cell 3 (Main RAG System) before running this cell!

def format_legal_answer(text):
    """
    Formats the answer with proper line breaks and structure for better readability.
    """
    # Clean up the text
    text = text.strip()
    
    # Add paragraph breaks for readability
    import re
    # Split into sentences and group them
    sentences = re.split(r'(?<=[.!?])\s+', text)
    
    # Group sentences into paragraphs (every 3-4 sentences)
    paragraphs = []
    current_para = []
    for i, sentence in enumerate(sentences):
        current_para.append(sentence)
        if (i + 1) % 3 == 0 or i == len(sentences) - 1:
            paragraphs.append(' '.join(current_para))
            current_para = []
    
    return '\n\n'.join(paragraphs)

def build_context_optimized(retrieved, max_chars=1800):
    """Build context with character limit to avoid token overflow"""
    parts = []
    total_chars = 0
    for i, r in enumerate(retrieved):
        chunk_text = r['chunk']
        # Truncate if adding this chunk would exceed limit
        if total_chars + len(chunk_text) > max_chars:
            remaining = max_chars - total_chars
            if remaining > 150:  # Only add if meaningful text remains
                chunk_text = chunk_text[:remaining] + "..."
                parts.append(chunk_text)
            break
        parts.append(chunk_text)
        total_chars += len(chunk_text)
    return "\n\n".join(parts)

def answer_query(query, top_k=4):
    """
    Retrieves top_k chunks related to a legal query and generates
    a detailed, factual, law-oriented answer.
    Optimized to avoid token length issues.
    
    NOTE: This overrides the basic answer_query() function from cell 3
    """

    # Retrieve relevant chunks
    retrieved = retrieve_topk(query, top_k=top_k)
    context = build_context_optimized(retrieved, max_chars=1800)

    # Improved prompt for detailed legal analysis
    legal_prompt = f"""You are a legal research assistant. Based on the legal text provided, give a comprehensive answer to the question. Include relevant details, principles, and implications.

LEGAL TEXT:
{context}

QUESTION: {query}

Provide a detailed answer (at least 3-4 sentences):"""

    # Generate the answer with more tokens for detailed response
    out = generator(legal_prompt, max_new_tokens=300, do_sample=False, truncation=True)[0]['generated_text'].strip()

    # Extract just the answer
    if "Provide a detailed answer" in out:
        parts = out.split("Provide a detailed answer")
        if len(parts) > 1:
            out = parts[-1].strip()
            # Remove the instruction suffix
            out = out.replace("(at least 3-4 sentences):", "").strip()
            out = out.lstrip(':').strip()
    
    # Format the answer for better readability
    formatted_answer = format_legal_answer(out)

    # Save in chat history
    chat_history.append((query, formatted_answer))

    return formatted_answer, retrieved

print("Enhanced legal query function loaded!")
print("Configured for detailed legal analysis with improved formatting.")

Enhanced legal query function loaded!
Configured for detailed legal analysis with improved formatting.


In [43]:
# Test the Enhanced Legal Query System
print("Testing Legal Query with Structured Output\n")
print("=" * 80)

ans, ret = answer_query("Explain the main legal principle discussed in the document.", top_k=4)

print("\nLEGAL ANALYSIS:")
print("=" * 80)
print(ans)
print("\n" + "=" * 80)
print("\nSOURCES CONSULTED:")
for i, r in enumerate(ret, 1):
    print(f"  {i}. {r['meta']['source']} (chunk {r['meta']['chunk_id']})")

Testing Legal Query with Structured Output


LEGAL ANALYSIS:
The main legal principle discussed in the document is rimination on the basis of religion, race, caste, sex, descent, place of birth or residence in matters of education, employment, access to public spaces, etc. The Constitution prohibits discrimination based on certain markers, it also provides for positive discrimination in the form of affirmative action. Article 15: rimination on the basis of religion, race, caste, sex, descent, place of birth or residence in matters of education, employment, access to public spaces, etc.

The Constitution prohibits discrimination based on certain markers, it also provides for positive discrimination in the form of affirmative action. Article 15: rimination on the basis of religion, race, caste, sex, descent, place of birth or residence in matters of education, employment, access to public spaces, etc. The Constitution prohibits discrimination based on certain markers, it also provides fo

---
## Interactive Legal Queries
Run the cells below to test different legal questions

In [44]:
# Test Query 1: General Legal Overview
print("\nLEGAL QUERY TEST #1")
print("=" * 80)

query1 = "What are the main legal challenges discussed regarding AI and justice?"
print(f"\nQuestion: {query1}\n")

ans, sources = answer_query(query1, top_k=4)

print("Answer:")
print("-" * 80)
print(ans)
print("-" * 80)

print("\nSources Referenced:")
for i, s in enumerate(sources, 1):
    print(f"  {i}. {s['meta']['source']} (chunk {s['meta']['chunk_id']})")


LEGAL QUERY TEST #1

Question: What are the main legal challenges discussed regarding AI and justice?

Answer:
--------------------------------------------------------------------------------
Privacy and data protection, security v tion to deal with this aspect of AI remains with the High C ourts of respective state and the Supreme Court of India.
--------------------------------------------------------------------------------

Sources Referenced:
  1. 4877+Life.pdf (chunk 11)
  2. 4877+Life.pdf (chunk 8)
  3. 4877+Life.pdf (chunk 23)
  4. legal1.pdf (chunk 13)
Answer:
--------------------------------------------------------------------------------
Privacy and data protection, security v tion to deal with this aspect of AI remains with the High C ourts of respective state and the Supreme Court of India.
--------------------------------------------------------------------------------

Sources Referenced:
  1. 4877+Life.pdf (chunk 11)
  2. 4877+Life.pdf (chunk 8)
  3. 4877+Life.pdf (chu

In [45]:
# Test Query 2: Specific Legal Topic
print("\nLEGAL QUERY TEST #2")
print("=" * 80)

query2 = "What are the implications of AI in judicial decision making?"
print(f"\nQuestion: {query2}\n")

ans, sources = answer_query(query2, top_k=4)

print("Answer:")
print("-" * 80)
print(ans)
print("-" * 80)

print("\nSources Referenced:")
for i, s in enumerate(sources, 1):
    print(f"  {i}. {s['meta']['source']} (chunk {s['meta']['chunk_id']})")


LEGAL QUERY TEST #2

Question: What are the implications of AI in judicial decision making?

Answer:
--------------------------------------------------------------------------------
Artificial Intelligence in the Indian Criminal Justice System: Advancements, Challenges, and Ethical Implications
--------------------------------------------------------------------------------

Sources Referenced:
  1. 4877+Life.pdf (chunk 48)
  2. 4877+Life.pdf (chunk 8)
  3. Responsible-AI-22022021.pdf (chunk 84)
  4. legal1.pdf (chunk 9)
Answer:
--------------------------------------------------------------------------------
Artificial Intelligence in the Indian Criminal Justice System: Advancements, Challenges, and Ethical Implications
--------------------------------------------------------------------------------

Sources Referenced:
  1. 4877+Life.pdf (chunk 48)
  2. 4877+Life.pdf (chunk 8)
  3. Responsible-AI-22022021.pdf (chunk 84)
  4. legal1.pdf (chunk 9)


In [None]:
# Custom Query - Ask Your Own Question!
# Change the question below to test different legal queries

print("\nCUSTOM LEGAL QUERY")
print("=" * 90)

my_question = "What are the ethical considerations for AI in the legal system?"

print(f"\nQuestion: {my_question}\n")

ans, sources = answer_query(my_question, top_k=4)

print("Legal Analysis:")
print("-" * 90)
print(ans)
print("-" * 90)

print("\nRetrieved from these sources:")
for i, s in enumerate(sources, 1):
    print(f"  {i}. {s['meta']['source']} - Chunk {s['meta']['chunk_id']}")

print("\n" + "=" * 90)


CUSTOM LEGAL QUERY

Question: What are the ethical considerations for AI in the legal system?

Legal Analysis:
------------------------------------------------------------------------------------------
Ethical impact assessments (EIA) for AI systems before deployment
------------------------------------------------------------------------------------------

Retrieved from these sources:
  1. V5I564.pdf - Chunk 43 (800 chars)
  2. 4877+Life.pdf - Chunk 11 (800 chars)
  3. 4877+Life.pdf - Chunk 45 (800 chars)

Legal Analysis:
------------------------------------------------------------------------------------------
Ethical impact assessments (EIA) for AI systems before deployment
------------------------------------------------------------------------------------------

Retrieved from these sources:
  1. V5I564.pdf - Chunk 43 (800 chars)
  2. 4877+Life.pdf - Chunk 11 (800 chars)
  3. 4877+Life.pdf - Chunk 45 (800 chars)

