## Problem Statement

Organizations often need to analyze **large volumes of legal contracts** stored as PDFs, which contain nuanced clauses like **termination**, **non-compete**, and **confidentiality terms**. Purely semantic search (vector-based) or keyword search (BM25) alone may miss context or relevance.

**Objective:**  
To build a more powerful **Hybrid RAG system** over multiple legal contracts that combines:

- **Semantic search** via FAISS for meaning-based retrieval  
- **Keyword search** via BM25 (Whoosh) for exact match and legal precision  
- **Instruction-tuned generation** using FLAN-T5 to answer questions using merged results  

All achieved **without cloud APIs**, using open-source tools only.

---

In [1]:
# Whats new?
# Focused on legal contracts
# Hybrid retrieval (semantic + BM25 keyword search)
# Whoosh (BM25-based inverted index)
# Combines FAISS + Whoosh, removes duplicates
# Stores file-level metadata with chunks
# Allows summarization of each PDF in a legal context
# 

In [2]:
# Install dependencies
!pip install ipywidgets sentence-transformers faiss-cpu transformers PyPDF2 whoosh -q
# whoosh - For full-text keyword search (BM25-style).


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [3]:
# Imports
import os
import PyPDF2
import faiss
import numpy as np
from whoosh.fields import Schema, TEXT, ID
# Imports schema field types used to define the structure of a Whoosh keyword search index.
# TEXT: For full-text searchable fields
# ID: For non-tokenized IDs (e.g., file name)
# i.e. Defines how chunks and metadata (like file name) are stored for BM25-style search.
from whoosh.index import create_in
# Used to create a Whoosh index directory and store your keyword-searchable documents in it.
from whoosh.qparser import QueryParser
# lets you parse a natural-language query string into a Whoosh search query object.
from sentence_transformers import SentenceTransformer
from transformers import pipeline
import tempfile

In [4]:
# Step 1: Load & chunk PDFs from contracts/
def load_contract_chunks(folder_path, chunk_size=300):
    chunks = []
    filenames = os.listdir(folder_path)
    for filename in filenames:
        if filename.endswith(".pdf"):
            path = os.path.join(folder_path, filename)
            reader = PyPDF2.PdfReader(path)
            text = ""
            for page in reader.pages:
                text += page.extract_text()
            text = text.replace("\n", " ")
            for i in range(0, len(text), chunk_size):
                chunk = text[i:i + chunk_size]
                chunks.append((chunk, filename))
    return chunks

chunks_with_meta = load_contract_chunks("contracts", chunk_size=300)
chunks = [c[0] for c in chunks_with_meta] # Extracts just the chunk texts (without filenames) for FAISS indexing.
print(f"✅ Loaded {len(chunks)} chunks from contracts.")

✅ Loaded 466 chunks from contracts.


In [11]:
print(f"chunks_with_meta: {chunks_with_meta[:1]}")

chunks_with_meta: [('  Page 1 Sample Contract    Contract No.___________  PROFESSIONAL SERVICES AGREEMENT      THIS AGREEMENT made and entered into this _______day of                       , 20      by and between the SANTA  CRUZ COUNTY REGIONAL TRANSPORTATION COMMISSION, hereinafter called COMMISSION, and ________     ', '1SampleCo1ntract-Shuttle.pdf')]


In [6]:
print(f"chunks {chunks[:1]}") # Print first chunk for verification

chunks ['  Page 1 Sample Contract    Contract No.___________  PROFESSIONAL SERVICES AGREEMENT      THIS AGREEMENT made and entered into this _______day of                       , 20      by and between the SANTA  CRUZ COUNTY REGIONAL TRANSPORTATION COMMISSION, hereinafter called COMMISSION, and ________     ']


In [9]:
# Step 2: FAISS vector index (semantic search)
embedder = SentenceTransformer("all-MiniLM-L6-v2") # Loads the pretrained bi-encoder model all-MiniLM-L6-v2 from HuggingFace via SentenceTransformers.
embeddings = embedder.encode(chunks, convert_to_tensor=False)
dimension = len(embeddings[0])
print(f"Dimension: {dimension}")
faiss_index = faiss.IndexFlatL2(dimension) # Creates a flat L2 index in FAISS for efficient nearest-neighbor search using Euclidean distance (L2 norm).
faiss_index.add(np.array(embeddings)) # Adds all your chunk embeddings into the FAISS index.
print("✅ FAISS index ready.")

Dimension: 384
✅ FAISS index ready.


In [13]:
print(f"embeddings: {embeddings}")

embeddings: [[-0.02682767  0.07159669  0.00220114 ...  0.0158756  -0.02813633
  -0.05008272]
 [-0.03836435  0.00306136 -0.06537561 ...  0.08085535 -0.02805743
  -0.03017664]
 [-0.04938861 -0.02651223 -0.09722698 ... -0.00675625  0.01230283
  -0.06103821]
 ...
 [-0.0593512   0.08018056  0.02705275 ... -0.00472621  0.09366456
   0.03406013]
 [ 0.01756999  0.07844537 -0.01589446 ... -0.06351048  0.04946419
  -0.03323623]
 [-0.07469863  0.11128418  0.00327795 ... -0.10253063  0.04803063
  -0.0106685 ]]


In [14]:
print(f"searchable faiss_index: {faiss_index}")

searchable faiss_index: <faiss.swigfaiss.IndexFlatL2; proxy of <Swig Object of type 'faiss::IndexFlatL2 *' at 0x16898b690> >


In [10]:
# Step 3: BM25 index using Whoosh (keyword search)
# This step builds a BM25-based full-text search index using Whoosh — the keyword retrieval engine of your Hybrid RAG pipeline.

schema = Schema(content=TEXT(stored=True), path=ID(stored=True))
# Defines a schema for the Whoosh index with two fields:
# content: The actual text chunk (full-text searchable using BM25)
# path: The source filename (used as metadata)

index_dir = tempfile.mkdtemp()
ix = create_in(index_dir, schema) # Initializes a new Whoosh index in the temporary folder with your schema.
writer = ix.writer()
for i, (chunk, fname) in enumerate(chunks_with_meta):
    writer.add_document(content=chunk, path=fname)
writer.commit() # Commits the changes to the index, making it searchable.
print("✅ Whoosh BM25 index ready.")

✅ Whoosh BM25 index ready.


In [15]:
# Step 4: Hybrid retrieval
def hybrid_retrieve(query, top_k=3):
    # 1. Vector search (semantic)
    # If the user says "What is the notice period?", FAISS will find chunks that semantically align 
    # even if the words are not an exact match.
    q_vec = embedder.encode([query]) # Converts the query to a dense vector.
    _, indices = faiss_index.search(np.array(q_vec), top_k) # Retrieves top-k closest chunk embeddings.
    semantic_results = [chunks[i] for i in indices[0]] # Contains text chunks most semantically similar to the question
    
    # 2. Keyword search (BM25)
    # FAISS might miss exact matches, while keyword search shines when the query and answer have overlapping 
    # tokens (e.g., legal clauses, names, technical terms).
    with ix.searcher() as searcher:
        parser = QueryParser("content", schema=ix.schema)
        parsed_query = parser.parse(query)
        results = searcher.search(parsed_query, limit=top_k)
        keyword_results = [r['content'] for r in results]

    # Merge and dedupe
    # Combines semantic and keyword results
    hybrid_results = list(dict.fromkeys(semantic_results + keyword_results))
    return hybrid_results[:top_k]

In [16]:
# Step 5: LLM - FLAN-T5
qa = pipeline("text2text-generation", model="google/flan-t5-base", max_length=256)

config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

Device set to use mps:0


In [18]:
# Step 6: Answer query with hybrid context
def answer_query(query):
    contexts = hybrid_retrieve(query) # Calls the previous hybrid retriever to get top-k relevant chunks.
    full_context = "\n".join(contexts) # Combines the chunks into a single paragraph for the prompt.
    prompt = f"Context:\n{full_context}\n\nQuestion: {query}\n\nAnswer:" # Constructs an instruction-style prompt with context + question.
    result = qa(prompt)[0]["generated_text"]
    return result.strip()

In [19]:
# Test queries
print("\n🧠 Query 1:")
print(answer_query("What does the termination clause say?"))

print("\n🧠 Query 2:")
print(answer_query("Explain the non-compete obligations."))

print("\n🧠 Query 3:")
print(answer_query("Describe the confidentiality terms."))



🧠 Query 1:
i) Breac

🧠 Query 2:
An agreement of service through which an employee commits not to compete with his employer is not in restraint of trad perform his obligations under a contr act. F) Discharge by impossibility of performance – Impossibility of performance results in the discharge of the contract. An agreement which is impossible is void, because law does not comp or other forms of compensation; a nd selection for training (including apprenticeship), employment, upgrading, demotion, or transfer. The CONSULTANT agrees to post in conspicuous places, available to employees and applicants for employme nt, notice setting forth the provisions of this non-discrim

🧠 Query 3:
a docto r has a duty of confidentiality oses a special duty to act with the utmost good faith i.e., to disclose all material information


In [20]:
print("\n🧠 Query 4:")
print(answer_query("Summarize each PDF"))


🧠 Query 4:
a) a written progress report, in a format to be mutually agreed upon, that is sufficiently detailed for the Contract Manager to determ ine if the CONSULTANT is performing to expectations and is on sche dule; 6. Written progress reports, in a format to be mutually agreed upon, that is sufficiently detailed for the Contract Manager to determ ine if the CONSULTANT is performing to expectations and is on sche dule; provides communi
