### What is Context Re-ranking?

Context re-ranking is the process of reordering retrieved documents or chunks based on how relevant they are to the user’s query after retrieval, not during retrieval.

##### Why is it needed?

Vector similarity search retrieves chunks based on embedding similarity, which:

+ May include semantically similar but irrelevant chunks
+ Does not always understand user intent deeply

This leads to:

+ Hallucinations
+ Answers based on weak context
+ Important chunks appearing too late in the prompt

##### How it works internally

1. Retriever fetches top N chunks (e.g., top 20)
2. A re-ranker model (LLM or cross-encoder) scores each chunk against the query
3. Chunks are sorted again by relevance
4. Only the top K high-quality chunks are passed to the LLM

Example

Query:

“What are the penalties under GDPR?”

Retrieved chunks:

+ Chunk 1: GDPR introduction
+ Chunk 2: Data subject rights
+ Chunk 3: Penalties and fines
+ Chunk 4: GDPR history

After re-ranking:

+ Penalties and fines
+ Data subject rights
+ GDPR introduction

Key takeaway: Re-ranking improves answer accuracy without changing your vector database

*** 

### What is a Compression Retriever?

A compression retriever reduces the size of retrieved content by extracting only the most relevant sentences or facts before sending it to the LLM.

Why is it needed?

Problems with raw retrieval:

+ Retrieved chunks are too long
+ Token limits are exceeded
+ LLM gets distracted by irrelevant text

Compression helps:

+ Lower token usage
+ Faster inference
+ Higher factual precision

How it works internally

+ Retrieve chunks normally
+ Pass each chunk through a compression step:
     1. Sentence selection
     2. Summarization
     3. Keyword-based filtering
+ Send compressed context to the LLM

Example

Original chunk (300 words):
“GDPR was introduced in 2018... penalties can go up to 20 million euros... applies to EU citizens...”

Compressed chunk:
“GDPR penalties can reach €20 million or 4% of global annual turnover.”

Key takeaway:
Compression retrievers make RAG lighter, cheaper, and more focused


***
### What is a Multi-Query Retriever?

A multi-query retriever generates multiple versions of the user query and retrieves documents for each variation.

##### Why is it needed?

Users often:

+ Ask vague questions
+ Use informal language
+ Miss important keywords

Single query retrieval may fail to capture all meanings.

##### How it works internally

1. User asks one question
2. LLM generates multiple re-phrased queries
3. Each query retrieves documents independently
4. Results are merged and deduplicated

Example

User query:
“How to secure customer data?”

Generated queries:
+ “Customer data security best practices”
+ “Data protection techniques for user information”
+ “Preventing data breaches in customer databases”

Each retrieves different but relevant content.

Key takeaway: Multi-query retrievers increase recall and reduce blind spots

***
### Hybrid Retrieval (BM25 + Embeddings)
What is Hybrid Retrieval?

Hybrid retrieval combines:
+ Keyword-based search (BM25)
+ Semantic search (embeddings)

Why is it needed?

+ Embedding search: Misses exact keywords, IDs, numbers
+ BM25: Misses semantic meaning

##### BM25 (Best Matching 25) is a popular ranking function in search engines that scores documents based on their relevance to a search query, improving upon older methods like TF-IDF by better handling term frequency, document length, and term rarity to deliver more precise results, commonly used in systems like Elasticsearch and Lucene

Hybrid retrieval gives best of both worlds.

How it works internally

Query is sent to:
+ BM25 retriever (keyword match)
+ Vector retriever (semantic similarity)
+ Scores from both retrievers are:
     + Combined
     + Weighted
     + Re-ranked
+ Final top chunks are selected

Example Query:

“Error code E102 in payment gateway”

+ BM25 finds exact error codes
+ Embeddings find semantic explanations

Key takeaway: Hybrid retrieval is production-grade retrieval, not optional

*** 
#### What is a Self-Query Retriever?

A self-query retriever lets the LLM decide what to search and how to filter, including metadata conditions.

Why is it needed?

Users naturally ask:

+ “Show recent policies”
+ “Documents after 2022”
+ “Only finance-related reports”

Traditional retrievers cannot interpret these filters automatically.

How it works internally

User query is analyzed by LLM and LLM converts query into:

1. Semantic search query
2. Metadata filters

Finally Retriever executes filtered search

Example

User:

“Show sustainability reports from 2023 related to energy”

LLM converts to:

+ Query: “sustainability energy”
+ Filter: year = 2023

Key takeaway:
Self-query retrievers enable natural language filtering

*** 

#### What is Metadata-based Retrieval?

Documents are stored with extra attributes (metadata) like: Date, Category, Author, Department, Region. 

Metadata filtering restricts retrieval to relevant subsets.

Why is it needed?

Without metadata filtering:

+ Old documents pollute results
+ Wrong departments show up
+ Compliance issues arise

How it works internally

+ Vector store stores metadata with embeddings
+ Query includes filter conditions
+ Retriever fetches only matching documents

Example

Filter:

+ department = “Finance”
+ region = “India”

Only finance documents from India are searched.

Key takeaway: Metadata filters bring precision and governance to RAG

*** 

#### What is RAG with Chat History?

RAG with chat history allows retrieval to consider previous conversation turns, not just the latest question.

##### Why is it needed?

Users ask follow-up questions like:

+ “What about penalties?”
+ “Explain that again”
+ “Does this apply in India?”

Without chat history:

+ Retriever lacks context
+ Answers become incorrect

How it works internally

1. Combine both: Latest user query and Relevant past messages
2. Create a context-aware retrieval query
3. Retrieve documents based on conversation state

Example

Conversation:

1. “Explain GDPR”
2. “What about penalties?”

Retriever understands:
penalties related to GDPR

Key takeaway: Chat-aware RAG enables true conversational intelligence

## Context Re-ranking

In [8]:
import json
import re
from typing import List, Tuple
from dotenv import load_dotenv
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_core.documents import Document

load_dotenv(override=True)

# LLM (OpenAI) and embeddings
llm = ChatOpenAI(model="gpt-4.1-mini", temperature=0)
emb = OpenAIEmbeddings()

# Sample documents (replace with your scraped/split docs)
docs = [
    Document(page_content="GDPR penalties can be up to €20M or 4% of global annual turnover.", metadata={"id": "A"}),
    Document(page_content="GDPR includes rights like access, rectification, and erasure.", metadata={"id": "B"}),
    Document(page_content="SOC 2 is a compliance framework focused on controls for service organizations.", metadata={"id": "C"}),
    Document(page_content="GDPR applies to processing personal data of people in the EU.", metadata={"id": "D"}),
]

# Vector DB + base retriever
vs = Chroma.from_documents(docs, embedding=emb)
base_retriever = vs.as_retriever(search_kwargs={"k": 6})


In [9]:
def extract_json(text: str):
    """
    Model sometimes adds extra text. This finds the first JSON array/object inside.
    """
    m = re.search(r"(\[.*\])", text, flags=re.DOTALL)
    if not m:
        raise ValueError("No JSON array found in model output:\n" + text)
    return json.loads(m.group(1))


In [10]:
# Re-ranking function (LLM scores each retrieved chunk)

def rerank_with_openai(query: str, retrieved_docs: List[Document], llm: ChatOpenAI, top_k: int = 3) -> List[Document]:
    """
    Takes docs from vector retrieval, asks LLM to score each chunk, returns top_k best chunks.
    """
    # Prepare small snippets to keep prompt size manageable
    snippets = []
    for i, d in enumerate(retrieved_docs):
        snippets.append({
            "id": i,
            "text": (d.page_content or "")[:900]
        })

    prompt = f"""
You are a relevance judge for RAG retrieval.

Question:
{query}

Below are retrieved text chunks. Give each chunk a relevance score from 0 to 100.
- 100 = directly answers the question
- 50 = somewhat useful
- 0 = not useful

Return ONLY valid JSON array like:
[
  {{"id": 0, "score": 87, "reason": "short reason"}},
  ...
]

Chunks:
{json.dumps(snippets, ensure_ascii=False)}
""".strip()

    resp = llm.invoke(prompt)
    scores_list = extract_json(resp.content)

    # Map id -> score
    id_to_score = {item["id"]: item.get("score", 0) for item in scores_list if "id" in item}

    # Sort docs by score
    ranked = sorted(
        enumerate(retrieved_docs),
        key=lambda x: id_to_score.get(x[0], 0),
        reverse=True
    )

    # Return top_k docs
    return [doc for _, doc in ranked[:top_k]]


In [11]:
query = "What are GDPR penalties?"

# Step 1: fast retrieval (vector DB)
retrieved = base_retriever.invoke(query)

print("=== Retrieved (Vector DB order) ===")
for d in retrieved:
    print("-", d.metadata, d.page_content)

# Step 2: smart ordering (LLM re-ranking)
reranked = rerank_with_openai(query, retrieved, llm, top_k=3)

print("\n=== Re-ranked (LLM order) ===")
for d in reranked:
    print("-", d.metadata, d.page_content)


Number of requested results 6 is greater than number of elements in index 4, updating n_results = 4


=== Retrieved (Vector DB order) ===
- {'id': 'A'} GDPR penalties can be up to €20M or 4% of global annual turnover.
- {'id': 'D'} GDPR applies to processing personal data of people in the EU.
- {'id': 'B'} GDPR includes rights like access, rectification, and erasure.
- {'id': 'C'} SOC 2 is a compliance framework focused on controls for service organizations.

=== Re-ranked (LLM order) ===
- {'id': 'A'} GDPR penalties can be up to €20M or 4% of global annual turnover.
- {'id': 'D'} GDPR applies to processing personal data of people in the EU.
- {'id': 'B'} GDPR includes rights like access, rectification, and erasure.


## Compression Retriever

In [13]:
from typing import List
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_core.documents import Document

llm = ChatOpenAI(model="gpt-4.1-mini", temperature=0)
emb = OpenAIEmbeddings()


In [14]:
docs = [
    Document(
        page_content="""
GDPR was introduced in 2018. It applies to processing personal data of people in the EU.
Penalties can be up to €20 million or 4% of global annual turnover, whichever is higher.
It also defines rights like access, rectification, and erasure.
""",
        metadata={"source": "gdpr_overview"}
    ),
    Document(
        page_content="""
SOC 2 is a compliance framework for service organizations.
It focuses on controls related to security, availability, processing integrity, confidentiality, and privacy.
""",
        metadata={"source": "soc2"}
    ),
]

vs = Chroma.from_documents(docs, embedding=emb)
base_retriever = vs.as_retriever(search_kwargs={"k": 4})


In [15]:
def compress_document(query: str, doc: Document, llm: ChatOpenAI, max_chars: int = 600) -> str:
    """
    Uses the LLM to extract only the relevant lines from a document for the given query.
    Returns a short compressed text.
    """
    text = (doc.page_content or "")[:2500]  # limit input

    prompt = f"""
    You are a retrieval compressor.
    Extract ONLY the sentences/phrases that help answer the question.
    - Do NOT add new information.
    - Keep numbers and facts exactly.
    - If nothing relevant, return an empty string.
    
    Question: {query}
    
    Text:
    {text}
    
    Return only the extracted text (no markdown, no bullets unless they already exist).
    """.strip()

    """
    The .strip() method in Python removes any leading (beginning) and trailing (ending) characters from a string. 
    By default, if no specific characters are provided, it removes all whitespace characters, including spaces, tabs (\t), and newlines (\n).
    """
    resp = llm.invoke(prompt).content.strip()
    return resp[:max_chars]


In [16]:
def compression_retriever(query: str, base_retriever, llm: ChatOpenAI, top_k: int = 3) -> List[Document]:
    """
    1) Retrieves docs using vector similarity
    2) Compresses each doc using LLM extraction
    3) Returns compressed Documents
    """
    retrieved_docs = base_retriever.invoke(query)

    compressed_docs = []
    for d in retrieved_docs:
        compressed_text = compress_document(query, d, llm)
        if compressed_text.strip():
            compressed_docs.append(
                Document(page_content=compressed_text, metadata=d.metadata)
            )

    return compressed_docs[:top_k]

In [17]:
query = "What are GDPR penalties?"

normal_docs = base_retriever.invoke(query)

print("=== NORMAL RETRIEVER (Raw Context) ===")
for d in normal_docs:
    print("----")
    print("meta:", d.metadata)
    print(d.page_content)


=== NORMAL RETRIEVER (Raw Context) ===
----
meta: {'id': 'A'}
GDPR penalties can be up to €20M or 4% of global annual turnover.
----
meta: {'source': 'gdpr_overview'}

GDPR was introduced in 2018. It applies to processing personal data of people in the EU.
Penalties can be up to €20 million or 4% of global annual turnover, whichever is higher.
It also defines rights like access, rectification, and erasure.

----
meta: {'id': 'D'}
GDPR applies to processing personal data of people in the EU.
----
meta: {'id': 'B'}
GDPR includes rights like access, rectification, and erasure.


In [18]:
compressed = compression_retriever(query, base_retriever, llm, top_k=3)

print("=== Compressed Context ===")
for d in compressed:
    print("----")
    print("meta:", d.metadata)
    print(d.page_content)


=== Compressed Context ===
----
meta: {'id': 'A'}
GDPR penalties can be up to €20M or 4% of global annual turnover.
----
meta: {'source': 'gdpr_overview'}
Penalties can be up to €20 million or 4% of global annual turnover, whichever is higher.


In [19]:
def answer_with_compressed_context(query: str, compressed_docs: List[Document], llm: ChatOpenAI) -> str:
    context = "\n\n".join(d.page_content for d in compressed_docs)

    prompt = f"""
Answer the question using ONLY this context.

Context:
{context}

Question: {query}
""".strip()

    return llm.invoke(prompt).content

final_answer = answer_with_compressed_context(query, compressed, llm)
print(final_answer)


GDPR penalties can be up to €20 million or 4% of global annual turnover, whichever is higher.


## Multi-Query Retriever

In [21]:
def generate_query_variants(user_query: str, llm, n: int = 4):
    prompt = f"""
Generate {n} different search queries that mean the same as the user question.
Use different wording and terminology.

User question:
{user_query}

Return each query on a new line.
""".strip()

    resp = llm.invoke(prompt).content

    # Clean output
    queries = [q.strip("-• ").strip() for q in resp.split("\n") if q.strip()]
    return queries[:n]


In [22]:
def multi_query_retriever(
    user_query: str,
    base_retriever,
    llm,
    per_query_k: int = 3
):
    """
    Manual Multi-Query Retriever compatible with LangChain v1
    """
    # Step 1: generate query variants
    queries = generate_query_variants(user_query, llm)

    print("Generated queries:")
    for q in queries:
        print("-", q)

    # Step 2: retrieve for each query
    all_docs = []
    for q in queries:
        docs = base_retriever.invoke(q)
        all_docs.extend(docs)

    # Step 3: deduplicate (by content)
    seen = set()
    unique_docs = []
    for d in all_docs:
        key = d.page_content.strip()
        if key not in seen:
            seen.add(key)
            unique_docs.append(d)

    return unique_docs


In [23]:
query = "How do we secure customer data?"

# Normal retrieval
normal_docs = base_retriever.invoke(query)

print("\n=== NORMAL RETRIEVER ===")
for d in normal_docs:
    print("-", d.page_content)

print("\n=== MULTI-QUERY RETRIEVER ===")

# Multi-query retrieval
multi_docs = multi_query_retriever(query, base_retriever, llm)

for d in multi_docs:
    print("-", d.page_content)



=== NORMAL RETRIEVER ===
- GDPR applies to processing personal data of people in the EU.
- 
SOC 2 is a compliance framework for service organizations.
It focuses on controls related to security, availability, processing integrity, confidentiality, and privacy.

- GDPR includes rights like access, rectification, and erasure.
- 
GDPR was introduced in 2018. It applies to processing personal data of people in the EU.
Penalties can be up to €20 million or 4% of global annual turnover, whichever is higher.
It also defines rights like access, rectification, and erasure.


=== MULTI-QUERY RETRIEVER ===
Generated queries:
- What are the best methods to protect customer information?
- How can we ensure the safety of client data?
- What steps should be taken to safeguard customer records?
- How do we implement security measures for customer data?
- GDPR includes rights like access, rectification, and erasure.
- 
GDPR was introduced in 2018. It applies to processing personal data of people in th

## Hybrid Retrieval (BM25 + Embeddings)

In [24]:
from typing import List, Tuple, Dict
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_core.documents import Document

from rank_bm25 import BM25Okapi
import re


In [25]:
# 2) Setup: docs + vector store (embeddings) + BM25 index

docs = [
    Document(page_content="GDPR penalties can be up to €20 million or 4% of global annual turnover.", metadata={"id": "D1", "topic": "gdpr"}),
    Document(page_content="GDPR defines data subject rights such as access and erasure.", metadata={"id": "D2", "topic": "gdpr"}),
    Document(page_content="The maximum fine is 4% of turnover for severe violations under GDPR.", metadata={"id": "D3", "topic": "gdpr"}),
    Document(page_content="Error code E102 occurs when the payment gateway signature is invalid.", metadata={"id": "D4", "topic": "payments"}),
    Document(page_content="To secure customer data, use encryption at rest and in transit and apply RBAC.", metadata={"id": "D5", "topic": "security"}),
]

In [26]:
emb = OpenAIEmbeddings()
vs = Chroma.from_documents(docs, embedding=emb)
vector_retriever = vs.as_retriever(search_kwargs={"k": 4})

In [27]:
def tokenize(text: str) -> List[str]:
    # very simple tokenizer for teaching
    return re.findall(r"[A-Za-z0-9%]+", text.lower())

corpus_tokens = [tokenize(d.page_content) for d in docs]
bm25 = BM25Okapi(corpus_tokens)

In [28]:
# Hybrid retrieval function (BM25 + embeddings → merge)

def bm25_search(query: str, top_k: int = 4) -> List[Tuple[Document, float]]:
    q_tokens = tokenize(query)
    scores = bm25.get_scores(q_tokens)  # score for each doc in same order as docs list
    ranked_idx = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:top_k]
    return [(docs[i], float(scores[i])) for i in ranked_idx]

def vector_search(query: str, top_k: int = 4) -> List[Tuple[Document, float]]:
    # Chroma retriever doesn't expose score in this simple interface,
    # so we treat vector hits with a constant base score (or you can use similarity_search_with_score)
    results = vs.similarity_search_with_score(query, k=top_k)
    # returns list of (Document, distance/score depending on vector store)
    # We'll convert to a "higher is better" score by negating distance if needed.
    scored = []
    for d, s in results:
        # Chroma returns distance (lower is better) in many setups,
        # so convert to "higher is better":
        vec_score = -float(s)
        scored.append((d, vec_score))
    return scored

def hybrid_retrieve(
    query: str,
    top_k_final: int = 5,
    top_k_bm25: int = 4,
    top_k_vec: int = 4,
    w_bm25: float = 0.5,
    w_vec: float = 0.5
) -> List[Tuple[Document, float, Dict]]:
    bm25_hits = bm25_search(query, top_k=top_k_bm25)
    vec_hits  = vector_search(query, top_k=top_k_vec)

    # Combine scores by doc identity (use metadata id if present; else content)
    combined: Dict[str, Dict] = {}

    def key_of(doc: Document) -> str:
        return doc.metadata.get("id") or doc.page_content.strip()[:80]

    for d, s in bm25_hits:
        k = key_of(d)
        combined.setdefault(k, {"doc": d, "bm25": 0.0, "vec": 0.0})
        combined[k]["bm25"] = s

    for d, s in vec_hits:
        k = key_of(d)
        combined.setdefault(k, {"doc": d, "bm25": 0.0, "vec": 0.0})
        combined[k]["vec"] = s

    # Normalize BM25 to 0-1 range for fair combination (simple min-max)
    bm25_scores = [v["bm25"] for v in combined.values()]
    bm25_min, bm25_max = min(bm25_scores), max(bm25_scores)
    def norm_bm25(x):
        if bm25_max == bm25_min:
            return 0.0
        return (x - bm25_min) / (bm25_max - bm25_min)

    # Normalize vector scores similarly
    vec_scores = [v["vec"] for v in combined.values()]
    vec_min, vec_max = min(vec_scores), max(vec_scores)
    def norm_vec(x):
        if vec_max == vec_min:
            return 0.0
        return (x - vec_min) / (vec_max - vec_min)

    ranked = []
    for v in combined.values():
        nb = norm_bm25(v["bm25"])
        nv = norm_vec(v["vec"])
        final_score = w_bm25 * nb + w_vec * nv
        ranked.append((v["doc"], final_score, {"bm25": v["bm25"], "vec": v["vec"]}))

    ranked.sort(key=lambda x: x[1], reverse=True)
    return ranked[:top_k_final]


In [29]:
query = "Error code E102"
print("=== Vector only ===")
for d in vector_retriever.invoke(query):
    print("-", d.metadata, d.page_content)

print("\n=== BM25 only ===")
for d, s in bm25_search(query, top_k=3):
    print("-", d.metadata, f"(bm25={s:.2f})", d.page_content)

print("\n=== HYBRID ===")
for d, score, parts in hybrid_retrieve(query, top_k_final=4):
    print("-", d.metadata, f"(final={score:.3f}, bm25={parts['bm25']:.2f}, vec={parts['vec']:.3f})")
    print(" ", d.page_content)


=== Vector only ===
- {'id': 'D4', 'topic': 'payments'} Error code E102 occurs when the payment gateway signature is invalid.
- {'id': 'D2', 'topic': 'gdpr'} GDPR defines data subject rights such as access and erasure.
- {'id': 'D3', 'topic': 'gdpr'} The maximum fine is 4% of turnover for severe violations under GDPR.
- {'id': 'B'} GDPR includes rights like access, rectification, and erasure.

=== BM25 only ===
- {'id': 'D4', 'topic': 'payments'} (bm25=3.45) Error code E102 occurs when the payment gateway signature is invalid.
- {'id': 'D1', 'topic': 'gdpr'} (bm25=0.00) GDPR penalties can be up to €20 million or 4% of global annual turnover.
- {'id': 'D2', 'topic': 'gdpr'} (bm25=0.00) GDPR defines data subject rights such as access and erasure.

=== HYBRID ===
- {'id': 'D4', 'topic': 'payments'} (final=0.840, bm25=3.45, vec=-0.184)
  Error code E102 occurs when the payment gateway signature is invalid.
- {'id': 'D1', 'topic': 'gdpr'} (final=0.500, bm25=0.00, vec=0.000)
  GDPR penalties

In [30]:
query = "What are GDPR fines?"
print("\n=== HYBRID ===")
for d, score, parts in hybrid_retrieve(query, top_k_final=4):
    print("-", d.metadata, f"(final={score:.3f}, bm25={parts['bm25']:.2f}, vec={parts['vec']:.3f})")
    print(" ", d.page_content)



=== HYBRID ===
- {'id': 'D2', 'topic': 'gdpr'} (final=1.000, bm25=0.26, vec=0.000)
  GDPR defines data subject rights such as access and erasure.
- {'id': 'D3', 'topic': 'gdpr'} (final=0.519, bm25=0.24, vec=-0.235)
  The maximum fine is 4% of turnover for severe violations under GDPR.
- {'id': 'D1', 'topic': 'gdpr'} (final=0.513, bm25=0.22, vec=-0.221)
  GDPR penalties can be up to €20 million or 4% of global annual turnover.
- {'id': 'D4', 'topic': 'payments'} (final=0.500, bm25=0.00, vec=0.000)
  Error code E102 occurs when the payment gateway signature is invalid.


## Self-Query Retriever

In [76]:
import json, re
from typing import Dict, Any, List

from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_core.documents import Document


In [78]:
llm = ChatOpenAI(model="gpt-4.1-mini", temperature=0)
emb = OpenAIEmbeddings()


In [79]:
docs = [
    Document(
        page_content="GDPR penalties can be up to €20M or 4% of global annual turnover.",
        metadata={"year": 2018, "region": "EU", "topic": "gdpr", "doc_type": "law"}
    ),
    Document(
        page_content="GDPR defines rights like access, rectification, and erasure.",
        metadata={"year": 2018, "region": "EU", "topic": "gdpr", "doc_type": "law"}
    ),
    Document(
        page_content="India's DPDP Act introduces obligations for personal data processing.",
        metadata={"year": 2023, "region": "IN", "topic": "privacy", "doc_type": "law"}
    ),
    Document(
        page_content="SOC 2 covers security and availability controls for service organizations.",
        metadata={"year": 2017, "region": "US", "topic": "compliance", "doc_type": "framework"}
    ),
]

In [80]:
vs = Chroma.from_documents(docs, embedding=emb)

In [81]:
def extract_json_obj(text: str) -> Dict[str, Any]:
    m = re.search(r"(\{.*\})", text, flags=re.DOTALL)
    if not m:
        raise ValueError("No JSON object found in model output:\n" + text)
    return json.loads(m.group(1))

In [91]:
def chroma_and_filter(filters: dict | None):
    """
    Convert {"a":1, "b":2} -> {"$and":[{"a":1},{"b":2}]}
    Chroma requires exactly one top-level operator when multiple conditions exist.
    """
    if not filters:
        return None
    if any(k.startswith("$") for k in filters.keys()):
        # already operator-based
        return filters
    if len(filters) == 1:
        return filters
    return {"$and": [{k: v} for k, v in filters.items()]}


In [93]:
def build_self_query_plan(user_query: str, llm: ChatOpenAI) -> Dict[str, Any]:
    """
    Returns:
    {
      "search_query": "...",     # text to embed for similarity search
      "filters": {...},          # metadata filter dict
      "k": 4
    }
    """
    prompt = f"""
    You are a Self-Query Retriever planner.
    
    Your job: convert the user question into:
    1) a concise semantic search query (search_query)
    2) metadata filters (filters) using ONLY these fields:
       - year (integer)
       - region (string: EU, IN, US)
       - topic (string: gdpr, privacy, compliance)
       - doc_type (string: law, framework)
    
    Rules:
    - If the user did not specify a field, omit it from filters.
    - Do NOT guess unknown values.
    - Output ONLY valid JSON object exactly like:
    {{
      "search_query": "...",
      "filters": {{}},
      "k": 4
    }}
    
    User question:
    {user_query}
    """.strip()

    resp = llm.invoke(prompt).content
    return extract_json_obj(resp)


In [97]:
def self_query_retrieve(user_query: str, vs: Chroma, llm: ChatOpenAI, default_k: int = 4) -> List[Document]:
    plan = build_self_query_plan(user_query, llm)

    search_query = plan.get("search_query", user_query)
    filters = plan.get("filters", {})
    k = int(plan.get("k", default_k))

    print("=== Self-Query Plan ===")
    print("search_query:", search_query)
    print("filters:", filters)
    print("k:", k)

    retriever = vs.as_retriever(search_kwargs={"k": k, "filter": chroma_and_filter(filters)})
    return retriever.invoke(search_query)


In [99]:
q1 = "Show privacy law documents from India in 2023"
docs1 = self_query_retrieve(q1, vs, llm)

print("\n=== Results ===")
for d in docs1:
    print("-", d.metadata, d.page_content)


=== Self-Query Plan ===
search_query: privacy law
filters: {'region': 'IN', 'year': 2023, 'doc_type': 'law'}
k: 4

=== Results ===
- {'doc_type': 'law', 'region': 'IN', 'topic': 'privacy', 'year': 2023} India's DPDP Act introduces obligations for personal data processing.
- {'doc_type': 'law', 'region': 'IN', 'topic': 'privacy', 'year': 2023} India's DPDP Act introduces obligations for personal data processing.


In [101]:
q2 = "In EU, in 2018, what were GDPR penalties?"
docs2 = self_query_retrieve(q2, vs, llm)

print("\n=== Results ===")
for d in docs2:
    print("-", d.metadata, d.page_content)

=== Self-Query Plan ===
search_query: GDPR penalties
filters: {'region': 'EU', 'year': 2018, 'topic': 'gdpr'}
k: 4

=== Results ===
- {'doc_type': 'law', 'region': 'EU', 'topic': 'gdpr', 'year': 2018} GDPR penalties can be up to €20M or 4% of global annual turnover.
- {'doc_type': 'law', 'region': 'EU', 'topic': 'gdpr', 'year': 2018} GDPR penalties can be up to €20M or 4% of global annual turnover.
- {'doc_type': 'law', 'region': 'EU', 'topic': 'gdpr', 'year': 2018} GDPR defines rights like access, rectification, and erasure.
- {'doc_type': 'law', 'region': 'EU', 'topic': 'gdpr', 'year': 2018} GDPR defines rights like access, rectification, and erasure.


## Retrieval with Metadata Filters

In [103]:
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_core.documents import Document

In [105]:
docs = [
    Document(
        page_content="GDPR penalties can be up to €20 million or 4% of global annual turnover.",
        metadata={"id": "EU1", "region": "EU", "year": 2018, "topic": "gdpr"}
    ),
    Document(
        page_content="GDPR defines data subject rights like access and erasure.",
        metadata={"id": "EU2", "region": "EU", "year": 2018, "topic": "gdpr"}
    ),
    Document(
        page_content="India's DPDP Act defines obligations for personal data processing.",
        metadata={"id": "IN1", "region": "IN", "year": 2023, "topic": "privacy"}
    ),
    Document(
        page_content="SOC 2 is a compliance framework used mostly in US organizations.",
        metadata={"id": "US1", "region": "US", "year": 2017, "topic": "compliance"}
    ),
]


In [107]:
emb = OpenAIEmbeddings()
vs = Chroma.from_documents(docs, embedding=emb)

# Normal retriever (no filtering)
base_retriever = vs.as_retriever(search_kwargs={"k": 3})


In [109]:
query = "What are penalties?"
normal_docs = base_retriever.invoke(query)

print("=== NORMAL RETRIEVAL (No Filters) ===")
for d in normal_docs:
    print("----")
    print("meta:", d.metadata)
    print(d.page_content)


=== NORMAL RETRIEVAL (No Filters) ===
----
meta: {'doc_type': 'law', 'region': 'EU', 'topic': 'gdpr', 'year': 2018}
GDPR penalties can be up to €20M or 4% of global annual turnover.
----
meta: {'doc_type': 'law', 'region': 'EU', 'topic': 'gdpr', 'year': 2018}
GDPR penalties can be up to €20M or 4% of global annual turnover.
----
meta: {'id': 'A'}
GDPR penalties can be up to €20M or 4% of global annual turnover.


In [111]:
filtered_retriever_eu = vs.as_retriever(
    search_kwargs={
        "k": 3,
        "filter": {"region": "EU"}   # ONLY EU docs
    }
)

filtered_docs = filtered_retriever_eu.invoke(query)

print("\n=== FILTERED RETRIEVAL (region=EU) ===")
for d in filtered_docs:
    print("----")
    print("meta:", d.metadata)
    print(d.page_content)



=== FILTERED RETRIEVAL (region=EU) ===
----
meta: {'doc_type': 'law', 'region': 'EU', 'topic': 'gdpr', 'year': 2018}
GDPR penalties can be up to €20M or 4% of global annual turnover.
----
meta: {'doc_type': 'law', 'region': 'EU', 'topic': 'gdpr', 'year': 2018}
GDPR penalties can be up to €20M or 4% of global annual turnover.
----
meta: {'id': 'EU1', 'region': 'EU', 'topic': 'gdpr', 'year': 2018}
GDPR penalties can be up to €20 million or 4% of global annual turnover.


In [115]:
filtered_retriever_eu_2018 = vs.as_retriever(
    search_kwargs={
        "k": 3,
        "filter": {
            "$and": [
                {"region": "EU"},
                {"year": 2018}
            ]
        }
    }
)

docs_eu_2018 = filtered_retriever_eu_2018.invoke("GDPR penalties")

print("\n=== FILTERED RETRIEVAL (region=EU AND year=2018) ===")
for d in docs_eu_2018:
    print("----")
    print("meta:", d.metadata)
    print(d.page_content)



=== FILTERED RETRIEVAL (region=EU AND year=2018) ===
----
meta: {'id': 'EU1', 'region': 'EU', 'topic': 'gdpr', 'year': 2018}
GDPR penalties can be up to €20 million or 4% of global annual turnover.
----
meta: {'doc_type': 'law', 'region': 'EU', 'topic': 'gdpr', 'year': 2018}
GDPR penalties can be up to €20M or 4% of global annual turnover.
----
meta: {'doc_type': 'law', 'region': 'EU', 'topic': 'gdpr', 'year': 2018}
GDPR penalties can be up to €20M or 4% of global annual turnover.


## RAG with Chat History

In [117]:
from typing import List, Tuple
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_core.documents import Document


In [119]:
llm = ChatOpenAI(model="gpt-4.1-mini", temperature=0)
emb = OpenAIEmbeddings()


In [121]:
docs = [
    Document(page_content="GDPR is an EU privacy regulation governing personal data processing.", metadata={"topic":"gdpr"}),
    Document(page_content="GDPR penalties can be up to €20 million or 4% of global annual turnover.", metadata={"topic":"gdpr"}),
    Document(page_content="SOC2 is a compliance framework focused on controls for service organizations.", metadata={"topic":"soc2"}),
    Document(page_content="Late payment penalties may apply for overdue invoices.", metadata={"topic":"finance"}),
]

vs = Chroma.from_documents(docs, embedding=emb)
base_retriever = vs.as_retriever(search_kwargs={"k": 3})


In [123]:
def answer_from_docs(question: str, docs: List[Document], llm: ChatOpenAI) -> str:
    context = "\n\n".join(d.page_content for d in docs)
    prompt = f"""
Answer the question using ONLY the context below.

Context:
{context}

Question: {question}
""".strip()
    return llm.invoke(prompt).content


In [125]:
# RAG without history

chat_history = [
    ("human", "Explain GDPR in simple terms."),
    ("ai", "GDPR is an EU privacy regulation about personal data processing.")
]

followup = "What about penalties?"

# Retrieval uses only follow-up (ambiguous)
docs_no_history = base_retriever.invoke(followup)

print("=== Retrieved WITHOUT history (often wrong) ===")
for d in docs_no_history:
    print("-", d.metadata, d.page_content)

print("\n=== Answer WITHOUT history ===")
print(answer_from_docs(followup, docs_no_history, llm))


=== Retrieved WITHOUT history (often wrong) ===
- {'topic': 'finance'} Late payment penalties may apply for overdue invoices.
- {'doc_type': 'law', 'region': 'EU', 'topic': 'gdpr', 'year': 2018} GDPR penalties can be up to €20M or 4% of global annual turnover.
- {'doc_type': 'law', 'region': 'EU', 'topic': 'gdpr', 'year': 2018} GDPR penalties can be up to €20M or 4% of global annual turnover.

=== Answer WITHOUT history ===
Penalties may include late payment penalties for overdue invoices and GDPR penalties, which can be up to €20 million or 4% of global annual turnover.


In [127]:
def rewrite_with_history(chat_history: List[Tuple[str, str]], user_question: str, llm: ChatOpenAI) -> str:
    history_text = "\n".join([f"{role.upper()}: {msg}" for role, msg in chat_history])

    prompt = f"""
You are a query rewriter for a RAG system.

Given the chat history and the latest user question, rewrite the latest question
into a standalone question with all necessary context included.

Chat history:
{history_text}

Latest question:
{user_question}

Return ONLY the rewritten standalone question.
""".strip()

    return llm.invoke(prompt).content.strip()


In [129]:
standalone_query = rewrite_with_history(chat_history, followup, llm)
print("Standalone query:", standalone_query)

docs_with_history = base_retriever.invoke(standalone_query)

print("\n=== Retrieved WITH history (better) ===")
for d in docs_with_history:
    print("-", d.metadata, d.page_content)

print("\n=== Answer WITH history-aware retrieval ===")
print(answer_from_docs(followup, docs_with_history, llm))


Standalone query: What are the penalties under the GDPR for non-compliance?

=== Retrieved WITH history (better) ===
- {'id': 'D1', 'topic': 'gdpr'} GDPR penalties can be up to €20 million or 4% of global annual turnover.
- {'id': 'EU1', 'region': 'EU', 'topic': 'gdpr', 'year': 2018} GDPR penalties can be up to €20 million or 4% of global annual turnover.
- {'topic': 'gdpr'} GDPR penalties can be up to €20 million or 4% of global annual turnover.

=== Answer WITH history-aware retrieval ===
GDPR penalties can be up to €20 million or 4% of global annual turnover.
