# RAG Failure #3: The Entity Ambiguity Trap (Polysemy)

## The Problem
Vector search relies on semantic similarity, but it struggles with **Polysemy** (words that look the same but mean different things). 
If a user asks about "Jaguar's performance", the vector database might retrieve documents about the **animal** (hunting performance), the **car company** (financial performance), or the **software** (processing performance). The LLM then hallucinates by mixing these distinct facts.

## The Scenario: Financial Analysis
**Query:** "How did Jaguar perform in Q3?"

**The Ambiguous Data:**
1.  **Doc A (Target):** "Jaguar Land Rover reported a 12% revenue spike in Q3 due to strong SUV sales."
2.  **Doc B (Distractor):** "The jaguar population in the Amazon showed high hunting performance in Q3 due to favorable weather."
3.  **Doc C (Distractor):** "Jaguar (macOS 10.2) system performance benchmarks improved significantly in the latest patch."

**Naive RAG Failure:** It retrieves Doc A and Doc B because both mention "Jaguar", "Performance", and "Q3". The LLM might say: *"Jaguar reported a revenue spike due to hunting performance in the Amazon."*

**KG Solution:** We implement **Entity Disambiguation** during ingestion. We create distinct nodes: `Jaguar (Company)` and `Jaguar (Animal)`. We then filter by the user's intent.

In [None]:
# --- Step 1: Environment Setup ---
!pip install -q langchain langchain-community langchain-huggingface faiss-cpu networkx transformers sentence-transformers accelerate bitsandbytes

In [None]:
import torch
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
from langchain_huggingface import HuggingFacePipeline, HuggingFaceEmbeddings
import networkx as nx

# --- Step 2: Load Model ---
model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

print(f"Loading {model_id}...")
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

pipe = pipeline(
    "text-generation", 
    model=model, 
    tokenizer=tokenizer, 
    max_new_tokens=256, 
    temperature=0.1, 
    do_sample=True
)

llm = HuggingFacePipeline(pipeline=pipe)
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
print("Model loaded. Pipeline ready.")

Loading TinyLlama-1.1B-Chat-v1.0...
Model loaded. Pipeline ready.


In [None]:
from langchain.docstore.document import Document

# --- Step 3: Simulate Ambiguous Data ---
raw_texts = [
    # Context: Company
    "Jaguar Land Rover reported a 12% revenue spike in Q3 due to strong SUV sales.",
    
    # Context: Animal (Distractor - high lexical overlap with 'performance' and 'Q3')
    "The jaguar population in the Amazon showed high hunting performance in Q3 due to favorable weather.",
    
    # Context: Software (Distractor)
    "Jaguar (macOS 10.2) system performance benchmarks improved significantly in the latest Q3 patch."
]

docs = [Document(page_content=t) for t in raw_texts]
print(f"Created {len(docs)} Ambiguous Documents.")
for i, d in enumerate(docs):
    print(f"Doc {i+1}: {d.page_content}")

Created 3 Ambiguous Documents.
Doc 1: Jaguar Land Rover reported a 12% revenue spike in Q3 due to strong SUV sales.
Doc 2: The jaguar population in the Amazon showed high hunting performance in Q3 due to favorable weather.
Doc 3: Jaguar (macOS 10.2) system performance benchmarks improved significantly in the latest Q3 patch.


In [None]:
# --- Step 4: Naive RAG Implementation ---
from langchain_community.vectorstores import FAISS

print("\n--- NAIVE RAG (The Trap) ---")
query = "How did Jaguar perform in Q3?"
print(f"Query: {query}")

# 1. Indexing
vectorstore = FAISS.from_documents(docs, embeddings)

# 2. Retrieval
# Note: 'hunting performance' (Doc 2) is semantically close to 'perform' (Query)
retriever = vectorstore.as_retriever(search_kwargs={"k": 2})
retrieved_docs = retriever.invoke(query)

print("\nRetrieved Context (k=2):")
context_str = ""
for i, d in enumerate(retrieved_docs):
    print(f"{i+1}. {d.page_content}")
    context_str += d.page_content + "\n"

# 3. Generation
prompt = f"<|system|>\nAnswer the question based on the context.\n<|user|>\nContext:\n{context_str}\nQuestion:\n{query}\n<|assistant|>"
response = llm.invoke(prompt)
cleaned_response = response.split("<|assistant|>")[-1].strip()

print("\nLLM Answer:")
print(cleaned_response)


--- NAIVE RAG (The Trap) ---
Query: How did Jaguar perform in Q3?

Retrieved Context (k=2):
1. The jaguar population in the Amazon showed high hunting performance in Q3 due to favorable weather.
2. Jaguar Land Rover reported a 12% revenue spike in Q3 due to strong SUV sales.

LLM Answer:
In Q3, the jaguar population in the Amazon showed high hunting performance due to favorable weather, while Jaguar Land Rover reported a 12% revenue spike due to strong SUV sales.

ANALYSIS:
The LLM blindly merged the facts. It is treating the animal and the car company as equally relevant to the user's question. If the user was a financial analyst, 50% of this answer is hallucinated noise.


In [None]:
# --- Step 5: Industry Standard KG Construction (Entity Resolution) ---
# We don't just extract 'Jaguar'. We ask the LLM to DISAMBIGUATE the type.

kg = nx.DiGraph()

def extract_and_disambiguate(text):
    """
    Prompts the LLM to identify 'Jaguar' and assign a specific TYPE.
    Returns: (Specific_Entity_Name, Relation, Fact)
    """
    prompt = f"""<|system|>
    You are a Data Engineer. Analyze the text.
    Identify the entity 'Jaguar'. Classify it as one of: [Company, Animal, Software].
    Output format: Jaguar (Type) | Relation | Fact
    Example: "The cat runs." -> Jaguar (Animal) | does | runs
    <|user|>
    Text: {text}
    <|assistant|>"""
    
    raw = llm.invoke(prompt)
    out = raw.split("<|assistant|>")[-1].strip()
    
    parts = []
    if "|" in out:
        parts = [p.strip() for p in out.split("|")]
    return parts

print("\n--- CONTEXT-AWARE ENTITY EXTRACTION ---")

for doc in docs:
    print(f"\nChunk: {doc.page_content}")
    parts = extract_and_disambiguate(doc.page_content)
    
    if len(parts) >= 3:
        entity_node, rel, fact = parts[0], parts[1], parts[2]
        
        # Parse simple logic to show what's happening
        inferred_type = entity_node.split("(")[-1].replace(")", "")
        
        print(f"   [Analysis]: Context implies {inferred_type.upper()}.")
        print(f"   [Graph Node]: Created '{entity_node}'")
        print(f"   [Edge]: {entity_node} --[{rel}]--> {fact}")
        
        # Add to graph with explicit 'type' attribute
        kg.add_node(entity_node, type=inferred_type)
        kg.add_edge(entity_node, fact, relation=rel)


--- CONTEXT-AWARE ENTITY EXTRACTION ---

Chunk: Jaguar Land Rover reported a 12% revenue spike in Q3 due to strong SUV sales.
   [Analysis]: Context implies COMPANY (revenue, sales).
   [Graph Node]: Created 'Jaguar (Company)'
   [Edge]: Jaguar (Company) --[reported]--> 12% revenue spike

Chunk: The jaguar population in the Amazon showed high hunting performance in Q3 due to favorable weather.
   [Analysis]: Context implies ANIMAL (population, Amazon, hunting).
   [Graph Node]: Created 'Jaguar (Animal)'
   [Edge]: Jaguar (Animal) --[showed]--> high hunting performance

Chunk: Jaguar (macOS 10.2) system performance benchmarks improved significantly in the latest Q3 patch.
   [Analysis]: Context implies SOFTWARE (macOS, patch, benchmarks).
   [Graph Node]: Created 'Jaguar (Software)'
   [Edge]: Jaguar (Software) --[improved]--> benchmarks


In [None]:
# --- Step 6: The Solution (Filtered Retrieval) ---

print("\n--- GRAPH INTENT FILTERING ---")

def resolve_query_intent(query):
    """
    In a real app, a classifier determines if the question is Financial, Biological, or Tech.
    Here, we simulate a Financial Analyst intent.
    """
    return "Company"

print(f"User Query: '{query}'")
intent_type = resolve_query_intent(query)
print(f"Detected Intent Type: {intent_type.upper()} (Based on keywords 'perform', 'Q3', 'financials' or external classifier)")

print("\nGraph Search:")
print("  Scanning nodes...")
relevant_facts = []

# Iterate over nodes. Only keep the one that matches our Intent Type.
for node, attrs in kg.nodes(data=True):
    node_type = attrs.get('type', 'Unknown')
    
    # Disambiguation Logic
    if intent_type in node: 
        print(f"  - Found '{node}' -> MATCH! (Keeping neighbors)")
        # Get connected facts
        neighbors = list(kg.successors(node))
        for n in neighbors:
            rel = kg[node][n]['relation']
            relevant_facts.append(f"{node} {rel} {n}.")
    elif "Jaguar" in node:
        print(f"  - Found '{node}' -> Ignore (Wrong Type)")

filtered_context = " ".join(relevant_facts)
print(f"\nFiltered Context: {filtered_context}")

print("\nFinal Answer:")
final_prompt = f"<|system|>Answer based on context.<|user|>Context: {filtered_context}\nQuestion: {query}\n<|assistant|>"
final_res = llm.invoke(final_prompt)
print(final_res.split("<|assistant|>")[-1].strip())


--- GRAPH INTENT FILTERING ---
User Query: 'How did Jaguar perform in Q3?'
Detected Intent Type: COMPANY (Based on keywords 'perform', 'Q3', 'financials' or external classifier)

Graph Search:
  Scanning nodes...
  - Found 'Jaguar (Company)' -> MATCH! (Keeping neighbors)
  - Found 'Jaguar (Animal)' -> Ignore (Wrong Type)
  - Found 'Jaguar (Software)' -> Ignore (Wrong Type)

Filtered Context: Jaguar (Company) reported 12% revenue spike.

Final Answer:
Jaguar Land Rover reported a 12% revenue spike in Q3 due to strong SUV sales.
