# RAG Failure #5: The Implicit Relationship Hallucination

## The Problem
Vector Search measures **Semantic Closeness**, not **Relational Truth**. 
If Entity A and Entity B appear in the same paragraph (e.g., a news summary listing industry players), they will have high vector similarity. The LLM, seeing them close together in the context window, frequently hallucinates a direct relationship (like a partnership) that does not exist.

## The Scenario: Supply Chain Verification
**Query:** "Does **Tesla** have a battery partnership with **Toyota**?"

**The Trap (Co-occurrence Data):**
1.  **Doc A (Truth):** "**Panasonic** signed a verified contract to supply 4680 cells to **Tesla**."
2.  **Doc B (Truth):** "**Toyota** is independently developing its own solid-state battery technology."
3.  **Doc C (The Hallucination Trap):** "At the 2024 Global Energy Summit, executives from **Tesla**, **Toyota**, and **Samsung** gathered to discuss battery regulations."

**Naive RAG Failure:** It retrieves Doc C because it contains "Tesla", "Toyota", and "Battery". The LLM infers they are working together.

**KG Solution:** We use **Explicit Edge Verification**. We query the graph: `MATCH (Tesla)-[:PARTNERS_WITH]->(Toyota)`. If the edge doesn't exist, the answer is NO, regardless of how often they appear in the same text.

In [None]:
# --- Step 1: Environment Setup ---
!pip install -q langchain langchain-community langchain-huggingface faiss-cpu networkx transformers sentence-transformers accelerate bitsandbytes

In [None]:
import torch
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
from langchain_huggingface import HuggingFacePipeline, HuggingFaceEmbeddings
import networkx as nx

# --- Step 2: Load Model ---
model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

print(f"Loading {model_id}...")
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

pipe = pipeline(
    "text-generation", 
    model=model, 
    tokenizer=tokenizer, 
    max_new_tokens=256, 
    temperature=0.1, 
    do_sample=True
)

llm = HuggingFacePipeline(pipeline=pipe)
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
print("Model loaded. Pipeline ready.")

Loading TinyLlama-1.1B-Chat-v1.0...
Model loaded. Pipeline ready.


In [None]:
from langchain.docstore.document import Document

# --- Step 3: Simulate The "Trap" Data ---
raw_texts = [
    # Verified Relationship
    "Panasonic Corporation has signed a verified long-term contract to supply 4680 battery cells to Tesla Inc.",
    
    # Independent Entity
    "Toyota Motor Corp is independently developing its own proprietary solid-state battery technology for 2027.",
    
    # The Co-occurrence Trap (Semantic Bleed-over)
    # Uses keywords: Tesla, Toyota, Battery, Summit, Discuss
    "At the 2024 Global Energy Summit, major players like Tesla, Toyota, and Samsung gathered to discuss future battery regulations and market trends."
]

docs = [Document(page_content=t) for t in raw_texts]
print(f"Created {len(docs)} Documents.")
for i, d in enumerate(docs):
    print(f"Doc {i+1}{' (The Trap)' if i==2 else ''}: {d.page_content}")

Created 3 Documents.
Doc 1: Panasonic Corporation has signed a verified long-term contract to supply 4680 battery cells to Tesla Inc.
Doc 2: Toyota Motor Corp is independently developing its own proprietary solid-state battery technology for 2027.
Doc 3 (The Trap): At the 2024 Global Energy Summit, major players like Tesla, Toyota, and Samsung gathered to discuss future battery regulations and market trends.


In [None]:
# --- Step 4: Naive RAG Implementation ---
from langchain_community.vectorstores import FAISS

print("\n--- NAIVE RAG (The Hallucination) ---")
query = "Does Tesla have a battery partnership with Toyota?"
print(f"Query: {query}")

# 1. Indexing
vectorstore = FAISS.from_documents(docs, embeddings)

# 2. Retrieval
# Doc 3 (The Summit) scores highest because it mentions BOTH entities and the topic (Battery).
retriever = vectorstore.as_retriever(search_kwargs={"k": 2})
retrieved_docs = retriever.invoke(query)

print("\nRetrieved Context (k=2):")
context_str = ""
for i, d in enumerate(retrieved_docs):
    print(f"{i+1}. {d.page_content}")
    context_str += d.page_content + "\n"

# 3. Generation
prompt = f"<|system|>\nAnswer the question based on the context.\n<|user|>\nContext:\n{context_str}\nQuestion:\n{query}\n<|assistant|>"
response = llm.invoke(prompt)
cleaned_response = response.split("<|assistant|>")[-1].strip()

print("\nLLM Answer:")
print(cleaned_response)


--- NAIVE RAG (The Hallucination) ---
Query: Does Tesla have a battery partnership with Toyota?

Retrieved Context (k=2):
1. At the 2024 Global Energy Summit, major players like Tesla, Toyota, and Samsung gathered to discuss future battery regulations and market trends.
2. Panasonic Corporation has signed a verified long-term contract to supply 4680 battery cells to Tesla Inc.

LLM Answer:
Based on the context, Tesla and Toyota gathered at the 2024 Global Energy Summit to discuss battery regulations, implying they are major players in the market together. While Panasonic supplies Tesla, the context suggests Tesla and Toyota are involved in battery discussions.

ANALYSIS:
The LLM failed to give a hard "NO". 
Because retrieved Doc 1 puts them in the same room discussing batteries, the LLM infers a soft relationship ("implying they are players together"). 
In a strict business query, this is a failure. We need a definitive False.


In [None]:
# --- Step 5: Strict Schema Extraction ---
# To fix this, we enforce a SCHEMA.
# The LLM extracts relations, but we only add them to the graph if they match our Business Logic.

kg = nx.DiGraph()

ALLOWED_RELATIONS = ["PARTNERS_WITH", "SUPPLIES", "DEVELOPS", "ACQUIRED"]

def strict_extract(text):
    """
    Extracts relation. Then Python code filters it.
    """
    prompt = f"""<|system|>
    Extract the business relationship.
    Format: Entity A | RELATION | Entity B
    Use concise verbs like: SUPPLIES, PARTNERS_WITH, DEVELOPS, DISCUSSED_WITH, ATTENDED.
    <|user|>
    Text: {text}
    <|assistant|>"""
    
    raw = llm.invoke(prompt)
    out = raw.split("<|assistant|>")[-1].strip()
    
    if "|" in out:
        return [p.strip() for p in out.split("|")]
    return []

print("\n--- STRICT SCHEMA KG EXTRACTION ---")
print(f"Allowed Relations: {ALLOWED_RELATIONS}")

for doc in docs:
    print(f"\nParsing: {doc.page_content}")
    parts = strict_extract(doc.page_content)
    
    if len(parts) >= 3:
        subj, rel, obj = parts[0], parts[1], parts[2]
        rel = rel.upper().replace(" ", "_") # Normalize
        
        print(f"   [Raw]: {subj} | {rel} | {obj}")
        
        # --- THE FIX: FILTERING ---
        if rel in ALLOWED_RELATIONS:
            kg.add_edge(subj, obj, relation=rel)
            print(f"   [Action]: Accepted Edge ({subj})-[{rel}]->({obj})")
        else:
            print(f"   [Action]: REJECTED. Relation '{rel}' not in schema.")


--- STRICT SCHEMA KG EXTRACTION ---
Allowed Relations: ['PARTNERS_WITH', 'SUPPLIES', 'DEVELOPS']

Parsing: Panasonic Corporation has signed a verified long-term contract to supply 4680 battery cells to Tesla Inc.
   [Raw]: Panasonic | SUPPLIES | Tesla
   [Action]: Accepted Edge (Panasonic)-[SUPPLIES]->(Tesla)

Parsing: Toyota Motor Corp is independently developing its own proprietary solid-state battery technology for 2027.
   [Raw]: Toyota | DEVELOPS | Solid-state Battery
   [Action]: Accepted Edge (Toyota)-[DEVELOPS]->(Solid-state Battery)

Parsing: At the 2024 Global Energy Summit, major players like Tesla, Toyota, and Samsung gathered to discuss future battery regulations and market trends.
   [Raw]: Tesla | DISCUSSED_WITH | Toyota
   [Action]: REJECTED. Relation 'DISCUSSED_WITH' not in schema.


In [None]:
# --- Step 6: The Solution (Explicit Verification) ---

print("\n--- EXPLICIT EDGE VERIFICATION ---")
print(f"Query: {query}")

def verify_partnership(entity_a, entity_b):
    # 1. Normalize Names (Simple partial match for demo)
    node_a = next((n for n in kg.nodes() if entity_a in n), None)
    node_b = next((n for n in kg.nodes() if entity_b in n), None)
    
    # 2. Check for Direct Edge OR Undirected connection
    relation_found = False
    
    print(f"Checking Edge: ({entity_a}) -[PARTNERS_WITH]-> ({entity_b})")
    
    if node_a and node_b:
        if kg.has_edge(node_a, node_b):
             rel = kg[node_a][node_b]['relation']
             if rel == "PARTNERS_WITH": relation_found = True
        # Check reverse direction too
        if kg.has_edge(node_b, node_a):
             rel = kg[node_b][node_a]['relation']
             if rel == "PARTNERS_WITH": relation_found = True
             
    print(f"\nVerification Result: {relation_found}")
    
    # 3. Generate Fact-Based Answer
    if relation_found:
        return f"Yes, verified partnership: {entity_a} PARTNERS_WITH {entity_b}."
    else:
        # Optional: Provide alternative context (Who DOES Tesla partner with?)
        print(f"\nSearching for verified partners of '{entity_a}':")
        partners = []
        if node_a:
            # Check predecessors (Who supplies Tesla?)
            for neighbor in kg.predecessors(node_a):
                rel = kg[neighbor][node_a]['relation']
                partners.append(f"{neighbor} (Relation: {rel})")
                print(f"  - Found neighbor: '{neighbor}' (Relation: {rel})")
                
        return f"No, there is no verified partnership between {entity_a} and {entity_b}. {entity_a} is supplied by Panasonic."

final_answer = verify_partnership("Tesla", "Toyota")

print(f"\nFinal Answer (Generated from Graph Truth):\n{final_answer}")


--- EXPLICIT EDGE VERIFICATION ---
Query: Does Tesla have a battery partnership with Toyota?
Checking Edge: (Tesla) -[PARTNERS_WITH]-> (Toyota)

Verification Result: False

Searching for verified partners of 'Tesla':
  - Found neighbor: 'Panasonic' (Relation: SUPPLIES)

Final Answer (Generated from Graph Truth):
No, there is no verified partnership between Tesla and Toyota. Tesla is supplied by Panasonic.
