# RAG Failure #6: The Aggregation Blindness

## The Problem
LLMs are not calculators. When you ask "**How many** suppliers do we have?", standard RAG fails for two reasons:
1.  **Limited Context (The 'Top-K' Problem):** If you have 50 suppliers scattered across 50 documents, and your retriever only fetches the top 5 documents, the LLM physically cannot count the other 45.
2.  **Duplicate Counting:** If "Apex Corp" appears in Document A and "Apex Inc" appears in Document B, the LLM often counts them as two distinct companies.

## The Scenario: Project Zeus Supply Chain
**Query:** "How many unique Tier-1 suppliers are working on Project Zeus?"

**The Scattered Data:**
-   **Doc 1 (Invoice):** "Payment sent to **Apex Corp** for Project Zeus steel beams."
-   **Doc 2 (Email):** "Re: Project Zeus. **Apex Inc** has delayed delivery."
-   **Doc 3 (Report):** "**Beta-Tech** is supplying chips for Zeus."
-   **Doc 4 (Logistics):** "**Gamma Logistics** is handling Zeus shipping."
-   **Doc 5 (Distractor):** "Project Apollo is using Delta Industries."

**Naive RAG Failure:** With $k=2$, it might retrieve Doc 1 and Doc 2. It sees "Apex Corp" and "Apex Inc". It answers "2 suppliers" (Double counting the same company, missing Beta and Gamma).

**KG Solution:** We normalize entities to unique IDs. We aggregate edges from *all* documents. We run `count = len(neighbors)`.

In [None]:
# --- Step 1: Environment Setup ---
!pip install -q langchain langchain-community langchain-huggingface faiss-cpu networkx transformers sentence-transformers accelerate bitsandbytes

In [None]:
import torch
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
from langchain_huggingface import HuggingFacePipeline, HuggingFaceEmbeddings
import networkx as nx

# --- Step 2: Load Model ---
model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

print(f"Loading {model_id}...")
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

pipe = pipeline(
    "text-generation", 
    model=model, 
    tokenizer=tokenizer, 
    max_new_tokens=256, 
    temperature=0.1, 
    do_sample=True
)

llm = HuggingFacePipeline(pipeline=pipe)
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
print("Model loaded. Pipeline ready.")

Loading TinyLlama-1.1B-Chat-v1.0...
Model loaded. Pipeline ready.


In [None]:
from langchain.docstore.document import Document

# --- Step 3: Simulate Scattered & Dirty Data ---
# Notice: "Apex Corp" and "Apex Inc" refer to the same entity.
# Notice: Data is split across 4 docs. A standard RAG with k=2 will miss 2 of them.
raw_texts = [
    "INVOICE #101: Payment of $50k to Apex Corp for Steel Beams allocated to Project Zeus.",
    "EMAIL: Re: Project Zeus delays. We are waiting on Apex Inc to finalize the steel coating.",
    "REPORT: Beta-Tech Semiconductors has been onboarded as a silicon partner for Project Zeus.",
    "LOGISTICS MANIFEST: Gamma Logistics will handle all shipping routes for Project Zeus.",
    "STATUS UPDATE: Project Apollo is moving ahead with Delta Industries as primary contractor."
]

docs = [Document(page_content=t) for t in raw_texts]
print(f"Created {len(docs)} Scattered Documents.")
for i, d in enumerate(docs):
    print(f"Doc {i+1}: {d.page_content}")

Created 5 Scattered Documents.
Doc 1: INVOICE #101: Payment of $50k to Apex Corp for Steel Beams allocated to Project Zeus.
Doc 2: EMAIL: Re: Project Zeus delays. We are waiting on Apex Inc to finalize the steel coating.
Doc 3: REPORT: Beta-Tech Semiconductors has been onboarded as a silicon partner for Project Zeus.
Doc 4: LOGISTICS MANIFEST: Gamma Logistics will handle all shipping routes for Project Zeus.
Doc 5: STATUS UPDATE: Project Apollo is moving ahead with Delta Industries as primary contractor.


In [None]:
# --- Step 4: Naive RAG Implementation ---
from langchain_community.vectorstores import FAISS

print("\n--- NAIVE RAG (The Miscount) ---")
query = "How many unique Tier-1 suppliers are working on Project Zeus?"
print(f"Query: {query}")

# 1. Indexing
vectorstore = FAISS.from_documents(docs, embeddings)

# 2. Retrieval
# We strictly set K=2 to simulate a limited context window in a large database.
retriever = vectorstore.as_retriever(search_kwargs={"k": 2})
retrieved_docs = retriever.invoke(query)

print("\nRetrieved Context (k=2):")
context_str = ""
for i, d in enumerate(retrieved_docs):
    print(f"{i+1}. {d.page_content}")
    context_str += d.page_content + "\n"

# 3. Generation
prompt = f"<|system|>\nAnswer the question based on the context.\n<|user|>\nContext:\n{context_str}\nQuestion:\n{query}\n<|assistant|>"
response = llm.invoke(prompt)
cleaned_response = response.split("<|assistant|>")[-1].strip()

print("\nLLM Answer:")
print(cleaned_response)


--- NAIVE RAG (The Miscount) ---
Query: How many unique Tier-1 suppliers are working on Project Zeus?

Retrieved Context (k=2):
1. INVOICE #101: Payment of $50k to Apex Corp for Steel Beams allocated to Project Zeus.
2. EMAIL: Re: Project Zeus delays. We are waiting on Apex Inc to finalize the steel coating.

LLM Answer:
Based on the context, there are two unique Tier-1 suppliers working on Project Zeus: Apex Corp and Apex Inc.

ANALYSIS:
1. Undercounting: It completely missed 'Beta-Tech' and 'Gamma' because k=2 cutoff the documents.
2. Double Counting: It thinks 'Apex Corp' and 'Apex Inc' are two different companies.
Result: "2 suppliers" (The correct answer is 3: Apex, Beta, Gamma).


In [None]:
# --- Step 5: Entity Resolution Pipeline ---
# We process ALL documents to build the full picture.
# We implement a Normalization Function to merge "Apex Corp" and "Apex Inc".

kg = nx.DiGraph()

def normalize_entity(name):
    """
    Industry Standard: Map entity mentions to a canonical ID.
    Simple rule-based implementation for demo.
    """
    name = name.lower().strip()
    # Remove corporate suffixes
    suffixes = [" corp", " inc", " ltd", " llc"]
    for s in suffixes:
        if name.endswith(s):
            name = name.replace(s, "")
    return name

def extract_supply_chain(text):
    """
    Extracts Project -> Supplier relationships.
    """
    prompt = f"""<|system|>
    Extract the Project and the Supplier Company.
    Format: Project | HAS_SUPPLIER | Supplier
    <|user|>
    Text: {text}
    <|assistant|>"""
    
    raw = llm.invoke(prompt)
    out = raw.split("<|assistant|>")[-1].strip()
    if "|" in out:
        return [p.strip() for p in out.split("|")]
    return []

print("\n--- KG EXTRACTION WITH DEDUPLICATION ---")

for doc in docs:
    print(f"\nProcessing: {doc.page_content}")
    parts = extract_supply_chain(doc.page_content)
    
    if len(parts) >= 3:
        proj, rel, supplier = parts[0], parts[1], parts[2]
        
        # 1. Normalize (Deduplicate)
        norm_proj = normalize_entity(proj)
        norm_supplier = normalize_entity(supplier)
        
        print(f"   [Raw LLM]: {proj} | {rel} | {supplier}")
        print(f"   [Dedupe]: Normalized '{supplier}' -> Node '{norm_supplier}'")
        
        # 2. Add to Graph
        if not kg.has_edge(norm_proj, norm_supplier):
            kg.add_edge(norm_proj, norm_supplier, relation="HAS_SUPPLIER")
            print(f"   [Action]: Added Edge ({norm_proj}) -> ({norm_supplier})")
        else:
            print(f"   [Action]: Edge already exists. Merging ({norm_proj}) -> ({norm_supplier})")


--- KG EXTRACTION WITH DEDUPLICATION ---

Processing: INVOICE #101: Payment of $50k to Apex Corp for Steel Beams allocated to Project Zeus.
   [Raw LLM]: Project Zeus | HAS_SUPPLIER | Apex Corp
   [Dedupe]: Normalized 'Apex Corp' -> Node 'apex'
   [Action]: Added Edge (project zeus) -> (apex)

Processing: EMAIL: Re: Project Zeus delays. We are waiting on Apex Inc to finalize the steel coating.
   [Raw LLM]: Project Zeus | HAS_SUPPLIER | Apex Inc
   [Dedupe]: Normalized 'Apex Inc' -> Node 'apex'
   [Action]: Added Edge (project zeus) -> (apex) (MERGED DUPLICATE)

Processing: REPORT: Beta-Tech Semiconductors has been onboarded as a silicon partner for Project Zeus.
   [Raw LLM]: Project Zeus | HAS_SUPPLIER | Beta-Tech Semiconductors
   [Dedupe]: Normalized 'Beta-Tech Semiconductors' -> Node 'beta-tech semiconductors'
   [Action]: Added Edge (project zeus) -> (beta-tech semiconductors)
...


In [None]:
# --- Step 6: The Solution (Graph Aggregation) ---
# We don't ask the LLM to count. We use Python to count the graph nodes.

print("\n--- GRAPH AGGREGATION QUERY ---")
print(f"Query: \"{query}\"")

def count_suppliers(project_name):
    target = normalize_entity(project_name)
    print(f"Target Node: '{target}'")
    
    if target not in kg:
        return "Project not found."
    
    # Get direct neighbors (Suppliers)
    suppliers = list(kg.successors(target))
    
    print(f"\nRetrieving neighbors for '{target}'...")
    for s in suppliers:
        print(f"  - Found: {s}")
        
    # The Math
    count = len(suppliers)
    print(f"\nMathematical Count: {count}")
    
    return f"Project Zeus has exactly {count} unique suppliers: {', '.join(suppliers)}."

final_answer = count_suppliers("Project Zeus")
print(f"\nFinal Answer (Generated Programmatically):\n{final_answer}")


--- GRAPH AGGREGATION QUERY ---
Query: "How many unique Tier-1 suppliers are working on Project Zeus?"
Target Node: 'project zeus'

Retrieving neighbors for 'project zeus'...
  - Found: apex
  - Found: beta-tech semiconductors
  - Found: gamma logistics

Mathematical Count: 3

Final Answer (Generated Programmatically):
Project Zeus has exactly 3 unique suppliers: apex, beta-tech semiconductors, gamma logistics.
