# RAG Failure #7: The "Scattered Evidence" Fragment (Low Recall)

## The Problem
Standard RAG relies on a `top_k` parameter (usually 3 to 5) to fit context into the LLM. 
If the user asks for a comprehensive list (e.g., "List **all** safety features of the Model-X"), and these features are mentioned one-by-one across **10 different pages** of a manual, the Vector Retriever will only fetch the top 3 pages. The LLM will confidentally list 3 features and miss the other 7.

## The Scenario: Product Safety Compliance
**Query:** "List all safety certifications and features of the **Model-X** Industrial Robot."

**The Scattered Data:**
-   **Doc 1 (Intro):** "The Model-X features a reinforced **Titanium Chassis**."
-   **Doc 2 (Electrical):** "Circuitry in the Model-X includes **Surge Protection**."
-   **Doc 3 (Vision):** "Model-X uses **Lidar Object Avoidance**."
-   **Doc 4 (Emergency):** "Standard **Red-Stop Button** is located on the Model-X rear."
-   **Doc 5 (Compliance):** "Model-X is **ISO-9001 Certified**."
-   **Doc 6 (Distractor):** "The Model-Y features **Voice Control**."

**Naive RAG Failure:** With `k=2` or `k=3`, it retrieves Doc 1, 3, and 5. It misses Electrical (Doc 2) and Emergency (Doc 4). The answer is incomplete.

**KG Solution:** We treat 'Model-X' as a central node. During ingestion, we attach every feature found in *any* document to this node. The query simply returns all neighbors of 'Model-X'.

In [None]:
# --- Step 1: Environment Setup ---
!pip install -q langchain langchain-community langchain-huggingface faiss-cpu networkx transformers sentence-transformers accelerate bitsandbytes

In [None]:
import torch
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
from langchain_huggingface import HuggingFacePipeline, HuggingFaceEmbeddings
import networkx as nx

# --- Step 2: Load Model ---
model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

print(f"Loading {model_id}...")
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

pipe = pipeline(
    "text-generation", 
    model=model, 
    tokenizer=tokenizer, 
    max_new_tokens=256, 
    temperature=0.1, 
    do_sample=True
)

llm = HuggingFacePipeline(pipeline=pipe)
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
print("Model loaded. Pipeline ready.")

Loading TinyLlama-1.1B-Chat-v1.0...
Model loaded. Pipeline ready.


In [None]:
from langchain.docstore.document import Document

# --- Step 3: Simulate Scattered Attributes ---
# 5 Docs describing Model-X. 1 Doc describing Model-Y.
# A Retriever with k=2 or k=3 is GUARANTEED to miss facts.
raw_texts = [
    "[Manual Sec 1] The Model-X features a reinforced Titanium Chassis for impact resistance.",
    "[Manual Sec 2] Electrical circuitry in the Model-X includes active Surge Protection.",
    "[Manual Sec 3] For navigation, the Model-X uses Lidar Object Avoidance technology.",
    "[Manual Sec 4] A physical Red-Stop Button is located on the Model-X rear panel.",
    "[Compliance Cert] The Model-X is fully ISO-9001 Certified for factory use.",
    # Distractor
    "[Ad Brochure] The new Model-Y features Voice Control and AI Chat."
]

docs = [Document(page_content=t) for t in raw_texts]
print(f"Created {len(docs)} Scattered Documents.")
for i, d in enumerate(docs):
    print(f"Doc {i+1}: {d.page_content}")

Created 6 Scattered Documents.
Doc 1: [Manual Sec 1] The Model-X features a reinforced Titanium Chassis for impact resistance.
Doc 2: [Manual Sec 2] Electrical circuitry in the Model-X includes active Surge Protection.
Doc 3: [Manual Sec 3] For navigation, the Model-X uses Lidar Object Avoidance technology.
Doc 4: [Manual Sec 4] A physical Red-Stop Button is located on the Model-X rear panel.
Doc 5: [Compliance Cert] The Model-X is fully ISO-9001 Certified for factory use.
Doc 6: [Ad Brochure] The new Model-Y features Voice Control and AI Chat.


In [None]:
# --- Step 4: Naive RAG Implementation ---
from langchain_community.vectorstores import FAISS

print("\n--- NAIVE RAG (Low Recall) ---")
query = "List all safety features and certifications of the Model-X."
print(f"Query: {query}")

# 1. Indexing
vectorstore = FAISS.from_documents(docs, embeddings)

# 2. Retrieval
# We set k=2 to demonstrate the fragmentation problem vividly.
# Even with k=4, we would miss 1 document.
retriever = vectorstore.as_retriever(search_kwargs={"k": 2})
retrieved_docs = retriever.invoke(query)

print("\nRetrieved Context (k=2):")
context_str = ""
for i, d in enumerate(retrieved_docs):
    print(f"{i+1}. {d.page_content}")
    context_str += d.page_content + "\n"

# 3. Generation
prompt = f"<|system|>\nList every feature mentioned.\n<|user|>\nContext:\n{context_str}\nQuestion:\n{query}\n<|assistant|>"
response = llm.invoke(prompt)
cleaned_response = response.split("<|assistant|>")[-1].strip()

print("\nLLM Answer:")
print(cleaned_response)


--- NAIVE RAG (Low Recall) ---
Query: List all safety features and certifications of the Model-X.

Retrieved Context (k=2):
1. [Compliance Cert] The Model-X is fully ISO-9001 Certified for factory use.
2. [Manual Sec 1] The Model-X features a reinforced Titanium Chassis for impact resistance.

LLM Answer:
Based on the context provided, the safety features and certifications of the Model-X are:
1. ISO-9001 Certified
2. Reinforced Titanium Chassis

ANALYSIS:
The LLM missed 3 CRITICAL features (Surge Protection, Lidar, Red-Stop Button). 
Why? Because k=2 limited the input. The LLM cannot list what it cannot see.


In [None]:
# --- Step 5: Attribute Accumulation Pipeline ---
# We process ALL documents. The graph acts as the "Global Memory".
# We normalize the Subject ('Model-X') so all features attach to one node.

kg = nx.DiGraph()

def extract_features(text):
    """
    Extracts Product -> Feature relationships.
    """
    prompt = f"""<|system|>
    Extract the Product and the specific Feature/Cert mentioned.
    Format: Product | HAS_FEATURE | Feature
    <|user|>
    Text: {text}
    <|assistant|>"""
    
    raw = llm.invoke(prompt)
    out = raw.split("<|assistant|>")[-1].strip()
    if "|" in out:
        return [p.strip() for p in out.split("|")]
    return []

print("\n--- ATTRIBUTE ACCUMULATION (KG Build) ---")

for doc in docs:
    print(f"\nParsing: {doc.page_content}")
    parts = extract_features(doc.page_content)
    
    if len(parts) >= 3:
        prod, rel, feature = parts[0], parts[1], parts[2]
        
        # Normalize the product name to ensure connectivity
        if "Model-X" in prod: prod_id = "Model-X"
        elif "Model-Y" in prod: prod_id = "Model-Y"
        else: prod_id = prod
        
        print(f"   [Extracted]: {prod_id} | HAS_FEATURE | {feature}")
        kg.add_edge(prod_id, feature, relation="HAS_FEATURE")
        print(f"   [Action]: Attached '{feature}' to '{prod_id}'")


--- ATTRIBUTE ACCUMULATION (KG Build) ---

Parsing: [Manual Sec 1] The Model-X features a reinforced Titanium Chassis for impact resistance.
   [Extracted]: Model-X | HAS_FEATURE | Titanium Chassis
   [Action]: Attached 'Titanium Chassis' to 'Model-X'

Parsing: [Manual Sec 2] Electrical circuitry in the Model-X includes active Surge Protection.
   [Extracted]: Model-X | HAS_FEATURE | Surge Protection
   [Action]: Attached 'Surge Protection' to 'Model-X'

Parsing: [Manual Sec 3] For navigation, the Model-X uses Lidar Object Avoidance technology.
   [Extracted]: Model-X | HAS_FEATURE | Lidar Object Avoidance
   [Action]: Attached 'Lidar Object Avoidance' to 'Model-X'
...


In [None]:
# --- Step 6: The Solution (Neighborhood Search) ---
# We retrieve the central node and ALL its connected neighbors.
# This simulates 'Infinite K' for this specific topic.

print("\n--- 1-HOP NEIGHBORHOOD SEARCH ---")
print(f"Query: \"{query}\"")

target = "Model-X"
print(f"Target Entity: '{target}'")

if target in kg:
    # Get all features (successors)
    features = list(kg.successors(target))
    
    print(f"\nGraph Retrieval (All Neighbors):")
    for f in features:
        print(f"  - {f}")
        
    print(f"\nTotal Features Retrieved: {len(features)}")
    
    # Formulate Answer
    print(f"\nFinal Answer (100% Recall):\nThe {target} comes equipped with {len(features)} key features: {', '.join(features)}.")
else:
    print("Entity not found in graph.")


--- 1-HOP NEIGHBORHOOD SEARCH ---
Query: "List all safety features and certifications of the Model-X."
Target Entity: 'Model-X'

Graph Retrieval (All Neighbors):
  - Titanium Chassis
  - Surge Protection
  - Lidar Object Avoidance
  - Red-Stop Button
  - ISO-9001 Certified

Total Features Retrieved: 5

Final Answer (100% Recall):
The Model-X comes equipped with 5 key features: Titanium Chassis, Surge Protection, Lidar Object Avoidance, Red-Stop Button, ISO-9001 Certified.
