# RAG Failure #8: The Structural Hierarchy Failure

## The Problem
Documents often store hierarchical data in a flattened, distributed way. 
One page says "System A contains Module B". Another page says "Module B contains Part C". 
If a user asks about **Part C**, Naive RAG retrieves the document about "Module B". It fails to retrieve the document about "System A" because "Part C" and "System A" never appear in the same text chunk.

**The LLM loses the "Big Picture" context.**

## The Scenario: Engineering Bill of Materials (BOM)
**Query:** "Trace the full system hierarchy for the **Qubit Lattice**."

**The Nested Data (The "Russian Doll" Effect):**
1.  **Doc 1 (Level 1):** "The **Zeus-X Supercomputer** is composed of the **Core Processing Unit** and the Cooling Array."
2.  **Doc 2 (Level 2):** "Inside the **Core Processing Unit**, you will find the **Quantum Chipset**."
3.  **Doc 3 (Level 3):** "The **Quantum Chipset** houses the delicate **Qubit Lattice** assembly."
4.  **Doc 4 (Distractor):** "The **Zeus-Y** model uses a standard silicon chipset."

**Naive RAG Failure:** It retrieves Doc 3 (Immediate parent). It might miss Doc 1 (Root parent). Answer: *"The Qubit Lattice is inside the Quantum Chipset."* (Incomplete lineage).

**KG Solution:** We build a `CONTAINS` graph. We run a **Reverse Traversal** (Lineage Query) to find the path from the leaf node all the way up to the root.

In [None]:
# --- Step 1: Environment Setup ---
!pip install -q langchain langchain-community langchain-huggingface faiss-cpu networkx transformers sentence-transformers accelerate bitsandbytes

In [None]:
import torch
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
from langchain_huggingface import HuggingFacePipeline, HuggingFaceEmbeddings
import networkx as nx

# --- Step 2: Load Model ---
model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

print(f"Loading {model_id}...")
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

pipe = pipeline(
    "text-generation", 
    model=model, 
    tokenizer=tokenizer, 
    max_new_tokens=256, 
    temperature=0.1, 
    do_sample=True
)

llm = HuggingFacePipeline(pipeline=pipe)
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
print("Model loaded. Pipeline ready.")

Loading TinyLlama-1.1B-Chat-v1.0...
Model loaded. Pipeline ready.


In [None]:
from langchain.docstore.document import Document

# --- Step 3: Simulate Nested Data ---
# Note: No single document contains the full path "Zeus-X -> Qubit Lattice".
# It is broken across 3 separate definitions.
raw_texts = [
    "The Zeus-X Supercomputer is composed of the Core Processing Unit and the Cooling Array.",
    "Inside the Core Processing Unit, you will find the Quantum Chipset housing the logic gates.",
    "The Quantum Chipset houses the delicate Qubit Lattice assembly for calculation.",
    "The older Zeus-Y model uses a standard silicon chipset and does not feature quantum components."
]

docs = [Document(page_content=t) for t in raw_texts]
print(f"Created {len(docs)} Nested Documents.")
for i, d in enumerate(docs):
    print(f"Doc {i+1} ({['Root', 'Level 2', 'Level 3', 'Distractor'][i]}): {d.page_content}")

Created 4 Nested Documents.
Doc 1 (Root): The Zeus-X Supercomputer is composed of the Core Processing Unit and the Cooling Array.
Doc 2 (Level 2): Inside the Core Processing Unit, you will find the Quantum Chipset housing the logic gates.
Doc 3 (Level 3): The Quantum Chipset houses the delicate Qubit Lattice assembly for calculation.
Doc 4 (Distractor): The older Zeus-Y model uses a standard silicon chipset and does not feature quantum components.


In [None]:
# --- Step 4: Naive RAG Implementation ---
from langchain_community.vectorstores import FAISS

print("\n--- NAIVE RAG (The Missing Ancestor) ---")
query = "Trace the full system hierarchy for the Qubit Lattice."
print(f"Query: {query}")

# 1. Indexing
vectorstore = FAISS.from_documents(docs, embeddings)

# 2. Retrieval
# K=2. Doc 1 (Zeus-X) is very far semantically from Qubit Lattice.
retriever = vectorstore.as_retriever(search_kwargs={"k": 2})
retrieved_docs = retriever.invoke(query)

print("\nRetrieved Context (k=2):")
context_str = ""
for i, d in enumerate(retrieved_docs):
    print(f"{i+1}. {d.page_content}")
    context_str += d.page_content + "\n"

# 3. Generation
prompt = f"<|system|>\nAnswer the question based on the context.\n<|user|>\nContext:\n{context_str}\nQuestion:\n{query}\n<|assistant|>"
response = llm.invoke(prompt)
cleaned_response = response.split("<|assistant|>")[-1].strip()

print("\nLLM Answer:")
print(cleaned_response)


--- NAIVE RAG (The Missing Ancestor) ---
Query: Trace the full system hierarchy for the Qubit Lattice.

Retrieved Context (k=2):
1. The Quantum Chipset houses the delicate Qubit Lattice assembly for calculation.
2. Inside the Core Processing Unit, you will find the Quantum Chipset housing the logic gates.

LLM Answer:
Based on the context, the Qubit Lattice is housed within the Quantum Chipset, which is found inside the Core Processing Unit.

ANALYSIS:
The answer is technically correct but INCOMPLETE. 
It completely missed the 'Zeus-X Supercomputer' (The Root). 
Why? Because 'Zeus-X' appears in Doc 1, which has zero keyword overlap with 'Qubit Lattice', so Vector Search ignored it.


In [None]:
# --- Step 5: Hierarchical Extraction ---
# We extract 'Parent -> Child' relationships.

kg = nx.DiGraph()

def extract_hierarchy(text):
    """
    Extracts containment relationships.
    """
    prompt = f"""<|system|>
    Extract the System (Container) and the Sub-Component (Part).
    Format: Parent | CONTAINS | Child
    <|user|>
    Text: {text}
    <|assistant|>"""
    
    raw = llm.invoke(prompt)
    out = raw.split("<|assistant|>")[-1].strip()
    if "|" in out:
        return [p.strip() for p in out.split("|")]
    return []

print("\n--- HIERARCHICAL EXTRACTION ---")

for doc in docs:
    print(f"\nParsing: {doc.page_content}")
    parts = extract_hierarchy(doc.page_content)
    
    if len(parts) >= 3:
        parent, rel, child = parts[0], parts[1], parts[2]
        
        # Clean strings
        parent = parent.strip()
        child = child.strip()
        
        print(f"   [Extracted]: {parent} | CONTAINS | {child}")
        kg.add_edge(parent, child, relation="CONTAINS")
        print(f"   [Action]: Added Edge ({parent}) -> ({child})")


--- HIERARCHICAL EXTRACTION ---

Parsing: The Zeus-X Supercomputer is composed of the Core Processing Unit and the Cooling Array.
   [Extracted]: Zeus-X Supercomputer | CONTAINS | Core Processing Unit
   [Action]: Added Edge (Zeus-X Supercomputer) -> (Core Processing Unit)

Parsing: Inside the Core Processing Unit, you will find the Quantum Chipset housing the logic gates.
   [Extracted]: Core Processing Unit | CONTAINS | Quantum Chipset
   [Action]: Added Edge (Core Processing Unit) -> (Quantum Chipset)

Parsing: The Quantum Chipset houses the delicate Qubit Lattice assembly for calculation.
   [Extracted]: Quantum Chipset | CONTAINS | Qubit Lattice
   [Action]: Added Edge (Quantum Chipset) -> (Qubit Lattice)
...


In [None]:
# --- Step 6: The Solution (Reverse Path Traversal) ---

print("\n--- REVERSE LINEAGE TRACING ---")
print(f"Query: \"{query}\"")

# 1. Identify Target Node
target_node = "Qubit Lattice"
print(f"Target Node: '{target_node}'")

def get_full_lineage(node_name):
    if node_name not in kg:
        return "Node not found."
    
    print(f"\nTracing Ancestors (Bottom-Up)...")
    
    # NetworkX Ancestors gives all parents, grandparents, etc.
    # But to get the ORDER, we can use shortest_path from Root to Leaf
    # First, find the root (nodes with in_degree 0)
    roots = [n for n, d in kg.in_degree() if d == 0]
    
    for root in roots:
        try:
            path = nx.shortest_path(kg, source=root, target=node_name)
            print(f"  Found Path: {path}")
            
            # Format into natural language
            hierarchy_str = " -> ".join(path)
            
            # Or construct a nice sentence
            narrative = f"The {path[-1]} is part of the {path[-2]}, which is part of the {path[-3]}, which is part of the {path[-4]}."
            return narrative
        except nx.NetworkXNoPath:
            continue
            
    return "No hierarchy path found."

final_answer = get_full_lineage(target_node)

print(f"\nFinal Answer (Full Context):\n{final_answer}")


--- REVERSE LINEAGE TRACING ---
Query: "Trace the full system hierarchy for the Qubit Lattice."
Target Node: 'Qubit Lattice'

Tracing Ancestors (Bottom-Up)...
  Found Path: ['Zeus-X Supercomputer', 'Core Processing Unit', 'Quantum Chipset', 'Qubit Lattice']

Final Answer (Full Context):
The Qubit Lattice is part of the Quantum Chipset, which is part of the Core Processing Unit, which is part of the Zeus-X Supercomputer.
