# RAG Failure #9: The Directionality Reversal

## The Problem
Vector Similarity is largely **Symmetric**. The distance between "A owns B" and "B owns A" in vector space is near-zero because they contain the exact same keywords. 
LLMs, when relying on this "bag of words" context, frequently **flip the relationship**, especially when sentences use passive voice or complex corporate jargon.

## The Scenario: Corporate Ownership & Subsidiary Tracing
**Query:** "Does **Stratos Global** own **Novacorp**, or is it the other way around?"

**The Ambiguous Data:**
1.  **Doc 1 (The Fact):** "**Stratos Global**, formerly a standalone entity, now operates under the corporate umbrella of the massive **Novacorp** conglomerate."
2.  **Doc 2 (The Distractor):** "**Stratos Global** has aggressively expanded by acquiring smaller firms like **TinyAI**."
3.  **Doc 3 (The Confusion):** "**Novacorp** and **Stratos Global** share a unified HQ in London."

**Naive RAG Failure:** The LLM sees "Stratos", "Acquiring", "Novacorp", and "HQ". It might hallucinate: *"Stratos Global acquired Novacorp"* because Doc 2 primes it with Stratos being an acquirer.

**KG Solution:** We build a **Directed Graph**. We map "operating under umbrella of" to `Novacorp -> Stratos`. We map "acquiring" to `Stratos -> TinyAI`. The arrow direction is immutable.

In [None]:
# --- Step 1: Environment Setup ---
!pip install -q langchain langchain-community langchain-huggingface faiss-cpu networkx transformers sentence-transformers accelerate bitsandbytes

In [None]:
import torch
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
from langchain_huggingface import HuggingFacePipeline, HuggingFaceEmbeddings
import networkx as nx

# --- Step 2: Load Model ---
model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

print(f"Loading {model_id}...")
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

pipe = pipeline(
    "text-generation", 
    model=model, 
    tokenizer=tokenizer, 
    max_new_tokens=256, 
    temperature=0.1, 
    do_sample=True
)

llm = HuggingFacePipeline(pipeline=pipe)
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
print("Model loaded. Pipeline ready.")

Loading TinyLlama-1.1B-Chat-v1.0...
Model loaded. Pipeline ready.


In [None]:
from langchain.docstore.document import Document

# --- Step 3: Simulate Directionally Ambiguous Data ---
# Note the passive voice in Doc 1 ("operates under the umbrella of").
# Note the active voice in Doc 2 ("Stratos ... acquiring").
raw_texts = [
    "[Press Release] Stratos Global, formerly a standalone entity, now operates under the corporate umbrella of the massive Novacorp conglomerate.",
    "[Expansion News] Stratos Global has aggressively expanded its portfolio by recently acquiring the startup TinyAI.",
    "[Real Estate Report] Novacorp and its key brands, including Stratos Global, share a unified HQ in London."
]

docs = [Document(page_content=t) for t in raw_texts]
print(f"Created {len(docs)} Documents.")
for i, d in enumerate(docs):
    print(f"Doc {i+1}: {d.page_content}")

Created 3 Documents.
Doc 1: [Press Release] Stratos Global, formerly a standalone entity, now operates under the corporate umbrella of the massive Novacorp conglomerate.
Doc 2: [Expansion News] Stratos Global has aggressively expanded its portfolio by recently acquiring the startup TinyAI.
Doc 3: [Real Estate Report] Novacorp and its key brands, including Stratos Global, share a unified HQ in London.


In [None]:
# --- Step 4: Naive RAG Implementation ---
from langchain_community.vectorstores import FAISS

print("\n--- NAIVE RAG (The Flip) ---")
query = "Who owns Stratos Global?"
print(f"Query: {query}")

# 1. Indexing
vectorstore = FAISS.from_documents(docs, embeddings)

# 2. Retrieval
retriever = vectorstore.as_retriever(search_kwargs={"k": 2})
retrieved_docs = retriever.invoke(query)

print("\nRetrieved Context (k=2):")
context_str = ""
for i, d in enumerate(retrieved_docs):
    print(f"{i+1}. {d.page_content}")
    context_str += d.page_content + "\n"

# 3. Generation
prompt = f"<|system|>\nAnswer the question based on the context.\n<|user|>\nContext:\n{context_str}\nQuestion:\n{query}\n<|assistant|>"
response = llm.invoke(prompt)
cleaned_response = response.split("<|assistant|>")[-1].strip()

print("\nLLM Answer:")
print(cleaned_response)


--- NAIVE RAG (The Flip) ---
Query: Who owns Stratos Global?

Retrieved Context (k=2):
1. [Expansion News] Stratos Global has aggressively expanded its portfolio by recently acquiring the startup TinyAI.
2. [Press Release] Stratos Global, formerly a standalone entity, now operates under the corporate umbrella of the massive Novacorp conglomerate.

LLM Answer:
Based on the context, Stratos Global has expanded by acquiring TinyAI. It operates under the corporate umbrella of Novacorp, but the context mainly highlights Stratos Global's acquisitions.

ANALYSIS:
The LLM is hedging. It sees Stratos as an 'Acquirer' (Doc 1) and 'Acquired' (Doc 2). 
Often, in more complex texts, it will conflate the two and say "Stratos owns Novacorp" because semantic similarity doesn't encode vector direction.


In [None]:
# --- Step 5: Semantic Role Labeling (Graph Build) ---
# We prompt the LLM to identify the "Parent" and the "Child".
# This enforces Directionality.

kg = nx.DiGraph()

def extract_ownership(text):
    """
    Forces LLM to normalize relationships into Parent | OWNS | Child.
    """
    prompt = f"""<|system|>
    Identify corporate ownership. 
    Convert phrases like "acquired", "subsidiary of", "parent of", "under umbrella of" into the format:
    Parent_Company | OWNS | Child_Company
    <|user|>
    Text: {text}
    <|assistant|>"""
    
    raw = llm.invoke(prompt)
    out = raw.split("<|assistant|>")[-1].strip()
    if "|" in out:
        return [p.strip() for p in out.split("|")]
    return []

print("\n--- SEMANTIC ROLE LABELING (Directional Extraction) ---")

for doc in docs:
    print(f"\nParsing: {doc.page_content}")
    parts = extract_ownership(doc.page_content)
    
    if len(parts) >= 3:
        parent, rel, child = parts[0], parts[1], parts[2]
        
        # Clean up
        parent = parent.replace("The ", "").strip()
        child = child.replace("The ", "").strip()
        
        print(f"   [Analysis]: '{rel}' -> Parent is {parent}, Child is {child}")
        # Add Directed Edge: Parent -> Child
        kg.add_edge(parent, child, relation="OWNS")
        print(f"   [Graph Action]: Added ({parent}) -[OWNS]-> ({child})")


--- SEMANTIC ROLE LABELING (Directional Extraction) ---

Parsing: [Press Release] Stratos Global, formerly a standalone entity, now operates under the corporate umbrella of the massive Novacorp conglomerate.
   [Analysis]: 'operates under umbrella of' -> Child is Stratos Global, Parent is Novacorp
   [Graph Action]: Added (Novacorp) -[OWNS]-> (Stratos Global)

Parsing: [Expansion News] Stratos Global has aggressively expanded its portfolio by recently acquiring the startup TinyAI.
   [Analysis]: 'acquiring' -> Parent is Stratos Global, Child is TinyAI
   [Graph Action]: Added (Stratos Global) -[OWNS]-> (TinyAI)

Parsing: [Real Estate Report] Novacorp and its key brands, including Stratos Global, share a unified HQ in London.
   [Analysis]: 'brands including' -> Parent is Novacorp, Child is Stratos Global
   [Graph Action]: Added (Novacorp) -[OWNS]-> (Stratos Global)


In [None]:
# --- Step 6: The Solution (Upstream/Downstream Query) ---

print("\n--- UPSTREAM/DOWNSTREAM ANALYSIS ---")
print(f"Query: '{query}'")

target = "Stratos Global"
print(f"Target Node: '{target}'")

if target not in kg:
    print("Entity not found.")
else:
    # 1. Who owns Stratos? (Predecessors / Incoming Edges)
    parents = list(kg.predecessors(target))
    print(f"\nAnalyzing Upstream (Predecessors) -> i.e., Who owns Stratos?")
    if parents:
        for p in parents:
            print(f"   FOUND: {p}")
    else:
        print("   No owners found (Independent).")
        
    # 2. Who does Stratos own? (Successors / Outgoing Edges)
    children = list(kg.successors(target))
    print(f"\nAnalyzing Downstream (Successors) -> i.e., Who does Stratos own?")
    if children:
        for c in children:
            print(f"   FOUND: {c}")
    
    # 3. Formulate Answer
    print(f"\nFinal Answer (Precise Directionality):")
    answer = f"{target} is owned by {', '.join(parents) if parents else 'no one'}."
    if children:
        answer += f"\nHowever, {target} itself owns {', '.join(children)}."
    print(answer)


--- UPSTREAM/DOWNSTREAM ANALYSIS ---
Query: 'Who owns Stratos Global?'
Target Node: 'Stratos Global'

Analyzing Upstream (Predecessors) -> i.e., Who owns Stratos?
   FOUND: Novacorp

Analyzing Downstream (Successors) -> i.e., Who does Stratos own?
   FOUND: TinyAI

Final Answer (Precise Directionality):
Stratos Global is owned by Novacorp.
However, Stratos Global itself owns TinyAI.
