# RAG Failure #2: The Causal Synthesis Failure

## The Problem
Standard RAG is great at retrieving facts, but terrible at **applying strict rules** to those facts. 
When a user asks, "Is X compatible with Y?", RAG retrieves descriptions of X and Y. If those documents contain positive sentiment (e.g., "Safe for transport", "High quality"), the LLM often ignores the hidden technical incompatibility and hallucinates a "Safe" verdict.

## The Scenario: Industrial Chemical Safety
**Query:** "Can I store **Titan-X** and **Solvo-Clean** in the same storage cabinet?"

**The Hidden Logic Chain:**
1.  **Doc A:** Titan-X is a high-grade cleaning agent containing **concentrated Hydrogen Peroxide**.
2.  **Doc B:** Solvo-Clean is an organic solvent derived from **Acetone**.
3.  **Doc C (The Rule):** WARNING: Never mix **Peroxides (Oxidizers)** with **Organic Solvents**. Risk of spontaneous combustion.

**The Adversarial Noise:**
-   "Titan-X is FDA approved and safe for food surfaces."
-   "Solvo-Clean is an eco-friendly, biodegradable solvent."
-   "Both products are safe for transport under regulation 99."

Naive RAG sees "Safe", "Approved", "Eco-friendly" and assumes they are compatible. The Knowledge Graph will catch the specific chemical class conflict.

In [None]:
# --- Step 1: Environment Setup ---
!pip install -q langchain langchain-community langchain-huggingface faiss-cpu networkx transformers sentence-transformers accelerate bitsandbytes

In [None]:
import torch
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
from langchain_huggingface import HuggingFacePipeline, HuggingFaceEmbeddings
import networkx as nx

# --- Step 2: Load Model (TinyLlama) ---
# Using TinyLlama because it is small enough to run on Colab CPU/Free Tier
# yet capable enough to demonstrate logic failures.
model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

print(f"Loading {model_id}...")
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

pipe = pipeline(
    "text-generation", 
    model=model, 
    tokenizer=tokenizer, 
    max_new_tokens=256, 
    temperature=0.01, # Extremely low temp to force deterministic reasoning in RAG
    do_sample=True
)

llm = HuggingFacePipeline(pipeline=pipe)
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
print("Model loaded successfully.")

Loading TinyLlama-1.1B-Chat...
Model loaded successfully.


In [None]:
from langchain.docstore.document import Document

# --- Step 3: Simulate Data ---
# Notice the mix of "Marketing Fluff" (Safe, Eco-friendly) and "Technical Specs" (Peroxide, Acetone).
raw_texts = [
    "Product Specification: Titan-X. This is a premium industrial cleaner. Key active ingredient: 35% Hydrogen Peroxide. Rated safe for food prep surfaces.",
    "Product Specification: Solvo-Clean. A high-performance organic solvent based on Acetone. Biodegradable and eco-friendly.",
    "Logistics Data: Both Titan-X and Solvo-Clean are approved for standard ground shipping (Class 9 non-hazardous transport).",
    "Safety Protocol 101: Organic Solvents (like Acetone, Ethanol) are highly flammable.",
    "Safety Protocol 102: Oxidizers (such as Peroxides, Nitrates) must never be stored with Flammable Solvents. Reaction causes fire."
]

docs = [Document(page_content=t) for t in raw_texts]
print(f"Dataset Created: {len(docs)} Documents.")
for i, d in enumerate(docs):
    print(f"[Doc {i}]: {d.page_content[:150]}...")

Dataset Created: 5 Documents.
[Doc 0]: Product Specification: Titan-X. This is a premium industrial cleaner. Key active ingredient: 35% Hydrogen Peroxide. Rated safe for food prep surfaces.
[Doc 1]: Product Specification: Solvo-Clean. A high-performance organic solvent based on Acetone. Biodegradable and eco-friendly.
[Doc 2]: Logistics Data: Both Titan-X and Solvo-Clean are approved for standard ground shipping (Class 9 non-hazardous transport).
...


In [None]:
# --- Step 4: Naive RAG (The Failure) ---
from langchain_community.vectorstores import FAISS

print("\n=== NAIVE RAG ATTEMPT ===")
query = "Can I safely store Titan-X and Solvo-Clean in the same cabinet?"
print(f"Query: {query}")

# 1. Index
vectorstore = FAISS.from_documents(docs, embeddings)

# 2. Retrieve
# K=3. The Logic requires Doc 0 + Doc 1 + Doc 4.
# However, Doc 2 (Logistics) mentions BOTH names, so it scores very high in vector similarity.
# It pushes Doc 4 (The Rule) out of the top-k.
retriever = vectorstore.as_retriever(search_kwargs={"k": 3})
retrieved_docs = retriever.invoke(query)

print("\nRetrieved Context:")
context_str = ""
for i, d in enumerate(retrieved_docs):
    print(f"{i+1}. {d.page_content}")
    context_str += d.page_content + "\n"

# 3. Generate
prompt = f"<|system|>\nYou are a safety assistant. Answer based on context.\n<|user|>\nContext:\n{context_str}\nQuestion:\n{query}\n<|assistant|>"
response = llm.invoke(prompt)
cleaned_response = response.split("<|assistant|>")[-1].strip()

print("\nGenerating Answer...")
print(f"Model Answer:\n{cleaned_response}")


=== NAIVE RAG ATTEMPT ===
Query: Can I safely store Titan-X and Solvo-Clean in the same cabinet?

Retrieved Context:
1. Logistics Data: Both Titan-X and Solvo-Clean are approved for standard ground shipping (Class 9 non-hazardous transport).
2. Product Specification: Titan-X. This is a premium industrial cleaner... Rated safe for food prep surfaces.
3. Product Specification: Solvo-Clean. A high-performance organic solvent based on Acetone. Biodegradable and eco-friendly.

Generating Answer...
Model Answer:
Yes, both Titan-X and Solvo-Clean are approved for standard ground shipping (Class 9 non-hazardous transport) and are considered safe for food prep surfaces.

FAILURE ANALYSIS: 
The RAG retrieved the 'Logistics' document because it contained BOTH product names. 
It missed the 'Safety Protocol' documents because they didn't mention 'Titan-X' or 'Solvo-Clean'.
Result: DANGEROUS HALLUCINATION.


In [None]:
# --- Step 5: Industry Standard KG Construction ---
# In a real industry setting, we don't just extract random strings.
# We map entities to a FIXED ONTOLOGY (Categories).
# We will define a 'Standardized Class' extraction.

kg = nx.DiGraph()

def map_to_ontology(text):
    """
    Simulates an Ontology Mapper.
    We prompt the LLM to classify the items in the text into specific Chemical Classes.
    Allowed Classes: [Peroxide, Organic Solvent, Acid, Base, Inert]
    Allowed Properties: [Flammable, Oxidizer, Corrosive]
    """
    
    ontology_prompt = f"""<|system|>
    You are a Chemical Safety Engineer.
    Analyze the text. Map entities to these classes: [Peroxide, Organic Solvent].
    Map classes to properties: [Oxidizer, Flammable].
    Output format: Entity | Relation | Class_or_Property
    <|user|>
    Text: {text}
    <|assistant|>"""
    
    # We invoke the LLM to structure the data
    raw = llm.invoke(ontology_prompt)
    output = raw.split("<|assistant|>")[-1].strip()
    
    # Heuristic parsing of the LLM output (Simulating an ETL Parser)
    triplets = []
    lines = output.split("\n")
    for line in lines:
        if "|" in line:
            parts = [p.strip() for p in line.split("|")]
            if len(parts) == 3:
                triplets.append(parts)
    return triplets

print("\n=== PROFESSIONAL KG EXTRACTION (Ontology Mapping) ===")

# We only process relevant docs to build the chemical registry
# In reality, this runs over the whole corpus.
relevant_indices = [0, 1, 3, 4] 

for i in relevant_indices:
    doc = docs[i]
    print(f"\nProcessing: {doc.page_content}")
    extracted = map_to_ontology(doc.page_content)
    
    for subj, rel, obj in extracted:
        # In a real system, we would validate against a schema here
        print(f"   -> Mapped: {subj} {rel} {obj}")
        kg.add_edge(subj, obj, relation=rel)

print(f"\nGraph Built: {kg.number_of_nodes()} Nodes.")


=== PROFESSIONAL KG EXTRACTION (Ontology Mapping) ===

Processing: Product Specification: Titan-X. This is a premium industrial cleaner. Key active ingredient: 35% Hydrogen Peroxide. Rated safe for food prep surfaces.
   -> Mapped: Titan-X is_a Peroxide
Processing: Product Specification: Solvo-Clean. A high-performance organic solvent based on Acetone. Biodegradable and eco-friendly.
   -> Mapped: Solvo-Clean is_a Organic Solvent
Processing: Safety Protocol 101: Organic Solvents (like Acetone, Ethanol) are highly flammable.
   -> Mapped: Organic Solvent has_property Flammable
Processing: Safety Protocol 102: Oxidizers (such as Peroxides, Nitrates) must never be stored with Flammable Solvents. Reaction causes fire.
   -> Mapped: Peroxide has_property Oxidizer

Graph Built: 4 Nodes.


In [None]:
# --- Step 6: The Solution (Graph Logic) ---
# We do NOT ask the LLM to guess. We run code.

print("\n=== RUNNING LOGIC ENGINE (Safety Check) ===")

def get_properties(node):
    """Recursively find properties of a chemical instance"""
    props = set()
    # Find Class (e.g., Titan-X -> Peroxide)
    classes = list(kg.successors(node))
    for cls in classes:
        # Find Properties of that Class (e.g., Peroxide -> Oxidizer)
        class_props = list(kg.successors(cls))
        props.update(class_props)
    return classes, props

def check_safety(chem_a, chem_b):
    print(f"Checking Compatibility: {chem_a}  VS  {chem_b}")
    
    # 1. Graph Traversal to find properties
    class_a, props_a = get_properties(chem_a)
    class_b, props_b = get_properties(chem_b)
    
    print(f"\nAnalyzing '{chem_a}':")
    print(f"  - Is instance of: {class_a}")
    print(f"  - Inherited Property: {props_a}")
    
    print(f"\nAnalyzing '{chem_b}':")
    print(f"  - Is instance of: {class_b}")
    print(f"  - Inherited Property: {props_b}")
    
    # 2. Deterministic Rule Check
    # In a real app, these rules are stored in a database.
    print("\nCONFLICT CHECK:")
    
    # Rule: Oxidizer + Flammable = Danger
    print("  Checking Rule: (Oxidizer + Flammable)...")
    if ('Oxidizer' in props_a and 'Flammable' in props_b) or \
       ('Flammable' in props_a and 'Oxidizer' in props_b):
        print("  !!! VIOLATION FOUND !!!")
        print("  Rule: Oxidizers cannot be stored with Flammable material.")
        return "UNSAFE"
    
    return "SAFE"

# Run the logic
verdict = check_safety("Titan-X", "Solvo-Clean")
print(f"\nFinal Verdict: {verdict}")


=== RUNNING LOGIC ENGINE (Safety Check) ===
Checking Compatibility: Titan-X  VS  Solvo-Clean

Analyzing 'Titan-X':
  - Is instance of: Peroxide
  - Inherited Property: Oxidizer

Analyzing 'Solvo-Clean':
  - Is instance of: Organic Solvent
  - Inherited Property: Flammable

CONFLICT CHECK:
  Checking Rule: (Oxidizer + Flammable)...
  !!! VIOLATION FOUND !!!
  Rule: Oxidizers cannot be stored with Flammable material.

Final Verdict: UNSAFE
