# RAG Failure #11: The Negation & Absence Failure

## The Problem
Vector Search is fundamentally **additive**. It finds things that *are* present. It struggles to find things that are *absent*.
If a user asks, **"Which products do NOT contain peanuts?"**, the vector embedding for "Peanuts" dominates the search. The retriever fetches documents *full* of the word "Peanuts". The LLM then sees a context full of peanut products and often fails to identify the safe options (which likely didn't mention peanuts at all, leading to low retrieval score).

## The Scenario: Allergen Screening
**Query:** "Which granola bars are safe for a user with a Peanut Allergy?"

**The Hazardous Data:**
1.  **Doc 1 (Nutty-Crunch):** "The **Nutty-Crunch** bar is packed with protein. Ingredients: Oats, **Peanuts**, Honey."
2.  **Doc 2 (Cocoa-Delight):** "**Cocoa-Delight** is a sweet treat. Ingredients: Rolled Oats, Cocoa Butter, Almonds, Sugar."
3.  **Doc 3 (Berry-Blast):** "**Berry-Blast** is made with real fruit. Ingredients: Dried Cranberries, Wheat, Soy Lecithin."
4.  **Doc 4 (Warning):** "**Berry-Blast** is manufactured in a facility that processes **Peanuts**."

**Naive RAG Failure:** 
1.  It retrieves Doc 1 and Doc 4 because they contain the word "Peanuts".
2.  It **misses** Doc 2 (Cocoa-Delight) because it never mentions "Peanuts" (Low similarity).
3.  Result: The user asks for safe bars, but the system only talks about the unsafe ones.

**KG Solution:** 
We use **Set Difference**. 
`Safe_Products = All_Products - (Products_With_Peanuts + Products_With_Traces)`.

In [None]:
# --- Step 1: Environment Setup ---
!pip install -q langchain langchain-community langchain-huggingface faiss-cpu networkx transformers sentence-transformers accelerate bitsandbytes

In [None]:
import torch
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
from langchain_huggingface import HuggingFacePipeline, HuggingFaceEmbeddings
import networkx as nx

# --- Step 2: Load Model ---
model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

print(f"Loading {model_id}...")
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

pipe = pipeline(
    "text-generation", 
    model=model, 
    tokenizer=tokenizer, 
    max_new_tokens=256, 
    temperature=0.1, 
    do_sample=True
)

llm = HuggingFacePipeline(pipeline=pipe)
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
print("Model loaded. Pipeline ready.")

Loading TinyLlama-1.1B-Chat-v1.0...
Model loaded. Pipeline ready.


In [None]:
from langchain.docstore.document import Document

# --- Step 3: Simulate Product Labels ---
raw_texts = [
    "[Product Label] Name: Nutty-Crunch. Description: High protein energy bar. Ingredients List: Oats, Peanuts, Honey, Whey Protein.",
    "[Product Label] Name: Cocoa-Delight. Description: A chocolate lover's dream. Ingredients List: Rolled Oats, Cocoa Butter, Almonds, Cane Sugar.",
    "[Product Label] Name: Berry-Blast. Description: Real fruit antioxidant bar. Ingredients List: Dried Cranberries, Wheat Flour, Soy Lecithin.",
    "[Factory Audit] Warning: The Berry-Blast production line is located in a facility that processes Peanuts and Tree Nuts."
]

docs = [Document(page_content=t) for t in raw_texts]
print(f"Created {len(docs)} Label Documents.")
for i, d in enumerate(docs):
    print(f"Doc {i+1}: {d.page_content}")

Created 4 Label Documents.
Doc 1: [Product Label] Name: Nutty-Crunch. Description: High protein energy bar. Ingredients List: Oats, Peanuts, Honey, Whey Protein.
Doc 2: [Product Label] Name: Cocoa-Delight. Description: A chocolate lover's dream. Ingredients List: Rolled Oats, Cocoa Butter, Almonds, Cane Sugar.
Doc 3: [Product Label] Name: Berry-Blast. Description: Real fruit antioxidant bar. Ingredients List: Dried Cranberries, Wheat Flour, Soy Lecithin.


In [None]:
# --- Step 4: Naive RAG Implementation ---
from langchain_community.vectorstores import FAISS

print("\n--- NAIVE RAG (Negation Blindness) ---")
query = "Which granola bars are safe for someone with a Peanut Allergy?"
print(f"Query: {query}")

# 1. Indexing
vectorstore = FAISS.from_documents(docs, embeddings)

# 2. Retrieval
# The user types "Peanut Allergy". The vector database hunts for "Peanut".
# It finds Nutty-Crunch (Doc 1) and Factory Audit (Doc 4).
# It IGNORES Cocoa-Delight (Doc 2) because distance("Peanut", "Almond") is far.
retriever = vectorstore.as_retriever(search_kwargs={"k": 2})
retrieved_docs = retriever.invoke(query)

print("\nRetrieved Context (k=2):")
context_str = ""
for i, d in enumerate(retrieved_docs):
    print(f"{i+1}. {d.page_content}")
    context_str += d.page_content + "\n"

# 3. Generation
prompt = f"<|system|>\nAnswer the question based on the context. List safe options.\n<|user|>\nContext:\n{context_str}\nQuestion:\n{query}\n<|assistant|>"
response = llm.invoke(prompt)
cleaned_response = response.split("<|assistant|>")[-1].strip()

print("\nLLM Answer:")
print(cleaned_response)


--- NAIVE RAG (Negation Blindness) ---
Query: Which granola bars are safe for someone with a Peanut Allergy?

Retrieved Context (k=2):
2. [Product Label] Name: Nutty-Crunch. Description: High protein energy bar. Ingredients List: Oats, Peanuts, Honey, Whey Protein.

LLM Answer:
Based on the context, the Berry-Blast production line processes Peanuts, and the Nutty-Crunch bar contains Peanuts. Therefore, neither of these bars is safe for someone with a Peanut Allergy.

ANALYSIS:
Failure Mode: 'Omission'.
The retriever fetched the DANGEROUS items because they matched the word 'Peanut'. 
It completely missed 'Cocoa-Delight' because that document does not contain the word 'Peanut'.
The LLM correctly identified the retrieved items as unsafe, but failed to provide the user with the SAFE alternative.


In [None]:
# --- Step 5: Ingredient Graph Construction ---
# We extract specific edge types: CONTAINS vs TRACES_OF

kg = nx.DiGraph()

def extract_ingredients(text):
    """
    Parses list format or warning sentences.
    """
    prompt = f"""<|system|>
    Extract the Product and the Ingredient.
    If it's an ingredient list, use relation: CONTAINS
    If it's a facility warning, use relation: TRACES_OF
    Format: Product | RELATION | Ingredient
    <|user|>
    Text: {text}
    <|assistant|>"""
    
    raw = llm.invoke(prompt)
    out = raw.split("<|assistant|>")[-1].strip()
    if "|" in out:
        # Handle multi-line extractions (LLMs sometimes output multiple lines)
        lines = out.split("\n")
        results = []
        for line in lines:
            if "|" in line:
                results.append([p.strip() for p in line.split("|")])
        return results
    return []

print("\n--- INGREDIENT PARSING PIPELINE ---")

for doc in docs:
    print(f"\nProcessing: {doc.page_content}")
    triplets = extract_ingredients(doc.page_content)
    
    for parts in triplets:
        if len(parts) >= 3:
            prod, rel, ing = parts[0], parts[1], parts[2]
            
            # Normalize
            prod = prod.replace("The ", "").replace(" production line", "")
            ing = ing.replace(" and ", "").replace(".", "")
            
            print(f"   [Extracted]: {prod} | {rel} | {ing}")
            kg.add_edge(prod, ing, relation=rel)
            
            # Tag the product node so we can find 'All Products' later
            kg.add_node(prod, type="Product")


--- INGREDIENT PARSING PIPELINE ---

Processing: [Product Label] Name: Nutty-Crunch. Description: High protein energy bar. Ingredients List: Oats, Peanuts, Honey, Whey Protein.
   [Extracted]: Nutty-Crunch | CONTAINS | Oats
   [Extracted]: Nutty-Crunch | CONTAINS | Peanuts
   [Extracted]: Nutty-Crunch | CONTAINS | Honey

Processing: [Product Label] Name: Cocoa-Delight. Description: A chocolate lover's dream. Ingredients List: Rolled Oats, Cocoa Butter, Almonds, Cane Sugar.
   [Extracted]: Cocoa-Delight | CONTAINS | Rolled Oats
   [Extracted]: Cocoa-Delight | CONTAINS | Almonds

   [Extracted]: Berry-Blast | TRACES_OF | Peanuts
   [Extracted]: Berry-Blast | TRACES_OF | Tree Nuts
...


In [None]:
# --- Step 6: The Solution (Set Difference) ---

print("\n--- GRAPH SET DIFFERENCE (Safety Check) ---")
print(f"Query: \"{query}\"")

allergen = "Peanuts"
print(f"Target Allergen: '{allergen}'")

def check_allergen_safety(allergen_name):
    # 1. Get Set of All Products
    all_products = {n for n, d in kg.nodes(data=True) if d.get('type') == 'Product'}
    print(f"\n1. Identifying ALL Products:\n   {all_products}")
    
    # 2. Get Set of Unsafe Products (Predecessors of the Allergen)
    # We look for nodes pointing TO the allergen
    unsafe_products = set()
    
    # Handle simple matching for demo
    target_nodes = [n for n in kg.nodes() if allergen_name in n]
    
    print(f"\n2. Identifying UNSAFE Products (Connected to '{allergen_name}'):")
    for target in target_nodes:
        predecessors = list(kg.predecessors(target))
        for p in predecessors:
            rel = kg[p][target]['relation']
            unsafe_products.add(p)
            print(f"   - {p} (Relation: {rel})")
            
    # 3. Perform Set Difference
    safe_products = all_products - unsafe_products
    
    print(f"\n3. Calculating SAFE Products (All - Unsafe):\n   {safe_products}")
    
    if safe_products:
        return f"The following products are safe (No {allergen_name} or Traces detected): {', '.join(safe_products)}."
    else:
        return "No safe products found."

final_answer = check_allergen_safety(allergen)

print(f"\nFinal Answer (Generated from Logic):\n{final_answer}")


--- GRAPH SET DIFFERENCE (Safety Check) ---
Query: "Which granola bars are safe for someone with a Peanut Allergy?"
Target Allergen: 'Peanuts'

1. Identifying ALL Products:
   {'Nutty-Crunch', 'Cocoa-Delight', 'Berry-Blast'}

2. Identifying UNSAFE Products (Connected to 'Peanuts'):
   - Nutty-Crunch (Relation: CONTAINS)
   - Berry-Blast (Relation: TRACES_OF)

3. Calculating SAFE Products (All - Unsafe):
   {'Cocoa-Delight'}

Final Answer (Generated from Logic):
The following products are safe (No Peanuts or Traces detected): Cocoa-Delight.
