# RAG Failure #4: The Contradictory Information Failure

## The Problem
Knowledge bases evolve. Old documents (2021 policies) often coexist with new documents (2024 policies) in the vector store. 
When a user asks **"What is the policy?"**, Vector Search retrieves *both* the old and new versions because they are semantically identical. The LLM, unable to distinguish "truth" from "history", often merges them into a hallucinated mess or hedges its answer.

## The Scenario: HR Policy "Remote Work" Saga
**Query:** "How many days can I work from home?"

**The Conflicting Data:**
1.  **Doc A (2021 Handbook):** "Under the 'Flex-21' initiative, all employees are entitled to **5 days** of remote work per week."
2.  **Doc B (2023 Update):** "Due to RTO mandates, the remote allowance is reduced to **2 days** (hybrid model)."
3.  **Doc C (2024 Memo):** "Effective immediately, full-time remote work is revoked. Maximum allowance is **0 days** (Strict On-Site)."

**Naive RAG Failure:** It retrieves all three. The LLM says: *"You can work 5 days, but also 2 days, and currently 0 days."*

**KG Solution:** We build a **Temporal Graph**. Edges have a `timestamp` property. The query engine sorts facts by date and returns only the latest state.

In [None]:
# --- Step 1: Environment Setup ---
!pip install -q langchain langchain-community langchain-huggingface faiss-cpu networkx transformers sentence-transformers accelerate bitsandbytes dateparser

In [None]:
import torch
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
from langchain_huggingface import HuggingFacePipeline, HuggingFaceEmbeddings
import networkx as nx
import dateparser # For parsing "Jan 2021" into datetime objects

# --- Step 2: Load Model ---
model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

print(f"Loading {model_id}...")
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

pipe = pipeline(
    "text-generation", 
    model=model, 
    tokenizer=tokenizer, 
    max_new_tokens=256, 
    temperature=0.1, 
    do_sample=True
)

llm = HuggingFacePipeline(pipeline=pipe)
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
print("Model loaded. Pipeline ready.")

Loading TinyLlama-1.1B-Chat-v1.0...
Model loaded. Pipeline ready.


In [None]:
from langchain.docstore.document import Document

# --- Step 3: Simulate Conflicting Data ---
# Note: The dates are embedded in the text, which is common in real PDF parsing.
raw_texts = [
    "[Dated Jan 1, 2021] Under the 'Flex-21' initiative, all employees are entitled to 5 days of remote work per week.",
    "[Dated June 15, 2023] Update: Due to RTO mandates, the remote allowance is reduced to 2 days (hybrid model).",
    "[Dated Feb 10, 2024] EXECUTIVE MEMO: Effective immediately, full-time remote work is revoked. Maximum allowance is 0 days (Strict On-Site)."
]

docs = [Document(page_content=t) for t in raw_texts]
print(f"Created {len(docs)} Conflicting Documents.")
for i, d in enumerate(docs):
    print(f"Doc {i+1} ({['2021', '2023', '2024'][i]}): {d.page_content}")

Created 3 Conflicting Documents.
Doc 1 (2021): [Dated Jan 1, 2021] Under the 'Flex-21' initiative, all employees are entitled to 5 days of remote work per week.
Doc 2 (2023): [Dated June 15, 2023] Update: Due to RTO mandates, the remote allowance is reduced to 2 days (hybrid model).
Doc 3 (2024): [Dated Feb 10, 2024] EXECUTIVE MEMO: Effective immediately, full-time remote work is revoked. Maximum allowance is 0 days (Strict On-Site).


In [None]:
# --- Step 4: Naive RAG Implementation ---
from langchain_community.vectorstores import FAISS

print("\n--- NAIVE RAG (The Confusion) ---")
query = "How many days can I work from home?"
print(f"Query: {query}")

# 1. Indexing
vectorstore = FAISS.from_documents(docs, embeddings)

# 2. Retrieval
# It retrieves ALL 3 because they are all highly relevant to "days work from home".
retriever = vectorstore.as_retriever(search_kwargs={"k": 3})
retrieved_docs = retriever.invoke(query)

print("\nRetrieved Context (k=3):")
context_str = ""
for i, d in enumerate(retrieved_docs):
    print(f"{i+1}. {d.page_content}")
    context_str += d.page_content + "\n"

# 3. Generation
prompt = f"<|system|>\nAnswer the question based on the context.\n<|user|>\nContext:\n{context_str}\nQuestion:\n{query}\n<|assistant|>"
response = llm.invoke(prompt)
cleaned_response = response.split("<|assistant|>")[-1].strip()

print("\nLLM Answer:")
print(cleaned_response)


--- NAIVE RAG (The Confusion) ---
Query: How many days can I work from home?

Retrieved Context (k=3):
1. [Dated Feb 10, 2024] EXECUTIVE MEMO: Effective immediately, full-time remote work is revoked. Maximum allowance is 0 days (Strict On-Site).
2. [Dated Jan 1, 2021] Under the 'Flex-21' initiative, all employees are entitled to 5 days of remote work per week.
3. [Dated June 15, 2023] Update: Due to RTO mandates, the remote allowance is reduced to 2 days (hybrid model).

LLM Answer:
Based on the provided documents, there are conflicting policies regarding remote work. 
One document states 5 days, another says 2 days, and a recent memo says 0 days.
It is unclear which policy applies to you.

ANALYSIS:
The LLM acts as a summary engine, not a logic engine. It sees 3 facts and reports 3 facts. 
It fails to definitively say: "The answer is 0 days because 2024 > 2023 > 2021."


In [None]:
# --- Step 5: Temporal Graph Construction ---
# We construct a PROPERTY GRAPH where edges have attributes (timestamps).
# We ask the LLM to extract: Topic | Value | Effective_Date

kg = nx.DiGraph()

def extract_temporal_fact(text):
    """
    Extracts the fact AND the date associated with it.
    """
    prompt = f"""<|system|>
    You are a Compliance Officer. Extract the Policy Topic, the Allowed Value, and the Effective Date.
    Format: Topic | Value | Date
    Example: "[2022] Sick leave is 10 days." -> Sick Leave | 10 days | 2022
    <|user|>
    Text: {text}
    <|assistant|>"""
    
    raw = llm.invoke(prompt)
    out = raw.split("<|assistant|>")[-1].strip()
    
    if "|" in out:
        return [p.strip() for p in out.split("|")]
    return []

print("\n--- TEMPORAL KG EXTRACTION ---")

for doc in docs:
    print(f"\nParsing Chunk: {doc.page_content}")
    parts = extract_temporal_fact(doc.page_content)
    
    if len(parts) >= 3:
        topic, value, date_str = parts[0], parts[1], parts[2]
        
        # Parse Date String to Python Datetime object for sorting
        # In production, use dateparser.parse(date_str)
        dt_obj = dateparser.parse(date_str)
        
        print(f"   [Raw LLM Output]: {topic} | {value} | {date_str}")
        print(f"   [Graph Action]: Edge ({topic}) -> [allowance: {value}] (valid_from: {dt_obj.date()})")
        
        # Add edge with METADATA
        # We use a MultiGraph logic here by appending to a list of facts on the node
        # Alternatively, we store edges with keys. Here we simplify: 
        # Node 'Remote Work Policy' stores a list of historical states.
        
        if topic not in kg:
            kg.add_node(topic, history=[])
        
        kg.nodes[topic]['history'].append({
            "value": value,
            "date": dt_obj,
            "source": doc.page_content[:20]
        })



--- TEMPORAL KG EXTRACTION ---

Parsing Chunk: [Dated Jan 1, 2021] Under the 'Flex-21' initiative, all employees are entitled to 5 days of remote work per week.
   [Raw LLM Output]: Remote Work Policy | 5 days | Jan 1, 2021
   [Graph Action]: Edge (Remote Work Policy) -> [allowance: 5 days] (valid_from: 2021-01-01)

Parsing Chunk: [Dated June 15, 2023] Update: Due to RTO mandates, the remote allowance is reduced to 2 days (hybrid model).
   [Raw LLM Output]: Remote Work Policy | 2 days | June 15, 2023
   [Graph Action]: Edge (Remote Work Policy) -> [allowance: 2 days] (valid_from: 2023-06-15)

Parsing Chunk: [Dated Feb 10, 2024] EXECUTIVE MEMO: Effective immediately, full-time remote work is revoked. Maximum allowance is 0 days (Strict On-Site).
   [Raw LLM Output]: Remote Work Policy | 0 days | Feb 10, 2024
   [Graph Action]: Edge (Remote Work Policy) -> [allowance: 0 days] (valid_from: 2024-02-10)


In [None]:
# --- Step 6: The Solution (Deterministic Time Resolution) ---

print("\n--- CONFLICT RESOLUTION (Time Travel) ---")

def resolve_latest_truth(topic_query):
    """
    Resolves conflicts by sorting metadata timestamps.
    """
    # 1. Topic Mapping (Simple string match for demo)
    target_node = None
    for node in kg.nodes():
        if "Remote Work" in node or "Work from home" in topic_query:
            target_node = node
            break
    
    print(f"Query: '{topic_query}'")
    print(f"Identified Topic: '{target_node}'")
    
    if not target_node:
        return "No policy found."
    
    # 2. Retrieve History
    history = kg.nodes[target_node]['history']
    print(f"\nFound {len(history)} conflicting records. Resolving...")
    
    # 3. Sort by Date Descending
    sorted_history = sorted(history, key=lambda x: x['date'], reverse=True)
    
    for idx, record in enumerate(sorted_history):
        status = "(Latest)" if idx == 0 else ""
        print(f"  {idx+1}. Value: {record['value']} | Date: {record['date'].date()} {status}")
        
    # 4. Return Top Result
    latest = sorted_history[0]
    print(f"\nSelected Truth: {latest['value']} (Source: {latest['source']}...)")
    
    return latest

latest_fact = resolve_latest_truth(query)

print("\nFinal Answer:")
if isinstance(latest_fact, dict):
    # We construct the answer programmatically to ensure accuracy
    print(f"The current remote work allowance is {latest_fact['value']}, effective as of {latest_fact['date'].strftime('%b %d, %Y')}.")
else:
    print(latest_fact)


--- CONFLICT RESOLUTION (Time Travel) ---
Query: 'How many days can I work from home?'
Identified Topic: 'Remote Work Policy'

Found 3 conflicting records. Resolving...
  1. Value: 0 days | Date: 2024-02-10 (Latest)
  2. Value: 2 days | Date: 2023-06-15
  3. Value: 5 days | Date: 2021-01-01

Selected Truth: 0 days (Source: [Dated Feb 10, 2024]...)

Final Answer:
The current remote work allowance is 0 days (Strict On-Site), effective as of Feb 10, 2024.
