# Information Verifier System

This notebook implements an information verification system using:
- **LangGraph**: Workflow orchestration
- **LangChain**: LLM integration and tools
- **Hugging Face**: Classification models

## Architecture

1. User input → Query enhancement
2. Information retrieval (web search + domain APIs)
3. Evidence extraction and analysis
4. Classification (real/fake/doubtful)
5. Explanation generation with sources


In [54]:
%pip install langchain langchain-openai langgraph langchain-community langchain-tavily transformers torch


Defaulting to user installation because normal site-packages is not writeable

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49m/Applications/Xcode.app/Contents/Developer/usr/bin/python3 -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [None]:
# Suppress warnings (optional - these are harmless but can be noisy)
import warnings
warnings.filterwarnings('ignore', category=UserWarning)
warnings.filterwarnings('ignore', message='.*urllib3.*')
warnings.filterwarnings('ignore', message='.*tqdm.*')
warnings.filterwarnings('ignore', message='.*IProgress.*')

# Suppress urllib3 OpenSSL warning specifically
import urllib3
urllib3.disable_warnings(urllib3.exceptions.NotOpenSSLWarning)


In [None]:
import os
from typing import TypedDict, Literal, List, Annotated
from pydantic import BaseModel
from enum import Enum

# Compatible StrEnum for Python < 3.11
try:
    from enum import StrEnum, auto
except ImportError:
    # Fallback for Python < 3.11
    class StrEnum(str, Enum):
        @staticmethod
        def _generate_next_value_(name, start, count, last_values):
            return name.lower()
    
    # auto() for Python < 3.11 compatibility
    _auto_value = object()
    def auto():
        return _auto_value

# LangChain & LangGraph
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_core.messages import HumanMessage, SystemMessage, AIMessage
from langchain_core.documents import Document
from langgraph.graph import StateGraph, START, END, MessagesState
from langgraph.checkpoint.memory import MemorySaver
from langgraph.prebuilt import ToolNode, tools_condition

# Try to import Tavily (new package), fallback to DuckDuckGo if not available
TAVILY_SEARCH_CLASS = None
try:
    # New package - preferred
    from langchain_tavily import TavilySearch
    TAVILY_SEARCH_CLASS = TavilySearch
    TAVILY_AVAILABLE = True
    print("✅ Using langchain-tavily package (new)")
except ImportError:
    # Fallback to old package if new one not available
    try:
        from langchain_community.tools.tavily_search import TavilySearchResults
        TAVILY_SEARCH_CLASS = TavilySearchResults
        TAVILY_AVAILABLE = True
        print("⚠️  Using deprecated langchain-community.tools.tavily_search - consider upgrading to langchain-tavily")
    except ImportError:
        try:
            from langchain_community.tools import DuckDuckGoSearchRun
            TAVILY_AVAILABLE = False
            print("Tavily not available, will use DuckDuckGo as fallback")
        except ImportError:
            TAVILY_AVAILABLE = False
            print("No search tool available. Please install langchain-tavily or ensure langchain-community is installed.")

# Hugging Face
from transformers import pipeline

# For domain-specific APIs (example: sports)
import requests


✅ Using langchain-tavily package (new)


In [None]:
# =============================================================================
# SETUP API KEYS - RUN THIS CELL FIRST!
# =============================================================================

# ⚠️ REQUIRED: OpenAI API Key (you must set this!)
# Replace the text below with your actual API key from https://platform.openai.com/api-keys
os.environ["OPENAI_API_KEY"] = "your-openai-api-key-here"  # ⬅️ CHANGE THIS to your actual key (starts with "sk-")

# Optional: Tavily API Key (you already have this set, so this is fine)
os.environ["TAVILY_API_KEY"] = "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"

# Option 2: Load from environment variable (RECOMMENDED)
# Set it in your terminal before running Jupyter:
# export OPENAI_API_KEY="your-key-here"
# export TAVILY_API_KEY="your-key-here"  # Optional

# Option 3: Load from a .env file (RECOMMENDED for local development)
# First install: pip install python-dotenv
# Then create a .env file with:
# OPENAI_API_KEY=your-key-here
# TAVILY_API_KEY=your-key-here
try:
    from dotenv import load_dotenv
    load_dotenv()
    print("Loaded environment variables from .env file")
except ImportError:
    print("python-dotenv not installed. Install with: pip install python-dotenv")
    print("Or set environment variables manually.")

# Check if API key is set
if not os.environ.get("OPENAI_API_KEY"):
    print("⚠️  WARNING: OPENAI_API_KEY not set!")
    print("Please set it using one of the methods above.")
    print("\nTo get an OpenAI API key:")
    print("1. Go to https://platform.openai.com/api-keys")
    print("2. Sign up or log in")
    print("3. Create a new API key")
    print("4. Copy it and set it using one of the methods above")
else:
    print("✅ OPENAI_API_KEY is set")

# Tavily is optional - will fallback to DuckDuckGo if not set
if os.environ.get("TAVILY_API_KEY"):
    print("✅ TAVILY_API_KEY is set")
else:
    print("ℹ️  TAVILY_API_KEY not set - will use DuckDuckGo for search (free, no API key needed)")


Loaded environment variables from .env file
✅ OPENAI_API_KEY is set
✅ TAVILY_API_KEY is set


## State Definition


In [None]:
# Verify API key is set before proceeding
if not os.environ.get("OPENAI_API_KEY"):
    raise ValueError(
        "❌ ERROR: OPENAI_API_KEY is not set!\n\n"
        "Please set it in the 'Setup API keys' cell above using one of these methods:\n\n"
        "Method 1 - Direct (quick):\n"
        "  os.environ['OPENAI_API_KEY'] = 'sk-your-key-here'\n\n"
        "Method 2 - Environment variable:\n"
        "  export OPENAI_API_KEY='sk-your-key-here'\n\n"
        "Method 3 - .env file:\n"
        "  Create a .env file with: OPENAI_API_KEY=sk-your-key-here\n\n"
        "Get your API key from: https://platform.openai.com/api-keys"
    )
else:
    print("✅ API key verified - ready to initialize model")


✅ API key verified - ready to initialize model


In [None]:
class ClassificationResult(StrEnum):
    REAL = auto()
    FAKE = auto()
    DOUBTFUL = auto()

class Source(BaseModel):
    url: str
    title: str
    snippet: str
    credibility_score: float = 0.5

class Evidence(BaseModel):
    claim: str
    supporting_text: str
    source: Source
    relevance_score: float

class VerificationState(TypedDict):
    user_input: str
    enhanced_query: str
    search_results: List[Document]
    evidence: List[Evidence]
    classification: ClassificationResult
    confidence: float
    explanation: str
    sources: List[Source]


## Node 1: Query Enhancement


In [None]:
model = ChatOpenAI(model="gpt-4o-mini", temperature=0)

def enhance_query_node(state: VerificationState) -> VerificationState:
    """Enhance user query for better search results"""
    prompt = f"""Given the following user query or claim, create an optimized search query 
    that will help verify the information. Extract key entities, dates, and facts.

    User input: {state['user_input']}
    
    Return only the enhanced search query, nothing else."""
    
    response = model.invoke([HumanMessage(content=prompt)])
    enhanced = response.content.strip()
    
    return {
        **state,
        "enhanced_query": enhanced
    }


## Node 2: Information Retrieval


In [None]:
# Initialize search tool
# Note: You need TAVILY_API_KEY for this to work
# Alternative: Use DuckDuckGoSearchRun from langchain_community.tools

try:
    search_tool = TavilySearchResults(max_results=5)
    USE_TAVILY = True
except:
    print("Tavily not available. Using alternative search.")
    USE_TAVILY = False

def retrieve_information_node(state: VerificationState) -> VerificationState:
    """Retrieve information from web search"""
    query = state.get('enhanced_query', state['user_input'])
    
    if USE_TAVILY:
        results = search_tool.invoke({"query": query})
        documents = [
            Document(
                page_content=result.get('content', ''),
                metadata={
                    'url': result.get('url', ''),
                    'title': result.get('title', '')
                }
            )
            for result in results
        ]
    else:
        # Fallback: Mock documents for demonstration
        documents = [
            Document(
                page_content="Sample search result content. This would come from web search.",
                metadata={'url': 'https://example.com', 'title': 'Example Source'}
            )
        ]
    
    return {
        **state,
        "search_results": documents
    }


## Node 3: Evidence Extraction


In [None]:
def extract_evidence_node(state: VerificationState) -> VerificationState:
    """Extract relevant evidence from search results"""
    user_claim = state['user_input']
    documents = state['search_results']
    
    evidence_list = []
    
    for doc in documents:
        # Use LLM to extract relevant evidence
        prompt = f"""Given the following claim and a source document, extract the most relevant 
        evidence that supports or contradicts the claim. Return only the relevant text snippet.

        Claim: {user_claim}
        
        Source: {doc.page_content[:1000]}  # Limit length
        
        Extract relevant evidence:"""
        
        response = model.invoke([HumanMessage(content=prompt)])
        evidence_text = response.content.strip()
        
        if evidence_text and len(evidence_text) > 20:  # Filter out empty/too short
            source = Source(
                url=doc.metadata.get('url', ''),
                title=doc.metadata.get('title', 'Unknown'),
                snippet=evidence_text[:200],
                credibility_score=0.7  # Could be enhanced with domain-specific scoring
            )
            
            evidence = Evidence(
                claim=user_claim,
                supporting_text=evidence_text,
                source=source,
                relevance_score=0.8  # Could use embeddings similarity
            )
            evidence_list.append(evidence)
    
    return {
        **state,
        "evidence": evidence_list
    }


## Node 4: Classification (Hugging Face + LLM)


In [None]:
# Initialize Hugging Face classifier
# Using zero-shot classification model
classifier = pipeline(
    "zero-shot-classification",
    model="facebook/bart-large-mnli"
)

def classify_node(state: VerificationState) -> VerificationState:
    """Classify claim as real, fake, or doubtful"""
    user_claim = state['user_input']
    evidence_texts = [e.supporting_text for e in state['evidence']]
    
    if not evidence_texts:
        return {
            **state,
            "classification": ClassificationResult.DOUBTFUL,
            "confidence": 0.3,
            "explanation": "Insufficient evidence found to verify the claim."
        }
    
    # Combine evidence
    combined_evidence = "\n\n".join(evidence_texts[:3])  # Use top 3
    
    # Use LLM for classification (more nuanced than simple classifier)
    prompt = f"""You are a fact-checker. Analyze the following claim against the provided evidence 
    and classify it as one of: REAL, FAKE, or DOUBTFUL.
    
    Claim: {user_claim}
    
    Evidence:
    {combined_evidence}
    
    Respond in this exact format:
    CLASSIFICATION: [REAL/FAKE/DOUBTFUL]
    CONFIDENCE: [0.0-1.0]
    REASONING: [brief explanation]"""
    
    response = model.invoke([HumanMessage(content=prompt)])
    result_text = response.content
    
    # Parse response
    classification = ClassificationResult.DOUBTFUL
    confidence = 0.5
    reasoning = result_text
    
    if "REAL" in result_text.upper():
        classification = ClassificationResult.REAL
    elif "FAKE" in result_text.upper():
        classification = ClassificationResult.FAKE
    
    # Extract confidence if mentioned
    import re
    conf_match = re.search(r'CONFIDENCE:\s*([0-9.]+)', result_text, re.IGNORECASE)
    if conf_match:
        confidence = float(conf_match.group(1))
    
    return {
        **state,
        "classification": classification,
        "confidence": confidence,
        "explanation": reasoning,
        "sources": [e.source for e in state['evidence']]
    }


Device set to use mps:0


## Node 5: Explanation Generation


In [None]:
def generate_explanation_node(state: VerificationState) -> VerificationState:
    """Generate human-readable explanation with source citations"""
    claim = state['user_input']
    classification = state['classification']
    confidence = state['confidence']
    sources = state['sources']
    evidence = state['evidence']
    
    sources_text = "\n".join([
        f"- {s.title} ({s.url})" for s in sources[:5]
    ])
    
    prompt = f"""Generate a clear, concise explanation for the fact-checking result. 
    Include specific evidence and cite sources.

    Claim: {claim}
    Classification: {classification.value}
    Confidence: {confidence:.2f}
    
    Sources:
    {sources_text}
    
    Evidence snippets:
    {chr(10).join([e.supporting_text[:200] for e in evidence[:3]])}
    
    Write a 2-3 sentence explanation that:
    1. States the classification clearly
    2. Provides key evidence
    3. Cites the sources
    4. Explains the confidence level"""
    
    response = model.invoke([HumanMessage(content=prompt)])
    
    return {
        **state,
        "explanation": response.content
    }


## Conditional Routing


In [None]:
def should_retry_search(state: VerificationState) -> Literal["retrieve_information", "classify"]:
    """Decide if we need more sources"""
    # If we have no evidence or very low confidence, retry search
    if len(state.get('evidence', [])) == 0:
        return "retrieve_information"
    if state.get('confidence', 0) < 0.4 and len(state.get('evidence', [])) < 3:
        return "retrieve_information"
    return "classify"


## Build Graph


In [None]:
builder = StateGraph(VerificationState)

# Add nodes
builder.add_node("enhance_query", enhance_query_node)
builder.add_node("retrieve_information", retrieve_information_node)
builder.add_node("extract_evidence", extract_evidence_node)
builder.add_node("classify", classify_node)
builder.add_node("generate_explanation", generate_explanation_node)

# Add edges
builder.add_edge(START, "enhance_query")
builder.add_edge("enhance_query", "retrieve_information")
builder.add_edge("retrieve_information", "extract_evidence")
builder.add_conditional_edges(
    "extract_evidence",
    should_retry_search,
    {
        "retrieve_information": "retrieve_information",
        "classify": "classify"
    }
)
builder.add_edge("classify", "generate_explanation")
builder.add_edge("generate_explanation", END)

# Compile
memory = MemorySaver()
graph = builder.compile(checkpointer=memory)


## Usage Example


In [None]:
# Example usage
config = {"configurable": {"thread_id": "1"}}

initial_state = VerificationState(
    user_input="LeBron James scored 50 points in the 2024 NBA Finals Game 7",
    enhanced_query="",
    search_results=[],
    evidence=[],
    classification=ClassificationResult.DOUBTFUL,
    confidence=0.0,
    explanation="",
    sources=[]
)

result = graph.invoke(initial_state, config)

print("=" * 60)
print("VERIFICATION RESULT")
print("=" * 60)
print(f"\nClaim: {result['user_input']}")
print(f"\nClassification: {result['classification'].value}")
print(f"Confidence: {result['confidence']:.2%}")
print(f"\nExplanation:\n{result['explanation']}")
print(f"\nSources ({len(result['sources'])}):")
for i, source in enumerate(result['sources'], 1):
    print(f"  {i}. {source.title}")
    print(f"     {source.url}")


AuthenticationError: Error code: 401 - {'error': {'message': 'Incorrect API key provided: your-ope************here. You can find your API key at https://platform.openai.com/account/api-keys.', 'type': 'invalid_request_error', 'param': None, 'code': 'invalid_api_key'}}

## Domain-Specific Enhancements (Example: Sports)

For sports domain, you can add:
1. Official API integration (NBA Stats API, etc.)
2. Domain-specific source whitelist
3. Structured data validation (scores, dates, stats)


In [None]:
# Example: Domain-specific source scoring
def score_source_credibility(source: Source, domain: str = "sports") -> float:
    """Score source credibility based on domain"""
    trusted_domains = {
        "sports": [
            "nba.com", "espn.com", "nfl.com", "mlb.com",
            "basketball-reference.com", "pro-football-reference.com"
        ]
    }
    
    url_lower = source.url.lower()
    base_score = 0.5
    
    # Boost score for trusted domains
    for trusted in trusted_domains.get(domain, []):
        if trusted in url_lower:
            base_score = 0.9
            break
    
    return base_score

# Example: Structured data validation for sports claims
def validate_sports_claim(claim: str) -> dict:
    """Extract structured information from sports claim"""
    # Use LLM to extract: player, team, stat, date, game
    prompt = f"""Extract structured information from this sports claim:
    {claim}
    
    Return JSON with: player, team, stat_type, stat_value, date, game_context"""
    
    response = model.invoke([HumanMessage(content=prompt)])
    # Parse JSON response
    # Then validate against official API
    
    return {"extracted": "data"}


## Next Steps & Improvements

1. **Fine-tune Classification Model**: Train on fact-checking datasets (FEVER, PolitiFact)
2. **Multi-query Strategy**: Generate multiple search queries for better coverage
3. **Temporal Verification**: Check if information is outdated
4. **Claim Decomposition**: Break complex claims into verifiable sub-claims
5. **Source Aggregation**: Weighted voting from multiple sources
6. **Confidence Calibration**: Improve confidence score accuracy
7. **Domain APIs**: Integrate official APIs for structured data validation
8. **Caching**: Cache verification results for repeated queries
