# NLP and LLM Tools Introduction

This notebook demonstrates the basic usage of NLP and LLM tools used in the Persuasion-Aware MUSE project:

1. **spaCy** - Industrial-strength NLP library for entity recognition and text processing
2. **OpenRouter API** - LLM gateway for accessing Gemini and other models

**Model Used**: `google/gemini-2.5-flash-lite` via OpenRouter

---

## Setup

In [1]:
# !pip install spacy openai python-dotenv
# !python -m spacy download en_core_web_sm

In [2]:
import spacy
import json
import os
from typing import List, Dict, Optional

# Load spaCy model
try:
    nlp = spacy.load("en_core_web_sm")
    print("spaCy model loaded successfully")
except OSError:
    print("Please run: python -m spacy download en_core_web_sm")

spaCy model loaded successfully


---

## 1. spaCy for Named Entity Recognition (NER)

spaCy provides pre-trained models for extracting named entities from text.

### 1.1 Basic NER

In [3]:
# Sample social media post
sample_post = """
BREAKING: The European Union is forcing Germany to accept 500,000 migrants! 
Angela Merkel knew this would happen. COVID-19 was just a distraction created by WHO.
"""

# Process the text
doc = nlp(sample_post)

# Extract named entities
print("Named Entities Found:")
print("=" * 60)
for ent in doc.ents:
    print(f"Text: {ent.text:30} | Label: {ent.label_:15} | Description: {spacy.explain(ent.label_)}")

Named Entities Found:
Text: The European Union             | Label: ORG             | Description: Companies, agencies, institutions, etc.
Text: Germany                        | Label: GPE             | Description: Countries, cities, states
Text: 500,000 migrants               | Label: QUANTITY        | Description: Measurements, as of weight or distance
Text: Angela Merkel                  | Label: PERSON          | Description: People, including fictional
Text: WHO                            | Label: ORG             | Description: Companies, agencies, institutions, etc.


### 1.2 Entity Classification for Knowledge Graph

In [4]:
def extract_entities_for_kg(text: str) -> List[Dict]:
    """
    Extract named entities from text and map to knowledge graph entity types.
    """
    doc = nlp(text)
    
    # Map spaCy labels to our ontology types
    label_mapping = {
        "PERSON": "Person",
        "ORG": "Organization",
        "GPE": "Location",
        "LOC": "Location",
        "NORP": "Group",
        "EVENT": "Event",
        "DATE": "Date",
    }
    
    entities = []
    seen = set()
    
    for ent in doc.ents:
        if ent.text.lower() not in seen:
            seen.add(ent.text.lower())
            entity_type = label_mapping.get(ent.label_, "Other")
            entities.append({
                "name": ent.text,
                "type": entity_type,
                "original_label": ent.label_
            })
    
    return entities

# Test the function
entities = extract_entities_for_kg(sample_post)
print("Extracted Entities for Knowledge Graph:")
print("=" * 60)
for ent in entities:
    print(f"Name: {ent['name']:25} | Type: {ent['type']:15}")

Extracted Entities for Knowledge Graph:
Name: The European Union        | Type: Organization   
Name: Germany                   | Type: Location       
Name: 500,000 migrants          | Type: Other          
Name: Angela Merkel             | Type: Person         
Name: WHO                       | Type: Organization   


---

## 2. OpenRouter API for LLM-based Analysis

We use OpenRouter to access various LLM models. For this project, we use **Google Gemini 2.5 Flash Lite** for:
- Claim extraction
- Persuasion technique detection
- Fact verification assistance

OpenRouter provides an OpenAI-compatible API, making it easy to switch between models.

### 2.1 Setup OpenRouter Client

In [5]:
from openai import OpenAI
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv("../.env")

# OpenRouter configuration
OPENROUTER_BASE_URL = "https://openrouter.ai/api/v1"
MODEL_NAME = "google/gemini-2.5-flash-lite"

# Initialize client with OpenRouter
api_key = os.getenv("OPENROUTER_API_KEY")
if api_key:
    client = OpenAI(
        base_url=OPENROUTER_BASE_URL,
        api_key=api_key
    )
    print(f"OpenRouter client initialized successfully")
    print(f"Using model: {MODEL_NAME}")
else:
    print("Warning: OPENROUTER_API_KEY not found in environment")
    print("Please set your API key in a .env file:")
    print("  OPENROUTER_API_KEY=your-key-here")
    client = None

OpenRouter client initialized successfully
Using model: google/gemini-2.5-flash-lite


### 2.2 Claim Extraction with LLM

In [6]:
def extract_claims_llm(text: str, client) -> List[Dict]:
    """
    Extract factual claims from text using LLM via OpenRouter.
    """
    if client is None:
        return [{"error": "OpenRouter client not initialized"}]
    
    prompt = f"""Analyze the following social media post and extract all factual claims that can be verified.
For each claim, provide:
1. The exact text of the claim
2. A brief description
3. Whether it's verifiable (true/false statements about facts)

Post: {text}

Return ONLY valid JSON in this exact format (no markdown, no extra text):
{{
  "claims": [
    {{
      "claim_id": "1",
      "text": "extracted claim",
      "description": "brief description",
      "is_verifiable": true
    }}
  ]
}}"""
    
    try:
        response = client.chat.completions.create(
            model=MODEL_NAME,
            messages=[
                {"role": "system", "content": "You are an expert fact-checker. Always respond with valid JSON only."},
                {"role": "user", "content": prompt}
            ],
            temperature=0.2
        )
        
        content = response.choices[0].message.content.strip()
        # Remove markdown code blocks if present
        if content.startswith("```"):
            content = content.split("```")[1]
            if content.startswith("json"):
                content = content[4:]
        content = content.strip()
        
        result = json.loads(content)
        return result.get("claims", [])
    
    except Exception as e:
        return [{"error": str(e)}]

# Test claim extraction
if client:
    claims = extract_claims_llm(sample_post, client)
    print("Extracted Claims:")
    print("=" * 60)
    for claim in claims:
        if "error" not in claim:
            print(f"\nClaim {claim.get('claim_id', 'N/A')}:")
            print(f"  Text: {claim.get('text', 'N/A')}")
            print(f"  Verifiable: {claim.get('is_verifiable', 'N/A')}")
        else:
            print(f"Error: {claim['error']}")
else:
    print("Skipping LLM test - no API key configured")

Extracted Claims:

Claim 1:
  Text: The European Union is forcing Germany to accept 500,000 migrants!
  Verifiable: True

Claim 2:
  Text: Angela Merkel knew this would happen.
  Verifiable: True

Claim 3:
  Text: COVID-19 was just a distraction created by WHO.
  Verifiable: True


### 2.3 Persuasion Technique Detection

In [7]:
PERSUASION_TAXONOMY = {
    "FearAppeal": "Using fear or threats to influence behavior or beliefs",
    "LoadedLanguage": "Using emotionally charged words to influence without evidence",
    "AppealToAuthority": "Citing authority figures without proper evidence",
    "Scapegoating": "Unfairly blaming a person or group for problems",
    "Exaggeration": "Overstating or understating facts for effect",
}

def detect_persuasion_techniques(claim_text: str, post_context: str, client) -> List[Dict]:
    """
    Detect persuasion techniques in a claim using LLM via OpenRouter.
    """
    if client is None:
        return [{"error": "OpenRouter client not initialized"}]
    
    taxonomy_str = "\n".join([f"- {k}: {v}" for k, v in PERSUASION_TAXONOMY.items()])
    
    prompt = f"""Analyze this claim for persuasion techniques.

Claim: {claim_text}
Full Post Context: {post_context}

Available techniques:
{taxonomy_str}

Return ONLY valid JSON:
{{
  "techniques": [
    {{
      "type": "TechniqueName",
      "confidence": 0.85,
      "explanation": "Why this technique applies"
    }}
  ]
}}"""
    
    try:
        response = client.chat.completions.create(
            model=MODEL_NAME,
            messages=[
                {"role": "system", "content": "You are an expert in rhetoric and propaganda analysis. Always respond with valid JSON only."},
                {"role": "user", "content": prompt}
            ],
            temperature=0.1
        )
        
        content = response.choices[0].message.content.strip()
        if content.startswith("```"):
            content = content.split("```")[1]
            if content.startswith("json"):
                content = content[4:]
        content = content.strip()
        
        result = json.loads(content)
        return result.get("techniques", [])
    
    except Exception as e:
        return [{"error": str(e)}]

# Test persuasion detection
if client:
    test_claim = "The European Union is forcing Germany to accept 500,000 migrants"
    techniques = detect_persuasion_techniques(test_claim, sample_post, client)
    
    print("Detected Persuasion Techniques:")
    print("=" * 60)
    for tech in techniques:
        if "error" not in tech:
            print(f"\nTechnique: {tech.get('type', 'N/A')}")
            print(f"  Confidence: {tech.get('confidence', 'N/A')}")
            print(f"  Explanation: {tech.get('explanation', 'N/A')}")
        else:
            print(f"Error: {tech['error']}")
else:
    print("Skipping LLM test - no API key configured")

Detected Persuasion Techniques:

Technique: FearAppeal
  Confidence: 0.9
  Explanation: The claim uses the word 'forcing' and the large number '500,000 migrants' to evoke a sense of threat and overwhelm, suggesting a negative and potentially dangerous situation for Germany.

Technique: LoadedLanguage
  Confidence: 0.85
  Explanation: The phrase 'forcing Germany' is emotionally charged and implies coercion and a lack of agency, aiming to provoke a negative emotional response towards the European Union.

Technique: Scapegoating
  Confidence: 0.7
  Explanation: The claim implicitly blames the European Union for a problem (migrant acceptance) and also introduces a conspiracy theory about COVID-19 and the WHO, diverting attention and blaming external entities for perceived issues.

Technique: Exaggeration
  Confidence: 0.75
  Explanation: The specific number '500,000 migrants' presented as a definitive and immediate mandate from the EU could be an exaggeration or misrepresentation of actual

---

## Summary

This notebook demonstrated:

1. **spaCy NER**: Fast entity extraction with type classification
2. **OpenRouter API**: LLM-based claim extraction and persuasion detection using Gemini

### Tool Roles in Our Pipeline:

| Task | Tool | Reason |
|------|------|--------|
| Entity Recognition | spaCy | Fast, consistent, good for standard entities |
| Claim Extraction | Gemini (via OpenRouter) | Requires understanding of semantics and context |
| Persuasion Detection | Gemini (via OpenRouter) | Requires nuanced understanding of rhetoric |

### OpenRouter Configuration

To use this notebook, set your OpenRouter API key in `.env`:
```
OPENROUTER_API_KEY=your-key-here
```

Get your API key at: https://openrouter.ai/keys