# FALCON Dataset Preprocessing

This notebook preprocesses the **FALCON** (Fallacies in COVID-19 Network-based) dataset for use in our semantic web pipeline.

## Dataset Overview

FALCON is a multi-label dataset of COVID-19-related tweets annotated for six fallacy types:
- Ad Hominem
- Appeal to Fear
- Appeal to Ridicule
- False Dilemma
- Hasty Generalization
- Loaded Language

**Source files:** `df_train.csv`, `df_val.csv`, `df_test.csv` (2,916 tweets total)

## 1. Setup and Imports

In [88]:
import pandas as pd
import numpy as np
import re
import json
from pathlib import Path
from typing import List, Dict

# Project paths
PROJECT_ROOT = Path.cwd().parent
DATA_DIR = PROJECT_ROOT / "data" / "input"
FALCON_DIR = DATA_DIR / "unprocessed" / "falcon_dataset"
OUTPUT_DIR = DATA_DIR / "processed"

print(f"Project root: {PROJECT_ROOT}")
print(f"FALCON data: {FALCON_DIR}")
print(f"Output dir: {OUTPUT_DIR}")

Project root: /Users/luncenok/Studia/sem7/SWSN/semantic_web_project
FALCON data: /Users/luncenok/Studia/sem7/SWSN/semantic_web_project/data/input/unprocessed/falcon_dataset
Output dir: /Users/luncenok/Studia/sem7/SWSN/semantic_web_project/data/input/processed


## 2. Load FALCON Dataset

In [89]:
# Load all splits
train_df = pd.read_csv(FALCON_DIR / "df_train.csv")
val_df = pd.read_csv(FALCON_DIR / "df_val.csv")
test_df = pd.read_csv(FALCON_DIR / "df_test.csv")

# Add split indicator
train_df['split'] = 'train'
val_df['split'] = 'val'
test_df['split'] = 'test'

# Combine
falcon_df = pd.concat([train_df, val_df, test_df], ignore_index=True)

print(f"Train: {len(train_df)} rows")
print(f"Val: {len(val_df)} rows")
print(f"Test: {len(test_df)} rows")
print(f"Total: {len(falcon_df)} rows")

Train: 1811 rows
Val: 550 rows
Test: 555 rows
Total: 2916 rows


In [90]:
# Check columns
print(f"Total columns: {len(falcon_df.columns)}")
print(f"\nKey columns: {falcon_df.columns[:15].tolist()}")

Total columns: 133

Key columns: ['new_id', 'component_id', 'main_tweet', 'previous_context', 'posterior_context', 'Ad Hominem', 'Appeal to Fear', 'Appeal to Ridicule', 'False Dilemma', 'Hasty Generalization', 'Loaded Language', 'None of the above', 'created_at', 'followers', 'tweet_count']


## 3. Define Fallacy Columns and Ontology Mapping

In [91]:
# Fallacy columns in FALCON
FALLACY_COLUMNS = [
    'Ad Hominem',
    'Appeal to Fear',
    'Appeal to Ridicule',
    'False Dilemma',
    'Hasty Generalization',
    'Loaded Language'
]

# Mapping to our ontology classes
FALLACY_TO_ONTOLOGY = {
    'Ad Hominem': 'AdHominem',
    'Appeal to Fear': 'FearAppeal',
    'Appeal to Ridicule': 'AppealToRidicule',
    'False Dilemma': 'FalseDilemma',
    'Hasty Generalization': 'HastyGeneralization',
    'Loaded Language': 'LoadedLanguage'
}

print("Ontology Mapping:")
for falcon_name, onto_name in FALLACY_TO_ONTOLOGY.items():
    print(f"  {falcon_name:25} -> {onto_name}")

Ontology Mapping:
  Ad Hominem                -> AdHominem
  Appeal to Fear            -> FearAppeal
  Appeal to Ridicule        -> AppealToRidicule
  False Dilemma             -> FalseDilemma
  Hasty Generalization      -> HastyGeneralization
  Loaded Language           -> LoadedLanguage


## 4. Explore Fallacy Distribution

In [92]:
# Fallacy counts
print("Fallacy Distribution:")
print("=" * 50)
for col in FALLACY_COLUMNS:
    count = falcon_df[col].sum()
    pct = count / len(falcon_df) * 100
    print(f"{col:25} {count:5} ({pct:5.1f}%)")

# Tweets with at least one fallacy
falcon_df['fallacy_count'] = falcon_df[FALLACY_COLUMNS].sum(axis=1)
with_fallacy = (falcon_df['fallacy_count'] > 0).sum()
print(f"\nTweets with â‰¥1 fallacy: {with_fallacy} ({with_fallacy/len(falcon_df)*100:.1f}%)")

Fallacy Distribution:
Ad Hominem                  259 (  8.9%)
Appeal to Fear              157 (  5.4%)
Appeal to Ridicule          238 (  8.2%)
False Dilemma               168 (  5.8%)
Hasty Generalization         91 (  3.1%)
Loaded Language             457 ( 15.7%)

Tweets with â‰¥1 fallacy: 1009 (34.6%)


In [93]:
# Distribution of fallacy count per tweet
print("\nFallacies per tweet:")
print(falcon_df['fallacy_count'].value_counts().sort_index())


Fallacies per tweet:
fallacy_count
0    1907
1     708
2     250
3      42
4       9
Name: count, dtype: int64


## 5. Text Cleaning Functions

In [94]:
def clean_tweet_text(text: str) -> str:
    """
    Clean tweet text for NLP processing.
    
    Normalizations:
    - Remove anonymized @user mentions (no entity info)
    - Keep real @mentions (remove @ symbol only)
    - Replace URLs with [URL]
    - Remove hashtag symbols (keep text)
    - Normalize whitespace
    - Collapse repeated punctuation
    """
    if pd.isna(text):
        return ""
    
    text = str(text)
    
    # Replace URLs
    text = re.sub(r'https?://\S+', '[URL]', text)
    
    # Remove anonymized @user mentions entirely (no entity info)
    text = re.sub(r'@user\b', '', text, flags=re.IGNORECASE)
    
    # For real mentions, remove @ symbol but keep username (for NER)
    text = re.sub(r'@(\w+)', r'\1', text)
    
    # Remove hashtag symbols but keep text
    text = re.sub(r'#(\w+)', r'\1', text)
    
    # Collapse repeated punctuation
    text = re.sub(r'([!?.]){2,}', r'\1', text)
    
    # Normalize whitespace
    text = ' '.join(text.split())
    
    return text.strip()

# Test
test_text = "OMG!!! @realDonaldTrump is DESTROYING America!!! ðŸ‡ºðŸ‡¸ #MAGA https://t.co/xyz123"
print(f"Original: {test_text}")
print(f"Cleaned:  {clean_tweet_text(test_text)}")

# Test with anonymized mentions
test_falcon = "[user79987]: @user @user ... @user The unintelligent thing"
print(f"\nFALCON original: {test_falcon}")
print(f"FALCON cleaned:  {clean_tweet_text(test_falcon)}")

Original: OMG!!! @realDonaldTrump is DESTROYING America!!! ðŸ‡ºðŸ‡¸ #MAGA https://t.co/xyz123
Cleaned:  OMG! realDonaldTrump is DESTROYING America! ðŸ‡ºðŸ‡¸ MAGA [URL]

FALCON original: [user79987]: @user @user ... @user The unintelligent thing
FALCON cleaned:  [user79987]: . The unintelligent thing


## 6. Process Dataset

In [95]:
def process_falcon(df: pd.DataFrame) -> pd.DataFrame:
    """
    Process FALCON dataset for pipeline use.
    
    Returns DataFrame with:
    - post_id: unique identifier
    - text: original tweet text
    - text_clean: normalized text
    - techniques: list of ontology class names
    - has_persuasion: binary flag
    - split: train/val/test
    """
    processed = pd.DataFrame()
    
    # Generate unique IDs
    processed['post_id'] = [f"falcon_{i}" for i in range(len(df))]
    
    # Text columns
    processed['text'] = df['main_tweet'].values
    processed['text_clean'] = df['main_tweet'].apply(clean_tweet_text)
    
    # Extract techniques as list of ontology classes
    def get_techniques(row):
        techniques = []
        for col in FALLACY_COLUMNS:
            if row[col] == 1:
                techniques.append(FALLACY_TO_ONTOLOGY[col])
        return techniques
    
    processed['techniques'] = df.apply(get_techniques, axis=1)
    processed['has_persuasion'] = (df[FALLACY_COLUMNS].sum(axis=1) > 0).astype(int)
    
    # Keep split info
    processed['split'] = df['split'].values
    
    # Source dataset marker
    processed['source'] = 'FALCON'
    
    return processed

# Process
processed_df = process_falcon(falcon_df)

print(f"Processed dataset: {len(processed_df)} rows")
print(f"Columns: {processed_df.columns.tolist()}")

Processed dataset: 2916 rows
Columns: ['post_id', 'text', 'text_clean', 'techniques', 'has_persuasion', 'split', 'source']


In [103]:
falcon_df[['main_tweet']].head(10)

Unnamed: 0,main_tweet
0,[user104337]: @user @user ... @user Kyrie Irv...
1,[user79987]: @user @user ... @user Totally di...
2,[user104337]: @user @user ... @user That's so...
3,[user79987]: @user @user ... @user The uninte...
4,[user47446]: @user @user ... @user It's been ...
5,[user1779]: @user @user ... @user this shit's...
6,[user47446]: @user @user ... @user Companies ...
7,[user47446]: @user @user ... @user Plus how d...
8,[user1779]: @user @user ... @user exactly. bo...
9,[user47446]: @user @user ... @user Why must w...


In [96]:
# Preview processed data
processed_df[['post_id', 'text_clean', 'techniques', 'has_persuasion', 'split']].head(10)

Unnamed: 0,post_id,text_clean,techniques,has_persuasion,split
0,falcon_0,[user104337]: . Kyrie Irving just doesn't get ...,[],0,train
1,falcon_1,[user79987]: . Totally disagree here. You hit ...,[],0,train
2,falcon_2,[user104337]: . That's so unintelligent. You c...,[],0,train
3,falcon_3,[user79987]: . The unintelligent thing to do w...,"[AdHominem, LoadedLanguage]",1,train
4,falcon_4,"[user47446]: . It's been what.2, 3 weeks and w...",[],0,train
5,falcon_5,[user1779]: . this shit's not gonna be easyðŸ˜‚no...,[LoadedLanguage],1,train
6,falcon_6,[user47446]: . Companies across the world are ...,[HastyGeneralization],1,train
7,falcon_7,[user47446]: . Plus how do you know they'll do...,[],0,train
8,falcon_8,[user1779]: . exactly. boycotting will stop a ...,[FalseDilemma],1,train
9,falcon_9,[user47446]: . Why must we always be combative...,[FalseDilemma],1,train


## 7. Summary Statistics

In [97]:
print("=" * 60)
print("FALCON PREPROCESSING SUMMARY")
print("=" * 60)

print(f"\nTotal tweets: {len(processed_df)}")
print(f"  - Train: {(processed_df['split'] == 'train').sum()}")
print(f"  - Val: {(processed_df['split'] == 'val').sum()}")
print(f"  - Test: {(processed_df['split'] == 'test').sum()}")

print(f"\nWith persuasion techniques: {processed_df['has_persuasion'].sum()} ({processed_df['has_persuasion'].mean()*100:.1f}%)")

print(f"\nAvg text length: {processed_df['text_clean'].str.len().mean():.1f} chars")

print("\nTechnique counts:")
all_techniques = [t for techniques in processed_df['techniques'] for t in techniques]
from collections import Counter
for technique, count in Counter(all_techniques).most_common():
    print(f"  {technique:25} {count:5}")

print("\n" + "=" * 60)

FALCON PREPROCESSING SUMMARY

Total tweets: 2916
  - Train: 1811
  - Val: 550
  - Test: 555

With persuasion techniques: 1009 (34.6%)

Avg text length: 182.0 chars

Technique counts:
  LoadedLanguage              457
  AdHominem                   259
  AppealToRidicule            238
  FalseDilemma                168
  FearAppeal                  157
  HastyGeneralization          91



## 8. Export Processed Data

In [98]:
# Create output directory
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

# Export as CSV
csv_path = OUTPUT_DIR / "falcon_processed.csv"
processed_df.to_csv(csv_path, index=False)
print(f"Exported CSV: {csv_path}")

# Export as JSON (better for list fields)
json_path = OUTPUT_DIR / "falcon_processed.json"
processed_df.to_json(json_path, orient='records', indent=2)
print(f"Exported JSON: {json_path}")

Exported CSV: /Users/luncenok/Studia/sem7/SWSN/semantic_web_project/data/input/processed/falcon_processed.csv
Exported JSON: /Users/luncenok/Studia/sem7/SWSN/semantic_web_project/data/input/processed/falcon_processed.json


In [99]:
# Verify JSON export
with open(json_path) as f:
    data = json.load(f)

print(f"JSON contains {len(data)} records")
print(f"\nSample record:")
print(json.dumps(data[3], indent=2))

JSON contains 2916 records

Sample record:
{
  "post_id": "falcon_3",
  "text": " [user79987]: @user @user ... @user The unintelligent thing to do would be to go on living like nothing happened. If you have leverage use it. See how quickly a rich ass owner who doesn't care will hold on to something that doesn't generate profits. Contracts don't mean a damn thing anymore...haven't in years. \n ",
  "text_clean": "[user79987]: . The unintelligent thing to do would be to go on living like nothing happened. If you have leverage use it. See how quickly a rich ass owner who doesn't care will hold on to something that doesn't generate profits. Contracts don't mean a damn thing anymore.haven't in years.",
  "techniques": [
    "AdHominem",
    "LoadedLanguage"
  ],
  "has_persuasion": 1,
  "split": "train",
  "source": "FALCON"
}


## 9. Next Steps

The processed FALCON dataset is now ready for:

1. **NLP Analysis** - Entity extraction with spaCy
2. **LLM Processing** - Claim extraction and technique detection with Gemini
3. **RDF Generation** - Creating semantic triples for knowledge graph

Output files:
- `data/input/processed/falcon_processed.csv` - CSV format
- `data/input/processed/falcon_processed.json` - JSON format (preserves list fields)