# Data Exploration: TruthSeeker Dataset

This notebook provides initial exploration of the TruthSeeker dataset components for analyzing misinformation spread by bots vs humans.

## Objectives
1. Load and inspect dataset components (FakeNewsNet, CoAID, TwiBot-22)
2. Assess data quality and completeness
3. Understand data structure and relationships
4. Identify integration opportunities and challenges
5. Generate preliminary statistics and visualizations

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import json
import warnings

warnings.filterwarnings('ignore')

# Set plotting style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

# Define paths
DATA_RAW = Path('../data/raw')
DATA_PROCESSED = Path('../data/processed')
DATA_EXTERNAL = Path('../data/external')
RESULTS_FIGURES = Path('../results/figures')

print("Environment setup complete!")

## 1. FakeNewsNet Dataset Exploration

In [None]:
# Load FakeNewsNet data
# NOTE: Update paths based on actual data structure after download

fakenewsnet_path = DATA_EXTERNAL / 'FakeNewsNet'

if fakenewsnet_path.exists():
    print(f"FakeNewsNet directory found at: {fakenewsnet_path}")
    
    # List subdirectories
    subdirs = [d for d in fakenewsnet_path.iterdir() if d.is_dir()]
    print(f"Subdirectories: {[d.name for d in subdirs]}")
    
    # TODO: Load actual data files once downloaded
    # Example:
    # politifact_news = pd.read_csv(fakenewsnet_path / 'politifact' / 'news.csv')
    # gossipcop_news = pd.read_csv(fakenewsnet_path / 'gossipcop' / 'news.csv')
else:
    print("FakeNewsNet data not found. Please download using instructions in docs/data_acquisition.md")
    print(f"Expected path: {fakenewsnet_path}")

### FakeNewsNet Data Structure

Expected components:
- News articles (fake/real labels)
- Social media posts (tweets sharing articles)
- User engagement (retweets, likes, replies)
- Temporal information

In [None]:
# Placeholder for FakeNewsNet analysis
# Once data is loaded:
# - Display basic statistics (number of articles, posts, users)
# - Show label distribution (fake vs real)
# - Temporal coverage
# - Sample records

## 2. CoAID Dataset Exploration

In [None]:
# Load CoAID data
coaid_path = DATA_EXTERNAL / 'CoAID'

if coaid_path.exists():
    print(f"CoAID directory found at: {coaid_path}")
    
    # List subdirectories
    subdirs = [d for d in coaid_path.iterdir() if d.is_dir()]
    print(f"Subdirectories: {[d.name for d in subdirs]}")
    
    # TODO: Load actual data files
else:
    print("CoAID data not found. Please download using instructions in docs/data_acquisition.md")
    print(f"Expected path: {coaid_path}")

### CoAID Data Structure

Expected components:
- COVID-19 related news and claims
- Social media posts
- Fact-checking labels
- Multi-modal content

In [None]:
# Placeholder for CoAID analysis
# Once data is loaded:
# - Dataset size and structure
# - Label distribution
# - Temporal coverage
# - Sample records

## 3. TwiBot-22 Dataset Exploration

In [None]:
# Load TwiBot-22 data
twibot_path = DATA_EXTERNAL / 'TwiBot-22'

if twibot_path.exists():
    print(f"TwiBot-22 directory found at: {twibot_path}")
    
    # List subdirectories and files
    items = list(twibot_path.iterdir())
    print(f"Contents: {[item.name for item in items[:10]]}...")  # First 10 items
    
    # TODO: Load actual data files
    # Example:
    # user_labels = pd.read_csv(twibot_path / 'labels.csv')
    # user_profiles = pd.read_json(twibot_path / 'user.json', lines=True)
else:
    print("TwiBot-22 data not found. Please download using instructions in docs/data_acquisition.md")
    print(f"Expected path: {twibot_path}")

### TwiBot-22 Data Structure

Expected components:
- User profiles (bots and humans)
- Bot/human labels
- Tweet content
- Network relationships

In [None]:
# Placeholder for TwiBot-22 analysis
# Once data is loaded:
# - Number of bot vs human accounts
# - Label distribution and confidence
# - Account characteristics
# - Sample records

## 4. Data Integration Planning

In [None]:
# Identify common fields for integration
# - User IDs (Twitter user IDs)
# - Tweet IDs
# - Timestamps

# Integration strategy:
# 1. Extract user IDs from FakeNewsNet/CoAID posts
# 2. Match with TwiBot-22 user labels
# 3. Create unified dataset with bot/human labels
# 4. Preserve temporal and network information

## 5. Data Quality Assessment

In [None]:
# Template for data quality checks
def assess_data_quality(df, dataset_name):
    """
    Assess data quality for a given dataframe.
    
    Args:
        df: pandas DataFrame
        dataset_name: str, name of dataset for reporting
    """
    print(f"\n{'='*60}")
    print(f"Data Quality Assessment: {dataset_name}")
    print(f"{'='*60}\n")
    
    # Basic info
    print(f"Shape: {df.shape}")
    print(f"Memory usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB\n")
    
    # Missing values
    missing = df.isnull().sum()
    missing_pct = (missing / len(df) * 100).round(2)
    missing_df = pd.DataFrame({
        'Missing Count': missing,
        'Percentage': missing_pct
    }).sort_values('Percentage', ascending=False)
    print("Missing Values:")
    print(missing_df[missing_df['Missing Count'] > 0])
    
    # Duplicates
    duplicates = df.duplicated().sum()
    print(f"\nDuplicate rows: {duplicates} ({duplicates/len(df)*100:.2f}%)")
    
    # Data types
    print("\nData types:")
    print(df.dtypes.value_counts())
    
    return missing_df

# Will be applied to each dataset once loaded

## 6. Preliminary Visualizations

In [None]:
# Placeholder for visualizations
# Once data is loaded, create:
# 1. Distribution of fake vs real news
# 2. Bot vs human label distribution
# 3. Temporal distribution of posts
# 4. User activity distributions

## 7. Summary and Next Steps

In [None]:
# Summary of exploration findings
summary = {
    'datasets_available': [],
    'total_records': 0,
    'integration_feasibility': 'TBD',
    'data_quality_issues': [],
    'next_steps': [
        'Download all dataset components',
        'Complete data quality assessment',
        'Develop data integration pipeline',
        'Create analysis-ready dataset',
        'Proceed to RQ-specific analyses'
    ]
}

print("\n" + "="*60)
print("EXPLORATION SUMMARY")
print("="*60)
for key, value in summary.items():
    print(f"\n{key.replace('_', ' ').title()}:")
    if isinstance(value, list):
        for item in value:
            print(f"  - {item}")
    else:
        print(f"  {value}")