# Graph Neural Network for Twitter Bot Detection

This notebook implements a Graph Neural Network (GNN) to detect bot accounts on Twitter using the TWIBOT22 dataset.

## What is a Graph Neural Network?

**Graph Neural Networks (GNNs)** are a class of deep learning models designed to work with graph-structured data. Unlike traditional neural networks that work with grid-like data (images, sequences), GNNs can handle irregular structures like social networks.

### Key Concepts:

1. **Graph Structure**: 
   - **Nodes**: Represent entities (users in our case)
   - **Edges**: Represent relationships (retweets, replies, mentions)
   - **Features**: Each node has features (follower count, tweet metrics, text embeddings)

2. **Message Passing**:
   - Each node aggregates information from its neighbors
   - Information flows through edges in multiple layers
   - Each layer refines the node representations

3. **Why GNNs for Bot Detection?**:
   - Bots often have distinct interaction patterns (who they follow, retweet patterns)
   - GNNs can learn from both node features AND network structure
   - Accounts with similar behavior cluster together in the graph

### Our Implementation:

We'll build a heterogeneous graph where:
- **Nodes** = Twitter users
- **Edges** = Interactions (retweets, replies, quotes)
- **Node Features** = User profile metrics + aggregated tweet text embeddings
- **Task** = Binary classification (bot vs human)

## 1. Import Required Libraries

In [None]:
# Install required packages (run once)
# %pip install torch torchvision torchaudio
# %pip install torch-geometric
# %pip install transformers
# %pip install scikit-learn
# %pip install imbalanced-learn

import warnings
warnings.filterwarnings('ignore')

# Core libraries
import pandas as pd
import numpy as np
import os
from dotenv import load_dotenv

# PyTorch
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.optim import Adam
from torch.optim.lr_scheduler import ReduceLROnPlateau

# PyTorch Geometric
import torch_geometric
from torch_geometric.data import Data
from torch_geometric.nn import GCNConv, SAGEConv, global_mean_pool
from torch_geometric.utils import degree

# Scikit-learn
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, roc_auc_score, confusion_matrix
from sklearn.utils.class_weight import compute_class_weight

# Imbalanced learning
from imblearn.over_sampling import SMOTE

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Set random seeds for reproducibility
torch.manual_seed(42)
np.random.seed(42)

print(f"PyTorch version: {torch.__version__}")
print(f"PyTorch Geometric version: {torch_geometric.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

## 2. Load Preprocessed Data

In [None]:
# Load environment variables
load_dotenv()
DATASET_DIR = os.getenv("DATASET_DIR")

# Load the preprocessed tweets with labels
data_path = os.path.join(DATASET_DIR, "tweets_with_labels.parquet")
df = pd.read_parquet(data_path)

print(f"Loaded dataset shape: {df.shape}")
print(f"\nColumns: {df.columns.tolist()}")
print(f"\nLabel distribution:")
print(df['label'].value_counts())
print(f"\nSplit distribution:")
print(df['split'].value_counts() if 'split' in df.columns else "No split column")

# Display sample
df.head()

## 3. Construct Graph Structure

We create a graph where:
- **Nodes** = Unique users (authors of tweets)
- **Edges** = Interactions between users (user A retweets/replies to user B's tweet)
- This captures the social network structure for the GNN

In [None]:
# Create user mapping: author_id_str -> node index
unique_users = df['author_id_str'].unique()
user_to_idx = {user: idx for idx, user in enumerate(unique_users)}
idx_to_user = {idx: user for user, idx in user_to_idx.items()}

num_nodes = len(unique_users)
print(f"Number of nodes (unique users): {num_nodes}")

# Build edges from retweet/reply/quote interactions
# For this simplified version, we create edges based on same conversation_id
edge_list = []

# Group by conversation to find interactions
for conv_id, group in df.groupby('conversation_id'):
    authors = group['author_id_str'].unique()
    # Create edges between all users in the same conversation
    for i, user_a in enumerate(authors):
        for user_b in authors[i+1:]:
            if user_a in user_to_idx and user_b in user_to_idx:
                edge_list.append([user_to_idx[user_a], user_to_idx[user_b]])
                edge_list.append([user_to_idx[user_b], user_to_idx[user_a]])  # Undirected

edge_index = torch.tensor(edge_list, dtype=torch.long).t().contiguous()
print(f"Number of edges: {edge_index.shape[1]}")
print(f"Edge index shape: {edge_index.shape}")

## 4. Feature Engineering and Text Processing

**Text Handling Strategy**: Use TF-IDF to convert tweet text into numerical vectors
- Creates sparse representations of text
- Captures important words while reducing dimensionality
- More efficient than large transformer models for this scale

In [None]:
# 1. Text features using TF-IDF (dimensionality reduction)
tfidf = TfidfVectorizer(max_features=100, stop_words='english', max_df=0.8, min_df=2)
text_features = tfidf.fit_transform(df['text'].fillna('')).toarray()

# Aggregate text features per user (mean of all their tweets)
user_text_features = np.zeros((num_nodes, 100))
for idx, row in df.iterrows():
    user_idx = user_to_idx[row['author_id_str']]
    user_text_features[user_idx] += text_features[idx]

# Average the features
tweet_counts = df.groupby('author_id_str').size()
for user, count in tweet_counts.items():
    user_text_features[user_to_idx[user]] /= count

print(f"Text features shape: {user_text_features.shape}")

# 2. Numerical user features
numerical_features = ['retweet_count', 'like_count', 'reply_count', 'quote_count', 'text_length']
user_numerical_features = np.zeros((num_nodes, len(numerical_features)))

for user_str in unique_users:
    user_tweets = df[df['author_id_str'] == user_str]
    user_idx = user_to_idx[user_str]
    for i, col in enumerate(numerical_features):
        user_numerical_features[user_idx, i] = user_tweets[col].mean()

print(f"Numerical features shape: {user_numerical_features.shape}")

# 3. Combine all features
node_features = np.concatenate([user_text_features, user_numerical_features], axis=1)
print(f"Combined node features shape: {node_features.shape}")

## 5. Feature Scaling and Normalization

Normalize features to have zero mean and unit variance for better training stability.

In [None]:
# Normalize features using StandardScaler
scaler = StandardScaler()
node_features_scaled = scaler.fit_transform(node_features)

# Convert to PyTorch tensors
x = torch.tensor(node_features_scaled, dtype=torch.float)

print(f"Scaled features shape: {x.shape}")
print(f"Feature statistics after scaling:")
print(f"  Mean: {x.mean(dim=0)[:5]}...")  # Show first 5
print(f"  Std: {x.std(dim=0)[:5]}...")

# Extract labels for each node
node_labels = np.zeros(num_nodes, dtype=np.int64)
for user_str in unique_users:
    user_idx = user_to_idx[user_str]
    # Get the label for this user (take the first occurrence)
    label = df[df['author_id_str'] == user_str]['label'].iloc[0]
    node_labels[user_idx] = label

y = torch.tensor(node_labels, dtype=torch.long)

print(f"\nLabel distribution in graph:")
unique, counts = torch.unique(y, return_counts=True)
for label, count in zip(unique, counts):
    print(f"  Label {label}: {count} ({100*count/num_nodes:.2f}%)")