# Competition Overview

The Social Media Extremism Detection Challenge is a Kaggle binary text classification competition focused on distinguishing potentially extremist social media posts from regular content. Participants receive an anonymized dataset of short English-language messages labeled as EXTREMIST (promotes/supports extremist ideology, organizations, or violence) or NON_EXTREMIST. The challenge is part of a Community Impact Initiative emphasizing AI for social good, online safety, and responsible content moderation research.

The task is educational and research-oriented—exploring fairness, robustness, and interpretability in detecting harmful content. Submissions are evaluated using classification accuracy on a hidden test set.

- Timeline: Started ~November 2025, closed January 7, 2026.
- Prize Pool: 200 dollar total (100 dollar – 1st, 60 dollar – 2nd, 40 dollar – 3rd) and digital certificates.
- Content Warning: Dataset contains disturbing or hateful references provided solely for research purposes.

# Introduction

Online platforms face increasing challenges in moderating extremist content while preserving free expression. Extremist posts are often subtle, context-dependent, and rapidly evolving, making automated detection difficult yet critical for reducing online harm.

This competition provides a curated, anonymized dataset of 2,776 hand-labeled messages. The data distribution is approximately 48% EXTREMIST and 52% NON_EXTREMIST. While relatively balanced, the nuance of the language requires models that can capture deep semantic context.

The provided notebook implements a state-of-the-art pipeline: aggressive data augmentation (synonym replacement and word swapping), Optuna hyperparameter tuning across 5 strong transformer models (RoBERTa, DeBERTa-v3, ELECTRA, DistilBERT, and a specialized Toxic-Comment model), and a final soft-voting ensemble. This approach achieves a high held-out accuracy of ~81.8%, ensuring a robust and generalized submission.

# Objective

The main goal is to develop a high-accuracy binary classifier that distinguishes extremist from non-extremist content. Specifically, the notebook aims to:

- Enhance Data Quality: Create a perfectly balanced training set through strong augmentation to ensure the model doesn't overfit to majority-class linguistic patterns.
- Optimize Architecture: Fine-tune 5 diverse transformer models, leveraging Optuna to find the ideal learning rates, dropout, and batch sizes.
- Implement Advanced Training: Apply Mixed Precision (AMP), Gradient Accumulation, and Cosine LR scheduling to maximize hardware efficiency.
- Maximize Generalization: Build a 5-model soft-voting ensemble to reduce variance and improve the final Private Leaderboard score.

# Data Dictionary

**Data Characteristics & Constraints**

- Language: English.
- Format: Short-form social media text (resembles microblogging style).
- Privacy: Data is anonymized; however, it retains references to public institutions (e.g., NHS) and general geographic/social groups.
- Balanced Distribution: The training set is relatively balanced (~48% Extremist vs. ~52% Non-Extremist), reducing the need for extreme class-weighting but benefiting from the augmentation techniques described in the pipeline.
- Note on Content: As this dataset deals with extremism detection, it contains sensitive and potentially disturbing language. It is intended strictly for research and the development of safety-focused AI.

**File Descriptions**

The dataset is split into three primary files, providing a total of 2,776 labeled examples in the training set and a corresponding unlabeled test set.

<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;}
.tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  overflow:hidden;padding:10px 5px;word-break:normal;}
.tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;}
.tg .tg-za14{border-color:inherit;text-align:left;vertical-align:bottom}
.tg .tg-7zrl{text-align:left;vertical-align:bottom}
.tg .tg-0lax{text-align:left;vertical-align:top}
</style>
<table class="tg"><thead>
  <tr>
    <th class="tg-za14">File Name</th>
    <th class="tg-7zrl">Description</th>
  </tr></thead>
<tbody>
  <tr>
    <td class="tg-7zrl">train.csv</td>
    <td class="tg-0lax">The training set containing labeled messages used to train and validate machine learning models.</td>
  </tr>
  <tr>
    <td class="tg-7zrl">test.csv</td>
    <td class="tg-0lax">The test set for which participants must predict the extremism labels.</td>
  </tr>
  <tr>
    <td class="tg-7zrl">sample_submission.csv</td>
    <td class="tg-0lax">A template demonstrating the required format for submission (ID and Prediction).</td>
  </tr>
</tbody>
</table>

**Field Definitions**

The following table outlines the schema for both the training and testing data.

<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;}
.tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  overflow:hidden;padding:10px 5px;word-break:normal;}
.tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;}
.tg .tg-za14{border-color:inherit;text-align:left;vertical-align:bottom}
.tg .tg-7zrl{text-align:left;vertical-align:bottom}
.tg .tg-0lax{text-align:left;vertical-align:top}
</style>
<table class="tg"><thead>
  <tr>
    <th class="tg-za14">Column Name</th>
    <th class="tg-7zrl">Data Type</th>
    <th class="tg-7zrl">Description</th>
  </tr></thead>
<tbody>
  <tr>
    <td class="tg-7zrl">ID</td>
    <td class="tg-7zrl">Integer</td>
    <td class="tg-0lax">A unique identifier for each social media post (Range: 1–2800).</td>
  </tr>
  <tr>
    <td class="tg-7zrl">Original_Message</td>
    <td class="tg-7zrl">String (Text)</td>
    <td class="tg-0lax">The raw content of the social media post. Includes slang, punctuation, and casing (UTF-8 encoded).</td>
  </tr>
  <tr>
    <td class="tg-7zrl">Extremism_Label</td>
    <td class="tg-7zrl">Categorical</td>
    <td class="tg-0lax">The target variable indicating the nature of the content (Present in train.csv only).</td>
  </tr>
</tbody>
</table>

**Label Schema**

For the purpose of binary classification, the Extremism_Label is defined by the following criteria:

<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;}
.tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  overflow:hidden;padding:10px 5px;word-break:normal;}
.tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;}
.tg .tg-2b7s{text-align:right;vertical-align:bottom}
.tg .tg-za14{border-color:inherit;text-align:left;vertical-align:bottom}
.tg .tg-7zrl{text-align:left;vertical-align:bottom}
.tg .tg-0lax{text-align:left;vertical-align:top}
</style>
<table class="tg"><thead>
  <tr>
    <th class="tg-za14">Label Value</th>
    <th class="tg-7zrl">Numeric Map (Suggested)</th>
    <th class="tg-7zrl">Definition</th>
  </tr></thead>
<tbody>
  <tr>
    <td class="tg-7zrl">EXTREMIST</td>
    <td class="tg-2b7s">1</td>
    <td class="tg-0lax">Posts that clearly promote, endorse, or advocate for extremist ideologies, specific organizations, or violence.</td>
  </tr>
  <tr>
    <td class="tg-7zrl">NON_EXTREMIST</td>
    <td class="tg-2b7s">0</td>
    <td class="tg-0lax">Posts that are neutral, critical without extremism, or unrelated to extremist rhetoric.</td>
  </tr>
</tbody>
</table>

# Pipeline Overview

The pipeline follows a systematic path from raw text to refined prediction:

- Environment Setup: Installs nlpaug for augmentation, transformers, and Optuna.
- Data Loading & Cleaning: Loads `train.csv` and `test.csv`, applying light cleaning (removing excessive whitespace and handling null posts).
- Data Balancing via Augmentation: Uses synonym replacement and random swapping to double the minority class size, resulting in a perfectly balanced training set.
- Model Selection & Hyperparameter Tuning: Runs Optuna trials on 5 models using fold cross-validation to optimizes parameters like `weight_decay, label_smoothing, and warmup_ratio`.
- Final Ensemble Training: Re-trains the optimized models on the full augmented dataset.
- Prediction & Submission: Generates probabilities, applies a 0.5 threshold, and outputs the final `submission.csv`.

# Approach

This solution employs a modern, high-performance NLP pipeline tailored for social media text:

**Data Preparation & Augmentation**

Instead of simple oversampling, we use nlpaug to generate synthetic samples that maintain the original intent but vary the vocabulary. This "jittering" of the data helps transformers generalize better to slang and misspellings common in extremist rhetoric.

**Model Selection**

The approach utilizes a diverse ensemble to capture different aspects of text:

- RoBERTa-base & DeBERTa-v3: Excellent at understanding complex contextual relationships and long-range dependencies.
- ELECTRA-base: Highly efficient at discriminative tasks.
- DistilBERT: Provides a lightweight baseline to ensure the ensemble doesn't become overly "top-heavy."
- Toxic-Comment Model: A pre-specialized transformer that brings domain knowledge of online hate speech into the ensemble.

**Training Strategy**

- Mixed Precision (AMP): Allows training larger models on limited GPU memory by using 16-bit floats where possible.
- Cosine Learning Rate Decay: Smoothly reduces the learning rate, helping the model settle into a more stable local minimum.
- Soft-Voting Ensemble: By averaging the probabilities (rather than just the hard labels), the ensemble accounts for the confidence levels of each model, leading to more nuanced predictions.

# Environment and Configuration

## Environment and Package Installation

The first step prepares a specialized environment. It installs nlpaug for data creation and optuna for automated tuning.

- Why this matters: Large language models (`transformers`) require specific versions of torch and transformers libraries to ensure that the pre-trained weights load correctly.

In [1]:
%%capture
! pip install nlpaug
! pip install torch==2.6.0
! pip install optuna
! pip install wordcloud
! pip install protobuf==3.20.0
! pip install kaggle
! kaggle competitions download -c social-media-extremism-detection-challenge

In [2]:
import torch
print(torch.__version__)

2.6.0+cu124


## Download and Unzip Dataset

Due to this pipeline will be develop on cloud, so it has to download competition dataset via Kaggle API.

In [2]:
# configuring the path of Kaggle.json file
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

In [8]:
# extracting the compessed Dataset
from zipfile import ZipFile
dataset = 'social-media-extremism-detection-challenge.zip'

with ZipFile(dataset,'r') as zip:
  zip.extractall()
  print('The dataset is extracted')

The dataset is extracted


## Import Libraries

**Library Imports & Configuration**

This block loads the tools needed for data science (Pandas, NumPy) and deep learning (PyTorch).

- Key Detail: A global random seed is set. In machine learning, "randomness" affects weight initialization and data splitting; setting a seed ensures that if you run the code twice, you get the exact same results.

In [3]:
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"

import warnings
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud, STOPWORDS

import nltk
from nltk.corpus import stopwords

import torch
from torch.utils.data import Dataset, DataLoader
from torch.amp import autocast, GradScaler
from torch.cuda.amp import autocast, GradScaler
from torch.optim import AdamW

from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression

from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    DataCollatorWithPadding,
    get_cosine_schedule_with_warmup
)

import optuna
import joblib

import nlpaug.augmenter.word as naw

# CONFIGURATION & DEVICE SETUP
%matplotlib inline

SEED = 42
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")

warnings.simplefilter('ignore')

SAVE_DIR = "optuna_tuning"
os.makedirs(SAVE_DIR, exist_ok=True)

2026-01-11 16:35:15.501220: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1768149315.748293      55 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1768149315.818759      55 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1768149316.426986      55 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1768149316.427039      55 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1768149316.427041      55 computation_placer.cc:177] computation placer alr

# Data Preprocessing

## Load Dataset and Data Cleaning

The raw social media messages are loaded and "standardized."

- Cleaning Process: It removes excessive repeated characters (e.g., `"!!!!!"`) which can confuse a model's tokenizer. It also handles empty posts to prevent the model from crashing during training.

In [None]:
# Load & Clean
train_df = pd.read_csv('train.csv')
test_df  = pd.read_csv('test.csv')

train_df.rename(columns={'Original_Message':'message', 'Extremism_Label':'label'}, inplace=True)
test_df.rename(columns={'Original_Message':'message'}, inplace=True)

test_ids = test_df['ID'].copy()
train_df.drop(['ID'], axis=1, inplace=True)
test_df.drop(['ID'], axis=1, inplace=True)
train_df.dropna(inplace=True)

def clean_text(text):
    if not isinstance(text, str) or len(text.strip()) == 0:
        return "empty post"
    text = text.strip()
    if len(text) < 5:
        return "empty post"
    return re.sub(r'(.)\1{2,}', r'\1\1', text)

train_df['message'] = train_df['message'].apply(clean_text)
test_df['message']  = test_df['message'].apply(clean_text)

train_df['label'] = train_df['label'].map({'EXTREMIST': 1, 'NON_EXTREMIST': 0}).astype(int)

## Train and Validation Split

The 2,776 training samples are split: 90 percent for training and 10 percent for validation.

- Stratified Sampling: This ensures the 10 percent "exam" (validation set) has the same ratio of `EXTREMIST` to `NON_EXTREMIST` posts as the original data, preventing biased evaluation.

In [2]:
train_data, val_data = train_test_split(
    train_df, test_size=0.1, random_state=42, stratify=train_df['label']
)

print(f"Train: {len(train_data)} | Val: {len(val_data)} | Test: {len(test_df)}")

Train: 2024 | Val: 225 | Test: 750


## Oversampling (Augmentation)

nlpaug is a comprehensive Python library designed to automate data augmentation for Natural Language Processing (NLP) and audio tasks. In the same way you might flip or rotate an image to give a computer vision model more examples, nlpaug creates synthetic variations of your text or speech to improve model robustness, prevent overfitting, and balance small datasets.

**Core Theories & Concepts**

The library is built around three fundamental architectural concepts:

**1. The Augmenter (The "How")**

An Augmenter is the basic building block. It defines a specific strategy to change your data. Every augmenter typically follows a hierarchy: Level (Character, Word, or Sentence) to Action (Insert, Substitute, Swap, or Delete).

- Insert: Adding new elements (chars/words) into the sequence.
- Substitute: Replacing existing elements with something similar.
- Swap: Changing the order of elements.
- Delete: Removing elements to simulate missing data.

**2. Flow (The "Pipeline")**

Real-world data often contains multiple types of "noise." A Flow allows you to arrange multiple augmenters into a pipeline.

- Sequential: Applies a list of augmentations in a specific order.
- Sometimes: Applies an augmentation only with a certain probability (e.g., 50 percent of the time).

**3. Safety vs. Diversity**

A key theoretical trade-off in NLP augmentation is Safety (preserving the original label/meaning) versus Diversity (how much the text changes). `nlpaug` offers tools across this spectrum, from "Safe" synonym replacement to "Diverse" contextual generation.

**Levels of Augmentation**

Character LevelFocuses on simulating "noise" common in human input or digital processing.
- Keyboard Augmenter: Simulates typos by replacing characters with nearby keys on a QWERTY keyboard (e.g., "Google" to "Goofle").
- OCR Augmenter: Simulates Optical Character Recognition errors where characters look similar (e.g., "0" vs "O", "I" vs "1").
- Random Augmenter: Randomly inserts, swaps, or deletes characters.

**Word Level**

These strategies change the vocabulary while attempting to keep the meaning intact.
- Synonym Augmenter: Uses WordNet or PPDB to find and swap words with synonyms.
- Word Embeddings Augmenter: Uses pre-trained vectors (Word2Vec, GloVe, FastText) to find words that are semantically "close" in vector space.
- Contextual Word Embeddings: Uses Transformer models (BERT, RoBERTa) to predict the most likely word to fill a "masked" spot, ensuring the change fits the sentence's grammar.
- Back Translation: Translates text to another language (e.g., German) and back to English to generate a paraphrase.

**Sentence Level**

Focuses on generating entirely new segments of text.
- Abstractive Summarization: Uses models like BART or T5 to summarize a long text into a shorter, varied version.
- Contextual Sentence Augmenter: Uses GPT-2 or XLNet to generate a continuation or a new sentence based on existing context.
- LAMBADA: Specifically designed for few-shot learning by generating samples that follow a specific class distribution.

**Why use it?**

- Class Imbalance: If you have 1,000 "Positive" reviews but only 100 "Negative" ones, you can use nlpaug to synthesize 900 new negative reviews.
- Robustness: By training on "Keyboard" noise, your model becomes better at understanding users who make typos.
- Low Resource: It helps train deep learning models (which are data-hungry) when you have very limited labeled data.

**Popular Augmenters in nlpaug**

<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;}
.tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  overflow:hidden;padding:10px 5px;word-break:normal;}
.tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;}
.tg .tg-za14{border-color:inherit;text-align:left;vertical-align:bottom}
.tg .tg-7zrl{text-align:left;vertical-align:bottom}
.tg .tg-0lax{text-align:left;vertical-align:top}
</style>
<table class="tg"><thead>
  <tr>
    <th class="tg-za14">Augmenter</th>
    <th class="tg-7zrl">Level</th>
    <th class="tg-7zrl">Theory/Concept</th>
    <th class="tg-7zrl">Best Use Case</th>
  </tr></thead>
<tbody>
  <tr>
    <td class="tg-7zrl">KeyboardAug</td>
    <td class="tg-7zrl">Character</td>
    <td class="tg-7zrl">Typo Simulation</td>
    <td class="tg-0lax">Social media text / Chatbots</td>
  </tr>
  <tr>
    <td class="tg-7zrl">SynonymAug</td>
    <td class="tg-7zrl">Word</td>
    <td class="tg-7zrl">Lexical Semantics</td>
    <td class="tg-0lax">General text variety</td>
  </tr>
  <tr>
    <td class="tg-7zrl">ContextualWordEmbs</td>
    <td class="tg-7zrl">Word</td>
    <td class="tg-7zrl">Transformers / Masked LM</td>
    <td class="tg-0lax">High-quality, fluent text</td>
  </tr>
  <tr>
    <td class="tg-7zrl">BackTranslation</td>
    <td class="tg-7zrl">Sentence</td>
    <td class="tg-7zrl">Round-trip Paraphrasing</td>
    <td class="tg-0lax">Diversifying sentence structure</td>
  </tr>
  <tr>
    <td class="tg-7zrl">RandomWordAug</td>
    <td class="tg-7zrl">Word</td>
    <td class="tg-7zrl">EDA (Easy Data Aug)</td>
    <td class="tg-0lax">Preventing overfitting</td>
  </tr>
</tbody></table>

**Data Balancing via Augmentation**

Because the dataset may have slight imbalances, nlpaug is used to create synthetic data.

- Synonym Replacement: A word is swapped for its synonym (e.g., "attack" to "assault").
- Random Swap: The order of two words is changed.
- Result: This grows the training set to approx. 4,136 samples, creating a perfectly 50/50 balance that helps the model learn both classes equally well.

In [3]:
# Perfectly Balanced Augmentation
print("\nGenerating perfectly balanced training data...")

synonym_aug = naw.SynonymAug(aug_p=0.3)
swap_aug    = naw.RandomWordAug(action="swap", aug_p=0.25)

def strong_augment(text):
    t = synonym_aug.augment(text)
    t = t[0] if isinstance(t, list) else t
    t = swap_aug.augment(t)
    t = t[0] if isinstance(t, list) else t
    return t

n_ext = (train_data['label'] == 1).sum()
n_non = (train_data['label'] == 0).sum()
target_per_class = max(n_ext, n_non) * 2

print(f"Original : NON_EXTREMIST: {n_non}, EXTREMIST: {n_ext}")
print(f"Target per class : {target_per_class}")


Generating perfectly balanced training data...
Original : NON_EXTREMIST: 990, EXTREMIST: 1034
Target per class : 2068


In [4]:
aug_samples = []

# Augment NON_EXTREMIST
non_df = train_data[train_data['label'] == 0]
needed_non = target_per_class - n_non
print(f"Generating {needed_non} NON_EXTREMIST samples...")
for _ in range(needed_non):
    sample = non_df.sample(1).iloc[0]
    aug_samples.append({'message': strong_augment(sample['message']), 'label': 0})

# Augment EXTREMIST (light)
ext_df = train_data[train_data['label'] == 1]
needed_ext = target_per_class - n_ext
print(f"Generating {needed_ext} EXTREMIST samples...")
for _ in range(needed_ext):
    sample = ext_df.sample(1).iloc[0]
    new_text = synonym_aug.augment(sample['message'])
    new_text = new_text[0] if isinstance(new_text, list) else new_text
    aug_samples.append({'message': new_text, 'label': 1})

train_balanced = pd.concat([train_data, pd.DataFrame(aug_samples)], ignore_index=True)
train_balanced = train_balanced.sample(frac=1, random_state=42).reset_index(drop=True)

print(f"Final balanced training set: {len(train_balanced)} samples")
print(train_balanced['label'].value_counts().sort_index().to_string())

Generating 1078 NON_EXTREMIST samples...
Generating 1034 EXTREMIST samples...
Final balanced training set: 4136 samples
label
0    2068
1    2068


## Dataset Tokenization

In the world of Large Language Models (LLMs), a tokenizer is the essential translation layer that converts raw human text into a format that a neural network can process. While humans read words, LLMs read numbers.

**Why Not Just Use Words?**

You might wonder why we don't just give every word in the dictionary its own number. Modern LLMs use Subword Tokenization (like BPE or WordPiece) for several critical reasons:
- Handling New Words: If a model only knows "happy," it would be confused by "unhappiness." Subword tokenizers break it into un + happi + ness. This allows the model to understand words it has never seen before by looking at their components.
- Efficiency: Character-level tokenization (A, B, C...) makes sequences too long and hard to process. Word-level tokenization makes the "vocabulary" too large (millions of words). Subword tokenization hits the "sweet spot."
- Root Recognition: It helps the model see that "running," "runner," and "runs" all share the root "run," rather than treating them as three entirely unrelated concepts.

**1. The Core Concept:From Text to Tensors**

A tokenizer follows a three-step theoretical process to prepare data for the model:

**Step A: Normalization**

The raw text is cleaned. This includes removing extra whitespaces, converting to lowercase (if the model is "uncased"), and sometimes handling unicode normalization (ensuring "é" is represented consistently).

**Step B: Pre-tokenization & Splitting**

The text is split into smaller units. Historically, there were three ways to do this:
- Word-level: Every word is a token.
- Problem: Massive vocabulary size and inability to handle "Out-of-Vocabulary" (OOV) words (e.g., if the model knows "run" but not "running").
- Character-level: Every letter is a token.
- Problem: Sequences become extremely long, and characters alone carry very little meaning.
- Subword-level (The Standard): This is what modern LLMs use. It breaks frequent words into single units and rare words into multiple chunks (e.g., "smartwatch" becomes ["smart", "##watch"]).

**Step C: Numerical Mapping**

Every unique subword in the tokenizer’s vocabulary is assigned a unique Input ID (an integer). The model uses these IDs to look up the corresponding Embedding Vector (the mathematical representation of that word's meaning).

**2. Theoretical Algorithms**

There are three main algorithms used by popular LLMs:
- Byte Pair Encoding (BPE): Used by GPT-3, GPT-4, and Llama.10 It starts with individual characters and iteratively merges the most frequently occurring adjacent pairs into a single new token.
- WordPiece: Used by BERT. It is similar to BPE but uses a likelihood-based approach rather than just raw frequency to decide which characters to merge.
- Unigram: Used by T5 and ALBERT. Instead of building up from characters, it starts with a massive vocabulary and trims away the least useful tokens until it reaches the desired size.

**3. How the Tokenization Process Works**

The transition from text to machine-readable data happens in three distinct stages:

- Normalization: The tokenizer cleans the text by removing extra spaces, standardizing casing, or handling special characters (like converting "Résumé" to "resume").
- Splitting (Segmentation): The text is broken into chunks. Depending on the model, a token can be a whole word, a part of a word (subword), or even a single character.
- Mapping to IDs: Each unique chunk is looked up in a "vocabulary" (a massive dictionary) and replaced with its corresponding integer ID.

This creates a "data pipeline" for PyTorch. It takes raw text and uses a Tokenizer to turn words into numbers (`input_ids`) and creates an `attention_mask` so the model knows which parts of the text to focus on and which are just padding.

In [None]:
# Dataset Class 
class ExtremismDataset(Dataset):
    def __init__(self, df, tokenizer, max_len=256):
        self.texts   = df['message'].values
        self.labels  = df['label'].values if 'label' in df.columns else None
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self): return len(self.texts)

    def __getitem__(self, idx):
        text = str(self.texts[idx])
        label = self.labels[idx] if self.labels is not None else -1

        enc = self.tokenizer(
            text,
            truncation=True,
            max_length=self.max_len,
            padding=False,
            add_special_tokens=True,
            return_attention_mask=True
        )

        item = {
            'input_ids': torch.tensor(enc['input_ids'], dtype=torch.long),
            'attention_mask': torch.tensor(enc['attention_mask'], dtype=torch.long),
        }
        if label != -1:
            item['labels'] = torch.tensor(label, dtype=torch.long)
        return item

# Model Training

The pipeline uses five distinct "expert" models. By combining different architectures, the final prediction becomes more robust.

## Initial Model Fine-Tuning

**StratifiedKFold**

`StratifiedKFold` is a specialized version of the standard K-Fold cross-validation technique. While regular K-Fold randomly splits data into 2$k$ groups (folds), `StratifiedKFold` ensures that each fold maintains the same proportion of classes as the original dataset.

This is critical when you have imbalanced data—for example, if you are detecting a rare disease where only 1% of patients are positive.4 A random split might accidentally create a "fold" with 0% of those positive cases, making it impossible for the model to learn or be tested fairly.

**How it Works (Step-by-Step)**

Imagine you have a dataset with 100 samples: 80 are "Neutral" and 20 are "Extremist" ($80:20$ ratio). If you use a 5-fold StratifiedKFold:
- Calculate the Ratio: It calculates that the ratio is 4:1.
- Divide the Classes: It separates the 80 Neutrals and 20 Extremists.
- Distribute Evenly: It places exactly 16 Neutrals and 4 Extremists into each of the 5 folds.
- Rotate: It trains on 4 folds and validates on 1, repeating this 5 times until every sample has been part of a validation set once.

**1. RoBERTa-base (`FacebookAI/roberta-base`)**

- What it is: A "Robustly Optimized" version of the original BERT model.
- The Theory: Facebook researchers found that the original BERT was significantly under-trained. RoBERTa uses the same architecture but changes the training recipe.
- Key Innovations: 
    - Dynamic Masking: Unlike BERT, which masks the same words in every epoch, RoBERTa changes the mask every time it sees a sequence, forcing it to learn more general patterns.
    - More Data: Trained on 160GB of text (10x more than BERT).
    - No NSP: It removed the "Next Sentence Prediction" task, which was found to be unnecessary.
- Role in Ensemble: It serves as a highly stable "all-rounder" with a deep understanding of English linguistics.

**2. DeBERTa-v3-base (`microsoft/deberta-v3-base`)**

- What it is: Currently one of the strongest "base-sized" models in existence.
- The Theory: It uses "Disentangled Attention," which treats the content of a word and its relative position as two separate vectors.
- Key Innovations:
    - V3 Upgrade: The "v3" version specifically uses ELECTRA-style pre-training (explained below), making it much more sample-efficient.
    - Gradient-Disentangled Embedding: Improves how the model shares information between the layers.
- Role in Ensemble: This is likely your "heavy lifter." It is exceptionally good at understanding the context and intent behind social media posts.

**3. Toxic-Comment-Model (`martin-ha/toxic-comment-model`)**

- What it is: A specialized model already "tuned" for detecting harmful language.
- The Theory: It is based on DistilBERT but has been fine-tuned on the Jigsaw Toxic Comment Classification dataset.
- Key Feature: While the other models are general-purpose, this one was born to detect toxicity, insults, and threats.
- Role in Ensemble: It provides domain expertise. Since extremist content often overlaps with toxicity, this model brings "pre-existing knowledge" of what hate speech looks like.

**4. ELECTRA-base (`google/electra-base-discriminator`)**

- What it is: A "Detective" model that learns by spotting fakes.
- The Theory: Instead of guessing "Masked" words, ELECTRA uses a Generator-Discriminator setup (similar to a GAN).
- Key Innovation:
    - Replaced Token Detection: A small model (the Generator) replaces some words with plausible fakes. The main model (the Discriminator) must decide for every word: "Is this original or a fake?"
- Role in Ensemble: Because it checks every single token for "suspicion," it is very good at spotting "dog whistles" or coded language often used in extremist circles.

**5. DistilBERT-base (`distilbert/distilbert-base-uncased`)**

- What it is: A smaller, faster, "distilled" version of BERT.
- The Theory: It uses Knowledge Distillation, where a large "Teacher" model (BERT) teaches a smaller "Student" model (DistilBERT) how to behave.
- Key Features:
    - Efficiency: 40% smaller and 60% faster than BERT.
    - Uncased: It treats "Extremist" and "extremist" exactly the same, reducing the complexity of the vocabulary.
- Role in Ensemble: It acts as a regularizer. Because it is simpler, it is less likely to overfit on tiny noise in the training data, helping the ensemble stay generalized.

**Ensemble Model : Soft Voting**

The final model uses Soft Voting (also known as Weighted Averaging) to determine if a post is extremist.

- How it works: Each of the 5 models outputs a probability score (e.g., 0.85 chance of being extremist).
- The Math: The code takes the average of these probabilities:$$\text{Final Probability} = \frac{\text{RoBERTa} + \text{DeBERTa} + \text{ToxicModel} + \text{ELECTRA} + \text{DistilBERT}}{5}$$
- Why it's better: If RoBERTa is 51% sure it's extremist, but DeBERTa is 99% sure, soft voting gives more "weight" to DeBERTa's high confidence. A simple "Hard Vote" (majority rule) would treat them both as just "1 vote."

In [2]:
# 5-MODEL ENSEMBLE 
MODELS = [
    "FacebookAI/roberta-base",
    "microsoft/deberta-v3-base",
    "martin-ha/toxic-comment-model",
    "google/electra-base-discriminator",     # Added: Strong "weak" learner
    "distilbert/distilbert-base-uncased"      # Added: Fast & effective
]

test_predictions = {}
val_predictions = {}

skf = StratifiedKFold(n_splits=2, shuffle=True, random_state=42)   # Increase number of validation split for more reliable result

for model_name in MODELS:
    print(f"\nInintial Fined-Tuning {model_name} with 2-fold CV...")

    tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token

    data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

    fold_test_probs = []

    for fold, (tr_idx, val_idx) in enumerate(skf.split(train_balanced, train_balanced['label'])):
        print(f"  Fold {fold+1}/2", end="\r")             ##############

        tr_fold = train_balanced.iloc[tr_idx].reset_index(drop=True)
        va_fold = train_balanced.iloc[val_idx].reset_index(drop=True)

        train_ds = ExtremismDataset(tr_fold, tokenizer)
        val_ds   = ExtremismDataset(va_fold, tokenizer)
        
        train_loader = DataLoader(train_ds, batch_size=2, shuffle=True,  collate_fn=data_collator)
        val_loader   = DataLoader(val_ds,   batch_size=2, shuffle=False, collate_fn=data_collator)

        model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)
        model.to(DEVICE)

        optimizer = AdamW([
            {'params': model.base_model.parameters() if hasattr(model, 'base_model') else model.parameters(), 'lr': 1e-5},
            {'params': model.classifier.parameters(), 'lr': 1e-4}
        ], weight_decay=0.01)

        scheduler = get_cosine_schedule_with_warmup(optimizer, num_warmup_steps=100, num_training_steps=len(train_loader)*10)
        scaler = GradScaler('cuda')

        best_acc = 0
        patience = 2
        wait = 0

        for epoch in range(10):
            model.train()
            for batch in train_loader:
                batch = {k: v.to(DEVICE) for k, v in batch.items()}
                with autocast('cuda'):
                    outputs = model(**batch)
                    loss = outputs.loss
                scaler.scale(loss).backward()
                scaler.step(optimizer)
                scaler.update()
                scheduler.step()
                optimizer.zero_grad()

            # Validation
            model.eval()
            preds = []
            with torch.no_grad():
                for batch in val_loader:
                    batch = {k: v.to(DEVICE) for k, v in batch.items()}
                    with autocast('cuda'):
                        logits = model(**batch).logits
                    preds.extend(torch.softmax(logits, dim=1)[:,1].cpu().numpy())
            acc = accuracy_score(va_fold['label'], (np.array(preds)>0.5).astype(int))
            print(f"\nValidation Accuracy: {acc:.4f}")
            
            if acc > best_acc:
                best_acc = acc
                wait = 0
                torch.save(model.state_dict(), f"/best_{model_name.split('/')[-1]}_f{fold}.pt")
            else:
                wait += 1
                if wait >= patience:
                    break

    # Held-out validation (on original val_data)
    val_ds_final = ExtremismDataset(val_data, tokenizer)
    val_loader_final = DataLoader(val_ds_final, batch_size=2, collate_fn=data_collator)
    all_val_probs = []
    for f in range(5):
        try:
            model.load_state_dict(torch.load(f"/best_{model_name.split('/')[-1]}_f{f}.pt"))
            model.eval()
            probs = []
            with torch.no_grad():
                for batch in val_loader_final:
                    batch = {k: v.to(DEVICE) for k, v in batch.items()}
                    with autocast('cuda'):
                        logits = model(**batch).logits
                    probs.extend(torch.softmax(logits, dim=1)[:,1].cpu().numpy())
            all_val_probs.append(probs)
        except:
            pass  # Skip missing folds
    val_predictions[model_name] = np.mean(all_val_probs, axis=0) if all_val_probs else np.zeros(len(val_data))

# Final 5-Model Ensemble + Held-out Accuracy
final_val_prob = np.mean(list(val_predictions.values()), axis=0)
val_acc = accuracy_score(val_data['label'], (final_val_prob > 0.5).astype(int))
print(f"\nFINAL HELD-OUT VALIDATION ACCURACY (5-Model Ensemble): {val_acc:.4f}")


Inintial Fined-Tuning FacebookAI/roberta-base with 2-fold CV...


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

  Fold 1/2

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at FacebookAI/roberta-base and are newly initialized: ['classifier.dense.weight', 'classifier.dense.bias', 'classifier.out_proj.weight', 'classifier.out_proj.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
You're using a RobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.



Validation Accuracy: 0.7340

Validation Accuracy: 0.8714

Validation Accuracy: 0.8893

Validation Accuracy: 0.8946

Validation Accuracy: 0.9115

Validation Accuracy: 0.9163

Validation Accuracy: 0.9033

Validation Accuracy: 0.9149
  Fold 2/2

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at FacebookAI/roberta-base and are newly initialized: ['classifier.dense.weight', 'classifier.dense.bias', 'classifier.out_proj.weight', 'classifier.out_proj.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Validation Accuracy: 0.8225

Validation Accuracy: 0.8085

Validation Accuracy: 0.8946

Validation Accuracy: 0.9009

Validation Accuracy: 0.8985

Validation Accuracy: 0.8825

Inintial Fined-Tuning microsoft/deberta-v3-base with 2-fold CV...


tokenizer_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/579 [00:00<?, ?B/s]

spm.model:   0%|          | 0.00/2.46M [00:00<?, ?B/s]



  Fold 1/2

pytorch_model.bin:   0%|          | 0.00/371M [00:00<?, ?B/s]

Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
You're using a DebertaV2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.



Validation Accuracy: 0.8656

Validation Accuracy: 0.8936

Validation Accuracy: 0.8907

Validation Accuracy: 0.9110

Validation Accuracy: 0.9120

Validation Accuracy: 0.9120

Validation Accuracy: 0.8999
  Fold 2/2

Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Validation Accuracy: 0.8583

Validation Accuracy: 0.8878

Validation Accuracy: 0.8951

Validation Accuracy: 0.8980

Validation Accuracy: 0.8960

Validation Accuracy: 0.8931

Inintial Fined-Tuning martin-ha/toxic-comment-model with 2-fold CV...


tokenizer_config.json:   0%|          | 0.00/403 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

  Fold 1/2

config.json:   0%|          | 0.00/704 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/268M [00:00<?, ?B/s]

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.



Validation Accuracy: 0.6431

Validation Accuracy: 0.6712

Validation Accuracy: 0.6668

Validation Accuracy: 0.7065

Validation Accuracy: 0.7495

Validation Accuracy: 0.7655

Validation Accuracy: 0.7684

Validation Accuracy: 0.7631

Validation Accuracy: 0.7640
  Fold 2/2
Validation Accuracy: 0.6228

Validation Accuracy: 0.7234

Validation Accuracy: 0.7427

Validation Accuracy: 0.7602

Validation Accuracy: 0.7582

Validation Accuracy: 0.7650

Validation Accuracy: 0.7737

Validation Accuracy: 0.7674

Validation Accuracy: 0.7684

Inintial Fined-Tuning google/electra-base-discriminator with 2-fold CV...


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/666 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

  Fold 1/2

  _torch_pytree._register_pytree_node(


pytorch_model.bin:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at google/electra-base-discriminator and are newly initialized: ['classifier.dense.weight', 'classifier.dense.bias', 'classifier.out_proj.weight', 'classifier.out_proj.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
You're using a ElectraTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.



Validation Accuracy: 0.8694

Validation Accuracy: 0.8612

Validation Accuracy: 0.8965

Validation Accuracy: 0.8752

Validation Accuracy: 0.9057

Validation Accuracy: 0.9096

Validation Accuracy: 0.9018

Validation Accuracy: 0.9086
  Fold 2/2

Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at google/electra-base-discriminator and are newly initialized: ['classifier.dense.weight', 'classifier.dense.bias', 'classifier.out_proj.weight', 'classifier.out_proj.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Validation Accuracy: 0.8540

Validation Accuracy: 0.8486

Validation Accuracy: 0.8844

Validation Accuracy: 0.8994

Validation Accuracy: 0.8994

Validation Accuracy: 0.9038

Validation Accuracy: 0.9038

Validation Accuracy: 0.9052

Validation Accuracy: 0.9033

Validation Accuracy: 0.9023

Inintial Fined-Tuning distilbert/distilbert-base-uncased with 2-fold CV...


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

  Fold 1/2

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: ['classifier.weight', 'pre_classifier.weight', 'pre_classifier.bias', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.



Validation Accuracy: 0.8607

Validation Accuracy: 0.8690

Validation Accuracy: 0.8839

Validation Accuracy: 0.8965

Validation Accuracy: 0.8864

Validation Accuracy: 0.8994

Validation Accuracy: 0.8999

Validation Accuracy: 0.8956

Validation Accuracy: 0.8956
  Fold 2/2

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: ['classifier.weight', 'pre_classifier.weight', 'pre_classifier.bias', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Validation Accuracy: 0.8607

Validation Accuracy: 0.8709

Validation Accuracy: 0.8975

Validation Accuracy: 0.9018

Validation Accuracy: 0.8946

Validation Accuracy: 0.9009

FINAL HELD-OUT VALIDATION ACCURACY (5-Model Ensemble): 0.8267


## Hyperparameters Tuning

Optuna is an open-source framework designed to automate Hyperparameter Optimization (HPO). While traditional methods like Grid Search or Random Search blindly try combinations of settings, Optuna uses a "smart" approach called Bayesian Optimization to learn from past experiments and find the best settings faster.

**Key Concepts: Study vs. Trial**

Optuna organizes its work into two main categories:
- Study: The overall optimization project (e.g., "Find the best settings for RoBERTa").
- Trial: A single "experiment" where Optuna tries one specific set of hyperparameters, trains the model, and checks the score.

**How Optuna Works: The "Trial & Study" Loop**

Optuna treats hyperparameter tuning as an iterative process involving three main components:
- Objective Function: This is the core logic you write. It takes a "Trial" object, trains your model with suggested parameters, and returns a score (like Accuracy or Loss).
- Study: This is the "manager" that coordinates all experiments. It tracks the history of results and decides which hyperparameters to try next.
- Trial: A single execution of your objective function. During a trial, Optuna "suggests" specific values for your hyperparameters.

**The Internal Mechanics**

Optuna is more efficient than other tools because it uses two specialized engines:

**A. The Sampler**

By default, Optuna uses the Tree-structured Parzen Estimator (TPE).
- Logic: It divides previous results into "good" and "bad" groups. It then calculates the probability that a new set of parameters belongs to the "good" group.
- Result: It focuses the search on promising areas of the map rather than wasting time on settings that consistently fail.

**B. The Pruner**

Pruning is automated early stopping.
- Logic: If a trial is halfway through training and its intermediate score (e.g., validation loss after 5 epochs) is significantly worse than the median of previous trials, Optuna kills the trial immediately.
- Result: This saves massive amounts of GPU/CPU time, allowing you to run 5–10 times more experiments in the same window.

**Defined Fine-Tuned Hyperparameters**

**1. Learning Rate Parameters**

These control how fast the "brain" of the model updates its knowledge.
- `lr_backbone` ($1\times10^{-6}$ to $1\times10^{-4}$):
    - Meaning: The learning rate for the main Transformer layers (the pre-trained part).
    - Why it matters: We use a smaller rate here because the backbone already knows English. We don't want to "overwrite" its existing knowledge too aggressively.
- `lr_classifier` ($1\times10^{-5}$ to $1\times10^{-3}$):
    - Meaning: The learning rate for the final "head" that decides if text is extremist.
    - Why it matters: This layer starts from scratch (randomly initialized), so it usually needs a higher learning rate than the backbone to learn the specific task quickly.

**2. Training Dynamics & Stability**

These parameters manage the math and memory usage during training.
- `batch_size` (2, 4, 8, 16):
    - Meaning: How many rows of data the model looks at before calculating an error. Smaller batches (2, 4) are noisier but fit in GPU memory; larger batches (16) provide a smoother "map" for the model to follow.
- `gradient_accumulation_steps` (1, 2, 4, 8):
    - Meaning: This simulates a larger batch size. If your batch is 2 and accumulation is 8, the model acts as if it saw 16 rows before actually updating its weights. This is a "memory hack."
- `max_grad_norm` (0.1 to 5.0):
    - Meaning: Also known as Gradient Clipping. If the math "explodes" and produces a massive update number, this hyperparameter "clips" it down to a maximum value.
    - Why it matters: It prevents the model from "breaking" during training if it encounters a very confusing sentence.

**3. Optimizer Configuration (`AdamW`)**

The model uses the AdamW optimizer, which has "sub-parameters" that control its "memory."
- `adam_epsilon` ($1\times10^{-10}$ to $1\times10^{-5}$):
    - Meaning: A tiny number added to the denominator to prevent "division by zero" errors in the math.
- `adam_beta1 & adam_beta2`:
    - Meaning: These control the "momentum." Beta1 (0.8–0.95) is how much the model remembers the direction of previous updates. Beta2 (0.98–0.99) is how much it remembers the volatility of those updates.
- `warmup_ratio` (0.0 to 0.5):
    - Meaning: The percentage of the first steps where the learning rate starts at zero and slowly climbs to the maximum.
    - Why it matters: Like a car engine in winter, the model needs to "warm up" to avoid crashing (diverging) at the very start of training.

**4. Generalization & Regularization**

These prevent the model from simply memorizing the training data (overfitting).
- `dropout` (0.0 to 0.5):
    - Meaning: Randomly "turns off" some neurons during training.
    - Why it matters: It forces the model to find multiple patterns instead of relying on just one specific word.
- `layerwise_decay` (0.85 to 1.0):
    - Meaning: A sophisticated technique where lower layers (close to the input) learn slower than higher layers.
    - The Logic: Layer 1 knows "Grammar" (don't change it much). Layer 12 knows "Context/Extremism" (change it more).
- `label_smoothing` (0.0 to 0.3):
    - Meaning: Instead of telling the model a post is "100% Extremist," you tell it the post is "90% Extremist."
    - Why it matters: It prevents the model from becoming "overconfident," which helps it perform better on new, unseen data.
- `patience` (2 to 10):
    - Meaning: How many epochs to wait for the score to improve before giving up (Early Stopping).

Instead of manually guessing settings, Optuna runs dozens of "trials" to find the best configuration for each model.

- Parameters Tuned: Learning rate, batch size, weight decay, and dropout.
- Fold CV: Each trial is tested twice on different slices of data to ensure the settings aren't just "lucky."

In [3]:
# MAIN TUNING & TRAINING 
MODELS = [
    "martin-ha/toxic-comment-model",
    "google/electra-base-discriminator",
    "distilbert/distilbert-base-uncased",
    "FacebookAI/roberta-base",
    "microsoft/deberta-v3-base"
]

# EXPANDED HYPERPARAM RANGES FOR BETTER ACCURACY (>0.97 target)
BATCH_SIZE_CHOICES = [2, 4, 8, 16]  # Added 32
GRAD_ACC_CHOICES = [1, 2, 4, 8]        # Added 8
LAYERWISE_CHOICES = [0.85, 0.9, 0.95, 0.98, 1.0]  # Added 0.85

val_predictions_for_meta = {}  # same length for all models
best_params = {}
studies = {}
skf = StratifiedKFold(n_splits=2, shuffle=True, random_state=42)              # Increase number of validation split for more reliable result

def tune_and_train(model_name):
    model_key = model_name.split("/")[-1]
    study_path = f"{SAVE_DIR}/{model_key}_study.pkl"

    if os.path.exists(study_path):
        study = joblib.load(study_path)
        print(f"Resumed study for {model_name} ({len(study.trials)} trials)")
    else:
        study = optuna.create_study(direction='maximize')
        print(f"Started new study for {model_name}")

    tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token
    data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

    def objective(trial):
        params = {
            'lr_backbone': trial.suggest_float('lr_backbone', 1e-6, 1e-4, log=True),  
            'lr_classifier': trial.suggest_float('lr_classifier', 1e-5, 1e-3, log=True),  
            'batch_size': trial.suggest_categorical('batch_size', BATCH_SIZE_CHOICES),
            'epochs': trial.suggest_int('epochs', 5, 30),  
            'max_grad_norm': trial.suggest_float('max_grad_norm', 0.1, 5.0),  
            'gradient_accumulation_steps': trial.suggest_categorical('gradient_accumulation_steps', GRAD_ACC_CHOICES),
            'warmup_ratio': trial.suggest_float('warmup_ratio', 0.0, 0.5),  
            'weight_decay': trial.suggest_float('weight_decay', 1e-6, 0.1, log=True),  
            'adam_epsilon': trial.suggest_float('adam_epsilon', 1e-10, 1e-5, log=True),  
            'adam_beta1': trial.suggest_float('adam_beta1', 0.8, 0.95),  
            'adam_beta2': trial.suggest_float('adam_beta2', 0.98, 0.9999),  
            'dropout': trial.suggest_float('dropout', 0.0, 0.5),  
            'layerwise_decay': trial.suggest_categorical('layerwise_decay', LAYERWISE_CHOICES),
            'label_smoothing': trial.suggest_float('label_smoothing', 0.0, 0.3),  
            'patience': trial.suggest_int('patience', 2, 10)  
        }

        # Training loop
        tr_idx, va_idx = next(skf.split(train_balanced, train_balanced['label']))
        tr_fold = train_balanced.iloc[tr_idx].reset_index(drop=True)
        va_fold = train_balanced.iloc[va_idx].reset_index(drop=True)

        train_ds = ExtremismDataset(tr_fold, tokenizer)
        val_ds = ExtremismDataset(va_fold, tokenizer)

        train_loader = DataLoader(train_ds, batch_size=params['batch_size'], shuffle=True, collate_fn=data_collator)
        val_loader = DataLoader(val_ds, batch_size=2, shuffle=False, collate_fn=data_collator)

        model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

        # Apply dropout
        if hasattr(model.config, 'hidden_dropout_prob'):
            model.config.hidden_dropout_prob = params['dropout']
        if hasattr(model.config, 'attention_probs_dropout_prob'):
            model.config.attention_probs_dropout_prob = params['dropout']
        if hasattr(model.config, 'classifier_dropout'):
            model.config.classifier_dropout = params['dropout']

        model.to(DEVICE)

        # Optimizer
        if params['layerwise_decay'] < 1.0:
            param_groups = []
            layer_idx = 0
            num_layers = 12
            for name, param in model.named_parameters():
                if 'classifier' in name or 'pooler' in name:
                    lr = params['lr_classifier']
                else:
                    decay_factor = params['layerwise_decay'] ** (num_layers - layer_idx)
                    lr = params['lr_backbone'] * decay_factor
                    if 'layer' in name and f"layer.{layer_idx}" in name:
                        layer_idx += 1
                param_groups.append({'params': param, 'lr': lr})
            optimizer = AdamW(param_groups, weight_decay=params['weight_decay'],
                              eps=params['adam_epsilon'], betas=(params['adam_beta1'], params['adam_beta2']))
        else:
            optimizer = AdamW([
                {'params': model.base_model.parameters() if hasattr(model, 'base_model') else model.parameters(),
                 'lr': params['lr_backbone']},
                {'params': model.classifier.parameters(), 'lr': params['lr_classifier']}
            ], weight_decay=params['weight_decay'], eps=params['adam_epsilon'],
               betas=(params['adam_beta1'], params['adam_beta2']))

        total_steps = len(train_loader) * params['epochs'] // params['gradient_accumulation_steps']
        warmup_steps = int(total_steps * params['warmup_ratio'])  # Use ratio here
        scheduler = get_cosine_schedule_with_warmup(optimizer, num_warmup_steps=warmup_steps, num_training_steps=total_steps)

        scaler = GradScaler('cuda')

        best_acc = 0
        wait = 0
        for epoch in range(params['epochs']):
            model.train()
            optimizer.zero_grad()
            accumulated = 0
            for batch in train_loader:
                batch = {k: v.to(DEVICE) for k, v in batch.items()}
                with autocast('cuda'):
                    outputs = model(**batch)
                    loss = outputs.loss
                    if params['label_smoothing'] > 0.0:
                        log_probs = torch.log_softmax(outputs.logits, dim=-1)
                        smooth_loss = -log_probs.mean()
                        loss = (1 - params['label_smoothing']) * loss + params['label_smoothing'] * smooth_loss
                scaler.scale(loss / params['gradient_accumulation_steps']).backward()
                accumulated += 1
                if accumulated % params['gradient_accumulation_steps'] == 0:
                    scaler.unscale_(optimizer)
                    torch.nn.utils.clip_grad_norm_(model.parameters(), params['max_grad_norm'])
                    scaler.step(optimizer)
                    scaler.update()
                    scheduler.step()
                    optimizer.zero_grad()
                    accumulated = 0

            # Validation
            model.eval()
            preds = []
            with torch.no_grad():
                for batch in val_loader:
                    batch = {k: v.to(DEVICE) for k, v in batch.items()}
                    with autocast('cuda'):
                        logits = model(**batch).logits
                    preds.extend(torch.softmax(logits, dim=1)[:,1].cpu().numpy())
            acc = accuracy_score(va_fold['label'], (np.array(preds)>0.5).astype(int))

            if acc > best_acc:
                best_acc = acc
                wait = 0
            else:
                wait += 1
                if wait >= params['patience']:
                    break
        return best_acc

    study.optimize(objective, n_trials=3)                   # Increase number of trial for better accuracy score
    # Save study
    best_params[model_name] = study.best_params
    studies[model_name] = study
    joblib.dump(study, study_path)
    print(f"Saved: {model_name} tuning complete")
    print(f"Best accuracy: {study.best_value:.5f}")
    print(f"Best params: {study.best_params}")

    # Final training + collect predictions on ORIGINAL val_data
    val_loader = DataLoader(ExtremismDataset(val_data, tokenizer), batch_size=2, shuffle=False, collate_fn=data_collator)

    # Final model with best params
    model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)
    if hasattr(model.config, 'hidden_dropout_prob'):
        model.config.hidden_dropout_prob = best_params[model_name]['dropout']
    if hasattr(model.config, 'attention_probs_dropout_prob'):
        model.config.attention_probs_dropout_prob = best_params[model_name]['dropout']
    model.to(DEVICE)

    # Use warmup_ratio
    dummy_loader = DataLoader(ExtremismDataset(train_balanced, tokenizer), batch_size=best_params[model_name]['batch_size'], shuffle=True, collate_fn=data_collator)
    total_steps = len(dummy_loader) * best_params[model_name]['epochs'] // best_params[model_name]['gradient_accumulation_steps']
    warmup_steps = int(total_steps * best_params[model_name]['warmup_ratio'])

    optimizer = AdamW([
        {'params': model.base_model.parameters() if hasattr(model, 'base_model') else model.parameters(),
         'lr': best_params[model_name]['lr_backbone']},
        {'params': model.classifier.parameters(), 'lr': best_params[model_name]['lr_classifier']}
    ], weight_decay=best_params[model_name]['weight_decay'])

    scheduler = get_cosine_schedule_with_warmup(optimizer, num_warmup_steps=warmup_steps, num_training_steps=total_steps)
    scaler = GradScaler('cuda')

    # Train on full train_balanced
    for epoch in range(best_params[model_name]['epochs']):
        model.train()
        for batch in dummy_loader:
            batch = {k: v.to(DEVICE) for k, v in batch.items()}
            with autocast('cuda'):
                outputs = model(**batch)
                loss = outputs.loss
            scaler.scale(loss).backward()
            scaler.unscale_(optimizer)
            torch.nn.utils.clip_grad_norm_(model.parameters(), best_params[model_name]['max_grad_norm'])
            scaler.step(optimizer)
            scaler.update()
            scheduler.step()
            optimizer.zero_grad()

    # Predict on val_data 
    model.eval()
    val_preds = []
    with torch.no_grad():
        for batch in val_loader:
            batch = {k: v.to(DEVICE) for k, v in batch.items()}
            with autocast('cuda'):
                logits = model(**batch).logits
            val_preds.extend(torch.softmax(logits, dim=1)[:,1].cpu().numpy())
    val_predictions_for_meta[model_name] = np.array(val_preds)

# Run all
for model_name in MODELS:
    tune_and_train(model_name)

[I 2026-01-10 20:00:59,866] A new study created in memory with name: no-name-0fb7cb93-834c-4656-882a-5318fac619ca


Started new study for martin-ha/toxic-comment-model


You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
[I 2026-01-10 20:04:16,258] Trial 0 finished with value: 0.7683752417794971 and parameters: {'lr_backbone': 1.59896870269066e-05, 'lr_classifier': 0.0003553107526483185, 'batch_size': 8, 'epochs': 13, 'max_grad_norm': 2.027937182498757, 'gradient_accumulation_steps': 1, 'warmup_ratio': 0.252973615198735, 'weight_decay': 0.0008570500681034111, 'adam_epsilon': 1.1067762943551042e-10, 'adam_beta1': 0.8824502834432306, 'adam_beta2': 0.9886947574333563, 'dropout': 0.3699323596768875, 'layerwise_decay': 0.95, 'label_smoothing': 0.15041277469644465, 'patience': 6}. Best is trial 0 with value: 0.7683752417794971.
[I 2026-01-10 20:06:36,759] Trial 1 finished with value: 0.6387814313346228 and parameters: {'lr_backbone': 4.264654675486179e-06, 'lr_classifier': 0.00060

Saved: martin-ha/toxic-comment-model tuning complete
Best accuracy: 0.76838
Best params: {'lr_backbone': 1.59896870269066e-05, 'lr_classifier': 0.0003553107526483185, 'batch_size': 8, 'epochs': 13, 'max_grad_norm': 2.027937182498757, 'gradient_accumulation_steps': 1, 'warmup_ratio': 0.252973615198735, 'weight_decay': 0.0008570500681034111, 'adam_epsilon': 1.1067762943551042e-10, 'adam_beta1': 0.8824502834432306, 'adam_beta2': 0.9886947574333563, 'dropout': 0.3699323596768875, 'layerwise_decay': 0.95, 'label_smoothing': 0.15041277469644465, 'patience': 6}


[I 2026-01-10 20:10:57,883] A new study created in memory with name: no-name-41a8ccb5-566d-4b63-a4f3-be20b1df3e99


Started new study for google/electra-base-discriminator


Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at google/electra-base-discriminator and are newly initialized: ['classifier.dense.weight', 'classifier.dense.bias', 'classifier.out_proj.weight', 'classifier.out_proj.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
You're using a ElectraTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
[I 2026-01-10 20:14:04,577] Trial 0 finished with value: 0.90715667311412 and parameters: {'lr_backbone': 1.3061038867920613e-05, 'lr_classifier': 2.410174274243116e-05, 'batch_size': 8, 'epochs': 7, 'max_grad_norm': 1.2528908938212762, 'gradient_accumulation_steps': 1, 'warmup_ratio': 0.05018311190961994, 'weight_decay': 0.04722807273266896, 'adam_epsilon': 8.337249440375688e-10, 'adam

Saved: google/electra-base-discriminator tuning complete
Best accuracy: 0.90716
Best params: {'lr_backbone': 1.3061038867920613e-05, 'lr_classifier': 2.410174274243116e-05, 'batch_size': 8, 'epochs': 7, 'max_grad_norm': 1.2528908938212762, 'gradient_accumulation_steps': 1, 'warmup_ratio': 0.05018311190961994, 'weight_decay': 0.04722807273266896, 'adam_epsilon': 8.337249440375688e-10, 'adam_beta1': 0.911707652040623, 'adam_beta2': 0.9987009538095121, 'dropout': 0.21316024816287132, 'layerwise_decay': 0.98, 'label_smoothing': 0.039982154988141956, 'patience': 2}


Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at google/electra-base-discriminator and are newly initialized: ['classifier.dense.weight', 'classifier.dense.bias', 'classifier.out_proj.weight', 'classifier.out_proj.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[I 2026-01-10 20:24:29,946] A new study created in memory with name: no-name-24278e4c-2f9a-4c00-ac6e-5058b9c83403


Started new study for distilbert/distilbert-base-uncased


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: ['classifier.weight', 'pre_classifier.weight', 'pre_classifier.bias', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
[I 2026-01-10 20:30:09,549] Trial 0 finished with value: 0.8655705996131529 and parameters: {'lr_backbone': 1.718154329847068e-06, 'lr_classifier': 4.8829238726036785e-05, 'batch_size': 4, 'epochs': 14, 'max_grad_norm': 2.8883196377787046, 'gradient_accumulation_steps': 1, 'warmup_ratio': 0.11692275817827491, 'weight_decay': 3.0444509945486562e-05, 'adam_epsilon': 7.257845758947511e-06, 'adam_beta1': 

Saved: distilbert/distilbert-base-uncased tuning complete
Best accuracy: 0.89313
Best params: {'lr_backbone': 3.951470242347571e-05, 'lr_classifier': 0.0006295043533964989, 'batch_size': 16, 'epochs': 8, 'max_grad_norm': 1.2087031969143767, 'gradient_accumulation_steps': 4, 'warmup_ratio': 0.17692978563624223, 'weight_decay': 0.0003824305986479276, 'adam_epsilon': 5.64923071355306e-08, 'adam_beta1': 0.8839395285597531, 'adam_beta2': 0.9808558820874684, 'dropout': 0.1841324994752428, 'layerwise_decay': 0.95, 'label_smoothing': 0.23047962141740042, 'patience': 8}


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: ['classifier.weight', 'pre_classifier.weight', 'pre_classifier.bias', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[I 2026-01-10 20:35:35,741] A new study created in memory with name: no-name-c44fe1f3-c814-4c2e-bd24-8fbc25e9bf9a


Started new study for FacebookAI/roberta-base


Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at FacebookAI/roberta-base and are newly initialized: ['classifier.dense.weight', 'classifier.dense.bias', 'classifier.out_proj.weight', 'classifier.out_proj.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
You're using a RobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
[I 2026-01-10 20:56:48,069] Trial 0 finished with value: 0.9081237911025145 and parameters: {'lr_backbone': 4.503597126597298e-06, 'lr_classifier': 0.00020998816014500615, 'batch_size': 2, 'epochs': 23, 'max_grad_norm': 2.6791006320124806, 'gradient_accumulation_steps': 2, 'warmup_ratio': 0.23369126450399175, 'weight_decay': 5.908659888525604e-06, 'adam_epsilon': 5.630417961345653e-10, 'adam_beta

Saved: FacebookAI/roberta-base tuning complete
Best accuracy: 0.92070
Best params: {'lr_backbone': 8.901975708567137e-05, 'lr_classifier': 0.00025789896217847557, 'batch_size': 16, 'epochs': 30, 'max_grad_norm': 3.3327717757057753, 'gradient_accumulation_steps': 1, 'warmup_ratio': 0.11060143650747228, 'weight_decay': 0.000146898092643942, 'adam_epsilon': 2.437829476675409e-10, 'adam_beta1': 0.8864108530900816, 'adam_beta2': 0.9802683800470975, 'dropout': 0.46433951303496224, 'layerwise_decay': 0.98, 'label_smoothing': 0.28239493723898756, 'patience': 10}


Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at FacebookAI/roberta-base and are newly initialized: ['classifier.dense.weight', 'classifier.dense.bias', 'classifier.out_proj.weight', 'classifier.out_proj.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[I 2026-01-10 21:28:21,135] A new study created in memory with name: no-name-b348efb8-136e-45ae-9733-0cbaec3f4aea


Started new study for microsoft/deberta-v3-base


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
You're using a DebertaV2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
[I 2026-01-10 21:38:17,379] Trial 0 finished with value: 0.8733075435203095 and parameters: {'lr_backbone': 3.959976506319883e-06, 'lr_classifier': 0.00034659571605606085, 'batch_size': 8, 'epochs': 13, 'max_grad_norm': 0.1555932167036509, 'gradient_accumulation_steps': 1, 'warmup_ratio': 0.44345135711477085, 'weight_decay': 0.003594511596372888, 'adam_epsilon': 1.7458498738842842e-10, 'adam_beta1': 0.80505354708820

Saved: microsoft/deberta-v3-base tuning complete
Best accuracy: 0.87718
Best params: {'lr_backbone': 1.0917768427382281e-05, 'lr_classifier': 2.2848477363553595e-05, 'batch_size': 8, 'epochs': 20, 'max_grad_norm': 4.444830723642736, 'gradient_accumulation_steps': 8, 'warmup_ratio': 0.004335758500113307, 'weight_decay': 0.08439088300132051, 'adam_epsilon': 2.2359627460340451e-07, 'adam_beta1': 0.9458224791749428, 'adam_beta2': 0.9953379791797905, 'dropout': 0.033516985577914216, 'layerwise_decay': 0.85, 'label_smoothing': 0.17374719473309874, 'patience': 10}


Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


# Final Training and Submission Creation

## Final Ensemble Training with Best Parameters

Once Optuna finds the best model settings, each model is re-trained one last time on the entire balanced dataset.

- Mixed Precision (AMP): Uses 16-bit math instead of 32-bit to speed up training by 1.5x to 2x without losing accuracy.
- Cosine Scheduling: Gradually lowers the learning rate in a wave-like pattern to help the model "settle" into the best possible logic.

The five models' predictions are averaged using Soft-Voting.

- The Logic: If Model A is 90 percent sure a post is `EXTREMIST` and Model B is only 51 percent sure, the average (70.5 percent) is more reliable than just looking at one.
- Final Threshold: Any average probability over 0.5 is labeled EXTREMIST.

In [6]:
# 5-MODEL LIST 
MODELS = [
    "FacebookAI/roberta-base",
    "microsoft/deberta-v3-base",
    "google/electra-base-discriminator",
    "distilbert/distilbert-base-uncased",
    "martin-ha/toxic-comment-model"
]

# Containers for predictions
val_predictions = {}  # For held-out validation
test_predictions = {} # For submission

print("Starting final training with best Optuna hyperparameters...\n")

for model_name in MODELS:
    model_key = model_name.split("/")[-1]
    study_path = f"{SAVE_DIR}/{model_key}_study.pkl"

    if not os.path.exists(study_path):
        raise FileNotFoundError(f"Study not found: {study_path}. Run tuning first!")

    # Load best params
    study = joblib.load(study_path)
    best_params = study.best_params
    print(f"Loaded best params for {model_name}")
    print(f"  - Best CV accuracy: {study.best_value:.5f}")
    print(f"  - Params: {best_params}\n")

    # Load tokenizer
    tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token
    data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

    # Datasets
    train_loader = DataLoader(
        ExtremismDataset(train_balanced, tokenizer),
        batch_size=best_params['batch_size'],
        shuffle=True,
        collate_fn=data_collator
    )
    val_loader = DataLoader(
        ExtremismDataset(val_data, tokenizer),
        batch_size=best_params.get('batch_size', 16),
        shuffle=False,
        collate_fn=data_collator
    )
    test_loader = DataLoader(
        ExtremismDataset(test_df, tokenizer),
        batch_size=best_params.get('batch_size', 16),
        shuffle=False,
        collate_fn=data_collator
    )

    # Model
    model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)
    
    # Apply dropout
    if hasattr(model.config, 'hidden_dropout_prob'):
        model.config.hidden_dropout_prob = best_params['dropout']
    if hasattr(model.config, 'attention_probs_dropout_prob'):
        model.config.attention_probs_dropout_prob = best_params['dropout']
    if hasattr(model.config, 'classifier_dropout'):
        model.config.classifier_dropout = best_params['dropout']

    model.to(DEVICE)

    # Optimizer with layerwise LR if needed
    if best_params['layerwise_decay'] < 1.0:
        param_groups = []
        layer_idx = 0
        num_layers = 12  # Adjust if needed
        for name, param in model.named_parameters():
            if 'classifier' in name or 'pooler' in name:
                lr = best_params['lr_classifier']
            else:
                decay_factor = best_params['layerwise_decay'] ** (num_layers - layer_idx)
                lr = best_params['lr_backbone'] * decay_factor
                if 'layer' in name and f"layer.{layer_idx}" in name:
                    layer_idx += 1
            param_groups.append({'params': param, 'lr': lr})
        optimizer = AdamW(
            param_groups,
            weight_decay=best_params['weight_decay'],
            eps=best_params['adam_epsilon'],
            betas=(best_params['adam_beta1'], best_params['adam_beta2'])
        )
    else:
        optimizer = AdamW([
            {'params': model.base_model.parameters() if hasattr(model, 'base_model') else model.parameters(),
             'lr': best_params['lr_backbone']},
            {'params': model.classifier.parameters(), 'lr': best_params['lr_classifier']}
        ], weight_decay=best_params['weight_decay'])

    # Scheduler
    total_steps = len(train_loader) * best_params['epochs'] // best_params['gradient_accumulation_steps']
    warmup_steps = int(total_steps * best_params['warmup_ratio'])
    scheduler = get_cosine_schedule_with_warmup(optimizer, num_warmup_steps=warmup_steps, num_training_steps=total_steps)
    scaler = GradScaler('cuda')

    # Training loop
    model.train()
    accumulated = 0
    for epoch in range(best_params['epochs']):
        for batch in train_loader:
            batch = {k: v.to(DEVICE) for k, v in batch.items()}
            with autocast('cuda'):
                outputs = model(**batch)
                loss = outputs.loss
            scaler.scale(loss / best_params['gradient_accumulation_steps']).backward()
            accumulated += 1
            if accumulated % best_params['gradient_accumulation_steps'] == 0:
                scaler.unscale_(optimizer)
                torch.nn.utils.clip_grad_norm_(model.parameters(), best_params['max_grad_norm'])
                scaler.step(optimizer)
                scaler.update()
                scheduler.step()
                optimizer.zero_grad()
                accumulated = 0

    # Predictions
    model.eval()

    # Val predictions
    val_preds = []
    with torch.no_grad():
        for batch in val_loader:
            batch = {k: v.to(DEVICE) for k, v in batch.items()}
            with autocast('cuda'):
                logits = model(**batch).logits
            val_preds.extend(torch.softmax(logits, dim=1)[:, 1].cpu().numpy())
    val_predictions[model_name] = np.array(val_preds)

    # Test predictions
    test_preds = []
    with torch.no_grad():
        for batch in test_loader:
            batch = {k: v.to(DEVICE) for k, v in batch.items()}
            with autocast('cuda'):
                logits = model(**batch).logits
            test_preds.extend(torch.softmax(logits, dim=1)[:, 1].cpu().numpy())
    test_predictions[model_name] = np.array(test_preds)

    print(f"{model_name} - Final training complete\n")

# FINAL ENSEMBLE 
# Average ensemble on held-out val
final_val_prob = np.mean(list(val_predictions.values()), axis=0)
val_acc = accuracy_score(val_data['label'], (final_val_prob > 0.5).astype(int))
print(f"\nFINAL 5-MODEL ENSEMBLE HELD-OUT ACCURACY: {val_acc:.5f}\n")

Starting final training with best Optuna hyperparameters...

Loaded best params for FacebookAI/roberta-base
  - Best CV accuracy: 0.92070
  - Params: {'lr_backbone': 8.901975708567137e-05, 'lr_classifier': 0.00025789896217847557, 'batch_size': 16, 'epochs': 30, 'max_grad_norm': 3.3327717757057753, 'gradient_accumulation_steps': 1, 'warmup_ratio': 0.11060143650747228, 'weight_decay': 0.000146898092643942, 'adam_epsilon': 2.437829476675409e-10, 'adam_beta1': 0.8864108530900816, 'adam_beta2': 0.9802683800470975, 'dropout': 0.46433951303496224, 'layerwise_decay': 0.98, 'label_smoothing': 0.28239493723898756, 'patience': 10}



Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at FacebookAI/roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
You're using a RobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


FacebookAI/roberta-base - Final training complete

Loaded best params for microsoft/deberta-v3-base
  - Best CV accuracy: 0.87718
  - Params: {'lr_backbone': 1.0917768427382281e-05, 'lr_classifier': 2.2848477363553595e-05, 'batch_size': 8, 'epochs': 20, 'max_grad_norm': 4.444830723642736, 'gradient_accumulation_steps': 8, 'warmup_ratio': 0.004335758500113307, 'weight_decay': 0.08439088300132051, 'adam_epsilon': 2.2359627460340451e-07, 'adam_beta1': 0.9458224791749428, 'adam_beta2': 0.9953379791797905, 'dropout': 0.033516985577914216, 'layerwise_decay': 0.85, 'label_smoothing': 0.17374719473309874, 'patience': 10}



Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['pooler.dense.weight', 'classifier.weight', 'classifier.bias', 'pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
You're using a DebertaV2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


microsoft/deberta-v3-base - Final training complete

Loaded best params for google/electra-base-discriminator
  - Best CV accuracy: 0.90716
  - Params: {'lr_backbone': 1.3061038867920613e-05, 'lr_classifier': 2.410174274243116e-05, 'batch_size': 8, 'epochs': 7, 'max_grad_norm': 1.2528908938212762, 'gradient_accumulation_steps': 1, 'warmup_ratio': 0.05018311190961994, 'weight_decay': 0.04722807273266896, 'adam_epsilon': 8.337249440375688e-10, 'adam_beta1': 0.911707652040623, 'adam_beta2': 0.9987009538095121, 'dropout': 0.21316024816287132, 'layerwise_decay': 0.98, 'label_smoothing': 0.039982154988141956, 'patience': 2}



Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at google/electra-base-discriminator and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
You're using a ElectraTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


google/electra-base-discriminator - Final training complete

Loaded best params for distilbert/distilbert-base-uncased
  - Best CV accuracy: 0.89313
  - Params: {'lr_backbone': 3.951470242347571e-05, 'lr_classifier': 0.0006295043533964989, 'batch_size': 16, 'epochs': 8, 'max_grad_norm': 1.2087031969143767, 'gradient_accumulation_steps': 4, 'warmup_ratio': 0.17692978563624223, 'weight_decay': 0.0003824305986479276, 'adam_epsilon': 5.64923071355306e-08, 'adam_beta1': 0.8839395285597531, 'adam_beta2': 0.9808558820874684, 'dropout': 0.1841324994752428, 'layerwise_decay': 0.95, 'label_smoothing': 0.23047962141740042, 'patience': 8}



Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


distilbert/distilbert-base-uncased - Final training complete

Loaded best params for martin-ha/toxic-comment-model
  - Best CV accuracy: 0.76838
  - Params: {'lr_backbone': 1.59896870269066e-05, 'lr_classifier': 0.0003553107526483185, 'batch_size': 8, 'epochs': 13, 'max_grad_norm': 2.027937182498757, 'gradient_accumulation_steps': 1, 'warmup_ratio': 0.252973615198735, 'weight_decay': 0.0008570500681034111, 'adam_epsilon': 1.1067762943551042e-10, 'adam_beta1': 0.8824502834432306, 'adam_beta2': 0.9886947574333563, 'dropout': 0.3699323596768875, 'layerwise_decay': 0.95, 'label_smoothing': 0.15041277469644465, 'patience': 6}



You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


martin-ha/toxic-comment-model - Final training complete


FINAL 5-MODEL ENSEMBLE HELD-OUT ACCURACY: 0.81778



## Submission Creation

In [7]:
# Average ensemble on test
final_test_prob = np.mean(list(test_predictions.values()), axis=0)

submission = pd.DataFrame({
    'ID': test_ids,
    'Extremism_Label': ['EXTREMIST' if p > 0.5 else 'NON_EXTREMIST' for p in final_test_prob]
})
submission.to_csv('submission_old_pre.csv', index=False)
print("SUBMISSION SAVED")
print(f"Predictions shape: {final_test_prob.shape} | Test IDs: {len(test_ids)}")

SUBMISSION SAVED
Predictions shape: (750,) | Test IDs: 750


# Conclusion

This notebook presents a complete, high-accuracy solution for the Social Media Extremism Detection Challenge. By combining careful data balancing, automated hyperparameter optimization via Optuna, and a robust 5-model ensemble, the pipeline achieves a superior held-out accuracy of approx. 81.8%.

This work contributes to the broader goal of creating safer digital environments by providing a reproducible, state-of-the-art framework for identifying harmful content. The final submission is optimized for both accuracy and robustness, making it a competitive entry for the final leaderboard.

# References

- [NLPAUG](https://nlpaug.readthedocs.io/en/latest/overview/overview.html)
- [NLPAUG – A Python library to Augment Your Text Data](https://www.analyticsvidhya.com/blog/2021/08/nlpaug-a-python-library-to-augment-your-text-data/)
- [Tokenization HuggingFace](https://huggingface.co/learn/llm-course/en/chapter2/4)
- [Toward a Theory of Tokenization in LLMs](https://arxiv.org/abs/2404.08335)
- [All you need to know about Tokenization in LLMs](https://medium.com/thedeephub/all-you-need-to-know-about-tokenization-in-llms-7a801302cf54)
- [FacebookAI/roberta-base](https://huggingface.co/FacebookAI/roberta-base)
- [microsoft/deberta-v3-base](https://huggingface.co/microsoft/deberta-v3-base)
- [google/electra-base-discriminator](https://huggingface.co/google/electra-base-discriminator)
- [distilbert/distilbert-base-uncased](https://huggingface.co/distilbert/distilbert-base-uncased)
- [martin-ha/toxic-comment-model](https://huggingface.co/martin-ha/toxic-comment-model)
- [StratifiedKFold sk-learn](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html)
- [Cross-validation: evaluating estimator performance](https://scikit-learn.org/stable/modules/cross_validation.html)
- [Stratified K Fold Cross Validation](https://www.geeksforgeeks.org/machine-learning/stratified-k-fold-cross-validation/)
- [VotingClassifier sk-learn](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html)
- [Voting Classifier Greeksforgreeks](https://www.geeksforgeeks.org/machine-learning/voting-classifier/)