# DeBERTa V2 Training with Focal Loss

**Purpose**: Fine-tune DeBERTa V2 XLarge using PyTorch for logical fallacy detection

**Dataset**: FLICC (Fallacy detection dataset with train/val/test splits)

**Method**: Full fine-tuning with Focal Loss (Gamma=4)

---

## Configuration Overview

Based on paper's findings:
- **Model**: microsoft/deberta-v2-xlarge
- **Learning Rate**: 1e-5 (paper-optimal)
- **Focal Loss Gamma**: 4.0 (critical for performance)
- **Weight Decay**: 0.01
- **Epochs**: 15
- **Batch Size**: 1 (with gradient accumulation of 4 = effective batch 16)
- **Device**: Apple Silicon MPS (Metal Performance Shaders)

## Step 1: Import Required Libraries

In [2]:
# Import libraries
import os
os.environ['PYTORCH_MPS_HIGH_WATERMARK_RATIO'] = '0.0'

# Core PyTorch
import torch
import torch.nn as nn
import torch.nn.functional as F

# HuggingFace Transformers
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    Trainer,
    TrainingArguments,
    DataCollatorWithPadding,
    EarlyStoppingCallback
)

# Data handling
from datasets import Dataset, DatasetDict, load_dataset
import pandas as pd
import numpy as np
from transformers import DebertaV2Tokenizer

# Metrics
from sklearn.metrics import (
    accuracy_score,
    precision_recall_fscore_support,
    classification_report,
    confusion_matrix,
    ConfusionMatrixDisplay
)
import matplotlib.pyplot as plt

# Utilities
import json
from pathlib import Path
from dataclasses import dataclass
from typing import Dict, List, Optional
import warnings
warnings.filterwarnings('ignore')

print("All libraries imported successfully!")

# Check device availability
if torch.backends.mps.is_available():
    device = torch.device("mps")
    print(f"Using Apple Silicon GPU (MPS)")
elif torch.cuda.is_available():
    device = torch.device("cuda")
    print(f"Using CUDA GPU: {torch.cuda.get_device_name(0)}")
else:
    device = torch.device("cpu")
    print("Using CPU")

print(f"PyTorch version: {torch.__version__}")

All libraries imported successfully!
Using Apple Silicon GPU (MPS)
PyTorch version: 2.9.1


## Step 2: Training Configuration

### Paper-Validated Parameters

Based on research findings:
- **Focal Loss Gamma 4.0**: Critical for handling class imbalance
- **Learning Rate 1e-5**: Validated as optimal
- **15 Epochs**: Full convergence duration
- **Weight Decay 0.01**: Standard regularization

In [None]:
@dataclass
class TrainingConfig:
    """
    Training configuration based on paper's best parameters.
    """
    
    # Model Configuration
    model_name: str = "microsoft/deberta-v2-xlarge"
    
    # Training Hyperparameters (Paper's Best)
    learning_rate: float = 1.0e-5        # Paper-validated optimal
    weight_decay: float = 0.01           # Standard regularization
    num_epochs: int = 15                 # Full convergence : Tuned to 5 for Mac M1 Max!
    batch_size: int = 1                  # Small for stability
    gradient_accumulation_steps: int = 16 # Effective batch = 16
    
    # Focal Loss Configuration
    focal_gamma: float = 4.0             # Critical parameter (paper finding)
    
    # Data Paths
    train_data_path: str = "Data/fallacy_train.csv"
    val_data_path: str = "Data/fallacy_val.csv"
    test_data_path: str = "Data/fallacy_test.csv"
    
    # Output Configuration
    output_dir: str = "./output/deberta_flicc"
    
    # Training Options
    max_seq_length: int = 512            # Maximum sequence length
    warmup_ratio: float = 0.1            # 10% warmup
    seed: int = 42                       # Random seed
    logging_steps: int = 10              # Log every N steps
    save_strategy: str = "epoch"         # Save after each epoch
    evaluation_strategy: str = "epoch"   # Evaluate after each epoch
    
    # Early Stopping
    early_stopping_patience: int = 3     # Stop if no improvement for 3 epochs
    metric_for_best_model: str = "f1"    # Use F1 score for best model
    

# Initialize configuration
config = TrainingConfig()

# Display configuration
print("="*70)
print("TRAINING CONFIGURATION (Paper Parameters)")
print("="*70)
print(f"\nModel: {config.model_name}")

print(f"\nTraining Hyperparameters:")
print(f"  Learning Rate: {config.learning_rate} (paper-optimal)")
print(f"  Weight Decay: {config.weight_decay}")
print(f"  Epochs: {config.num_epochs}")
print(f"  Batch Size: {config.batch_size}")
print(f"  Gradient Accumulation: {config.gradient_accumulation_steps}")
print(f"  Effective Batch Size: {config.batch_size * config.gradient_accumulation_steps}")

print(f"\nFocal Loss Configuration:")
print(f"  Gamma: {config.focal_gamma} (critical for performance)")

print(f"\nData Splits:")
print(f"  Training: {config.train_data_path}")
print(f"  Validation: {config.val_data_path}")
print(f"  Test: {config.test_data_path}")

print(f"\nOutput Directory: {config.output_dir}")
print("="*70)

TRAINING CONFIGURATION (Paper Parameters)

Model: microsoft/deberta-v2-xlarge

Training Hyperparameters:
  Learning Rate: 1e-05 (paper-optimal)
  Weight Decay: 0.01
  Epochs: 15
  Batch Size: 1
  Gradient Accumulation: 16
  Effective Batch Size: 16

Focal Loss Configuration:
  Gamma: 4 (critical for performance)

Data Splits:
  Training: Data/fallacy_train.csv
  Validation: Data/fallacy_val.csv
  Test: Data/fallacy_test.csv

Output Directory: ./output/deberta_flicc


## Step 3: Load Data

Load all three data splits and prepare label mappings.

In [3]:
# Load datasets using memory-efficient method (zero-copy)
print("="*70)
print("LOADING DATA (Memory-Efficient Mode)")
print("="*70)

data_files = {
    "train": config.train_data_path,
    "validation": config.val_data_path,
    "test": config.test_data_path
}

print("\nLoading datasets from CSV files...")
dataset_dict = load_dataset("csv", data_files=data_files)

# Get unique labels from training set
unique_labels = sorted(list(set(dataset_dict["train"]["label"])))
label2id = {label: idx for idx, label in enumerate(unique_labels)}
id2label = {idx: label for label, idx in label2id.items()}

# Map labels to IDs
def map_labels(example):
    example["labels"] = label2id[example["label"]]
    return example

print("Mapping labels...")
dataset_dict = dataset_dict.map(map_labels)

# Display info
print(f"\n" + "="*70)
print("DATA LOADING SUMMARY")
print("="*70)
print(f"\nTraining samples:   {len(dataset_dict['train']):,}")
print(f"Validation samples: {len(dataset_dict['validation']):,}")
print(f"Test samples:       {len(dataset_dict['test']):,}")
print(f"\nNumber of classes: {len(unique_labels)}")
print(f"\nLabel to ID mapping:")
for label, idx in sorted(label2id.items(), key=lambda x: x[1]):
    print(f"  {idx}: {label}")
print("="*70)

LOADING DATA (Memory-Efficient Mode)

Loading datasets from CSV files...
Mapping labels...

DATA LOADING SUMMARY

Training samples:   1,796
Validation samples: 457
Test samples:       256

Number of classes: 12

Label to ID mapping:
  0: ad hominem
  1: anecdote
  2: cherry picking
  3: conspiracy theory
  4: fake experts
  5: false choice
  6: false equivalence
  7: impossible expectations
  8: misrepresentation
  9: oversimplification
  10: single cause
  11: slothful induction


## Step 5: Load Tokenizer and Tokenize Data

In [4]:
print(f"\nLoading tokenizer: {config.model_name}")
tokenizer = DebertaV2Tokenizer.from_pretrained(config.model_name)
print("Tokenizer loaded!")


def tokenize_function(examples):
    """
    Tokenize text examples.
    
    Args:
        examples: Batch of examples with 'text' field
        
    Returns:
        Tokenized examples
    """
    return tokenizer(
        examples['text'],
        truncation=True,
        max_length=config.max_seq_length,
        padding=False  # Dynamic padding handled by data collator
    )


print("\nTokenizing datasets...")
tokenized_datasets = dataset_dict.map(
    tokenize_function,
    batched=True,
    desc="Tokenizing"
)

# Remove columns that are not needed for training (keeps only model inputs + labels)
columns_to_remove = ['text', 'label', 'Claim', 'Source']
tokenized_datasets = tokenized_datasets.remove_columns(columns_to_remove)

print("\nTokenization complete!")
print("\nColumns kept for training:", tokenized_datasets['train'].column_names)
print(tokenized_datasets)


Loading tokenizer: microsoft/deberta-v2-xlarge
Tokenizer loaded!

Tokenizing datasets...


Tokenizing:   0%|          | 0/1796 [00:00<?, ? examples/s]

Tokenizing:   0%|          | 0/457 [00:00<?, ? examples/s]

Tokenizing:   0%|          | 0/256 [00:00<?, ? examples/s]


Tokenization complete!

Columns kept for training: ['labels', 'input_ids', 'token_type_ids', 'attention_mask']
DatasetDict({
    train: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1796
    })
    validation: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 457
    })
    test: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 256
    })
})


## Step 6: Load Model

In [5]:
print(f"\nLoading model: {config.model_name}")
print("This may take several minutes...\n")

model = AutoModelForSequenceClassification.from_pretrained(
    config.model_name,
    num_labels=len(unique_labels),
    id2label=id2label,
    label2id=label2id,
    problem_type="single_label_classification"
)

# Move model to device
model = model.to(device)
# Disable gradient checkpointing
model.gradient_checkpointing_disable()

print("Model loaded successfully!")
print(f"\nModel parameters: {model.num_parameters():,}")
print(f"Device: {device}")


Loading model: microsoft/deberta-v2-xlarge
This may take several minutes...



Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v2-xlarge and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model loaded successfully!

Model parameters: 886,972,428
Device: mps


## Step 7: Define Focal Loss Trainer

Custom trainer implementing Focal Loss with Gamma=4.

**Focal Loss**: Addresses class imbalance by down-weighting easy examples.

Formula: `FL(pt) = -(1 - pt)^γ * log(pt)`

Where:
- `pt` is the probability of the correct class
- `γ` (gamma) controls the down-weighting factor

In [6]:
class FocalLossTrainer(Trainer):
    """
    Custom Trainer with Focal Loss.
    
    Implements focal loss to handle class imbalance as per paper's findings.
    """
    
    def __init__(self, focal_gamma: float = 4.0, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.focal_gamma = focal_gamma
        print(f"\nUsing Focal Loss with Gamma = {self.focal_gamma}")
    
    def compute_loss(self, model, inputs, return_outputs=False, num_items_in_batch=None):  # Added num_items_in_batch
        """
        Compute focal loss instead of standard cross-entropy.
        
        Args:
            model: The model being trained
            inputs: Input batch
            return_outputs: Whether to return model outputs
            
        Returns:
            Loss value (and outputs if requested)
        """
        labels = inputs.pop("labels")
        
        # Forward pass
        outputs = model(**inputs)
        logits = outputs.logits
        
        # Compute focal loss
        ce_loss = F.cross_entropy(logits, labels, reduction='none')
        pt = torch.exp(-ce_loss)  # Probability of true class
        focal_loss = ((1 - pt) ** self.focal_gamma * ce_loss).mean()
        
        return (focal_loss, outputs) if return_outputs else focal_loss


print("Focal Loss Trainer defined!")
print(f"Gamma parameter: {config.focal_gamma}")

Focal Loss Trainer defined!
Gamma parameter: 4


## Step 8: Define Metrics

In [7]:
def compute_metrics(eval_pred):
    """
    Compute evaluation metrics.
    
    Args:
        eval_pred: Tuple of (predictions, labels)
        
    Returns:
        Dictionary of metrics
    """
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    
    # Calculate metrics
    accuracy = accuracy_score(labels, predictions)
    precision, recall, f1, _ = precision_recall_fscore_support(
        labels, 
        predictions, 
        average='weighted',
        zero_division=0
    )
    
    return {
        'accuracy': accuracy,
        'precision': precision,
        'recall': recall,
        'f1': f1
    }


print("Metrics function defined!")

Metrics function defined!


## Step 9: Setup Training Arguments

In [8]:
# Create output directory
Path(config.output_dir).mkdir(parents=True, exist_ok=True)

# Define training arguments
training_args = TrainingArguments(
    # Output
    output_dir=config.output_dir,
    
    # Training hyperparameters
    learning_rate=config.learning_rate,
    weight_decay=config.weight_decay,
    num_train_epochs=config.num_epochs,
    per_device_train_batch_size=config.batch_size,
    per_device_eval_batch_size=config.batch_size,
    gradient_accumulation_steps=config.gradient_accumulation_steps,
    max_grad_norm=1.0,
    
    # Learning rate schedule
    warmup_ratio=config.warmup_ratio,
    lr_scheduler_type="cosine",
    
    # Evaluation and saving
    eval_strategy=config.evaluation_strategy,  # Changed from evaluation_strategy
    save_strategy=config.save_strategy,
    load_best_model_at_end=True,
    metric_for_best_model=config.metric_for_best_model,
    greater_is_better=True,
    
    # Logging
    logging_dir=f"{config.output_dir}/logs",
    logging_steps=config.logging_steps,
    report_to=["tensorboard"],
    
    # Device configuration
    use_mps_device=(device.type == "mps"),
    
    # Reproducibility
    seed=config.seed,
    
    # Memory optimization
    fp16=False,  # MPS works better with FP32
    gradient_checkpointing=False,  # Can enable if OOM
    
    # Save settings
    save_total_limit=3,  # Keep only 3 best checkpoints
)

# Display training configuration
print("\n" + "="*70)
print("TRAINING ARGUMENTS")
print("="*70)
print(f"\nOutput directory: {training_args.output_dir}")
print(f"\nTraining:")
print(f"  Epochs: {training_args.num_train_epochs}")
print(f"  Batch size (per device): {training_args.per_device_train_batch_size}")
print(f"  Gradient accumulation: {training_args.gradient_accumulation_steps}")
print(f"  Effective batch size: {training_args.per_device_train_batch_size * training_args.gradient_accumulation_steps}")
print(f"  Learning rate: {training_args.learning_rate}")
print(f"  Weight decay: {training_args.weight_decay}")
print(f"  Warmup ratio: {training_args.warmup_ratio}")
print(f"\nEvaluation:")
print(f"  Strategy: {training_args.eval_strategy}")
print(f"  Metric for best model: {training_args.metric_for_best_model}")
print(f"\nDevice: {device}")
print("="*70)


TRAINING ARGUMENTS

Output directory: ./output/deberta_flicc

Training:
  Epochs: 15
  Batch size (per device): 1
  Gradient accumulation: 16
  Effective batch size: 16
  Learning rate: 1e-05
  Weight decay: 0.01
  Warmup ratio: 0.1

Evaluation:
  Strategy: IntervalStrategy.EPOCH
  Metric for best model: f1

Device: mps


## Step 10: Initialize Trainer and Start Training

**Expected Duration**: Full fine-tuning of DeBERTa V2 XLarge for 15 epochs will take many hours (potentially 10-20 hours on Apple Silicon).

The model will be evaluated after each epoch and the best model will be saved based on F1 score.

In [9]:
# Data collator for dynamic padding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# Initialize trainer
trainer = FocalLossTrainer(
    focal_gamma=config.focal_gamma,
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['validation'],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=config.early_stopping_patience)]
)

print("\n" + "="*70)
print("STARTING TRAINING")
print("="*70)
print(f"\nModel: {config.model_name}")
print(f"Training samples: {len(tokenized_datasets['train']):,}")
print(f"Validation samples: {len(tokenized_datasets['validation']):,}")
print(f"\nThis will take many hours. Monitor the progress below.")
print(f"\nTensorBoard logs: {config.output_dir}/logs")
print(f"To view: tensorboard --logdir {config.output_dir}/logs")
print("\n" + "="*70 + "\n")

# Clear MPS cache if available
if torch.backends.mps.is_available():
    torch.mps.empty_cache()
# Start training
train_result = trainer.train()

# Save the final model
print("\n" + "="*70)
print("TRAINING COMPLETE")
print("="*70)
print(f"\nSaving final model to: {config.output_dir}/final_model")
trainer.save_model(f"{config.output_dir}/final_model")
tokenizer.save_pretrained(f"{config.output_dir}/final_model")
print("Model saved!")

# Display training metrics
print("\nTraining Metrics:")
print(f"  Training Loss: {train_result.training_loss:.4f}")
print(f"  Training Steps: {train_result.global_step}")
print("="*70)

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'eos_token_id': 2, 'bos_token_id': 1}.



Using Focal Loss with Gamma = 4

STARTING TRAINING

Model: microsoft/deberta-v2-xlarge
Training samples: 1,796
Validation samples: 457

This will take many hours. Monitor the progress below.

TensorBoard logs: ./output/deberta_flicc/logs
To view: tensorboard --logdir ./output/deberta_flicc/logs




Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,1.1773,1.104082,0.448578,0.438608,0.448578,0.390099
2,0.5536,0.511075,0.691466,0.723315,0.691466,0.691748
3,0.2761,0.450313,0.726477,0.754069,0.726477,0.72531
4,0.1543,0.535573,0.715536,0.75987,0.715536,0.720238
5,0.0339,0.526338,0.726477,0.747029,0.726477,0.727389


KeyboardInterrupt: 

## Step 10.5: Export Best Model

In [10]:
# 1. Point to your Epoch 3 checkpoint
# Replace 'checkpoint-423' with the actual folder name for Epoch 3
checkpoint_path = "./output/deberta_flicc/checkpoint-452" 

# 2. Load it
print(f"Loading best model from {checkpoint_path}...")
model = AutoModelForSequenceClassification.from_pretrained(checkpoint_path)
tokenizer = AutoTokenizer.from_pretrained(checkpoint_path, use_fast=False)

# 3. Save it to a permanent location
final_path = "./best_model"
print(f"Saving to {final_path}...")
model.save_pretrained(final_path)
tokenizer.save_pretrained(final_path)

print("Success! You can now safely delete the 'output' folder.")

Loading best model from ./output/deberta_flicc/checkpoint-452...
Saving to ./best_model...
Success! You can now safely delete the 'output' folder.


## Step 11: Validation Set Evaluation

In [12]:
print("\n" + "="*70)
print("VALIDATION SET EVALUATION")
print("="*70)

# Evaluate on validation set
val_results = trainer.evaluate(eval_dataset=tokenized_datasets['validation'])

print("\nValidation Results:")
for key, value in val_results.items():
    if isinstance(value, float):
        print(f"  {key}: {value:.4f}")
    else:
        print(f"  {key}: {value}")

# Get predictions for detailed analysis
val_predictions = trainer.predict(tokenized_datasets['validation'])
val_preds = np.argmax(val_predictions.predictions, axis=1)
val_labels = val_predictions.label_ids

# Print classification report
print("\nDetailed Classification Report (Validation):")
print(classification_report(
    val_labels,
    val_preds,
    target_names=[id2label[i] for i in range(len(id2label))],
    zero_division=0
))

print("="*70)


VALIDATION SET EVALUATION

Validation Results:
  eval_loss: 0.5263
  eval_accuracy: 0.7265
  eval_precision: 0.7470
  eval_recall: 0.7265
  eval_f1: 0.7274

Detailed Classification Report (Validation):
                         precision    recall  f1-score   support

             ad hominem       0.70      0.78      0.74        67
               anecdote       0.88      0.88      0.88        43
         cherry picking       0.59      0.79      0.68        56
      conspiracy theory       0.88      0.72      0.79        39
           fake experts       0.83      0.83      0.83        12
           false choice       0.43      0.77      0.56        13
      false equivalence       0.57      0.29      0.38        14
impossible expectations       0.63      0.73      0.68        37
      misrepresentation       0.75      0.63      0.69        38
     oversimplification       0.96      0.61      0.75        36
           single cause       0.81      0.77      0.79        57
     slothful in

## Training Complete!

### Summary

DeBERTa V2 XLarge has been successfully fine-tuned with Focal Loss (Gamma=4) for fallacy detection.

### Output Files

Located in `./output/deberta_flicc/`:
- `best_model/` - Fine-tuned model and tokenizer
- `training_report.json` - Complete training report
- `logs/` - TensorBoard logs
- Checkpoint directories for model checkpoints

### Performance Metrics

- **Validation Results**: See Step 11
- **Test Results**: See -> evaluate_model.ipynb

### Next Steps

1.Run the demo/ artefact in Terminal -> `python fallacy_detector_tui.py`