# MARBERT v2 10-Fold Cross-Validation Ensemble Training

## Overview
This notebook trains 10 MARBERT v2 models using 10-fold cross-validation and combines them into an ensemble classifier for final predictions on the dev set.

## Ensemble Strategy
- **Models:** 10 MARBERT v2 models (one per fold)
- **Ensemble Method:** Soft voting (averaging predicted probabilities)
- **Training:** Each model trained on 9/10 of the data
- **Validation:** Each model evaluated on held-out 1/10 fold

## Configuration
- **Model:** MARBERT v2 (UBC-NLP/MARBERTv2)
- **Preprocessing:** Basic (character normalization)
- **Folds:** 10
- **Epochs per model:** 4
- **Learning Rate:** 2e-5
- **Batch Size:** 16
- **Warmup Steps:** 500
- **Weight Decay:** 0.01

## Memory Management (16GB VRAM Safe!)
- **Strategy:** Train one model at a time, save to disk, clear GPU memory
- **Peak VRAM:** ~4-5GB per model during training (well within 16GB limit)
- **Inference:** Load one model at a time for predictions (~2-3GB)
- **Disk Space:** ~2GB per model Ã— 10 = ~20GB total disk space needed

## Expected Benefits
- Reduced variance through model averaging
- More robust predictions
- Better generalization to unseen data

## 1. Setup & Imports

In [None]:
!pip install transformers datasets torch scikit-learn -q

In [None]:
import pandas as pd
import numpy as np
import torch
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
    DataCollatorWithPadding
)
from datasets import Dataset
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, f1_score
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA device: {torch.cuda.get_device_name(0)}")
    print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.2f} GB")

## 2. Load Preprocessed Data

**Training data:** `../train/arb_clean_basic.csv` with columns: `id`, `text`, `polarization`  
**Dev data:** `../dev/arb_clean.csv` with columns: `id`, `text_clean`

In [None]:
# Load full training data
train_df = pd.read_csv('../train/arb_clean_basic.csv')

print(f"Training set size: {len(train_df)}")
print(f"Columns: {train_df.columns.tolist()}")
print(f"\nClass distribution:")
print(train_df['polarization'].value_counts())
print(f"\nClass balance:")
print(train_df['polarization'].value_counts(normalize=True))
print(f"\nSample training data:")
print(train_df.head(3))

In [None]:
# Load preprocessed dev data
dev_df = pd.read_csv('../dev/arb_clean.csv')

print(f"Dev set size: {len(dev_df)}")
print(f"Columns: {dev_df.columns.tolist()}")
print(f"\nSample dev data:")
print(dev_df.head(3))

## 3. Setup Tokenizer

In [None]:
# Load tokenizer
model_name = "UBC-NLP/MARBERTv2"
tokenizer = AutoTokenizer.from_pretrained(model_name)

print(f"Tokenizer loaded: {model_name}")
print(f"Vocab size: {tokenizer.vocab_size}")

## 4. Tokenization Functions

In [None]:
# Tokenization function for training data
def tokenize_function_train(examples):
    return tokenizer(
        examples['text'],
        padding='max_length',
        truncation=True,
        max_length=128
    )

# Tokenization function for dev data
def tokenize_function_dev(examples):
    return tokenizer(
        examples['text_clean'],
        padding='max_length',
        truncation=True,
        max_length=128
    )

print("âœ“ Tokenization functions defined")

## 5. Prepare Dev Dataset (tokenize once)

In [None]:
# Prepare dev dataset
dev_dataset = Dataset.from_pandas(dev_df[['text_clean']])

print("Tokenizing dev data...")
dev_dataset_tokenized = dev_dataset.map(tokenize_function_dev, batched=True)
dev_dataset_tokenized.set_format('torch', columns=['input_ids', 'attention_mask'])

print(f"âœ“ Dev dataset tokenized: {len(dev_dataset_tokenized)} samples")

## 6. Setup 10-Fold Cross-Validation

In [None]:
# Setup stratified k-fold
RANDOM_SEED = 42
N_FOLDS = 10

skf = StratifiedKFold(n_splits=N_FOLDS, shuffle=True, random_state=RANDOM_SEED)

print(f"âœ“ 10-Fold Cross-Validation configured")
print(f"  Random Seed: {RANDOM_SEED}")
print(f"  Number of Folds: {N_FOLDS}")
print(f"  Stratification: Enabled")

## 7. Training Configuration

In [None]:
# Training hyperparameters (best configuration from finetuning)
training_config = {
    'num_train_epochs': 4,
    'per_device_train_batch_size': 16,
    'per_device_eval_batch_size': 32,
    'learning_rate': 2e-5,
    'weight_decay': 0.01,
    'warmup_steps': 500,
    'logging_steps': 50,
    'save_strategy': 'epoch',
    'fp16': torch.cuda.is_available(),
    'seed': RANDOM_SEED
}

print("Training Configuration:")
for key, value in training_config.items():
    print(f"  {key}: {value}")

## 8. Helper Functions

In [None]:
def compute_metrics(eval_pred):
    """Compute metrics for evaluation"""
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    
    accuracy = accuracy_score(labels, predictions)
    f1 = f1_score(labels, predictions, average='weighted')
    
    return {
        'accuracy': accuracy,
        'f1': f1
    }

def plot_confusion_matrix(y_true, y_pred, title="Confusion Matrix"):
    """Plot confusion matrix"""
    cm = confusion_matrix(y_true, y_pred)
    plt.figure(figsize=(8, 6))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
    plt.title(title)
    plt.ylabel('True Label')
    plt.xlabel('Predicted Label')
    plt.show()

print("âœ“ Helper functions defined")

## 9. Train 10-Fold Models

This will train 10 models, one for each fold. Each model is trained on 90% of the data and validated on the remaining 10%.

**Memory Management:** Models are saved to disk and cleared from GPU memory after training to avoid OOM errors with 16GB VRAM.

In [None]:
import os
import gc

# Storage for model paths and results (NOT storing models in memory)
model_paths = []
fold_results = []

# Create directory for saved models
os.makedirs('./saved_models', exist_ok=True)

# Data collator
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# Get indices for stratified k-fold
X = train_df['text'].values
y = train_df['polarization'].values

print("="*80)
print("STARTING 10-FOLD CROSS-VALIDATION TRAINING")
print("="*80)

for fold, (train_idx, val_idx) in enumerate(skf.split(X, y), 1):
    print(f"\n{'='*80}")
    print(f"FOLD {fold}/{N_FOLDS}")
    print(f"{'='*80}")
    
    # Split data
    fold_train_df = train_df.iloc[train_idx].reset_index(drop=True)
    fold_val_df = train_df.iloc[val_idx].reset_index(drop=True)
    
    print(f"Training samples: {len(fold_train_df)}")
    print(f"Validation samples: {len(fold_val_df)}")
    print(f"Train class distribution: {fold_train_df['polarization'].value_counts().to_dict()}")
    print(f"Val class distribution: {fold_val_df['polarization'].value_counts().to_dict()}")
    
    # Create datasets
    fold_train_dataset = Dataset.from_pandas(fold_train_df[['text', 'polarization']])
    fold_val_dataset = Dataset.from_pandas(fold_val_df[['text', 'polarization']])
    
    # Tokenize
    fold_train_dataset = fold_train_dataset.map(tokenize_function_train, batched=True)
    fold_train_dataset = fold_train_dataset.rename_column('polarization', 'labels')
    fold_train_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'labels'])
    
    fold_val_dataset = fold_val_dataset.map(tokenize_function_train, batched=True)
    fold_val_dataset = fold_val_dataset.rename_column('polarization', 'labels')
    fold_val_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'labels'])
    
    # Load fresh model for this fold
    model = AutoModelForSequenceClassification.from_pretrained(
        model_name,
        num_labels=2
    )
    
    # Training arguments for this fold
    training_args = TrainingArguments(
        output_dir=f'./results_fold_{fold}',
        num_train_epochs=training_config['num_train_epochs'],
        per_device_train_batch_size=training_config['per_device_train_batch_size'],
        per_device_eval_batch_size=training_config['per_device_eval_batch_size'],
        learning_rate=training_config['learning_rate'],
        weight_decay=training_config['weight_decay'],
        warmup_steps=training_config['warmup_steps'],
        logging_dir=f'./logs_fold_{fold}',
        logging_steps=training_config['logging_steps'],
        eval_strategy="epoch",
        save_strategy=training_config['save_strategy'],
        load_best_model_at_end=True,
        metric_for_best_model="f1",
        seed=training_config['seed'],
        fp16=training_config['fp16'],
        report_to='none'
    )
    
    # Create trainer
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=fold_train_dataset,
        eval_dataset=fold_val_dataset,
        compute_metrics=compute_metrics,
        data_collator=data_collator
    )
    
    # Train
    print(f"\nTraining Fold {fold}...")
    trainer.train()
    
    # Evaluate on validation fold
    print(f"\nEvaluating Fold {fold}...")
    eval_results = trainer.evaluate()
    
    print(f"\nFold {fold} Results:")
    print(f"  Validation Accuracy: {eval_results['eval_accuracy']:.4f}")
    print(f"  Validation F1 Score: {eval_results['eval_f1']:.4f}")
    
    # Save model to disk (important for memory management!)
    model_save_path = f'./saved_models/fold_{fold}_model'
    trainer.save_model(model_save_path)
    print(f"  Model saved to: {model_save_path}")
    
    # Store path and results
    model_paths.append(model_save_path)
    fold_results.append({
        'fold': fold,
        'accuracy': eval_results['eval_accuracy'],
        'f1': eval_results['eval_f1']
    })
    
    # Get detailed predictions for this fold
    predictions = trainer.predict(fold_val_dataset)
    preds = np.argmax(predictions.predictions, axis=1)
    labels = predictions.label_ids
    
    print(f"\nDetailed Classification Report (Fold {fold}):")
    print(classification_report(labels, preds, target_names=['Class 0', 'Class 1']))
    
    # CRITICAL: Clear GPU memory after each fold
    del model, trainer, fold_train_dataset, fold_val_dataset, predictions
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
    gc.collect()
    print(f"  GPU memory cleared")


print("\n" + "="*80)
print("10-FOLD CROSS-VALIDATION COMPLETE")
print("="*80)

## 10. Cross-Validation Results Summary

In [None]:
# Convert results to DataFrame
results_df = pd.DataFrame(fold_results)

print("="*80)
print("10-FOLD CROSS-VALIDATION RESULTS")
print("="*80)
print("\nPer-Fold Results:")
print(results_df.to_string(index=False))

print(f"\n{'='*80}")
print("SUMMARY STATISTICS")
print(f"{'='*80}")
print(f"Mean Accuracy: {results_df['accuracy'].mean():.4f} Â± {results_df['accuracy'].std():.4f}")
print(f"Mean F1 Score: {results_df['f1'].mean():.4f} Â± {results_df['f1'].std():.4f}")
print(f"\nMin F1 Score: {results_df['f1'].min():.4f} (Fold {results_df.loc[results_df['f1'].idxmin(), 'fold']})")
print(f"Max F1 Score: {results_df['f1'].max():.4f} (Fold {results_df.loc[results_df['f1'].idxmax(), 'fold']})")

# 95% confidence interval
f1_mean = results_df['f1'].mean()
f1_std = results_df['f1'].std()
f1_ci_lower = f1_mean - 1.96 * f1_std / np.sqrt(N_FOLDS)
f1_ci_upper = f1_mean + 1.96 * f1_std / np.sqrt(N_FOLDS)
print(f"\n95% Confidence Interval: [{f1_ci_lower:.4f}, {f1_ci_upper:.4f}]")

# Plot results
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.bar(results_df['fold'], results_df['accuracy'])
plt.axhline(y=results_df['accuracy'].mean(), color='r', linestyle='--', label='Mean')
plt.xlabel('Fold')
plt.ylabel('Accuracy')
plt.title('Accuracy per Fold')
plt.legend()
plt.ylim([0.7, 1.0])

plt.subplot(1, 2, 2)
plt.bar(results_df['fold'], results_df['f1'])
plt.axhline(y=results_df['f1'].mean(), color='r', linestyle='--', label='Mean')
plt.xlabel('Fold')
plt.ylabel('F1 Score')
plt.title('F1 Score per Fold')
plt.legend()
plt.ylim([0.7, 1.0])

plt.tight_layout()
plt.show()

print(f"\nâœ“ Trained and saved {len(model_paths)} models successfully")
print(f"âœ“ Models saved to disk to conserve GPU memory")

## 11. Ensemble Predictions on Dev Set

Use soft voting: average the predicted probabilities from all 10 models, then take the argmax for final prediction.

**Memory-Efficient Loading:** Load one model at a time, get predictions, then unload to save GPU memory.

In [None]:
print("="*80)
print("GENERATING ENSEMBLE PREDICTIONS ON DEV SET")
print("="*80)

# Storage for all predictions
all_predictions = []

# Get predictions from each model (load one at a time to save memory)
for fold, model_path in enumerate(model_paths, 1):
    print(f"Loading and evaluating Fold {fold} model...")
    
    # Load model from disk
    model = AutoModelForSequenceClassification.from_pretrained(model_path)
    model.eval()
    model.to('cuda' if torch.cuda.is_available() else 'cpu')
    
    # Create a temporary trainer just for prediction
    temp_trainer = Trainer(
        model=model,
        args=TrainingArguments(
            output_dir='./temp',
            per_device_eval_batch_size=32,
            fp16=torch.cuda.is_available(),
            report_to='none'
        ),
        data_collator=data_collator
    )
    
    # Get predictions (logits)
    predictions = temp_trainer.predict(dev_dataset_tokenized)
    
    # Convert logits to probabilities using softmax
    probs = torch.softmax(torch.tensor(predictions.predictions), dim=1).numpy()
    all_predictions.append(probs)
    
    print(f"  Shape: {probs.shape}, Class 1 prob range: [{probs[:, 1].min():.3f}, {probs[:, 1].max():.3f}]")
    
    # CRITICAL: Clear GPU memory after each model
    del model, temp_trainer, predictions
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
    gc.collect()

# Convert to numpy array for easier manipulation
all_predictions = np.array(all_predictions)  # Shape: (10, n_samples, 2)

print(f"\nâœ“ Collected predictions from all {len(model_paths)} models")
print(f"  Prediction array shape: {all_predictions.shape}")

In [None]:
# Ensemble: Average probabilities across all models
ensemble_probs = all_predictions.mean(axis=0)  # Shape: (n_samples, 2)

# Get final predictions
ensemble_predictions = np.argmax(ensemble_probs, axis=1)

print("="*80)
print("ENSEMBLE PREDICTIONS COMPLETE")
print("="*80)
print(f"Total predictions: {len(ensemble_predictions)}")
print(f"\nPrediction distribution:")
unique, counts = np.unique(ensemble_predictions, return_counts=True)
for label, count in zip(unique, counts):
    print(f"  Class {label}: {count} ({count/len(ensemble_predictions)*100:.1f}%)")

print(f"\nEnsemble probability statistics:")
print(f"  Class 0 prob - Mean: {ensemble_probs[:, 0].mean():.3f}, Std: {ensemble_probs[:, 0].std():.3f}")
print(f"  Class 1 prob - Mean: {ensemble_probs[:, 1].mean():.3f}, Std: {ensemble_probs[:, 1].std():.3f}")

# Confidence analysis
confidence = ensemble_probs.max(axis=1)
print(f"\nPrediction confidence:")
print(f"  Mean: {confidence.mean():.3f}")
print(f"  Min: {confidence.min():.3f}")
print(f"  Max: {confidence.max():.3f}")
print(f"  Median: {np.median(confidence):.3f}")

# High confidence predictions
high_conf_threshold = 0.9
high_conf_count = (confidence > high_conf_threshold).sum()
print(f"\nHigh confidence predictions (>{high_conf_threshold}): {high_conf_count} ({high_conf_count/len(confidence)*100:.1f}%)")

## 12. Create Submission File

Following the submission guidelines:
- File format: CSV with columns `id` and `polarization`
- File name: `pred_arb.csv`
- Values: 0 or 1 for polarization labels

In [None]:
# Create submission dataframe
submission_df = pd.DataFrame({
    'id': dev_df['id'],
    'polarization': ensemble_predictions
})

# Save to CSV
output_file = 'pred_arb_ensemble.csv'
submission_df.to_csv(output_file, index=False)

print(f"âœ“ Submission file created: {output_file}")
print(f"\nFile preview:")
print(submission_df.head(10))
print(f"\nFile preview (tail):")
print(submission_df.tail(10))
print(f"\nTotal predictions: {len(submission_df)}")

## 13. Validation Check

In [None]:
# Verify submission file format
print("="*80)
print("VERIFYING SUBMISSION FILE")
print("="*80)

# Read the file back
verify_df = pd.read_csv(output_file)

# Check columns
expected_columns = ['id', 'polarization']
if list(verify_df.columns) == expected_columns:
    print("âœ“ Columns are correct: ['id', 'polarization']")
else:
    print(f"âœ— Column mismatch! Expected {expected_columns}, got {list(verify_df.columns)}")

# Check for missing values
missing = verify_df.isnull().sum()
if missing.sum() == 0:
    print("âœ“ No missing values")
else:
    print(f"âœ— Missing values found:\n{missing}")

# Check polarization values
unique_values = verify_df['polarization'].unique()
if set(unique_values).issubset({0, 1}):
    print(f"âœ“ Polarization values are valid: {sorted(unique_values)}")
else:
    print(f"âœ— Invalid polarization values: {unique_values}")

# Check number of predictions
if len(verify_df) == len(dev_df):
    print(f"âœ“ Number of predictions matches dev set: {len(verify_df)}")
else:
    print(f"âœ— Prediction count mismatch! Expected {len(dev_df)}, got {len(verify_df)}")

# Check IDs match
if (verify_df['id'] == dev_df['id']).all():
    print("âœ“ All IDs match the dev set")
else:
    print("âœ— ID mismatch detected!")

print("\n" + "="*80)
print("SUBMISSION FILE READY")
print("="*80)
print(f"\nðŸ“„ File: {output_file}")
print(f"ðŸ“‹ Format: CSV with columns 'id' and 'polarization'")
print(f"ðŸ“Š Predictions: {len(verify_df)}")
print(f"\nðŸŽ¯ Training Performance (10-Fold CV):")
print(f"   Mean F1 Score: {results_df['f1'].mean():.4f} Â± {results_df['f1'].std():.4f}")
print(f"   95% CI: [{f1_ci_lower:.4f}, {f1_ci_upper:.4f}]")
print(f"\nðŸ’¡ Ensemble Method: Soft voting (averaged probabilities from 10 models)")

## 14. Comparison with Individual Models

Let's see how each individual model would perform compared to the ensemble.

In [None]:
print("="*80)
print("INDIVIDUAL MODEL PREDICTIONS ON DEV SET")
print("="*80)

individual_predictions = []

for fold in range(N_FOLDS):
    # Get predictions from this model (argmax of probabilities)
    fold_preds = np.argmax(all_predictions[fold], axis=1)
    individual_predictions.append(fold_preds)
    
    # Distribution for this fold
    unique, counts = np.unique(fold_preds, return_counts=True)
    dist_str = ", ".join([f"Class {label}: {count}" for label, count in zip(unique, counts)])
    print(f"Fold {fold+1}: {dist_str}")

print(f"\nEnsemble: ", end="")
unique, counts = np.unique(ensemble_predictions, return_counts=True)
dist_str = ", ".join([f"Class {label}: {count}" for label, count in zip(unique, counts)])
print(dist_str)

# Calculate agreement between models
individual_predictions = np.array(individual_predictions)  # Shape: (10, n_samples)

# For each sample, count how many models agree with the ensemble
agreement_counts = (individual_predictions == ensemble_predictions).sum(axis=0)
print(f"\n{'='*80}")
print("MODEL AGREEMENT ANALYSIS")
print(f"{'='*80}")
print(f"Mean agreement: {agreement_counts.mean():.2f} / 10 models")
print(f"Min agreement: {agreement_counts.min()} models")
print(f"Max agreement: {agreement_counts.max()} models")
print(f"\nAgreement distribution:")
for i in range(N_FOLDS+1):
    count = (agreement_counts == i).sum()
    if count > 0:
        print(f"  {i} models agree: {count} samples ({count/len(agreement_counts)*100:.1f}%)")

# Unanimous predictions
unanimous = (agreement_counts == N_FOLDS).sum()
print(f"\nUnanimous predictions (all 10 models agree): {unanimous} ({unanimous/len(agreement_counts)*100:.1f}%)")

## Summary

### Model Configuration
- **Base Model:** MARBERT v2 (UBC-NLP/MARBERTv2)
- **Preprocessing:** Basic (character normalization, diacritic removal, tatweel removal)
- **Training Strategy:** 10-Fold Stratified Cross-Validation
- **Number of Models:** 10 (one per fold)
- **Ensemble Method:** Soft Voting (averaged predicted probabilities)

### Hyperparameters
- **Epochs:** 4
- **Learning Rate:** 2e-5
- **Batch Size:** 16
- **Warmup Steps:** 500
- **Weight Decay:** 0.01
- **Random Seed:** 42

### Training Data
- **Total Samples:** Full training set
- **Folds:** 10
- **Train per fold:** 90% of data
- **Validation per fold:** 10% of data
- **Stratification:** Enabled (maintains class balance)

### Performance
- **Cross-Validation Mean F1:** Reported above
- **Cross-Validation Std:** Reported above
- **95% Confidence Interval:** Reported above

### Output
- **File:** `pred_arb_ensemble.csv`
- **Format:** Two columns (`id`, `polarization`)
- **Language:** Arabic (arb)
- **Method:** Ensemble of 10 models with soft voting
- **Ready for submission to Codabench Subtask 1**

### Advantages of 10-Fold Ensemble
1. **Reduced Variance:** Averaging predictions from 10 models reduces overfitting
2. **Better Generalization:** Each model sees different validation data
3. **Robust Predictions:** Ensemble captures broader patterns in the data
4. **Full Data Utilization:** Every sample used for both training and validation
5. **Confidence Estimation:** Agreement between models indicates prediction confidence