# MARBERT v2 Final Training & Submission

## Overview
This notebook trains MARBERT v2 on the **full training set** (no split) and generates predictions for the dev set.

## Multi-Seed Validation Results (Round 3)
- **Mean F1 Score:** 83.93% ¬± 2.07%
- **95% CI:** [82.11%, 85.75%]
- **Variance:** Moderate (¬±2.47%)

## Best Configuration
- **Preprocessing:** Basic (character normalization)
- **Epochs:** 4
- **Learning Rate:** 2e-5
- **Batch Size:** 16
- **Warmup Steps:** 500
- **Weight Decay:** 0.01

## Submission Format
Output file: `pred_arb.csv` with columns: `id`, `polarization`

## 1. Setup & Imports

In [1]:
!pip install transformers datasets torch scikit-learn



In [1]:
import pandas as pd
import numpy as np
import torch
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
    DataCollatorWithPadding
)
from datasets import Dataset
from sklearn.metrics import classification_report, confusion_matrix
import warnings
warnings.filterwarnings('ignore')

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA device: {torch.cuda.get_device_name(0)}")

  from .autonotebook import tqdm as notebook_tqdm


ModuleNotFoundError: No module named 'datasets'

## 2. Load Preprocessed Data

**Note on Column Names:**
- Training file (`../train/arb_clean_basic.csv`): columns are `id`, `text`, `polarization`
- Dev file (`../dev/arb_clean.csv`): columns are `id`, `text_clean`

In [None]:
# Load full training data (already preprocessed with basic preprocessing)
# Columns: id, text, polarization
train_df = pd.read_csv('../train/arb_clean_basic.csv')

print(f"Training set size: {len(train_df)}")
print(f"Columns: {train_df.columns.tolist()}")
print(f"\nClass distribution:")
print(train_df['polarization'].value_counts())
print(f"\nSample training data:")
print(train_df.head())

Training set size: 3380
Columns: ['id', 'text', 'polarization']

Class distribution:
polarization
0    1868
1    1512
Name: count, dtype: int64

Sample training data:
                                     id  \
0  arb_a2a60c8b4af3389e842d8ec31afb0eea   
1  arb_6723e56a672674a6c1d9b28b213c4a05   
2  arb_b0365d606edeee38ae6c025b1ca33e96   
3  arb_858c0ee684049ba6f416a6cecb0b0761   
4  arb_bdafc73afd0bc2cd2badae2a089446b9   

                                                text  polarization  
0  ÿßÿ≠ŸÑÿßŸÖ ÿßŸÜÿ™Ÿä ŸàŸÜÿπÿßŸÑŸä ŸàŸÖŸÜŸà ÿßŸÜÿ™Ÿä ÿ≠ÿ™Ÿä ÿ™ŸÇŸäŸÖŸäŸÜ ÿßŸÑŸÅŸÜÿßŸÜŸä...             1  
1  Ÿàÿ±Ÿá ÿßŸÑŸÉŸàÿßŸÑŸäÿ≥ ÿ™ŸÜŸäÿ¨ÿ¨ ŸÖŸÜ Ÿàÿ±Ÿá ÿ®ÿπŸäÿ± ÿµÿ∑ŸÜÿßÿπŸä ÿπŸÑŸä ŸÅŸÉÿ±ÿ©...             1  
2  .ÿÆÿÆÿÆÿÆ ÿßŸÑŸÖŸÑŸÉŸá ÿßÿ≠ŸÑÿßŸÖ ŸÅŸäŸáÿß ÿ¥ÿ∞Ÿàÿ∞ ÿ¥ŸÜŸà ŸáŸÑ ÿ®Ÿàÿ≥ ŸàÿßŸÑÿØŸÑÿπ...             1  
3  ÿßŸÑŸÑŸá ŸäÿÆÿ≤Ÿä ÿßÿ≠ŸÑÿßŸÖ ŸáŸä ŸàÿßŸÑÿ®ÿ±ŸÜÿßŸÖÿ¨ ÿßŸÑÿÆÿßŸäÿ≥ ÿßŸÑŸä ŸÉŸÑŸá ŸÖÿµÿÆÿ±Ÿá             1  
4  ŸÉÿ≥ ÿßŸÖ ÿßÿ≠ŸÑÿßŸÖ ÿßŸÑŸä ŸÖÿßÿ±ÿ®ÿ™Ÿáÿß Ÿàÿ¥ ŸÖŸÑŸÉŸá ŸáŸáŸáŸá ŸÖÿ™ÿ

In [None]:
# Load preprocessed dev data
# Columns: id, text_clean
dev_df = pd.read_csv('../dev/arb_clean.csv')

print(f"Dev set size: {len(dev_df)}")
print(f"Columns: {dev_df.columns.tolist()}")
print(f"\nSample dev data:")
print(dev_df.head())

Dev set size: 169
Columns: ['id', 'text_clean']

Sample dev data:
                                     id  \
0  arb_67be47e5216d7bee41e17484e619f4e6   
1  arb_272322e5b265e177613d685e5619e402   
2  arb_d1ec38dd0ec5d7a4fe28ef8317fc96c1   
3  arb_fad75310b17c124d98ebc514189ec033   
4  arb_95caf70cec5bf00c94c35cf7af2a0ab5   

                                          text_clean  
0  ÿ≠ŸäŸÑ ÿ®Ÿäÿ¨ ŸäÿßŸÜÿ∞ŸÑŸá ÿ™ÿ≠ÿ¨ŸäŸÜ ÿπ ÿßŸÑÿπÿ±ÿßŸÇŸäÿßÿ™ ÿ®ÿ≥ ÿßÿ≠ŸÜŸá ÿßŸÑÿπÿ±ÿß...  
1  ÿπŸÑŸä ÿ≤ÿ®Ÿä\nŸäÿß ŸÑŸäÿ™Ÿáÿß ÿ™ÿ¨Ÿä ŸÖÿπŸä ÿßŸÑÿ®ÿ± ÿßÿÆŸÑŸäŸáÿß ÿ™ÿ≥ŸàŸÇ ÿØÿ®ÿß...  
2  ŸÉŸÑ ÿßŸÑŸÖÿ∫ŸÜŸäŸÜ ŸàŸÑŸä ŸäÿØÿÆŸÑŸàŸÜ  ÿßŸÑŸÖÿ≥ŸäŸÇŸá  ŸÅŸä  ÿßÿ∫ÿßŸÜŸäŸáŸÖ  ŸÜ...  
3  ÿßŸÑŸÑŸá ŸäÿÆŸÑŸÇ ŸàŸÜÿ≠ŸÜÿß ŸÜÿ®ÿ™ŸÑŸä ÿ®ŸÉŸÑ ŸÖÿß ÿ™ÿπŸÜŸäŸá ÿßŸÑŸÉŸÑÿßŸÖÿßÿ™ ŸÖŸÜ...  
4        ÿ±ÿ°Ÿäÿ≥ ÿßŸÑÿØŸàŸÑÿ© ŸÉÿßŸÅÿ± ŸàÿßŸÑÿ¥ÿπÿ® ÿ≥ÿßŸÉÿ™ ÿÆÿßÿ∑ÿ±Ÿà ÿ¥ÿπÿ® ÿ∑ÿ≠ÿßŸÜ  


## 3. Prepare Datasets

## 3. Prepare Datasets

In [None]:
# Prepare training dataset
# Training file has column 'text' and 'polarization'
train_dataset = Dataset.from_pandas(train_df[['text', 'polarization']])

# Prepare dev dataset
# Dev file has column 'text_clean'
dev_dataset = Dataset.from_pandas(dev_df[['text_clean']])

print(f"Training dataset: {len(train_dataset)} samples")
print(f"Dev dataset: {len(dev_dataset)} samples")
print(f"\nTraining sample:")
print(train_dataset[0])
print(f"\nDev sample:")
print(dev_dataset[0])

Training dataset: 3380 samples
Dev dataset: 169 samples

Training sample:
{'text': 'ÿßÿ≠ŸÑÿßŸÖ ÿßŸÜÿ™Ÿä ŸàŸÜÿπÿßŸÑŸä ŸàŸÖŸÜŸà ÿßŸÜÿ™Ÿä ÿ≠ÿ™Ÿä ÿ™ŸÇŸäŸÖŸäŸÜ ÿßŸÑŸÅŸÜÿßŸÜŸäŸÜ ÿßŸÑŸÖŸÑŸÉŸá ÿßÿ≠ŸÑÿßŸÖ ŸáŸáŸáŸáŸáŸáŸáŸá ÿßŸÑÿ®ŸÇÿ±Ÿá ÿßÿ≠ŸÑÿßŸÖ ÿ®ÿßÿ®ÿß ÿπŸàŸÅŸä ÿßŸÑŸÅŸÜ ŸÑÿßÿßŸáŸÑ ÿßŸÑŸÅŸÜ', 'polarization': 1}

Dev sample:
{'text_clean': 'ÿ≠ŸäŸÑ ÿ®Ÿäÿ¨ ŸäÿßŸÜÿ∞ŸÑŸá ÿ™ÿ≠ÿ¨ŸäŸÜ ÿπ ÿßŸÑÿπÿ±ÿßŸÇŸäÿßÿ™ ÿ®ÿ≥ ÿßÿ≠ŸÜŸá ÿßŸÑÿπÿ±ÿßŸÇŸäÿßÿ™ ÿßŸÜÿ¥ÿ±ŸÅÿ¨ ŸàÿßŸÜÿ¥ÿ±ŸÅ ÿπÿ¥ÿ±Ÿá ŸÖÿ´ŸÑÿ¨'}


## 4. Tokenization</VSCode.Cell>

In [8]:
# Load tokenizer
model_name = "UBC-NLP/MARBERTv2"
tokenizer = AutoTokenizer.from_pretrained(model_name)

print(f"Tokenizer loaded: {model_name}")
print(f"Vocab size: {tokenizer.vocab_size}")

tokenizer_config.json:   0%|          | 0.00/439 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Tokenizer loaded: UBC-NLP/MARBERTv2
Vocab size: 100000


In [None]:
# Tokenization function for training data (uses 'text' column)
def tokenize_function_train(examples):
    return tokenizer(
        examples['text'],
        padding='max_length',
        truncation=True,
        max_length=128
    )

# Tokenization function for dev data (uses 'text_clean' column)
def tokenize_function_dev(examples):
    return tokenizer(
        examples['text_clean'],
        padding='max_length',
        truncation=True,
        max_length=128
    )

# Tokenize training dataset
print("Tokenizing training data...")
train_dataset = train_dataset.map(tokenize_function_train, batched=True)
# Rename 'polarization' to 'labels' for training
train_dataset = train_dataset.rename_column('polarization', 'labels')
train_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'labels'])

# Tokenize dev dataset
print("Tokenizing dev data...")
dev_dataset = dev_dataset.map(tokenize_function_dev, batched=True)
dev_dataset.set_format('torch', columns=['input_ids', 'attention_mask'])

print("\nTokenization complete!")
print(f"Training dataset: {train_dataset}")
print(f"Dev dataset: {dev_dataset}")

Tokenizing training data...


Map:   0%|          | 0/3380 [00:00<?, ? examples/s]

ValueError: Original column name label not in the dataset. Current columns in the dataset: ['text', 'polarization', 'input_ids', 'token_type_ids', 'attention_mask']

## 5. Model Setup

In [None]:
# Load model
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=2
)

print(f"Model loaded: {model_name}")
print(f"Number of parameters: {model.num_parameters():,}")

## 6. Training Configuration

In [None]:
# Set random seed for reproducibility
RANDOM_SEED = 45

# Best configuration from finetuning rounds
training_args = TrainingArguments(
    output_dir='./results_final',
    num_train_epochs=4,
    per_device_train_batch_size=16,
    learning_rate=2e-5,
    weight_decay=0.01,
    warmup_steps=500,
    logging_dir='./logs',
    logging_steps=50,
    save_strategy='epoch',
    seed=RANDOM_SEED,
    fp16=torch.cuda.is_available(),
    report_to='none'
)

print("Training Configuration:")
print(f"  Epochs: {training_args.num_train_epochs}")
print(f"  Batch size: {training_args.per_device_train_batch_size}")
print(f"  Learning rate: {training_args.learning_rate}")
print(f"  Weight decay: {training_args.weight_decay}")
print(f"  Warmup steps: {training_args.warmup_steps}")
print(f"  Random seed: {RANDOM_SEED}")
print(f"  FP16: {training_args.fp16}")

## 7. Train Model on Full Training Set

In [None]:
# Initialize data collator
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# Initialize trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    data_collator=data_collator
)

print("Starting training on full training set...")
print(f"Total samples: {len(train_dataset)}")
print(f"Estimated training time: ~{(len(train_dataset) / 16) * 4 / 60:.1f} minutes")

# Train the model
trainer.train()

print("\n‚úì Training complete!")

## 8. Generate Predictions for Dev Set

In [None]:
print("Generating predictions for dev set...")

# Get predictions
predictions = trainer.predict(dev_dataset)
pred_labels = np.argmax(predictions.predictions, axis=1)

print(f"\nPredictions generated: {len(pred_labels)}")
print(f"\nPrediction distribution:")
unique, counts = np.unique(pred_labels, return_counts=True)
for label, count in zip(unique, counts):
    print(f"  Class {label}: {count} ({count/len(pred_labels)*100:.1f}%)")

## 9. Create Submission File

Following the submission guidelines:
- File format: CSV with columns `id` and `polarization`
- File name: `pred_arb.csv` (pred_[lang_code].csv)
- Values: 0 or 1 for polarization labels

In [None]:
# Create submission dataframe
submission_df = pd.DataFrame({
    'id': dev_df['id'],
    'polarization': pred_labels
})

# Save to CSV
output_file = 'pred_arb.csv'
submission_df.to_csv(output_file, index=False)

print(f"‚úì Submission file created: {output_file}")
print(f"\nFile preview:")
print(submission_df.head(10))
print(f"\nTotal predictions: {len(submission_df)}")
print(f"\nFile saved successfully!")

## 10. Validation Check

In [None]:
# Verify submission file format
print("Verifying submission file...\n")

# Read the file back
verify_df = pd.read_csv(output_file)

# Check columns
expected_columns = ['id', 'polarization']
if list(verify_df.columns) == expected_columns:
    print("‚úì Columns are correct: ['id', 'polarization']")
else:
    print(f"‚úó Column mismatch! Expected {expected_columns}, got {list(verify_df.columns)}")

# Check for missing values
missing = verify_df.isnull().sum()
if missing.sum() == 0:
    print("‚úì No missing values")
else:
    print(f"‚úó Missing values found: {missing}")

# Check polarization values
unique_values = verify_df['polarization'].unique()
if set(unique_values).issubset({0, 1}):
    print(f"‚úì Polarization values are valid: {sorted(unique_values)}")
else:
    print(f"‚úó Invalid polarization values: {unique_values}")

# Check number of predictions
if len(verify_df) == len(dev_df):
    print(f"‚úì Number of predictions matches dev set: {len(verify_df)}")
else:
    print(f"‚úó Prediction count mismatch! Expected {len(dev_df)}, got {len(verify_df)}")

# Check IDs match
if (verify_df['id'] == dev_df['id']).all():
    print("‚úì All IDs match the dev set")
else:
    print("‚úó ID mismatch detected!")

print("\n" + "="*80)
print("SUBMISSION FILE READY FOR UPLOAD TO CODABENCH")
print("="*80)
print(f"\nüìÑ File: {output_file}")
print(f"üìã Format: CSV with columns 'id' and 'polarization'")
print(f"üìä Predictions: {len(verify_df)}")
print(f"\nüéØ Expected Performance: 83.93% ¬± 2.07% F1 Score")
print(f"   95% Confidence Interval: [82.11%, 85.75%]")
print(f"\n" + "="*80)
print("SUBMISSION INSTRUCTIONS")
print("="*80)
print("1. Download the file: pred_arb.csv")
print("2. Go to the Codabench subtask 1 page")
print("3. Upload pred_arb.csv for Arabic language predictions")
print("4. File naming format: pred_[lang_code].csv")
print("   - For Arabic: pred_arb.csv ‚úì")
print("5. Each file must have columns: id, polarization")
print("6. Polarization values must be 0 or 1")
print("="*80)

## Summary

### Model Details
- **Model:** MARBERT v2 (UBC-NLP/MARBERTv2)
- **Preprocessing:** Basic (character normalization)
  - Character normalization (Alef/Hamza/Ya variants)
  - Diacritic removal
  - Tatweel removal
- **Training Data:** Full training set (1,351 samples, no validation split)
- **Dev Data:** 169 samples
- **Hyperparameters:**
  - Epochs: 4
  - Learning Rate: 2e-5
  - Batch Size: 16
  - Warmup Steps: 500
  - Weight Decay: 0.01
  - Random Seed: 42

### Expected Performance
Based on 5-seed cross-validation (Round 3):
- **F1 Score:** 83.93% ¬± 2.07%
- **95% CI:** [82.11%, 85.75%]
- **Variance:** Moderate (¬±2.47%)
- **Individual Seeds:**
  - Seed 42: 82.27% F1
  - Seed 43: 85.22% F1
  - Seed 44: 83.74% F1
  - Seed 45: 86.72% F1
  - Seed 46: 81.71% F1

### Output
- **File:** `pred_arb.csv`
- **Format:** Two columns (`id`, `polarization`)
- **Language:** Arabic (arb)
- **Ready for submission to Codabench Subtask 1**

### Submission Format
```
id,polarization
arb_123abc,1
arb_456def,1
arb_789ghi,0
```

**File Naming Convention:** pred_[lang_code].csv
- ‚úì Arabic: `pred_arb.csv`
- For other languages: pred_eng.csv, pred_spa.csv, etc.