# ModernAraBERT Benchmarking Examples

**Evaluate ModernAraBERT on Sentiment Analysis and Named Entity Recognition tasks**

This notebook demonstrates:
1. Running sentiment analysis benchmarks (HARD, AJGT, LABR, ASTD)
2. Running NER benchmarks (ANERCorp)
3. Visualizing and comparing results
4. Reproducing paper results

---

## 📦 Setup


In [None]:
import sys
import os
import json
from pathlib import Path
import pandas as pd
import matplotlib.pyplot as plt

# Add repository root to path
REPO_ROOT = Path.cwd().parent if 'notebooks' in str(Path.cwd()) else Path.cwd()
sys.path.insert(0, str(REPO_ROOT))

print(f"Repository root: {REPO_ROOT}")
print("✅ Environment ready")


## 😊 Sentiment Analysis Benchmarks

ModernAraBERT was evaluated on 4 Arabic sentiment datasets.


In [None]:
print("😊 Sentiment Analysis Datasets\\n")
print("1. HARD - Hotel Arabic Reviews Dataset")
print("   - 93,700 hotel reviews")
print("   - Classes: Positive, Negative")
print("   - Split: 60/20/20\\n")

print("2. AJGT - Arabic Jordanian General Tweets")
print("   - 1,800 tweets")
print("   - Classes: Positive, Negative")
print("   - Split: 60/20/20\\n")

print("3. LABR - Large-Scale Arabic Book Reviews")
print("   - 63,000 book reviews")
print("   - Classes: 1-5 stars")
print("   - Split: Predefined train/test\\n")

print("4. ASTD - Arabic Sentiment Tweets Dataset")
print("   - 10,000 tweets")
print("   - Classes: Positive, Negative, Neutral, Mixed")
print("   - Split: Predefined train/test")


### Running SA Benchmarks

Use the provided script to run all SA benchmarks:


```bash
# Run all SA benchmarks
./scripts/benchmarking/run_sa_benchmark.sh gizadatateam/ModernAraBERT all ./results/sa

# Or run specific dataset
./scripts/benchmarking/run_sa_benchmark.sh gizadatateam/ModernAraBERT HARD ./results/sa_hard
```


### SA Benchmark Results (Paper)

Macro-F1 scores from the paper:


In [None]:
# Sentiment Analysis results from paper
sa_results = {
    'Dataset': ['AJGT', 'HARD', 'LABR'],
    'AraBERT v1': [58.0, 72.7, 45.5],
    'mBERT': [61.5, 71.7, 45.5],
    'ModernAraBERT': [70.5, 89.4, 56.5]
}

df_sa = pd.DataFrame(sa_results)
print("📊 Sentiment Analysis Results (Macro-F1 %)\\n")
print(df_sa.to_string(index=False))
print("\\n✅ Improvements over AraBERT v1:")
print(f"  AJGT: +{70.5 - 58.0:.1f}% (+{((70.5 - 58.0)/58.0)*100:.1f}% relative)")
print(f"  HARD: +{89.4 - 72.7:.1f}% (+{((89.4 - 72.7)/72.7)*100:.1f}% relative)")
print(f"  LABR: +{56.5 - 45.5:.1f}% (+{((56.5 - 45.5)/45.5)*100:.1f}% relative)")


### Visualize SA Results


In [None]:
# Plot SA results
fig, ax = plt.subplots(figsize=(10, 6))

x = range(len(df_sa['Dataset']))
width = 0.25

bars1 = ax.bar([i - width for i in x], df_sa['AraBERT v1'], width, label='AraBERT v1', alpha=0.8)
bars2 = ax.bar([i for i in x], df_sa['mBERT'], width, label='mBERT', alpha=0.8)
bars3 = ax.bar([i + width for i in x], df_sa['ModernAraBERT'], width, label='ModernAraBERT', alpha=0.8)

ax.set_ylabel('Macro-F1 Score (%)', fontsize=12)
ax.set_title('Sentiment Analysis Benchmark Results', fontsize=14, fontweight='bold')
ax.set_xticks(x)
ax.set_xticklabels(df_sa['Dataset'])
ax.legend()
ax.grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

print("✅ ModernAraBERT achieves the best performance across all SA datasets!")


## 🏷️ Named Entity Recognition Benchmark

Evaluation on ANERCorp dataset with IOB2 tagging scheme.


In [None]:
print("🏷️ ANERCorp Dataset\\n")
print("Description:")
print("  - 150,000 tokens manually annotated")
print("  - Entity types: PERSON, LOCATION, ORGANIZATION, MISC")
print("  - Tagging scheme: IOB2")
print("  - Labeling strategy: First-subtoken only")
print("\\nTraining Configuration:")
print("  - Epochs: 10")
print("  - Batch size: 8")
print("  - Learning rate: 2e-5")
print("  - Loss function: Focal Loss (α=0.25, γ=3.0)")
print("  - Class weights: Computed for imbalanced classes")
print("  - Early stopping: Patience=8")
print("\\nRun benchmark:")
print("  ./scripts/benchmarking/run_ner_benchmark.sh gizadatateam/ModernAraBERT ./results/ner")


### NER Benchmark Results (Paper)


In [None]:
# NER results from paper
ner_results = {
    'Model': ['AraBERT v1', 'mBERT', 'ModernAraBERT'],
    'Micro-F1 (%)': [78.9, 90.7, 82.1]
}

df_ner = pd.DataFrame(ner_results)
print("📊 NER Results on ANERCorp (Micro-F1 %)\\n")
print(df_ner.to_string(index=False))
print("\\n📝 Note: mBERT achieves highest score due to its multilingual pretraining.")
print("ModernAraBERT offers a good trade-off between performance and efficiency.")


In [None]:
print("📊 ModernAraBERT - Complete Benchmark Results\\n")
print("="*60)
print("SENTIMENT ANALYSIS (Macro-F1 %)")
print("="*60)
print(f"{'Dataset':<15} {'AraBERT v1':<12} {'mBERT':<12} {'ModernAraBERT':<15}")
print("-"*60)
print(f"{'AJGT':<15} {58.0:<12.1f} {61.5:<12.1f} {70.5:<15.1f} ⭐")
print(f"{'HARD':<15} {72.7:<12.1f} {71.7:<12.1f} {89.4:<15.1f} ⭐")
print(f"{'LABR':<15} {45.5:<12.1f} {45.5:<12.1f} {56.5:<15.1f} ⭐")
print()
print("="*60)
print("NAMED ENTITY RECOGNITION (Micro-F1 %)")
print("="*60)
print(f"{'ANERCorp':<15} {78.9:<12.1f} {90.7:<12.1f} {82.1:<15.1f}")
print()
print("="*60)
print("QUESTION ANSWERING (ARCD Test)")
print("="*60)
print(f"{'Metric':<15} {'AraBERT v1':<12} {'mBERT':<12} {'ModernAraBERT':<15}")
print("-"*60)
print(f"{'Exact Match':<15} {13.26:<12.2f} {15.27:<12.2f} {18.73:<15.2f} ⭐")
print(f"{'F1 Score':<15} {40.82:<12.2f} {46.12:<12.2f} {47.18:<15.2f} ⭐")
print(f"{'Sentence Match':<15} {71.47:<12.2f} {63.11:<12.2f} {76.66:<15.2f} ⭐")
print("="*60)
print("\\n⭐ = Best performance")
print("\\n✅ ModernAraBERT achieves best results on SA and QA tasks!")
print("✅ Competitive NER performance with efficient architecture!")


## 🔍 Model Comparison

Compare multiple models on the same dataset:


```bash
# Compare multiple models on SA
for model in gizadatateam/ModernAraBERT aubmindlab/bert-base-arabertv2 bert-base-multilingual-cased; do
    model_name=$(basename $model)
    ./scripts/benchmarking/run_sa_benchmark.sh "$model" all "./results/sa_$model_name"
done

# Compare multiple models on NER
for model in gizadatateam/ModernAraBERT aubmindlab/bert-base-arabertv2 bert-base-multilingual-cased; do
    model_name=$(basename $model)
    ./scripts/benchmarking/run_ner_benchmark.sh "$model" "./results/ner_$model_name"
done
```


## 📖 Additional Resources

- **Detailed Benchmarking Guide**: [docs/BENCHMARKING.md](../docs/BENCHMARKING.md)
- **Dataset Documentation**: [docs/DATASETS.md](../docs/DATASETS.md)
- **Paper Results**: [results/README.md](../results/README.md)
- **Configuration Files**: [configs/](../configs/)

---

**For production benchmarking, always use the scripts in `scripts/benchmarking/` for consistent and reproducible results!**
