# ModernAraBERT Pretraining Walkthrough

**Complete guide to pretraining ModernAraBERT from scratch**

This notebook covers:
1. Data collection and preprocessing
2. Tokenizer vocabulary extension
3. Model training with MLM objective
4. Checkpointing and evaluation

**Note**: This is a walkthrough tutorial. For production pretraining, use the provided scripts in `scripts/pretraining/`.

---

## 📋 Prerequisites

Ensure you have the required dependencies:


In [None]:
# Install required packages
# !pip install transformers torch datasets accelerate tokenizers farasa gdown rarfile PyYAML psutil

import sys
import os
from pathlib import Path

# Add repository root to path
REPO_ROOT = Path.cwd().parent if 'notebooks' in str(Path.cwd()) else Path.cwd()
sys.path.insert(0, str(REPO_ROOT))

print(f"Repository root: {REPO_ROOT}")
print("✅ Environment setup complete")


## 📥 Step 1: Data Collection

Download pretraining datasets from the configured sources.


In [None]:
from src.pretraining.data_collection import download_and_extract_all_datasets

# Configure paths
links_json = REPO_ROOT / "data" / "links.json"
raw_data_dir = REPO_ROOT / "data" / "raw"

print("📥 Data Collection")
print(f"Links file: {links_json}")
print(f"Output directory: {raw_data_dir}")
print("\\nNote: This may take a while depending on your internet connection...")
print("For production, use: python scripts/pretraining/run_data_collection.py")

# Uncomment to actually download (can take hours):
# download_and_extract_all_datasets(str(links_json), str(raw_data_dir))


## 🔧 Step 2: Data Preprocessing

Process the raw Arabic text data with:
- Diacritics removal
- Tatweel (elongation) removal
- English word filtering
- Word count filtering (100-8000 words per document)
- Farasa morphological segmentation


In [None]:
from src.pretraining.data_preprocessing import (
    normalize_arabic_text,
    process_text_files_parallel,
    segment_text_files_farasa,
    split_data
)

# Example: Normalize Arabic text
sample_text = "الحَمْدُ لِلَّهِ رَبِّ الْعَالَمِيـــــنَ"  # With diacritics and tatweel
normalized = normalize_arabic_text(sample_text)

print("🔧 Text Normalization Example:")
print(f"Original:   {sample_text}")
print(f"Normalized: {normalized}")
print("\\n✅ Preprocessing removes:")
print("  - Diacritics (الحَمْدُ → الحمد)")
print("  - Tatweel/elongation (الْعَالَمِيـــــنَ → العالمين)")
print("  - English words")
print("  - Documents with <100 or >8000 words")


## 📚 Step 3: Tokenizer Extension

Extend ModernBERT's vocabulary with 80,000 Arabic-specific tokens learned from the preprocessed corpus.


In [None]:
from transformers import AutoTokenizer, AutoModel

# Load base ModernBERT
base_model_name = "answerdotai/ModernBERT-base"
base_tokenizer = AutoTokenizer.from_pretrained(base_model_name)
base_model = AutoModel.from_pretrained(base_model_name)

print("📚 Tokenizer Extension")
print(f"Base Model: {base_model_name}")
print(f"Base Vocabulary Size: {len(base_tokenizer):,} tokens")
print(f"Base Model Parameters: {sum(p.numel() for p in base_model.parameters()):,}")
print("\\n✅ Extension Process:")
print("  1. Analyze vocabulary frequency in Arabic corpus")
print("  2. Select top 80,000 most frequent Arabic tokens")
print("  3. Add tokens to vocabulary (handling + segmentation markers)")
print("  4. Resize model embeddings to accommodate new tokens")
print("  5. New vocabulary size: ~230,000 tokens")
print("\\nFor production, use: python scripts/pretraining/run_tokenizer_extension.py")


## 🎯 Step 4: Pretraining with MLM

Train the extended model using Masked Language Modeling (MLM) objective.


In [None]:
print("🎯 MLM Pretraining Configuration")
print("\\nTraining Setup:")
print("  - Objective: Masked Language Modeling (15% masking)")
print("  - Epochs: 3 (2 @ 128 tokens, 1 @ 512 tokens)")
print("  - Optimizer: AdamW (lr=5e-5)")
print("  - Scheduler: Cosine with warmup")
print("  - Batch Size: 32 per device")
print("  - Gradient Accumulation: 4 steps")
print("  - Mixed Precision: FP16")
print("  - Hardware: NVIDIA A100 40GB")
print("\\nAdvanced Features:")
print("  ✅ Distributed training with Accelerate")
print("  ✅ Automatic checkpointing")
print("  ✅ Memory profiling")
print("  ✅ torch.compile optimization")
print("\\nFor production training:")
print("  python scripts/pretraining/run_pretraining.py --config configs/pretraining_config.yaml")


## 📊 Step 5: Monitoring and Evaluation

Track training progress and evaluate the model.


In [None]:
print("📊 Training Monitoring")
print("\\nMetrics Tracked:")
print("  - MLM Loss (training and validation)")
print("  - MLM Accuracy (% of correct predictions)")
print("  - Perplexity")
print("  - Learning rate schedule")
print("  - Memory usage (RAM + VRAM)")
print("  - Training throughput (samples/second)")
print("\\nCheckpointing:")
print("  - Auto-save every 1000 steps")
print("  - Keep best 3 checkpoints")
print("  - Resume from any checkpoint")
print("\\nExpected Results (3 epochs on ~17GB data):")
print("  - Training time: ~24-48 hours on A100")
print("  - Final MLM loss: ~2.5-3.0")
print("  - Final perplexity: ~12-20")
print("  - Memory usage: ~35GB VRAM")


## 🚀 Production Workflow

Complete command sequence for pretraining:


```bash
# 1. Download datasets
python scripts/pretraining/run_data_collection.py \\
    --links-json data/links.json \\
    --output-dir data/raw

# 2. Preprocess data (full pipeline)
python scripts/pretraining/run_data_preprocessing.py \\
    --input-dir data/raw \\
    --output-dir data/processed \\
    --all

# 3. Extend tokenizer
python scripts/pretraining/run_tokenizer_extension.py \\
    --model-name answerdotai/ModernBERT-base \\
    --input-dir data/processed/segmented \\
    --output-dir models/modernarabert_extended \\
    --max-vocab-size 80000

# 4. Run pretraining (single GPU)
python scripts/pretraining/run_pretraining.py \\
    --config configs/pretraining_config.yaml

# 4b. Run pretraining (multi-GPU with Accelerate)
accelerate launch scripts/pretraining/run_pretraining.py \\
    --config configs/pretraining_config.yaml
```

---

## 📖 Additional Resources

- **Detailed Guide**: [docs/PRETRAINING.md](../docs/PRETRAINING.md)
- **Configuration**: [configs/pretraining_config.yaml](../configs/pretraining_config.yaml)
- **Source Code**: [src/pretraining/](../src/pretraining/)

---

**Next**: Check out `03_benchmarking_examples.ipynb` to evaluate your trained model!
