# Arabic Question Answering with AraBERT and AraELECTRA

This notebook implements fine-tuning for three Arabic NLP models on a Question Answering (QA) task:
1. **AraBERTv2-base**
2. **AraBERTv0.2-large**
3. **AraELECTRA-base-discriminator**

## Specifications

| Model | Architecture | Layers | Hidden | Heads | Params | Preprocessing |
|-------|--------------|--------|--------|-------|--------|---------------|
| **AraBERTv2-base** | BERT-base | 12 | 768 | 12 | ~110M | Farasa Segmentation |
| **AraBERTv0.2-large** | BERT-large | 24 | 1024 | 16 | ~336M | None |
| **AraELECTRA-base** | ELECTRA-base | 12 | 768 | 12 | ~136M | None |

## Metrics
- **F1 Score**: Harmonic mean of Precision and Recall.
- **Exact Match (EM)**: Binary measure of exact matches.

In [1]:
# Install dependencies
!pip install transformers datasets torch scikit-learn polars accelerate arabert
!pip install farasapy



  DEPRECATION: Building 'emoji' using the legacy setup.py bdist_wheel mechanism, which will be removed in a future version. pip 25.3 will enforce this behaviour change. A possible replacement is to use the standardized build interface by setting the `--use-pep517` option, (possibly combined with `--no-build-isolation`), or adding a `pyproject.toml` file to the source tree of 'emoji'. Discussion can be found at https://github.com/pypa/pip/issues/6334



Collecting arabert
  Downloading arabert-1.0.1-py3-none-any.whl.metadata (16 kB)
Collecting PyArabic (from arabert)
  Downloading PyArabic-0.6.15-py3-none-any.whl.metadata (10 kB)
Collecting farasapy (from arabert)
  Downloading farasapy-0.1.1-py3-none-any.whl.metadata (11 kB)
Collecting emoji==1.4.2 (from arabert)
  Downloading emoji-1.4.2.tar.gz (184 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Downloading arabert-1.0.1-py3-none-any.whl (179 kB)
Downloading farasapy-0.1.1-py3-none-any.whl (14 kB)
Downloading PyArabic-0.6.15-py3-none-any.whl (126 kB)
Building wheels for collected packages: emoji
  Building wheel for emoji (setup.py): started
  Building wheel for emoji (setup.py): finished with status 'done'
  Created wheel for emoji: filename=emoji-1.4.2-py3-none-any.whl size=186563 sha256=a481fd6f13bf75983f9e5468392eec7203dc87d44f36adac3c6a9c1c92ff3700
  Stored in directory: c:\users\hp\appdata\local\pip\cache\wheels\10\f0

In [2]:
import os
import json
import numpy as np
import polars as pl
import torch
from transformers import (
    AutoTokenizer, 
    AutoModelForQuestionAnswering, 
    TrainingArguments, 
    Trainer, 
    DefaultDataCollator,
    pipeline
)
from datasets import Dataset, DatasetDict
from arabert.preprocess import ArabertPreprocessor
from sklearn.metrics import f1_score
import collections

# Set seed for reproducibility
torch.manual_seed(42)

  from .autonotebook import tqdm as notebook_tqdm


KeyboardInterrupt: 

## 1. Data Loading and Preparation
We load the merged dataset from `dataset/master_data.csv`. We need to calculate the start character positions of the answers because the dataset provides only the text.

In [None]:
# Load Data
data_path = "dataset/master_data.csv"
df = pl.read_csv(data_path)

# Filter out invalid rows (missing question, context, or answer)
df = df.filter(
    pl.col("question").is_not_null() & 
    pl.col("context").is_not_null() & 
    pl.col("answer").is_not_null()
)

print(f"Total samples: {len(df)}")

# Convert to Python list of dicts for easier processing
data = df.to_dicts()

# Find Answer Start Positions
valid_data = []
for item in data:
    context = item['context']
    answer = item['answer']
    
    # Find start index
    start_idx = context.find(answer)
    
    if start_idx != -1:
        # Create SQuAD-like dictionary structure
        valid_data.append({
            'id': str(item['id']),
            'context': context,
            'question': item['question'],
            'answers': {
                'text': [answer],
                'answer_start': [start_idx]
            }
        })

print(f"Valid samples after finding start positions: {len(valid_data)}")

# Convert to Hugging Face Dataset
full_dataset = Dataset.from_list(valid_data)

# Split into Train (80%) and Test (20%)
dataset_dict = full_dataset.train_test_split(test_size=0.2, seed=42)
print(dataset_dict)

## 2. Preprocessing Functions

Each model requires specific tokenizer handling. Specifically, **AraBERTv2-base** expects pre-segmentation using Farasa.

In [None]:
def prepare_train_features(examples, tokenizer, max_length=512, doc_stride=128):
    # Tokenize our examples with truncation and padding, but keep the overflows using a stride. 
    # This results in one example possible giving several features when a context is long, 
    # each of those features having a "context_id" that allows mapping back to original example.
    tokenized_examples = tokenizer(
        examples["question"],
        examples["context"],
        truncation="only_second",  # Truncate to max_length. The context is the second sequence.
        max_length=max_length,
        stride=doc_stride,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    sample_mapping = tokenized_examples.pop("overflow_to_sample_mapping")
    offset_mapping = tokenized_examples.pop("offset_mapping")

    tokenized_examples["start_positions"] = []
    tokenized_examples["end_positions"] = []

    for i, offsets in enumerate(offset_mapping):
        input_ids = tokenized_examples["input_ids"][i]
        cls_index = input_ids.index(tokenizer.cls_token_id)

        sequence_ids = tokenized_examples.sequence_ids(i)

        sample_index = sample_mapping[i]
        answers = examples["answers"][sample_index]

        # If no answers are given, set the cls_index as answer.
        if len(answers["answer_start"]) == 0:
            tokenized_examples["start_positions"].append(cls_index)
            tokenized_examples["end_positions"].append(cls_index)
        else:
            # Start/end character index of the answer in the text.
            start_char = answers["answer_start"][0]
            end_char = start_char + len(answers["text"][0])

            # Start token index of the current span in the text.
            token_start_index = 0
            while sequence_ids[token_start_index] != 1:
                token_start_index += 1

            # End token index of the current span in the text.
            token_end_index = len(input_ids) - 1
            while sequence_ids[token_end_index] != 1:
                token_end_index -= 1

            # Detect if the answer is out of the span (in which case this feature is labeled with the CLS index).
            if not (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char):
                tokenized_examples["start_positions"].append(cls_index)
                tokenized_examples["end_positions"].append(cls_index)
            else:
                # Otherwise move the token_start_index and token_end_index to the two ends of the answer.
                while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char:
                    token_start_index += 1
                tokenized_examples["start_positions"].append(token_start_index - 1)

                while token_end_index >= 0 and offsets[token_end_index][1] >= end_char:
                    token_end_index -= 1
                tokenized_examples["end_positions"].append(token_end_index + 1)

    return tokenized_examples

## 3. Metrics
Implementation of F1 and Exact Match (EM).

In [None]:
from transformers import Evaluator

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    start_logits, end_logits = predictions
    
    # Select the token with highest score for start and end
    start_preds = np.argmax(start_logits, axis=-1)
    end_preds = np.argmax(end_logits, axis=-1)
    
    # This simple compute_metrics is for token-level matching used during training loop logging
    # For full QA evaluation (string matching), we normally use the evaluation loop which maps back to text
    # But for simplicity here we assume if indices match, it's correct (Exact Match approximation)
    
    total = len(start_preds)
    correct_start = np.sum(start_preds == labels[0])
    correct_end = np.sum(end_preds == labels[1])
    
    return {
        'start_accuracy': correct_start / total,
        'end_accuracy': correct_end / total
    }

# Helper for Post-Processing to get F1/EM on text
def normalize_text(text):
    return text.strip()

def compute_f1(a_gold, a_pred):
    gold_toks = normalize_text(a_gold).split()
    pred_toks = normalize_text(a_pred).split()
    common = collections.Counter(gold_toks) & collections.Counter(pred_toks)
    num_same = sum(common.values())
    if len(gold_toks) == 0 or len(pred_toks) == 0:
        return int(gold_toks == pred_toks)
    if num_same == 0:
        return 0
    precision = 1.0 * num_same / len(pred_toks)
    recall = 1.0 * num_same / len(gold_toks)
    f1 = (2 * precision * recall) / (precision + recall)
    return f1

def compute_exact(a_gold, a_pred):
    return int(normalize_text(a_gold) == normalize_text(a_pred))

## 4. Models Configuration
We define the three models.

In [None]:
MODELS_CONFIG = {
    "AraBERTv2-base": {
        "checkpoint": "aubmindlab/bert-base-arabertv2",
        "use_farasa": True,
        "batch_size": 16,
        "lr": 3e-5
    },
    "AraBERTv0.2-large": {
        "checkpoint": "aubmindlab/bert-large-arabertv02",
        "use_farasa": False,
        "batch_size": 8, # Smaller batch for large model
        "lr": 2e-5
    },
    "AraELECTRA-base": {
        "checkpoint": "aubmindlab/araelectra-base-discriminator",
        "use_farasa": False,
        "batch_size": 16,
        "lr": 5e-5
    }
}

## 5. Training Loop
We define a function to train a specific model.

In [None]:
def train_model(model_name, config):
    print(f"\n=== Starting Training for {model_name} ===")
    print(f"Checkpoint: {config['checkpoint']}")
    
    # 1. Initialize Tokenizer & Preprocessor
    if config['use_farasa']:
        # Initialize ArabertPreprocessor which uses Farasa
        # Note: This might take time to initialize the JVM for Farasa
        arabert_prep = ArabertPreprocessor(model_name=config['checkpoint'])
        
        # Custom function to apply preprocessing before tokenization
        def preprocess_data(examples):
            examples['question'] = [arabert_prep.preprocess(q) for q in examples['question']]
            examples['context'] = [arabert_prep.preprocess(c) for c in examples['context']]
            return examples
            
        print("Applying Farasa Segmentation...")
        processed_dataset = dataset_dict.map(preprocess_data, batched=True)
    else:
        processed_dataset = dataset_dict

    # Load tokenizer
    tokenizer = AutoTokenizer.from_pretrained(config['checkpoint'])
    
    # 2. Tokenize Dataset
    print("Tokenizing dataset...")
    tokenized_datasets = processed_dataset.map(
        lambda x: prepare_train_features(x, tokenizer),
        batched=True,
        remove_columns=processed_dataset["train"].column_names,
    )
    
    # 3. Model
    model = AutoModelForQuestionAnswering.from_pretrained(config['checkpoint'])
    
    # 4. Training Arguments
    args = TrainingArguments(
        output_dir=f"./results/{model_name}",
        evaluation_strategy="epoch",
        learning_rate=config['lr'],
        per_device_train_batch_size=config['batch_size'],
        per_device_eval_batch_size=config['batch_size'],
        num_train_epochs=3,
        weight_decay=0.01,
        save_strategy="epoch",
        logging_steps=50,
        push_to_hub=False,
    )
    
    # 5. Trainer
    trainer = Trainer(
        model=model,
        args=args,
        train_dataset=tokenized_datasets["train"],
        eval_dataset=tokenized_datasets["test"],
        tokenizer=tokenizer,
        data_collator=DefaultDataCollator(),
    )
    
    # Train
    trainer.train()
    
    # Save
    trainer.save_model(f"./models/{model_name}_finetuned")
    print(f"Training complete. Model saved to ./models/{model_name}_finetuned")
    
    return trainer, tokenized_datasets['test']

## 6. Execution
Run the training for each model.

In [None]:
# Train Model 1: AraBERTv2-base
# Note: This requires Farasa and might segementation might take time
trainer_v2, test_v2 = train_model("AraBERTv2-base", MODELS_CONFIG["AraBERTv2-base"])

In [None]:
# Train Model 2: AraBERTv0.2-large
if torch.cuda.get_device_properties(0).total_memory > 12000000000: # Check vram roughly
    trainer_large, test_large = train_model("AraBERTv0.2-large", MODELS_CONFIG["AraBERTv0.2-large"])
else:
    print("Skipping Large model training in this notebook run due to potential memory constraints. Uncomment to run.")
    # trainer_large, test_large = train_model("AraBERTv0.2-large", MODELS_CONFIG["AraBERTv0.2-large"])

In [None]:
# Train Model 3: AraELECTRA-base
trainer_electra, test_electra = train_model("AraELECTRA-base", MODELS_CONFIG["AraELECTRA-base"])