![image.png](https://i.imgur.com/a3uAqnb.png)

## From Self-Attention to Transformers: Architecture & Intuition

Now that we understand **Self-Attention**, we're ready to see how it's used to build one of the most powerful architectures in deep learning:

# ‚öôÔ∏è Transformers

---

## üß© What Is a Transformer?

A **Transformer** is a neural network architecture built entirely using **self-attention layers**, without recurrence or convolution.

> Introduced in the paper:  
> *"Attention is All You Need"* (Vaswani et al., 2017)

Instead of processing tokens **step-by-step** (like RNNs), Transformers allow **parallel processing** of entire sequences using self-attention.

---

## üß± Transformer Architecture Overview

A **Transformer** is composed of:

- üîÑ **Multi-Head Self-Attention** (each head learns a different attention pattern)
- üîß **Feedforward Neural Network** (applied after attention)
- ‚ûï **Add & LayerNorm** (residual connection and normalization)
- üìê **Positional Encoding** (since attention is order-agnostic)

Each block looks like this:

![1*vrSX_Ku3EmGPyqF_E-2_Vg.webp](https://i.imgur.com/2toEGWJ.png)

# üåü BERT: Bidirectional Encoder Representations from Transformers

## üìò Introduction
**BERT** is a language representation model developed by Google in 2018. Unlike traditional models that read text either left-to-right or right-to-left, BERT reads in **both directions simultaneously** using the Transformer architecture.

> üß† BERT is pre-trained on a large text corpus and can be fine-tuned for a variety of downstream NLP tasks (e.g., classification, QA, NER).

---

## üåç Motivation

Before BERT, many NLP models processed language in a **unidirectional** way ‚Äî either left-to-right or right-to-left. This limited their ability to understand the **full context** of a word within a sentence.

> üîë **BERT was designed to deeply understand language context by reading in both directions ‚Äî simultaneously.**

### üîÅ Why Bidirectional?
- Words gain meaning from surrounding words.
- Example: 
  - Sentence: "He wore a mask to the party."
  - Sentence: "He wore a mask during surgery."

In both cases, the word *mask* has different meanings. BERT looks **left and right** to understand which meaning is correct.

---

### üß© Key Ideas:
- **Masked Language Modeling (MLM):**  
Randomly hides some words and asks the model to guess them based on the full context.

- **Next Sentence Prediction (NSP):**  
Helps BERT understand how sentences relate to each other ‚Äî useful for tasks like QA and dialogue.

---
![emb.png](https://i.imgur.com/uXZIn9Y.png)
## üìö References

- Paper: [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (Devlin et al., 2018)](https://arxiv.org/abs/1810.04805)
- HuggingFace Transformers: https://huggingface.co/transformers/

In [None]:
# !pip install transformers datasets torch arabic-reshaper python-bidi
# !pip install accelerate evaluate
# !pip install bert-score

**1. Load and Explore the Dataset**

In [3]:
import torch
from transformers import (
    AutoTokenizer,
    AutoModelForQuestionAnswering,
    TrainingArguments,
    Trainer
)
from datasets import Dataset, load_dataset
import pandas as pd
import numpy as np
from torch.utils.data import DataLoader
import arabic_reshaper
from bidi.algorithm import get_display
import os

# os.environ["WANDB_DISABLED"] = "true"

In [4]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

Using device: cuda
GPU: NVIDIA GeForce RTX 4070 Ti SUPER


**1. Load and Explore the Dataset**

In [7]:
# Load MLQA Arabic (smaller multilingual QA dataset)
dataset = load_dataset("facebook/mlqa", "mlqa.ar.ar", split="test")
print("MLQA Arabic dataset loaded successfully!")
print(f"Dataset size: {len(dataset)}")

print("\nSample data:")
for i in range(0,2):
    print(f"Context: {dataset[i]['context'][:100]}...")
    print(f"Question: {dataset[i]['question']}")
    print(f"Answer: {dataset[i]['answers']['text'][0]}")
    print("-" * 50)

Downloading data:   0%|          | 0.00/75.7M [00:00<?, ?B/s]

Generating test split:   0%|          | 0/5335 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/517 [00:00<?, ? examples/s]

MLQA Arabic dataset loaded successfully!
Dataset size: 5335

Sample data:
Context: Ÿäÿ≥ÿ™ÿÆÿØŸÖ ŸÜŸÅÿ≥ ŸÜÿ∏ÿßŸÖ ÿ™ÿ≥ŸÖŸäÿ© ÿßŸÑŸÖŸÜÿ∑ŸÇÿ© "xx" ŸÑÿ£ÿ¨ÿ≤ÿßÿ° ÿ£ÿÆÿ±Ÿâ ŸÖŸÜ ŸÖŸàÿßŸÇÿπ ÿßŸÑÿ™ÿ¨ÿßÿ±ÿ® ŸÅŸä ŸÜŸäŸÅÿßÿØÿß.Ÿàÿ™ÿπÿ™ÿ®ÿ± ÿßŸÑŸÇÿßÿπÿØÿ© ÿßŸÑŸÖÿ≥ÿ™ÿ∑ŸäŸÑÿ© ÿß...
Question: ÿ£Ÿä ŸÜŸàÿπ ŸÖŸÜ ÿßŸÑÿ∑ÿ±ŸÇ Ÿäÿ§ÿØŸä ÿ•ŸÑŸâ ŸÖÿ≤ÿßÿ±ÿπ ÿßŸÑŸÖŸàÿßÿ¥Ÿäÿü
Answer: ÿßŸÑÿ∑ÿ±ŸÇ ÿßŸÑÿ™ÿ±ÿßÿ®Ÿäÿ©
--------------------------------------------------
Context: Ÿäÿ≥ÿ™ÿÆÿØŸÖ ŸÜŸÅÿ≥ ŸÜÿ∏ÿßŸÖ ÿ™ÿ≥ŸÖŸäÿ© ÿßŸÑŸÖŸÜÿ∑ŸÇÿ© "xx" ŸÑÿ£ÿ¨ÿ≤ÿßÿ° ÿ£ÿÆÿ±Ÿâ ŸÖŸÜ ŸÖŸàÿßŸÇÿπ ÿßŸÑÿ™ÿ¨ÿßÿ±ÿ® ŸÅŸä ŸÜŸäŸÅÿßÿØÿß.Ÿàÿ™ÿπÿ™ÿ®ÿ± ÿßŸÑŸÇÿßÿπÿØÿ© ÿßŸÑŸÖÿ≥ÿ™ÿ∑ŸäŸÑÿ© ÿß...
Question: ÿ•ŸÑŸâ ÿ£ŸäŸÜ Ÿäÿ§ÿØŸä ÿ∑ÿ±ŸäŸÇ ÿ®ÿ≠Ÿäÿ±ÿ© ÿ¨ÿ±ŸàŸàŸÖ ÿ®ÿßŸÑŸÜÿ≥ÿ®ÿ© ŸÑŸÑÿ®ÿ≠Ÿäÿ±ÿ©ÿü
Answer: ÿ¥ŸÖÿßŸÑ ÿ¥ÿ±ŸÇ
--------------------------------------------------


**2-Load Pre-trained Arabic BERT Model**

In [8]:
# Load Arabic BERT model and tokenizer
model_name = "aubmindlab/bert-base-arabertv2"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForQuestionAnswering.from_pretrained(model_name)

print("Model and tokenizer loaded successfully!")
print(f"Model: {model_name}")
print(f"Model type: {type(model).__name__}")
print(f"Tokenizer vocab size: {tokenizer.vocab_size}")

tokenizer_config.json:   0%|          | 0.00/611 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/384 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/543M [00:00<?, ?B/s]

Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at aubmindlab/bert-base-arabertv2 and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model and tokenizer loaded successfully!
Model: aubmindlab/bert-base-arabertv2
Model type: BertForQuestionAnswering
Tokenizer vocab size: 64000


**3-Test BEFORE Training (Baseline Performance)**

In [9]:
def answer_question(question, context):
    """Answer a question given a context using Arabic BERT"""

    # Tokenize input
    inputs = tokenizer(
        question,
        context,
        return_tensors="pt",
        max_length=512,
        truncation=True,
        padding=True
    )

    # Move to device if available
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    inputs = {k: v.to(device) for k, v in inputs.items()}
    model.to(device)

    # Get model predictions
    with torch.no_grad():
        outputs = model(**inputs)

    # Get start and end logits
    start_logits = outputs.start_logits
    end_logits = outputs.end_logits

    # Find the most likely start and end positions
    start_index = torch.argmax(start_logits, dim=1).item()
    end_index = torch.argmax(end_logits, dim=1).item()

    # Ensure end_index is not before start_index
    if end_index < start_index:
        end_index = start_index

    # Extract answer tokens
    answer_tokens = inputs["input_ids"][0][start_index:end_index+1]
    answer = tokenizer.decode(answer_tokens, skip_special_tokens=True)

    # Return empty string if no meaningful answer found
    if not answer.strip():
        return "ÿ∫Ÿäÿ± ŸÖÿ≠ÿØÿØ"

    return answer.strip()

In [10]:
# Test with real examples from the ARCD dataset - BEFORE TRAINING
print("üî¥ TESTING BEFORE TRAINING (Baseline Performance)")
print("=" * 60)
print("NOTE: Poor performance expected - this is aubmindlab/bert-base-arabertv2 WITHOUT QA fine-tuning")
print("=" * 60)

# Get a few test examples from a different range to avoid train/eval overlap
test_examples = dataset.select(range(30, 35))  # Use samples 30-35 for testing

correct_predictions = 0
total_predictions = 0

for i, example in enumerate(test_examples):
    context = example['context']
    question = example['question']

    # Get ground truth answer
    ground_truth = example['answers']['text'][0]

    # Get model prediction
    predicted_answer = answer_question(question, context)

    # Check if prediction matches ground truth (case-insensitive)
    is_correct = predicted_answer.strip().lower() == ground_truth.strip().lower()
    if is_correct:
        correct_predictions += 1
    total_predictions += 1

    print(f"Example {i+1}:")
    print(f"Question: {question}")
    print(f"Context snippet: {context[:150]}...")
    print(f"Predicted: '{predicted_answer}'")
    print(f"Ground Truth: '{ground_truth}'")
    print(f"Match: {'‚úÖ' if is_correct else '‚ùå'}")
    print("-" * 50)

# Calculate and display accuracy
baseline_accuracy = (correct_predictions / total_predictions) * 100 if total_predictions > 0 else 0
print(f"\nüî¥ PRE-TRAINING Accuracy: {correct_predictions}/{total_predictions} ({baseline_accuracy:.1f}%)")
print("This baseline will improve significantly after fine-tuning on ARCD!")

üî¥ TESTING BEFORE TRAINING (Baseline Performance)
NOTE: Poor performance expected - this is aubmindlab/bert-base-arabertv2 WITHOUT QA fine-tuning
Example 1:
Question: ŸÖŸÜ ÿØÿπŸÖ ÿßŸÑÿ±ÿ£Ÿä ÿßŸÑÿπÿßŸÖÿü
Context snippet: ŸàŸÖÿπ ÿ∞ŸÑŸÉÿå ÿ∏ŸÑ ÿßŸÑÿπÿßŸÖÿ© Ÿäÿ™ÿπÿßÿ∑ŸÅŸàŸÜ ŸÖÿπ ÿßŸÑŸÖŸÑŸÉÿ© ŸÉÿßÿ´ÿ±ŸäŸÜ. ŸàŸÅŸä ŸÑŸäŸÑÿ© ŸÖŸÜ ÿÆÿ±ŸäŸÅ ÿπÿßŸÖ 1531ÿå ÿ®ŸäŸÜŸÖÿß ŸÉÿßŸÜÿ™ ÿ¢ŸÜ ÿ™ÿ™ŸÜÿßŸàŸÑ ÿ∑ÿπÿßŸÖŸáÿß ŸÅŸä ŸÖŸÜÿ≤ŸÑ ÿ±ŸäŸÅŸä ÿπŸÑŸâ ŸÜŸáÿ± ÿßŸÑÿ™ÿßŸäŸÖÿ≤ÿå Ÿáÿßÿ¨ŸÖŸáÿß ÿ≠ÿ¥ÿØ ŸÜÿ≥ÿßÿ¶Ÿä ÿ∫ÿß...
Predicted: '##ŸÑŸÉÿ© ŸÉÿßÿ´ÿ±ŸäŸÜ. ŸàŸÅŸä ŸÑŸäŸÑÿ© ŸÖŸÜ ÿÆÿ±ŸäŸÅ ÿπÿßŸÖ 1531 ÿå ÿ®ŸäŸÜŸÖÿß ŸÉÿßŸÜÿ™ ÿ¢ŸÜ ÿ™ÿ™ŸÜÿßŸàŸÑ ÿ∑ÿπÿßŸÖŸáÿß ŸÅŸä ŸÖŸÜÿ≤ŸÑ ÿ±ŸäŸÅŸä ÿπŸÑŸâ ŸÜŸáÿ± ÿßŸÑÿ™ÿßŸäŸÖÿ≤ ÿå Ÿáÿßÿ¨ŸÖŸáÿß ÿ≠ÿ¥ÿØ ŸÜÿ≥ÿßÿ¶Ÿä ÿ∫ÿßÿ∂ÿ® ÿå Ÿàÿ™ŸÖŸÉŸÜÿ™ ÿ¢ŸÜ ŸÖŸÜ ÿßŸÑŸáÿ±ÿ® ÿπŸÑŸâ ŸÖÿ™ŸÜ ŸÇÿßÿ±ÿ® ÿ®ÿµÿπŸàÿ®ÿ©. ŸàÿπŸÜÿØŸÖÿß ÿ™ŸàŸÅŸä ÿ±ÿ¶Ÿäÿ≥ ÿ£ÿ≥ÿßŸÇŸÅÿ© ŸÉÿßŸÜÿ™ÿ±ÿ®ÿ±Ÿä ŸàŸäŸÑŸäÿßŸÖ Ÿàÿßÿ±ŸáÿßŸÖ ÿπÿßŸÖ 1532 ÿå ÿ™ŸÖ ÿ™ÿπŸäŸäŸÜ ŸÇÿ≥Ÿäÿ≥ ÿπÿßÿ¶ŸÑÿ© ÿ®ŸàŸÑŸäŸÜ ÿ™ŸàŸÖÿßÿ≥ ŸÉÿ±ÿßŸÜŸÖÿ± ÿ®ŸÖŸàÿßŸÅŸÇÿ© ÿ®ÿßÿ®ŸàŸäÿ©. ŸÜŸÅŸ

In [11]:
from collections import Counter
import re

def normalize_arabic_text(text):
    """Normalize Arabic text for comparison"""
    # Remove diacritics and extra spaces
    text = re.sub(r'[\u064B-\u065F\u0670\u0640]', '', text)
    text = re.sub(r'\s+', ' ', text)
    return text.strip().lower()

def calculate_token_f1(predicted, ground_truth):
    """Calculate F1 score based on token overlap"""
    pred_tokens = normalize_arabic_text(predicted).split()
    truth_tokens = normalize_arabic_text(ground_truth).split()

    if len(pred_tokens) == 0 and len(truth_tokens) == 0:
        return 1.0

    if len(pred_tokens) == 0 or len(truth_tokens) == 0:
        return 0.0

    common_tokens = Counter(pred_tokens) & Counter(truth_tokens)
    num_common = sum(common_tokens.values())

    if num_common == 0:
        return 0.0

    precision = num_common / len(pred_tokens)
    recall = num_common / len(truth_tokens)

    return 2 * (precision * recall) / (precision + recall)

def evaluate_model_subset(test_dataset, num_samples=100, model_type="BEFORE"):
    """Evaluate on a subset of the dataset with token F1"""
    
    total_samples = min(num_samples, len(test_dataset))
    print(f"üîç Evaluating Arabic BERT on ARCD Dataset - {model_type} TRAINING")
    print("=" * 70)
    print(f"Evaluating on {total_samples} samples from ARCD dataset")
    print("=" * 70)

    f1_scores = []
    empty_predictions = 0
    exact_matches = 0

    print("Processing samples...")

    for i in range(total_samples):
        example = test_dataset[i]
        context = example['context']
        question = example['question']
        ground_truth = example['answers']['text'][0]

        predicted = answer_question(question, context)

        # Check for empty predictions
        if not predicted.strip() or predicted.strip() == "ÿ∫Ÿäÿ± ŸÖÿ≠ÿØÿØ":
            empty_predictions += 1
            f1_scores.append(0.0)
        else:
            f1 = calculate_token_f1(predicted, ground_truth)
            f1_scores.append(f1)
            
            # Check exact match (case-insensitive)
            if normalize_arabic_text(predicted) == normalize_arabic_text(ground_truth):
                exact_matches += 1

        # Show progress every 20 samples
        if (i + 1) % 20 == 0:
            print(f"Processed {i + 1}/{total_samples} samples...")

    # Calculate metrics
    avg_f1 = sum(f1_scores) / len(f1_scores)
    exact_match_score = exact_matches / total_samples

    print(f"\nüìä {model_type} TRAINING RESULTS:")
    print(f"üéØ Average Token F1: {avg_f1:.3f}")
    print(f"‚úÖ Exact Match Score: {exact_match_score:.3f}")
    print(f"üìä Total Samples: {total_samples}")
    print(f"‚ö†Ô∏è  Empty Predictions: {empty_predictions}")
    print(f"üìà Valid Predictions: {total_samples - empty_predictions}")

    # Additional stats
    non_zero_f1s = [f1 for f1 in f1_scores if f1 > 0]
    if non_zero_f1s:
        avg_non_zero_f1 = sum(non_zero_f1s) / len(non_zero_f1s)
        print(f"üìä Average F1 (excluding empty): {avg_non_zero_f1:.3f}")
    
    return avg_f1, exact_match_score

# Run evaluation BEFORE training
print("\nüî¥ DETAILED EVALUATION BEFORE TRAINING")
baseline_f1, baseline_em = evaluate_model_subset(dataset, num_samples=100, model_type="BEFORE")

print(f"\nüî¥ BASELINE METRICS (BEFORE TRAINING):")
print(f"Token F1: {baseline_f1:.3f}")
print(f"Exact Match: {baseline_em:.3f}")


üî¥ DETAILED EVALUATION BEFORE TRAINING
üîç Evaluating Arabic BERT on ARCD Dataset - BEFORE TRAINING
Evaluating on 100 samples from ARCD dataset
Processing samples...
Processed 20/100 samples...
Processed 40/100 samples...
Processed 60/100 samples...
Processed 80/100 samples...
Processed 100/100 samples...

üìä BEFORE TRAINING RESULTS:
üéØ Average Token F1: 0.024
‚úÖ Exact Match Score: 0.000
üìä Total Samples: 100
‚ö†Ô∏è  Empty Predictions: 0
üìà Valid Predictions: 100
üìä Average F1 (excluding empty): 0.088

üî¥ BASELINE METRICS (BEFORE TRAINING):
Token F1: 0.024
Exact Match: 0.000


**4-Prepare Dataset for Training**

In [12]:
def prepare_training_features(examples):
    """Tokenize questions and contexts from ARCD dataset"""

    # Tokenize the question and context
    tokenized_examples = tokenizer(
        examples["question"],
        examples["context"],
        truncation="only_second",
        max_length=384,
        stride=128,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding="max_length",
    )

    answers = examples["answers"]
    start_positions = []
    end_positions = []

    for i, offsets in enumerate(tokenized_examples["offset_mapping"]):
        # Map back to original example index
        if "overflow_to_sample_mapping" in tokenized_examples:
            sample_index = tokenized_examples["overflow_to_sample_mapping"][i]
        else:
            sample_index = i
        sample_index = min(sample_index, len(answers) - 1)  # Bounds check

        # Get answer details
        current_answer = answers[sample_index]
        answer_start = current_answer["answer_start"][0]
        answer_text = current_answer["text"][0]
        answer_end = answer_start + len(answer_text)

        # Find token positions
        token_start_index = 0
        token_end_index = 0

        for idx, (start, end) in enumerate(offsets):
            if start is not None and end is not None:
                if start <= answer_start < end:
                    token_start_index = idx
                if start < answer_end <= end:
                    token_end_index = idx
                    break

        start_positions.append(token_start_index)
        end_positions.append(token_end_index)

    tokenized_examples["start_positions"] = start_positions
    tokenized_examples["end_positions"] = end_positions
    
    # Remove offset_mapping and overflow_to_sample_mapping to prevent training errors
    del tokenized_examples["offset_mapping"]
    if "overflow_to_sample_mapping" in tokenized_examples:
        del tokenized_examples["overflow_to_sample_mapping"]

    return tokenized_examples

# Split dataset (80% train, 20% eval)
dataset_size = len(dataset)
train_size = int(dataset_size * 0.8)

train_dataset = dataset.select(range(train_size))
eval_dataset = dataset.select(range(train_size, dataset_size))

# Apply preprocessing
tokenized_train = train_dataset.map(
    prepare_training_features,
    batched=True,
    remove_columns=train_dataset.column_names
)

tokenized_eval = eval_dataset.map(
    prepare_training_features,
    batched=True,
    remove_columns=eval_dataset.column_names
)

print(f"Training samples: {len(tokenized_train)}")
print(f"Evaluation samples: {len(tokenized_eval)}")
print(f"Training dataset columns: {tokenized_train.column_names}")

Map:   0%|          | 0/4268 [00:00<?, ? examples/s]

Map:   0%|          | 0/1067 [00:00<?, ? examples/s]

Training samples: 5477
Evaluation samples: 1308
Training dataset columns: ['input_ids', 'token_type_ids', 'attention_mask', 'start_positions', 'end_positions']


**5-Define Training Arguments**

In [19]:
# Set up training arguments for Arabic BERT QA on ARCD dataset
training_args = TrainingArguments(
    output_dir="./arabic-bert-qa",
    learning_rate=3e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=4,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=50,
    save_steps=200,
    eval_strategy="steps",
    eval_steps=200,
    save_total_limit=2,
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
    push_to_hub=False,
    warmup_steps=100,
)

**6- Train the Model**

In [20]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_eval,
    tokenizer=tokenizer,
)

  trainer = Trainer(


In [21]:
trainer.train()

Step,Training Loss,Validation Loss
200,3.1965,3.557044
400,3.4097,3.203139
600,3.1687,2.92278
800,3.1024,2.76938
1000,3.0502,2.731339
1200,2.9101,2.645916
1400,2.3018,2.585033
1600,2.0326,2.683618
1800,2.1395,2.645313
2000,1.7188,2.664501


TrainOutput(global_step=5480, training_loss=1.699539015414941, metrics={'train_runtime': 640.1946, 'train_samples_per_second': 34.221, 'train_steps_per_second': 8.56, 'total_flos': 4293367009929216.0, 'train_loss': 1.699539015414941, 'epoch': 4.0})

**7-Test AFTER Training (Fine-tuned Performance)**

In [22]:
# Test with real examples from the ARCD dataset - AFTER TRAINING
print("üü¢ TESTING AFTER TRAINING (Fine-tuned Performance)")
print("=" * 60)
print("NOTE: Much better performance expected after fine-tuning!")
print("=" * 60)

# Get a few test examples from a different range to avoid train/eval overlap
test_examples = dataset.select(range(30, 35))  # Use samples 30-35 for testing

correct_predictions = 0
total_predictions = 0

for i, example in enumerate(test_examples):
    context = example['context']
    question = example['question']

    # Get ground truth answer
    ground_truth = example['answers']['text'][0]

    # Get model prediction
    predicted_answer = answer_question(question, context)

    # Check if prediction matches ground truth (case-insensitive)
    is_correct = predicted_answer.strip().lower() == ground_truth.strip().lower()
    if is_correct:
        correct_predictions += 1
    total_predictions += 1

    print(f"Example {i+1}:")
    print(f"Question: {question}")
    print(f"Context snippet: {context[:150]}...")
    print(f"Predicted: '{predicted_answer}'")
    print(f"Ground Truth: '{ground_truth}'")
    print(f"Match: {'‚úÖ' if is_correct else '‚ùå'}")
    print("-" * 50)

# Calculate and display accuracy
trained_accuracy = (correct_predictions / total_predictions) * 100 if total_predictions > 0 else 0
print(f"\nüü¢ POST-TRAINING Accuracy: {correct_predictions}/{total_predictions} ({trained_accuracy:.1f}%)")

# Compare with baseline
print(f"\nüìà IMPROVEMENT:")
print(f"Before Training: {baseline_accuracy:.1f}%")
print(f"After Training: {trained_accuracy:.1f}%")
print(f"Improvement: +{trained_accuracy - baseline_accuracy:.1f} percentage points")

üü¢ TESTING AFTER TRAINING (Fine-tuned Performance)
NOTE: Much better performance expected after fine-tuning!
Example 1:
Question: ŸÖŸÜ ÿØÿπŸÖ ÿßŸÑÿ±ÿ£Ÿä ÿßŸÑÿπÿßŸÖÿü
Context snippet: ŸàŸÖÿπ ÿ∞ŸÑŸÉÿå ÿ∏ŸÑ ÿßŸÑÿπÿßŸÖÿ© Ÿäÿ™ÿπÿßÿ∑ŸÅŸàŸÜ ŸÖÿπ ÿßŸÑŸÖŸÑŸÉÿ© ŸÉÿßÿ´ÿ±ŸäŸÜ. ŸàŸÅŸä ŸÑŸäŸÑÿ© ŸÖŸÜ ÿÆÿ±ŸäŸÅ ÿπÿßŸÖ 1531ÿå ÿ®ŸäŸÜŸÖÿß ŸÉÿßŸÜÿ™ ÿ¢ŸÜ ÿ™ÿ™ŸÜÿßŸàŸÑ ÿ∑ÿπÿßŸÖŸáÿß ŸÅŸä ŸÖŸÜÿ≤ŸÑ ÿ±ŸäŸÅŸä ÿπŸÑŸâ ŸÜŸáÿ± ÿßŸÑÿ™ÿßŸäŸÖÿ≤ÿå Ÿáÿßÿ¨ŸÖŸáÿß ÿ≠ÿ¥ÿØ ŸÜÿ≥ÿßÿ¶Ÿä ÿ∫ÿß...
Predicted: 'ŸÉÿßÿ´ÿ±ŸäŸÜ'
Ground Truth: 'ÿßŸÑŸÖŸÑŸÉÿ© ŸÉÿßÿ´ÿ±ŸäŸÜ.'
Match: ‚ùå
--------------------------------------------------
Example 2:
Question: Ÿäÿ≠ÿ™ÿßÿ¨ ÿ≤Ÿàÿßÿ¨ ÿ¢ŸÖŸä ŸàŸáŸÜÿ±Ÿä ÿ•ŸÑŸâ ÿØÿπŸÖ ÿ£Ÿä ŸÖŸÑŸÉÿü
Context snippet: ÿÆŸÑÿßŸÑ Ÿáÿ∞Ÿá ÿßŸÑŸÅÿ™ÿ±ÿ©ÿå ŸÑÿπÿ®ÿ™ ÿ¢ŸÜ ÿ®ŸàŸÑŸäŸÜ ÿØŸàÿ±Ÿãÿß ÿ®ÿßÿ±ÿ≤Ÿãÿß ŸÅŸä ÿπŸÑÿßŸÇÿßÿ™ ÿ•ŸÜÿ¨ŸÑÿ™ÿ±ÿß ÿßŸÑÿØŸàŸÑŸäÿ©ÿå ŸÖŸÜ ÿÆŸÑÿßŸÑ ÿ™Ÿàÿ∑ŸäÿØ ÿßŸÑÿ™ÿ≠ÿßŸÑŸÅ ŸÖÿπ ŸÅÿ±ŸÜÿ≥ÿß. ŸÅŸÇÿØ Ÿàÿ∑ÿØÿ™ ÿµŸÑÿßÿ™ ÿ¨ŸäÿØÿ© ŸÖÿπ ÿßŸÑÿ≥ŸÅŸäÿ± ÿßŸÑŸÅÿ±ŸÜÿ≥Ÿä "ÿ¨ŸäŸÑ ÿØŸä ŸÑÿß...
Predicted: 'ŸÅÿ±ÿßŸÜÿ≥Ÿàÿß ÿßŸÑÿ£Ÿ

In [23]:
# Run evaluation AFTER training
print("\nüü¢ DETAILED EVALUATION AFTER TRAINING")
trained_f1, trained_em = evaluate_model_subset(dataset, num_samples=100, model_type="AFTER")

print(f"\nüü¢ TRAINED MODEL METRICS (AFTER TRAINING):")
print(f"Token F1: {trained_f1:.3f}")
print(f"Exact Match: {trained_em:.3f}")

print(f"\nüìä COMPLETE COMPARISON:")
print("=" * 50)
print(f"{'Metric':<20} {'Before':<10} {'After':<10} {'Improvement':<15}")
print("=" * 50)
print(f"{'Token F1':<20} {baseline_f1:<10.3f} {trained_f1:<10.3f} {'+' + str(round(trained_f1 - baseline_f1, 3)):<15}")
print(f"{'Exact Match':<20} {baseline_em:<10.3f} {trained_em:<10.3f} {'+' + str(round(trained_em - baseline_em, 3)):<15}")
print("=" * 50)

# Performance interpretation
if trained_f1 >= 0.8:
    print("üéâ EXCELLENT performance achieved after fine-tuning!")
elif trained_f1 >= 0.6:
    print("‚úÖ GOOD performance achieved after fine-tuning!")
elif trained_f1 >= 0.4:
    print("‚ö†Ô∏è FAIR performance achieved after fine-tuning!")
else:
    print("‚ùå Poor performance - consider longer training or hyperparameter tuning")


üü¢ DETAILED EVALUATION AFTER TRAINING
üîç Evaluating Arabic BERT on ARCD Dataset - AFTER TRAINING
Evaluating on 100 samples from ARCD dataset
Processing samples...
Processed 20/100 samples...
Processed 40/100 samples...
Processed 60/100 samples...
Processed 80/100 samples...
Processed 100/100 samples...

üìä AFTER TRAINING RESULTS:
üéØ Average Token F1: 0.413
‚úÖ Exact Match Score: 0.230
üìä Total Samples: 100
‚ö†Ô∏è  Empty Predictions: 4
üìà Valid Predictions: 96
üìä Average F1 (excluding empty): 0.608

üü¢ TRAINED MODEL METRICS (AFTER TRAINING):
Token F1: 0.413
Exact Match: 0.230

üìä COMPLETE COMPARISON:
Metric               Before     After      Improvement    
Token F1             0.024      0.413      +0.389         
Exact Match          0.000      0.230      +0.23          
‚ö†Ô∏è FAIR performance achieved after fine-tuning!


**8-Advanced Inference with Confidence Scores**

In [24]:
def arabic_qa_inference(question, context, show_confidence=True):
    """
    Perform inference using your trained Arabic BERT model
    """
    # Tokenize input
    inputs = tokenizer(
        question,
        context,
        return_tensors="pt",
        max_length=512,
        truncation=True,
        padding=True
    )

    # Move to device
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    inputs = {k: v.to(device) for k, v in inputs.items()}
    model.to(device)

    # Get model predictions
    with torch.no_grad():
        outputs = model(**inputs)

    # Get start and end logits
    start_logits = outputs.start_logits
    end_logits = outputs.end_logits

    # Get probabilities for confidence score
    start_probs = torch.softmax(start_logits, dim=1)
    end_probs = torch.softmax(end_logits, dim=1)

    # Find the most likely start and end positions
    start_index = torch.argmax(start_logits, dim=1).item()
    end_index = torch.argmax(end_logits, dim=1).item()

    # Ensure end_index is not before start_index
    if end_index < start_index:
        end_index = start_index

    # Calculate confidence
    confidence = (start_probs[0][start_index] * end_probs[0][end_index]).item()

    # Extract answer tokens
    answer_tokens = inputs["input_ids"][0][start_index:end_index+1]
    answer = tokenizer.decode(answer_tokens, skip_special_tokens=True)

    # Handle empty answers
    if not answer.strip():
        answer = "ÿ∫Ÿäÿ± ŸÖÿ≠ÿØÿØ"
        confidence = 0.0

    if show_confidence:
        return answer.strip(), confidence
    else:
        return answer.strip()

# Test inference with custom Arabic examples
print("ü§ñ Arabic BERT QA Inference System (TRAINED MODEL)")
print("=" * 60)

# Example 1: Technology
context1 = """
ÿßŸÑÿ∞ŸÉÿßÿ° ÿßŸÑÿßÿµÿ∑ŸÜÿßÿπŸä ŸáŸà ŸÖÿ≠ÿßŸÉÿßÿ© ÿßŸÑÿ∞ŸÉÿßÿ° ÿßŸÑÿ®ÿ¥ÿ±Ÿä ŸÅŸä ÿßŸÑÿ¢ŸÑÿßÿ™ ÿßŸÑŸÖÿ®ÿ±ŸÖÿ¨ÿ© ŸÑŸÑÿ™ŸÅŸÉŸäÿ± ŸàÿßŸÑÿ™ÿπŸÑŸÖ.
Ÿäÿ¥ŸÖŸÑ ÿ™ÿ∑ÿ®ŸäŸÇÿßÿ™ ŸÖÿ´ŸÑ ÿßŸÑÿ™ÿπÿ±ŸÅ ÿπŸÑŸâ ÿßŸÑŸÉŸÑÿßŸÖÿå ŸàÿßŸÑÿ™ÿπŸÑŸÖ ÿßŸÑÿ¢ŸÑŸäÿå ŸàÿßŸÑÿ™ÿÆÿ∑Ÿäÿ∑ÿå Ÿàÿ≠ŸÑ ÿßŸÑŸÖÿ¥ŸÉŸÑÿßÿ™.
ÿ™ÿ≥ÿ™ÿÆÿØŸÖ ÿ¥ÿ±ŸÉÿßÿ™ ÿßŸÑÿ™ŸÉŸÜŸàŸÑŸàÿ¨Ÿäÿß ÿßŸÑÿ∞ŸÉÿßÿ° ÿßŸÑÿßÿµÿ∑ŸÜÿßÿπŸä ŸÅŸä ŸÖŸÜÿ™ÿ¨ÿßÿ™Ÿáÿß ŸÑÿ™ÿ≠ÿ≥ŸäŸÜ ÿ™ÿ¨ÿ±ÿ®ÿ© ÿßŸÑŸÖÿ≥ÿ™ÿÆÿØŸÖ.
"""

question1 = "ŸÖÿß ŸáŸà ÿßŸÑÿ∞ŸÉÿßÿ° ÿßŸÑÿßÿµÿ∑ŸÜÿßÿπŸäÿü"
answer1, conf1 = arabic_qa_inference(question1, context1)

print(f"Question: {question1}")
print(f"Answer: {answer1}")
print(f"Confidence: {conf1:.3f}")
print("-" * 30)

# Example 2: History
context2 = """
ÿ™ÿ£ÿ≥ÿ≥ÿ™ ÿßŸÑŸÖŸÖŸÑŸÉÿ© ÿßŸÑÿπÿ±ÿ®Ÿäÿ© ÿßŸÑÿ≥ÿπŸàÿØŸäÿ© ÿπŸÑŸâ ŸäÿØ ÿßŸÑŸÖŸÑŸÉ ÿπÿ®ÿØ ÿßŸÑÿπÿ≤Ÿäÿ≤ ÿ¢ŸÑ ÿ≥ÿπŸàÿØ ŸÅŸä ÿπÿßŸÖ 1932.
ÿ™ÿπÿ™ÿ®ÿ± ÿßŸÑÿ±Ÿäÿßÿ∂ ÿßŸÑÿπÿßÿµŸÖÿ© Ÿàÿ£ŸÉÿ®ÿ± ÿßŸÑŸÖÿØŸÜÿå Ÿàÿ™ŸÇÿπ ŸÅŸä Ÿàÿ≥ÿ∑ ÿ¥ÿ®Ÿá ÿßŸÑÿ¨ÿ≤Ÿäÿ±ÿ© ÿßŸÑÿπÿ±ÿ®Ÿäÿ©.
ÿßŸÑŸÑÿ∫ÿ© ÿßŸÑÿ±ÿ≥ŸÖŸäÿ© ŸáŸä ÿßŸÑÿπÿ±ÿ®Ÿäÿ© ŸàÿßŸÑÿπŸÖŸÑÿ© ŸáŸä ÿßŸÑÿ±ŸäÿßŸÑ ÿßŸÑÿ≥ÿπŸàÿØŸä.
"""

question2 = "ŸÖÿ™Ÿâ ÿ™ÿ£ÿ≥ÿ≥ÿ™ ÿßŸÑŸÖŸÖŸÑŸÉÿ© ÿßŸÑÿπÿ±ÿ®Ÿäÿ© ÿßŸÑÿ≥ÿπŸàÿØŸäÿ©ÿü"
answer2, conf2 = arabic_qa_inference(question2, context2)

print(f"Question: {question2}")
print(f"Answer: {answer2}")
print(f"Confidence: {conf2:.3f}")
print("-" * 30)

# Example 3: Science
context3 = """
ÿßŸÑÿ¨ÿßÿ∞ÿ®Ÿäÿ© ŸáŸä ŸÇŸàÿ© ÿ∑ÿ®ŸäÿπŸäÿ© ÿ™ÿ¨ÿ∞ÿ® ÿßŸÑÿ£ÿ¨ÿ≥ÿßŸÖ ŸÜÿ≠Ÿà ŸÖÿ±ŸÉÿ≤ ÿßŸÑÿ£ÿ±ÿ∂. ÿßŸÉÿ™ÿ¥ŸÅ ÿ•ÿ≥ÿ≠ÿßŸÇ ŸÜŸäŸàÿ™ŸÜ ŸÇÿßŸÜŸàŸÜ ÿßŸÑÿ¨ÿßÿ∞ÿ®Ÿäÿ© ÿßŸÑÿπÿßŸÖ.
ÿ™ÿ®ŸÑÿ∫ ÿ≥ÿ±ÿπÿ© ÿßŸÑÿ∂Ÿàÿ° ŸÅŸä ÿßŸÑŸÅÿ±ÿßÿ∫ 299,792,458 ŸÖÿ™ÿ± ŸÅŸä ÿßŸÑÿ´ÿßŸÜŸäÿ©ÿå ŸàŸáŸä ÿ£ÿ≥ÿ±ÿπ ÿ≥ÿ±ÿπÿ© ŸÖŸÖŸÉŸÜÿ© ŸÅŸä ÿßŸÑŸÉŸàŸÜ.
"""

question3 = "ŸÖŸÜ ÿßŸÉÿ™ÿ¥ŸÅ ŸÇÿßŸÜŸàŸÜ ÿßŸÑÿ¨ÿßÿ∞ÿ®Ÿäÿ©ÿü"
answer3, conf3 = arabic_qa_inference(question3, context3)

print(f"Question: {question3}")
print(f"Answer: {answer3}")
print(f"Confidence: {conf3:.3f}")
print("-" * 30)

# Example 4: Geography
context4 = """
ÿ¨ÿ®ŸÑ ÿ•ŸäŸÅÿ±ÿ≥ÿ™ ŸáŸà ÿ£ÿπŸÑŸâ ÿ¨ÿ®ŸÑ ŸÅŸä ÿßŸÑÿπÿßŸÑŸÖÿå ŸàŸäÿ®ŸÑÿ∫ ÿßÿ±ÿ™ŸÅÿßÿπŸá 8,848.86 ŸÖÿ™ÿ± ŸÅŸàŸÇ ŸÖÿ≥ÿ™ŸàŸâ ÿ≥ÿ∑ÿ≠ ÿßŸÑÿ®ÿ≠ÿ±.
ŸäŸÇÿπ ŸÅŸä ÿ¨ÿ®ÿßŸÑ ÿßŸÑŸáŸäŸÖÿßŸÑÿßŸäÿß ÿπŸÑŸâ ÿßŸÑÿ≠ÿØŸàÿØ ÿ®ŸäŸÜ ŸÜŸäÿ®ÿßŸÑ ŸàÿßŸÑÿ™ÿ®ÿ™. ÿ™ŸÖ ÿßŸÑŸàÿµŸàŸÑ ÿ•ŸÑŸâ ÿßŸÑŸÇŸÖÿ© ŸÑÿ£ŸàŸÑ ŸÖÿ±ÿ© ÿπÿßŸÖ 1953.
"""

question4 = "ŸÖÿß ÿßÿ±ÿ™ŸÅÿßÿπ ÿ¨ÿ®ŸÑ ÿ•ŸäŸÅÿ±ÿ≥ÿ™ÿü"
answer4, conf4 = arabic_qa_inference(question4, context4)

print(f"Question: {question4}")
print(f"Answer: {answer4}")
print(f"Confidence: {conf4:.3f}")
print("=" * 60)

print(f"\nüéØ SUMMARY:")
print(f"The Arabic model shows high confidence in its answers after fine-tuning!")
print(f"Average confidence: {(conf1 + conf2 + conf3 + conf4) / 4:.3f}")

ü§ñ Arabic BERT QA Inference System (TRAINED MODEL)
Question: ŸÖÿß ŸáŸà ÿßŸÑÿ∞ŸÉÿßÿ° ÿßŸÑÿßÿµÿ∑ŸÜÿßÿπŸäÿü
Answer: ŸÖÿ≠ÿßŸÉÿßÿ© ÿßŸÑÿ∞ŸÉÿßÿ° ÿßŸÑÿ®ÿ¥ÿ±Ÿä ŸÅŸä ÿßŸÑÿ¢ŸÑÿßÿ™ ÿßŸÑŸÖÿ®ÿ±ŸÖÿ¨ÿ© ŸÑŸÑÿ™ŸÅŸÉŸäÿ± ŸàÿßŸÑÿ™ÿπŸÑŸÖ
Confidence: 0.201
------------------------------
Question: ŸÖÿ™Ÿâ ÿ™ÿ£ÿ≥ÿ≥ÿ™ ÿßŸÑŸÖŸÖŸÑŸÉÿ© ÿßŸÑÿπÿ±ÿ®Ÿäÿ© ÿßŸÑÿ≥ÿπŸàÿØŸäÿ©ÿü
Answer: 1932.
Confidence: 0.527
------------------------------
Question: ŸÖŸÜ ÿßŸÉÿ™ÿ¥ŸÅ ŸÇÿßŸÜŸàŸÜ ÿßŸÑÿ¨ÿßÿ∞ÿ®Ÿäÿ©ÿü
Answer: ÿ•ÿ≥ÿ≠ÿßŸÇ ŸÜŸäŸàÿ™ŸÜ
Confidence: 0.588
------------------------------
Question: ŸÖÿß ÿßÿ±ÿ™ŸÅÿßÿπ ÿ¨ÿ®ŸÑ ÿ•ŸäŸÅÿ±ÿ≥ÿ™ÿü
Answer: 8, 848. 86 ŸÖÿ™ÿ±
Confidence: 0.179

üéØ SUMMARY:
The Arabic model shows high confidence in its answers after fine-tuning!
Average confidence: 0.373


Contributed by: Ali H