# Arabic Named-Entity Recognition (NER) — Assignment

This notebook guides you through building an Arabic NER model using the ANERCorp dataset (`asas-ai/ANERCorp`). Fill in the TODO cells to complete the exercise.

- **Objective:** Train a token-classification model (NER) that labels tokens with entity tags (e.g., people, locations, organizations).
- **Dataset:** `asas-ai/ANERCorp` — contains tokenized Arabic text and tag sequences.
- **Typical Labels:** `B-PER`, `I-PER` (person), `B-LOC`, `I-LOC` (location), `B-ORG`, `I-ORG` (organization), and `O` (outside/no entity). Your code should extract the exact label set from the dataset and build `label_list`, `id2label`, and `label2id` mappings.
- **Key Steps (what you will implement):**
  1. Load the dataset and inspect samples.
  2. Convert the provided words into sentence groupings (use `.` `?` `!` as sentence delimiters) before tokenization so sentence boundaries are preserved.
  3. Tokenize with a pretrained Arabic tokenizer and align tokenized sub-words with original labels (use `-100` for tokens to ignore in loss).
  4. Prepare `tokenized_datasets` and data collator for dynamic padding.
  5. Configure and run model training using `AutoModelForTokenClassification` and `Trainer`.
  6. Evaluate using `seqeval` (report precision, recall, F1, and accuracy) and run inference with a pipeline.

- **Evaluation:** Use the `seqeval` metric (entity-level precision, recall, F1). When aligning predictions and labels, filter out `-100` entries so only real token labels are compared.

- **Deliverables:** Completed notebook with working cells for data loading, tokenization/label alignment, training, evaluation, and an inference example. Add short comments explaining choices (e.g., sentence-splitting strategy, tokenizer settings).

Good luck — implement each TODO in order and run the cells to verify output.

In [25]:
# TODO: Install the required packages for Arabic NER with transformers
# Required packages: transformers, datasets, seqeval, evaluate, accelerate
# Use pip install with -q flag to suppress output

!pip install transformers datasets seqeval evaluate accelerate -q

In [26]:
# TODO: List the files in the current directory to explore the workspace
# Hint: Use a simple command to display directory contents

!ls

arabert-ner  sample_data


In [None]:
# TODO: Load the ANERCorp dataset and extract label mappings
# Steps:
# 1. Import required libraries (datasets, numpy)
# 2. Load the "asas-ai/ANERCorp" dataset using load_dataset()
# 3. Inspect the dataset structure - print the splits and a sample entry
# 4. Extract unique tags from the training split
# 5. Create label_list (sorted), id2label, and label2id mappings

# YOUR CODE HERE
# TODO: Load the ANERCorp dataset and extract label mappings
import datasets
import numpy as np


dataset = datasets.load_dataset("asas-ai/ANERCorp")


print(f"Dataset Split: {dataset}")
print(f"Sample Entry: {dataset['train'][0]}")


raw_tags = dataset["train"]["tag"]

unique_tag_ids = sorted(list(set([t for sublist in raw_tags for t in sublist])))


label_list = [str(t) for t in unique_tag_ids] 


id2label = {i: label for i, label in enumerate(label_list)}
label2id = {label: i for i, label in enumerate(label_list)}

print(f"\nLabel List: {label_list}")

Dataset Split: DatasetDict({
    train: Dataset({
        features: ['word', 'tag'],
        num_rows: 125102
    })
    test: Dataset({
        features: ['word', 'tag'],
        num_rows: 25008
    })
})
Sample Entry: {'word': 'فرانكفورت', 'tag': 'B-LOC'}

Label List: ['-', 'B', 'C', 'E', 'G', 'I', 'L', 'M', 'O', 'P', 'R', 'S']


In [None]:
# TODO: Verify the dataset was loaded correctly
import pandas as pd  


print("Dataset Summary:")
print(dataset)


df_sample = pd.DataFrame(dataset["train"][:5])
print("\nFirst 5 entries (Sentence lists):")
display(df_sample)


print("\nDataset Features:")
print(dataset["train"].features)

Dataset Summary:
DatasetDict({
    train: Dataset({
        features: ['word', 'tag'],
        num_rows: 125102
    })
    test: Dataset({
        features: ['word', 'tag'],
        num_rows: 25008
    })
})

First 5 entries (Sentence lists):


Unnamed: 0,word,tag
0,فرانكفورت,B-LOC
1,(د,O
2,ب,O
3,أ),O
4,أعلن,O



Dataset Features:
{'word': Value('string'), 'tag': Value('string')}


In [None]:
# TODO: Load tokenizer and create tokenization function
# Steps:
# 1. Import AutoTokenizer from transformers
# 2. Set model_checkpoint to "aubmindlab/bert-base-arabertv02"
# 3. Load the tokenizer using AutoTokenizer.from_pretrained()
# 4. Create tokenize_and_align_labels function that:
#    - Tokenizes the input text (is_split_into_words=True)
#    - Maps tokens to their original words
#    - Handles special tokens by setting them to -100
#    - Aligns labels with sub-word tokens
#    - Returns tokenized inputs with labels
# 5. Important: Convert words to sentences using punctuation marks ".?!" as sentence delimiters
#    - This helps the model understand sentence boundaries
#    - Hint (suggested approach): group `examples['word']` into sentence lists using ".?!" as end markers, e.g.:
#        sentences = []
#        current = []
#        for w in examples['word']:
#            current.append(w)
#            if w in ['.', '?', '!'] or (len(w) > 0 and w[-1] in '.?!'):
#                sentences.append(current)
#                current = []
#        if current:
#            sentences.append(current)
#      Then align `examples['tag']` accordingly to these sentence groups before tokenization.
# 6. Apply the function to the entire dataset using dataset.map()

from transformers import AutoTokenizer

# YOUR CODE HERE
def group_into_sentences(examples):
    all_sentences = []
    all_tags = []
    current_sentence = []
    current_tags = []
    
    for words, tags in zip(examples["word"], examples["tag"]):
        for w, t in zip(words, tags):
            current_sentence.append(w)

            if isinstance(t, str):
                current_tags.append(label2id[t])
            else:
                current_tags.append(int(t))
                
            if w in ['.', '?', '!'] or (len(w) > 0 and w[-1] in '.?!'):
                all_sentences.append(current_sentence)
                all_tags.append(current_tags)
                current_sentence = []
                current_tags = []
    
    if current_sentence:
        all_sentences.append(current_sentence)
        all_tags.append(current_tags)
        
    return {"tokens": all_sentences, "ner_tags": all_tags}


prepared_dataset = dataset.map(
    group_into_sentences, 
    batched=True, 
    remove_columns=dataset["train"].column_names
)


tokenized_datasets = prepared_dataset.map(tokenize_and_align_labels, batched=True)


print("Successfully tokenized!")
print(f"Sample labels: {tokenized_datasets['train'][0]['labels'][:10]}")

Map:   0%|          | 0/4317 [00:00<?, ? examples/s]

Map:   0%|          | 0/987 [00:00<?, ? examples/s]

Successfully tokenized!
Sample labels: [-100, 1, 0, 6, 8, 2, 8, 8, 8, 8]


In [None]:
# TODO: Define the compute_metrics function for model evaluation
# Steps:
# 1. Import evaluate and load "seqeval" metric
# 2. Create compute_metrics function that:
#    - Extracts predictions from model outputs using argmax
#    - Filters out -100 labels (special tokens and sub-words)
#    - Converts prediction and label IDs back to label names
#    - Computes seqeval metrics (precision, recall, f1, accuracy)
#    - Returns results as a dictionary

import evaluate
import numpy as np

# YOUR CODE HERE
metric = evaluate.load("seqeval")

def compute_metrics(p):
    predictions, labels = p
    

    predictions = np.argmax(predictions, axis=2)


    true_predictions = [
        [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]


    results = metric.compute(predictions=true_predictions, references=true_labels)
    

    return {
        "precision": results["overall_precision"],
        "recall": results["overall_recall"],
        "f1": results["overall_f1"],
        "accuracy": results["overall_accuracy"],
    }

In [None]:
# TODO: Load the model and configure training
# Steps:
# 1. Import AutoModelForTokenClassification, TrainingArguments, Trainer, and DataCollatorForTokenClassification
# 2. Load the model using AutoModelForTokenClassification.from_pretrained() with:
#    - model_checkpoint
#    - num_labels based on label_list length
#    - id2label and label2id mappings
# 3. Create TrainingArguments with:
#    - output directory "arabert-ner"
#    - evaluation_strategy="epoch"
#    - learning_rate=2e-5
#    - batch_size=16 (both train and eval)
#    - num_train_epochs=3
#    - weight_decay=0.01
# 4. Create a DataCollatorForTokenClassification for dynamic padding
# 5. Initialize the Trainer with model, args, datasets, data_collator, tokenizer, and compute_metrics
# 6. Call trainer.train() to start training

from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer
from transformers import DataCollatorForTokenClassification

# YOUR CODE HERE


model = AutoModelForTokenClassification.from_pretrained(
    model_checkpoint, 
    num_labels=len(label_list),
    id2label=id2label,
    label2id=label2id
)

)
args = TrainingArguments(
    "arabert-ner",
    eval_strategy="epoch",
    learning_rate=2e-5,               
    per_device_train_batch_size=8,    
    per_device_eval_batch_size=8,     
    gradient_accumulation_steps=2,   
    num_train_epochs=3,               
    weight_decay=0.01,                
    logging_steps=100,                
    push_to_hub=False,
    report_to="none",
    fp16=True                         
)


data_collator = DataCollatorForTokenClassification(tokenizer)


trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"], 
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)


trainer.train()

Some weights of BertForTokenClassification were not initialized from the model checkpoint at aubmindlab/bert-base-arabertv02 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,0.7772,0.820312,0.25974,0.162602,0.2,0.725248
2,0.5886,0.763315,0.323308,0.227236,0.266889,0.751629
3,0.5052,0.7258,0.34601,0.262602,0.29859,0.757721




TrainOutput(global_step=810, training_loss=0.6542675618772154, metrics={'train_runtime': 144.3377, 'train_samples_per_second': 89.727, 'train_steps_per_second': 5.612, 'total_flos': 627941135138184.0, 'train_loss': 0.6542675618772154, 'epoch': 3.0})

In [37]:
# TODO: Test the trained model with inference
# Steps:
# 1. Import pipeline from transformers
# 2. Create an NER pipeline using the trained model and tokenizer
# 3. Use aggregation_strategy="simple" to merge sub-tokens back into words
# 4. Test the pipeline with an Arabic text sample
# 5. Pretty print the results showing entity, label, and confidence score

from transformers import pipeline

# YOUR CODE HERE
ner_pipeline = pipeline(
    "ner", 
    model=model, 
    tokenizer=tokenizer, 
    aggregation_strategy="simple"
)

text = "أعلن المدير التنفيذي لشركة أبل تيم كوك عن افتتاح فرع جديد في الرياض."
results = ner_pipeline(text)

print(f"Input Text: {text}\n")
print("-" * 30)
for entity in results:
    print(f"Entity: {entity['word']}, Label: {entity['entity_group']}, Score: {entity['score']:.2f}")

Device set to use cuda:0


Input Text: أعلن المدير التنفيذي لشركة أبل تيم كوك عن افتتاح فرع جديد في الرياض.

------------------------------
Entity: لشركة, Label: , Score: 0.26
Entity: تيم, Label: , Score: 0.19
