# Arabic Named-Entity Recognition (NER) — Assignment

This notebook guides you through building an Arabic NER model using the ANERCorp dataset (`asas-ai/ANERCorp`). Fill in the TODO cells to complete the exercise.

- **Objective:** Train a token-classification model (NER) that labels tokens with entity tags (e.g., people, locations, organizations).
- **Dataset:** `asas-ai/ANERCorp` — contains tokenized Arabic text and tag sequences.
- **Typical Labels:** `B-PER`, `I-PER` (person), `B-LOC`, `I-LOC` (location), `B-ORG`, `I-ORG` (organization), and `O` (outside/no entity). Your code should extract the exact label set from the dataset and build `label_list`, `id2label`, and `label2id` mappings.
- **Key Steps (what you will implement):**
  1. Load the dataset and inspect samples.
  2. Convert the provided words into sentence groupings (use `.` `?` `!` as sentence delimiters) before tokenization so sentence boundaries are preserved.
  3. Tokenize with a pretrained Arabic tokenizer and align tokenized sub-words with original labels (use `-100` for tokens to ignore in loss).
  4. Prepare `tokenized_datasets` and data collator for dynamic padding.
  5. Configure and run model training using `AutoModelForTokenClassification` and `Trainer`.
  6. Evaluate using `seqeval` (report precision, recall, F1, and accuracy) and run inference with a pipeline.

- **Evaluation:** Use the `seqeval` metric (entity-level precision, recall, F1). When aligning predictions and labels, filter out `-100` entries so only real token labels are compared.

- **Deliverables:** Completed notebook with working cells for data loading, tokenization/label alignment, training, evaluation, and an inference example. Add short comments explaining choices (e.g., sentence-splitting strategy, tokenizer settings).

Good luck — implement each TODO in order and run the cells to verify output.

In [None]:
! pip install -q transformers datasets seqeval evaluate accelerate 

In [None]:
!ls

sample_data


In [None]:
from datasets import load_dataset
import numpy as np

dataset = load_dataset("asas-ai/ANERCorp")

print(f"Dataset Splits: {dataset}")
print(f"\nSample Entry (train[0]):\n{dataset['train'][0]}")

unique_tags = set()
for example in dataset["train"]:
    unique_tags.add(example["tag"])

label_list = sorted(list(unique_tags))
id2label = {i: label for i, label in enumerate(label_list)}
label2id = {label: i for i, label in enumerate(label_list)}

print("Label List:", label_list)



Error while fetching `HF_TOKEN` secret value from your vault: 'Requesting secret HF_TOKEN timed out. Secrets can only be fetched when running from the Colab UI.'.
You are not authenticated with the Hugging Face Hub in this notebook.
If the error persists, please let us know by opening an issue on GitHub (https://github.com/huggingface/huggingface_hub/issues/new).


Dataset Splits: DatasetDict({
    train: Dataset({
        features: ['word', 'tag'],
        num_rows: 125102
    })
    test: Dataset({
        features: ['word', 'tag'],
        num_rows: 25008
    })
})

Sample Entry (train[0]):
{'word': 'فرانكفورت', 'tag': 'B-LOC'}
Label List: ['B-LOC', 'B-MISC', 'B-ORG', 'B-PERS', 'I-LOC', 'I-MISC', 'I-ORG', 'I-PERS', 'O']


In [None]:
dataset["train"].to_pandas().head()

Unnamed: 0,word,tag
0,فرانكفورت,B-LOC
1,(د,O
2,ب,O
3,أ),O
4,أعلن,O


In [None]:
from transformers import AutoTokenizer

model_checkpoint = "aubmindlab/bert-base-arabertv02"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

def _is_sentence_end(w: str) -> bool:
    return (w in [".", "?", "!"]) or (len(w) > 0 and w[-1] in ".?!")

def tokenize_and_align_labels(examples):
    words = examples["word"]
    tags = examples["tag"]

    sentences_words = []
    sentences_tags = []
    cur_w, cur_t = [], []

    for w, t in zip(words, tags):
        cur_w.append(w)
        cur_t.append(t)

        if _is_sentence_end(w):
            sentences_words.append(cur_w)
            sentences_tags.append(cur_t)
            cur_w, cur_t = [], []

    if cur_w:  
        sentences_words.append(cur_w)
        sentences_tags.append(cur_t)

    sentences_tag_ids = [[label2id[x] for x in sent_tags] for sent_tags in sentences_tags]

    tokenized_inputs = tokenizer(
        sentences_words,
        is_split_into_words=True,
        truncation=True,
        padding=False,  
    )

    labels = []
    for i, word_labels in enumerate(sentences_tag_ids):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        previous_word_idx = None
        label_ids = []

        for word_idx in word_ids:
            if word_idx is None:
                label_ids.append(-100) 
            elif word_idx != previous_word_idx:
                label_ids.append(word_labels[word_idx])  
            else:
                label_ids.append(-100)  
            previous_word_idx = word_idx

        labels.append(label_ids)

    tokenized_inputs["labels"] = labels
    return tokenized_inputs

tokenized_datasets = dataset.map(
    tokenize_and_align_labels,
    batched=True,
    remove_columns=dataset["train"].column_names,  
)

tokenized_datasets


Map:   0%|          | 0/25008 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 4384
    })
    test: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 990
    })
})

In [None]:
import evaluate
import numpy as np

seqeval = evaluate.load("seqeval")

def compute_metrics(p):
    logits, labels = p
    predictions = np.argmax(logits, axis=-1)

    true_predictions = []
    true_labels = []

    for pred_seq, label_seq in zip(predictions, labels):
        sent_preds = []
        sent_labels = []
        for pred_id, label_id in zip(pred_seq, label_seq):
            if label_id == -100:
                continue  
            sent_preds.append(id2label[int(pred_id)])
            sent_labels.append(id2label[int(label_id)])
        true_predictions.append(sent_preds)
        true_labels.append(sent_labels)

    results = seqeval.compute(predictions=true_predictions, references=true_labels)

    return {
        "precision": results["overall_precision"],
        "recall": results["overall_recall"],
        "f1": results["overall_f1"],
        "accuracy": results["overall_accuracy"],
    }


Downloading builder script: 0.00B [00:00, ?B/s]

In [None]:
from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer
from transformers import DataCollatorForTokenClassification

model = AutoModelForTokenClassification.from_pretrained(
    model_checkpoint,
    num_labels=len(label_list),
    id2label=id2label,
    label2id=label2id,
)

args = TrainingArguments(
    output_dir="arabert-ner",
    eval_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_steps=50,
    save_strategy="epoch",
    report_to="none",  
)

data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()


Some weights of BertForTokenClassification were not initialized from the model checkpoint at aubmindlab/bert-base-arabertv02 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,0.0728,0.134563,0.81513,0.796937,0.805931,0.966771
2,0.0384,0.14557,0.843454,0.778118,0.80947,0.96765
3,0.0341,0.140875,0.836125,0.808315,0.821985,0.96989


TrainOutput(global_step=822, training_loss=0.07973394653512904, metrics={'train_runtime': 248.1589, 'train_samples_per_second': 52.998, 'train_steps_per_second': 3.312, 'total_flos': 598472770007808.0, 'train_loss': 0.07973394653512904, 'epoch': 3.0})

In [None]:
from transformers import pipeline

ner_pipeline = pipeline(
    "token-classification",
    model=model,
    tokenizer=tokenizer,
    aggregation_strategy="simple"
)

text = "أعلن المدير التنفيذي لشركة أبل تيم كوك عن افتتاح فرع جديد في الرياض"

results = ner_pipeline(text)

for entity in results:
    print(f"Entity: {entity['word']}, Label: {entity['entity_group']}, Score: {entity['score']:.2f}")


Device set to use cuda:0


Entity: أبل, Label: ORG, Score: 0.93
Entity: تيم كوك, Label: PERS, Score: 0.99
Entity: الرياض, Label: LOC, Score: 0.99
