# Arabic Named-Entity Recognition (NER) — Assignment

This notebook guides you through building an Arabic NER model using the ANERCorp dataset (`asas-ai/ANERCorp`). Fill in the TODO cells to complete the exercise.

- **Objective:** Train a token-classification model (NER) that labels tokens with entity tags (e.g., people, locations, organizations).
- **Dataset:** `asas-ai/ANERCorp` — contains tokenized Arabic text and tag sequences.
- **Typical Labels:** `B-PER`, `I-PER` (person), `B-LOC`, `I-LOC` (location), `B-ORG`, `I-ORG` (organization), and `O` (outside/no entity). Your code should extract the exact label set from the dataset and build `label_list`, `id2label`, and `label2id` mappings.
- **Key Steps (what you will implement):**
  1. Load the dataset and inspect samples.
  2. Convert the provided words into sentence groupings (use `.` `?` `!` as sentence delimiters) before tokenization so sentence boundaries are preserved.
  3. Tokenize with a pretrained Arabic tokenizer and align tokenized sub-words with original labels (use `-100` for tokens to ignore in loss).
  4. Prepare `tokenized_datasets` and data collator for dynamic padding.
  5. Configure and run model training using `AutoModelForTokenClassification` and `Trainer`.
  6. Evaluate using `seqeval` (report precision, recall, F1, and accuracy) and run inference with a pipeline.

- **Evaluation:** Use the `seqeval` metric (entity-level precision, recall, F1). When aligning predictions and labels, filter out `-100` entries so only real token labels are compared.

- **Deliverables:** Completed notebook with working cells for data loading, tokenization/label alignment, training, evaluation, and an inference example. Add short comments explaining choices (e.g., sentence-splitting strategy, tokenizer settings).

Good luck — implement each TODO in order and run the cells to verify output.

In [None]:
# TODO: Install the required packages for Arabic NER with transformers
# Required packages: transformers, datasets, seqeval, evaluate, accelerate
# Use pip install with -q flag to suppress output

#!pip install transformers datasets seqeval evaluate accelerate -q

In [None]:
!pip install seqeval

In [None]:
!pip install --upgrade transformers datasets evaluate --quiet


In [None]:
# TODO: List the files in the current directory to explore the workspace
# Hint: Use a simple command to display directory contents
!ls -la


In [None]:
# TODO: Load the ANERCorp dataset and extract label mappings
# Steps:
# 1. Import required libraries (datasets, numpy)
# 2. Load the "asas-ai/ANERCorp" dataset using load_dataset()
# 3. Inspect the dataset structure - print the splits and a sample entry
# 4. Extract unique tags from the training split
# 5. Create label_list (sorted), id2label, and label2id mappings
import numpy as np

# YOUR CODE HERE
from datasets import load_dataset

dataset = load_dataset("asas-ai/ANERCorp")

print(f"Dataset Split: {dataset}")
print(f"Sample Entry: {dataset['train'][0]}")

# Collect unique tag strings
unique_tags = set()

for row in dataset["train"]:
    unique_tags.add(row["tag"])
       

# Create label mappings
label_list = sorted(list(unique_tags))
id2label = {i: label for i, label in enumerate(label_list)}
label2id = {label: i for i, label in enumerate(label_list)}

print("\nLabel List:")
print(label_list)

print("\nlabel2id:")
print(label2id)



In [None]:
# TODO: Verify the dataset was loaded correctly
# Print the dataframe or dataset summary to inspect the data structure
print(dataset)


In [None]:
# TODO: Load tokenizer and create tokenization function
# Steps:
# 1. Import AutoTokenizer from transformers
# 2. Set model_checkpoint to "aubmindlab/bert-base-arabertv02"
# 3. Load the tokenizer using AutoTokenizer.from_pretrained()
# 4. Create tokenize_and_align_labels function that:
#    - Tokenizes the input text (is_split_into_words=True)
#    - Maps tokens to their original words
#    - Handles special tokens by setting them to -100
#    - Aligns labels with sub-word tokens
#    - Returns tokenized inputs with labels
# 5. Important: Convert words to sentences using punctuation marks ".?!" as sentence delimiters
#    - This helps the model understand sentence boundaries
#    - Hint (suggested approach): group `examples['word']` into sentence lists using ".?!" as end markers, e.g.:
#        sentences = []
#        current = []
#        for w in examples['word']:
#            current.append(w)
#            if w in ['.', '?', '!'] or (len(w) > 0 and w[-1] in '.?!'):
#                sentences.append(current)
#                current = []
#        if current:
#            sentences.append(current)
#      Then align `examples['tag']` accordingly to these sentence groups before tokenization.
# 6. Apply the function to the entire dataset using dataset.map()


# def tokenize_and_align_labels(examples):
#     # TODO: Implement tokenization and label alignment
#     # Hint: Use tokenizer with is_split_into_words=True
#     # Handle -100 for special tokens and sub-words
#     # Note: Consider punctuation marks ".?!" when processing sentence boundaries
#     pass


from transformers import AutoTokenizer

model_checkpoint = "aubmindlab/bert-base-arabertv02"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

def tokenize_and_align_labels(examples):
  
    words = examples["word"]
    tags = examples["tag"]
    
    sentences = []
    sentence_tags = []
    current_sentence = []
    current_sentence_tags = []
    
    for word, tag in zip(words, tags):
        current_sentence.append(word)
        current_sentence_tags.append(tag)
        
        if word in [".", "?", "!"] or (len(word) > 0 and word[-1] in ".?!"):
            sentences.append(current_sentence)
            sentence_tags.append(current_sentence_tags)
            current_sentence = []
            current_sentence_tags = []
    
    if current_sentence:
        sentences.append(current_sentence)
        sentence_tags.append(current_sentence_tags)
    
    tokenized_inputs = {"input_ids": [], "attention_mask": [], "labels": []}
    
    for sentence, tags_list in zip(sentences, sentence_tags):
        tokenized = tokenizer(
            sentence,
            is_split_into_words=True,
            truncation=True,
            max_length=512
        )
        
        word_ids = tokenized.word_ids()
        label_ids = []
        previous_word_idx = None
        
        for word_idx in word_ids:
            if word_idx is None:
                label_ids.append(-100)
            elif word_idx != previous_word_idx:
                label_ids.append(label2id[tags_list[word_idx]])
            else:
                label_ids.append(-100)
            previous_word_idx = word_idx
        
        tokenized_inputs["input_ids"].append(tokenized["input_ids"])
        tokenized_inputs["attention_mask"].append(tokenized["attention_mask"])
        tokenized_inputs["labels"].append(label_ids)
    
    return tokenized_inputs


tokenized_datasets = dataset.map(
    tokenize_and_align_labels,
    batched=True,
    remove_columns=dataset["train"].column_names
)

print("Tokenization complete!")
print(f"Tokenized train samples: {len(tokenized_datasets['train'])}")
print(f"Tokenized test samples: {len(tokenized_datasets['test'])}")

# Verify 
print("\n Sample tokenized entry:")
sample = tokenized_datasets["train"][0]
print(f"Input IDs length: {len(sample['input_ids'])}")
print(f"Labels length: {len(sample['labels'])}")
print(f"First few labels: {sample['labels'][:20]}")

In [None]:
from transformers import AutoTokenizer

model_checkpoint = "aubmindlab/bert-base-arabertv02"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

def tokenize_and_align_labels(examples):
    tokenized_inputs = {"input_ids": [], "attention_mask": [], "labels": []}

    for words, tags in zip(examples["word"], examples["tag"]):
        current_words = []
        current_tags = []

        
        sentences = []
        sentence_tags = []
        for w, t in zip(words, tags):
            current_words.append(w)
            current_tags.append(t)
            if w in [".", "?", "!"] or (len(w) > 0 and w[-1] in ".?!"):
                sentences.append(current_words)
                sentence_tags.append(current_tags)
                current_words = []
                current_tags = []
        if current_words:
            sentences.append(current_words)
            sentence_tags.append(current_tags)

        
        for sent_words, sent_tags in zip(sentences, sentence_tags):
            tokenized = tokenizer(
                sent_words, is_split_into_words=True, truncation=True
            )
            word_ids = tokenized.word_ids()
            label_ids = []
            previous_word_idx = None
            for word_idx in word_ids:
                if word_idx is None:
                    label_ids.append(-100)
                elif word_idx != previous_word_idx:
                    label_ids.append(label2id[sent_tags[word_idx]])  
                else:
                    label_ids.append(-100)
                previous_word_idx = word_idx

            tokenized_inputs["input_ids"].append(tokenized["input_ids"])
            tokenized_inputs["attention_mask"].append(tokenized["attention_mask"])
            tokenized_inputs["labels"].append(label_ids)

    return tokenized_inputs


In [None]:
tokenized_datasets = dataset.map(
    tokenize_and_align_labels,
    batched=True,
    remove_columns=dataset["train"].column_names
)

In [None]:
# TODO: Define the compute_metrics function for model evaluation
# Steps:
# 1. Import evaluate and load "seqeval" metric
# 2. Create compute_metrics function that:
#    - Extracts predictions from model outputs using argmax
#    - Filters out -100 labels (special tokens and sub-words)
#    - Converts prediction and label IDs back to label names
#    - Computes seqeval metrics (precision, recall, f1, accuracy)
#    - Returns results as a dictionary

import evaluate
import numpy as np

# YOUR CODE HERE

seqeval = seqeval = evaluate.load("seqeval")


def compute_metrics(p):
    
    predictions, labels = p.predictions, p.label_ids

   
    preds = np.argmax(predictions, axis=-1)

    
    true_labels = []
    true_predictions = []

    for pred_row, label_row in zip(preds, labels):
        row_labels = []
        row_preds = []
        for p_id, l_id in zip(pred_row, label_row):
            if l_id != -100:  # ignore special tokens
                row_labels.append(id2label[l_id])
                row_preds.append(id2label[p_id])
        true_labels.append(row_labels)
        true_predictions.append(row_preds)

    # Compute seqeval metrics
    results = seqeval.compute(predictions=true_predictions, references=true_labels)
    return results


In [None]:
from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer
from transformers import DataCollatorForTokenClassification
import os

os.environ["WANDB_DISABLED"] = "true"

model = AutoModelForTokenClassification.from_pretrained(
    model_checkpoint,          
    num_labels=len(label_list),  
    id2label=id2label,
    label2id=label2id
)

training_args = TrainingArguments(
    output_dir="arabert-ner",          
    eval_strategy="epoch",               
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    save_strategy="epoch",
    logging_strategy="steps",
    logging_steps=500,
    load_best_model_at_end=True,
    metric_for_best_model="overall_f1"  
)

data_collator = DataCollatorForTokenClassification(tokenizer)


train_val_split = tokenized_datasets["train"].train_test_split(test_size=0.1, seed=42)


trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_val_split["train"],      
    eval_dataset=train_val_split["test"],        
    processing_class=tokenizer,  
    data_collator=data_collator,
    compute_metrics=compute_metrics
)


trainer.train()



In [None]:
# TODO: Test the trained model with inference
# Steps:
# 1. Import pipeline from transformers
# 2. Create an NER pipeline using the trained model and tokenizer
# 3. Use aggregation_strategy="simple" to merge sub-tokens back into words
# 4. Test the pipeline with an Arabic text sample
# 5. Pretty print the results showing entity, label, and confidence score

from transformers import pipeline


ner_pipeline= pipeline(
    "ner",
    model=trainer.model,
    tokenizer=tokenizer,
    aggregation_strategy="simple"
)

text = "أعلن المدير التنفيذي لشركة أبل تيم كوك عن افتتاح فرع جديد في الرياض."
results = ner_pipeline(text)

for entity in results:
    print(f"Entity: {entity['word']}, Label: {entity['entity_group']}, Score: {entity['score']:.2f}")