# Arabic Named-Entity Recognition (NER) — Assignment

This notebook guides you through building an Arabic NER model using the ANERCorp dataset (`asas-ai/ANERCorp`). Fill in the TODO cells to complete the exercise.

- **Objective:** Train a token-classification model (NER) that labels tokens with entity tags (e.g., people, locations, organizations).
- **Dataset:** `asas-ai/ANERCorp` — contains tokenized Arabic text and tag sequences.
- **Typical Labels:** `B-PER`, `I-PER` (person), `B-LOC`, `I-LOC` (location), `B-ORG`, `I-ORG` (organization), and `O` (outside/no entity). Your code should extract the exact label set from the dataset and build `label_list`, `id2label`, and `label2id` mappings.
- **Key Steps (what you will implement):**
  1. Load the dataset and inspect samples.
  2. Convert the provided words into sentence groupings (use `.` `?` `!` as sentence delimiters) before tokenization so sentence boundaries are preserved.
  3. Tokenize with a pretrained Arabic tokenizer and align tokenized sub-words with original labels (use `-100` for tokens to ignore in loss).
  4. Prepare `tokenized_datasets` and data collator for dynamic padding.
  5. Configure and run model training using `AutoModelForTokenClassification` and `Trainer`.
  6. Evaluate using `seqeval` (report precision, recall, F1, and accuracy) and run inference with a pipeline.

- **Evaluation:** Use the `seqeval` metric (entity-level precision, recall, F1). When aligning predictions and labels, filter out `-100` entries so only real token labels are compared.

- **Deliverables:** Completed notebook with working cells for data loading, tokenization/label alignment, training, evaluation, and an inference example. Add short comments explaining choices (e.g., sentence-splitting strategy, tokenizer settings).

Good luck — implement each TODO in order and run the cells to verify output.

In [None]:
# TODO: Install the required packages for Arabic NER with transformers
# Required packages: transformers, datasets, seqeval, evaluate, accelerate
# Use pip install with -q flag to suppress output

!pip install transformers datasets seqeval evaluate accelerate -q

In [None]:
# TODO: List the files in the current directory to explore the workspace
# Hint: Use a simple command to display directory contents

# YOUR CODE HERE

In [None]:
# TODO: Load the ANERCorp dataset and extract label mappings
# Steps:
# 1. Import required libraries (datasets, numpy)
# 2. Load the "asas-ai/ANERCorp" dataset using load_dataset()
# 3. Inspect the dataset structure - print the splits and a sample entry
# 4. Extract unique tags from the training split
# 5. Create label_list (sorted), id2label, and label2id mappings

# YOUR CODE HERE
# dataset = 
# print(f"Dataset Split: {dataset}")
# print(f"Sample Entry: {dataset['train'][0]}")
# unique_tags = 
# label_list = 
# id2label = 
# label2id = 
# print(f"\nLabel List: {label_list}")

In [None]:
# TODO: Verify the dataset was loaded correctly
# Print the dataframe or dataset summary to inspect the data structure

# YOUR CODE HERE

In [None]:
# TODO: Load tokenizer and create tokenization function
# Steps:
# 1. Import AutoTokenizer from transformers
# 2. Set model_checkpoint to "aubmindlab/bert-base-arabertv02"
# 3. Load the tokenizer using AutoTokenizer.from_pretrained()
# 4. Create tokenize_and_align_labels function that:
#    - Tokenizes the input text (is_split_into_words=True)
#    - Maps tokens to their original words
#    - Handles special tokens by setting them to -100
#    - Aligns labels with sub-word tokens
#    - Returns tokenized inputs with labels
# 5. Important: Convert words to sentences using punctuation marks ".?!" as sentence delimiters
#    - This helps the model understand sentence boundaries
#    - Hint (suggested approach): group `examples['word']` into sentence lists using ".?!" as end markers, e.g.:
#        sentences = []
#        current = []
#        for w in examples['word']:
#            current.append(w)
#            if w in ['.', '?', '!'] or (len(w) > 0 and w[-1] in '.?!'):
#                sentences.append(current)
#                current = []
#        if current:
#            sentences.append(current)
#      Then align `examples['tag']` accordingly to these sentence groups before tokenization.
# 6. Apply the function to the entire dataset using dataset.map()

from transformers import AutoTokenizer

# YOUR CODE HERE
# model_checkpoint = 
# tokenizer = 

# def tokenize_and_align_labels(examples):
#     # TODO: Implement tokenization and label alignment
#     # Hint: Use tokenizer with is_split_into_words=True
#     # Handle -100 for special tokens and sub-words
#     # Note: Consider punctuation marks ".?!" when processing sentence boundaries
#     pass

# tokenized_datasets = 

In [None]:
# TODO: Define the compute_metrics function for model evaluation
# Steps:
# 1. Import evaluate and load "seqeval" metric
# 2. Create compute_metrics function that:
#    - Extracts predictions from model outputs using argmax
#    - Filters out -100 labels (special tokens and sub-words)
#    - Converts prediction and label IDs back to label names
#    - Computes seqeval metrics (precision, recall, f1, accuracy)
#    - Returns results as a dictionary

import evaluate
import numpy as np

# YOUR CODE HERE
# seqeval = 

# def compute_metrics(p):
#     # TODO: Implement metric computation
#     # Hint: Use np.argmax, filter -100 labels, use seqeval.compute()
#     pass

In [None]:
# TODO: Load the model and configure training
# Steps:
# 1. Import AutoModelForTokenClassification, TrainingArguments, Trainer, and DataCollatorForTokenClassification
# 2. Load the model using AutoModelForTokenClassification.from_pretrained() with:
#    - model_checkpoint
#    - num_labels based on label_list length
#    - id2label and label2id mappings
# 3. Create TrainingArguments with:
#    - output directory "arabert-ner"
#    - evaluation_strategy="epoch"
#    - learning_rate=2e-5
#    - batch_size=16 (both train and eval)
#    - num_train_epochs=3
#    - weight_decay=0.01
# 4. Create a DataCollatorForTokenClassification for dynamic padding
# 5. Initialize the Trainer with model, args, datasets, data_collator, tokenizer, and compute_metrics
# 6. Call trainer.train() to start training

from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer
from transformers import DataCollatorForTokenClassification

# YOUR CODE HERE
# model = 
# args = 
# data_collator = 
# trainer = 
# trainer.train()

In [None]:
# TODO: Test the trained model with inference
# Steps:
# 1. Import pipeline from transformers
# 2. Create an NER pipeline using the trained model and tokenizer
# 3. Use aggregation_strategy="simple" to merge sub-tokens back into words
# 4. Test the pipeline with an Arabic text sample
# 5. Pretty print the results showing entity, label, and confidence score

from transformers import pipeline

# YOUR CODE HERE
# ner_pipeline = 
# 
# text = "أعلن المدير التنفيذي لشركة أبل تيم كوك عن افتتاح فرع جديد في الرياض."
# results = 
#
# # Pretty print results
# for entity in results:
#     print(f"Entity: {entity['word']}, Label: {entity['entity_group']}, Score: {entity['score']:.2f}")