## Loading and Preprocessing the dataset

## Code Explanation

This code snippet utilizes the Hugging Face Datasets library to load and explore a Named Entity Recognition (NER) dataset for Arabic text. Here's a breakdown of each line:

1. **Import Library:**
   - `from datasets import load_dataset`: This line imports the `load_dataset` function from the `datasets` library. This function is used to load datasets from the Hugging Face Hub.

2. **Load Dataset:**
   - `raw_datasets = load_dataset("e-hossam96/conllpp-ner-ar")`: This line loads the Arabic NER dataset named "conllpp-ner-ar" created by user "e-hossam96" from the Hugging Face Hub. The loaded dataset is stored in the `raw_datasets` variable.

3. **Access Sample Data:**
   - `raw_datasets["train"][7]["tokens"], raw_datasets["train"][7]["ner_tags"]`: These lines access specific information from the loaded dataset. They retrieve the following elements from the 8th sample (index 7) in the "train" split of the dataset:
      - `raw_datasets["train"][7]["tokens"]`: This extracts the list of tokens (words) from the sample.
      - `raw_datasets["train"][7]["ner_tags"]`: This extracts the list of NER tags corresponding to each token, indicating the named entity type (e.g., Person, Location).

4. **Get Feature Information:**
   - `ner_feature = raw_datasets["train"].features["ner_tags"]`: This line retrieves the feature definition for the "ner_tags" column from the "train" split. This feature definition describes the format and content of the NER tags.

5. **Extract Label Names:**
   - `label_names = ner_feature.feature.names`: This line extracts the list of possible NER tag names (e.g., "B-PER", "I-LOC") from the feature definition. These names represent the different entity types the model can identify.

6. **Print Output:**
   - `label_names, ner_feature`: This line likely represents the output of the code cell in a Jupyter Notebook. It displays both the extracted label names and the complete feature definition for the "ner_tags" column.



In [14]:
from datasets import load_dataset

raw_datasets = load_dataset("e-hossam96/conllpp-ner-ar")

In [15]:
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['tokens', 'ner_tags'],
        num_rows: 10250
    })
    validation: Dataset({
        features: ['tokens', 'ner_tags'],
        num_rows: 2383
    })
    test: Dataset({
        features: ['tokens', 'ner_tags'],
        num_rows: 2572
    })
})

In [29]:
raw_datasets["train"][7]["tokens"], raw_datasets["train"][7]["ner_tags"]

(['وقال',
  'إن',
  'الاقتراح',
  'الذي',
  'قدمه',
  'الشهر',
  'الماضي',
  'مفوض',
  'المزرعة',
  'الاتحاد',
  'الأوروبي',
  'فرانز',
  'فيشلر',
  'بحظر',
  'أدمغة',
  'الأغنام',
  'والطحال',
  'والنخاع',
  'الشوكي',
  'من',
  'السلسلة',
  'الغذائية',
  'البشرية',
  'والحيوانية',
  'كان',
  'بمثابة',
  'خطوة',
  'احترازية',
  'ومحددة',
  'للغاية',
  'لحماية',
  'صحة',
  'الإنسان.'],
 [0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  3,
  4,
  1,
  2,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0])

In [23]:
ner_feature = raw_datasets["train"].features["ner_tags"]
label_names = ner_feature.feature.names
label_names, ner_feature

(['O',
  'B-PER',
  'I-PER',
  'B-ORG',
  'I-ORG',
  'B-LOC',
  'I-LOC',
  'B-MISC',
  'I-MISC'],
 Sequence(feature=ClassLabel(names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC'], id=None), length=-1, id=None))

In [24]:
words = raw_datasets["train"][0]["tokens"]
labels = raw_datasets["train"][0]["ner_tags"]
line1 = ""
line2 = ""
for word, label in zip(words, labels):
    full_label = label_names[label]
    max_length = max(len(word), len(full_label))
    line1 += word + " " * (max_length - len(word) + 1)
    line2 += full_label + " " * (max_length - len(full_label) + 1)

print(line1)
print(line2)

الاتحاد الأوروبي يرفض الدعوة الألمانية لمقاطعة لحم الضأن البريطاني . 
B-ORG   I-ORG    O    O      B-MISC    O       O   O     B-MISC    O 


## Importing and implementing the alignment function

Define a function `align_labels_with_tokens` that aligns NER tags with tokens after Arabic BERT-based tokenization.

- **Imports:** `AutoTokenizer` from `transformers` for tokenization.
- **Tokenizer:** Loads tokenizer for the Arabic BERT model (`aubmindlab/bert-base-arabertv02`).
- **Function:** `align_labels_with_tokens` takes NER tags (`labels`) and token IDs (`word_ids`).
- **Alignment:** Iterates through `word_ids`:
    - Start of new word (different `word_id`): Gets corresponding label from `labels`.
    - Special token ( `word_id` is None): Adds placeholder label (-100).
    - Same word as previous: Gets label, converts Begin tags (odd values) to Inside tags for consistency.
- **Returns:** The aligned NER tags (`new_labels`) matching the tokenization.

This function ensures each token has a corresponding NER label after tokenization.


In [25]:
from transformers import AutoTokenizer

model_checkpoint = "aubmindlab/bert-base-arabertv02"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)



In [26]:
def align_labels_with_tokens(labels, word_ids):
    new_labels = []
    current_word = None
    for word_id in word_ids:
        if word_id != current_word:
            # Start of a new word!
            current_word = word_id
            label = -100 if word_id is None else labels[word_id]
            new_labels.append(label)
        elif word_id is None:
            # Special token
            new_labels.append(-100)
        else:
            # Same word as previous token
            label = labels[word_id]
            # If the label is B-XXX we change it to I-XXX
            if label % 2 == 1:
                label += 1
            new_labels.append(label)

    return new_labels

This code preprocesses a dataset for Arabic NER with a transformer model:

1. **Function:** `tokenize_and_align_labels` tokenizes text and aligns labels with tokens.
2. **Tokenization:** Uses tokenizer to convert text to tokens.
3. **Label Alignment:** Aligns original labels with tokens using `align_labels_with_tokens`.
4. **Batch Processing:** Applies `tokenize_and_align_labels` to the entire dataset in batches.
5. **Data Collator:** Creates a data collator for token classification tasks.
6. **Batch Creation:** Creates a batch by selecting and padding examples for training.

This prepares the data for training a transformer-based NER model on Arabic text.


In [28]:
def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(
        examples["tokens"], truncation=True, is_split_into_words=True
    )
    all_labels = examples["ner_tags"]
    new_labels = []
    for i, labels in enumerate(all_labels):
        word_ids = tokenized_inputs.word_ids(i)
        new_labels.append(align_labels_with_tokens(labels, word_ids))

    tokenized_inputs["labels"] = new_labels
    return tokenized_inputs

tokenized_datasets = raw_datasets.map(
    tokenize_and_align_labels,
    batched=True,
    remove_columns=raw_datasets["train"].column_names,
)

In [29]:
from transformers import DataCollatorForTokenClassification

data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)
batch = data_collator([tokenized_datasets["train"][i] for i in range(10)])
batch['input_ids'][0],batch['labels'][0],batch['attention_mask'][0]

(tensor([    2,   948,  2934,  5999,  4508,  4205, 37995, 12786,   792,   460,
          4704,    20,     3,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0]),
 tensor([-100,    3,    4,    0,    0,    7,    0,    0,    0,    0,    7,    0,
         -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
         -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
         -100, -100, -100, -100, -100, -100]),
 tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]))

In [31]:
# importing evaluate library to use it in model evaluation
import evaluate
metric = evaluate.load("seqeval")

In [32]:
labels = raw_datasets["train"][0]["ner_tags"]
labels = [label_names[i] for i in labels]
labels

['B-ORG', 'I-ORG', 'O', 'O', 'B-MISC', 'O', 'O', 'O', 'B-MISC', 'O']

In [33]:
predictions = labels.copy()
predictions[2] = "I-ORG" # Changing the prediction to test our eval
metric.compute(predictions=[predictions], references=[labels])

{'MISC': {'precision': 1.0, 'recall': 1.0, 'f1': 1.0, 'number': 2},
 'ORG': {'precision': 0.0, 'recall': 0.0, 'f1': 0.0, 'number': 1},
 'overall_precision': 0.6666666666666666,
 'overall_recall': 0.6666666666666666,
 'overall_f1': 0.6666666666666666,
 'overall_accuracy': 0.9}

We defines the evaluation function for evaluating a trained NER model and loading a model with label mapping:

**1. Evaluation Function (`compute_metrics`):**
  - Takes model predictions (`logits`) and ground truth labels.
  - Converts logits to predicted labels (argmax).
  - Removes ignored labels (`-100`) from both predictions and ground truth.
  - Uses an external metric library (`metric`) to compute precision, recall, F1, and accuracy.
  - Returns a dictionary containing these metrics.

**2. Label Mapping:**
  - Creates dictionaries `id2label` and `label2id` to map between label IDs and their actual names.
  - `id2label`: Maps numerical label ID to its corresponding NER tag name.
  - `label2id`: Maps the NER tag name to its corresponding numerical ID.

**3. Model Loading:**
  - Loads the Arabic BERT-based model (`AutoModelForTokenClassification`) from the specified checkpoint (`model_checkpoint`).
  - Provides the label mapping dictionaries (`id2label` and `label2id`) during model loading.

This code prepares for model evaluation by defining how to calculate performance metrics and ensures the model can interpret labels correctly. You'll likely need to specify the `metric` library used for evaluation (e.g., `seqeval` or `evaluate`).


In [34]:
import numpy as np


def compute_metrics(eval_preds):
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)

    # Remove ignored index (special tokens) and convert to labels
    true_labels = [[label_names[l] for l in label if l != -100] for label in labels]
    true_predictions = [
        [label_names[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    all_metrics = metric.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": all_metrics["overall_precision"],
        "recall": all_metrics["overall_recall"],
        "f1": all_metrics["overall_f1"],
        "accuracy": all_metrics["overall_accuracy"],
    }

In [35]:
id2label = {i: label for i, label in enumerate(label_names)}
label2id = {v: k for k, v in id2label.items()}

In [36]:
from transformers import AutoModelForTokenClassification

model = AutoModelForTokenClassification.from_pretrained(
    model_checkpoint,
    id2label=id2label,
    label2id=label2id,
)

Some weights of BertForTokenClassification were not initialized from the model checkpoint at aubmindlab/bert-base-arabertv02 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Trains and evaluates a NER model for Arabic text using the  Transformers library:

**1. Training Arguments:**
  - Defines training hyperparameters using `TrainingArguments`:
      - `output_dir` : set for model saving.
      - `evaluation_strategy`: Evaluates the model after each epoch.
      - `save_strategy`: Saves the model checkpoint after each epoch.
      - `learning_rate`: Sets the learning rate to 2e-5.
      - `num_train_epochs`: Trains for 3 epochs.
      - `weight_decay`: Applies weight decay (0.01).
      - `push_to_hub`: Disables pushing the model to the Hugging Face Hub (set to `False`).

**2. Trainer Setup:**
  - Creates a `Trainer` object to manage the training and evaluation process.
  - Provides the following arguments to the trainer:
      - `model`: The loaded Arabic BERT-based model.
      - `args`: The defined training arguments.
      - `train_dataset`: The preprocessed training dataset (`tokenized_datasets["train"]`).
      - `eval_dataset`: The preprocessed validation dataset (`tokenized_datasets["validation"]`).
      - `data_collator`: The data collator for batching and padding (`data_collator`).
      - `compute_metrics`: The function to calculate evaluation metrics (`compute_metrics`).
      - `tokenizer`: The Arabic BERT tokenizer (`tokenizer`).

**3. Training and Evaluation:**
  - Calls `trainer.train()` to train the model on the provided training dataset.
  - Calls `trainer.evaluate(eval_dataset=tokenized_datasets["test"])` to evaluate the model's performance on the test dataset after training.

In [37]:
from transformers import TrainingArguments
import accelerate

args = TrainingArguments(
    "aubmindlab/bert-base-arabert",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    num_train_epochs=3,
    weight_decay=0.01,
    push_to_hub=False,
)

In [38]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
)
trainer.train()

Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,0.2195,0.186373,0.813286,0.835622,0.824303,0.944874
2,0.1217,0.163757,0.845081,0.869144,0.856944,0.952393
3,0.0852,0.172738,0.854658,0.863288,0.858951,0.954198


TrainOutput(global_step=3846, training_loss=0.16325811117314076, metrics={'train_runtime': 484.4064, 'train_samples_per_second': 63.48, 'train_steps_per_second': 7.94, 'total_flos': 619559527872024.0, 'train_loss': 0.16325811117314076, 'epoch': 3.0})

In [39]:
trainer.evaluate(eval_dataset=tokenized_datasets["test"])

{'eval_loss': 0.2298029363155365,
 'eval_precision': 0.8308346839546191,
 'eval_recall': 0.8525987525987526,
 'eval_f1': 0.8415760311922841,
 'eval_accuracy': 0.9438562171414598,
 'eval_runtime': 6.894,
 'eval_samples_per_second': 373.077,
 'eval_steps_per_second': 46.707,
 'epoch': 3.0}

Loading and making a predictions with our trained model with HuggingFace Pipline:

**1. Model Loading:**
- Loads our pre-trained NER model from the specified checkpoint (`model_checkpoint`).
- Uses the `pipeline` function from Transformers to create a ready-to-use NER pipeline.
- Sets `aggregation_strategy` to "simple" (default), which means the predicted entity label for a token is the most likely one across all possible labels.

**2. NER Prediction:**
- Stores the input Arabic sentence in the variable `comp_sent`.
- Calls the loaded NER pipeline (`token_classifier`) on the input sentence (`comp_sent`).
- Saves the resulting predictions in the variable `preds`.

**3. Output:**
- Iterates through the predictions (`preds`):
    - For each prediction:
        - Prints the word (`i['word']`).
        - Prints the predicted entity label (`i['entity_group']`).

In [4]:
from transformers import pipeline

# Replace this with your own checkpoint
model_checkpoint = "aubmindlab/bert-base-arabert/checkpoint-3846/"
token_classifier = pipeline(
    "token-classification", model=model_checkpoint, aggregation_strategy="simple"
)


In [None]:
comp_sent = "اتهمت الصين يوم الخميس تايبيه بإفساد الأجواء لاستئناف المحادثات عبر مضيق تايوان بزيارة نائب الرئيس التايواني ليان تشان إلى أوكرانيا هذا الأسبوع والتي أثارت غضب بكين"

In [82]:
preds = token_classifier(comp_sent)

In [83]:
for i in preds:
    print(f"the word {i['word']}\n is labeled as {i['entity_group']}")

the word الصين
 is labeled as LOC
the word تايبيه
 is labeled as LOC
the word تايوان
 is labeled as LOC
the word ليان تشان
 is labeled as PER
the word أوكرانيا
 is labeled as LOC
the word بكين
 is labeled as LOC
