# Model Fine-Tuning for EthioMart Amharic NER System

This notebook describes the fine-tuning process for the DistilBERT model to perform Named Entity Recognition (NER) on Amharic text. The fine-tuning is conducted using the labeled dataset created from the Telegram messages.


In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
from datasets import load_dataset
from transformers import AutoModelForTokenClassification, AutoTokenizer, TrainingArguments, Trainer
from evaluate import load  # For evaluation metrics
import shap  # For model interpretability

# Load the labeled dataset (CoNLL format)
data_files = {"train": "../data/labels/ner_labels.txt"}
dataset = load_dataset("text", data_files=data_files, split='train')


  from .autonotebook import tqdm as notebook_tqdm
Generating train split: 5046 examples [00:00, 99359.45 examples/s]


## Tokenization

We will initialize the tokenizer and model, and create a tokenization function to prepare our data for training.


In [2]:
# Tokenizer and model
model_name = "distilbert-base-multilingual-cased"  # You can also use "bert-tiny-amharic"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name, num_labels=4)  # Adjust based on your entities

# Tokenization function with label alignment
def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(
        examples['text'], 
        padding='max_length',
        truncation=True, 
        max_length=512, 
        is_split_into_words=True  # To make sure labels are aligned with tokens
    )
    
    labels = []
    for i, label in enumerate(examples['labels']):
        word_ids = tokenized_inputs.word_ids(batch_index=i)  # Map tokens back to their word
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:
            # Special tokens will be given a label of -100 (ignored by the model)
            if word_idx is None:
                label_ids.append(-100)
            elif word_idx != previous_word_idx:
                label_ids.append(label[word_idx])
            else:
                # For subword tokens, we assign the same label as the first token in the word
                label_ids.append(label[word_idx] if label_all_tokens else -100)
            previous_word_idx = word_idx
        labels.append(label_ids)

    tokenized_inputs["labels"] = labels  # Add labels to tokenized inputs
    return tokenized_inputs

# Pass the Labels During Training: Ensure that the dataset you pass to the Trainer contains both input_ids, attention_mask, and labels. If your dataset has already been tokenized, the Trainer will compute the loss using the labels.
# Tokenize dataset and align labels
tokenized_dataset = dataset.map(tokenize_and_align_labels, batched=True)
# Make Sure the Model is Configured for Token Classification: Ensure you are using a model for token classification. For instance:

model_name = "distilbert-base-multilingual-cased"
model = AutoModelForTokenClassification.from_pretrained(model_name, num_labels=len(label_list))  # Set num_labels to the number of classes in your dataset


Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at distilbert-base-multilingual-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████| 5046/5046 [00:01<00:00, 4957.75 examples/s]


## Training the Model

Set up the training arguments and initialize the Trainer to fine-tune the model.


In [None]:
# Training arguments
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",  # Evaluate after every epoch
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
)

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
)

# Fine-tune the model
trainer.train()

# Save the fine-tuned model
trainer.save_model("../models/fine-tuned/")


## Model Evaluation

After fine-tuning, we will evaluate the model's performance using F1-score, precision, and recall metrics.


In [None]:
# Load evaluation metrics
metric = load("seqeval")  # Updated to use the `evaluate` library

# Function to align predictions with actual labels
def align_predictions(predictions, label_ids):
    preds = np.argmax(predictions, axis=-1)
    
    # Convert predictions and labels from indices to actual entity labels
    pred_labels = [[id2label[p] for p in pred] for pred in preds]
    true_labels = [[id2label[l] for l in label] for l in label_ids]
    
    return pred_labels, true_labels

# Evaluate the fine-tuned model
predictions, label_ids, metrics = trainer.predict(tokenized_dataset)
preds, labels = align_predictions(predictions, label_ids)

# Compute evaluation metrics (F1-score, precision, recall)
results = metric.compute(predictions=preds, references=labels)
print("Evaluation Results:", results)

# Save evaluation metrics
with open("../models/fine-tuned/evaluation_results.txt", "w") as f:
    f.write(str(results))


## Model Interpretability with SHAP

Finally, we will use SHAP to interpret the model's predictions and understand which tokens influenced its decisions.


In [None]:
# SHAP interpretability
explainer = shap.Explainer(model, tokenized_dataset)

# Compute SHAP values for a few examples
shap_values = explainer(tokenized_dataset[:10])

# Plot SHAP values
shap.plots.text(shap_values)
