# Task 4: Model Comparison & Selection

In this task, we will compare the performance of different transformer-based models for Amharic NER, such as XLM-Roberta, mBERT, and DistilBERT. We will fine-tune each model on our labeled dataset, evaluate their performance, and select the best model for EthioMart's business needs.

## Model Selection

We will compare the following models:
- XLM-Roberta (`xlm-roberta-base`)
- Multilingual BERT (`bert-base-multilingual-cased`)
- DistilBERT (`distilbert-base-multilingual-cased`)

Each model will be fine-tuned and evaluated using the same data and metrics.

In [None]:
model_names = {
    "XLM-Roberta": "xlm-roberta-base",
    "mBERT": "bert-base-multilingual-cased",
    "DistilBERT": "distilbert-base-multilingual-cased"
}

## Data Preparation

We will reuse the labeled dataset and preprocessing pipeline from Task 3. The data will be loaded, tokenized, and split into training and validation sets.

In [None]:
conll_path = "../data/labeled/ner_labeled_sample.conll"
label_list = ["O", "B-Product", "I-Product", "B-LOC", "I-LOC", "B-PRICE", "I-PRICE"]
label_to_id = {l: i for i, l in enumerate(label_list)}
id_to_label = {i: l for l, i in label_to_id.items()}

def read_conll(path):
    tokens, labels, sentences = [], [], []
    with open(path, encoding="utf-8") as f:
        for line in f:
            line = line.strip()
            if not line:
                if tokens:
                    sentences.append({"tokens": tokens, "ner_tags": labels})
                    tokens, labels = [], []
            else:
                splits = line.split()
                tokens.append(splits[0])
                labels.append(splits[1])
        if tokens:
            sentences.append({"tokens": tokens, "ner_tags": labels})
    return sentences

def encode_labels(example):
    example["labels"] = [label_to_id[l] for l in example["ner_tags"]]
    return example

def tokenize_and_align_labels(example, tokenizer):
    tokenized_inputs = tokenizer(example["tokens"], truncation=True, is_split_into_words=True)
    word_ids = tokenized_inputs.word_ids()
    labels = []
    previous_word_idx = None
    for word_idx in word_ids:
        if word_idx is None:
            labels.append(-100)
        else:
            labels.append(example["labels"][word_idx])
        previous_word_idx = word_idx
    tokenized_inputs["labels"] = labels
    return tokenized_inputs

data = read_conll(conll_path)
from datasets import Dataset
dataset = Dataset.from_list(data)
dataset = dataset.map(encode_labels)
split = dataset.train_test_split(test_size=0.2, seed=42)
raw_train_dataset = split["train"]
raw_eval_dataset = split["test"]

## Model Training and Evaluation

We will fine-tune and evaluate each model using the same training and validation data. For each model, we will:
- Load the pre-trained model and tokenizer
- Tokenize and align the data
- Fine-tune the model
- Evaluate its performance (F1-score, precision, recall)
- Record the results for comparison

In [None]:
from transformers import AutoTokenizer, AutoModelForTokenClassification, TrainingArguments, Trainer, DataCollatorForTokenClassification
import numpy as np
from seqeval.metrics import classification_report, f1_score

results = {}

for model_name, model_checkpoint in model_names.items():
    print(f"\n=== Training {model_name} ===")
    tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
    
    # Tokenize datasets
    train_dataset = raw_train_dataset.map(lambda x: tokenize_and_align_labels(x, tokenizer), batched=False)
    eval_dataset = raw_eval_dataset.map(lambda x: tokenize_and_align_labels(x, tokenizer), batched=False)
    
    model = AutoModelForTokenClassification.from_pretrained(
        model_checkpoint, num_labels=len(label_list), id2label=id_to_label, label2id=label_to_id
    )
    
    training_args = TrainingArguments(
        output_dir=f"./results/{model_name}",
        evaluation_strategy="epoch",
        learning_rate=2e-5,
        per_device_train_batch_size=8,
        per_device_eval_batch_size=8,
        num_train_epochs=3,
        weight_decay=0.01,
        logging_dir=f"./logs/{model_name}",
        logging_steps=10,
        save_total_limit=2,
        report_to="none"
    )
    
    data_collator = DataCollatorForTokenClassification(tokenizer)
    
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        tokenizer=tokenizer,
        data_collator=data_collator,
    )
    
    trainer.train()
    
    predictions, labels, _ = trainer.predict(eval_dataset)
    preds = np.argmax(predictions, axis=2)
    true_labels = [[id_to_label[l] for l in label if l != -100] for label in labels]
    pred_labels = [[id_to_label[p] for (p, l) in zip(pred, label) if l != -100] for pred, label in zip(preds, labels)]
    
    report = classification_report(true_labels, pred_labels, output_dict=True)
    results[model_name] = {
        "f1": f1_score(true_labels, pred_labels),
        "report": report
    }
    print(f"\n{model_name} F1-score: {results[model_name]['f1']:.4f}")
    print(classification_report(true_labels, pred_labels))

## Results Summary

We summarize the F1-scores and other metrics for each model to compare their performance and select the best model for our NER task.

In [None]:
import pandas as pd

summary = {model: {"F1-score": res["f1"]} for model, res in results.items()}
df = pd.DataFrame(summary).T
display(df)