# Task 4: Model Comparison & Selection

In this task, we compare multiple NER models (e.g., XLM-Roberta, DistilBERT, mBERT) on Amharic entity extraction. We evaluate each model's accuracy, speed, and robustness, and select the best-performing model for production use.

## Steps
1. Fine-tune multiple NER models (XLM-Roberta, DistilBERT, mBERT, etc.)
2. Evaluate each model on the validation set
3. Compare models based on accuracy, speed, and robustness
4. Select the best-performing model for production

In [None]:
from transformers import AutoTokenizer, AutoModelForTokenClassification

# List of models to compare
model_names = [
    'xlm-roberta-base',
    'distilbert-base-multilingual-cased',
    'bert-base-multilingual-cased',
    # Add more models as needed
]

def load_model_and_tokenizer(model_name, num_labels, id2label, label2id):
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForTokenClassification.from_pretrained(
        model_name, num_labels=num_labels, id2label=id2label, label2id=label2id
    )
    return tokenizer, model

In [None]:
import time
from transformers import Trainer, TrainingArguments
from seqeval.metrics import f1_score

# Assume train_dataset, val_dataset, label_list, id2label, label2id are already defined from previous steps
results = []

for model_name in model_names:
    print(f"\nEvaluating {model_name}...")
    tokenizer, model = load_model_and_tokenizer(model_name, len(label_list), id2label, label2id)
    training_args = TrainingArguments(
        output_dir=f'./results_{model_name}',
        evaluation_strategy='epoch',
        per_device_train_batch_size=8,
        per_device_eval_batch_size=8,
        num_train_epochs=3,
        logging_dir=f'./logs_{model_name}',
        logging_steps=10,
        save_total_limit=1,
        load_best_model_at_end=True,
    )
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=val_dataset,
        tokenizer=tokenizer,
        compute_metrics=lambda p: {'f1': f1_score([[id2label[l] for l in label if l != -100] for label in p.label_ids],
                                                 [[id2label[pred] for (pred, l) in zip(prediction, label) if l != -100]
                                                  for prediction, label in zip(np.argmax(p.predictions, axis=2), p.label_ids)])}
    )
    start = time.time()
    trainer.train()
    eval_result = trainer.evaluate()
    elapsed = time.time() - start
    results.append({
        'model': model_name,
        'f1': eval_result['eval_f1'],
        'eval_time_sec': elapsed
    })

In [None]:
# Display comparison results
import pandas as pd
results_df = pd.DataFrame(results)
results_df = results_df.sort_values(by='f1', ascending=False)
display(results_df)

best_model = results_df.iloc[0]['model']
print(f"Best model: {best_model}")

## Summary

- Fine-tuned and evaluated multiple NER models (XLM-Roberta, DistilBERT, mBERT, etc.)
- Compared models based on F1 score and evaluation time
- Selected the best-performing model for Amharic entity extraction