# Task 4:  Model Comparison & Selection
**Objective**
Compare different models and select the best-performing one for the entity extraction task.

**Steps**

1. Finetune multiple models like XLM-Roberta: A large multilingual model for NER tasks, or DistilBERT: A smaller, lighter model for more efficient NER tasks, or mBERT (Multilingual BERT): A multilingual version of BERT, suitable for Amharic or .others?
2. Evaluate the fine-tuned model on the validation set to check performance.
3. Compare models based on accuracy, speed, and robustness in handling multi-modal data.
4. Select the best-performing model for production based on evaluation metrics.

# Task 5: Model Interpretability
**Objective**
Use model interpretability tools to explain how the NER model identifies entities, ensuring transparency and trust in the system.

**Steps:**
1. Implement SHAP (SHapley Additive exPlanations) and LIME (Local Interpretable Model-agnostic Explanations) to interpret the model’s predictions.
2. Analyze difficult cases where the model might struggle to identify entities correctly (e.g., ambiguous text, overlapping entities).
3. Generate reports on how the model makes decisions and identify areas for improvement.


In [17]:
pip install transformers datasets
pip install seqeval
pip install shap
pip install lime

Collecting shap
  Downloading shap-0.46.0-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (24 kB)
Collecting slicer==0.0.8 (from shap)
  Downloading slicer-0.0.8-py3-none-any.whl.metadata (4.0 kB)
Downloading shap-0.46.0-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (540 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m540.1/540.1 kB[0m [31m23.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading slicer-0.0.8-py3-none-any.whl (15 kB)
Installing collected packages: slicer, shap
Successfully installed shap-0.46.0 slicer-0.0.8
Collecting lime
  Downloading lime-0.2.0.1.tar.gz (275 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m275.7/275.7 kB[0m [31m15.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: lime
  Building wheel for lime (setup.py) ... [?25l[?

In [18]:
from transformers import AutoModelForTokenClassification, AutoTokenizer, Trainer, TrainingArguments
import shap
from lime.lime_text import LimeTextExplainer
from google.colab import drive
from sklearn.model_selection import train_test_split
from datasets import Dataset, Features, Sequence, ClassLabel, Value
from seqeval.metrics import classification_report as seq_classification_report
# from sklearn.metrics import classification_report
import numpy as np



In [3]:
import os
os.environ['CUDA_LAUNCH_BLOCKING'] = '1'

In [4]:
drive.mount('/content/drive')


file_path = '/content/drive/My Drive/tokens_labels.conll'

with open(file_path, 'r') as file:
    contents = file.readlines()

Mounted at /content/drive


In [5]:
# extract tokens and labels from the dataset
def extract_tokens_labels(text):
  words = []
  labels = []
  # checks for English word
  def is_amharic(word):
      # Amharic characters are in the Unicode range: 1200-137F (hex)
      for char in word:
          if not (0x1200 <= ord(char) <= 0x137F):
              return False
      return True

  # split tokens and labels
  for con in content:
    con = con.strip().replace('[', '').replace(']', '').replace(',', '').replace("'", "").split(' ')
    if not(is_amharic(con[0])):
      pass
    else:
      words.append(con[0])
      labels.append(con[-1])

  return words, labels


In [6]:
# align tokens and labels
def align_token_label(text, tokenizer):
  # labels to id number
  label_to_id = {
      "O": 0,
      "B-LOC": 1,
      "I-LOC": 2,
      "B-PRODUCT": 3,
      "I-PRODUCT": 4,
      "B-PRICE": 5,
      "I-PRICE": 6
}
  tokens, labels = extract_tokens_labels(text)
  tokenized_inputs = tokenizer(tokens, truncation = True, padding = True, is_split_into_words = True)

  word_ids = tokenized_inputs.word_ids()
  aligned_labels = []

  previous_id = None
  for k,id in enumerate(word_ids):
    if id is None:
      aligned_labels.append(-100)

    elif id != previous_id:
      aligned_labels.append(label_to_id[labels[id]])

    else:
      aligned_labels.append(-100)

    previous_id = id
  tokenized_inputs['labels'] = aligned_labels
  # print(aligned_labels)
  return tokenized_inputs

In [22]:
# SHAP report generation
def generate_shap_report(model, tokenizer, validation_data):
    explainer = shap.KernelExplainer(lambda x: model(**tokenizer(x, return_tensors="pt", truncation=True, padding=True)).logits.detach().numpy(),
                                     shap.sample(validation_data, 100))  # Sample for efficiency

    # Example sentence for SHAP explanation
    test_sentence = "አፕል አዲሱን አይፎን በኒውዮርክ ከተማ በ4000ዋጋ እየለቀቀች ነው።"
    shap_values = explainer.shap_values([test_sentence])

    # Visualize SHAP results
    shap.initjs()
    shap.force_plot(explainer.expected_value, shap_values[0], test_sentence.split())

  # LIME report generation
def generate_lime_report(model, tokenizer, validation_data):
    explainer = LimeTextExplainer(class_names=["O", "B-LOC", "I-LOC", "B-PRODUCT", "I-PRODUCT", "B-PRICE", "I-PRICE"])

    def lime_predict(texts):
        inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True)
        outputs = model(**inputs)
        predictions = outputs.logits.detach().numpy()
        return predictions

    # Example sentence for LIME explanation
    test_sentence = "አዲሱ ጋላክሲ ስልኩን ሳምሰንግ በአዲስ አበባ በ3000ዋጋ አስመረቀች።"
    exp = explainer.explain_instance(test_sentence, lime_predict, num_features=10)

    # Show LIME explanation
    exp.show_in_notebook(text=True)

In [7]:
# take a subset
content = contents[0:1000]

# split validation and train sets
train_data, validation_data = train_test_split(content, test_size=0.2, random_state=42)

In [25]:
def different_model(model,train_data,validation_data):
  # intializing model
  tokenizer = AutoTokenizer.from_pretrained(model)
  model = AutoModelForTokenClassification.from_pretrained(model,num_labels=7)

  # Create dictionaries to hold the tokenized datasets
  tokenized_datasets = {'train': [], 'validation': []}

  batch_size = 4
  # batch for faster computation
  def batch_data(data, batch_size):
    for i in range(0, len(data), batch_size):
      yield data[i:i + batch_size]

  # align_token_label for train dataset
  for batch in batch_data(train_data,batch_size):
    tokenized_batch = [align_token_label(con, tokenizer) for con in batch]
    tokenized_datasets['train'].extend(tokenized_batch)

  # align_token_label for validation dataset
  for batch in batch_data(validation_data,batch_size):
      tokenized_batch = [align_token_label(con, tokenizer) for con in batch]
      tokenized_datasets['validation'].extend(tokenized_batch)

  # Convert lists to Hugging Face Dataset objects
  tokenized_datasets['train'] = Dataset.from_list(tokenized_datasets['train'])
  tokenized_datasets['validation'] = Dataset.from_list(tokenized_datasets['validation'])
  print('e')
  # fine tunning the model
  training_args = TrainingArguments(
      output_dir = '/content/drive/My Drive/results',
      evaluation_strategy = 'epoch',
      learning_rate = 2e-5,
      per_device_train_batch_size = 4,
      per_device_eval_batch_size = 4,
      gradient_accumulation_steps = 4,
      num_train_epochs = 3,
      weight_decay = 0.01,
      fp16 = True # Enable mixed precision training
      #  no_cuda=True  # Force CPU training
  )

  trainer = Trainer(
      model = model,
      args = training_args,
      train_dataset = tokenized_datasets['train'],
      eval_dataset = tokenized_datasets['validation'],
  )
  # train the model
  trainer.train()

  # evaluate the model
  print(trainer.evaluate())

    # Get predictions from the model
  predictions, labels, _ = trainer.predict(tokenized_datasets['validation'])

  # Convert predictions to label ids
  predicted_label_ids = np.argmax(predictions, axis=-1)

  label_names = {
      0: "O",
      1: "B-LOC",
      2: "I-LOC",
      3: "B-PRODUCT",
      4: "I-PRODUCT",
      5: "B-PRICE",
      6: "I-PRICE"
  }

  # Convert the label ids to actual label names
  true_labels = [[label_names[l] for l in label if l != -100] for label in labels]
  predicted_labels = [[label_names[p] for p, l in zip(pred, label) if l != -100]
                      for pred, label in zip(predicted_label_ids, labels)]

  # Use sklearn to compute the classification report
  print(seq_classification_report(true_labels, predicted_labels))

  if model == 'bert-base-multilingual-cased':       # wrote this after comparing the models
        generate_shap_report(model, tokenizer, tokenized_datasets['validation'])
        generate_lime_report(model, tokenizer, tokenized_datasets['validation'])



In [26]:
model_name = ['xlm-roberta-base','distilbert-base-multilingual-cased','bert-base-multilingual-cased']
for model in model_name:
  different_model(model,train_data,validation_data)


Some weights of XLMRobertaForTokenClassification were not initialized from the model checkpoint at xlm-roberta-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


e


  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


Epoch,Training Loss,Validation Loss
1,No log,0.020477
2,No log,0.001157
3,No log,0.000778


{'eval_loss': 0.0007777191931381822, 'eval_runtime': 5.186, 'eval_samples_per_second': 38.566, 'eval_steps_per_second': 9.641, 'epoch': 3.0}
              precision    recall  f1-score   support

         LOC       1.00      1.00      1.00      1600
       PRICE       1.00      1.00      1.00       200
     PRODUCT       1.00      1.00      1.00      1200

   micro avg       1.00      1.00      1.00      3000
   macro avg       1.00      1.00      1.00      3000
weighted avg       1.00      1.00      1.00      3000



Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at distilbert-base-multilingual-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


e


  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


Epoch,Training Loss,Validation Loss
1,No log,0.384748
2,No log,0.192141
3,No log,0.10724


{'eval_loss': 0.10723968595266342, 'eval_runtime': 2.8967, 'eval_samples_per_second': 69.043, 'eval_steps_per_second': 17.261, 'epoch': 3.0}


  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

         LOC       0.79      0.68      0.73      4400
       PRICE       0.00      0.00      0.00       800
     PRODUCT       0.20      0.08      0.11      2600

   micro avg       0.67      0.41      0.51      7800
   macro avg       0.33      0.25      0.28      7800
weighted avg       0.51      0.41      0.45      7800



Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-multilingual-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


e


  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


Epoch,Training Loss,Validation Loss
1,No log,0.3251
2,No log,0.196404
3,No log,0.141046


{'eval_loss': 0.14104604721069336, 'eval_runtime': 2.9881, 'eval_samples_per_second': 66.932, 'eval_steps_per_second': 16.733, 'epoch': 3.0}


  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

         LOC       0.65      0.50      0.56      4400
       PRICE       0.00      0.00      0.00       800
     PRODUCT       0.67      0.31      0.42      2600

   micro avg       0.65      0.38      0.48      7800
   macro avg       0.44      0.27      0.33      7800
weighted avg       0.59      0.38      0.46      7800



In [None]:
# # for cleaning up memory
# import torch
# torch.cuda.empty_cache()

Model **bert-base-multilingual-cased** is a better option than Model **distilbert-base-multilingual-cased** because it has stronger performance on both LOC and PRODUCT, while Model **xlm-roberta-base** struggles significantly with PRODUCT.

Model **distilbert-base-multilingual-cased** shows perfect scores in all metrics, but this might indicate overfitting.