## NER Fine-tuning with XLM-Roberta, DistilBERT, and mBERT
### This notebook demonstrates how to fine-tune various pre-trained models for Named Entity Recognition (NER) on a custom dataset formatted in CoNLL style.


#### Installing required Libraries

In [3]:
!pip install pyarrow==10.0.1 datasets==2.4.0 seqeval transformers==4.20.0


Collecting pyarrow==10.0.1
  Downloading pyarrow-10.0.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.0 kB)
Collecting datasets==2.4.0
  Downloading datasets-2.4.0-py3-none-any.whl.metadata (20 kB)
Collecting seqeval
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting transformers==4.20.0
  Downloading transformers-4.20.0-py3-none-any.whl.metadata (77 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.3/77.3 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.6 (from datasets==2.4.0)
  Downloading dill-0.3.5.1-py2.py3-none-any.whl.metadata (9.7 kB)
Collecting xxhash (from datasets==2.4.0)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets==2.4.0)
  Downloading 

#### Importing libraries

In [4]:
import pandas as pd
from sklearn.model_selection import train_test_split
from transformers import XLMRobertaTokenizerFast
from datasets import Dataset, Features, Sequence, Value
from transformers import TrainingArguments, XLMRobertaForTokenClassification, AutoModelForTokenClassification, AutoTokenizer, Trainer

#### Loading CoNLL formatted Data

In [5]:
from google.colab import files
uploaded = files.upload()

Saving first_labeled_ner_data.conll to first_labeled_ner_data.conll


#### Function to load CoNLL formatted data

In [6]:
def load_conll(file_path):
    sentences = []
    labels = []
    with open(file_path, 'r', encoding='utf-8') as f:
        sentence = []
        label = []
        for line in f:
            if line.strip():  # Non-empty line
                token, label_item = line.split()
                sentence.append(token)
                label.append(label_item)
            else:  # Empty line indicates end of a sentence
                sentences.append(sentence)
                labels.append(label)
                sentence = []
                label = []
    return pd.DataFrame({'tokens': sentences, 'labels': labels})

In [7]:
df = load_conll('first_labeled_ner_data.conll')
df.head()

Unnamed: 0,tokens,labels
0,"[ለኮንዶሚኒየም, ለጠባብ, ቤቶች, ገላግሌ, የሆነ, ከንፁህ, የሲልከን, ...","[O, B-Product, O, B-Product, O, B-Product, I-P..."
1,"[ከላዩ, ፈር, ውስጡ, ኮተን, የሆነ, 2000, 0909003864, 090...","[O, B-Product, I-Product, I-Product, O, O, O, ..."
2,"[5, 1, ኦሪጅናል, ማቴሪያል, በሳይዙ, ትልቅ, 3200, ብር, 0909...","[O, O, B-Product, I-Product, I-Product, I-Prod..."
3,"[ምቹ, ጠንካራ, የልጆች, ማዘያ, በተለይ, ለወንድ, ልጆች, ፍሬያቸው, ...","[B-Product, O, B-Product, O, O, O, O, O, O, O,..."
4,"[4100, ብር, 1.80*2, 0909003864, 0905707448, እቃ,...","[B-Price, I-Price, O, O, O, B-Product, O, O, O..."


#### Defining Unique Labels

In [8]:
unique_labels = set(label for sublist in df['labels'] for label in sublist)
label2id = {label: i for i, label in enumerate(unique_labels)}
id2label = {i: label for label, i in label2id.items()}


#### Mapping Labels to IDs

In [9]:
df['labels'] = df['labels'].apply(lambda x: [label2id[label] for label in x])


#### Convert DataFrame to Hugging Face Dataset


In [10]:
features = Features({
    'tokens': Sequence(Value('string')),  # List of strings for tokens
    'labels': Sequence(Value('int32'))    # List of integers for labels
})

dataset = Dataset.from_pandas(df[['tokens', 'labels']], features=features)


### Tokenization and Label Alignment
#### We will tokenize the dataset for XLM-Roberta, DistilBERT, and mBERT.

### Tokenizer for XLM-Roberta


In [11]:
tokenizer = XLMRobertaTokenizerFast.from_pretrained("xlm-roberta-base", clean_up_tokenization_spaces=True)


Downloading:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/4.83M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/8.68M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/615 [00:00<?, ?B/s]

### Tokenizer for DistilBERT

In [12]:
tokenizer_distilbert = AutoTokenizer.from_pretrained('distilbert-base-multilingual-cased', clean_up_tokenization_spaces=True)


Downloading:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/972k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.87M [00:00<?, ?B/s]

### Tokenizer for mBERT


In [13]:
tokenizer_mbert = AutoTokenizer.from_pretrained('bert-base-multilingual-cased', clean_up_tokenization_spaces=True)


Downloading:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/625 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/972k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.87M [00:00<?, ?B/s]

## Tokenization and Alignment Function

In [14]:
def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(examples['tokens'], truncation=True, is_split_into_words=True, padding="max_length", max_length=128)
    labels = []

    for i in range(len(examples['tokens'])):
        label = examples['labels'][i]
        tokenized_label = [-100] * len(tokenized_inputs['input_ids'][i])

        for j, token in enumerate(tokenized_inputs['input_ids'][i]):
            original_word_idx = tokenizer.decode(token).strip()
            if original_word_idx in examples['tokens'][i]:
                token_index = examples['tokens'][i].index(original_word_idx)
                tokenized_label[j] = label[token_index]

        labels.append(tokenized_label)

    tokenized_inputs['labels'] = labels
    return tokenized_inputs

### Tokenize the Dataset
### Apply the tokenization and alignment function to the dataset.

In [15]:
tokenized_dataset = dataset.map(tokenize_and_align_labels, batched=True)


  0%|          | 0/2 [00:00<?, ?ba/s]

### Splitting the Dataset into Train and Test Data
### We split the tokenized dataset into training and testing datasets.


In [16]:
train_test_split = tokenized_dataset.train_test_split(test_size=0.1)


## Setting Up Training Arguments

In [17]:
training_args = TrainingArguments(
    output_dir='./results',
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=7,
    weight_decay=0.01,
    max_grad_norm=1.0,
    logging_dir='./logs',
    logging_strategy="steps",
    logging_steps=50,
    save_strategy="epoch",
    report_to="none"
)

## Fine Tuning the Model
## We initialize models and set up trainers for fine-tuning.

### XLM-Roberta Model and Trainer


In [18]:
model_xlmr = XLMRobertaForTokenClassification.from_pretrained("xlm-roberta-base", num_labels=len(unique_labels))
trainer_xlmr = Trainer(
    model=model_xlmr,
    args=training_args,
    train_dataset=train_test_split['train'],
    eval_dataset=train_test_split['test']
)

Downloading:   0%|          | 0.00/1.04G [00:00<?, ?B/s]

  return torch.load(checkpoint_file, map_location="cpu")
Some weights of the model checkpoint at xlm-roberta-base were not used when initializing XLMRobertaForTokenClassification: ['lm_head.dense.bias', 'lm_head.dense.weight', 'lm_head.layer_norm.bias', 'lm_head.bias', 'lm_head.layer_norm.weight', 'lm_head.decoder.weight']
- This IS expected if you are initializing XLMRobertaForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of XLMRobertaForTokenClassification were not initialized from the model checkpoint at xlm-roberta-base and are newly initialized: ['classifier.weight', 'classifier

### DistilBERT Model and Trainer

In [19]:
model_distilbert = AutoModelForTokenClassification.from_pretrained('distilbert-base-multilingual-cased', num_labels=len(unique_labels))
trainer_distilbert = Trainer(
    model=model_distilbert,
    args=training_args,
    train_dataset=train_test_split['train'],
    eval_dataset=train_test_split['test']
)

loading configuration file https://huggingface.co/distilbert-base-multilingual-cased/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/cf37a9dc282a679f121734d06f003625d14cfdaf55c14358c4c0b8e7e2b89ac9.7a727bd85e40715bec919a39cdd6f0aba27a8cd488f2d4e0f512448dcd02bf0f
Model config DistilBertConfig {
  "_name_or_path": "distilbert-base-multilingual-cased",
  "activation": "gelu",
  "architectures": [
    "DistilBertForMaskedLM"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
    "3": "LABEL_3",
    "4": "LABEL_4",
    "5": "LABEL_5",
    "6": "LABEL_6"
  },
  "initializer_range": 0.02,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2,
    "LABEL_3": 3,
    "LABEL_4": 4,
    "LABEL_5": 5,
    "LABEL_6": 6
  },
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "output_past": true,
  "

Downloading:   0%|          | 0.00/517M [00:00<?, ?B/s]

storing https://huggingface.co/distilbert-base-multilingual-cased/resolve/main/pytorch_model.bin in cache at /root/.cache/huggingface/transformers/7b48683e2e7ba71cd1d7d6551ac325eceee01db5c2f3e81cfbfd1ee7bb7877f2.c24097b0cf91dbc66977325325fd03112f0f13d0e3579abbffc8d1e45f8d0619
creating metadata file for /root/.cache/huggingface/transformers/7b48683e2e7ba71cd1d7d6551ac325eceee01db5c2f3e81cfbfd1ee7bb7877f2.c24097b0cf91dbc66977325325fd03112f0f13d0e3579abbffc8d1e45f8d0619
loading weights file https://huggingface.co/distilbert-base-multilingual-cased/resolve/main/pytorch_model.bin from cache at /root/.cache/huggingface/transformers/7b48683e2e7ba71cd1d7d6551ac325eceee01db5c2f3e81cfbfd1ee7bb7877f2.c24097b0cf91dbc66977325325fd03112f0f13d0e3579abbffc8d1e45f8d0619
Some weights of the model checkpoint at distilbert-base-multilingual-cased were not used when initializing DistilBertForTokenClassification: ['vocab_layer_norm.bias', 'vocab_layer_norm.weight', 'vocab_projector.bias', 'vocab_transform.b

### mBERT Model and Trainer

In [20]:
model_mbert = AutoModelForTokenClassification.from_pretrained('bert-base-multilingual-cased', num_labels=len(unique_labels))
trainer_mbert = Trainer(
    model=model_mbert,
    args=training_args,
    train_dataset=train_test_split['train'],
    eval_dataset=train_test_split['test']
)

loading configuration file https://huggingface.co/bert-base-multilingual-cased/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/6c4a5d81a58c9791cdf76a09bce1b5abfb9cf958aebada51200f4515403e5d08.0fe59f3f4f1335dadeb4bce8b8146199d9083512b50d07323c1c319f96df450c
Model config BertConfig {
  "_name_or_path": "bert-base-multilingual-cased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "directionality": "bidi",
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
    "3": "LABEL_3",
    "4": "LABEL_4",
    "5": "LABEL_5",
    "6": "LABEL_6"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2,
    "LABEL_3": 3,
    "LABEL_4": 4,
    "LABEL_5": 5,
    "LABEL_6": 6
  },
  "layer_norm_eps": 1e-12,
  "max_position_embedding

Downloading:   0%|          | 0.00/681M [00:00<?, ?B/s]

storing https://huggingface.co/bert-base-multilingual-cased/resolve/main/pytorch_model.bin in cache at /root/.cache/huggingface/transformers/0a3fd51713dcbb4def175c7f85bddc995d5976ce1dde327f99104e4d33069f17.aa7be4c79d76f4066d9b354496ea477c9ee39c5d889156dd1efb680643c2b052
creating metadata file for /root/.cache/huggingface/transformers/0a3fd51713dcbb4def175c7f85bddc995d5976ce1dde327f99104e4d33069f17.aa7be4c79d76f4066d9b354496ea477c9ee39c5d889156dd1efb680643c2b052
loading weights file https://huggingface.co/bert-base-multilingual-cased/resolve/main/pytorch_model.bin from cache at /root/.cache/huggingface/transformers/0a3fd51713dcbb4def175c7f85bddc995d5976ce1dde327f99104e4d33069f17.aa7be4c79d76f4066d9b354496ea477c9ee39c5d889156dd1efb680643c2b052
Some weights of the model checkpoint at bert-base-multilingual-cased were not used when initializing BertForTokenClassification: ['cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cl

## Fine-tuning and Evaluating Each Model
### We fine-tune and evaluate each model.

### XLM-Roberta Fine-tuning and Evaluation

In [21]:
trainer_xlmr.train()
trainer_xlmr.evaluate()

The following columns in the training set don't have a corresponding argument in `XLMRobertaForTokenClassification.forward` and have been ignored: tokens. If tokens are not expected by `XLMRobertaForTokenClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 1232
  Num Epochs = 7
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 539


Epoch,Training Loss,Validation Loss
1,0.6842,0.094207
2,0.0736,0.075142
3,0.068,0.061738
4,0.044,0.041745
5,0.0307,0.032864
6,0.0261,0.03077
7,0.0269,0.028463


The following columns in the evaluation set don't have a corresponding argument in `XLMRobertaForTokenClassification.forward` and have been ignored: tokens. If tokens are not expected by `XLMRobertaForTokenClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 137
  Batch size = 16
Saving model checkpoint to ./results/checkpoint-77
Configuration saved in ./results/checkpoint-77/config.json
Model weights saved in ./results/checkpoint-77/pytorch_model.bin
The following columns in the evaluation set don't have a corresponding argument in `XLMRobertaForTokenClassification.forward` and have been ignored: tokens. If tokens are not expected by `XLMRobertaForTokenClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 137
  Batch size = 16
Saving model checkpoint to ./results/checkpoint-154
Configuration saved in ./results/checkpoint-154/config.json
Model weights saved in ./results/chec

{'eval_loss': 0.02846306748688221,
 'eval_runtime': 0.987,
 'eval_samples_per_second': 138.804,
 'eval_steps_per_second': 9.119,
 'epoch': 7.0}

### DistilBERT Fine-tuning and Evaluation

In [22]:
trainer_distilbert.train()
trainer_distilbert.evaluate()

The following columns in the training set don't have a corresponding argument in `DistilBertForTokenClassification.forward` and have been ignored: tokens. If tokens are not expected by `DistilBertForTokenClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 1232
  Num Epochs = 7
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 539


RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


### mBERT Fine-tuning and Evaluation

In [23]:
trainer_mbert.train()
trainer_mbert.evaluate()

The following columns in the training set don't have a corresponding argument in `BertForTokenClassification.forward` and have been ignored: tokens. If tokens are not expected by `BertForTokenClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 1232
  Num Epochs = 7
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 539


RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


## Saving the Fine-Tuned Model

In [24]:
model_xlmr.save_pretrained("./fine_tuned_xlmr_model")
tokenizer.save_pretrained("./fine_tuned_xlmr_model")

Configuration saved in ./fine_tuned_xlmr_model/config.json


RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


## Evaluating the Model

### After fine-tuning, we evaluate the model to check its performance.

In [25]:
eval_results = trainer_xlmr.evaluate()
print(eval_results)

The following columns in the evaluation set don't have a corresponding argument in `XLMRobertaForTokenClassification.forward` and have been ignored: tokens. If tokens are not expected by `XLMRobertaForTokenClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 137
  Batch size = 16


RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
