<a href="https://colab.research.google.com/github/ClintonBeyene/EthioMart/blob/task3-finetune-DistilBERT/notebooks/Fine_tuned_DistillBERT_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# DistillBert Model
This implementation uses DistilBERT for Named Entity Recognition (NER) on a dataset loaded from a CoNLL file, which is converted into a pandas DataFrame. The dataset is split into training and validation sets, tokenized using DistilBERT's tokenizer, and encoded for token classification. A custom `NERDataset` class is created to handle input tokens and labels. The DistilBERT model is fine-tuned using Hugging Face’s `Trainer` class, with training arguments that define epochs, batch size, gradient accumulation, and evaluation strategy. After training, the model's performance is evaluated using accuracy, F1, and precision scores, with padding excluded from the final metrics.


In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


We are going to start by loading the .conll file using the funciton below and make it suitable for manipulation through a popular library named Pandas.

In [2]:
import pandas as pd

def load_conll_to_dataframe(file_path):
    data = []
    with open(file_path, 'r', encoding='utf-8') as file:
        for line in file:
            line = line.strip()
            if line:  # If the line is not empty
                tokens = line.split()  # Split by whitespace
                if len(tokens) >= 2:  # Ensure there are at least 2 columns (token and label)
                    token, label = tokens[0], tokens[1]  # First token is the word, second is the label
                    data.append((token, label))  # Append as a tuple

    # Create a DataFrame with appropriate columns
    df = pd.DataFrame(data, columns=['Token', 'Label'])
    return df

In [3]:
# Usage
conll_file_path = '/content/drive/MyDrive/@mertteka_labeled_data.conll'
df = load_conll_to_dataframe(conll_file_path)

df.head()

Unnamed: 0,Token,Label
0,ይሄንን,O
1,ተጭነው,O
2,ያድርጉ፣,O
3,ቤተሰብ,O
4,ይሁኑ,O


Installing the necessary libraries for our model.

In [4]:
!pip install transformers datasets torch
!pip install seqeval

Collecting seqeval
  Using cached seqeval-1.2.2-py3-none-any.whl
Installing collected packages: seqeval
Successfully installed seqeval-1.2.2


In [5]:
from sklearn.model_selection import train_test_split
import pandas as pd

# Split the dataset (80% for training, 20% for validation)
train_df, val_df = train_test_split(df, test_size=0.2, random_state=42)

print(f"Training data size: {train_df.shape[0]}")
print(f"Validation data size: {val_df.shape[0]}")

Training data size: 135407
Validation data size: 33852


In [6]:
from transformers import DistilBertTokenizerFast, DistilBertForTokenClassification

# Load the tokenizer and model for DistilBERT
tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert-base-multilingual-cased")
model = DistilBertForTokenClassification.from_pretrained("distilbert-base-multilingual-cased", num_labels=len(df['Label'].unique()))

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/466 [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/542M [00:00<?, ?B/s]

Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at distilbert-base-multilingual-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


The above code loads a pre-trained DistilBERT tokenizer and a DistilBERT model for token classification, specifically using the "distilbert-base-multilingual-cased" model. The `num_labels` parameter is dynamically set to match the number of unique labels in the 'Label' column of the `df` DataFrame for classification tasks.


# Converting DataFrames to Datasets and Setting Up Data Collator for Token Classification

In [7]:
from datasets import Dataset
from transformers import DataCollatorForTokenClassification # import DataCollatorForTokenClassification

# Convert pandas DataFrames to Hugging Face Datasets
train_dataset = Dataset.from_pandas(train_df)
val_dataset = Dataset.from_pandas(val_df)
# Set up data collator
data_collator = DataCollatorForTokenClassification(tokenizer)

The above code converts pandas DataFrames (`train_df` and `val_df`) into Hugging Face Datasets for efficient processing. It then sets up a data collator for token classification using the `DataCollatorForTokenClassification`, ensuring dynamic padding of inputs during model training and evaluation.


# Setting Up a DistilBERT NER Pipeline for Token Classification and Model training

The code below sets up a Named Entity Recognition (NER) pipeline using DistilBERT for token classification. It starts by creating a label map to convert text labels into numeric IDs. The `encode_data` function tokenizes and encodes the data from both training and validation sets, aligning the labels. A custom `NERDataset` class is defined to structure the tokenized data and labels for PyTorch. The DistilBERT model is initialized with the number of unique labels. Training is managed using the Hugging Face `Trainer` class with specified training arguments, and the model is trained and evaluated on the validation data.


In [8]:
import pandas as pd
from transformers import DistilBertTokenizerFast, DistilBertForTokenClassification, Trainer, TrainingArguments
import torch
from datasets import Dataset
from transformers import DataCollatorForTokenClassification

# Step 1: Map labels to IDs using combined labels from both train and val
def create_label_map(train_df, val_df):
    label_list = pd.concat([train_df['Label'], val_df['Label']]).unique().tolist()
    label_map = {label: idx for idx, label in enumerate(label_list)}
    return label_map, len(label_map)

label_map, num_labels = create_label_map(train_df, val_df)

# Step 2: Tokenization and Encoding Function
def encode_data(df, tokenizer, label_map):
    tokens = []
    labels = []

    for _, group in df.groupby((df['Label'] != df['Label'].shift()).cumsum()):
        tokenized_input = tokenizer(
            list(group['Token']),
            is_split_into_words=True,
            padding='max_length',
            truncation=True,
            return_tensors='pt'
        )
        tokens.append(tokenized_input)

        label_ids = [label_map.get(label, label_map['O']) for label in group['Label']]
        label_ids += [label_map['O']] * (tokenized_input['input_ids'].shape[1] - len(label_ids))  # Padding
        labels.append(torch.tensor(label_ids))

    return tokens, labels

# Step 3: Initialize DistilBERT Tokenizer
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

# Step 4: Tokenize and Encode Train and Validation Data
train_tokens, train_labels = encode_data(train_df, tokenizer, label_map)
val_tokens, val_labels = encode_data(val_df, tokenizer, label_map)

# Step 5: Create a Dataset Class for Token Classification
class NERDataset(torch.utils.data.Dataset):
    def __init__(self, tokens, labels):
        self.tokens = tokens
        self.labels = labels

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        return {
            'input_ids': self.tokens[idx]['input_ids'].squeeze(),
            'attention_mask': self.tokens[idx]['attention_mask'].squeeze(),
            'labels': self.labels[idx]
        }

# Step 6: Create Datasets using the NERDataset Class
train_dataset = NERDataset(train_tokens, train_labels)
val_dataset = NERDataset(val_tokens, val_labels)

# Step 7: Initialize the DistilBERT Model for Token Classification
model = DistilBertForTokenClassification.from_pretrained('distilbert-base-uncased', num_labels=num_labels)

# Step 8: Set up Data Collator for Token Classification
data_collator = DataCollatorForTokenClassification(tokenizer)

# Step 9: Define Training Arguments
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,  # Adjust as needed
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    gradient_accumulation_steps=2,  # Accumulate gradients over 2 batches
    warmup_steps=500,
    weight_decay=0.01,
    fp16=True,
    logging_dir='./logs',
    eval_strategy="epoch",  # Evaluate after each epoch
    remove_unused_columns=False,
)

# Step 10: Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    data_collator=data_collator,
    tokenizer=tokenizer
)

# Step 11: Train the Model
trainer.train()


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


Epoch,Training Loss,Validation Loss
0,0.0019,0.001862
2,0.0017,0.001748


TrainOutput(global_step=7386, training_loss=0.01591669197751357, metrics={'train_runtime': 2484.1229, 'train_samples_per_second': 47.581, 'train_steps_per_second': 2.973, 'total_flos': 1.5441753636513792e+16, 'train_loss': 0.01591669197751357, 'epoch': 2.9993908629441624})

### Conclusion:
The decreasing training and validation losses, along with a low final training loss of 0.0159, indicate that the model has trained well without signs of overfitting. The performance metrics show that the training was efficiently executed with a reasonably fast processing rate.

In [9]:
import pickle
import os

# Step 13: Save the trained model as a .pkl file
model_dir = os.path.join('Ethiomart', 'Models')
os.makedirs(model_dir, exist_ok=True)  # Create directory if it doesn't exist

model_path = os.path.join(model_dir, 'DistillBERT_model.pkl')
with open(model_path, 'wb') as f:
    pickle.dump(model, f)

print(f"Model saved to {model_path}")

Model saved to Ethiomart/Models/DistillBERT_model.pkl


In [10]:
from sklearn.metrics import accuracy_score, f1_score

# Step 1: Encode test data
test_tokens, test_labels = encode_data(val_df, tokenizer, label_map)
test_dataset = NERDataset(test_tokens, test_labels)

# Step 2: Run predictions on the test set
predictions, label_ids, metrics = trainer.predict(test_dataset)

# Step 3: Post-process predictions
predictions = predictions.argmax(axis=2)

# Flatten predictions and labels for metric calculation
true_labels = [label for label_batch in label_ids for label in label_batch]
true_predictions = [pred for pred_batch in predictions for pred in pred_batch]

# Remove padding (-100) from labels and predictions
true_labels = [label for label, pred in zip(true_labels, true_predictions) if label != -100]
true_predictions = [pred for label, pred in zip(true_labels, true_predictions) if label != -100]

# Step 4: Calculate Accuracy and F1 Score
accuracy = accuracy_score(true_labels, true_predictions)
f1 = f1_score(true_labels, true_predictions, average='weighted')

print(f"Test Accuracy: {accuracy}")
print(f"Test F1 Score: {f1}")

Test Accuracy: 0.9993360644551087
Test F1 Score: 0.9992220569563852


### SHAP evaluation of the DistillBERT model is presented below:

In [38]:
!pip install lime
import numpy as np
from lime.lime_text import LimeTextExplainer
import torch

# Assuming you have your trained model, tokenizer, and label_map available

# Function to predict probabilities for a given sentence
def predict_proba(texts):
    inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")

    # Move the inputs to the same device as the model
    inputs = {k: v.to(model.device) for k, v in inputs.items()}

    outputs = model(**inputs).logits
    probabilities = torch.softmax(outputs, dim=2).detach().cpu().numpy()

    # Return the probabilities for the most likely class for each token
    return probabilities.argmax(axis=2)

# Choose a sample sentence from your validation set
sample_sentence = " ".join(val_df['Token'].tolist()[:20])

# Initialize the LIME explainer
explainer = LimeTextExplainer(class_names=list(label_map.keys()))

# Explain the prediction for the sample sentence
# Reduce the number of samples used for the explanation
explanation = explainer.explain_instance(sample_sentence, predict_proba, num_features=10, num_samples=50)  # Reduced from default (5000) to 50


# Print the explanation in a readable format
print("LIME Explanation:")
print(explanation.as_list())

# You can also visualize the explanation
# explanation.show_in_notebook(text=sample_sentence)


# Create a markdown string for the results
results_md = """
### LIME Evaluation Results

**Sample Sentence:**  {}

**Explanation:**

{}
""".format(sample_sentence, explanation.as_list())

print(results_md)

LIME Explanation:
[('ጆግ', 0.0), ('ወደ', 0.0), ('ጥሬ', 0.0), ('የፈለጉትን', 0.0), ('የራሱ', 0.0), ('ቁስ', 0.0), ('0911928738', 0.0), ('መገናኛ', 0.0), ('ማግ', 0.0), ('4350', 0.0)]

### LIME Evaluation Results

**Sample Sentence:**  ጆግ ወደ ጥሬ የፈለጉትን የራሱ ቁስ 0911928738 መገናኛ ማግ 4350 አይነት ሞል ዘመናዊ ሆነው ማስተላለፍ ዕቃ የቅመማ ያድርጉ፣ ከፈለጉ ጥሬ

**Explanation:**

[('ጆግ', 0.0), ('ወደ', 0.0), ('ጥሬ', 0.0), ('የፈለጉትን', 0.0), ('የራሱ', 0.0), ('ቁስ', 0.0), ('0911928738', 0.0), ('መገናኛ', 0.0), ('ማግ', 0.0), ('4350', 0.0)]

