# Fine-Tune DistilBERT For Multi-Class Text Classification Using Tensorflow and Keras

In this project, we explore the capabilities of DistilBERT for multi-class text classification by comparing three approaches: no fine-tuning, standard fine-tuning, and LoRA fine-tuning.

Table of Contents:
    1. Data Pre-Processing
    2. Metrics We Will Measure
    3. Classification with No Fine-Tuning
    4. Classification with Standard Fine-Tuning
    5. Classification with LoRA Fine-Tuning
    

In [136]:
## Import required packages
from transformers import DistilBertTokenizer, TFDistilBertForSequenceClassification
import tf_keras
import tensorflow as tf
import pandas as pd
import time
import numpy as np
from sklearn.model_selection import train_test_split

import matplotlib.pyplot as plt
from plotly.offline import iplot


## 1. Preprocess Data

### 1.1. Split into Train, Validation, Test using Stratified Sampling

In [117]:
# Import Data
root_path = 'data/full_dataset.csv'
df = pd.read_csv(root_path)
df.head()

# Encode the 'category' column into numerical labels
df['encoded_text'] = df['category'].astype('category').cat.codes

# Separate columns for splitting
data_texts = df['request'].to_list()  # 'request' is the text data
data_labels = df['encoded_text'].to_list()  # Encoded class labels
stratify_values = df['stratify_col'].to_list()  # Stratification column

# Split the data into Train/Validation sets with stratification
train_texts, val_texts, train_labels, val_labels, train_stratify, val_stratify = train_test_split(
    data_texts, data_labels, stratify_values, 
    test_size=0.2, stratify=stratify_values, random_state=0
)

# Split the Train set further into Train/Test with stratification
train_texts, test_texts, train_labels, test_labels = train_test_split(
    train_texts, train_labels, 
    test_size=0.1, stratify=train_stratify, random_state=0
)

### 1.2. View test/train/validation Splits

In [118]:
# Map numerical labels back to category names
label_mapping = dict(enumerate(df['category'].astype('category').cat.categories))
print("\nLabel Mapping (Encoded -> Category):")

for encoded, category in label_mapping.items():
    print(f"{encoded}: {category}")

# Output dataset information
print("\nFinal dataset information:")
print(f"Train set size: {len(train_texts)}")
print(f"Validation set size: {len(val_texts)}")
print(f"Test set size: {len(test_texts)}")

print(f"Example train_texts: {train_texts[:3]}") 
print(f"Example train_labels: {train_labels[:3]}")
print(f"Example val_texts: {val_texts[:3]}") 
print(f"Example val_labels: {val_labels[:3]}")
print(f"Example test_texts: {test_texts[:3]}") 
print(f"Example test_labels: {test_labels[:3]}")


Label Mapping (Encoded -> Category):
0: Facilities Management
1: Finance
2: HR
3: IT Support
4: Marketing

Final dataset information:
Train set size: 3596
Validation set size: 1000
Test set size: 400
Example train_texts: ['I’m gathering details about rewards for long-term employees and was hoping you could provide some insight. Let me know if you need further specifics from me.', 'Do you have the latest version of the diversity and inclusion policies handbook? I need it for a new hire orientation.', 'Could you share detailed insights on the performance metrics for keyword research for PPC campaigns? I’d like to use this data for our planning.']
Example train_labels: [2, 2, 4]
Example val_texts: ['Could you outline the steps to optimize our launching retargeting ads approach? Any case studies or examples would be helpful.', 'Need access to server maintenance.', 'Insights on customer retention strategies performance needed.']
Example val_labels: [4, 3, 4]
Example test_texts: ['I’m exper

### 1.3. Optional: Save test/train/val to CSV

In [119]:
# Create DataFrames for each split
train_df = pd.DataFrame({
    'request': train_texts,
    'label': train_labels
})

val_df = pd.DataFrame({
    'request': val_texts,
    'label': val_labels
})

test_df = pd.DataFrame({
    'request': test_texts,
    'label': test_labels
})

# Save DataFrames to CSV files
train_df.to_csv("data/train.csv", index=False)
val_df.to_csv("data/validation.csv", index=False)
test_df.to_csv("data/test.csv", index=False)


## 1.4. Tokenize Data For DistilBERT Model

In [120]:
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

train_encodings = tokenizer(train_texts, truncation = True, padding = True  )

val_encodings = tokenizer(val_texts, truncation = True, padding = True )

test_encodings = tokenizer(test_texts, truncation = True, padding = True )

In [121]:
train_dataset = tf.data.Dataset.from_tensor_slices((
    dict(train_encodings),
    train_labels
)).batch(32)


val_dataset = tf.data.Dataset.from_tensor_slices((
    dict(val_encodings),
    val_labels
)).batch(32)

test_dataset = tf.data.Dataset.from_tensor_slices((
    dict(test_encodings),
    test_labels
)).batch(32)

## 2. Model Evaluation

For each of our three models, we will collect the following metrics:
    1.    accuracy
    2.    precision
    3.    recall
    4.    f1
    5.    Training Time
    6.    Trainable Parameters

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Define evaluation function
def evaluate_model(y_true, y_pred):
    accuracy = accuracy_score(y_true, y_pred)
    precision = precision_score(y_true, y_pred, average='weighted')
    recall = recall_score(y_true, y_pred, average='weighted')
    f1 = f1_score(y_true, y_pred, average='weighted')
    return accuracy, precision, recall, f1


## 2. Predict Categories Without Fine-Tuning

### 2.1 Setup Model

In [112]:
# Model setup
model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=5)
opt = tf_keras.optimizers.legacy.Adam(learning_rate=5e-5)
loss = tf_keras.losses.SparseCategoricalCrossentropy(from_logits=True)  # Raw logits expected
model.compile(optimizer=opt, loss=loss, metrics=['accuracy'])


Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFDistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_projector.bias']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
Some weights or buffers of the TF 2.0 model TFDistilBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.weight', 'classifier.bias']
You should 

## 2.2 Evaluate Performance of DistilBERT with no fine-tuning

In [134]:
# Perform prediction on the test dataset
y_true = []
y_pred = []

# Iterate over test dataset to collect true labels and predictions
for batch in test_dataset:
    input_data, labels = batch
    y_true.extend(labels.numpy())  # Collect true labels
    logits = model.predict(input_data).logits  # Predict logits
    predictions = logits.argmax(axis=-1)  # Convert logits to predicted labels
    y_pred.extend(predictions)


# Evaluate the model
no_finetuning_accuracy, no_finetuning_precision, no_finetuning_recall, no_finetuning_f1 = evaluate_model(y_true, y_pred)
no_finetuning_training_time = 0, #There is no training involved, as we are not fine tuning the model

# Find number of trainable parameters
trainable_params = 0
for variable in model.trainable_weights:
    # Use the 'count_params' method to get the number of parameters in each variable
    param_count = tf.keras.backend.count_params(variable)
    trainable_params += param_count
no_finetuning_trainable_params = trainable_params

model_metrics = [
    {
        'Model': 'No Fine-Tuning',
        'Accuracy': no_finetuning_accuracy,
        'Precision': no_finetuning_precision,
        'Recall': no_finetuning_recall,
        'F1 Score': no_finetuning_f1,
        'Training Time (s)': 0,
        'Trainable Parameters': trainable_params
    },

]
# Convert the list of dictionaries into a DataFrame
df_metrics = pd.DataFrame(model_metrics)

# Display the DataFrame
display(df_metrics)




Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.



Unnamed: 0,Model,Accuracy,Precision,Recall,F1 Score,Training Time (s),Trainable Parameters
0,No Fine-Tuning,0.2225,0.098889,0.2225,0.084762,0,66957317


## 3. Predict Categories With Standard Fine-Tuning

### 3.1. Define Model

In [137]:
# Model setup
std_ft_model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=5)
opt = tf_keras.optimizers.legacy.Adam(learning_rate=5e-5)
loss = tf_keras.losses.SparseCategoricalCrossentropy(from_logits=True)  # Raw logits expected
std_ft_model.compile(optimizer=opt, loss=loss, metrics=['accuracy'])

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFDistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_projector.bias']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
Some weights or buffers of the TF 2.0 model TFDistilBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.weight', 'classifier.bias']
You should 

### 3.2 Train Model

In [138]:
std_ft_start_time = time.time()
std_ft_model.fit(
    train_dataset,
    validation_data=val_dataset,
    epochs=3,
    verbose=1
) 
std_ft_end_time = time.time()


Epoch 1/3
Epoch 2/3
Epoch 3/3


### 3.3. Evaluate Model

In [139]:
# Perform prediction on the test dataset
y_true = []
y_pred = []

# Iterate over test dataset to collect true labels and predictions
for batch in test_dataset:
    input_data, labels = batch
    y_true.extend(labels.numpy())  # Collect true labels
    logits = std_ft_model.predict(input_data).logits  # Predict logits
    predictions = logits.argmax(axis=-1)  # Convert logits to predicted labels
    y_pred.extend(predictions)


# Evaluate the model
std_ft_accuracy, std_ft_precision, std_ft_recall, std_ft_f1 = evaluate_model(y_true, y_pred)
std_ft_training_time = std_ft_end_time - std_ft_start_time

# Find number of trainable parameters
trainable_params = 0
for variable in std_ft_model.trainable_weights:
    # Use the 'count_params' method to get the number of parameters in each variable
    param_count = tf.keras.backend.count_params(variable)
    trainable_params += param_count
no_finetuning_trainable_params = trainable_params



model_metrics = [
    {
        'Model': 'No Fine-Tuning',
        'Accuracy': no_finetuning_accuracy,
        'Precision': no_finetuning_precision,
        'Recall': no_finetuning_recall,
        'F1 Score': no_finetuning_f1,
        'Training Time (s)': 0,
        'Trainable Parameters': no_finetuning_trainable_params
    },
    {
        'Model': 'Standard Fine-Tuning',
        'Accuracy': std_ft_accuracy,
        'Precision': std_ft_precision,
        'Recall': std_ft_recall,
        'F1 Score': std_ft_f1,
        'Training Time (s)': std_ft_training_time,
        'Trainable Parameters': trainable_params
    },

]
# Convert the list of dictionaries into a DataFrame
df_metrics = pd.DataFrame(model_metrics)

# Display the DataFrame
display(df_metrics)



Unnamed: 0,Model,Accuracy,Precision,Recall,F1 Score,Training Time (s),Trainable Parameters
0,No Fine-Tuning,0.2225,0.098889,0.2225,0.084762,0.0,66957317
1,Standard Fine-Tuning,1.0,1.0,1.0,1.0,278.466017,66957317


## Fine-Tuning With LoRA

In [141]:
import math
from transformers import TFDistilBertForSequenceClassification
from typing import List

class LoraLayer(tf.keras.layers.Layer):

    def __init__(
        self,
        original_layer,
        rank: int = 8,
        alpha: int = 32,
        dim: int = 768,
        dropout: float = 0.05,
        **kwargs,
    ):
        # We want to keep the name of this layer the same as the original
        # dense layer.
        original_layer_config = original_layer.get_config()
        name = original_layer_config["name"]

        kwargs.pop("name", None)

        super().__init__(name=name, **kwargs)

        self.rank = rank
        self.alpha = alpha
        self._scale = alpha / rank
        self.dim = dim  # dim of DistilBert hidden states.
        self.dropout = dropout

        # Layers.

        # Original dense layer.
        self.original_layer = original_layer
        # No matter whether we are training the model or are in inference mode,
        # this layer should be frozen.
        self.original_layer.trainable = False

        # LoRA dense layers.
        self.A = tf.keras.layers.Dense(
            units=rank,
            use_bias=False,
            # Note: the original paper mentions that normal distribution was
            # used for initialization. However, the official LoRA implementation
            # uses "Kaiming/He Initialization".
            kernel_initializer=tf.keras.initializers.VarianceScaling(scale=math.sqrt(5), mode="fan_in", distribution="uniform"),
            name="lora_A",
        )

        self.B = tf.keras.layers.Dense(
            units=self.dim,
            use_bias=False,
            kernel_initializer="zeros",
            name="lora_B",
        )

        self.dropout_layer = tf.keras.layers.Dropout(self.dropout)

    def call(self, inputs: tf.Tensor) -> tf.Tensor:
        original_output = self.original_layer(inputs)

        x = self.A(inputs)
        x = self.dropout_layer(x)
        lora_output = self.B(x) * self._scale

        return original_output + lora_output

DISTILBERT_LINEAR_MODULES_DICT = {
    "q_lin": {"parent_layer": "attention", "input_dim": 768, "dim": 768},
    "v_lin": {"parent_layer": "attention", "input_dim": 768, "dim": 768},
    "k_lin": {"parent_layer": "attention", "input_dim": 768, "dim": 768},
    "out_lin": {"parent_layer": "attention", "input_dim": 768, "dim": 768},
    "lin1": {"parent_layer": "ffn", "input_dim": 768, "dim": 3072},
    "lin2": {"parent_layer": "ffn", "input_dim": 3072, "dim": 768},
}

LORA_PARAMETERS = {
    "rank": 8,
    "alpha": 8,
    "target_modules": ["q_lin", "v_lin", "k_lin", "out_lin", "lin1", "lin2"],
    "dropout": 0.05,
}

def apply_lora(
    model,
    rank: int,
    alpha: int,
    target_modules: List[str],
    dropout: float = 0.05
):
    for i in range(model.distilbert.transformer.n_layers):
        for target_module in target_modules:
            parent_layer_name = DISTILBERT_LINEAR_MODULES_DICT[target_module]["parent_layer"]
            parent_layer = getattr(
                model.distilbert.transformer.layer[i],
                parent_layer_name,
            )

            original_target_layer = getattr(parent_layer, target_module)
            original_target_layer_dim = DISTILBERT_LINEAR_MODULES_DICT[target_module]["dim"]

            lora_layer = LoraLayer(
                original_layer=original_target_layer,
                rank=rank,
                alpha=alpha,
                trainable=True,
                dim=original_target_layer_dim,
                dropout=dropout,
            )
            setattr(parent_layer, target_module, lora_layer)

            input_dim = DISTILBERT_LINEAR_MODULES_DICT[target_module]["input_dim"]
            getattr(parent_layer, target_module).A.build(input_dim)
            getattr(parent_layer, target_module).B.build(rank)

    # Set all distilbert linear layers to trainable=False except the LoRA layers
    model.distilbert.embeddings.trainable = False
    for (layer) in (model.distilbert._flatten_layers()):
        lst_of_sublayers = list(layer._flatten_layers())

        if len(lst_of_sublayers) == 1:  # "leaves of the model"
            if layer.name in ["lora_A", "lora_B"]:
                layer.trainable = True
            else:
                layer.trainable = False

    return model

def merge_lora_weights(
    model,
    rank: int,
    alpha: int,
    target_modules: List[str]
):

    scale = alpha / rank
    for i in range(model.distilbert.transformer.n_layers):
        for target_module in target_modules:
            parent_layer_name = DISTILBERT_LINEAR_MODULES_DICT[target_module]["parent_layer"]
            parent_layer = getattr(
                model.distilbert.transformer.layer[i],
                parent_layer_name,
            )

            target_layer = getattr(parent_layer, target_module)
            target_layer_input_dim = DISTILBERT_LINEAR_MODULES_DICT[target_module]["input_dim"]

            A_layer = getattr(target_layer, "A")
            B_layer = getattr(target_layer, "B")
            original_dense_layer = getattr(target_layer, "original_layer")

            lora_weights = tf.linalg.matmul(A_layer.kernel, B_layer.kernel)
            original_dense_layer_weights = original_dense_layer.kernel

            merged_layer_weights = original_dense_layer_weights + lora_weights * scale
            merged_layer_bias = original_dense_layer.bias

            merged_layer = tf.keras.layers.Dense(
                units=original_dense_layer.units,
                kernel_initializer=tf.constant_initializer(merged_layer_weights.numpy()),
                bias_initializer=tf.constant_initializer(merged_layer_bias.numpy()),
                name=target_module,
            )
            merged_layer.build(target_layer_input_dim)

            setattr(parent_layer, target_module, merged_layer)

    return model

batch_size = 32
num_epochs = 3
batches_per_epoch = len(train_dataset)
total_train_steps = batches_per_epoch * num_epochs

TRAINING_PARAMETERS = {
    "batch_size": batch_size,
    "num_epochs": num_epochs,
    "batches_per_epoch": batches_per_epoch,
    "total_train_steps": total_train_steps,
    "learning_rate": 2e-5,
    "num_warmup_steps": 0,
}

def train(train_dataset, validation_dataset, model, training_parameters, lora_parameters) -> None:

    if lora_parameters:
        print("Model summary before applying LoRA layers:")
        print(model.summary())
        print()
        model = apply_lora(model, **lora_parameters)
        print("Model summary after applying LoRA layers:")
        print(model.summary())
        print()

    optimizer = tf_keras.optimizers.legacy.Adam(learning_rate=5e-5)
    model.compile(optimizer=optimizer)

    # metric_callback = KerasMetricCallback(metric_fn=compute_metrics, eval_dataset=validation_dataset)

    loRA_ft_start_time = time.time()
    model.fit(
        x=train_dataset,
        validation_data=validation_dataset,
        epochs=training_parameters["num_epochs"],
        # callbacks=[metric_callback]
    )
    loRA_ft_end_time = time.time()

    loRA_ft_training_time = loRA_ft_end_time - loRA_ft_start_time

    if lora_parameters:
        model = merge_lora_weights(
            model,
            rank=lora_parameters["rank"],
            alpha=lora_parameters["alpha"],
            target_modules=lora_parameters["target_modules"]
        )
        print()
        print("Model summary after merging LoRA weights:")
        print(model.summary())
        print()
        evaluation = model.evaluate(validation_dataset, return_dict=True)
        print("Evaluation after merging weights:", evaluation)

    return model, loRA_ft_training_time


In [142]:
model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=5)


loRA_ft_model, loRA_ft_training_time = train(
    train_dataset=train_dataset,
    validation_dataset=val_dataset,
    model=model,
    training_parameters=TRAINING_PARAMETERS,
    lora_parameters=LORA_PARAMETERS,
)


Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFDistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_projector.bias']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
Some weights or buffers of the TF 2.0 model TFDistilBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.weight', 'classifier.bias']
You should 

Model summary before applying LoRA layers:
Model: "tf_distil_bert_for_sequence_classification_22"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 distilbert (TFDistilBertMa  multiple                  66362880  
 inLayer)                                                        
                                                                 
 pre_classifier (Dense)      multiple                  590592    
                                                                 
 classifier (Dense)          multiple                  3845      
                                                                 
 dropout_531 (Dropout)       multiple                  0 (unused)
                                                                 
Total params: 66957317 (255.42 MB)
Trainable params: 66957317 (255.42 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In [None]:
# Perform prediction on the test dataset
y_true = []
y_pred = []

# Iterate over test dataset to collect true labels and predictions
for batch in test_dataset:
    input_data, labels = batch
    y_true.extend(labels.numpy())  # Collect true labels
    logits = loRA_ft_model.predict(input_data).logits  # Predict logits
    predictions = logits.argmax(axis=-1)  # Convert logits to predicted labels
    y_pred.extend(predictions)


# Evaluate the model
loRA_ft_accuracy, loRA_ft_precision, loRA_ft_recall, loRA_ft_f1 = evaluate_model(y_true, y_pred)


# Find number of trainable parameters
loRA_ft_trainable_params = 0
for variable in loRA_ft_model.trainable_weights:
    # Use the 'count_params' method to get the number of parameters in each variable
    param_count = tf.keras.backend.count_params(variable)
    loRA_ft_trainable_params += param_count
loRA_ft_trainable_params = loRA_ft_trainable_params



model_metrics = [
    {
        'Model': 'No Fine-Tuning',
        'Accuracy': no_finetuning_accuracy,
        'Precision': no_finetuning_precision,
        'Recall': no_finetuning_recall,
        'F1 Score': no_finetuning_f1,
        'Training Time (s)': 0,
        'Trainable Parameters': no_finetuning_trainable_params
    },
    {
        'Model': 'Standard Fine-Tuning',
        'Accuracy': std_ft_accuracy,
        'Precision': std_ft_precision,
        'Recall': std_ft_recall,
        'F1 Score': std_ft_f1,
        'Training Time (s)': std_ft_training_time,
        'Trainable Parameters': trainable_params
    },
        {
        'Model': 'LoRA Fine-Tuning',
        'Accuracy': loRA_ft_accuracy,
        'Precision': loRA_ft_precision,
        'Recall': loRA_ft_recall,
        'F1 Score': loRA_ft_f1,
        'Training Time (s)': loRA_ft_training_time,
        'Trainable Parameters': loRA_ft_trainable_params
    },

]
# Convert the list of dictionaries into a DataFrame
df_metrics = pd.DataFrame(model_metrics)

# Display the DataFrame
display(df_metrics)

## Save Model

In [198]:
save_directory = "./saved_models" 

model.save_pretrained(save_directory)

tokenizer.save_pretrained(save_directory)

('./saved_models/tokenizer_config.json',
 './saved_models/special_tokens_map.json',
 './saved_models/vocab.txt',
 './saved_models/added_tokens.json')