# Fine-Tune DistilBERT For Multi-Class Text Classification Using Tensorflow and Keras

In this project, we explore the capabilities of DistilBERT for multi-class text classification by comparing three approaches: no fine-tuning, standard fine-tuning, and LoRA fine-tuning.

Table of Contents:

    1. Data Pre-Processing
        1.1. Split into Train, Validation, Test using Stratified Sampling
        1.2. View test/train/validation Splits
        1.3. (Optional) Save test/train/val to CSV
        1.4. Tokenize Data For DistilBERT Model
    
    2. Model Evaluation
        2.1. Model Parameters
        2.2. Model Metrics

    3. DistilBERT with No Fine-Tuning
        3.1. Model Setup
        3.2. Model Evaluation

    4. DistilBERT with Standard Fine-Tuning
        4.1. Model Setup
        4.2. Model Training
        4.3. Model Evaluation

    5. DistilBERT with LoRA Fine-Tuning
        5.1. Implementing the 'LoRALayer' class
        5.2. Applying LoRA
        5.3. Defining the LoRALayer class
        5.4. Model Training
        5.5. Model Evaluation

    6. Saving the Model
        6.1. (Optional) Merging LoRA Weights
        6.2. Saving the Model
    

In [77]:
## Import required packages
from transformers import DistilBertTokenizer, TFDistilBertForSequenceClassification
import tf_keras
import tensorflow as tf
import pandas as pd
import time
import numpy as np
from sklearn.model_selection import train_test_split

## 1. Preprocess Data

### 1.1. Split into Train, Validation, Test using Stratified Sampling

In [78]:
# Import Data
root_path = 'data/full_dataset.csv'
df = pd.read_csv(root_path)
df.head()

# Encode the 'category' column into numerical labels
df['encoded_text'] = df['category'].astype('category').cat.codes

# Separate columns for splitting
data_texts = df['request'].to_list()  # 'request' is the text data
data_labels = df['encoded_text'].to_list()  # Encoded class labels
stratify_values = df['stratify_col'].to_list()  # Stratification column

# Split the data into Train/Validation sets with stratification
train_texts, val_texts, train_labels, val_labels, train_stratify, val_stratify = train_test_split(
    data_texts, data_labels, stratify_values, 
    test_size=0.2, stratify=stratify_values, random_state=0
)

# Split the Train set further into Train/Test with stratification
train_texts, test_texts, train_labels, test_labels = train_test_split(
    train_texts, train_labels, 
    test_size=0.1, stratify=train_stratify, random_state=0
)

### 1.2. View test/train/validation Splits

In [79]:
# Map numerical labels back to category names
label_mapping = dict(enumerate(df['category'].astype('category').cat.categories))
print("\nLabel Mapping (Encoded -> Category):")

for encoded, category in label_mapping.items():
    print(f"{encoded}: {category}")

# Output dataset information
print("\nFinal dataset information:")
print(f"Train set size: {len(train_texts)}")
print(f"Validation set size: {len(val_texts)}")
print(f"Test set size: {len(test_texts)}")

print(f"Example train_texts: {train_texts[:3]}") 
print(f"Example train_labels: {train_labels[:3]}")
print(f"Example val_texts: {val_texts[:3]}") 
print(f"Example val_labels: {val_labels[:3]}")
print(f"Example test_texts: {test_texts[:3]}") 
print(f"Example test_labels: {test_labels[:3]}")


Label Mapping (Encoded -> Category):
0: Facilities Management
1: Finance
2: HR
3: IT Support
4: Marketing

Final dataset information:
Train set size: 3596
Validation set size: 1000
Test set size: 400
Example train_texts: ['I’m gathering details about rewards for long-term employees and was hoping you could provide some insight. Let me know if you need further specifics from me.', 'Do you have the latest version of the diversity and inclusion policies handbook? I need it for a new hire orientation.', 'Could you share detailed insights on the performance metrics for keyword research for PPC campaigns? I’d like to use this data for our planning.']
Example train_labels: [2, 2, 4]
Example val_texts: ['Could you outline the steps to optimize our launching retargeting ads approach? Any case studies or examples would be helpful.', 'Need access to server maintenance.', 'Insights on customer retention strategies performance needed.']
Example val_labels: [4, 3, 4]
Example test_texts: ['I’m exper

### 1.3. (Optional) Save test/train/val to CSV

In [80]:
# Create DataFrames for each split
train_df = pd.DataFrame({
    'request': train_texts,
    'label': train_labels
})

val_df = pd.DataFrame({
    'request': val_texts,
    'label': val_labels
})

test_df = pd.DataFrame({
    'request': test_texts,
    'label': test_labels
})

# Save DataFrames to CSV files
train_df.to_csv("data/train.csv", index=False)
val_df.to_csv("data/validation.csv", index=False)
test_df.to_csv("data/test.csv", index=False)


### 1.4. Tokenize Data For DistilBERT Model

In [81]:
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

train_encodings = tokenizer(train_texts, truncation = True, padding = True  )

val_encodings = tokenizer(val_texts, truncation = True, padding = True )

test_encodings = tokenizer(test_texts, truncation = True, padding = True )

In [82]:
train_dataset = tf.data.Dataset.from_tensor_slices((
    dict(train_encodings),
    train_labels
)).batch(32)


val_dataset = tf.data.Dataset.from_tensor_slices((
    dict(val_encodings),
    val_labels
)).batch(32)

test_dataset = tf.data.Dataset.from_tensor_slices((
    dict(test_encodings),
    test_labels
)).batch(32)

## 2. Model Evaluation

### 2.1 Model Parameters

For our three models (No Fine-tune, Standard fine-tune, LoRA fine tune), we will be using the DistilBERT model, an Adam optimize function, SparseCategoricalCrossentropy, a batch size of 32 and 3 ephocs.

In [83]:
# Initial Model setup
model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=5)
opt = tf_keras.optimizers.legacy.Adam(learning_rate=5e-5)
loss = tf_keras.losses.SparseCategoricalCrossentropy(from_logits=True)  # Raw logits expected

batch_size = 32
num_epochs = 3
batches_per_epoch = len(train_dataset)
total_train_steps = batches_per_epoch * num_epochs

TRAINING_PARAMETERS = {
    "batch_size": batch_size,
    "num_epochs": num_epochs,
    "batches_per_epoch": batches_per_epoch,
    "total_train_steps": total_train_steps,
    "learning_rate": 2e-5,
    "num_warmup_steps": 0,
}

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFDistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_projector.bias', 'vocab_layer_norm.weight', 'vocab_transform.bias', 'vocab_layer_norm.bias']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
Some weights or buffers of the TF 2.0 model TFDistilBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.weight', 'classifier.bias']
You should 

### 2.3 Model Metrics

For each of our three models, we will collect the following metrics:
    1.    accuracy
    2.    precision
    3.    recall
    4.    f1
    5.    Training Time
    6.    Trainable Parameters

In [90]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Define evaluation function
def evaluate_model(model):
    # Perform prediction on the test dataset
    y_true = []
    y_pred = []

    # Iterate over test dataset to collect true labels and predictions
    for batch in test_dataset:
        input_data, labels = batch
        y_true.extend(labels.numpy())  # Collect true labels
        logits = model.predict(input_data).logits  # Predict logits
        predictions = logits.argmax(axis=-1)  # Convert logits to predicted labels
        y_pred.extend(predictions)
    
    accuracy = accuracy_score(y_true, y_pred)
    precision = precision_score(y_true, y_pred, average='weighted')
    recall = recall_score(y_true, y_pred, average='weighted')
    f1 = f1_score(y_true, y_pred, average='weighted')
    trainable_params = np.sum([np.prod(v.get_shape().as_list()) for v in model.trainable_variables])
    return accuracy, precision, recall, f1, trainable_params



We will use the following function to display our models metrics

In [85]:
# Initialize the metrics list
model_metrics = []

# Define a function to add metrics
def add_model_metrics(model_name, accuracy, precision, recall, f1, training_time, trainable_params):
    model_metrics.append({
        'Model': model_name,
        'Accuracy': accuracy,
        'Precision': precision,
        'Recall': recall,
        'F1 Score': f1,
        'Training Time (s)': training_time,
        'Trainable Parameters': trainable_params
    })
    # Convert the list of dictionaries into a DataFrame
    df_metrics = pd.DataFrame(model_metrics)
    display(df_metrics)

## 3. DistilBERT With No Fine-Tuning

### 3.1 Model Setup

For the non-fine tuned model, we will simply compile the original model

In [86]:
no_ft_model = model
no_ft_model.compile(optimizer=opt, loss=loss, metrics=['accuracy'])

### 3.2 Model Evaluation

In [87]:

# Evaluate the model
no_ft_accuracy, no_ft_precision, no_ft_recall, no_ft_f1, no_ft_trainable_params = evaluate_model(no_ft_model)
no_ft_training_time = 0, #There is no training involved, as we are not fine tuning the model

# Add metrics for "No Fine-Tuning"
no_ft_training_time = 0  # No training involved
add_model_metrics(
    'No Fine-Tuning',
    no_ft_accuracy, no_ft_precision, no_ft_recall, no_ft_f1,
    no_ft_training_time, no_ft_trainable_params
)




Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.



Unnamed: 0,Model,Accuracy,Precision,Recall,F1 Score,Training Time (s),Trainable Parameters
0,No Fine-Tuning,0.2425,0.114557,0.2425,0.142696,0,66957317


## 4. Predict Categories With Standard Fine-Tuning

### 4.1. Model Setup

In [94]:
std_ft_model = model
std_ft_model.compile(optimizer=opt, loss=loss, metrics=['accuracy'])

### 4.2 Model Training

In [95]:
std_ft_start_time = time.time()
std_ft_model.fit(
    train_dataset,
    validation_data=val_dataset,
    epochs=TRAINING_PARAMETERS["num_epochs"],
    verbose=1,
) 
std_ft_end_time = time.time()

Epoch 1/3
Epoch 2/3
Epoch 3/3


### 4.3. Model Evaluation

In [96]:
# Evaluate the model
std_ft_accuracy, std_ft_precision, std_ft_recall, std_ft_f1, std_ft_trainable_params = evaluate_model(std_ft_model)
std_ft_training_time = std_ft_end_time - std_ft_start_time # Find training time

# Add metrics for "Standard Fine-Tuning"
std_ft_training_time = std_ft_end_time - std_ft_start_time  # Calculate training time
add_model_metrics(
    'Standard Fine-Tuning',
    std_ft_accuracy, std_ft_precision, std_ft_recall, std_ft_f1,
    std_ft_training_time, std_ft_trainable_params
)



Unnamed: 0,Model,Accuracy,Precision,Recall,F1 Score,Training Time (s),Trainable Parameters
0,No Fine-Tuning,0.2425,0.114557,0.2425,0.142696,0.0,66957317
1,Standard Fine-Tuning,1.0,1.0,1.0,1.0,292.727587,66957317


## 5. DistilBERT With LoRA Fine-Tuning

### 5.1 Implementing the LoRALayer class

First, we must define the LoRA layer architecture

In [97]:
import math
from typing import List

class LoraLayer(tf.keras.layers.Layer):

    def __init__(
        self,
        original_layer,
        rank: int = 8,
        alpha: int = 32,
        dim: int = 768,
        dropout: float = 0.05,
        **kwargs,
    ):
        # We want to keep the name of this layer the same as the original
        # dense layer.
        original_layer_config = original_layer.get_config()
        name = original_layer_config["name"]

        kwargs.pop("name", None)

        super().__init__(name=name, **kwargs)

        self.rank = rank
        self.alpha = alpha
        self._scale = alpha / rank
        self.dim = dim  # dim of DistilBert hidden states.
        self.dropout = dropout

        # Layers.

        # Original dense layer.
        self.original_layer = original_layer
        # No matter whether we are training the model or are in inference mode,
        # this layer should be frozen.
        self.original_layer.trainable = False

        # LoRA dense layers.
        self.A = tf.keras.layers.Dense(
            units=rank,
            use_bias=False,
            # Note: the original paper mentions that normal distribution was
            # used for initialization. However, the official LoRA implementation
            # uses "Kaiming/He Initialization".
            kernel_initializer=tf.keras.initializers.VarianceScaling(scale=math.sqrt(5), mode="fan_in", distribution="uniform"),
            name="lora_A",
        )

        self.B = tf.keras.layers.Dense(
            units=self.dim,
            use_bias=False,
            kernel_initializer="zeros",
            name="lora_B",
        )

        self.dropout_layer = tf.keras.layers.Dropout(self.dropout)

    def call(self, inputs: tf.Tensor) -> tf.Tensor:
        original_output = self.original_layer(inputs)

        x = self.A(inputs)
        x = self.dropout_layer(x)
        lora_output = self.B(x) * self._scale

        return original_output + lora_output



Next, we will write the script to apply LoRA to the wanted layers

### 5.2 Applying LoRA

In [98]:

DISTILBERT_LINEAR_MODULES_DICT = {
    "q_lin": {"parent_layer": "attention", "input_dim": 768, "dim": 768},
    "v_lin": {"parent_layer": "attention", "input_dim": 768, "dim": 768},
    "k_lin": {"parent_layer": "attention", "input_dim": 768, "dim": 768},
    "out_lin": {"parent_layer": "attention", "input_dim": 768, "dim": 768},
    "lin1": {"parent_layer": "ffn", "input_dim": 768, "dim": 3072},
    "lin2": {"parent_layer": "ffn", "input_dim": 3072, "dim": 768},
}

LORA_PARAMETERS = {
    "rank": 8,
    "alpha": 8,
    "target_modules": ["q_lin", "v_lin", "k_lin", "out_lin", "lin1", "lin2"],
    "dropout": 0.05,
}

def apply_lora(
    model,
    rank: int,
    alpha: int,
    target_modules: List[str],
    dropout: float = 0.05
):
    for i in range(model.distilbert.transformer.n_layers):
        for target_module in target_modules:
            parent_layer_name = DISTILBERT_LINEAR_MODULES_DICT[target_module]["parent_layer"]
            parent_layer = getattr(
                model.distilbert.transformer.layer[i],
                parent_layer_name,
            )

            original_target_layer = getattr(parent_layer, target_module)
            original_target_layer_dim = DISTILBERT_LINEAR_MODULES_DICT[target_module]["dim"]

            lora_layer = LoraLayer(
                original_layer=original_target_layer,
                rank=rank,
                alpha=alpha,
                trainable=True,
                dim=original_target_layer_dim,
                dropout=dropout,
            )
            setattr(parent_layer, target_module, lora_layer)

            input_dim = DISTILBERT_LINEAR_MODULES_DICT[target_module]["input_dim"]
            getattr(parent_layer, target_module).A.build(input_dim)
            getattr(parent_layer, target_module).B.build(rank)

    # Set all distilbert linear layers to trainable=False except the LoRA layers
    model.distilbert.embeddings.trainable = False
    for (layer) in (model.distilbert._flatten_layers()):
        lst_of_sublayers = list(layer._flatten_layers())

        if len(lst_of_sublayers) == 1:  # "leaves of the model"
            if layer.name in ["lora_A", "lora_B"]:
                layer.trainable = True
            else:
                layer.trainable = False

    return model

### 5.3. Model Setup

In [99]:
loRA_ft_model = apply_lora(model, **LORA_PARAMETERS)
loRA_ft_model.compile(optimizer=opt) 

### 5.4 Model Training

In [100]:
loRA_ft_start_time = time.time()
loRA_ft_model.fit(
    x=train_dataset,
    validation_data=val_dataset,
    epochs=TRAINING_PARAMETERS["num_epochs"],
)
loRA_ft_end_time = time.time()

Epoch 1/3
Epoch 2/3
Epoch 3/3


### 5.5 Model Evaluation

In [102]:
# Evaluate the model
loRA_ft_accuracy, loRA_ft_precision, loRA_ft_recall, loRA_ft_f1, loRA_ft_trainable_params = evaluate_model(loRA_ft_model)
loRA_ft_training_time = loRA_ft_end_time - loRA_ft_start_time # Find the training time

# Add metrics for "LoRA Fine-Tuning"
add_model_metrics(
    'LoRA Fine-Tuning',
    loRA_ft_accuracy, loRA_ft_precision, loRA_ft_recall, loRA_ft_f1,
    loRA_ft_training_time, loRA_ft_trainable_params
)



Unnamed: 0,Model,Accuracy,Precision,Recall,F1 Score,Training Time (s),Trainable Parameters
0,No Fine-Tuning,0.2425,0.114557,0.2425,0.142696,0.0,66957317
1,Standard Fine-Tuning,1.0,1.0,1.0,1.0,292.727587,66957317
2,LoRA Fine-Tuning,1.0,1.0,1.0,1.0,240.234911,1257989
3,LoRA Fine-Tuning,1.0,1.0,1.0,1.0,240.234911,1257989


## 6. Save Model

### 6.1. (Optional) Merging LoRA Weights

LoRA is an excellent approach for managing multiple fine-tuned models efficiently. However, in this case, we only need to handle a single downstream task: classifying texts into categories. To optimize inference performance during deployment, we will merge the LoRA layers with the base model and save the resulting model. While this process increases the total number of parameters, it is irrelevant since the model will be used exclusively for inference, not training. Merging the layers reduces inference latency, making this approach the most suitable for our use case.

In [103]:
def merge_lora_weights(
    model,
    rank: int,
    alpha: int,
    target_modules: List[str]
):

    scale = alpha / rank
    for i in range(model.distilbert.transformer.n_layers):
        for target_module in target_modules:
            parent_layer_name = DISTILBERT_LINEAR_MODULES_DICT[target_module]["parent_layer"]
            parent_layer = getattr(
                model.distilbert.transformer.layer[i],
                parent_layer_name,
            )

            target_layer = getattr(parent_layer, target_module)
            target_layer_input_dim = DISTILBERT_LINEAR_MODULES_DICT[target_module]["input_dim"]

            A_layer = getattr(target_layer, "A")
            B_layer = getattr(target_layer, "B")
            original_dense_layer = getattr(target_layer, "original_layer")

            lora_weights = tf.linalg.matmul(A_layer.kernel, B_layer.kernel)
            original_dense_layer_weights = original_dense_layer.kernel

            merged_layer_weights = original_dense_layer_weights + lora_weights * scale
            merged_layer_bias = original_dense_layer.bias

            merged_layer = tf.keras.layers.Dense(
                units=original_dense_layer.units,
                kernel_initializer=tf.constant_initializer(merged_layer_weights.numpy()),
                bias_initializer=tf.constant_initializer(merged_layer_bias.numpy()),
                name=target_module,
            )
            merged_layer.build(target_layer_input_dim)

            setattr(parent_layer, target_module, merged_layer)

    return model

In [104]:

merged_model = merge_lora_weights(
    model,
    rank=LORA_PARAMETERS["rank"],
    alpha=LORA_PARAMETERS["alpha"],
    target_modules=LORA_PARAMETERS["target_modules"]
)

### 6.2 Saving the Model

In [198]:
save_directory = "./saved_models" 

merged_model.save_pretrained(save_directory)

tokenizer.save_pretrained(save_directory)

('./saved_models/tokenizer_config.json',
 './saved_models/special_tokens_map.json',
 './saved_models/vocab.txt',
 './saved_models/added_tokens.json')