🛠️ The overall problem is to develop a text classification system using a pre-trained DistilBert model that can accurately predict categories based on textual input.

📝 The input feature, denoted as 'X', consists of raw text strings sourced from a dataset, which the model processes to predict categorical labels.

🎯 The target variable, referred to as 'label', represents the actual categories of the text, which are used to train the model and evaluate its accuracy.

📊 The model's performance is assessed through metrics such as the confusion matrix and accuracy, comparing the predicted labels against the actual labels.

🔧 The project involves not only fine-tuning a pre-trained language model on a specific dataset but also validating its effectiveness on both a smaller sample and a larger subset to ensure robustness and scalability of the predictions.

In [1]:
!pip -q install accelerate -U
!pip -q install transformers[torch]
!pip -q install datasets
!pip install --upgrade pyarrow
#Restart after installing

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m362.1/362.1 kB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m39.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m32.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m30.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 MB[0m [31m11.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from transformers import pipeline
import torch
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification, Trainer, TrainingArguments
from transformers import Trainer, TrainingArguments
from datasets import load_dataset, DatasetDict, ClassLabel, Dataset

## Import Emotions Data

In [3]:
!wget https://raw.githubusercontent.com/venkatareddykonasani/Datasets/master/Final_Emotion_Data/five_emotions_data.csv
emotions_data=pd.read_csv("five_emotions_data.csv")
print(emotions_data.shape)
print(emotions_data.head())
print(emotions_data["label"].value_counts())

--2025-05-22 15:51:30--  https://raw.githubusercontent.com/venkatareddykonasani/Datasets/master/Final_Emotion_Data/five_emotions_data.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.108.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4092906 (3.9M) [text/plain]
Saving to: ‘five_emotions_data.csv’


2025-05-22 15:51:30 (66.9 MB/s) - ‘five_emotions_data.csv’ saved [4092906/4092906]

(42645, 4)
   Id    Emotion                                               Text  label
0   1    sadness                            i didnt feel humiliated      3
1   2    sadness  i can go from feeling so hopeless to so damned...      3
2   4       love  i am ever feeling nostalgic about the fireplac...      1
3   6    sadness  ive been feeling a little burdened lately wasn...      3
4   9  happiness  i have been with petronas fo

## Use distilbert model without finetunung

In [None]:
# Distil bert model
from transformers import pipeline
distilbert_model = pipeline(task="text-classification",
                            model="distilbert-base-uncased",
                            device="cuda",
                            )

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [None]:
sample_data=emotions_data.sample(10000, random_state=42)
sample_data["Text"]=sample_data["Text"].apply(lambda x: " ".join(x.split()[:100]))
sample_data["bert_predicted"] = sample_data["Text"].apply(lambda x: distilbert_model(x)[0]["label"])
sample_data["bert_predicted_num"]=sample_data["bert_predicted"].apply(lambda x: x[-1])
sample_data["bert_predicted_num"] = sample_data["bert_predicted_num"].astype(int)
sample_data.head()

You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


Unnamed: 0,Id,Emotion,Text,label,bert_predicted,bert_predicted_num
1900,2690,sadness,i social and dreaming about things that make y...,3,LABEL_0,0
20627,27544,worry,"is missing training tonight, the lurgy is on m...",4,LABEL_0,0
12481,17415,worry,my HD is full. need to cleanup a lot,4,LABEL_0,0
30267,39810,sadness,"i'm watching missing pieces, just coz the them...",3,LABEL_0,0
14420,19838,sadness,rain got so big weather so cold right now,3,LABEL_0,0


### Accuracy of the model without fine-tuning

In [None]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(sample_data["label"], sample_data["bert_predicted_num"])
print(cm)
accuracy=cm.diagonal().sum()/cm.sum()
print(accuracy)

[[2522    0    0    0    0]
 [1255    0    0    0    0]
 [1941    3    0    0    0]
 [2250    2    0    0    0]
 [2027    0    0    0    0]]
0.2522


# Finetuning the model with our data


In [None]:
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification, Trainer, TrainingArguments
from transformers import Trainer, TrainingArguments
from datasets import load_dataset, DatasetDict, ClassLabel, Dataset
import pandas as pd
from sklearn.model_selection import train_test_split
import torch

In [None]:
Sample_data = Dataset.from_pandas(sample_data)
# Split the dataset into training and testing sets
train_test_split = Sample_data.train_test_split(test_size=0.2)  # 80% training, 20% testing
dataset = DatasetDict({
    'train': train_test_split['train'],
    'test': train_test_split['test']
})
dataset

DatasetDict({
    train: Dataset({
        features: ['Id', 'Emotion', 'Text', 'label', 'bert_predicted', 'bert_predicted_num', '__index_level_0__'],
        num_rows: 8000
    })
    test: Dataset({
        features: ['Id', 'Emotion', 'Text', 'label', 'bert_predicted', 'bert_predicted_num', '__index_level_0__'],
        num_rows: 2000
    })
})

### Load the tokenizer

In [None]:
# Load the tokenizer
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

# Padding
tokenizer.pad_token = tokenizer.eos_token
tokenizer.pad_token_id = tokenizer.eos_token_id
tokenizer.add_special_tokens({'pad_token': '[PAD]'} )

def tokenize_function(examples):
    return tokenizer(examples["Text"], padding="max_length", truncation=True, max_length=100)
tokenized_datasets = dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/8000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

### Load and Train the model

In [None]:
model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased',
                                                            num_labels=5,
                                                            pad_token_id=tokenizer.eos_token_id) # Adjust num_labels as needed

training_args = TrainingArguments(
    output_dir="./results_bert_custom",
    num_train_epochs=5,
    logging_dir="./logs_bert_custom",
    evaluation_strategy="epoch"

)

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['test'],
)

# Start training
trainer.train()

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss
1,1.058,1.038556
2,0.8724,1.036763
3,0.5976,1.236661
4,0.3152,1.79817
5,0.1816,2.093273


TrainOutput(global_step=5000, training_loss=0.6239358413696289, metrics={'train_runtime': 293.3728, 'train_samples_per_second': 136.345, 'train_steps_per_second': 17.043, 'total_flos': 1034956920000000.0, 'train_loss': 0.6239358413696289, 'epoch': 5.0})

In [None]:
# Define the directory where you want to save your model and tokenizer
model_dir = "./distilbert_finetuned"

# Save the model
model.save_pretrained(model_dir)

# Save the tokenizer
tokenizer.save_pretrained(model_dir)

#Save the model with
trainer.save_model('Distilbert_CustomModel_10K')

#!zip -r distilbert_finetuned_10k.zip ./distilbert_finetuned


In [None]:
def make_prediction(text):
  new_text=text
  inputs=tokenizer(new_text, return_tensors="pt")
  inputs = inputs.to(torch.device("cuda:0"))
  outputs=model(**inputs)
  predictions=outputs.logits.argmax(-1)
  predictions=predictions.detach().cpu().numpy()
  return(predictions)

sample_data["finetuned_predicted"]=sample_data["Text"].apply(lambda x: make_prediction(str(x))[0])

In [None]:
from sklearn.metrics import confusion_matrix
# Create the confusion matrix
cm1 = confusion_matrix(sample_data["label"], sample_data["finetuned_predicted"])
print(cm1)
accuracy1=cm1.diagonal().sum()/cm1.sum()
print(accuracy1)

[[2330   65   81   12   34]
 [  79 1097   46    6   27]
 [  57   36 1696   29  126]
 [  21   15   54 2052  110]
 [  43   50  103   70 1761]]
0.8936


### Loading a pre-built model and making prediction

In [None]:
#Code to donwloading the distilbert model
!gdown --id 12rYkcG7AHkZMDIlzJ4P5JkVCJwnXJvaU -O distilbert_finetuned_10k.zip
!unzip -o -j distilbert_finetuned_10k.zip -d distilbert_finetuned_V1

Downloading...
From (original): https://drive.google.com/uc?id=12rYkcG7AHkZMDIlzJ4P5JkVCJwnXJvaU
From (redirected): https://drive.google.com/uc?id=12rYkcG7AHkZMDIlzJ4P5JkVCJwnXJvaU&confirm=t&uuid=d02926db-4486-42c8-9055-61f808c55a65
To: /content/distilbert_finetuned_10k.zip
100% 247M/247M [00:02<00:00, 122MB/s]
Archive:  distilbert_finetuned_10k.zip
  inflating: distilbert_finetuned_V1/special_tokens_map.json  
  inflating: distilbert_finetuned_V1/config.json  
  inflating: distilbert_finetuned_V1/model.safetensors  
  inflating: distilbert_finetuned_V1/tokenizer_config.json  
  inflating: distilbert_finetuned_V1/vocab.txt  


In [None]:
model_v1 = DistilBertForSequenceClassification.from_pretrained('/content/distilbert_finetuned_V1')
model_v1.to("cuda:0")

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
            (lin

In [None]:
def make_prediction(text):
  new_complaint=text
  inputs=tokenizer(new_complaint, return_tensors="pt")
  inputs = inputs.to(torch.device("cuda:0"))
  outputs=model_v1(**inputs)
  predictions=outputs.logits.argmax(-1)
  predictions=predictions.detach().cpu().numpy()
  return(predictions)


In [None]:
sample_data_large=emotions_data.sample(n=40000, random_state=55)
sample_data_large["finetuned_predicted"]=sample_data_large["Text"].apply(lambda x: make_prediction(str(x))[0])

In [None]:
from sklearn.metrics import confusion_matrix
# Create the confusion matrix
cm1 = confusion_matrix(sample_data_large["label"], sample_data_large["finetuned_predicted"])
print(cm1)
accuracy1=cm1.diagonal().sum()/cm1.sum()
print(accuracy1)

[[7353  823  904  265  564]
 [ 970 2900  418  205  350]
 [ 822  401 4571  536 1742]
 [ 318  172  610 6386 1759]
 [ 483  298 1257 1152 4741]]
0.648775
