## **Fine tuning the distilbert model on custom dataset for text classification into 3 classes- social isolation, social media addiction and cyberbullying** ##
***

**Description:** In this notebook, we are training the distilbert(distilbert-base-uncased) model for classifying the text into 3 classes: social isolation, cyberbullying and social media addiction.

**Dataset** used for training is cleanedv3.csv

**Contributor:** N Priyanka

Credits: https://huggingface.co/docs/transformers/en/tasks/sequence_classification

Also took help from **Narayan Singh Adhikari's** notebook which he used for training the Bert model.

Link to trained model on Hugging Face: https://huggingface.co/PriyankaDS/distilbert-base-uncased-finetuned-mental_social

***

# Installing the libraries required

In [1]:
!pip install transformers[torch] datasets accelerate -U

Collecting transformers[torch]
  Downloading transformers-4.39.0-py3-none-any.whl (8.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.8/8.8 MB[0m [31m27.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets
  Downloading datasets-2.18.0-py3-none-any.whl (510 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m510.5/510.5 kB[0m [31m49.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting accelerate
  Downloading accelerate-0.28.0-py3-none-any.whl (290 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m290.1/290.1 kB[0m [31m37.8 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m13.4 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
[2K     [90m━━━━━━━━━━━━━

In [28]:
!pip install evaluate

Collecting evaluate
  Downloading evaluate-0.4.1-py3-none-any.whl (84 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m1.1 MB/s[0m eta [36m0:00:00[0m
Collecting responses<0.19 (from evaluate)
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Installing collected packages: responses, evaluate
Successfully installed evaluate-0.4.1 responses-0.18.0


Importing the necessary libraries


In [2]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from datasets import load_dataset

Reading the cleanedv3.csv , the final dataset

In [3]:
df = pd.read_csv('/content/drive/MyDrive/Mental_health_prediction/cleanedv3.csv')

In [4]:
df.head()

Unnamed: 0,prompt,chosen,rejected,category,data src,Response Source
0,I feel really alone lately.,"Loneliness can be tough, but there are ways to...","Everyone feels lonely sometimes, just get out ...",social isolation,jaswanthi(reddit data),
1,I keep feeling like nobody understands me.,It's important to have people who understand y...,The internet is full of people to talk to.,social isolation,jaswanthi(reddit data),
2,"I just want someone to talk to, but I don't kn...",That's a brave step to want to connect with so...,"Maybe if you weren't so negative, people would...",social isolation,jaswanthi(reddit data),
3,Feeling really down and alone.,Loneliness and feeling down can go hand in han...,"Just suck it up, everyone feels lonely sometimes.",social isolation,jaswanthi(reddit data),
4,I'm bored and lonely waiting for something to ...,Feeling bored and lonely can be a drag. Have y...,College is a great place to meet new people. Y...,social isolation,jaswanthi(reddit data),


Taking only the columns prompt and category

In [5]:
df =df[['prompt','category']]

Renaming the column prompt to text

In [6]:
df.rename(columns = {'prompt':'text'},inplace=True)

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2365 entries, 0 to 2364
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   text      2365 non-null   object
 1   category  2365 non-null   object
dtypes: object(2)
memory usage: 37.1+ KB


Checking for null values

In [8]:
df.isnull().sum()

text        0
category    0
dtype: int64

Checking the category counts

In [9]:
df['category'].value_counts()

social isolation          829
social media addiction    775
cyberbullying             761
Name: category, dtype: int64

Encoding the category column using Label encoder from scikit learn

In [10]:
le = LabelEncoder()

In [11]:
df['label'] = le.fit_transform(df['category'])

In [12]:
df.tail()

Unnamed: 0,text,category,label
2360,"On social media platforms, I've been targeted ...",cyberbullying,0
2361,"While engaging on social media platforms, I've...",cyberbullying,0
2362,"On social media platforms, I've been the targe...",cyberbullying,0
2363,"While using social media, I've faced relentles...",cyberbullying,0
2364,"On social media platforms, I've been targeted ...",cyberbullying,0


In [13]:
df['label'].value_counts()

1    829
2    775
0    761
Name: label, dtype: int64

In [14]:
le.classes_

array(['cyberbullying', 'social isolation', 'social media addiction'],
      dtype=object)

In [15]:
df_t = df[['text','label']]

In [16]:
df_t

Unnamed: 0,text,label
0,I feel really alone lately.,1
1,I keep feeling like nobody understands me.,1
2,"I just want someone to talk to, but I don't kn...",1
3,Feeling really down and alone.,1
4,I'm bored and lonely waiting for something to ...,1
...,...,...
2360,"On social media platforms, I've been targeted ...",0
2361,"While engaging on social media platforms, I've...",0
2362,"On social media platforms, I've been the targe...",0
2363,"While using social media, I've faced relentles...",0


In [17]:
#shuffling the data
df_t = df_t.sample(frac=1)

In [18]:
# storing it as a csv file
df_t.to_csv('train.csv',index=False)

In [19]:
data = load_dataset("csv", data_files="train.csv")

Generating train split: 0 examples [00:00, ? examples/s]

In [20]:
data

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 2365
    })
})

Splitting the dataset into train and test datasets.

In [21]:
train_data = data['train'].train_test_split(test_size=0.2)

In [22]:
train_data

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 1892
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 473
    })
})

Loading a DistilBERT tokenizer to preprocess the text column

In [23]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased")

tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Create a preprocessing function to tokenize text and truncate sequences to be no longer than DistilBERT's maximum input length.

In [24]:
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True)

Applying the preprocessing function over the entire dataset using map function.
You can speed up map by setting batched=True to process multiple elements of the dataset at once.


In [25]:
tokenized_data = train_data.map(preprocess_function, batched=True)

Map:   0%|          | 0/1892 [00:00<?, ? examples/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Map:   0%|          | 0/473 [00:00<?, ? examples/s]

In [26]:
tokenized_data

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 1892
    })
    test: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 473
    })
})

Creating a batch of examples using DataCollatorWithPadding. It's more efficient to dynamically pad the sentences to the longest length in a batch during collation, instead of padding the whole dataset to the maximum length

In [27]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Including a metric during training for evaluating the model's performance

In [29]:
import evaluate
accuracy = evaluate.load("accuracy")

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

creating a function that passes your predictions and labels to compute to calculate the accuracy, Precision, recall and F1 score.

In [30]:
import numpy as np
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)

    accuracy = accuracy_score(labels, predictions)
    precision = precision_score(labels, predictions, average='macro')
    recall = recall_score(labels, predictions, average='macro')
    f1 = f1_score(labels, predictions, average='macro')

    return {
        'accuracy': accuracy,
        'precision': precision,
        'recall': recall,
        'f1': f1
    }

Creating a map of the expected ids to their labels with id2label and label2id:

In [31]:
num_labels = 3
id2label = {
    "0": "cyberbullying",
    "1": "social isolation",
    "2": "social media addiction",
}
label2id = {
    "cyberbullying": 0,
    "social isolation": 1,
    "social media addiction": 2,

}

Loading DistilBERT with AutoModelForSequenceClassification along with the number of expected labels, and the label mappings

In [32]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert/distilbert-base-uncased", num_labels=num_labels, id2label=id2label, label2id=label2id
)

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Defining the training hyperparameters in TrainingArguments

In [33]:
base_model = "distilbert-base-uncased"

batch_size = 16
logging_steps = len(tokenized_data["train"]) // batch_size

model_name = f"{base_model}-finetuned-mental_social"

training_args = TrainingArguments(
    output_dir= model_name,
    learning_rate=2e-5,
    per_device_train_batch_size= batch_size,
    per_device_eval_batch_size= batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    logging_steps=logging_steps,
    log_level="error"
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_data["train"],
    eval_dataset=tokenized_data["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,0.5871,0.318167,0.894292,0.899026,0.892058,0.894765
2,0.2633,0.25087,0.902748,0.903573,0.905564,0.903253
3,0.1719,0.236013,0.921776,0.92154,0.922478,0.921788


TrainOutput(global_step=357, training_loss=0.33862333571543546, metrics={'train_runtime': 77.4351, 'train_samples_per_second': 73.3, 'train_steps_per_second': 4.61, 'total_flos': 178998729663672.0, 'train_loss': 0.33862333571543546, 'epoch': 3.0})

In [34]:
trainer.evaluate()

{'eval_loss': 0.23601293563842773,
 'eval_accuracy': 0.9217758985200846,
 'eval_precision': 0.9215403950887823,
 'eval_recall': 0.9224784339747844,
 'eval_f1': 0.921788088454755,
 'eval_runtime': 1.8106,
 'eval_samples_per_second': 261.235,
 'eval_steps_per_second': 16.569,
 'epoch': 3.0}

Pushing the model to the Hub with the push_to_hub() method.

In [35]:
trainer.push_to_hub()

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

events.out.tfevents.1710990778.b704355e895a.258.0:   0%|          | 0.00/7.18k [00:00<?, ?B/s]

events.out.tfevents.1710990866.b704355e895a.258.1:   0%|          | 0.00/560 [00:00<?, ?B/s]

Upload 4 LFS files:   0%|          | 0/4 [00:00<?, ?it/s]

training_args.bin:   0%|          | 0.00/4.98k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/PriyankaDS/distilbert-base-uncased-finetuned-mental_social/commit/16c2bebb3b2d97e5634720030ecde555a9a9ef4d', commit_message='End of training', commit_description='', oid='16c2bebb3b2d97e5634720030ecde555a9a9ef4d', pr_url=None, pr_revision=None, pr_num=None)

#Inference

The simplest way to try out your finetuned model for inference is to use it in a pipeline(). Instantiate a pipeline for sentiment analysis with your model, and pass your text to it:

In [36]:
from transformers import pipeline

classifier = pipeline("text-classification", model="PriyankaDS/distilbert-base-uncased-finetuned-mental_social")

config.json:   0%|          | 0.00/840 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.22k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

In [37]:
text = "This was a masterpiece. Not completely faithful to the books, but enthralling from beginning to end. Might be my favorite of the three."

In [38]:
classifier(text)

[{'label': 'social isolation', 'score': 0.954781174659729}]

In [39]:
text1 = "I get so much anxiety, and I don’t know why. I feel like I can’t do anything by myself because I’m scared of the outcomes."

In [40]:
classifier(text1)

[{'label': 'social isolation', 'score': 0.9709294438362122}]

In [41]:
classifier("I've noticed that I feel anxious and stressed out after spending time on social media, but I can't seem to break the habit.")

[{'label': 'social media addiction', 'score': 0.9689680337905884}]

In [42]:
classifier("I find myself comparing my body and appearance to others on social media, and it's making me feel self-conscious and unhappy with myself")

[{'label': 'social media addiction', 'score': 0.9716455936431885}]

In [46]:
classifier("I feel like I'm constantly bombarded with images and messages on social media that make me feel inadequate and insecure.")

[{'label': 'social media addiction', 'score': 0.9705734252929688}]

In [47]:
classifier("I find myself constantly comparing my life to what I see on social media, and it's making me feel like I'm not good enough.")

[{'label': 'social media addiction', 'score': 0.9714193344116211}]

In [48]:
classifier("I can't take it anymore. Every time I log onto social media, I'm bombarded with hateful comments and messages from anonymous trolls. They call me names, spread rumors about me, and tell me to kill myself. It's relentless and it's destroying my mental health. I used to love going online to connect with friends and share my thoughts, but now I dread it. I feel anxious and depressed all the time, and I'm starting to believe the awful things they say about me. I don't know how to make it stop. I just want to feel safe again.")

[{'label': 'cyberbullying', 'score': 0.9835460186004639}]

In [49]:
classifier("Whenever I see someone post about their vacation or new purchase on social media, I feel envious and dissatisfied with my own life.")

[{'label': 'social media addiction', 'score': 0.9702302813529968}]

**Inference from model saved locally**

Saving the model locally.

In [None]:
model_path = "finetuned_distilbert_social_media_mental_health"
trainer.save_model(model_path)
tokenizer.save_pretrained(model_path)

loading the model from local path

In [None]:
from transformers import pipeline
model_path = "finetuned_distilbert_social_media_mental_health"

model = AutoModelForSequenceClassification.from_pretrained(model_path)
tokenizer= AutoTokenizer.from_pretrained(model_path)
nlp= pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)

In [None]:
nlp(text)