# Fine-tuning BERT (and friends) for multi-label text classification

In this notebook, we are going to fine-tune BERT to predict one or more labels for a given piece of text. Note that this notebook illustrates how to fine-tune a bert-base-uncased model, but you can also fine-tune a RoBERTa, DeBERTa, DistilBERT, CANINE, ... checkpoint in the same way.

All of those work in the same way: they add a linear layer on top of the base model, which is used to produce a tensor of shape (batch_size, num_labels), indicating the unnormalized scores for a number of labels for every example in the batch.



## Set-up environment

First, we install the libraries which we'll use: HuggingFace Transformers and Datasets.

In [2]:
#load dataset
import pandas as pd
df=pd.read_csv("/content/go_emotions_dataset.csv")


In [3]:
df.head()

Unnamed: 0,id,text,example_very_unclear,admiration,amusement,anger,annoyance,approval,caring,confusion,...,love,nervousness,optimism,pride,realization,relief,remorse,sadness,surprise,neutral
0,eew5j0j,That game hurt.,False,0,0,0,0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1,eemcysk,>sexuality shouldn’t be a grouping category I...,True,0,0,0,0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,ed2mah1,"You do right, if you don't care then fuck 'em!",False,0,0,0,0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
3,eeibobj,Man I love reddit.,False,0,0,0,0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,eda6yn6,"[NAME] was nowhere near them, he was by the Fa...",False,0,0,0,0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


As we can see, the dataset contains 3 splits: one for training, one for validation and one for testing.

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 66717 entries, 0 to 66716
Data columns (total 31 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   id                    66717 non-null  object 
 1   text                  66717 non-null  object 
 2   example_very_unclear  66717 non-null  bool   
 3   admiration            66717 non-null  int64  
 4   amusement             66717 non-null  int64  
 5   anger                 66717 non-null  int64  
 6   annoyance             66717 non-null  int64  
 7   approval              66716 non-null  float64
 8   caring                66716 non-null  float64
 9   confusion             66716 non-null  float64
 10  curiosity             66716 non-null  float64
 11  desire                66716 non-null  float64
 12  disappointment        66716 non-null  float64
 13  disapproval           66716 non-null  float64
 14  disgust               66716 non-null  float64
 15  embarrassment      

In [5]:
# Check for missing values
print(df.isnull().sum())



id                      0
text                    0
example_very_unclear    0
admiration              0
amusement               0
anger                   0
annoyance               0
approval                1
caring                  1
confusion               1
curiosity               1
desire                  1
disappointment          1
disapproval             1
disgust                 1
embarrassment           1
excitement              1
fear                    1
gratitude               1
grief                   1
joy                     1
love                    1
nervousness             1
optimism                1
pride                   1
realization             1
relief                  1
remorse                 1
sadness                 1
surprise                1
neutral                 1
dtype: int64


In [6]:
# drop null values
df.dropna(inplace=True)
print(df.isnull().sum())

id                      0
text                    0
example_very_unclear    0
admiration              0
amusement               0
anger                   0
annoyance               0
approval                0
caring                  0
confusion               0
curiosity               0
desire                  0
disappointment          0
disapproval             0
disgust                 0
embarrassment           0
excitement              0
fear                    0
gratitude               0
grief                   0
joy                     0
love                    0
nervousness             0
optimism                0
pride                   0
realization             0
relief                  0
remorse                 0
sadness                 0
surprise                0
neutral                 0
dtype: int64


In [7]:
# prompt: convert float [3: ]to interger data type in loop

#convert float to integer
for col in df.columns[3:]:
  df[col] = df[col].astype(int)

df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 66716 entries, 0 to 66715
Data columns (total 31 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   id                    66716 non-null  object
 1   text                  66716 non-null  object
 2   example_very_unclear  66716 non-null  bool  
 3   admiration            66716 non-null  int64 
 4   amusement             66716 non-null  int64 
 5   anger                 66716 non-null  int64 
 6   annoyance             66716 non-null  int64 
 7   approval              66716 non-null  int64 
 8   caring                66716 non-null  int64 
 9   confusion             66716 non-null  int64 
 10  curiosity             66716 non-null  int64 
 11  desire                66716 non-null  int64 
 12  disappointment        66716 non-null  int64 
 13  disapproval           66716 non-null  int64 
 14  disgust               66716 non-null  int64 
 15  embarrassment         66716 non-null  int

In [8]:
row=df.iloc[0]
row

Unnamed: 0,0
id,eew5j0j
text,That game hurt.
example_very_unclear,False
admiration,0
amusement,0
anger,0
annoyance,0
approval,0
caring,0
confusion,0


In [9]:
df.head()

Unnamed: 0,id,text,example_very_unclear,admiration,amusement,anger,annoyance,approval,caring,confusion,...,love,nervousness,optimism,pride,realization,relief,remorse,sadness,surprise,neutral
0,eew5j0j,That game hurt.,False,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
1,eemcysk,>sexuality shouldn’t be a grouping category I...,True,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ed2mah1,"You do right, if you don't care then fuck 'em!",False,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3,eeibobj,Man I love reddit.,False,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
4,eda6yn6,"[NAME] was nowhere near them, he was by the Fa...",False,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


In [10]:
# prompt: 'example_very_unclear' drop it

df.drop('example_very_unclear', axis=1, inplace=True)


In [11]:
# prompt: labels = [label for label in dataset['train'].features.keys() if label not in ['ID', 'Tweet']]
# id2label = {idx:label for idx, label in enumerate(labels)}
# label2id = {label:idx for idx, label in enumerate(labels)}
# labels
# do for this dataset

labels = [label for label in df.columns if label not in ['id', 'text', 'example_what_if']]
id2label = {idx:label for idx, label in enumerate(labels)}
label2id = {label:idx for idx, label in enumerate(labels)}
labels


['admiration',
 'amusement',
 'anger',
 'annoyance',
 'approval',
 'caring',
 'confusion',
 'curiosity',
 'desire',
 'disappointment',
 'disapproval',
 'disgust',
 'embarrassment',
 'excitement',
 'fear',
 'gratitude',
 'grief',
 'joy',
 'love',
 'nervousness',
 'optimism',
 'pride',
 'realization',
 'relief',
 'remorse',
 'sadness',
 'surprise',
 'neutral']

In [12]:
df

Unnamed: 0,id,text,admiration,amusement,anger,annoyance,approval,caring,confusion,curiosity,...,love,nervousness,optimism,pride,realization,relief,remorse,sadness,surprise,neutral
0,eew5j0j,That game hurt.,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
1,eemcysk,>sexuality shouldn’t be a grouping category I...,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,ed2mah1,"You do right, if you don't care then fuck 'em!",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3,eeibobj,Man I love reddit.,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
4,eda6yn6,"[NAME] was nowhere near them, he was by the Fa...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
66711,eczjsko,Haven't been able to watch tonight and just sa...,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
66712,ef7o72y,I've always wanted one of those but like worry...,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
66713,edfw98v,I wish someone would just run [NAME] or [NAME]...,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
66714,eeb79vk,OMG - you're my freakin' hero. And your last p...,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


In [13]:
# prompt: preprocess dataset using from transformers import AutoTokenizer

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

def preprocess_data(examples):
  # take a batch of texts
  text = examples["text"]
  # encode them
  encoding = tokenizer(text, padding="max_length", truncation=True, max_length=128)
  # add labels
  labels_batch = {k: examples[k] for k in examples.keys() if k in labels}
  # create numpy array of shape (batch_size, num_labels)
  labels_matrix = np.zeros((len(text), len(labels)))
  # fill numpy array
  for ix, label in enumerate(labels):
    labels_matrix[:, ix] = labels_batch[label]

  encoding["labels"] = labels_matrix.tolist()

  return encoding

# You would typically apply this function to your dataset using the `map` method:
# tokenized_dataset = dataset.map(preprocess_data, batched=True, encodings=tokenizer, padding="max_length", truncation=True, max_length=128)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [14]:
# prompt: on which dataset should i map?
!pip install datasets
from datasets import Dataset
import numpy as np # Make sure numpy is imported if not already

# Convert Pandas DataFrame to Hugging Face Dataset
dataset = Dataset.from_pandas(df)

# Now you can apply the preprocess_data function
tokenized_dataset = dataset.map(preprocess_data, batched=True)





Map:   0%|          | 0/66716 [00:00<?, ? examples/s]

In [15]:
tokenized_dataset


Dataset({
    features: ['id', 'text', 'admiration', 'amusement', 'anger', 'annoyance', 'approval', 'caring', 'confusion', 'curiosity', 'desire', 'disappointment', 'disapproval', 'disgust', 'embarrassment', 'excitement', 'fear', 'gratitude', 'grief', 'joy', 'love', 'nervousness', 'optimism', 'pride', 'realization', 'relief', 'remorse', 'sadness', 'surprise', 'neutral', '__index_level_0__', 'input_ids', 'token_type_ids', 'attention_mask', 'labels'],
    num_rows: 66716
})

In [16]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased",
                                                           problem_type="multi_label_classification",
                                                           num_labels=len(labels),
                                                           id2label=id2label,
                                                           label2id=label2id)

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [17]:
# Retry the upgrade to ensure the latest version with the expected arguments is installed

from transformers import TrainingArguments, Trainer
batch_size = 8
metric_name = "f1"

args = TrainingArguments(
    f"bert-finetuned-sem_eval-english",
    # evaluation_strategy and save_strategy were added in later versions.
    # Ensure transformers is upgraded to a version that supports these arguments.
    eval_strategy = "epoch",
    save_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=5,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model=metric_name,
    #push_to_hub=True,
)

In [18]:
# prompt: wt to do next

from sklearn.metrics import f1_score, roc_auc_score, accuracy_score
from transformers import EvalPrediction
import torch
from datasets import Dataset
from sklearn.model_selection import train_test_split
from transformers import TrainingArguments, Trainer


# source: https://jesusleal.io/2021/04/21/Longformer-multilabel-classification/
def multi_label_metrics(predictions, labels, threshold=0.5):
    # first, apply sigmoid on predictions which are of shape (batch_size, num_labels)
    sigmoid = torch.nn.Sigmoid()
    probs = sigmoid(torch.Tensor(predictions))
    # next, use threshold to turn them into integer predictions
    y_pred = np.zeros(probs.shape)
    y_pred[np.where(probs >= threshold)] = 1
    # finally, compute metrics
    y_true = labels
    f1_micro_average = f1_score(y_true=y_true, y_pred=y_pred, average='micro')
    roc_auc = roc_auc_score(y_true, y_pred, average = 'micro')
    accuracy = accuracy_score(y_true, y_pred)
    # return as dictionary
    metrics = {'f1': f1_micro_average,
               'roc_auc': roc_auc,
               'accuracy': accuracy}
    return metrics

def compute_metrics(p: EvalPrediction):
    preds = p.predictions[0] if isinstance(p.predictions,
            tuple) else p.predictions
    result = multi_label_metrics(
        predictions=preds,
        labels=p.label_ids)
    return result


In [19]:
# let's verify a batch as well as a forward pass
text = "I am so happy, and excited!"
encoding = tokenizer(text, return_tensors="pt")
encoding = {k: v.to(model.device) for k,v in encoding.items()}

outputs = model(**encoding)

outputs.logits.shape

torch.Size([1, 28])

In [20]:
# Split the tokenized dataset into training and evaluation sets
train_eval_dataset = tokenized_dataset.train_test_split(test_size=0.2) # Adjust test_size as needed
train_dataset = train_eval_dataset['train']
eval_dataset = train_eval_dataset['test']

In [21]:
# traing time was too much time so decided to train on small sample dataset

trainer = Trainer(
    model,
    args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    processing_class=tokenizer, # Use processing_class instead of tokenizer
    compute_metrics=compute_metrics
)

trainer.train()




<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mgvengineering-com[0m ([33mgvengineering-com-grand-valley-state-university[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Epoch,Training Loss,Validation Loss


KeyboardInterrupt: 

In [22]:
#try on smal data set its taking to much time

# Use a smaller subset of the data for faster training and evaluation
df_small = df.sample(n=1000, random_state=42) # Adjust the number of samples as needed

# Convert the smaller Pandas DataFrame to Hugging Face Dataset
dataset_small = Dataset.from_pandas(df_small)

# Now you can apply the preprocess_data function to the smaller dataset
tokenized_dataset_small = dataset_small.map(preprocess_data, batched=True)

# Split the smaller tokenized dataset into training and evaluation sets
train_eval_dataset_small = tokenized_dataset_small.train_test_split(test_size=0.2, seed=42) # Adjust test_size and add seed for reproducibility
train_dataset_small = train_eval_dataset_small['train']
eval_dataset_small = train_eval_dataset_small['test']

# Create a new Trainer instance with the smaller datasets
trainer_small = Trainer(
    model,
    args,
    train_dataset=train_dataset_small,
    eval_dataset=eval_dataset_small,
    processing_class=tokenizer,
    compute_metrics=compute_metrics
)

# Train the model on the smaller dataset
trainer_small.train()

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Epoch,Training Loss,Validation Loss,F1,Roc Auc,Accuracy
1,No log,0.206842,0.0,0.5,0.015
2,No log,0.17602,0.0,0.5,0.015
3,No log,0.169612,0.0,0.5,0.015
4,No log,0.167521,0.0,0.5,0.015
5,0.199000,0.167084,0.0,0.5,0.015


TrainOutput(global_step=500, training_loss=0.19895394897460938, metrics={'train_runtime': 6873.1194, 'train_samples_per_second': 0.582, 'train_steps_per_second': 0.073, 'total_flos': 263172476928000.0, 'train_loss': 0.19895394897460938, 'epoch': 5.0})

In [23]:
# evaluate model

trainer_small.evaluate()


{'eval_loss': 0.2068420797586441,
 'eval_f1': 0.0,
 'eval_roc_auc': 0.5,
 'eval_accuracy': 0.015,
 'eval_runtime': 114.3815,
 'eval_samples_per_second': 1.749,
 'eval_steps_per_second': 0.219,
 'epoch': 5.0}

#Detailed  Report

## Dataset Preprocessing Steps

The dataset used is the GoEmotions dataset, which is in CSV format. The preprocessing steps involved:

1.  **Loading the data:** The dataset was loaded into a pandas DataFrame.
2.  **Handling Missing Values:** Missing values were checked using `df.isnull().sum()` and dropped using `df.dropna(inplace=True)`.
3.  **Data Type Conversion:** Columns representing emotion labels (which were initially loaded as floats) were converted to integers using `df[col] = df[col].astype(int)`.
4.  **Dropping Irrelevant Columns:** The 'example\_very\_unclear' column was dropped as it was not needed for the classification task.
5.  **Defining Labels:** The emotion label columns were identified and separated from the text and id columns. Dictionaries for mapping label names to IDs (`label2id`) and IDs to label names (`id2label`) were created.
6.  **Tokenization and Encoding:** The `bert-base-uncased` tokenizer was used to tokenize the text data. A `preprocess_data` function was defined to:
    *   Take a batch of text and tokenize it using `tokenizer`.
    *   Pad and truncate the sequences to a maximum length of 128.
    *   Create a numpy array representing the labels for each example, converting the multi-label format into a binary matrix where 1 indicates the presence of a label and 0 indicates its absence.
    *   Add this label matrix to the encoding dictionary.
7.  **Converting to Hugging Face Dataset:** The pandas DataFrame was converted into a Hugging Face `Dataset` object.
8.  **Applying Preprocessing:** The `preprocess_data` function was applied to the dataset using the `map` method to generate the tokenized and encoded dataset with labels.
9.  **Splitting the Dataset:** The tokenized dataset was split into training and evaluation sets using `train_test_split` (initially attempted on the full dataset, then switched to a smaller sample due to training time).

## Model Selection and Rationale

**Model:** `bert-base-uncased` with a sequence classification head configured for multi-label classification.

**Rationale:**

*   **BERT:** BERT (Bidirectional Encoder Representations from Transformers) is a powerful pre-trained language model that has demonstrated strong performance on various downstream NLP tasks, including text classification. Its pre-training on a large corpus allows it to capture rich contextual information.
*   **`bert-base-uncased`:** This is a standard, widely used version of BERT. It's a good starting point for many text classification tasks due to its balance of size and performance. 'uncased' means it does not distinguish between uppercase and lowercase letters, which is generally suitable for this task unless case sensitivity is crucial for emotion detection.
*   **Sequence Classification Head for Multi-label:** The `AutoModelForSequenceClassification` class from the Hugging Face transformers library is designed for classification tasks on sequences. By setting `problem_type="multi_label_classification"` and `num_labels` to the number of emotion categories, the model is configured with an output layer (a linear layer) that produces scores for each label independently, which is necessary for multi-label problems where a single piece of text can have multiple emotions. The sigmoid activation is typically applied to the output logits to obtain probabilities for each label.

## Challenges Faced and Solutions

**Challenge:** Training time was excessively long on the full dataset.

**Solution:** Due to the computational cost and time required to train the model on the entire dataset in the Colaboratory environment, a decision was made to train on a smaller sample of the dataset. A sample of 1000 rows was selected using `df.sample(n=1000, random_state=42)`. This significantly reduced the training time, allowing for quicker iteration and evaluation of the fine-tuning process. While training on a larger dataset would likely yield better performance, training on a small sample provided a feasible way to demonstrate and test the fine-tuning workflow within the given constraints.

