# Fine-Tuning Transformer Models for Classification of Digital Behavioural Data
## Classfying political stance

By Indira Sen

This notebook will demonstrate how to fine-tune transformer models like BERT for classification. We show two options for using transformer models in Python
- Simple Transformers
- HuggingFace `transformers`

We will fine-tune transformer models like BERT, DistilBERT, and RoBERTa for a task common in Computational Social Science: Political Stance Detection

Finally, we will use zero-shot classification from a pretartined Natural Language Inference (NLI) model to also label data.


<br><br>

## **Import necessary Python libraries and modules**

First, we will import necessary Python libraries and modules. These include scikit-learn (`sklearn`) and PyTorch (`torch`), for various machine learning tools.

In [1]:
# Basic Python modules
from collections import defaultdict

import os
from daacs.infrastructure.bootstrap import Bootstrap
b = Bootstrap() 

# For data manipulation and analysis
import pandas as pd
import numpy as np

# For deep learning
# https://pytorch.org/tutorials/beginner/basics/quickstart_tutorial.html
import torch

We first download the datasets we need for finetuning our models. This is a **supervised** classification task, therefore, we will need labeled data. We download the following datasets:

1. PStance [https://github.com/chuchun8/PStance]

In [2]:
data = pd.read_csv(f'{b.DATA_DIR}/pol/raw_train_biden.csv')
data.head()

Unnamed: 0,Tweet,Target,Stance
0,Joe Biden is looking to gather votes from unsu...,Joe Biden,AGAINST
1,Check out the latest podcast conversation betw...,Joe Biden,FAVOR
2,Thank you Secretary Clinton for your endorseme...,Joe Biden,FAVOR
3,Happening now: @JoeBiden kicking off #Hispanic...,Joe Biden,FAVOR
4,Thank you Mayor @KeishaBottoms for opening our...,Joe Biden,FAVOR


We first use the [`simpletransformers`](https://simpletransformers.ai/) package which is more beginner-friendly

In [3]:
! pip3 install simpletransformers



The basic steps for finetuning a classifier using simpletrasnformers are:
- Initialize a model based on a specific architechture (BERT, DistilBERT, etc)
- Train the model with train_model()
- Evaluate the model with eval_model()
- Make predictions on (unlabelled) data with predict()

In [3]:
from simpletransformers.classification import ClassificationModel, ClassificationArgs
import logging

In [4]:
logging.basicConfig(level=logging.INFO)
transformers_logger = logging.getLogger("transformers")
transformers_logger.setLevel(logging.WARNING)

We need to preprocess the data first before we start the finetuning process. In this step, we split the dataset into **train** and **test** sets to have a fully held-out test set that can be used to evaluate our classifier.

We can also create a **validation** that is used during the fine tuning process for hyperparameter tuning, but that is not mandatory.

In [5]:
from sklearn.model_selection import train_test_split

train_df, test_df = train_test_split(data, stratify=data['Stance'], test_size=0.2)

We now convert the dataframes into a format that can be read by simpletransformers. This is a dataframe with the columns 'text' and 'labels'. The 'labels' column should be numerical, so we use **one-hot encoding** to transform our string stance labels to numerical ones.

In [6]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
le.fit(train_df['Stance'])
train_df['labels'] = le.transform(train_df['Stance'])
test_df['labels'] = le.transform(test_df['Stance'])

In [7]:
# to see which number was mapped to which class:
list(le.inverse_transform([0,1]))

['AGAINST', 'FAVOR']

So, 0 is 'against' and 1 is 'favor'. We now have the appropriate data structure. The next step is setting the training parameters and loading the classification model, in this case, DistilBERT, a lightweight model that can be trained relatively quickly compared to other transformer variants like BERT and RoBERTa.

For training parameters, we have many to choose from such as the learning rate, whether we want to stop early or not, where we should save the model, and more. You can find all of them here: https://simpletransformers.ai/docs/usage/

As a minimal setup, we will just set the number of **epochs**, i.e., the number of passes the model does over the full training set. For recent transformer models, epochs are usually set to 2 or 3, after which overfitting may happen.

In [8]:

# Optional model configuration
model_args = ClassificationArgs(num_train_epochs=3, output_dir='output_st', overwrite_output_dir=True, )

# Create a ClassificationModel
model = ClassificationModel(
    "distilbert", "distilbert-base-uncased", args=model_args, use_cuda=False
)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


We are now finally ready to begin training! This might take a while, especially when we're not using a GPU.

In [9]:
train_df = train_df[['Tweet', 'labels']]
test_df = test_df[['Tweet', 'labels']]

In [10]:
len(train_df)

4644

In [None]:
# Train the model
model.train_model(train_df)

INFO:simpletransformers.classification.classification_utils: Converting to features started. Cache is not used.


  0%|          | 0/9 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

Epoch:   0%|          | 0/3 [00:00<?, ?it/s]

Running Epoch 1 of 3:   0%|          | 0/581 [00:00<?, ?it/s]

Running Epoch 2 of 3:   0%|          | 0/581 [00:00<?, ?it/s]

Running Epoch 3 of 3:   0%|          | 0/581 [00:00<?, ?it/s]

After training our model, we can use it to make predictions for unlabeled datapoints to classify the stance of the tweet towards the predefined target.

In [12]:
anti_biden_tweet = "Ugh, this was true yesterday and it's also true now: Biden is an idiot"
predictions, raw_outputs = model.predict([anti_biden_tweet])
le.inverse_transform(predictions)

INFO:simpletransformers.classification.classification_utils: Converting to features started. Cache is not used.


0it [00:00, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


  0%|          | 0/1 [00:00<?, ?it/s]

array(['AGAINST'], dtype=object)

We can also use the held-out test set to quantitatively evaluate our model.

In [13]:
len(test_df) / 8

145.25

In [14]:
# Evaluate the model
result, model_outputs, wrong_predictions = model.eval_model(test_df)
result

INFO:simpletransformers.classification.classification_utils: Converting to features started. Cache is not used.


  0%|          | 0/2 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
INFO:simpletransformers.classification.classification_utils: Saving features into cached file cache_dir/cached_dev_distilbert_128_2_2


Running Evaluation:   0%|          | 0/12 [00:00<?, ?it/s]

INFO:simpletransformers.classification.classification_model:{'mcc': 0.5972925230028748, 'tp': 389, 'tn': 543, 'fp': 108, 'fn': 122, 'auroc': 0.8716531243518176, 'auprc': 0.8605532048346441, 'eval_loss': 0.7476535215973854}


{'mcc': 0.5972925230028748,
 'tp': 389,
 'tn': 543,
 'fp': 108,
 'fn': 122,
 'auroc': 0.8716531243518176,
 'auprc': 0.8605532048346441,
 'eval_loss': 0.7476535215973854}

In [15]:
# you can also use sklearn's neat classification report to get more metrics
from sklearn.metrics import classification_report

In [16]:
preds, probs = model.predict(list(test_df['Tweet'].values))
# preds = le.inverse_transform(preds)

print(classification_report(test_df['labels'], preds))

INFO:simpletransformers.classification.classification_utils: Converting to features started. Cache is not used.


  0%|          | 0/2 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISMhuggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable =(true | false)
TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


  0%|          | 0/12 [00:00<?, ?it/s]

              precision    recall  f1-score   support

           0       0.82      0.83      0.83       651
           1       0.78      0.76      0.77       511

    accuracy                           0.80      1162
   macro avg       0.80      0.80      0.80      1162
weighted avg       0.80      0.80      0.80      1162



In [17]:
probs

array([[-2.2826643 ,  3.03902411],
       [ 2.00688696, -2.58880019],
       [-2.23345685,  2.95897388],
       ...,
       [ 1.60807133, -2.18697786],
       [ 1.38311386, -1.7749964 ],
       [ 1.31926453, -1.7898016 ]])

We now repeat the same process with the HuggingFace [`transformers` Python library](https://huggingface.co/transformers/installation.html).

In [19]:
# !pip3 install transformers

In [20]:
! pip install -U accelerate
# ! pip install -U transformers



We will again use DistilBERT.

In [21]:
from transformers import DistilBertTokenizerFast, DistilBertForSequenceClassification
from transformers import Trainer, TrainingArguments

We will set some of the configurations

In [22]:
model_name = 'distilbert-base-uncased'
device_name = 'cuda'

# This is the maximum number of tokens in any document; the rest will be truncated.
max_length = 512

# This is the name of the directory where we'll save our model. You can name it whatever you want.
cached_model_directory_name = 'output_hf'

We will reuse the train-test splits we created for simpletransformers, but change the data structure slightly.

In [23]:
train_texts = train_df['Tweet']#.values
train_labels = train_df['labels']#.values

test_texts = test_df['Tweet']#.values
test_labels = test_df['labels']#.values

Compared to simpletransformers, we get a closer look at what happens 'under the hood' with huggingface. We will see the transformation of the text better --- each tweet will be truncated if they're more than 512 tokens or padded if they're fewer than 512 tokens.

The tokens will be separated into "word pieces" using the transformers tokenizers ('DistilBertTokenizerFast' in this case to match the DistiBERT model). And some special tokens will also be added such as **CLS** (start token of every tweet) and **SEP** (separator between each sentence {not tweet}):

In [24]:
tokenizer = DistilBertTokenizerFast.from_pretrained(model_name)

We now encode our texts using the tokenizer.

In [25]:
from datasets import Dataset

train_df = Dataset.from_pandas(train_df)
test_df = Dataset.from_pandas(test_df)

def tokenize_function(examples):
  return tokenizer(examples["Tweet"], padding="max_length", truncation=True)


tokenized_train_df = train_df.map(tokenize_function, batched=True)
tokenized_test_df = test_df.map(tokenize_function, batched=True)

Map:   0%|          | 0/4644 [00:00<?, ? examples/s]

Map:   0%|          | 0/1162 [00:00<?, ? examples/s]

We now load the DistilBERT model and specify that it should use the GPU.

In [26]:
model = DistilBertForSequenceClassification.from_pretrained(model_name, num_labels=len(le.classes_)).to(device_name)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'pre_classifier.weight', 'classifier.bias', 'pre_classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


As we did with simpletransformers, we now set the training parameters, i.e., the number of epochs.

In [27]:
training_args = TrainingArguments(
    num_train_epochs=3,              # total number of training epochs
    output_dir='./results',          # output directory
    report_to='none'
)

<br><br>

## **Fine-tune the BERT model**

First, we define a custom evaluation function that returns the accuracy. You could modify this function to return precision, recall, F1, and/or other metrics.

In [28]:
from sklearn.metrics import accuracy_score

def compute_metrics(pred):
  labels = pred.label_ids
  preds = pred.predictions.argmax(-1)
  acc = accuracy_score(labels, preds)
  return {
      'accuracy': acc,
  }

Then we create a HuggingFace `Trainer` object using the `TrainingArguments` object that we created above. We also send our `compute_metrics` function to the `Trainer` object, along with our test and train datasets.

In [29]:
trainer = Trainer(
    model=model,                         # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=tokenized_train_df,         # training dataset
    compute_metrics=compute_metrics      # our custom evaluation function
)

In [30]:
tokenized_train_df

Dataset({
    features: ['Tweet', 'labels', '__index_level_0__', 'input_ids', 'attention_mask'],
    num_rows: 4644
})

Time to finally fine-tune!

In [31]:
trainer.train()

Step,Training Loss
500,0.5042
1000,0.3422
1500,0.1926


TrainOutput(global_step=1743, training_loss=0.3222933652684118, metrics={'train_runtime': 694.1214, 'train_samples_per_second': 20.071, 'train_steps_per_second': 2.511, 'total_flos': 1845535798075392.0, 'train_loss': 0.3222933652684118, 'epoch': 3.0})

<br><br>

## **Save fine-tuned model**

The following cell will save the model and its configuration files to a directory in Colab. To preserve this model for future use, you should download the model to your computer.

In [32]:
trainer.save_model(cached_model_directory_name)

(Optional) If you've already fine-tuned and saved the model, you can reload it using the following line. You don't have to run fine-tuning every time you want to evaluate.

In [33]:
# trainer = DistilBertForSequenceClassification.from_pretrained(cached_model_directory_name)

We can now evaluate the model by predicting the labels for the test set.

In [34]:
predicted_results = trainer.predict(tokenized_test_df)

In [35]:
predicted_labels = predicted_results.predictions.argmax(-1) # Get the highest probability prediction
predicted_labels = predicted_labels.flatten().tolist()      # Flatten the predictions into a 1D list
predicted_labels[0:5]

[0, 0, 0, 0, 1]

In [36]:
print(classification_report(tokenized_test_df['labels'],
                            predicted_labels))

              precision    recall  f1-score   support

           0       0.81      0.84      0.82       651
           1       0.79      0.75      0.77       511

    accuracy                           0.80      1162
   macro avg       0.80      0.79      0.80      1162
weighted avg       0.80      0.80      0.80      1162



## 2. Zero-shot NLI-based Labeling

In [37]:
### zero-shot NLI classification
from transformers import pipeline

classifier = pipeline("zero-shot-classification",model='facebook/bart-large-mnli')

config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [38]:
classifier('i love going to school', ['positive', 'negative', 'neutral', 'none']) # for sentiment

{'sequence': 'i love going to school',
 'labels': ['positive', 'neutral', 'negative', 'none'],
 'scores': [0.9397299289703369,
  0.034338414669036865,
  0.015356624498963356,
  0.010575151070952415]}

In [39]:
classifier([anti_biden_tweet], ['pro Biden', 'anti Biden', 'none']) # for stance

[{'sequence': "Ugh, this was true yesterday and it's also true now: Biden is an idiot",
  'labels': ['anti Biden', 'pro Biden', 'none'],
  'scores': [0.9814677834510803, 0.011077933944761753, 0.00745432311668992]}]

In [40]:
test_df = Dataset.to_pandas(test_df)

In [None]:
test_df_labels = []

from tqdm import tqdm # to help you keep track of how many instances have been labeled
test_df_sample = test_df.head(200) # we take a subset because this can take some time to run on the full data
for _, row in tqdm(test_df_sample.iterrows(), total=test_df_sample.shape[0]):
  test_df_labels.append(classifier(row['Tweet'], ['pro Biden', 'anti Biden', 'none']))

 92%|█████████▎| 185/200 [07:49<00:34,  2.30s/it]

In [None]:
# get the labels
labels = [i['labels'][0] for i in test_df_labels]
labels[:5], len(labels)

In [None]:
# let's match
test_df_labels[:5], len(test_df_labels)

In [None]:
# let's see how well this matches the manually labeled data
# recall that 0 is against Biden and 1 is in favor
test_df_sample['NLI_labels'] = [0 if i == 'anti Biden' else 1 for i in labels]
test_df_sample.head()

In [None]:
print(classification_report(test_df_sample['labels'],
                            test_df_sample['NLI_labels']))

## Optional: 3. Multiclass classification

First, let's stack the PStance datasets across different targets into one dataset with different targets. The targets will be our final label to predict.

In [None]:
df_trump = pd.read_csv('raw_train_trump.csv')
df_biden = pd.read_csv('raw_train_biden.csv')
df_sanders = pd.read_csv('raw_train_bernie.csv')
all_data = pd.concat([df_trump,df_biden,df_sanders],ignore_index=True)
all_data.head()

In [None]:
all_data.groupby('Target').size()

Repeat the same steps as earlier for splitting this into train and test splits, but this time stratify on 'Target', not 'Stance'

In [None]:
multi_train_df, multi_test_df = train_test_split(all_data, stratify=all_data['Target'], test_size=0.2)

Let's encode the labels again.

In [None]:
le = LabelEncoder()
le.fit(train_df['Target'])
multi_train_df['labels'] = le.transform(multi_train_df['Target'])
multi_test_df['labels'] = le.transform(multi_test_df['Target'])

In [None]:
# to see which number was mapped to which class:
list(le.inverse_transform([0,1,2]))

The rest is exacty the same.

In [None]:

# Optional model configuration
model_args = ClassificationArgs(num_train_epochs=3, output_dir='output_st', overwrite_output_dir=True)

# Create a ClassificationModel
model = ClassificationModel(
    "distilbert", "distilbert-base-uncased", args=model_args
)

In [None]:
multi_train_df = multi_train_df[['Tweet', 'labels']]
multi_test_df = multi_test_df[['Tweet', 'labels']]

In [None]:
len(multi_train_df)

In [None]:
# Train the model
model.train_model(multi_train_df)

After training our model, we can use it to make predictions for unlabeled datapoints to classify the stance of the tweet towards the predefined target.

In [None]:
predictions, raw_outputs = model.predict([anti_biden_tweet])
le.inverse_transform(predictions)

We can also use the held-out test set to quantitatively evaluate our model.

In [None]:
# Evaluate the model
result, model_outputs, wrong_predictions = model.eval_model(multi_test_df)
result

In [None]:
preds, probs = model.predict(list(multi_test_df['Tweet'].values))
# preds = le.inverse_transform(preds)

print(classification_report(multi_test_df['labels'], preds))