# BERT Tutorial: Using BertForSequenceClassification
### _The "B-2" of Deep Learning Natural Language Processing Models_

### **<ins>Version:</ins>**
@March 2023 / Quek Jing Hao


### _<ins>**Objective</ins>:**_

Be familiar with using the model BertForSequenceClassification.

### **<ins>Introduction:</ins>**

As we have seen from the other tutorials involving BERTModel and machine learning classifiers, it usually involves two steps: using a pretrianed BERT model to access the sentence vectors (_"last hidden state"_), and then sending the feature vectors dataframe for classification.

In this tutorial, we introduce a powerful model inside the transformers library called BertForSequenceClassification. This model consist of a BERT model and a classifier - integrated into one. Furthermore, the model uses the PyTorch API. This is an advantage depending on your exposure to Tensorflow or PyTorch.

In this tutorial, we will train a deep learning BERT model on a Amazon phones review dataset.

At the same time, because this is a proper deep learning model, one should be familiar - or at least be comfortable with the PyTorch package. I recommend taking a look at https://pytorch.org/tutorials/beginner/basics/intro.html to get a very good high level view of the library.

There are several libraries that Colab does not have that we have to download

In [1]:
%%capture
!pip install sentence_transformers
!pip install datasets
!pip install evaluate
!pip install fastBPE sacremoses subword_nmt sentencepiece
!pip install --upgrade gensim
!pip install accelerate -U

In [2]:
# import modules and dependencies
import numpy as np
import pandas as pd
import torch
import matplotlib.pyplot as mpp
import re
import nltk
import time
import sys

from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix

from datasets import load_dataset
from torch.utils.data import DataLoader
from tqdm.auto import tqdm
from torch.optim import AdamW

from transformers.optimization import Adafactor, AdafactorSchedule
from datasets.dataset_dict import DatasetDict, Dataset

from wordcloud import WordCloud
from matplotlib.pyplot import figure

from sklearn.model_selection import train_test_split
from transformers import EarlyStoppingCallback
from transformers import AutoTokenizer, DataCollatorWithPadding, DataCollatorForLanguageModeling
from transformers import BertModel, BertTokenizer, BertForSequenceClassification
from transformers import TrainingArguments, Trainer
from transformers import AutoModelForSequenceClassification

pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 10)

### Model Specification and Environment Configuration

First, we want to ensure the at we have GPU computing

In [3]:
detected_device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
try:
    device_name = torch.cuda.get_device_name()
except:
    device_name = 'CPU'

print(detected_device, f'\nName of device: {device_name}')

cuda 
Name of device: Tesla T4


Next we specify the BERT Model that we are interested in using, as well as the tokenizer

In [4]:
NO_CLASSES = 2

In [5]:
model_name = 'bert-base-uncased'

bert_tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForSequenceClassification.from_pretrained(model_name, num_labels = NO_CLASSES)

# push the model to GPU
model.to(detected_device)

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly i

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12,

### Read Dataset

Here, we will read the Amazon reviews dataset.

This sample dataset is hosted on my Github account

In [6]:
!git clone https://github.com/QuekJingHao/amazon-reviews-sample-dataset.git

Cloning into 'amazon-reviews-sample-dataset'...
remote: Enumerating objects: 6, done.[K
remote: Counting objects: 100% (6/6), done.[K
remote: Compressing objects: 100% (5/5), done.[K
remote: Total 6 (delta 0), reused 3 (delta 0), pack-reused 0[K
Unpacking objects: 100% (6/6), 5.70 MiB | 8.94 MiB/s, done.


In [7]:
df = pd.read_csv('/content/amazon-reviews-sample-dataset/amazon_sample.csv')

print(f'Length of dataframe: {len(df)}')
df.head(2)

Length of dataframe: 49661


Unnamed: 0.1,Unnamed: 0,Product Name,Brand Name,Price,Rating,Reviews,Review Votes
0,198019,HTC Radar C110E Unlocked GSM Phone - White/Silver,HTC,74.99,2,Had to return this phone as it was password locked by previous user and I couldnot do much about it.Also there were couple of scratches on the display. Returned it.,0.0
1,106928,BLU Advance 4.0L Unlocked Smartphone -US GSM - White,BLU,47.97,5,Good,0.0


What we can do, is to split this dataframe into two - one that we call the train_df will go into the train-test-split. While another will be used to evaulate how well the model do.

In [21]:
df = df.sample(frac = 1.0, random_state = 90)

train_df = df.iloc[:48000]
test_df = df.iloc[48000:]

### Data Aggregation

As you can see, the dataframe has several columns that we do not need. So, let us set perform some data manipulation to transform it so we can work with it. We pick out only the relevant columns - in this case, are the Reviews and Rating columns. We will also need to decide the sentiments of the Reviews. In this case, let's set the rule that if the rating is less than 3, we consider it to be Negative. On the other hand, if it is more then 3, then we consider it Positive. We also represent 1 to be Positive, and 0 to be Negative. We will remove Ratings with 3 as these are neutral Ratings.

In [9]:
subdf = train_df.copy()[['Reviews', 'Rating']]

subdf = subdf[subdf['Rating'] != 3].reset_index(drop = True)

subdf.loc[subdf['Rating'] < 3, 'Rating'] = 0
subdf.loc[subdf['Rating'] > 3, 'Rating'] = 1

subdf = subdf[subdf['Reviews'].notna()].reset_index(drop = True)

# Lastly, we take a sample of the subdf
subdf = subdf.sample(n = 20000).reset_index(drop = True)

In [10]:
subdf['Rating'].value_counts()

1    14898
0     5102
Name: Rating, dtype: int64

### Data Cleaning

Typically, text data is not always clean. So, one has to decide on how to clean the text data such to remove any unwanted characters, in the text. However, the good news is that for deep learning language models, one do not need to perform very heavy cleaning on the text data. The addition of 'dirty' text will not impact the performance of the model significantly.

Nonetheless, let us perform some basic cleaning. We will:

1. Lowercase all words in the reviews and remove all starting and sending white spaces

In [11]:
def text_processing(text):
    return text.lower().strip()

# Apply function
subdf['Clean_Reviews'] = subdf['Reviews'].apply(text_processing)

subdf.head(2)

Unnamed: 0,Reviews,Rating,Clean_Reviews
0,The good news is that it was an unlocked phone. And I was able to go to any carrier to start it up.The bad news is that it does not have a good battery life. The phone loses its charge very quickly!,0,the good news is that it was an unlocked phone. and i was able to go to any carrier to start it up.the bad news is that it does not have a good battery life. the phone loses its charge very quickly!
1,"This phone worked perfectly for 1 month and then had no service. After going to T-Mobile for help, they discovered this phone was never paid for by the original owner and shut off. This is fraud!!",0,"this phone worked perfectly for 1 month and then had no service. after going to t-mobile for help, they discovered this phone was never paid for by the original owner and shut off. this is fraud!!"


### Data Processing for BERT

To use BERT properly, the model expects the input to have a certain format. Our goal is to turn the two columns ino the DatasetDict data structure in PyTorch. Once this is done, the BERT model wil be able to read the data and perform unsupervised learning and classification

#### Step 1: Transform the two Pandas Series into Lists

This step is easy. Once we have these two lists, we will send them into our classic train-test-split function to break them up

In [12]:
clean_reviews = subdf['Clean_Reviews'].to_list()
target = subdf['Rating'].to_list()

X_train, X_test, y_train, y_test = train_test_split(clean_reviews,
                                                    target,
                                                    random_state = 98)

#### Step 2: Use BERT Tokenizer to tokenize the Text Data

As per other NLP models, we need to tokenize the text to numbers

In [13]:
%%time
X_train_tokenized = bert_tokenizer(X_train, padding = True, truncation = True, max_length = 512)
X_test_tokenized = bert_tokenizer(X_test, padding = True, truncation = True, max_length = 512)

CPU times: user 17.7 s, sys: 52.3 ms, total: 17.8 s
Wall time: 19.2 s


But the interesting thing is that this tokenizer function will return us a dictionary with 3 key-value pair! Let's look at X_train_tokenized

In [14]:
X_train_tokenized.keys()

dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])

The most important out of these 3 is the input_ids. This will be the data that will be feed into BERT model in the training loop.

#### Step 3: Convert Series into Dataset Dict Structure

This is the last step. We will now transfer the input_ids as well as the labels into a dataset dictionary. Please note that the keys in the dataset dictionary _**must**_ be labels and input_ids.

In [15]:
train_dict = {'label' : y_train,
              'input_ids' : X_train_tokenized['input_ids']}

test_dict = {'label' : y_test,
             'input_ids' : X_test_tokenized['input_ids']}

# Transfer these dictionaries into Dataset Dict
train_dataset = Dataset.from_dict(train_dict)
test_dataset = Dataset.from_dict(test_dict)

print(f'Training dataset dict: {train_dataset}\nTesting dataset dict: {test_dataset}')

Training dataset dict: Dataset({
    features: ['label', 'input_ids'],
    num_rows: 15000
})
Testing dataset dict: Dataset({
    features: ['label', 'input_ids'],
    num_rows: 5000
})


### Model Parameters Configurations to Fine-Tune Pretrained Model

Now is the fun part! We first need to specify the hyperparameters that we wish to fine tune. Indeed, this following block serve as a kind of 'remote control' to control the behavior of the whole notebook and training loop. Most of these parameters are quite self explanatory. Although they are not exhaustive.

In [16]:
Number_of_Epochs = 1
Evaluation_Steps = 100
Train_Batch_Size = 8
Test_Batch_Size = 8
weight_decay = 1e-4
lr = 1e-5
warmup_steps = 100
Seed = 45

#### Define Trainer Parameters and arguments

We will need to declare a function that computes the evaluation metrics where the model will print out at each evaluation step

In [17]:
def compute_metrics(p):

    logits, labels = p
    pred = np.argmax(logits, axis = 1)

    auc_score = roc_auc_score(y_true = labels, y_score = pred)
    recall    = recall_score(y_true = labels, y_pred = pred)
    precision = precision_score(y_true = labels, y_pred = pred)
    f1        = f1_score(y_true = labels, y_pred = pred)
    accuracy  = accuracy_score(y_true = labels, y_pred = pred)
    #confusion = confusion_matrix(y_true=labels, y_pred=pred) # cannot JSON confusion matrix

    return {'Accuracy': accuracy, 'Precision': precision, 'Recall': recall, 'f1': f1}


# Define Trainer
args = TrainingArguments(
    output_dir = '/content/',
    overwrite_output_dir = True,
    evaluation_strategy = 'steps',
    save_strategy = 'steps',
    eval_steps = Evaluation_Steps,
    per_device_train_batch_size = Train_Batch_Size,
    per_device_eval_batch_size = Test_Batch_Size,
    num_train_epochs = Number_of_Epochs,
    weight_decay = weight_decay,
    seed = Seed,
    save_steps = Evaluation_Steps,
    learning_rate = lr,
    load_best_model_at_end = True,
    gradient_accumulation_steps = 8,
    #warmup_steps = warmup_steps
)

#### Define Trainer Class

Here, is where you put all of the ingredients into one class - here is where you input your dataset dicionaries, training arugments and parameters

In [18]:
data_collator = DataCollatorWithPadding(tokenizer = bert_tokenizer)

trainer = Trainer(
    model = model,
    args = args,
    train_dataset = train_dataset,
    eval_dataset = test_dataset,
    compute_metrics = compute_metrics,
    callbacks = [EarlyStoppingCallback(early_stopping_patience = 3)],
    data_collator = data_collator,
)

#### Train the Model!

In [19]:
trainer.train()



Step,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
100,No log,0.231898,0.924,0.948649,0.948649,0.948649
200,No log,0.184068,0.9388,0.956428,0.961081,0.958749


TrainOutput(global_step=234, training_loss=0.2930770287146935, metrics={'train_runtime': 1698.2733, 'train_samples_per_second': 8.833, 'train_steps_per_second': 0.138, 'total_flos': 3940351165071360.0, 'train_loss': 0.2930770287146935, 'epoch': 1.0})

### Model Inference

Now that we have trained the model, we can use the unseen dataframe that the model has not seen before - and see how did the model performs

We have to perform the same cleaning steps on the test_df with the cleaning function defined earlier

In [32]:
eval_df = test_df[['Reviews', 'Rating']].copy()


eval_df['Clean_Reviews'] = eval_df['Reviews'].apply(text_processing)

eval_df = eval_df[eval_df['Rating'] != 3].reset_index(drop = True)

eval_df.loc[eval_df['Rating'] < 3, 'Rating'] = 0
eval_df.loc[eval_df['Rating'] > 3, 'Rating'] = 1

eval_df = eval_df[eval_df['Reviews'].notna()].reset_index(drop = True)


print(len(eval_df))
eval_df.head(2)

1542


Unnamed: 0,Reviews,Rating,Clean_Reviews
0,Does not work as an unlocked smartphone with any US Carriers. BEWARE!!!,0,does not work as an unlocked smartphone with any us carriers. beware!!!
1,Just read and work immediately with the new nano chip,1,just read and work immediately with the new nano chip


We will also need to input the text into the dataset dictionary data structure for BERT as well

In [33]:
reviews_eval = eval_df['Clean_Reviews'].to_list()

X_eval_tokenized = bert_tokenizer(reviews_eval, padding = True, truncation = True, max_length = 512)

X_eval_dataset = Dataset.from_dict({'input_ids' : X_eval_tokenized['input_ids']})

X_eval_dataset

Dataset({
    features: ['input_ids'],
    num_rows: 1542
})

Next, we load in the trained BERT model. You should edit the model_path accordingly

In [35]:
model_path = '/content/checkpoint-200'
model = BertForSequenceClassification.from_pretrained(model_path, num_labels = NO_CLASSES)

test_trainer = Trainer(model)

Make predictions!

In [36]:
%%time
raw_pred, _, _ = test_trainer.predict(X_eval_dataset)

CPU times: user 44.2 s, sys: 151 ms, total: 44.4 s
Wall time: 46.1 s


In [37]:
eval_df['Predicted_Rating'] = np.argmax(raw_pred, axis = 1)
eval_df.head(3)

Unnamed: 0,Reviews,Rating,Clean_Reviews,Predicted_Rating
0,Does not work as an unlocked smartphone with any US Carriers. BEWARE!!!,0,does not work as an unlocked smartphone with any us carriers. beware!!!,0
1,Just read and work immediately with the new nano chip,1,just read and work immediately with the new nano chip,1
2,Great device,1,great device,1


Similary, we can compute all of the evaluation metrics here

In [39]:
wrong_ratings = eval_df[eval_df['Rating'] != eval_df['Predicted_Rating']]
no_wrong_ratings = len(wrong_ratings)

accuracy = 1 - no_wrong_ratings / len(eval_df)

print(f'Final accuracy of model by predicting unseen reviews: {accuracy}')

target_true = eval_df['Rating']
target_predict = eval_df['Predicted_Rating']

auc_score = roc_auc_score(y_true = target_true, y_score = target_predict)
recall    = recall_score(y_true = target_true, y_pred = target_predict)
precision = precision_score(y_true = target_true, y_pred = target_predict)
f1        = f1_score(y_true = target_true, y_pred = target_predict)
accuracy  = accuracy_score(y_true = target_true, y_pred = target_predict)
confusion = confusion_matrix(y_true = target_true, y_pred = target_predict)

print('Confusion matrix:\n', confusion, '\n')

print(f'AUC Score : {auc_score}\n Recall : {recall}\n Precision : {precision}\n F1 Score : {f1}\n ')

Final accuracy of model by predicting unseen reviews: 0.9455252918287937
Confusion matrix:
 [[ 338   46]
 [  38 1120]] 

AUC Score : 0.923696567357513
 Recall : 0.9671848013816926
 Precision : 0.9605488850771869
 F1 Score : 0.963855421686747
 


### Conclusion

It is _this_ simple! We have used the