Polarity detection with BERT

In [1]:
! pip install transformers==3.0.2
! pip install nltk
! pip install sklearn
! pip install torchmetrics

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers==3.0.2
  Downloading transformers-3.0.2-py3-none-any.whl (769 kB)
[K     |████████████████████████████████| 769 kB 17.8 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.53.tar.gz (880 kB)
[K     |████████████████████████████████| 880 kB 24.6 MB/s 
Collecting tokenizers==0.8.1.rc1
  Downloading tokenizers-0.8.1rc1-cp37-cp37m-manylinux1_x86_64.whl (3.0 MB)
[K     |████████████████████████████████| 3.0 MB 64.6 MB/s 
Collecting sentencepiece!=0.1.92
  Downloading sentencepiece-0.1.97-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[K     |████████████████████████████████| 1.3 MB 67.8 MB/s 
Building wheels for collected packages: sacremoses
  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone
  Created wheel for sacremoses: filename=sacremoses-0.0.53-py3-none-any.whl size=895260 sha256=5bbb8c760d34be56e27b493d62001a62b69afb57930

In [2]:
import torch
import transformers
print(torch.__version__)
print(transformers.__version__)

1.12.1+cu113
3.0.2


In [3]:
# Load the Drive helper and mount
from google.colab import drive

# This will prompt for authorization.
drive.mount('/content/drive/')

Mounted at /content/drive/


In [4]:
# Importing the libraries needed
import pandas as pd
import numpy as np
import torch
import transformers
import json
from tqdm import tqdm
from torch.utils.data import Dataset, DataLoader
from transformers import BertModel, BertTokenizer
import logging
logging.basicConfig(level=logging.ERROR)
import time
from torchmetrics.functional import precision_recall, f1_score

In [5]:
# Setting up the device for GPU usage

from torch import cuda
device = 'cuda' if cuda.is_available() else 'cpu'

### Loading and preparing the data

#### Dataset
- The dataset containing subjective reviews obtained from the `subjectivity_detection.ipynb` notebook will be used as the training dataset.
- Due to class imbalance (with more positive reviews than negative reviews), we will balance the training dataset with random sampling.
- The test dataset will evaluate the performance of the fine-tuned BERT model.

#### SentimentData Dataset Class
- This class is defined to accept the Dataframe as input and generate tokenized output that is used by the BERT model for training.
- The BERT tokenizer is used to tokenize the data in the `concat_review` column of the dataframe. `concat_review` consists of both the `reviewTitle` and `reviewDescription`.
- The tokenizer uses the encode_plus method to perform tokenization and generate the necessary outputs, namely: ids, attention_mask.
- `target` is the encoded category, either positive or negative.
- The SentimentData class is used to create 2 datasets, for training and for validation.
- Training Dataset is used to fine tune the model using all the reviews from the subjectivity detection task.
- Validation Dataset is used to evaluate the performance of the model. The model has not seen this data during training.

#### Dataloader
- Dataloader is used to for creating training and validation dataloader that load data to the neural network in a defined manner. This is needed because all the data from the dataset cannot be loaded to the memory at once, hence the amount of dataloaded to the memory and then passed to the neural network needs to be controlled.
- This control is achieved using the parameters such as batch_size and max_len.
- Training and Validation dataloaders are used in the training and validation part of the flow respectively

In [6]:
path_to_folder = "/content/drive/My Drive/data/cz4045/"
train = pd.read_csv(path_to_folder+'subjective_reviews.csv')
test_raw1 = pd.read_csv(path_to_folder + 'test_df_Bryson.csv')
test_raw2 = pd.read_csv(path_to_folder + 'test_df_Gx.csv')
test_raw3 = pd.read_csv(path_to_folder + 'test_df_Kelvin.csv')
df_list = [test_raw1, test_raw2, test_raw3]
test = pd.concat(df_list, ignore_index=True)

In [7]:
train['polarity'].value_counts()

1    9593
0    3617
Name: polarity, dtype: int64

In [8]:
# Drop the excess positive reviews randomly
# Credits to GX's notebook
differences = train["polarity"].value_counts()[1]-train["polarity"].value_counts()[0]
train_balanced = train.drop(train[train["polarity"] == 1].sample(differences,random_state=42).index)
train_balanced

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,productAsin,ratingScore,reviewTitle,reviewReaction,reviewDescription,isVerified,category,languages,concat_review,polarity,cleaned_text,tb_subjectivity,tb_polarity,pos_tags,senti_score,swn_subjectivity
1,3,5021,399226907,5,Add this book to your collection,,Cute and educational book to teach counting an...,True,children,Language.ENGLISH,Add this book to your collection. Cute and edu...,1,Add this book to your collection. Cute and edu...,0.580000,0.355000,"[('Add', 'VB'), ('this', 'DT'), ('book', 'NN')...",1.000,1
2,4,21354,125030170X,2,Just okay.,,This is one of those books you can read in a c...,False,children,Language.ENGLISH,Just okay.. This is one of those books you can...,0,Just okay.. This is one of those books you can...,0.500000,0.500000,"[('Just', 'RB'), ('okay', 'RB'), ('..', 'VB'),...",0.500,1
3,5,23286,63215381,1,The paperback‚Äôs quality sucks,1,I hate this paperback. Terrible quality! The p...,True,children,Language.ENGLISH,The paperback‚Äôs quality sucks. I hate this p...,0,The paperback‚Äôs quality sucks. I hate this p...,0.697778,-0.490000,"[('The', 'DT'), ('paperback‚Äôs', 'NN'), ('qua...",-1.375,1
6,12,4101,B096MWJLNW,4,Good job will,,I loved his honesty. It eas an informative read.,True,humor_entertainment,Language.ENGLISH,Good job will. I loved his honesty. It eas an...,1,Good job will. I loved his honesty. It eas an...,0.700000,0.700000,"[('Good', 'JJ'), ('job', 'NN'), ('will', 'MD')...",1.250,1
7,13,1094,B01IW9TM5O,5,"Nice, easy read",,"Nice story, good ending, good to tead",True,humor_entertainment,Language.ENGLISH,"Nice, easy read. Nice story, good ending, good...",1,"Nice, easy read. Nice story, good ending, good...",0.806667,0.606667,"[('Nice', 'NNP'), (',', ','), ('easy', 'JJ'), ...",2.125,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13196,20942,5589,670062510,5,Good reading,,Good value,True,children,Language.ENGLISH,Good reading. Good value,1,Good reading. Good value,0.600000,0.700000,"[('Good', 'JJ'), ('reading', 'NN'), ('.', '.')...",1.500,1
13197,20943,2542,1982185821,5,"Witty, genuine, and overall well-written",,I loved the book. It breaks my heart to read s...,True,humor_entertainment,Language.ENGLISH,"Witty, genuine, and overall well-written. I lo...",1,"Witty, genuine, and overall well-written. I lo...",0.640909,0.238636,"[('Witty', 'NNP'), (',', ','), ('genuine', 'NN...",0.875,1
13203,20949,14990,B07GX3BR7P,5,It's a Great Read,,I was curious from beginning to the end. It h...,True,mystery,Language.ENGLISH,It's a Great Read. I was curious from beginnin...,1,It's a Great Read. I was curious from beginnin...,0.783333,0.466667,"[('It', 'PRP'), (""'s"", 'VBZ'), ('a', 'DT'), ('...",0.750,1
13204,20950,22935,1501161938,1,"So many great reviews, I just wanted it to end",19,I feel hoodwinked on this book. So many great ...,True,children,Language.ENGLISH,"So many great reviews, I just wanted it to end...",0,"So many great reviews, I just wanted it to end...",0.700000,0.420000,"[('So', 'RB'), ('many', 'JJ'), ('great', 'JJ')...",-0.375,1


In [9]:
train_balanced['polarity'].value_counts()

1    3617
0    3617
Name: polarity, dtype: int64

In [10]:
train_new = train_balanced[['concat_review', 'polarity']]
train_new = train_new.reset_index().drop(columns=['index'])
train_new

Unnamed: 0,concat_review,polarity
0,Add this book to your collection. Cute and edu...,1
1,Just okay.. This is one of those books you can...,0
2,The paperback‚Äôs quality sucks. I hate this p...,0
3,Good job will. I loved his honesty. It eas an...,1
4,"Nice, easy read. Nice story, good ending, good...",1
...,...,...
7229,Good reading. Good value,1
7230,"Witty, genuine, and overall well-written. I lo...",1
7231,It's a Great Read. I was curious from beginnin...,1
7232,"So many great reviews, I just wanted it to end...",0


In [11]:
test = test.drop(columns=['Unnamed: 0', 'Unnamed: 0.1', 'Unnamed: 0.1.1'])
test['concat_review'] = test['reviewTitle'] + '. ' + test['reviewDescription']
test['polarity'] = test['Annotator_1']
# We are going to use polarity detection so we drop all neutral reviews and convert the polarity
# positive -> 1
# negative -> 0
test = test.loc[test.polarity != 0]
test.loc[test['polarity'] == -1, 'polarity'] = 0
print(len(test))
test.head()

2227


Unnamed: 0,productAsin,ratingScore,reviewTitle,reviewReaction,reviewDescription,isVerified,category,languages,Annotator_1,Annotator_2,concat_review,polarity
0,1982137452,1,The content is all messed up,,I started this book this week for my book club...,True,children,Language.ENGLISH,-1,-1,The content is all messed up. I started this b...,0
1,125030170X,1,Duplicate copy.Damaged book.,,Pages missing.,True,children,Language.ENGLISH,-1,-1,Duplicate copy.Damaged book.. Pages missing.,0
2,63215381,1,Awful,,I gave up after 38% of my Kindle. Yes we were ...,True,children,Language.ENGLISH,-1,-1,Awful. I gave up after 38% of my Kindle. Yes w...,0
3,60935464,1,Syrupy Overload,3.0,The book is an example of leading the witness.,True,children,Language.ENGLISH,-1,-1,Syrupy Overload. The book is an example of lea...,0
4,1501161938,1,Couldn‚Äôt read it; type too small!,1.0,"Beware, the type is TINY, I mean TINY. I am 60...",True,children,Language.ENGLISH,-1,-1,"Couldn‚Äôt read it; type too small!. Beware, t...",0


In [12]:
test.polarity.value_counts()

1    1579
0     648
Name: polarity, dtype: int64

In [13]:
test_new = test[['concat_review', 'polarity']]
test_new = test_new.reset_index().drop(columns=['index'])
test_new

Unnamed: 0,concat_review,polarity
0,The content is all messed up. I started this b...,0
1,Duplicate copy.Damaged book.. Pages missing.,0
2,Awful. I gave up after 38% of my Kindle. Yes w...,0
3,Syrupy Overload. The book is an example of lea...,0
4,"Couldn‚Äôt read it; type too small!. Beware, t...",0
...,...,...
2222,No. Just awful!,0
2223,Bored. I was so bored reading this book. I swi...,0
2224,"Ugh!. Ugh! Too wordy, predictable and shallow....",0
2225,A story that made me cry. I have so many fond ...,1


In [14]:
# Defining some key variables that will be used later on in the training
MAX_LEN = 256
TRAIN_BATCH_SIZE = 8
VALID_BATCH_SIZE = 4
# EPOCHS = 1
LEARNING_RATE = 1e-05
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', truncation=True, do_lower_case=True)

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

In [15]:
class SentimentData(Dataset):
    def __init__(self, dataframe, tokenizer, max_len):
        self.tokenizer = tokenizer
        self.data = dataframe
        self.text = dataframe.concat_review
        self.targets = self.data.polarity
        self.max_len = max_len

    def __len__(self):
        return len(self.text)

    def __getitem__(self, index):
        text = str(self.text[index])
        text = " ".join(text.split())

        inputs = self.tokenizer.encode_plus(
            text,
            None,
            add_special_tokens=True,
            max_length=self.max_len,
            pad_to_max_length=True,
            return_token_type_ids=True
        )
        ids = inputs['input_ids']
        mask = inputs['attention_mask']
        token_type_ids = inputs["token_type_ids"]


        return {
            'ids': torch.tensor(ids, dtype=torch.long),
            'mask': torch.tensor(mask, dtype=torch.long),
            'token_type_ids': torch.tensor(token_type_ids, dtype=torch.long),
            'targets': torch.tensor(self.targets[index], dtype=torch.float)
        }

In [16]:
training_set = SentimentData(train_new, tokenizer, MAX_LEN)
testing_set = SentimentData(test_new, tokenizer, MAX_LEN)

In [17]:
train_params = {'batch_size': TRAIN_BATCH_SIZE,
                'shuffle': True,
                'num_workers': 0
                }

test_params = {'batch_size': VALID_BATCH_SIZE,
                'shuffle': False,
                'num_workers': 0
                }

training_loader = DataLoader(training_set, **train_params)
testing_loader = DataLoader(testing_set, **test_params)

### Creating the Neural Network for Fine Tuning

#### Neural Network

- The neural network is created with the `BertClass`, which consists of the Bert Language model followed by a `dropout` and finally a `Linear` layer to obtain the final outputs.
- The data will be fed to the BERT Language model as defined by the dataset
- Final layer outputs will be compared to `polarity` to determine evaluation metrics such as accuracy.
- We will initiate an instance of the network called `model`. This instance will be used for training and then to save the final trained model for future inference. 

#### Loss Function and Optimizer
- The `Loss function` is used to calculate the difference in the output created by the model and the actual output.
- The `Optimizer` is used to update the weights of the neural network to improve its performance.

In [18]:
class BertClass(torch.nn.Module):
    def __init__(self):
        super(BertClass, self).__init__()
        self.l1 = BertModel.from_pretrained("bert-base-uncased")
        self.pre_classifier = torch.nn.Linear(768, 768)
        self.dropout = torch.nn.Dropout(0.2)
        self.classifier = torch.nn.Linear(768, 2)

    def forward(self, input_ids, attention_mask, token_type_ids):
        output_1 = self.l1(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)
        hidden_state = output_1[0]
        pooler = hidden_state[:, 0]
        pooler = self.pre_classifier(pooler)
        pooler = torch.nn.ReLU()(pooler)
        pooler = self.dropout(pooler)
        output = self.classifier(pooler)
        return output

In [19]:
model = BertClass()
model.to(device)

Downloading:   0%|          | 0.00/433 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/440M [00:00<?, ?B/s]

BertClass(
  (l1): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    

### Fine Tuning the Model
Here we define a training function, `train_model`, to train the model on the training dataset, specified by the number of times (EPOCH).

Following events happen in this function to fine tune the neural network:

- The dataloader passes data to the model based on the batch size.
- Subsequent output from the model and the actual category are compared to calculate the loss.
- Loss value is used to optimize the weights of the neurons in the network.
After every 5000 steps the loss value is printed in the console.

In [20]:
# Creating the loss function and optimizer
loss_function = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(params =  model.parameters(), lr=LEARNING_RATE)

In [21]:
def calculate_accuracy(preds, targets):
    n_correct = (preds==targets).sum().item()
    return n_correct

In [22]:
# Defining the training function on the 80% of the dataset for tuning the distilbert model

def train_model(epoch):
    tr_loss = 0
    n_correct = 0
    nb_tr_steps = 0
    nb_tr_examples = 0
    model.train()
    for _,data in tqdm(enumerate(training_loader, 0)):
        ids = data['ids'].to(device, dtype = torch.long)
        mask = data['mask'].to(device, dtype = torch.long)
        token_type_ids = data['token_type_ids'].to(device, dtype = torch.long)
        targets = data['targets'].to(device, dtype = torch.long)

        outputs = model(ids, mask, token_type_ids)
        loss = loss_function(outputs, targets)
        tr_loss += loss.item()
        big_val, big_idx = torch.max(outputs.data, dim=1)
        n_correct += calculate_accuracy(big_idx, targets)

        nb_tr_steps += 1
        nb_tr_examples+=targets.size(0)
        
        if _%5000==0:
            loss_step = tr_loss/nb_tr_steps
            accu_step = (n_correct*100)/nb_tr_examples 
            print(f"Training Loss per 5000 steps: {loss_step}")
            print(f"Training Accuracy per 5000 steps: {accu_step}")

        optimizer.zero_grad()
        loss.backward()
        # # When using GPU
        optimizer.step()

    print(f'The Total Accuracy for Epoch {epoch}: {(n_correct*100)/nb_tr_examples}')
    epoch_loss = tr_loss/nb_tr_steps
    epoch_accu = (n_correct*100)/nb_tr_examples
    print(f"Training Loss Epoch: {epoch_loss}")
    print(f"Training Accuracy Epoch: {epoch_accu}")

    return 

In [23]:
EPOCHS = 1
for epoch in range(EPOCHS):
    train_model(epoch)



Training Loss per 5000 steps: 0.6872203350067139
Training Accuracy per 5000 steps: 50.0


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
905it [05:41,  2.65it/s]

The Total Accuracy for Epoch 0: 94.41526126624274
Training Loss Epoch: 0.15452820398273637
Training Accuracy Epoch: 94.41526126624274





In [24]:
output_model_file = path_to_folder+'pytorch_bert_sentiment.bin'
output_vocab_file = path_to_folder+'./'

model_to_save = model
torch.save(model_to_save, output_model_file)
tokenizer.save_vocabulary(output_vocab_file)

print('All files saved')

All files saved


### Validating the Model
During the validation stage we pass the unseen data (testing dataset) to the model. The `valid` function is used to do this. Validation determines how good the model performs on the unseen data. The evaluation metrics we are interested in are accuracy, precision, recall, and F1 score. Additionally, time taken for prediction and number of reviews predicted on per second are of interest too.

In [25]:
def valid(model, testing_loader):
    model.eval()
    n_correct = 0; n_wrong = 0; total = 0; tr_loss=0; nb_tr_steps=0; nb_tr_examples=0
    precision_ = 0; recall_ = 0; f1_ = 0
    with torch.no_grad():
        for _, data in tqdm(enumerate(testing_loader)):
            ids = data['ids'].to(device, dtype = torch.long)
            mask = data['mask'].to(device, dtype = torch.long)
            token_type_ids = data['token_type_ids'].to(device, dtype=torch.long)
            targets = data['targets'].to(device, dtype = torch.long)
            outputs = model(ids, mask, token_type_ids)
            loss = loss_function(outputs, targets)
            tr_loss += loss.item()
            big_val, big_idx = torch.max(outputs.data, dim=1)
            n_correct += calculate_accuracy(big_idx, targets)
            prec, rec = precision_recall(big_idx, targets)
            precision_ += prec
            recall_ += rec
            f1_ += f1_score(big_idx, targets)

            nb_tr_steps += 1
            nb_tr_examples+=targets.size(0)
            
            if _%5000==0:
                loss_step = tr_loss/nb_tr_steps
                accu_step = (n_correct*100)/nb_tr_examples
                precision_step = (precision_*100)/nb_tr_steps
                recall_step = (recall_*100)/nb_tr_steps
                f1_step = (f1_*100)/nb_tr_steps
                print(f"Validation Loss per 100 steps: {loss_step}")
                print(f"Validation Accuracy per 100 steps: {accu_step}")
                print(f"Validation Precision per 100 steps: {precision_step}")
                print(f"Validation Recall per 100 steps: {recall_step}")
                print(f"Validation F-measure per 100 steps: {f1_step}")
    epoch_loss = tr_loss/nb_tr_steps
    epoch_accu = (n_correct)/nb_tr_examples
    epoch_prec = (precision_)/nb_tr_steps
    epoch_rec = (recall_)/nb_tr_steps
    epoch_f1 = (f1_)/nb_tr_steps
    print(f"Validation Loss Epoch: {epoch_loss}")
    print(f"Validation Accuracy Epoch: {epoch_accu}")
    print(f"Validation Precision Epoch: {epoch_prec}")
    print(f"Validation Recall Epoch: {epoch_rec}")
    print(f"Validation F-measure Epoch: {epoch_f1}")
    
    return epoch_accu, epoch_prec, epoch_rec, epoch_f1


In [26]:
# model = torch.load(path_to_folder+'pytorch_bert_sentiment.bin')
# device = 'cuda' if cuda.is_available() else 'cpu'
# model.to(device)

In [28]:
start_time = time.time()
acc, precision, recall, f1 = valid(model, testing_loader)
time_taken = time.time() - start_time
rec_classified = len(test_new)/time_taken
print("Predictions took ", time_taken, " seconds")
print("Number of reviews classified per second: ", rec_classified)
print("Accuracy on test data = %0.3f" % acc)
print("Precision on test data = %0.3f" % precision)
print("Recall on test data = %0.3f" % recall)
print("F-measure on test data = %0.3f" % f1)



Validation Loss per 100 steps: 0.027231140062212944
Validation Accuracy per 100 steps: 100.0
Validation Precision per 100 steps: 100.0
Validation Recall per 100 steps: 100.0
Validation F-measure per 100 steps: 100.0


557it [00:48, 11.58it/s]

Validation Loss Epoch: 0.1080587828762602
Validation Accuracy Epoch: 0.9667714414009879
Validation Precision Epoch: 0.9667863845825195
Validation Recall Epoch: 0.9667863845825195
Validation F-measure Epoch: 0.9667863845825195
Predictions took  48.10567307472229  seconds
Number of reviews classified per second:  46.29391624020752
Accuracy on test data = 0.967
Precision on test data = 0.967
Recall on test data = 0.967
F-measure on test data = 0.967





In [33]:
results = {'Model':['BERT'], 'Test Accuracy':[acc], 'Test Precision':[precision.item()], 'Test Recall':[recall.item()], 'Test F1':[f1.item()], 'Time for Predictions':[time_taken], 'No. reviews classified per second':[rec_classified]}
results_df = pd.DataFrame.from_dict(results)
results_df.to_csv(path_to_folder+'bert_results.csv', index=False)