# AUC Text Mining, Group Project: Training the sentiment analysis model
### By Sarah de Jong, Tom Klein Tijssink and Lukas Busch

- This is the notebook file that was used to train a BERT model for a sentiment analysis task on musical lyrics. The training was done on Google Colab, as a free use of GPU(/TPU) is provided by Google, which increases training speed.

- This notebook is mostly based on the steps of this TowardsDataScience tutorial:
https://towardsdatascience.com/transformers-for-multilabel-classification-71a1a0daf5e1
Here BERT is trained for a multi-label problem. Which means it uses a treshold to determine how many labels are selected. For the training this is not changed as it did not really make a difference wether a different loss function was used (one that was suggested for multi-class). When predicting the sentiments of songs however the label is decided by taking the maximum predictoin value of a one-hot vector

- Our task is to predict a sentiment for a poetic verse. The datasat comes from:
https://arxiv.org/abs/2011.02686

- The possible outcomes are: "negative", "neutral", "positve" "other"/ {-1,0,1,2}  
Our trained model returned a F1 score of roughly 85%, which we deemed good enough to be used for the prediction of the sentiment for songs. The code for this can be found in the 'predicting_sentiments.ipynb' file



In [None]:
# Install and import the libraries needed

!pip install transformers
!pip install sentencepiece
import pandas as pd
import numpy as np
import tensorflow as tf
import torch
from torch.nn import BCEWithLogitsLoss, BCELoss
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.metrics import classification_report, confusion_matrix, multilabel_confusion_matrix, f1_score, accuracy_score
import pickle
from transformers import *
from tqdm import tqdm, trange
from ast import literal_eval


Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/81/91/61d69d58a1af1bd81d9ca9d62c90a6de3ab80d77f27c5df65d9a2c1f5626/transformers-4.5.0-py3-none-any.whl (2.1MB)
[K     |████████████████████████████████| 2.2MB 9.7MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/08/cd/342e584ee544d044fb573ae697404ce22ede086c9e87ce5960772084cad0/sacremoses-0.0.44.tar.gz (862kB)
[K     |████████████████████████████████| 870kB 48.8MB/s 
Collecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/ae/04/5b870f26a858552025a62f1649c20d29d2672c02ff3c3fb4c688ca46467a/tokenizers-0.10.2-cp37-cp37m-manylinux2010_x86_64.whl (3.3MB)
[K     |████████████████████████████████| 3.3MB 53.6MB/s 
Building wheels for collected packages: sacremoses
  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone
  Created wheel for sacremoses: filename=sacremoses-0.0.44-cp37-none-any.whl size=886084 sha256=5302e4c53f8

In [None]:
# Setting up the Google GPU

device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
  raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
n_gpu = torch.cuda.device_count()
torch.cuda.get_device_name(0)

Found GPU at: /device:GPU:0


'Tesla T4'

In [None]:
# Reading the respective datasets (dev, test, train)
DEVPATH = '/content/dev.tsv'
TESTPATH = '/content/test.tsv'
TRAINPATH = '/content/train.tsv'
COLNAMES = ['verse','sentiment'] #added column names
#Note below how not all the data is clean, thus we do error_bad_lines =False
dev = pd.read_csv(DEVPATH, delimiter="\t", engine='python', encoding='utf-8',error_bad_lines=False, names= COLNAMES)
test = pd.read_csv(TESTPATH, delimiter="\t", engine='python', encoding='utf-8',error_bad_lines=False, names= COLNAMES)
train = pd.read_csv(TRAINPATH, delimiter="\t", engine='python', encoding='utf-8',error_bad_lines=False, names= COLNAMES)

Skipping line 90: unexpected end of data
Skipping line 48: '	' expected after '"'
Skipping line 58: '	' expected after '"'
Skipping line 186: '	' expected after '"'
Skipping line 257: '	' expected after '"'
Skipping line 391: '	' expected after '"'
Skipping line 415: '	' expected after '"'
Skipping line 426: '	' expected after '"'
Skipping line 468: '	' expected after '"'
Skipping line 525: '	' expected after '"'
Skipping line 535: '	' expected after '"'
Skipping line 638: '	' expected after '"'
Skipping line 702: '	' expected after '"'


In [None]:
print("{} number of dev samples".format(len(dev)))
print("{} number of test samples".format(len(test)))
print("{} number of train samples".format(len(train)))

89 number of dev samples
96 number of test samples
738 number of train samples


In [None]:
def create_onehots(labels, unique_labels):
  """function that takes a list of labels and a list of all the unique labels and 
  creates the corresponding one-hot-vectors"""
  label_dict = {}
  for i,un_label in enumerate(unique_labels):
    label_dict[un_label] = i

  one_hot_labels = []
  n = len(unique_labels)
  for label in labels:
    oh = [0]* n
    index = label_dict[label]
    oh[index] = 1
    one_hot_labels.append(oh)
  
  return one_hot_labels

In [None]:
UNIQUE_LABELS = train.sentiment.unique() #All unique labels {-1,0,1,2}
#Creating the onehot vectors for the respective datasets
dev_oh = create_onehots(dev.sentiment,UNIQUE_LABELS)
test_oh = create_onehots(test.sentiment,UNIQUE_LABELS)
train_oh = create_onehots(train.sentiment,UNIQUE_LABELS)

#get the texts that correspond to the onehot labels
dev_texts = dev.verse.to_list()
test_texts = test.verse.to_list()
train_texts = train.verse.to_list()



In [None]:
def data_to_dataloader(textlist, labels, max_length,batchsize, tokenizer):
  """Function that takes: a list of texts, a list of labels, a maximum_token length, 
    a batch_size and a tokenzier name that corresponds to a tokenizer of huggingface's hub.
    Returns a pytorch dataloader object. Which we will use for training the BERT model."""
  encodings = tokenizer.batch_encode_plus(textlist,max_length=max_length,pad_to_max_length=True, truncation=True)
  input_ids = torch.tensor(encodings['input_ids']) # tokenized and encoded sentences
  token_type_ids = torch.tensor(encodings['token_type_ids']) # token type ids
  attention_masks = torch.tensor(encodings['attention_mask']) # attention masks
  labels = torch.tensor(labels) #labels to torch

  data = TensorDataset(input_ids, attention_masks, labels, token_type_ids)
  sampler = RandomSampler(data)
  return DataLoader(data, sampler=sampler, batch_size=batchsize) #create the dataloader

In [None]:
# Tokenize with BERT tokenizer
MODEL_NAME = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(MODEL_NAME, do_lower_case=True) # tokenizer

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=28.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=466062.0, style=ProgressStyle(descripti…




In [None]:
#Creating the dataloader objects
dev_dataloader = data_to_dataloader(dev_texts,dev_oh, 128, 32, tokenizer)
test_dataloader = data_to_dataloader(test_texts,test_oh, 128, 32, tokenizer)
train_dataloader = data_to_dataloader(train_texts,train_oh, 128, 32, tokenizer)



In [None]:
#Initiate our Model (don't worry about the warning this is to be expected)
#notice that MODEL_NAME was defined earlier to be the standard bert-base model
nb_labels = 4
model = BertForSequenceClassification.from_pretrained(MODEL_NAME, num_labels=nb_labels)
model.cuda()

In [None]:
#Create our optimizer
param_optimizer = list(model.named_parameters())
no_decay = ['bias', 'gamma', 'beta']
optimizer_grouped_parameters = [
    {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)],
     'weight_decay_rate': 0.01},
    {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)],
     'weight_decay_rate': 0.0}
]

optimizer = AdamW(optimizer_grouped_parameters,lr=1e-5,correct_bias=True)


In [None]:
# Store our loss and accuracy for plotting
train_loss_set = []

# Number of training epochs 
epochs = 5

# trange is a tqdm wrapper around the normal python range
for _ in trange(epochs, desc="Epoch"):

  # Training
  
  # Set our model to training mode (as opposed to evaluation mode)
  model.train()

  # Tracking variables
  tr_loss = 0 #running loss
  nb_tr_examples, nb_tr_steps = 0, 0
  
  # Train the data for one epoch
  for step, batch in enumerate(train_dataloader):
    # Add batch to GPU
    batch = tuple(t.to(device) for t in batch)
    # Unpack the inputs from our dataloader
    b_input_ids, b_input_mask, b_labels, b_token_types = batch
    # Clear out the gradients (by default they accumulate)
    optimizer.zero_grad()


    # Forward pass for multilabel classification
    outputs = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask)
    logits = outputs[0]
    loss_func = BCELoss() 
    loss = loss_func(torch.sigmoid(logits.view(-1,4)),b_labels.type_as(logits).view(-1,4)) #convert labels to float for calculation
    train_loss_set.append(loss.item())       

    # Backward pass
    loss.backward()
    # Update parameters and take a step using the computed gradient
    optimizer.step()
    # scheduler.step()
    # Update tracking variables
    tr_loss += loss.item()
    nb_tr_examples += b_input_ids.size(0)
    nb_tr_steps += 1

  print("Train loss: {}".format(tr_loss/nb_tr_steps))

###############################################################################

  # Validation

  # Put model in evaluation mode to evaluate loss on the validation set
  model.eval()

  # Variables to gather full output
  logit_preds,true_labels,pred_labels,tokenized_texts = [],[],[],[]

  # Predict
  for i, batch in enumerate(dev_dataloader):
    batch = tuple(t.to(device) for t in batch)
    # Unpack the inputs from our dataloader
    b_input_ids, b_input_mask, b_labels, b_token_types = batch
    with torch.no_grad():
      # Forward pass
      outs = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask)
      b_logit_pred = outs[0]
      pred_label = torch.sigmoid(b_logit_pred)

      b_logit_pred = b_logit_pred.detach().cpu().numpy()
      pred_label = pred_label.to('cpu').numpy()
      b_labels = b_labels.to('cpu').numpy()

    tokenized_texts.append(b_input_ids)
    logit_preds.append(b_logit_pred)
    true_labels.append(b_labels)
    pred_labels.append(pred_label)

  # Flatten outputs
  pred_labels = [item for sublist in pred_labels for item in sublist]
  true_labels = [item for sublist in true_labels for item in sublist]

  # Calculate Accuracy
  threshold = 0.5
  pred_bools = [pl>threshold for pl in pred_labels] 
  true_bools = [tl==1 for tl in true_labels]
  val_f1_accuracy = f1_score(true_bools,pred_bools,average='micro')*100
  val_flat_accuracy = accuracy_score(true_bools, pred_bools)*100

  print('F1 Validation Accuracy: ', val_f1_accuracy)
  print('Flat Validation Accuracy: ', val_flat_accuracy)





Epoch:   0%|          | 0/5 [00:00<?, ?it/s][A[A[A[A

Train loss: 0.11374265126263101






Epoch:  20%|██        | 1/5 [00:42<02:48, 42.18s/it][A[A[A[A

F1 Validation Accuracy:  83.72093023255815
Flat Validation Accuracy:  80.89887640449437
Train loss: 0.10427362083767851






Epoch:  40%|████      | 2/5 [01:24<02:06, 42.10s/it][A[A[A[A

F1 Validation Accuracy:  86.36363636363636
Flat Validation Accuracy:  85.39325842696628
Train loss: 0.09503170428797603






Epoch:  60%|██████    | 3/5 [02:06<01:24, 42.05s/it][A[A[A[A

F1 Validation Accuracy:  86.36363636363636
Flat Validation Accuracy:  85.39325842696628
Train loss: 0.08612038164089124






Epoch:  80%|████████  | 4/5 [02:47<00:41, 42.00s/it][A[A[A[A

F1 Validation Accuracy:  85.54913294797689
Flat Validation Accuracy:  83.14606741573034
Train loss: 0.07735171355307102






Epoch: 100%|██████████| 5/5 [03:29<00:00, 41.97s/it]

F1 Validation Accuracy:  87.35632183908046
Flat Validation Accuracy:  85.39325842696628





In [None]:
model.save_pretrained('poem_sentiment') #saving our model

In [None]:
# Test

# Put model in evaluation mode to evaluate loss on the validation set
model.eval()

#track variables
logit_preds,true_labels,pred_labels,tokenized_texts = [],[],[],[]

# Predict
for i, batch in enumerate(test_dataloader):
  batch = tuple(t.to(device) for t in batch)
  # Unpack the inputs from our dataloader
  b_input_ids, b_input_mask, b_labels, b_token_types = batch
  with torch.no_grad():
    # Forward pass
    outs = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask)
    b_logit_pred = outs[0]
    pred_label = torch.sigmoid(b_logit_pred)

    b_logit_pred = b_logit_pred.detach().cpu().numpy()
    pred_label = pred_label.to('cpu').numpy()
    b_labels = b_labels.to('cpu').numpy()

  tokenized_texts.append(b_input_ids)
  logit_preds.append(b_logit_pred)
  true_labels.append(b_labels)
  pred_labels.append(pred_label)

# Flatten outputs
tokenized_texts = [item for sublist in tokenized_texts for item in sublist]
pred_labels = [item for sublist in pred_labels for item in sublist]
true_labels = [item for sublist in true_labels for item in sublist]
# Converting flattened binary values to boolean values
true_bools = [tl==1 for tl in true_labels]

In [None]:
pred_bools = [pl>0.5 for pl in pred_labels] #boolean output after thresholding

# Print and save classification report
print('Test F1 Accuracy: ', f1_score(true_bools, pred_bools,average='micro'))
print('Test Flat Accuracy: ', accuracy_score(true_bools, pred_bools),'\n')
# ,target_names=label_cols
clf_report = classification_report(true_bools,pred_bools)
#pickle.dump(clf_report, open('classification_report.txt','wb')) #save report
print(clf_report)

Test F1 Accuracy:  0.8631578947368421
Test Flat Accuracy:  0.8541666666666666 

              precision    recall  f1-score   support

           0       0.75      0.60      0.67        15
           1       0.91      0.90      0.90        67
           2       0.81      0.93      0.87        14
           3       0.00      0.00      0.00         0

   micro avg       0.87      0.85      0.86        96
   macro avg       0.62      0.61      0.61        96
weighted avg       0.87      0.85      0.86        96
 samples avg       0.85      0.85      0.85        96



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
