<a href="https://colab.research.google.com/github/DAbbott93/Dean-Abbott--Dissertation/blob/main/Model_3_GRU.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# GRU model using BERT embeddings for Sentiment analysis

In this notebook I use the pretrained BERT transormer model (from the transformers library) as embedding layers for our GRU model.  I will freeze BERT and only train the remainder of the model which learns from the representations produced by the transformer.

## Data preparation

In [None]:
!pip install torch==1.6.0 torchvision==0.7.0 torchtext==0.7.0

Collecting torch==1.6.0
  Downloading torch-1.6.0-cp37-cp37m-manylinux1_x86_64.whl (748.8 MB)
[K     |█████████████████████████████   | 679.1 MB 45.2 MB/s eta 0:00:02
[31mERROR: Operation cancelled by user[0m
[?25hTraceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/pip/_internal/cli/base_command.py", line 180, in _main
    status = self.run(options, args)
  File "/usr/local/lib/python3.7/dist-packages/pip/_internal/cli/req_command.py", line 199, in wrapper
    return func(self, options, args)
  File "/usr/local/lib/python3.7/dist-packages/pip/_internal/commands/install.py", line 319, in run
    reqs, check_supported_wheels=not options.target_dir
  File "/usr/local/lib/python3.7/dist-packages/pip/_internal/resolution/resolvelib/resolver.py", line 128, in resolve
    requirements, max_rounds=try_to_avoid_resolution_too_deep
  File "/usr/local/lib/python3.7/dist-packages/pip/_vendor/resolvelib/resolvers.py", line 473, in resolve
    state = resolution.res

Set the seed to achieve reproducibilty

In [None]:
import torch

import random
import numpy as np

SEED = 1234

random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

In [None]:
# check whether cuda is available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

cuda


Tokenise the data into the required format for BERT. Use the BertTokenizer from the transformers library.

In [None]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.9.2-py3-none-any.whl (2.6 MB)
[K     |████████████████████████████████| 2.6 MB 5.4 MB/s 
Collecting huggingface-hub==0.0.12
  Downloading huggingface_hub-0.0.12-py3-none-any.whl (37 kB)
Collecting pyyaml>=5.1
  Downloading PyYAML-5.4.1-cp37-cp37m-manylinux1_x86_64.whl (636 kB)
[K     |████████████████████████████████| 636 kB 40.8 MB/s 
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 49.9 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.45-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 38.9 MB/s 


In [None]:
from transformers import BertTokenizer
# Use bert base uncased tokeniser
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

Mark the mark the beginning of each with ([CLS]) and the end of each sentence with ([SEP]. Also, add padding and unkown tokens. This is the required format for BERT inputs

In [None]:
init_token = tokenizer.cls_token
eos_token = tokenizer.sep_token
pad_token = tokenizer.pad_token
unk_token = tokenizer.unk_token

print(init_token, eos_token, pad_token, unk_token)

Get indexes for the special tokens from the tokenizer

In [None]:
init_token_idx = tokenizer.cls_token_id
eos_token_idx = tokenizer.sep_token_id
pad_token_idx = tokenizer.pad_token_id
unk_token_idx = tokenizer.unk_token_id

print(init_token_idx, eos_token_idx, pad_token_idx, unk_token_idx)

BERT was trained on a defined maximum length, therefore set our max length to this value.

In [None]:
max_input_length = tokenizer.max_model_input_sizes['bert-base-uncased']

print(max_input_length)

Define a method for tokenization.




In [None]:
def tokenize_and_cut(sentence):
    tokens = tokenizer.tokenize(sentence) 
    tokens = tokens[:max_input_length-2]
    return tokens

Define the fields for how the data should be processed.  Use the TEXT field to define how the review should be processed, and the LABEL field to process the sentiment.

In [None]:
!pip install torchtext

In [None]:
from torchtext import data

In [None]:
TEXT = data.Field(batch_first = True,
                  use_vocab = False,
                  tokenize = tokenize_and_cut,
                  preprocessing = tokenizer.convert_tokens_to_ids,
                  init_token = init_token_idx,
                  eos_token = eos_token_idx,
                  pad_token = pad_token_idx,
                  unk_token = unk_token_idx)

LABEL = data.LabelField(dtype = torch.float)

Load Data

In [None]:
!pip install datasets
from datasets import load_dataset
dataset= load_dataset("hope_edi", "english")
print(dataset)

Convert to dataframe

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
from pandas import DataFrame

#Training datset
df_train=DataFrame({'text':dataset['train']['text'], 'label': dataset['train']['label']})
print(df_train.shape)

df_train['label'] = df_train['label'].replace([1], "negative")
df_train['label'] = df_train['label'].replace([0], "positive")
df_train.to_csv('/content/drive/MyDrive/Hope_Dataset/train.tsv', sep="\t",index=False)


#Validation dataset
df_val=DataFrame({'text':dataset['validation']['text'], 'label': dataset['validation']['label']}) 
x = df_val['label'].value_counts()
df_val['label'] = df_train['label'].replace([1], "negative")
df_val['label'] = df_train['label'].replace([0], "positive")
df_val.to_csv('/content/drive/MyDrive/Hope_Dataset/test.tsv', sep="\t", index=False)

df_val.head()
df_train.tail()

In [None]:
print(x)

In [None]:
df_val['label'].value_counts()

In [None]:
df_train.sample(20)

In [None]:
fields = [('text', TEXT), ('label', LABEL)]
#loading custom dataset
training_data=data.TabularDataset(path ="/content/drive/MyDrive/Hope_Dataset/train.tsv",format = 'tsv',fields = fields,skip_header = True)


train_data, valid_data = training_data.split(split_ratio=0.8, random_state = random.seed(SEED))   #, random_state = random.seed(SEED)


print(len(train_data))
print(len(valid_data))

LABEL.build_vocab(train_data)

# No. of unique tokens in label
print("Size of LABEL vocabulary:", len(LABEL.vocab))


Create Iterators

In [None]:
BATCH_SIZE = 128

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)
train_iterator, valid_iterator = data.BucketIterator.splits(
    (train_data, valid_data), 
    batch_size = BATCH_SIZE, 
    sort_key=lambda x: len(x.text),
    sort_within_batch=True,
    device = device)

# Build the model

In [None]:
from transformers import BertTokenizer, BertModel

bert = BertModel.from_pretrained('bert-base-uncased') # it is important to use the same model and tokenizer

Define the model.
The pretrained BERT model for the embedding layer.  The embdeeings are fed into the the LSTM model to produce a prediction for the sentiment. 

In [None]:
import torch.nn as nn

class BERTGRUSentiment(nn.Module):
    def __init__(self,
                 bert,
                 hidden_dim,
                 output_dim,
                 n_layers,
                 bidirectional,
                 dropout):
        
        super().__init__()
        
        self.bert = bert
        
        embedding_dim = bert.config.to_dict()['hidden_size']
        
        self.rnn = nn.GRU(embedding_dim,
                          batch_first = True,
                          hidden_dim,
                          num_layers = n_layers,
                          bidirectional = bidirectional,                
                          dropout = 0 if n_layers < 2 else dropout)
        
        self.out = nn.Linear(hidden_dim * 2 if bidirectional else hidden_dim, output_dim)
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, text):    
                
        with torch.no_grad():
            embedded = self.bert(text)[0]            
        
        _, hidden = self.rnn(embedded)        
        
        if self.rnn.bidirectional:
            hidden = self.dropout(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1))
        else:
            hidden = self.dropout(hidden[-1,:,:])                
        
        output = self.out(hidden)      
        
        return output

Define hyperparameters

In [None]:
HIDDEN_DIM = 256
OUTPUT_DIM = 1
BIDIRECTIONAL = True
DROPOUT = 0.25
N_LAYERS = 2

model = BERTGRUSentiment(bert,
                         N_LAYERS,
                         HIDDEN_DIM,
                         OUTPUT_DIM,
                         BIDIRECTIONAL,
                         DROPOUT)

Define method to find out how many  parameters the model has. Most of the parameters are for BERT

In [None]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

Set requires_grad attribute to False to freeze parameters for BERT


In [None]:
for name, param in model.named_parameters():                
    if name.startswith('bert'):
        param.requires_grad = False

In [None]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

Print the names of the trainable parameters

In [None]:
for name, param in model.named_parameters():                
    if param.requires_grad:
        print(name)

# Train the model

Define optimizer and loss function

In [None]:
import torch.optim as optim

optimizer = optim.Adam(model.parameters())
criterion = nn.BCEWithLogitsLoss()

Place the model and criterion onto the GPU 

In [None]:
# check whether cuda is available
if torch.cuda.is_available():    
    # If a GPU is available tell PyTorch to use the GPU.    
    device = torch.device("cuda")
    # Print that a GPU is available and its name
    print('There are %d GPU(s) available.' % torch.cuda.device_count())
    print('We will use the GPU:', torch.cuda.get_device_name(0))
# If a GPU is not available print the following statement
else:
    print('No GPU available, using the CPU instead.')

In [None]:
model = model.to(device)
criterion = criterion.to(device)

Define method for calculating accuracy

In [None]:
def binary_accuracy(preds, y):
    """
    Returns accuracy per batch, i.e. if you get 8/10 right, this returns 0.8, NOT 8
    """

    #round predictions to the closest integer
    rounded_preds = torch.round(torch.sigmoid(preds))
    correct = (rounded_preds == y).float() #convert into float for division 
    acc = correct.sum() / len(correct)
    return acc

Define model for performing a training epoch

In [None]:
def train(model, iterator, optimizer, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.train()
    
    for batch in iterator:
        
        optimizer.zero_grad()
        
        predictions = model(batch.text).squeeze(1)
        
        loss = criterion(predictions, batch.label)
        
        acc = binary_accuracy(predictions, batch.label)
        
        loss.backward()
        
        optimizer.step()
        
        epoch_loss += loss.item()
        epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

Define a method to performing an evaluation epoch and calculating how long a training/evaluation epoch takes.

In [None]:
def f1_loss(y_pred:torch.Tensor, y_true:torch.Tensor, is_training=False):
    '''Calculate F1 score. Can work with gpu tensors'''
   
    assert y_true.ndim == 1
    assert y_pred.ndim == 1 or y_pred.ndim == 2
    
    if y_pred.ndim == 2:
        y_pred = y_pred.argmax(dim=1)
    
    y_pred = torch.round(torch.sigmoid(y_pred))
   
   

    
    tp = (y_true * y_pred).sum().to(torch.float32)
    tn = ((1 - y_true) * (1 - y_pred)).sum().to(torch.float32)
    fp = ((1 - y_true) * y_pred).sum().to(torch.float32)
    fn = (y_true * (1 - y_pred)).sum().to(torch.float32)
    
    epsilon = 1e-7
    
    precision = tp / (tp + fp + epsilon)
    recall = tp / (tp + fn + epsilon)
    
    f1 = 2* (precision*recall) / (precision + recall + epsilon)
    f1.requires_grad = is_training
    return f1, precision, recall, tp, tn, fp, fn



In [None]:
def evaluate(model, iterator, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    tp = 0
    tn = 0 
    fp = 0
    fn =0
    
    model.eval()
    
    with torch.no_grad():
    
        for batch in iterator:

            predictions = model(batch.text).squeeze(1)
            
            loss = criterion(predictions, batch.label)
            #print('label:', batch.label)
            #print('preds;', predictions)
            acc = binary_accuracy(predictions, batch.label)
            f1, precision, recall, tp, tn, fp, fn= f1_loss(predictions, batch.label)

            epoch_loss += loss.item()
            epoch_acc += acc.item()
            tp +=tp.item()
            fp +=fp.item()
            tn +=tn.item()
            fn +=fn.item()


        
    return epoch_loss / len(iterator), epoch_acc / len(iterator), tp, tn, fp, fn

In [None]:
import time

def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

Train the model 

In [None]:
N_EPOCHS = 5

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):
    
    start_time = time.time()
    
    train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
    valid_loss, valid_acc, tp, tn, fp, fn = evaluate(model, valid_iterator, criterion)
        
    end_time = time.time()
        
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
        
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'tut6-model.pt')
    
    print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')


Test set

In [None]:
unseen_data = data.TabularDataset(path="/content/drive/MyDrive/Hope_Dataset/test.tsv",format='tsv', fields= [('text', TEXT), ('label', LABEL)], skip_header=True)


  # loading custom dataset
unseen_train_data, unseen_data = unseen_data.split(split_ratio=0.99,
                                                      random_state=random.seed(
                                                          SEED))  

print(len(unseen_train_data))
print(len(unseen_data))
 
LABEL.build_vocab(unseen_train_data)

  

unseen_train_data_iter, unseen_data_iter = data.BucketIterator.splits((unseen_train_data, unseen_data),
                                                                        batch_size=256,
                                                                        sort_key=lambda x: len(x.text),
                                                                        sort_within_batch=True, device=device)
                                                                  

model.load_state_dict(torch.load('trip_140521.pt'))


def evaluate_testset(model, unseen_train_data_iterator, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    tp_total = 0
    tn_total = 0 
    fp_total = 0
    fn_total =0
    total_no_inputs = 0
    model.eval()
    
    with torch.no_grad():
    
        for batch in unseen_train_data_iter:
            total_no_inputs += 256        
            predictions = model(batch.text).squeeze(1)
            loss = criterion(predictions, batch.label)
            acc = binary_accuracy(predictions, batch.label)
            f1, precision, recall, tp, tn, fp, fn= f1_loss(predictions, batch.label)
            epoch_loss += loss.item()
            epoch_acc += acc.item()
            tp_total +=tp.item()
            fp_total +=fp.item()
            tn_total +=tn.item()
            fn_total +=fn.item()
           

    return epoch_loss / len(unseen_train_data_iter), epoch_acc / len(unseen_train_data_iter), tp_total, tn_total, fp_total, fn_total 


unseen_test_loss, unseen_test_acc, tp_total, tn_total, fp_total, fn_total = evaluate_testset(model, unseen_train_data_iter, criterion)
print(f'Unseen Test Loss: {unseen_test_loss:.3f} | Unseen Test Acc: {unseen_test_acc * 100:.2f}%')
print(  'tp:', tp_total, 'tn:', tn_total, 'fp:', fp_total, 'fn:', fn_total)
prec = tp_total/(tp_total + fp_total)
recall = tp_total/(tp_total + fn_total)
print('Recall', recall)
print('Prec', prec )
F1 = 2*prec*recall/ (prec+recall)
print('F1:', F1)
print("finished")   



Test the sentiment of random sentences

In [None]:
def predict_sentiment(model, tokenizer, sentence):
    model.eval()
    tokens = tokenizer.tokenize(sentence)
    tokens = tokens[:max_input_length-2]
    indexed = [init_token_idx] + tokenizer.convert_tokens_to_ids(tokens) + [eos_token_idx]
    tensor = torch.LongTensor(indexed).to(device)
    tensor = tensor.unsqueeze(0)
    prediction = torch.sigmoid(model(tensor))
    return prediction.item()

In [None]:
predict_sentiment(model, tokenizer, "i hate indians")

In [None]:
predict_sentiment(model, tokenizer, "omg how can someone be that amazing")

#Transfer learning
##Airline dataset

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
import pandas as pd

In [None]:
df2 = pd.read_csv("/content/drive/MyDrive/Tweets.csv",encoding='latin1')
df2.head()

In [None]:


#Training datset

df2["airline_sentiment"] = df2["airline_sentiment"].replace("neutral", "negative")
df2=df2[["text","airline_sentiment"]]
df2.columns = ['text', 'label']
df2.sample(10)




In [None]:
df2.to_csv('/content/drive/MyDrive/AirlineTweet.tsv', sep="\t",index=False)

In [None]:
df2.head()

In [None]:
unseen_data2 = data.TabularDataset(path="/content/drive/MyDrive/AirlineTweet.tsv",format='tsv', fields = [('text', TEXT), ('label', LABEL)], skip_header=True)

In [None]:
unseen_data2

In [None]:

eq=4,
                  #  vectors="glove.6B.100d"unseen_train_data, unseen_data = unseen_data2.split(split_ratio=0.99,
                                                      random_state=random.seed(
                                                          SEED))  # , random_state = random.seed(SEED)


print(len(unseen_train_data))
print(len(unseen_data))

LABEL.build_vocab(unseen_train_data)

 

unseen_train_data_iter, unseen_data_iter = data.BucketIterator.splits((unseen_train_data, unseen_data),
                                                                        batch_size=256,
                                                                        sort_key=lambda x: len(x.text),
                                                                        sort_within_batch=True, device=device)
                                                                  

model.load_state_dict(torch.load('tut6-model.pt'))


def evaluate_testset(model, unseen_train_data_iterator, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    tp_total = 0
    tn_total = 0 
    fp_total = 0
    fn_total =0
    total_no_inputs = 0
    model.eval()
    
    with torch.no_grad():
    
        for batch in unseen_train_data_iter:
           
            total_no_inputs += 256
           
            predictions = model(batch.text).squeeze(1)
            
            loss = criterion(predictions, batch.label)
           
            acc = binary_accuracy(predictions, batch.label)
            f1, precision, recall, tp, tn, fp, fn= f1_loss(predictions, batch.label)

            epoch_loss += loss.item()
            epoch_acc += acc.item()
            tp_total +=tp.item()
            fp_total +=fp.item()
            tn_total +=tn.item()
            fn_total +=fn.item()
           

    return epoch_loss / len(unseen_train_data_iter), epoch_acc / len(unseen_train_data_iter), tp_total, tn_total, fp_total, fn_total 


unseen_test_loss, unseen_test_acc, tp_total, tn_total, fp_total, fn_total = evaluate_testset(model, unseen_train_data_iter, criterion)
print(f'Unseen Test Loss: {unseen_test_loss:.3f} | Unseen Test Acc: {unseen_test_acc * 100:.2f}%')
print(  'tp:', tp_total, 'tn:', tn_total, 'fp:', fp_total, 'fn:', fn_total)
prec = tp_total/(tp_total + fp_total)
recall = tp_total/(tp_total + fn_total)
print('Recall', recall)
print('Prec', prec )
F1 = 2*prec*recall/ (prec+recall)
print('F1:', F1)
print("finished")   

##Covid dataset

In [None]:
df3 = pd.read_csv("/content/drive/MyDrive/coviddataset.csv",encoding='latin1')
df3.dropna(subset = ["Sentiment"], inplace=True)
df3.head()

In [None]:


df3['Sentiment'] = df3['Sentiment'].map({'Positive':'negative', 'Extremely Positive':"positive",'Negative':"negative",'Extremely Negative':"negative",'Neutral':"negative"})
df3=df3[["OriginalTweet","Sentiment"]]
df3.columns = ['text', 'label']
df3.sample(10)



In [None]:
df3.to_csv('/content/drive/MyDrive/covid_test.tsv', sep="\t",index=False)

In [None]:
df3.head()

In [None]:
unseen_data3 = data.TabularDataset(path="/content/drive/MyDrive/covid_test.tsv",format='tsv', fields=[('text', TEXT), ('label', LABEL)], skip_header=True)

In [None]:
unseen_data3

In [None]:
unseen_train_data, unseen_data = unseen_data3.split(split_ratio=0.99,
                                                      random_state=random.seed(
                                                          SEED))  


print(len(unseen_train_data))
print(len(unseen_data))

LABEL.build_vocab(unseen_train_data)

 

unseen_train_data_iter, unseen_data_iter = data.BucketIterator.splits((unseen_train_data, unseen_data),
                                                                        batch_size=256,
                                                                        sort_key=lambda x: len(x.text),
                                                                        sort_within_batch=True, device=device)
                                                                  
model.load_state_dict(torch.load('tut6-model.pt'))


def evaluate_testset(model, unseen_train_data_iterator, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    tp_total = 0
    tn_total = 0 
    fp_total = 0
    fn_total =0
    total_no_inputs = 0
    model.eval()
    
    with torch.no_grad():
    
        for batch in unseen_train_data_iter:
            
            total_no_inputs += 256
            
            predictions = model(batch.text).squeeze(1)
            
            loss = criterion(predictions, batch.label)
            
            acc = binary_accuracy(predictions, batch.label)
            f1, precision, recall, tp, tn, fp, fn= f1_loss(predictions, batch.label)

            epoch_loss += loss.item()
            epoch_acc += acc.item()
            tp_total +=tp.item()
            fp_total +=fp.item()
            tn_total +=tn.item()
            fn_total +=fn.item()
           

    return epoch_loss / len(unseen_train_data_iter), epoch_acc / len(unseen_train_data_iter), tp_total, tn_total, fp_total, fn_total 


unseen_test_loss, unseen_test_acc, tp_total, tn_total, fp_total, fn_total = evaluate_testset(model, unseen_train_data_iter, criterion)
print(f'Unseen Test Loss: {unseen_test_loss:.3f} | Unseen Test Acc: {unseen_test_acc * 100:.2f}%')
print(  'tp:', tp_total, 'tn:', tn_total, 'fp:', fp_total, 'fn:', fn_total)
prec = tp_total/(tp_total + fp_total)
recall = tp_total/(tp_total + fn_total)
print('Recall', recall)
print('Prec', prec )
F1 = 2*prec*recall/ (prec+recall)
print('F1:', F1)
print("finished")   

In [None]:
df3.value_counts