<a href="https://colab.research.google.com/github/DAbbott93/Dean-Abbott--Dissertation/blob/main/model%202%3A%20LSTM_BERT_EMBED.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LSTM model using BERT embeddings for Hope speech detection and cross domain testing

This notebook uses the pretrained BERT transormer model (from the transformers library) as embedding layers for our LSTM model.  It freezes BERT and only train the remainder of the model which learns from the representations produced by the transformer.

## Data preparation

In [None]:
!pip install torch==1.6.0 torchvision==0.7.0 torchtext==0.7.0

Collecting torch==1.6.0
  Downloading torch-1.6.0-cp37-cp37m-manylinux1_x86_64.whl (748.8 MB)
[K     |████████████████████████████████| 748.8 MB 19 kB/s 
[?25hCollecting torchvision==0.7.0
  Downloading torchvision-0.7.0-cp37-cp37m-manylinux1_x86_64.whl (5.9 MB)
[K     |████████████████████████████████| 5.9 MB 16.8 MB/s 
[?25hCollecting torchtext==0.7.0
  Downloading torchtext-0.7.0-cp37-cp37m-manylinux1_x86_64.whl (4.5 MB)
[K     |████████████████████████████████| 4.5 MB 38.2 MB/s 
Collecting sentencepiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 41.8 MB/s 
Installing collected packages: torch, sentencepiece, torchvision, torchtext
  Attempting uninstall: torch
    Found existing installation: torch 1.9.0+cu102
    Uninstalling torch-1.9.0+cu102:
      Successfully uninstalled torch-1.9.0+cu102
  Attempting uninstall: torchvision
    Found existing installation: torchvis

Set the seed to achieve reproducibilty (you should test experiements accrsoss different seed values)

In [None]:
import torch

import random
import numpy as np

SEED = 1234

random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

In [None]:
# check whether cuda is available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

cuda


We must tokenise our data into the required format for BERT. We will use the BertTokenizer from the transformers library.

In [None]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.10.0-py3-none-any.whl (2.8 MB)
[K     |████████████████████████████████| 2.8 MB 5.1 MB/s 
[?25hCollecting sacremoses
  Downloading sacremoses-0.0.45-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 47.0 MB/s 
[?25hCollecting huggingface-hub>=0.0.12
  Downloading huggingface_hub-0.0.16-py3-none-any.whl (50 kB)
[K     |████████████████████████████████| 50 kB 5.4 MB/s 
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 33.0 MB/s 
[?25hCollecting pyyaml>=5.1
  Downloading PyYAML-5.4.1-cp37-cp37m-manylinux1_x86_64.whl (636 kB)
[K     |████████████████████████████████| 636 kB 39.5 MB/s 
Installing collected packages: tokenizers, sacremoses, pyyaml, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Found existing installati

In [None]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Mark the mark the beginning of each with ([CLS]) and the end of each sentence with ([SEP]. We also need to add padding and unkown tokens. This is the required format for BERT inputs

In [None]:
init_token = tokenizer.cls_token
eos_token = tokenizer.sep_token
pad_token = tokenizer.pad_token
unk_token = tokenizer.unk_token

print(init_token, eos_token, pad_token, unk_token)

[CLS] [SEP] [PAD] [UNK]


We can get indexes for the special tokens from the tokenizer

In [None]:
init_token_idx = tokenizer.cls_token_id
eos_token_idx = tokenizer.sep_token_id
pad_token_idx = tokenizer.pad_token_id
unk_token_idx = tokenizer.unk_token_id

print(init_token_idx, eos_token_idx, pad_token_idx, unk_token_idx)

101 102 0 100


BERT was trained on a defined maximum length, therefore we should set our max length to this value.

In [None]:
max_input_length = tokenizer.max_model_input_sizes['bert-base-uncased']

print(max_input_length)

512


Define a method for tokenization.Note that our maximum length is 2 less than the actual maximum length. This is because we need to append two tokens to each sequence, one to the start and one to the end.



In [None]:
def tokenize_and_cut(sentence):
    tokens = tokenizer.tokenize(sentence) 
    tokens = tokens[:max_input_length-2]
    return tokens

Define the fields for how the data should be processed.  This is a main concept of TorchText.
We use the TEXT field to define how the review should be processed, and the LABEL field to process the sentiment.

In [None]:
!pip install torchtext



In [None]:
from torchtext import data

In [None]:
TEXT = data.Field(batch_first = True,
                  use_vocab = False,
                  tokenize = tokenize_and_cut,
                  preprocessing = tokenizer.convert_tokens_to_ids,
                  init_token = init_token_idx,
                  eos_token = eos_token_idx,
                  pad_token = pad_token_idx,
                  unk_token = unk_token_idx)

LABEL = data.LabelField(dtype = torch.float)



Load Data

In [None]:
!pip install datasets
from datasets import load_dataset
dataset= load_dataset("hope_edi", "english")
print(dataset)

Collecting datasets
  Downloading datasets-1.11.0-py3-none-any.whl (264 kB)
[K     |████████████████████████████████| 264 kB 5.4 MB/s 
[?25hCollecting xxhash
  Downloading xxhash-2.0.2-cp37-cp37m-manylinux2010_x86_64.whl (243 kB)
[K     |████████████████████████████████| 243 kB 36.7 MB/s 
Collecting fsspec>=2021.05.0
  Downloading fsspec-2021.8.1-py3-none-any.whl (119 kB)
[K     |████████████████████████████████| 119 kB 42.7 MB/s 
Installing collected packages: xxhash, fsspec, datasets
Successfully installed datasets-1.11.0 fsspec-2021.8.1 xxhash-2.0.2


Downloading:   0%|          | 0.00/2.90k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.34k [00:00<?, ?B/s]

Downloading and preparing dataset hope_edi/english (download: 2.61 MiB, generated: 2.48 MiB, post-processed: Unknown size, total: 5.09 MiB) to /root/.cache/huggingface/datasets/hope_edi/english/1.0.0/fff5cf6e767fe3d1de7c5df863565bdce10bfe79dfb0b2ce42d320c3864497e3...


Downloading: 0.00B [00:00, ?B/s]

Downloading:   0%|          | 0.00/305k [00:00<?, ?B/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

Dataset hope_edi downloaded and prepared to /root/.cache/huggingface/datasets/hope_edi/english/1.0.0/fff5cf6e767fe3d1de7c5df863565bdce10bfe79dfb0b2ce42d320c3864497e3. Subsequent calls will reuse this data.
DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 22762
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 2843
    })
})


Convert to dataframe

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.activity.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fexperimentsandconfigs%20https%3a%2f%2fwww.googleapis.com%2fauth%2fphotos.native&response_type=code

Enter your authorization code:
4/1AX4XfWjlOhxW-zTnHAa7lE6Bky99IcFmmPtXrFnmBnxXGQ7zyUVfgQn8mfw
Mounted at /content/drive


In [None]:
from pandas import DataFrame

#Training datset
df_train=DataFrame({'text':dataset['train']['text'], 'label': dataset['train']['label']})
print(df_train.shape)
df_train['label'] = df_train['label'].replace([1], "negative")
df_train['label'] = df_train['label'].replace([0], "positive")
df_train.to_csv('/content/drive/MyDrive/Hope_Dataset/train.tsv', sep="\t",index=False)


#Validation dataset
df_val=DataFrame({'text':dataset['validation']['text'], 'label': dataset['validation']['label']}) 
df_val['label'] = df_train['label'].replace([1], "negative")
df_val['label'] = df_train['label'].replace([0], "positive")
df_val.to_csv('/content/drive/MyDrive/Hope_Dataset/test.tsv', sep="\t", index=False)

df_val.head()
df_train.tail()

(22762, 2)


Unnamed: 0,text,label
22757,It's a load of bollocks every life matters sim...,negative
22758,no say it because all lives matter! deku would...,negative
22759,God says her life matters,negative
22760,This video is just shit. A bunch of whiny ass ...,negative
22761,Mc Fortnut2821 she did 4 months ago in west ch...,negative


In [None]:
df_train.sample(20)

Unnamed: 0,text,label
11550,Fallon Daughtry How so?,negative
14824,fluff ball ok calm down it’s the youtube comme...,negative
15462,Noelle Kay How can it be shallow when its inf...,negative
19821,She looks good here but these new videos she’s...,negative
9886,George had a previous conviction for house rob...,negative
3872,sheriff for president,negative
5046,Yeah this channel has slowly crashed over the ...,negative
20105,Madonna has been an advocate from day dot. She...,positive
9695,@American Patriot! I'll pay. I would use my en...,negative
20157,Worthy cause until I heard the typical feminis...,negative


In [None]:
fields = [('text', TEXT), ('label', LABEL)]
#loading custom dataset
training_data=data.TabularDataset(path ="/content/drive/MyDrive/Hope_Dataset/train.tsv",format = 'tsv',fields = fields,skip_header = True)



train_data, valid_data = training_data.split(split_ratio=0.8, random_state = random.seed(SEED))   #, random_state = random.seed(SEED)
# train_data, valid_data = train_data.split(split_ratio=0.8, random_state=random.seed(SEED))



print(len(train_data))
print(len(valid_data))

LABEL.build_vocab(train_data)

# No. of unique tokens in label
print("Size of LABEL vocabulary:", len(LABEL.vocab))




18210
4552
Size of LABEL vocabulary: 3


Create Iterators

In [None]:
BATCH_SIZE = 128

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)
train_iterator, valid_iterator = data.BucketIterator.splits(
    (train_data, valid_data), 
    batch_size = BATCH_SIZE, 
    sort_key=lambda x: len(x.text),
    sort_within_batch=True,
    device = device)

# Build the model

In [None]:
from transformers import BertTokenizer, BertModel

bert = BertModel.from_pretrained('bert-base-uncased') # it is important to use the same model and tokenizer

Downloading:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Define the model.
The pretrained BERT model for the embedding layer.  The embdeeings will be fed into the the LSTM model to produce a prediction for the sentiment of the input sentence. 

The embedding dimension size (hidden_size) comes from the transformer via its config attribute. The rest of the initialization is standard.

Within the forward pass, we wrap the transformer in a no_grad to ensure no gradients are calculated over this part of the model. The transformer actually returns the embeddings for the whole sequence as well as a pooled output. The documentation states that the pooled output is "usually not a good summary of the semantic content of the input, you’re often better with averaging or pooling the sequence of hidden-states for the whole input sequence", hence we will not be using it. The rest of the forward pass is the standard implementation of a recurrent model, where we take the hidden state over the final time-step, and pass it through a linear layer to get our predictions.

*change name of class

In [None]:
import torch.nn as nn

class BERTLSTMSentiment(nn.Module):
    def __init__(self,
                 bert,
                 hidden_dim,
                 output_dim,
                 n_layers,
                 bidirectional,
                 dropout):
        
        super().__init__()
        
        self.bert = bert
        
        embedding_dim = bert.config.to_dict()['hidden_size']
        
        self.rnn = nn.LSTM(embedding_dim,
                          hidden_dim,
                          num_layers = n_layers,
                          bidirectional = bidirectional,
                          batch_first = True,
                          dropout = 0 if n_layers < 2 else dropout)
        
        self.out = nn.Linear(hidden_dim * 2 if bidirectional else hidden_dim, output_dim)
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, text):
        
        print('text:', text)
        #text = [batch size, sent len]
                
        with torch.no_grad():
            print('embedded dimensions', self.bert(text)[0] )
            embedded = self.bert(text)[0][0]
            print(self,bert(text)[0][0])

        print('embedded:', embedded)    
                
        #embedded = [batch size, sent len, emb dim]
        
        _, hidden = self.rnn(embedded)
        
        #hidden = [n layers * n directions, batch size, emb dim]
        
        if self.rnn.bidirectional:
            
            hidden = self.dropout(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1))
        else:
            hidden = self.dropout(hidden[-1,:,:])
                
        #hidden = [batch size, hid dim]
        
        output = self.out(hidden)
           
        return output
            

In [None]:
import torch.nn as nn
from torch.autograd import Variable

class BERTLSTMSentiment(nn.Module):
      def __init__(self,
                  bert,
                  hidden_dim,
                  output_dim,
                  n_layers,
                  bidirectional,
                  dropout,
                  batch_size):
          
          super().__init__()

          self.hidden_dim = hidden_dim
          self.bert = bert
          
          embedding_dim = bert.config.to_dict()['hidden_size']
          
          self.lstm = nn.LSTM(embedding_dim, hidden_dim)
          
          self.label = nn.Linear(hidden_dim, output_dim)
          
          self.dropout = nn.Dropout(dropout)
          self.batch_size= batch_size
          
	
      def forward(self, input_sentence, batch_size=None):

        with torch.no_grad():
          embedded = self.bert(input_sentence)[0]

        input = embedded.permute(1, 0, 2)

     
        output, (final_hidden_state, final_cell_state) = self.lstm(input)
        final_output = self.label(final_hidden_state[-1]) # final_hidden_state.size() = (1, batch_size, hidden_size) & final_output.size() = (batch_size, output_size)
        
        return final_output

Define hyperparameters

In [None]:
HIDDEN_DIM = 256
OUTPUT_DIM = 1
N_LAYERS = 2
BIDIRECTIONAL = True
DROPOUT = 0.25
BATCH_SIZE = 128

model = BERTLSTMSentiment(bert,
                         HIDDEN_DIM,
                         OUTPUT_DIM,
                         N_LAYERS,
                         BIDIRECTIONAL,
                         DROPOUT,
                         BATCH_SIZE)


Define method to find out how many  parameters the model has.  Keep in mind 100 million of these paramters are from the transformer model.

In [None]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

Set requires_grad attribute to False to freeze paramers (not train them) by looping through all of the named_parameters in the model, and if they're a part of the bert transformer model, set requires_grad = False.


In [None]:
for name, param in model.named_parameters():                
    if name.startswith('bert'):
        param.requires_grad = False

We can now see that our model has under 3M trainable parameters, making it almost comparable to the FastText model. However, the text still has to propagate through the transformer which causes training to take considerably longer.

In [None]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 1,050,881 trainable parameters


Print the names of the trainable parameters

In [None]:
for name, param in model.named_parameters():                
    if param.requires_grad:
        print(name)

lstm.weight_ih_l0
lstm.weight_hh_l0
lstm.bias_ih_l0
lstm.bias_hh_l0
label.weight
label.bias


# Train the model

Define optimizer and loss function

In [None]:
import torch.optim as optim

optimizer = optim.Adam(model.parameters())
criterion = nn.BCEWithLogitsLoss()

Place the model and criterion onto the GPU (if available)

In [None]:
# check whether cuda is available
if torch.cuda.is_available():    
    # If a GPU is available tell PyTorch to use the GPU.    
    device = torch.device("cuda")
    # Print that a GPU is available and its name
    print('There are %d GPU(s) available.' % torch.cuda.device_count())
    print('We will use the GPU:', torch.cuda.get_device_name(0))
# If a GPU is not available print the following statement
else:
    print('No GPU available, using the CPU instead.')

There are 1 GPU(s) available.
We will use the GPU: Tesla K80


In [None]:
model = model.to(device)
criterion = criterion.to(device)

Define method for calculating accuracy

In [None]:
def binary_accuracy(preds, y):
    #round predictions to the closest integer
    rounded_preds = torch.round(torch.sigmoid(preds))
    correct = (rounded_preds == y).float() #convert into float for division 
    acc = correct.sum() / len(correct)
    return acc

Define model for performing a traing epoch

In [None]:
def train(model, iterator, optimizer, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.train()
    
    for batch in iterator:
        
        optimizer.zero_grad()
        
        predictions = model(batch.text).squeeze(1)
        
        loss = criterion(predictions, batch.label)
        
        acc = binary_accuracy(predictions, batch.label)
        
        loss.backward()
        
        optimizer.step()
        
        epoch_loss += loss.item()
        epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

Define a method to performing an evaluation epoch and calculating how long a training/evaluation epoch takes.

In [None]:
def evaluate(model, iterator, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.eval()
    
    with torch.no_grad():
    
        for batch in iterator:

            predictions = model(batch.text).squeeze(1)
            
            loss = criterion(predictions, batch.label)
            
            acc = binary_accuracy(predictions, batch.label)

            epoch_loss += loss.item()
            epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

In [None]:
import time

def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

Train the model 

In [None]:
N_EPOCHS = 20

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):
    
    start_time = time.time()
    
    train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
    valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)
        
    end_time = time.time()
        
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
        
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'LSTM.PT')
    
    print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')




Epoch: 01 | Epoch Time: 1m 12s
	Train Loss: 0.261 | Train Acc: 90.53%
	 Val. Loss: 0.222 |  Val. Acc: 91.84%
Epoch: 02 | Epoch Time: 1m 11s
	Train Loss: 0.211 | Train Acc: 92.02%
	 Val. Loss: 0.206 |  Val. Acc: 92.53%
Epoch: 03 | Epoch Time: 1m 11s
	Train Loss: 0.207 | Train Acc: 92.21%
	 Val. Loss: 0.211 |  Val. Acc: 91.92%
Epoch: 04 | Epoch Time: 1m 11s
	Train Loss: 0.195 | Train Acc: 92.62%
	 Val. Loss: 0.204 |  Val. Acc: 92.89%
Epoch: 05 | Epoch Time: 1m 11s
	Train Loss: 0.186 | Train Acc: 92.86%
	 Val. Loss: 0.195 |  Val. Acc: 92.98%
Epoch: 06 | Epoch Time: 1m 11s
	Train Loss: 0.176 | Train Acc: 93.28%
	 Val. Loss: 0.196 |  Val. Acc: 93.00%
Epoch: 07 | Epoch Time: 1m 11s
	Train Loss: 0.166 | Train Acc: 93.39%
	 Val. Loss: 0.192 |  Val. Acc: 92.89%
Epoch: 08 | Epoch Time: 1m 11s
	Train Loss: 0.157 | Train Acc: 93.85%
	 Val. Loss: 0.205 |  Val. Acc: 92.30%
Epoch: 09 | Epoch Time: 1m 11s
	Train Loss: 0.149 | Train Acc: 93.95%
	 Val. Loss: 0.200 |  Val. Acc: 92.96%
Epoch: 10 | Epoch T

Test set

In [None]:
def f1_loss(y_pred:torch.Tensor, y_true:torch.Tensor, is_training=False):
    '''Calculate F1 score. Can work with gpu tensors'''
   
    assert y_true.ndim == 1
    assert y_pred.ndim == 1 or y_pred.ndim == 2
    
    if y_pred.ndim == 2:
        y_pred = y_pred.argmax(dim=1)
    
    y_pred = torch.round(torch.sigmoid(y_pred))
   
   

    
    tp = (y_true * y_pred).sum().to(torch.float32)
    tn = ((1 - y_true) * (1 - y_pred)).sum().to(torch.float32)
    fp = ((1 - y_true) * y_pred).sum().to(torch.float32)
    fn = (y_true * (1 - y_pred)).sum().to(torch.float32)
    
    epsilon = 1e-7
    
    precision = tp / (tp + fp + epsilon)
    recall = tp / (tp + fn + epsilon)
    
    f1 = 2* (precision*recall) / (precision + recall + epsilon)
    f1.requires_grad = is_training
    return f1, precision, recall, tp, tn, fp, fn


In [None]:
unseen_data = data.TabularDataset(path="/content/drive/MyDrive/Hope_Dataset/test.tsv",format='tsv', fields= [('text', TEXT), ('label', LABEL)], skip_header=True)


unseen_train_data, unseen_data = unseen_data.split(split_ratio=0.99,
                                                      random_state=random.seed(
                                                          SEED))  

print(len(unseen_train_data))
print(len(unseen_data))

LABEL.build_vocab(unseen_train_data)


unseen_train_data_iter, unseen_data_iter = data.BucketIterator.splits((unseen_train_data, unseen_data),
                                                                        batch_size=256,
                                                                        sort_key=lambda x: len(x.text),
                                                                        sort_within_batch=True, device=device)
                                                                
def evaluate_testset(model, unseen_train_data_iterator, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    tp_total = 0
    tn_total = 0 
    fp_total = 0
    fn_total =0
    total_no_inputs = 0
    model.eval()
    
    with torch.no_grad():
    
        for batch in unseen_train_data_iter:
            #print('batch:', batch)
            total_no_inputs += 256
            #print(total_no_inputs)
            predictions = model(batch.text).squeeze(1)
            
            loss = criterion(predictions, batch.label)
            #print('label:', batch.label)
            #print('preds;', predictions)
            acc = binary_accuracy(predictions, batch.label)
            f1, precision, recall, tp, tn, fp, fn= f1_loss(predictions, batch.label)

            epoch_loss += loss.item()
            epoch_acc += acc.item()
            tp_total +=tp.item()
            fp_total +=fp.item()
            tn_total +=tn.item()
            fn_total +=fn.item()
           

    return epoch_loss / len(unseen_train_data_iter), epoch_acc / len(unseen_train_data_iter), tp_total, tn_total, fp_total, fn_total 


unseen_test_loss, unseen_test_acc, tp_total, tn_total, fp_total, fn_total = evaluate_testset(model, unseen_train_data_iter, criterion)
print(f'Unseen Test Loss: {unseen_test_loss:.3f} | Unseen Test Acc: {unseen_test_acc * 100:.2f}%')
print(  'tp:', tp_total, 'tn:', tn_total, 'fp:', fp_total, 'fn:', fn_total)
prec = tp_total/(tp_total + fp_total)
recall = tp_total/(tp_total + fn_total)
print('Recall', recall)
print('Prec', prec )
F1 = 2*prec*recall/ (prec+recall)
print('F1:', F1)
print("finished")   



2815
28




Unseen Test Loss: 1.073 | Unseen Test Acc: 82.80%
tp: 23.0 tn: 2305.0 fp: 272.0 fn: 215.0
Recall 0.09663865546218488
Prec 0.07796610169491526
F1: 0.08630393996247655
finished




Test the sentiment of random sentences

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
import pandas as pd

In [None]:
df2 = pd.read_csv("/content/drive/MyDrive/Tweets.csv",encoding='latin1')
df2.head()

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada)
3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada)
4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada)


In [None]:
#Training datset

df2["airline_sentiment"] = df2["airline_sentiment"].replace("neutral", "negative")
df2=df2[["text","airline_sentiment"]]
df2.columns = ['text', 'label']
df2.sample(10)


Unnamed: 0,text,label
14481,@AmericanAir appreciate update. Have also appr...,positive
7282,@JetBlue But not reddit? I work for the site a...,negative
12068,@AmericanAir r u serious?? 304min #delay with ...,negative
10565,@USAirways Flight 1815 (N747UW) arrives at @Fl...,negative
3599,@united are you trying to break a world record...,negative
2314,@united give me an email address and I'll send...,negative
13680,@AmericanAir-everyone: its been weeks&amp;thos...,negative
3512,@united ...she said she would need to get a su...,negative
2565,@united yes to more food! Add some gluten free...,positive
12334,"@AmericanAir depends on the terminal, what' th...",negative


In [None]:
df2.to_csv('/content/drive/MyDrive/AirlineTweet.tsv', sep="\t",index=False)

In [None]:
unseen_data2 = data.TabularDataset(path="/content/drive/MyDrive/AirlineTweet.tsv",format='tsv', fields = [('text', TEXT), ('label', LABEL)], skip_header=True)



In [None]:
print(df2['label'].value_counts())

negative    12277
positive     2363
Name: label, dtype: int64


In [None]:
unseen_train_data, unseen_data = unseen_data2.split(split_ratio=0.99,
                                                      random_state=random.seed(
                                                          SEED)) 


print(len(unseen_train_data))
print(len(unseen_data))

LABEL.build_vocab(unseen_train_data)

unseen_train_data_iter, unseen_data_iter = data.BucketIterator.splits((unseen_train_data, unseen_data),
                                                                        batch_size=256,
                                                                        sort_key=lambda x: len(x.text),
                                                                        sort_within_batch=True, device=device)
                                                                  

unseen_test_loss, unseen_test_acc, tp_total, tn_total, fp_total, fn_total = evaluate_testset(model, unseen_train_data_iter, criterion)
print(f'Unseen Test Loss: {unseen_test_loss:.3f} | Unseen Test Acc: {unseen_test_acc * 100:.2f}%')
print(  'tp:', tp_total, 'tn:', tn_total, 'fp:', fp_total, 'fn:', fn_total)
prec = tp_total/(tp_total + fp_total)
recall = tp_total/(tp_total + fn_total)
print('Recall', recall)
print('Prec', prec )
F1 = 2*prec*recall/ (prec+recall)
print('F1:', F1)
print("finished")   

14494
146




Unseen Test Loss: 1.268 | Unseen Test Acc: 84.17%
tp: 168.0 tn: 12037.0 fp: 128.0 fn: 2161.0
Recall 0.07213396307428081
Prec 0.5675675675675675
F1: 0.128
finished


In [None]:
df3 = pd.read_csv("/content/drive/MyDrive/coviddataset.csv",encoding='latin1')
df3.dropna(subset = ["Sentiment"], inplace=True)
df3.head()

Unnamed: 0,UserName,ScreenName,Location,TweetAt,OriginalTweet,Sentiment
0,3799,48751,London,16/03/2020,@MeNyrbie @Phil_Gahan @Chrisitv https://t.co/i...,Neutral
1,3800,48752,UK,16/03/2020,advice Talk to your neighbours family to excha...,Positive
2,3801,48753,Vagabonds,16/03/2020,Coronavirus Australia: Woolworths to give elde...,Positive
3,3802,48754,,16/03/2020,My food stock is not the only one which is emp...,Positive
4,3803,48755,,16/03/2020,"Me, ready to go at supermarket during the #COV...",Extremely Negative


In [None]:
df3['Sentiment'] = df3['Sentiment'].map({'Positive':'negative', 'Extremely Positive':"positive",'Negative':"negative",'Extremely Negative':"negative",'Neutral':"negative"})
df3=df3[["OriginalTweet","Sentiment"]]
df3.columns = ['text', 'label']
df3.sample(10)


Unnamed: 0,text,label
7650,@SkyUK why would you put your prices up now fo...,negative
7023,#Oklahoma state &amp;local governments financi...,negative
5304,Oil prices seem to be fighting a three-headed ...,negative
3433,".@Alex_Stafford, a Conservative MP in the UK p...",negative
720,Malaysia announced restricted movement re COVI...,negative
7745,"Hey, People of Twitter  Is there a place we c...",positive
2061,Sure Betfred won't be the last business to ask...,positive
3175,Could this #coronavirus crisis be a tipping po...,negative
6828,How about at 9pm tonight everyone in London ap...,positive
6830,I somehow think this pandemic will turn social...,negative


In [None]:
df3.to_csv('/content/drive/MyDrive/covid_test.tsv', sep="\t",index=False)

In [None]:
df3['label'].value_counts()

negative    6842
positive    1212
Name: label, dtype: int64

In [None]:
unseen_data3 = data.TabularDataset(path="/content/drive/MyDrive/covid_test.tsv",format='tsv', fields=[('text', TEXT), ('label', LABEL)], skip_header=True)



In [None]:
unseen_train_data, unseen_data = unseen_data3.split(split_ratio=0.99,
                                                      random_state=random.seed(
                                                          SEED))  

print(len(unseen_train_data))
print(len(unseen_data))

LABEL.build_vocab(unseen_train_data)

unseen_train_data_iter, unseen_data_iter = data.BucketIterator.splits((unseen_train_data, unseen_data),
                                                                        batch_size=256,
                                                                        sort_key=lambda x: len(x.text),
                                                                        sort_within_batch=True, device=device)
                                                                  



unseen_test_loss, unseen_test_acc, tp_total, tn_total, fp_total, fn_total = evaluate_testset(model, unseen_train_data_iter, criterion)
print(f'Unseen Test Loss: {unseen_test_loss:.3f} | Unseen Test Acc: {unseen_test_acc * 100:.2f}%')
print(  'tp:', tp_total, 'tn:', tn_total, 'fp:', fp_total, 'fn:', fn_total)
prec = tp_total/(tp_total + fp_total)
recall = tp_total/(tp_total + fn_total)
print('Recall', recall)
print('Prec', prec )
F1 = 2*prec*recall/ (prec+recall)
print('F1:', F1)
print("finished")   

7973
81




Unseen Test Loss: 1.151 | Unseen Test Acc: 82.61%
tp: 101.0 tn: 6548.0 fp: 231.0 fn: 1093.0
Recall 0.08458961474036851
Prec 0.3042168674698795
F1: 0.13237221494102228
finished
