<a href="https://colab.research.google.com/github/GusMalija/Master-Thesis-Project-Augustine-Malija/blob/main/deep_learning_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# BERT Transformers for Climate Science stance detection

After loading the required libraries, device preparation is done for CUDA execeution. This configuration is paramount for leveraging onboard GPU. 

In [None]:
# Importing the libraries needed
import pandas as pd
import numpy as np
import re
import torch
!pip install transformers
import transformers
from torch.utils.data import Dataset, DataLoader
from transformers import DistilBertModel, DistilBertTokenizer

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/d5/43/cfe4ee779bbd6a678ac6a97c5a5cdeb03c35f9eaebbb9720b036680f9a2d/transformers-4.6.1-py3-none-any.whl (2.2MB)
[K     |████████████████████████████████| 2.3MB 3.9MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/75/ee/67241dc87f266093c533a2d4d3d69438e57d7a90abb216fa076e7d475d4a/sacremoses-0.0.45-py3-none-any.whl (895kB)
[K     |████████████████████████████████| 901kB 32.6MB/s 
Collecting huggingface-hub==0.0.8
  Downloading https://files.pythonhosted.org/packages/a1/88/7b1e45720ecf59c6c6737ff332f41c955963090a18e72acbcbeac6b25e86/huggingface_hub-0.0.8-py3-none-any.whl
Collecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/d4/e2/df3543e8ffdab68f5acc73f613de9c2b155ac47f162e725dcac87c521c11/tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3MB)
[K     |███████

In [None]:
# Setting up the device for GPU usage

from torch import cuda
device = 'cuda' if cuda.is_available() else 'cpu'

In [None]:
#retrieving the data
url = "https://raw.githubusercontent.com/GusMalija/Master-Thesis-Project-Augustine-Malija/main/Data/predicted_labels.csv"

data = pd.read_csv(url)

#selecting random samples
#data = data.sample(n = 40000)
data.head()

Unnamed: 0.1,Unnamed: 0,tweet__id,tweet__text,created_at,classes,date,month,clean_tweets
0,0,1204651507636826112,Whilst we are asking our pollies to sort out c...,2019-12-11 06:37:55+00:00,1,2019-12-11 06:37:55+00:00,2019-12,whilst we are asking our pollies to sort out c...
1,1,1217712196882288640,@JustinTrudeau polar bears are greatly impacte...,2020-01-16 07:36:26+00:00,1,2020-01-16 07:36:26+00:00,2020-01,polar bears are greatly impacted by climate ch...
2,2,1201250688820486151,@WandaIsWhite @LadyRedWave These Ex Politician...,2019-12-01 21:24:17+00:00,1,2019-12-01 21:24:17+00:00,2019-12,these ex politicians will do anything to hang ...
3,3,1217460486888939520,Climate change is causing 'eco-anxiety' ― here...,2020-01-15 14:56:14+00:00,3,2020-01-15 14:56:14+00:00,2020-01,climate change is causing eco anxiety here wha...
4,4,1136037411526467584,@Femi_Sorry I got my dad to eventually see tha...,2019-06-04 22:30:00+00:00,1,2019-06-04 22:30:00+00:00,2019-06,sorry got my dad to eventually see that was ri...


In [None]:
def preprocess_text(sen):
    sen = str(sen)
    # Removing html tags
    #sentence = remove_tags(sen)
    #removing mentions from @ and hashtags
    sen = ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+|(\s([@#][\w_-]+)))"," ",sen).split())
    # Remove punctuations and numbers
    sen = re.sub('[^a-zA-Z]', ' ', sen)
    # Single character removal
    sen = re.sub(r"\s+[a-zA-Z]\s+", ' ', sen)
    # Removing multiple spaces
    sen = re.sub(r'\s+', ' ', sen)
    #removing single characters from the start
    sen = re.sub(r'\^[a-zA-Z]\s+', ' ', sen)
    #substituting multiple spaces with single space
    sen = re.sub(r'\s+', ' ', sen, flags=re.I)
    #removing prefixes
    sen = re.sub(r'^b\s+', '', sen)
    #removing numbers
    sen = re.sub(r'[0-9]+', '', sen)
    #removing urls
    sen = re.sub(r"http\S+", "", sen)
    # Remove http:// links
    sen = re.sub('http:\/\/.*','', sen)
    # Remove https:// links
    sen = re.sub('https:\/\/.*','', sen)
    # Remove emojis
    emoji_pattern = re.compile("["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           "]+", flags=re.UNICODE)
    
    sen = emoji_pattern.sub(r'', sen)
    #converting to lowercase
    sen = sen.lower()

    return sen

In [None]:
#applying the preprocessing function and adding a new column of processed tweets
data["clean_tweets"] = data['tweet__text'].apply(preprocess_text)
#dropping unnecessary columns
processed = data.drop(columns=["tweet__text", "Unnamed: 0","created_at", "date", "month"])

#transforming labels to string so it can be model friendly
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
processed["classes"] = label_encoder.fit_transform(processed["classes"])
processed.head()

Unnamed: 0,tweet__id,classes,clean_tweets
0,1204651507636826112,0,whilst we are asking our pollies to sort out c...
1,1217712196882288640,0,polar bears are greatly impacted by climate ch...
2,1201250688820486151,0,these ex politicians will do anything to hang ...
3,1217460486888939520,2,climate change is causing eco anxiety here wha...
4,1136037411526467584,0,sorry got my dad to eventually see that was ri...


<a id='section03'></a>
### Preparing the Dataset and Dataloader

A few key variables that will be used during the training/fine tuning stage are defined. The creation of Dataset class defines how the text is pre-processed before sending it to the neural network. The Dataloader that feeds data in batches to the neural network for suitable training and processing is also defined. These are constructs of the PyTorch library for defining and controlling the data pre-processing and its passage to neural network.

#### *Triage* Dataset Class
- This class is defined to accept the Dataframe as input and generate tokenized output that is used by the DistilBERT model for training. 
- The DistilBERT tokenizer is used to tokenize the data in the `tweet_text` column of the dataframe. 
- The tokenizer uses the `encode_plus` method to perform tokenization and generate the necessary outputs, namely: `ids`, `attention_mask`
- `target` represent the labeled classes of climate change tweet stances. 
- The *Triage* class is used to create 2 datasets, for training and for validation.
- *Training Dataset* is used to fine tune the model: **80% of the original data**
- *Validation Dataset* is used to evaluate the performance of the model. The model has not seen this data during training. 

#### Dataloader
- Dataloader is used to for creating training and validation dataloader that load data to the neural network in a defined manner. This is needed because all the data from the dataset cannot be loaded to the memory at once, hence the amount of dataloaded to the memory and then passed to the neural network needs to be controlled.
- This control is achieved using the parameters such as `batch_size` and `max_len`.
- Training and Validation dataloaders are used in the training and validation part of the flow respectively

In [None]:
# Defining some key variables that will be used later on in the training
MAX_LEN = 512
TRAIN_BATCH_SIZE = 4
VALID_BATCH_SIZE = 2
EPOCHS = 1
LEARNING_RATE = 1e-05
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-cased')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=213450.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=29.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=435797.0, style=ProgressStyle(descripti…




In [None]:
class Triage(Dataset):
    def __init__(self, dataframe, tokenizer, max_len):
        self.len = len(dataframe)
        self.data = dataframe
        self.tokenizer = tokenizer
        self.max_len = max_len
        
    def __getitem__(self, index):
        tweets = str(self.data.clean_tweets[index])
        tweets = " ".join(tweets.split())
        inputs = self.tokenizer.encode_plus(
            tweets,
            None,
            add_special_tokens=True,
            max_length=self.max_len,
            pad_to_max_length=True,
            return_token_type_ids=True,
            truncation=True
        )
        ids = inputs['input_ids']
        mask = inputs['attention_mask']

        return {
            'ids': torch.tensor(ids, dtype=torch.long),
            'mask': torch.tensor(mask, dtype=torch.long),
            'targets': torch.tensor(self.data.classes[index], dtype=torch.long)
        }
    
    def __len__(self):
        return self.len

In [None]:
# Creating the dataset and dataloader for the neural network

train_size = 0.8
train_dataset=processed.sample(frac=train_size,random_state=200)
test_dataset=processed.drop(train_dataset.index).reset_index(drop=True)
train_dataset = train_dataset.reset_index(drop=True)


print("FULL Dataset: {}".format(processed.shape))
print("TRAIN Dataset: {}".format(train_dataset.shape))
print("TEST Dataset: {}".format(test_dataset.shape))

training_set = Triage(train_dataset, tokenizer, MAX_LEN)
testing_set = Triage(test_dataset, tokenizer, MAX_LEN)

FULL Dataset: (50000, 3)
TRAIN Dataset: (40000, 3)
TEST Dataset: (10000, 3)


In [None]:
train_params = {'batch_size': TRAIN_BATCH_SIZE,
                'shuffle': True,
                'num_workers': 0
                }

test_params = {'batch_size': VALID_BATCH_SIZE,
                'shuffle': True,
                'num_workers': 0
                }

training_loader = DataLoader(training_set, **train_params)
testing_loader = DataLoader(testing_set, **test_params)

<a id='section04'></a>
### Creating the Neural Network for Fine Tuning

#### Neural Network
 - Neuronetwork creation is done with the `DistillBERTClass`. 
 - This network will have the DistilBERT Language model followed by a `dropout` and finally a `Linear` layer to obtain the final outputs. 
 - The data will be fed to the DistilBERT Language model as defined in the dataset. 
 - Final layer outputs is what will be compared to the `encoded category` to determine the accuracy of models prediction. 
 - An instance of the network called `model` will be initiated. This instance will be used for training and then to save the final trained model for future inference. 
 
#### Loss Function and Optimizer
 - `Loss Function` and `Optimizer` are defined in the next cell.
 - The `Loss Function` is used the calculate the difference in the output created by the model and the actual output. 
 - `Optimizer` is used to update the weights of the neural network to improve its performance.

In [None]:
# Creating the customized model, by adding a drop out and a dense layer on top of distil bert to get the final output for the model. 

class DistillBERTClass(torch.nn.Module):
    def __init__(self):
        super(DistillBERTClass, self).__init__()
        self.l1 = DistilBertModel.from_pretrained("distilbert-base-uncased")
        self.pre_classifier = torch.nn.Linear(768, 768)
        self.dropout = torch.nn.Dropout(0.3)
        self.classifier = torch.nn.Linear(768, 3)

    def forward(self, input_ids, attention_mask):
        output_1 = self.l1(input_ids=input_ids, attention_mask=attention_mask)
        hidden_state = output_1[0]
        pooler = hidden_state[:, 0]
        pooler = self.pre_classifier(pooler)
        pooler = torch.nn.ReLU()(pooler)
        pooler = self.dropout(pooler)
        output = self.classifier(pooler)
        return output

In [None]:
model = DistillBERTClass()
model.to(device)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=442.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=267967963.0, style=ProgressStyle(descri…




Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_projector.weight', 'vocab_transform.weight', 'vocab_projector.bias', 'vocab_layer_norm.weight', 'vocab_transform.bias', 'vocab_layer_norm.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


DistillBERTClass(
  (l1): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
            (lin1): Linear(in_feat

In [None]:
# Creating the loss function and optimizer
loss_function = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(params =  model.parameters(), lr=LEARNING_RATE)

<a id='section05'></a>
### Fine Tuning the Model

Here a training function is defined that trains the model on the training dataset created above, specified number of times (EPOCH), An epoch defines how many times the complete data will be passed through the network. 

The following events happen in this function to fine tune the neural network:
- The dataloader passes data to the model based on the batch size. 
- Subsequent output from the model and the actual category are compared to calculate the loss. 
- Loss value is used to optimize the weights of the neurons in the network.
- After every 5000 steps the loss value is printed in the console.

In [None]:
# Function to calcuate the accuracy of the model

def calcuate_accu(big_idx, targets):
    n_correct = (big_idx==targets).sum().item()
    return n_correct

In [None]:
# Defining the training function on the 80% of the dataset for tuning the distilbert model

def train(epoch):
    tr_loss = 0
    n_correct = 0
    nb_tr_steps = 0
    nb_tr_examples = 0
    model.train()
    for _,data in enumerate(training_loader, 0):
        ids = data['ids'].to(device, dtype = torch.long)
        mask = data['mask'].to(device, dtype = torch.long)
        targets = data['targets'].to(device, dtype = torch.long)

        outputs = model(ids, mask)
        loss = loss_function(outputs, targets)
        tr_loss += loss.item()
        big_val, big_idx = torch.max(outputs.data, dim=1)
        n_correct += calcuate_accu(big_idx, targets)

        nb_tr_steps += 1
        nb_tr_examples+=targets.size(0)
        
        if _%30==0:
            loss_step = tr_loss/nb_tr_steps
            accu_step = (n_correct*100)/nb_tr_examples 
            print(f"Training Loss per 30 steps: {loss_step}")
            print(f"Training Accuracy per 30 steps: {accu_step}")

        optimizer.zero_grad()
        loss.backward()
        # # When using GPU
        optimizer.step()

    print(f'The Total Accuracy for Epoch {epoch}: {(n_correct*100)/nb_tr_examples}')
    epoch_loss = tr_loss/nb_tr_steps
    epoch_accu = (n_correct*100)/nb_tr_examples
    print(f"Training Loss Epoch: {epoch_loss}")
    print(f"Training Accuracy Epoch: {epoch_accu}")

    return 

In [None]:
for epoch in range(EPOCHS):
    train(epoch)



Training Loss per 30 steps: 1.1250410079956055
Training Accuracy per 30 steps: 25.0
Training Loss per 30 steps: 1.0153899423537716
Training Accuracy per 30 steps: 55.645161290322584
Training Loss per 30 steps: 0.9565299798230655
Training Accuracy per 30 steps: 61.47540983606557
Training Loss per 30 steps: 0.9394036387349223
Training Accuracy per 30 steps: 61.26373626373626
Training Loss per 30 steps: 0.932294722184662
Training Accuracy per 30 steps: 61.77685950413223
Training Loss per 30 steps: 0.9184977212883779
Training Accuracy per 30 steps: 62.913907284768214
Training Loss per 30 steps: 0.9022999254379483
Training Accuracy per 30 steps: 64.3646408839779
Training Loss per 30 steps: 0.8930556858885345
Training Accuracy per 30 steps: 64.45497630331754
Training Loss per 30 steps: 0.8841350418650757
Training Accuracy per 30 steps: 64.73029045643153
Training Loss per 30 steps: 0.8955182646473395
Training Accuracy per 30 steps: 63.92988929889299
Training Loss per 30 steps: 0.8870505785625

<a id='section06'></a>
### Validating the Model

During the validation stage the unseen data(Testing Dataset) is passed to the model. This step determines how good the model performs on the unseen data. 

During the validation stage the weights of the model are not updated. Only the final output is compared to the actual value. This comparison is then used to calcuate the accuracy of the model. 


In [None]:
def valid(model, testing_loader):
    model.eval()
    n_correct = 0; n_wrong = 0; total = 0
    with torch.no_grad():
        for _, data in enumerate(testing_loader, 0):
            tr_loss = 0
            nb_tr_steps = 0
            nb_tr_examples = 0
            ids = data['ids'].to(device, dtype = torch.long)
            mask = data['mask'].to(device, dtype = torch.long)
            targets = data['targets'].to(device, dtype = torch.long)
            outputs = model(ids, mask)#.squeeze()
            loss = loss_function(outputs, targets)
            tr_loss += loss.item()
            big_val, big_idx = torch.max(outputs.data, dim=1)
            n_correct += calcuate_accu(big_idx, targets)

            nb_tr_steps += 1
            nb_tr_examples+=targets.size(0)
            
            if _%30==0:
                loss_step = tr_loss/nb_tr_steps
                accu_step = (n_correct*100)/nb_tr_examples
                print(f"Validation Loss per 30 steps: {loss_step}")
                print(f"Validation Accuracy per 30 steps: {accu_step}")
    epoch_loss = tr_loss/nb_tr_steps
    epoch_accu = (n_correct*100)/nb_tr_examples
    print(f"Validation Loss Epoch: {epoch_loss}")
    print(f"Validation Accuracy Epoch: {epoch_accu}")
    
    return epoch_accu


In [None]:
print('This is the validation section to print the accuracy and see how it performs')
print('Here we are leveraging on the dataloader crearted for the validation dataset, the approcah is using more of pytorch')

acc = valid(model, testing_loader)
print("Accuracy on test data = %0.2f%%" % acc)

This is the validation section to print the accuracy and see how it performs
Here we are leveraging on the dataloader crearted for the validation dataset, the approcah is using more of pytorch




Validation Loss per 30 steps: 1.0753190517425537
Validation Accuracy per 30 steps: 50.0
Validation Loss per 30 steps: 1.2043761014938354
Validation Accuracy per 30 steps: 1550.0
Validation Loss per 30 steps: 1.0624420642852783
Validation Accuracy per 30 steps: 3050.0
Validation Loss per 30 steps: 0.9247564077377319
Validation Accuracy per 30 steps: 4400.0
Validation Loss per 30 steps: 1.095865249633789
Validation Accuracy per 30 steps: 6100.0
Validation Loss per 30 steps: 0.9801040887832642
Validation Accuracy per 30 steps: 7700.0
Validation Loss per 30 steps: 1.0455188751220703
Validation Accuracy per 30 steps: 9050.0
Validation Loss per 30 steps: 0.9449798464775085
Validation Accuracy per 30 steps: 10600.0
Validation Loss per 30 steps: 1.0476458072662354
Validation Accuracy per 30 steps: 12400.0
Validation Loss per 30 steps: 1.082362174987793
Validation Accuracy per 30 steps: 13550.0
Validation Loss per 30 steps: 0.9653517007827759
Validation Accuracy per 30 steps: 15350.0
Validation