# Lab 6: BERT for hate speech detection

In this lab, we will take you through a practical use of Transformers. This notebook shows you how to use [Hugging face](https://huggingface.co/)'s package to import and train pretrained models for the tasks of hate speech classification and machine translation.

We first show you all necessay components to use the ``transformers`` package before asking you to implement some code in the later sections.


**Note:** The training of models will take quite some time so make sure to run this session with the GPU enabled. 


## Setting up the Environment

First, we need to install Hugging Face [transformers](https://huggingface.co/transformers/index.html) and [Sentence piece Tokenizers](https://github.com/google/sentencepiece) with the following commands

In [None]:
#! pip install torch

In [None]:
!pip install transformers
!pip install sentencepiece
!pip install ipywidgets
!jupyter nbextension enable --py widgetsnbextension

In [None]:
import torch
import transformers
import torch.nn as nn
from torch.utils.data import DataLoader
from tqdm.notebook import tqdm

from transformers import Trainer, TrainingArguments
from transformers import BertTokenizer
from transformers import BertPreTrainedModel, BertModel

import pandas as pd
import numpy as np
import os

from sklearn.metrics import classification_report

if not torch.cuda.is_available():
  print('WARNING: You may want to change the runtime to GPU for faster training!')
  DEVICE = 'cpu'
else:
  DEVICE = 'cuda:0'

If you work in Colab, mount your google drive to save models and training checkpoints. Run the following code to connect your google drive to colab. Click on the link and copy and past the code you saw into the input box.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

%cd '/content/drive/MyDrive/Colab Notebooks/'
%mkdir './Lab 6'
%cd './Lab 6' 

## Hate Speech Classification

### Downloading the dataset

For the task of hate speech classification, we will work with the [Offensive Language Identification Dataset - OLID ](https://scholar.harvard.edu/malmasi/olid). It is a dataset of tweets hierarchically annotated on three levels: 

* Level A: Offensive Language Detection
* Level B: Categorization of Offensive Language
* Level C: Offensive Language Target Identification


Let's download it first.

In [None]:
%mkdir ./data
%cd ./data

if not os.path.isfile('pretrain.txt'): 
  !wget -O pretrain.txt https://www.dropbox.com/s/bavjtyx0ndty7xt/pretrain.txt?dl=0

if not os.path.isfile('OLIDv1.0.zip'): 
  !wget -O OLIDv1.0.zip https://sites.google.com/site/offensevalsharedtask/olid/OLIDv1.0.zip
  ! unzip OLIDv1.0.zip
  
%cd ..


Let's have a look at the data we downloaded.

As mentioned above, the ``OLID`` dataset has been labeled for three subtask, therefore we have three different labels sets per tweet: 
* Task A: Not Offensive (``NOT``) and Offensive (``OFF``).
* Task B: Targeted Insult (``TIN``), Untargeted (``UNT``) and ``NULL`` for not offensive tweets.
* Task C: Individual (``IND``), Group (``GRP``), Other (``OTH``) and ``NULL`` for not offensive and non targeted tweets.

In [None]:
df = pd.read_csv('./data/olid-training-v1.0.tsv',delimiter="\t")

print(f'Number of training samples: {len(df)}')

df.head()

### Loading and preprocessing the corpus 


Let's define ``reader_train`` and ``reader_test`` that will prepare our data corpus and labels for both train and test set.

In [None]:
def reader_train(file_name):
    texts = []
    labels = []
    fin = open(file_name)
    title = fin.readline()
    set_a = ['NOT' , 'OFF']
    set_b = ['NULL', 'TIN', 'UNT']
    set_c = ['NULL', 'IND', 'GRP', 'OTH']
    while True:
        line = fin.readline()
        if not line:
            break
        items = line.split('\t')
        text = items[1]
        label_a = set_a.index(items[2].strip())
        label_b = set_b.index(items[3].strip())
        label_c = set_c.index(items[4].strip())

        if len(text) > 0:
            texts.append(text)
            labels.append([label_a, label_b, label_c])
            
    return {'texts':texts, 'labels':labels}

In [None]:
def reader_test(test_textlist, test_labellist):
    texts = []
    labels = []
    text_dict = {}
    
    # build text_dict
    for file_text in test_textlist:
        fin = open(file_text)
        title = fin.readline()
        while True:
            line = fin.readline()
            if not line:
                break
            items = line.split('\t')
            if items[0] not in text_dict:
                text_dict[items[0]] = items[1]
        fin.close()
    label_dict_list = []
    
    # build label_dict
    for i, file_label in enumerate(test_labellist):
        label_dict_list.append({})
        fin = open(file_label)
        title = fin.readline()
        while True:
            line = fin.readline()
            if not line:
                break
            items = line.split(',')
            label_dict_list[i][items[0]] = items[1]
        fin.close()    
    
    set_a = ['NOT' , 'OFF']
    set_b = ['NULL', 'TIN', 'UNT']
    set_c = ['NULL', 'IND', 'GRP', 'OTH']
    
    for idx, text in text_dict.items():
        if len(text) > 0:
            texts.append(text)
            if idx in label_dict_list[0]:
                label_a = label_dict_list[0][idx]
            else:
                label_a = 'OFF'
            if idx in label_dict_list[1]:
                label_b = label_dict_list[1][idx]
            else:
                label_b = 'NULL'
            if idx in label_dict_list[2]:
                label_c = label_dict_list[2][idx]
            else:
                label_c = 'NULL'
            
            label_a = set_a.index(label_a.strip())
            label_b = set_b.index(label_b.strip())
            label_c = set_c.index(label_c.strip())
        
            labels.append([label_a, label_b, label_c])
            
    return {'texts':texts, 'labels':labels}            


We also define our custom ``OlidDataset`` class which allows us to control how we handle the iteration and batches.

At each iteration over the dataset object, the function ``__get_item__`` is called and returns a list of dictionnaries with the tweets and their 3 labels. 
Then, the ``collate_fn`` function will process the list of samples into their encodings and return a batch when called by the iterator during training.

In [None]:
class OlidDataset(torch.utils.data.Dataset):

    def __init__(self, tokenizer, input_set):

        self.tokenizer = tokenizer
        self.texts = input_set['texts']
        self.labels = input_set['labels']
        
    def collate_fn(self, batch):

        texts = []
        labels_a = []
        labels_b = []
        labels_c = []
        for b in batch:
            texts.append(b['text'])
            labels_a.append(b['label_a'])
            labels_b.append(b['label_b'])
            labels_c.append(b['label_c'])

        #The maximum sequence size for BERT is 512 but here the tokenizer truncate sentences longer than 128 tokens.  
        # We also pad shorter sentences to a length of 128 tokens
        encodings = self.tokenizer(texts, return_tensors='pt', padding=True, truncation=True, max_length=128)
        labels = {}
        encodings['label_a'] =  torch.tensor(labels_a)
        encodings['label_b'] =  torch.tensor(labels_b)
        encodings['label_c'] =  torch.tensor(labels_c)
        
        return encodings
    
    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
       
        item = {'text': self.texts[idx],
                'label_a': self.labels[idx][0],
                'label_b': self.labels[idx][1],
                'label_c': self.labels[idx][2]}
        return item


Now let's put it all together and load our data. Here we use a pre-made tokenizer that was used for our BERT model. Here we pick the pre-trained model ``bert-base-cased``. There are several other models of various sizes (base, large).

**Note:** ``bert-base-cased`` is case-sensitive and it differenciates English from english. An non case-sensitive variant is ``bert-base-uncased``.

You can always use another [tokenizer](https://huggingface.co/transformers/main_classes/tokenizer.html), but we will get better results using the same tokenizer as the one used to pre-train the model.


In [None]:
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')

# we can check the parameters of this tokenizer
tokenizer

In [None]:
trainset = reader_train('./data/olid-training-v1.0.tsv')
testset = reader_test(['./data/testset-levela.tsv','./data/testset-levelb.tsv','./data/testset-levelc.tsv'], 
                      ['./data/labels-levela.csv','./data/labels-levelb.csv','./data/labels-levelc.csv'])

train_dataset = OlidDataset(tokenizer, trainset)
test_dataset = OlidDataset(tokenizer, testset)

The following code let's you play around with our ``train_dataset`` object.

In [None]:
#returns first item as dictionnary
#print(train_dataset[0])

# put all train set into one batch for the collate_fn function
batch = [sample for sample in train_dataset]

encodings = train_dataset.collate_fn(batch[:10])

for key, value in encodings.items():
  print(f"{key}: {value.numpy().tolist()}")



### Finetuning a pre-trained BERT model


As you can recall from the lecture, BERT is a model trained on Masked language Modeling(MLM) and Next Sentence Prediction(NSP), however is not trained to do to do sentence classification. We then need to adapt it for hate speech classification and finetune the pre-trained model on our dataset.




Let's have a look at ``bert_base-uncased`` summary.

In [None]:
model = BertModel.from_pretrained("bert-base-cased")

#180 M
print(f"Model size: {model.num_parameters()}")

#model summary
model

Note that the model has only encoder layers.

#### BERT Model

To define our model, we will build on top of a Huggingface pre-trained model and adapt it to our task. We will use ``BertModel`` to extract embeddings and add a ``Linear`` layer to classify samples. Hugging face implementation of BERT can handle different variations of the model, which we define and pass its parameter values via``config``.


The code below defines a model adapted to classify tweets on Level A, Offensive Language Detection. We will implement Task B and C later.



In [None]:
class BERT_hate_speech(BertPreTrainedModel):

    def __init__(self, config):
        super().__init__(config)

        # BERT Model
        self.bert = BertModel(config)
        
        # Task A
        self.projection_a = torch.nn.Sequential(torch.nn.Dropout(0.2),
                                                torch.nn.Linear(config.hidden_size, 2))
        
        # Task B
        # TBA
        
        # Task C
        # TBA
        
        self.init_weights()

    def forward(
        self,
        input_ids=None,
        attention_mask=None,
        token_type_ids=None,
        position_ids=None,
        head_mask=None,
        inputs_embeds=None,
        labels=None,
        output_attentions=None,
        output_hidden_states=None,
        return_dict=None):
 
        outputs = self.bert(
            input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
            position_ids=position_ids,
            head_mask=head_mask,
            inputs_embeds=inputs_embeds,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )

        # Logits A
        logits_a = self.projection_a(outputs[1])
        
        return logits_a


#### Finetuning

Finally, we should define our training loop. Fortunately, the ``transformers`` package provides us with a [``Trainer``](https://huggingface.co/transformers/main_classes/trainer.html#transformers.Trainer) class wich takes care of the training of transformers models.


We build our custom ``Trainer`` class to incorporate our own ``compute_loss`` function over the three labels. 

In [None]:

class Trainer_hate_speech(Trainer):
    def compute_loss(self, model, inputs):
        labels = {}
        labels['label_a'] = inputs.pop('label_a')
        labels['label_b'] = inputs.pop('label_b')
        labels['label_c'] = inputs.pop('label_c')

        outputs = model(**inputs)

        # TASK A
        loss_task_a = nn.CrossEntropyLoss()
        labels_a = labels['label_a']
        loss_a = loss_task_a(outputs.view(-1, 2), labels_a.view(-1))

        loss = loss_a
        
        return loss


Now let's finetune the pretrained model on our ``OlidDataset``.

In our function ``main_hate_speech`` we define the arguments for the ``Trainer`` object and launch the training with ``trainer.train``. 


In [None]:
def main_hate_speech():

    #call our custom BERT model and pass as parameter the name of an available pretrained model
    model = BERT_hate_speech.from_pretrained("bert-base-cased")
    
    training_args = TrainingArguments(
        output_dir='./experiment/hate_speech',
        learning_rate = 0.0001,
        logging_steps= 100,
        per_device_train_batch_size=32,
        num_train_epochs = 3,
        remove_unused_columns=False # This argument prevents the collator to drop data from our batch when customizing the data collator
    )
    trainer = Trainer_hate_speech(
        model=model,                         
        args=training_args,                 
        train_dataset=train_dataset,                   
        data_collator=train_dataset.collate_fn,
    )

    trainer.train()

    trainer.save_model('./models/ht_bert_finetuned/')



Let's run it.

In [None]:
main_hate_speech()

#### Evaluation
Once we trained our model, we can evaluate it on our test set.

Let's define a helper function ``predict_hatespeech`` that will extract the predicted label.

In [None]:
def predict_hatespeech(input, tokenizer, model): 
  model.eval()
  encodings = tokenizer(input, return_tensors='pt', padding=True, truncation=True, max_length=128)
  
  output = model(**encodings)
  preds = torch.max(output, 1)

  return {'prediction':preds[1], 'confidence':preds[0]}

Now let's define a function that will evaluate our model on the test set we prepared.

In [None]:
def evaluate(model, tokenizer, data_loader):

  total_count = 0
  correct_count = 0 

  preds = []
  tot_labels = []

  with torch.no_grad():
    for data in tqdm(data_loader): 

      labels = {}
      labels['label_a'] = data['label_a']

      tweets = data['text']

      pred = predict_hatespeech(tweets, tokenizer, model)

      preds.append(pred['prediction'].tolist())
      tot_labels.append(labels['label_a'].tolist())

  # with the saved predictions and labels we can compute accuracy, precision, recall and f1-score
  report = classification_report(tot_labels, preds, target_names=["Not offensive","Offensive"], output_dict= True)

  return report

In [None]:
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')

#your saved model name here
model_name = './models/ht_bert_finetuned/' 
model = BERT_hate_speech.from_pretrained(model_name)

# we don't batch our test set unless it's too big
test_loader = DataLoader(test_dataset)

report = evaluate(model, tokenizer, test_loader)

print(report)

print(report['accuracy'])
print(report['Not offensive']['f1-score'])
print(report['Offensive']['f1-score'])

Let's test our model on a few sentences to get an intuition. Feel free to play around.

In [None]:
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
model = BERT_hate_speech.from_pretrained('./models/ht_bert_finetuned/')

print(predict_hatespeech("I go see pinguins at the zoo.", tokenizer, model))
print(predict_hatespeech("Bananas are stupid", tokenizer, model))

### Pre-training and finetuning BERT

In this section, we will implement our own masked language modeling (MLM) training.

#### Pre-training

**Question 1: Add MLM head for pretraining**
Your task is to fill in the following classes to implement MLM training: 

* ``PretrainDataset()``
* ``Trainer_MLM()``
* ``BERT_pretrain()``
* ``main_pretrain()``

To train our model in a MLM fashion, we need to make some adjustment to our ``Dataset`` class. We want to train BERT to predict an X% of tokens (in the original paper it is 15%) of which 80% will be replaced by a ``[MASK]`` token, 10% with a random token and 10% remain the same token.

We introduce the function ``mask_tokens`` that will take care of that.

In [None]:
class PretrainDataset(torch.utils.data.Dataset):

    def __init__(self, tokenizer, input_file):

        self.tokenizer = tokenizer

        self.texts = self.read_text(input_file)

        self.mlm_probability = 0.15
        
    def read_text(self, input_file):

        ## Question 1 ##

        fin = open(input_file)
        return fin.readlines()
        
    def collate_fn(self, batch):
       
        ## Question 1 ##

        batch = self.tokenizer(batch, return_tensors='pt', padding=True, truncation=True, max_length=128)

        inputs, labels = self.mask_tokens(batch["input_ids"])
        return {"input_ids": inputs, "labels": labels}
    
        return encodings
    
    def mask_tokens(self, inputs):
        """
        Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original.
        """
        if self.tokenizer.mask_token is None:
            raise ValueError(
                "This tokenizer does not have a mask token which is necessary for masked language modeling. Remove the --mlm flag if you want to use this tokenizer."
            )
        labels = inputs.clone()

        # We sample a few tokens in each sequence for masked-LM training (with probability args.mlm_probability defaults to 0.15 in Bert/RoBERTa)
        probability_matrix = torch.full(labels.shape, self.mlm_probability)
        special_tokens_mask = [
            self.tokenizer.get_special_tokens_mask(val, already_has_special_tokens=True) for val in labels.tolist()
        ]
        probability_matrix.masked_fill_(torch.tensor(special_tokens_mask, dtype=torch.bool), value=0.0)
        
        if self.tokenizer._pad_token is not None:
            padding_mask = labels.eq(self.tokenizer.pad_token_id)
            probability_matrix.masked_fill_(padding_mask, value=0.0)
        masked_indices = torch.bernoulli(probability_matrix).bool()
        labels[~masked_indices] = -100  # We only compute loss on masked tokens

        # 80% of the time, we replace masked input tokens with tokenizer.mask_token ([MASK])
        indices_replaced = torch.bernoulli(torch.full(labels.shape, 0.8)).bool() & masked_indices
        inputs[indices_replaced] = self.tokenizer.convert_tokens_to_ids(self.tokenizer.mask_token)

        # 10% of the time, we replace masked input tokens with random word
        indices_random = torch.bernoulli(torch.full(labels.shape, 0.5)).bool() & masked_indices & ~indices_replaced
        random_words = torch.randint(len(self.tokenizer), labels.shape, dtype=torch.long)
        inputs[indices_random] = random_words[indices_random]

        # The rest of the time (10% of the time) we keep the masked input tokens unchanged
        return inputs, labels
    
    def __len__(self):
        
        ## Question 1 ##

        return len(self.texts)

    def __getitem__(self, idx):

        ## Question 1 ##
 
        text = self.texts[idx]
        return text

The next step is to add a MLM head to our model. 
Use the ``BertOnlyMLMHead`` to add a MLM classifier to BERT.

In [None]:
from transformers.models.bert.modeling_bert import BertOnlyMLMHead

class BERT_pretrain(BertPreTrainedModel):

    def __init__(self, config):
        super().__init__(config)
        self.config = config

        ## Question 1 ##
        # BERT Model
        self.bert = BertModel(config)
        
        
        ## Question 1 ##
        # MLM head
        self.cls = BertOnlyMLMHead(config)
        
        

        self.init_weights()

    def forward(
        self,
        input_ids=None,
        attention_mask=None,
        token_type_ids=None,
        position_ids=None,
        head_mask=None,
        inputs_embeds=None,
        labels=None,
        output_attentions=None,
        output_hidden_states=None,
        return_dict=None):

        outputs = self.bert(
            input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
            position_ids=position_ids,
            head_mask=head_mask,
            inputs_embeds=inputs_embeds,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )

        ## Question 1 ##

        # MLM output
        prediction_scores = self.cls(outputs[0])
        
        return prediction_scores

We will define a new Trainer class for pre-training. 

**Note:** We could use the standard ``Trainer`` class to train our model. Then we would need to make ``BERT_pretrain`` output  ``loss`` and BERT ``outputs`` as a tuple``(loss, outputs)``.




In [None]:
class Trainer_MLM(Trainer):
    def compute_loss(self, model, inputs):
        
        labels = inputs['labels']

        outputs = model(**inputs)

        # MLM loss
        lm_loss = nn.CrossEntropyLoss()

        loss_mlm = lm_loss(outputs.view(-1, model.config.vocab_size), labels.view(-1))
        
        loss = loss_mlm
        
        return loss

Finally, put everything together in the ``main_pretrain()`` class. 

In the code below, write code to pre-train your custom MLM model on ``pretrain.txt`` file found in the ``data`` folder.





In [None]:
def main_pretrain():
    
    ## Question 1 ##

    model = BERT_pretrain.from_pretrained("bert-base-cased")
    
    tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
    pretrain_dataset = PretrainDataset(tokenizer, 'data/pretrain.txt')
    
    training_args = TrainingArguments(
        output_dir='./experiment/pretrain',
        learning_rate = 0.00005,
        num_train_epochs =1,
        save_steps = 10000,  #saves a checkpoint file every 10000 iterations
        per_device_train_batch_size=64,
        remove_unused_columns=False
    )
    trainer = Trainer_MLM(
        model=model,                         
        args=training_args,                 
        train_dataset=pretrain_dataset,                    
        data_collator=pretrain_dataset.collate_fn
    )

    trainer.train()
    
    trainer.save_model('./models/ht_bert_pretrained/')
    

Running the pretraining will take ~ 2 hours with one epoch.

In [None]:
 main_pretrain()

#### Finetuning

**Question 2: Load the pretrained model for finetuning**

In the code below modify the ``main_hate_speech`` function from earlier to import the model we just trained, and finetune it on our ``OlidDataset`` train sets.

**Note**: Your pre-trained model is saved as checkpoint files in your ``output_dir`` folder.

In [None]:
def main_hate_speech():

    ## Question 2 ##

    model = BERT_hate_speech.from_pretrained("./models/ht_bert_pretrained/")
    
    training_args = TrainingArguments(
        output_dir='./experiment/hate_speech',
        learning_rate = 0.0001,
        logging_steps= 500,
        per_device_train_batch_size=32,
        num_train_epochs = 1,
        remove_unused_columns=False
    )
    trainer = Trainer_hate_speech(
        model=model,                         
        args=training_args,                 
        train_dataset=train_dataset,        
        eval_dataset=test_dataset,             
        data_collator=train_dataset.collate_fn
    )

    trainer.train()

    trainer.save_model('./models/ht_bert_pretrained_finetuned/')


In [None]:
main_hate_speech()

#### Evaluation

In [None]:
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')

#your saved model name here
model_name = './models/ht_bert_pretrained_finetuned/' 
model = BERT_hate_speech.from_pretrained(model_name)

test_loader = DataLoader(test_dataset)

report = evaluate(model, tokenizer, test_loader)
print(report)
print(report['accuracy'])
print(report['Not offensive']['f1-score'])
print(report['Offensive']['f1-score'])

## Multi-task Hate Speech Classification

It's time to add the two other tasks to our implementation of ``BERT_hate_speech()``.

**Question 3: Add multi-heads (task b, task c) for multi-task hatespeech classification**

Fill in the missing code from the following classes:

* ``BERT_hate_speech_multitask()``
* `` Trainer_hate_speech_multitask()``
* ``main_hate_speech_multitask()``

### Multi-task Model

In [None]:

class BERT_hate_speech_multitask(BertPreTrainedModel):

    def __init__(self, config):
        super().__init__(config)
        
        # BERT Model
        self.bert = BertModel(config)
        
        # Task A
        self.projection_a = torch.nn.Sequential(torch.nn.Dropout(0.2),
                                                torch.nn.Linear(config.hidden_size, 2))
        
        ##  Question 3 ##

        # Task B
        self.projection_b = torch.nn.Sequential(torch.nn.Dropout(0.2),
                                                torch.nn.Linear(config.hidden_size, 3))

        # Task C
        self.projection_c = torch.nn.Sequential(torch.nn.Dropout(0.2),
                                                torch.nn.Linear(config.hidden_size, 4))
        
        self.init_weights()

    def forward(
        self,
        input_ids=None,
        attention_mask=None,
        token_type_ids=None,
        position_ids=None,
        head_mask=None,
        inputs_embeds=None,
        labels=None,
        output_attentions=None,
        output_hidden_states=None,
        return_dict=None):

        outputs = self.bert(
            input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
            position_ids=position_ids,
            head_mask=head_mask,
            inputs_embeds=inputs_embeds,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )

        # Task A
        logits_a = self.projection_a(outputs[1])
        
        ##  Question 3 ##
        
        # Task B
        logits_b = self.projection_b(outputs[1])
      
        # Task C 
        logits_c = self.projection_c(outputs[1])

        return (logits_a, logits_b, logits_c)

In [None]:
class Trainer_hate_speech_multitask(Trainer):
    def compute_loss(self, model, inputs):
        labels = {}
        labels['label_a'] = inputs.pop('label_a')
        labels['label_b'] = inputs.pop('label_b')
        labels['label_c'] = inputs.pop('label_c')

        (out_a, out_b, out_c) = model(**inputs)

        # LOSS A
        loss_task_a = nn.CrossEntropyLoss()
        labels_a = labels['label_a']
        loss_a = loss_task_a(out_a.view(-1, 2), labels_a.view(-1))

        ## QUESTION 3 ##        
        # LOSS B
        loss_task_b = nn.CrossEntropyLoss()
        labels_b = labels['label_b']
        loss_b = loss_task_b(out_b.view(-1, 3), labels_b.view(-1))

        # LOSS C
        loss_task_c = nn.CrossEntropyLoss()
        labels_c = labels['label_c']
        loss_c = loss_task_c(out_c.view(-1, 4), labels_c.view(-1))

        loss = loss_a + loss_b + loss_c
        
        return loss

Just as in the finetuning task, instantiate a ``BERT_hate_speech_multitask`` model from an pre-trained model and finetune it on our ``train_dataset``.

In [None]:
def main_hate_speech_multitask():
    ##  Question 3 ##

    model = BERT_hate_speech_multitask.from_pretrained("bert-base-cased")
    
    training_args = TrainingArguments(
        output_dir='./experiment/hate_speech_multitask',
        learning_rate = 0.0001,
        logging_steps= 100,
        num_train_epochs = 3,
        per_device_train_batch_size=64,
        remove_unused_columns=False
    )
    trainer = Trainer_hate_speech_multitask(
        model=model,                         
        args=training_args,                 
        train_dataset=train_dataset,                 
        data_collator=train_dataset.collate_fn
    )
    trainer.train()

    trainer.save_model('./models/ht_bert_multi_finetuned/')

Running the code below should take ~10 min for 3 epochs.

In [None]:
main_hate_speech_multitask()

### Evaluation

In [None]:
def predict_hatespeech_multitask(input, tokenizer, model): 
  model.eval()
  encodings = tokenizer(input, return_tensors='pt', padding=True, truncation=True, max_length=128)
  
  (out1, out2, out3) = model(**encodings)
  
  preds_a = torch.max(out1, 1)
  preds_b = torch.max(out2, 1)
  preds_c = torch.max(out3, 1)

  preds = (preds_a[1], preds_b[1], preds_c[1])
  scores = (preds_a[0], preds_b[0], preds_c[0])

  return {'predictions':preds, 'confidences':scores}

In [None]:
def evaluate_multitask(model, tokenizer, data_loader):

  task_num = 3
  total_count = 0
  correct_count = [0] * task_num  
  accuracies = [0] * task_num

  batch_size = data_loader.batch_size

  with torch.no_grad():
    for data in tqdm(data_loader): 

      labels = {}
      labels['label_a'] = data['label_a']
      labels['label_b'] = data['label_b']
      labels['label_c'] = data['label_c']

      tweets = data['text']

      pred = predict_hatespeech_multitask(tweets, tokenizer, model)

      preds = pred['predictions'] 

      for i, label in enumerate(labels):
        correct_count[i]+= torch.mean((preds[i] == labels[label]).float())

      total_count += batch_size

    for i, label in enumerate(labels):
      accuracies[i] = (correct_count[i]/total_count)

 
  return accuracies

In [None]:

model = BERT_hate_speech_multitask.from_pretrained("./models/ht_bert_multi_finetuned/")
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')

test_loader = DataLoader(test_dataset)

accuracies = evaluate_multitask(model, tokenizer, test_loader)


In [None]:
for i in range(3):
    print('Task %d accuracy: %2.2f %%' % (i, 100.0*accuracies[i]))
    

In [None]:
print(predict_hatespeech_multitask("I go see pinguins at the zoo.", tokenizer, model)['predictions'])
print(predict_hatespeech_multitask("Bananas are so stupid ", tokenizer, model)['predictions'])