## Description

This notebook contains the ML work on classification for the hate speech comments.  

***Objective:*** The main objective of this work is not to improve a metric (cost, accuracy). The authors already tried some models (you can find more info here: https://arxiv.org/pdf/1809.04444.pdf). I am assuming I will get better numbers just because I will use a more powerful model : Bert (an attention model). My main goal is to compare 2 main Bert approaches: fine-tunning vs. sentence embeddings.   

In [3]:
import pkg_resources
import subprocess
import pickle
import shutil
import time
import sys
import os

from enum import Enum
from functools import partial

REQUIRED = {
  'spacy', 'scikit-learn', 'numpy', 'pandas', 'torch', 'transformers'
}

installed = {pkg.key for pkg in pkg_resources.working_set}
missing = REQUIRED - installed

if missing:
    python = sys.executable
    subprocess.check_call([python, '-m', 'pip', 'install', *missing], stdout=subprocess.DEVNULL)

from typing import List, Dict, Tuple
import glob

from dataclasses import dataclass

from sklearn.metrics import accuracy_score

from torch import torch
from torch.utils.data import Dataset
from torch import nn

import numpy as np
import pandas as pd
from pandas import DataFrame

from transformers import PreTrainedTokenizer, PreTrainedModel
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification
from transformers import DistilBertModel, DistilBertPreTrainedModel
from transformers import Trainer, TrainingArguments
from transformers import EvalPrediction

### Read Data

In [4]:
# dataset git repo
if not os.path.exists("hate-speech-dataset"):
  !git clone https://github.com/Vicomtech/hate-speech-dataset.git

In [5]:
def readData(paths: List[str], group: str):
  pairs = []
  for p in paths:
    
    with open(p) as f:
      file_id = p.split('/')[-1].split('.')[0]
      pairs.append((file_id, f.read(), group))
  
  return pd.DataFrame(pairs, columns=["file_id", "text", "gSet"])

idLabels = pd.read_csv('hate-speech-dataset/annotations_metadata.csv')[["file_id", "label"]]

trainDf = readData(glob.glob('./hate-speech-dataset/sampled_train/*.txt'), 'train')
testDf = readData(glob.glob('./hate-speech-dataset/sampled_test/*.txt'), 'test')

trainDf = trainDf.join(idLabels.set_index('file_id'), on='file_id')
testDf = testDf.join(idLabels.set_index('file_id'), on='file_id')

print(trainDf.info(), '\n')
print(testDf.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1914 entries, 0 to 1913
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   file_id  1914 non-null   object
 1   text     1914 non-null   object
 2   gSet     1914 non-null   object
 3   label    1914 non-null   object
dtypes: object(4)
memory usage: 59.9+ KB
None 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 478 entries, 0 to 477
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   file_id  478 non-null    object
 1   text     478 non-null    object
 2   gSet     478 non-null    object
 3   label    478 non-null    object
dtypes: object(4)
memory usage: 15.1+ KB
None


# A) Tunning Pre-trained Distil Bert

The first approach is to fine-tune a pre-trained Bert model (Distil Bert to make it faster). I will use the transformers library to get the model. The setup is a simple binary classification model to distinguish between sentences with and without hate speech. 

In [6]:
tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")

model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)
# save original model set of weights. They we'll be used later over multiple iterations
torch.save(model.state_dict(), "./tempBertState.pt")

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classi

### Utility functions

In [7]:
@dataclass
class Triple:
  """
  Class for keeping track of single tokenized inputs 
  (for Pytorch Dataset __getitem__).
  """
  inputs: List[int]
  masks: List[int]
  label: int
  
  def triple(self):
    return (self.inputs, self.masks, self.label)

  
class HateDataset(Dataset):
  """PyTorch Dataset class. Required as Trainer input."""
  def __init__(self, 
    corpus: List[str], labels: List[str], 
    tokenizer: PreTrainedTokenizer=tokenizer,
    truncate=True,
    padding=True
  ):
    self.tokenizer = tokenizer
    self.trunc = truncate
    # corpus is entirely tokenized during instantiation
    self.corpusTokenized = tokenizer(corpus, padding=padding)
    self.labels = labels
    
  
  def __getitem__(self, i):
    inputs = self.corpusTokenized['input_ids'][i]
    masks = self.corpusTokenized['attention_mask'][i]
    
    return Triple(
      self._truncate(inputs) if self.trunc else inputs,
      self._truncate(masks, is_mask=True) if self.trunc else masks,
      self.labels[i]
    )
  
  def __len__(self):
    return len(self.labels)
  
  def _truncate(self, vector: List[int], is_mask=False, sep_token_num=102):
    """
    Truncate vector length to model max input size (the default transformers 
    library implementation does not seem to work with batches!).
    """
    limit = self.tokenizer.max_len
    
    if len(vector) <= limit:
      return vector
    elif is_mask:
      return vector[:limit]
    else:
      last = sep_token_num if vector[-1] not in (0, sep_token_num) else 0
      return vector[:limit - 1] + [last]
  
  
def collator(items: List[Triple]) -> Dict:
  """
  Collects list of items into a single dictionary; this dictionary's keys/values match 
  the parameter names and format of the model selected. This function is used by the 
  Pytorch Dataloader inside the Trainer class.
  """
  # keys for DistilBertForSequenceClassification
  input_k = 'input_ids'
  mask_k = 'attention_mask'
  label_k = 'labels'
  
  # zipps all input_ids, attention_mask and labels together
  zipped =  list(map(list, zip(*[t.triple() for t in items])))
  # convert lists to pytorch tensors
  inputs, masks, labels = list(map(lambda ls: torch.tensor(ls), zipped))
  
  return {input_k: inputs, mask_k: masks, label_k: labels}


def prepareDf(df: DataFrame, textCol='text', labelsCol='label'):
  """Extract relevant columns to be fed to a Dataset object."""
  return {
    "corpus": df[textCol].tolist(),
    "labels": df[labelsCol].map({'hate': 1, 'noHate': 0})
  }


def hateMetrics(pred: EvalPrediction):
  """
  Compute 3 accuracies: all labels, hate, noHate. These are the baseline metrics 
  used in the paper to compare models (https://arxiv.org/pdf/1809.04444.pdf, page 7).
  Function assumes 'hate' group uses label 1, 'noHate' uses label 0.
  
  For reference, results from paper:

      Accu | hate | noHate | all
      ---------------------------
      SVM  | 0.69 | 0.73   | 0.71
      CNN  | 0.55 | 0.79   | 0.66
      LSTM | 0.71 | 0.75   | 0.73
  """
  labels = pred.label_ids
  preds = pred.predictions.argmax(-1)

  mapper = lambda n: True if n == 1 else False

  hatePreds = preds[[mapper(l) for l in labels]]
  noHatePreds = preds[[mapper(l) for l in 1 - labels]]

  return {
      'accHate': accuracy_score(hatePreds, np.ones_like(hatePreds)),
      'accNoHate': accuracy_score(noHatePreds, np.zeros_like(noHatePreds)),
      'accAll': accuracy_score(labels, preds),
  }

## Training

In [8]:
train_dataset = HateDataset(**prepareDf(trainDf))
test_dataset = HateDataset(**prepareDf(testDf))

# for testing purposes
train_ds_mock = HateDataset(**prepareDf(trainDf[:20]))
test_ds_mock = HateDataset(**prepareDf(testDf[:20]))

***Note:*** Bert authors recommend the following ranges for fine tunning purposes:  
- Batch size: 16, 32
- Learning rate (Adam): 5e-5, 3e-5, 2e-5 
- Number of epochs: 2, 3, 4

(Appendix A.3 Fine-tuning Procedure - https://arxiv.org/pdf/1810.04805.pdf)

I am not interested in exhaustively optimizing a metric so I won't test all possible combinations. I will, however, test different epochs to see if longer trainings yield better results. It is mentioned in the paper (and in some other sources/blogs/articles) that fine-tunning for 2 or 3 epochs is enough for most tasks. I find this astounding given that for conventional untrained NNs you need to train for tents or hundreds of epochs to get optimal results. Let's see how Bert behaves as we increase the epoch number.  

In [9]:
# Trainer and TrainingArguments are both classes from transformers to facilitate training.

def trainingArgs(
  epochs: int, 
  trainDir: str, 
  batchSizeTrain=16, 
  batchSizeEval=32,
  training_set_len=len(train_dataset)
):
  """Return a TrainingArguments instance to be passed to Trainer class."""
  # calculate total training steps 
  totalSteps = int((training_set_len / batchSizeTrain) * epochs)
  # use 5% of all training steps as warmup
  warmup = int(totalSteps * 0.05)

  return TrainingArguments(
    output_dir=f"./{trainDir}/results", 
    logging_dir=f"./{trainDir}/logs",
    
    overwrite_output_dir=True,
    # trains faster without evaluation
    evaluate_during_training=False,
    
    per_device_train_batch_size=batchSizeTrain,   
    per_device_eval_batch_size=batchSizeEval, 

    num_train_epochs=epochs,
    warmup_steps=warmup,   
    
    # I won't be logging or checkpointing since 
    # training occurs fairly quickly
    logging_steps=9999,
    save_steps=9999,
    save_total_limit=1,
    
    # standard arguments
    learning_rate=5e-5,
    weight_decay=1e-2,
  )

In [9]:
# training arguments 
trainDir = "training"
saveModelDir = "tuned-bert"
epochsList = [2, 3, 4]

embArgs= trainingArgs(2, trainDir)
trainDs = train_dataset
testDs = test_dataset

# uncomment these to pass arguments for testing purposes
# epochsList = [2]
# embArgs = trainingArgs(2, trainDir, 2, 4, len(train_ds_mock))
# trainDs = train_ds_mock
# testDs = test_ds_mock

In [10]:
finalMetrics = list()

for epoch in epochsList:
  # start each iteration with the original set of weights
  model.load_state_dict(torch.load("./tempBertState.pt"))
  
  embArgs.num_train_epochs = epoch
  
  trainer = Trainer(
    model=model,                         
    args=embArgs,   
    data_collator=collator,
    
    train_dataset=trainDs,         
    eval_dataset=testDs,
    compute_metrics=hateMetrics
  )

  trainer.train()
  evaMetrics: Dict = trainer.evaluate()
  trainLoss: float = trainer.evaluate(trainDs)['eval_loss']
  
  finalMetrics.append(
    {"epoch": epoch, "eval_train_loss": trainLoss, **evaMetrics}
  )

  trainer.save_model(f"./{saveModelDir}/epoch-{epoch}")
  
  # clean logs
  shutil.rmtree(f"./{trainDir}")

Tesla T4 with CUDA capability sm_75 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70.
If you want to use the Tesla T4 GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/



HBox(children=(FloatProgress(value=0.0, description='Epoch', max=2.0, style=ProgressStyle(description_width='i…

HBox(children=(FloatProgress(value=0.0, description='Iteration', max=120.0, style=ProgressStyle(description_wi…




HBox(children=(FloatProgress(value=0.0, description='Iteration', max=120.0, style=ProgressStyle(description_wi…





HBox(children=(FloatProgress(value=0.0, description='Evaluation', max=15.0, style=ProgressStyle(description_wi…




HBox(children=(FloatProgress(value=0.0, description='Evaluation', max=60.0, style=ProgressStyle(description_wi…




HBox(children=(FloatProgress(value=0.0, description='Epoch', max=3.0, style=ProgressStyle(description_width='i…

HBox(children=(FloatProgress(value=0.0, description='Iteration', max=120.0, style=ProgressStyle(description_wi…




HBox(children=(FloatProgress(value=0.0, description='Iteration', max=120.0, style=ProgressStyle(description_wi…




HBox(children=(FloatProgress(value=0.0, description='Iteration', max=120.0, style=ProgressStyle(description_wi…





HBox(children=(FloatProgress(value=0.0, description='Evaluation', max=15.0, style=ProgressStyle(description_wi…




HBox(children=(FloatProgress(value=0.0, description='Evaluation', max=60.0, style=ProgressStyle(description_wi…




HBox(children=(FloatProgress(value=0.0, description='Epoch', max=4.0, style=ProgressStyle(description_width='i…

HBox(children=(FloatProgress(value=0.0, description='Iteration', max=120.0, style=ProgressStyle(description_wi…




HBox(children=(FloatProgress(value=0.0, description='Iteration', max=120.0, style=ProgressStyle(description_wi…




HBox(children=(FloatProgress(value=0.0, description='Iteration', max=120.0, style=ProgressStyle(description_wi…




HBox(children=(FloatProgress(value=0.0, description='Iteration', max=120.0, style=ProgressStyle(description_wi…





HBox(children=(FloatProgress(value=0.0, description='Evaluation', max=15.0, style=ProgressStyle(description_wi…




HBox(children=(FloatProgress(value=0.0, description='Evaluation', max=60.0, style=ProgressStyle(description_wi…




In [20]:
metricsTunedDf = pd.DataFrame(finalMetrics)
metricsTunedDf

Unnamed: 0,epoch,eval_train_loss,eval_loss,eval_accHate,eval_accNoHate,eval_accAll
0,2.0,0.150157,0.438262,0.807531,0.83682,0.822176
1,3.0,0.048003,0.556381,0.799163,0.853556,0.82636
2,4.0,0.013872,0.701802,0.861925,0.820084,0.841004


***Results:*** we can see the evaluation loss for 2 epochs is the lowest! We can also see that the overall accuracy for 4 epochs is the highest but this model is clearly overfitting ginven the drastic decrease in eval_train_loss and increase in eval_loss. I will select 2 epochs for further testing.  

(**note:** loss should always be preferred over accuracy when selecting a model since label-based metrics are just superficial metrics for final assessment, whereas loss quantifies how correctly/wrongly confident the model is about its predictions).

# B) Using Distil Bert embeddings + Distil Bert head

Now, it is the turn to use pre-trained Bert model to extract sentence embeddings. This part is really interesting since there are multiple strategies to get embeddings. Can any of these strategies beat the previous fine-tuned model? 

You can find some of the strategies used in the ***yieldEmbeds*** function.  

In [10]:
# we'll use this model to get the embeddings from the hidden states
rawModel = DistilBertModel.from_pretrained("distilbert-base-uncased", output_hidden_states=True)
rawModel.eval()

DistilBertModel(
  (embeddings): Embeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (transformer): Transformer(
    (layer): ModuleList(
      (0): TransformerBlock(
        (attention): MultiHeadSelfAttention(
          (dropout): Dropout(p=0.1, inplace=False)
          (q_lin): Linear(in_features=768, out_features=768, bias=True)
          (k_lin): Linear(in_features=768, out_features=768, bias=True)
          (v_lin): Linear(in_features=768, out_features=768, bias=True)
          (out_lin): Linear(in_features=768, out_features=768, bias=True)
        )
        (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        (ffn): FFN(
          (dropout): Dropout(p=0.1, inplace=False)
          (lin1): Linear(in_features=768, out_features=3072, bias=True)
          (lin2): Linear(i

In [11]:
class FakeBertModel(DistilBertModel):
  """
  Class to mock Bert model to generate random embeddings (for testing purposes).
  """
  def __init__(self):
    self.name = 'FakeBertModel'
    
  def __call__(self, input_ids, attention_mask):
    batchSize = input_ids.shape[0]
    seqLen = input_ids.shape[1]
    embLen = 768
    attLayers = 7
    
    fakeModelOutput = tuple([
      torch.rand([batchSize, seqLen, embLen]),
      tuple([torch.rand([batchSize, seqLen, embLen]) for _ in range(attLayers)])
    ])

    return fakeModelOutput
    
  def eval(self):
    return

fakeModel = FakeBertModel()

### Utility functions

In [12]:
EmbeddingStrategy = Enum(
  'EmbeddingStrategy',
  'CLS LAST_MEAN LAST_MAX LAST2_MEAN LAST2_MAX'
)


def yieldEmbeds(allHiddenStates, strategy: EmbeddingStrategy):
  """
  Generate sentence embeddings from word embeddings (Bert hidden states).
  
  Parameters:
  ----------
  - allHiddenStates: Attention encoder hidden states. Shape: (batch_size, seq_len, emb_len)
  
  Strategies:
    - CLS: embedding for [CLS] token.
    - LAST_MEAN: mean of all word embeddings from last layer.
    - LAST_MAX: max pooling of all word embeddings from last layer (max at each dimension).
    - LAST2_MEAN: mean of all word embeddings from second to last layer.
    - LAST2_MAX: max pooling of all word embeddings from second to last layer.
    
  Return:
  -------
  Pytorch tensor with shape (batch_size, emb_len).
  """
  
  lastLayer = allHiddenStates[-1]
  lastLastLayer = allHiddenStates[-2]
  embeddings = None
  
  if strategy == EmbeddingStrategy.CLS:
    embeddings = lastLayer[:, 0]
  elif strategy == EmbeddingStrategy.LAST_MEAN:
    embeddings = lastLayer[:, 1:].mean(1)
  elif strategy == EmbeddingStrategy.LAST_MAX:
    embeddings = lastLayer[:, 1:].max(1)[0]
  elif strategy == EmbeddingStrategy.LAST2_MEAN:
    embeddings = lastLastLayer[:, 1:].mean(1)
  elif strategy == EmbeddingStrategy.LAST2_MAX:
    embeddings = lastLastLayer[:, 1:].max(1)[0]
  
  return embeddings

    
def collator2(items: List[Triple], model: DistilBertModel, strategy: EmbeddingStrategy) -> Dict:
  """
  Used for Pytorch Dataloader to collate a list of items. Embeddings 
  from input tokens are generated here.
  """
  # keys for DistilBert Head model
  input_emb, label_k = 'inputs_embeds', 'labels'
  inputDc = collator(items)
  inputDc = dict((k,v.to("cuda")) for k, v in inputDc.items())

  labels = inputDc['labels']
  inputDc.pop('labels')
  
  with torch.no_grad():
    # original output: Tuple[last_hidden_layer, all_hidden_layers]
    # slicing [1][-2]: last 2 hidden layers 
    hLayers = model(**inputDc)[1][-2:]
    embeds = yieldEmbeds(hLayers, strategy)
                                                
  return {input_emb: embeds, label_k: labels}

## Training

The class below (**DisBertHeadHacked**) is simply a modification from the ***DistilBertForSequenceClassification*** class (original code: https://huggingface.co/transformers/_modules/transformers/modeling_distilbert.html#DistilBertForSequenceClassification). Basically I have removed all computations related to the model itself (since we are not tunning the entire model) and only the head remains; this is the part we want to train.

In [44]:
class DisBertHeadHacked(DistilBertPreTrainedModel):
    def __init__(self, config):
        super().__init__(config)
        
        self.num_labels = config.num_labels
        self.distilbert = DistilBertModel(config)
        self.pre_classifier = nn.Linear(config.dim, config.dim)
        self.classifier = nn.Linear(config.dim, config.num_labels)
        self.dropout = nn.Dropout(config.seq_classif_dropout)

        self.init_weights()

    def forward(
      self, 
      # sentence embeds. Shape: (bs, dim=768)
      inputs_embeds=None, 
      labels=None
    ):
      pooled_output = self.pre_classifier(inputs_embeds)  # (bs, dim)
      pooled_output = nn.ReLU()(pooled_output)  # (bs, dim)
      pooled_output = self.dropout(pooled_output)  # (bs, dim)
      logits = self.classifier(pooled_output)  # (bs, dim)

      outputs = (logits,) 
      if labels is not None:
          if self.num_labels == 1:
              loss_fct = nn.MSELoss()
              loss = loss_fct(logits.view(-1), labels.view(-1))
          else:
              loss_fct = nn.CrossEntropyLoss()
              loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
          outputs = (loss,) + outputs

      return outputs  # (loss), logits
    
# we are still instantiating the model from pretrained, but this is only for compatibility 
# issues (Trainer class will complain if it finds the model does not extend a Pretrained one).
# In reality we are not using the pretrain model since we already have the code for the head above.
modelHead = DisBertHeadHacked.from_pretrained("distilbert-base-uncased", num_labels=2)
# save the original state
torch.save(modelHead.state_dict(), "./tempBertState.pt")

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DisBertHeadHacked: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DisBertHeadHacked from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing DisBertHeadHacked from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DisBertHeadHacked were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.weight', 'classifier.bias']
You should probably TRAIN this model on

In [45]:
# training arguments 
trainDir = "training"
saveModelDir = "bert-embeds"
epoch = 2

embModel = rawModel.to("cuda")
embArgs= trainingArgs(epoch, trainDir)
trainDs = train_dataset
testDs = test_dataset

# uncomment these to pass arguments for testing purposes
# embModel = fakeModel
# embArgs = trainingArgs(epoch, trainDir, 2, 4, len(train_ds_mock))
# trainDs = train_ds_mock
# testDs = test_ds_mock

In [53]:
finalMetrics = list()

# iterate over each embedding strategy
for embStrategy in EmbeddingStrategy:

  # load the original model state otherwise you would start
  # with weights learned from previous iteration
  modelHead.load_state_dict(torch.load("./tempBertState.pt"))
  
  # partially apply the collator
  collator2_ = partial(collator2, model=embModel, strategy=embStrategy)

  trainer = Trainer(
    model=modelHead,                         
    args=embArgs,   
    data_collator=collator2_,
    
    train_dataset=trainDs,         
    eval_dataset=testDs,
    compute_metrics=hateMetrics
  )

  trainer.train()
  evaMetrics: Dict = trainer.evaluate()
  trainLoss: float = trainer.evaluate(trainDs)['eval_loss']
  
  finalMetrics.append(
    {"emb-strategy": embStrategy.name, "eval_train_loss": trainLoss, **evaMetrics}
  )

  trainer.save_model(f"./{saveModelDir}/str-{embStrategy.name}")
  
  # clean logs
  shutil.rmtree(f"./{trainDir}")

HBox(children=(FloatProgress(value=0.0, description='Epoch', max=2.0, style=ProgressStyle(description_width='i…

HBox(children=(FloatProgress(value=0.0, description='Iteration', max=120.0, style=ProgressStyle(description_wi…




HBox(children=(FloatProgress(value=0.0, description='Iteration', max=120.0, style=ProgressStyle(description_wi…





HBox(children=(FloatProgress(value=0.0, description='Evaluation', max=15.0, style=ProgressStyle(description_wi…




HBox(children=(FloatProgress(value=0.0, description='Evaluation', max=60.0, style=ProgressStyle(description_wi…




HBox(children=(FloatProgress(value=0.0, description='Epoch', max=2.0, style=ProgressStyle(description_width='i…

HBox(children=(FloatProgress(value=0.0, description='Iteration', max=120.0, style=ProgressStyle(description_wi…




HBox(children=(FloatProgress(value=0.0, description='Iteration', max=120.0, style=ProgressStyle(description_wi…





HBox(children=(FloatProgress(value=0.0, description='Evaluation', max=15.0, style=ProgressStyle(description_wi…




HBox(children=(FloatProgress(value=0.0, description='Evaluation', max=60.0, style=ProgressStyle(description_wi…




HBox(children=(FloatProgress(value=0.0, description='Epoch', max=2.0, style=ProgressStyle(description_width='i…

HBox(children=(FloatProgress(value=0.0, description='Iteration', max=120.0, style=ProgressStyle(description_wi…




HBox(children=(FloatProgress(value=0.0, description='Iteration', max=120.0, style=ProgressStyle(description_wi…





HBox(children=(FloatProgress(value=0.0, description='Evaluation', max=15.0, style=ProgressStyle(description_wi…




HBox(children=(FloatProgress(value=0.0, description='Evaluation', max=60.0, style=ProgressStyle(description_wi…




HBox(children=(FloatProgress(value=0.0, description='Epoch', max=2.0, style=ProgressStyle(description_width='i…

HBox(children=(FloatProgress(value=0.0, description='Iteration', max=120.0, style=ProgressStyle(description_wi…




HBox(children=(FloatProgress(value=0.0, description='Iteration', max=120.0, style=ProgressStyle(description_wi…





HBox(children=(FloatProgress(value=0.0, description='Evaluation', max=15.0, style=ProgressStyle(description_wi…




HBox(children=(FloatProgress(value=0.0, description='Evaluation', max=60.0, style=ProgressStyle(description_wi…




HBox(children=(FloatProgress(value=0.0, description='Epoch', max=2.0, style=ProgressStyle(description_width='i…

HBox(children=(FloatProgress(value=0.0, description='Iteration', max=120.0, style=ProgressStyle(description_wi…




HBox(children=(FloatProgress(value=0.0, description='Iteration', max=120.0, style=ProgressStyle(description_wi…





HBox(children=(FloatProgress(value=0.0, description='Evaluation', max=15.0, style=ProgressStyle(description_wi…




HBox(children=(FloatProgress(value=0.0, description='Evaluation', max=60.0, style=ProgressStyle(description_wi…




In [54]:
metricsDf = pd.DataFrame(finalMetrics)
metricsDf

Unnamed: 0,emb-strategy,eval_train_loss,eval_loss,eval_accHate,eval_accNoHate,eval_accAll,epoch
0,CLS,0.549668,0.574593,0.786611,0.67364,0.730126,2.0
1,LAST_MEAN,0.540307,0.559487,0.83682,0.698745,0.767782,2.0
2,LAST_MAX,0.601751,0.612578,0.828452,0.682008,0.75523,2.0
3,LAST2_MEAN,0.511449,0.542598,0.811715,0.686192,0.748954,2.0
4,LAST2_MAX,0.585415,0.59983,0.824268,0.686192,0.75523,2.0


***Results:*** We can see the best strategy by eval_loss is LAST2_MEAN (0.542). This is the same default implementation used by the library **bert-as-service** (https://github.com/hanxiao/bert-as-service). None of these embeddings are better than the fine-tunned model above (with an eval_loss of 0.438) so we can conclude that fine tunning is better than simply getting embeddings. This makes sense given that when we fine tune we are also fine tunning the embeddings themselves (the CLS token). It is worth nothing that the CLS token is used as the input for the classification head, so when we fine tune we can actually use the CLS token as the best representation (embeddings) for the entire sentence.  