# BERT NUSWhispers

This notebook documents various explorations with regards to using BERT in the NUSWhispers sentiment analysis task. Note that it is in a pretty raw format and has not been tidied up.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
!git clone https://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@github.com/CindyTsai1/CS4248-Team23.git

Cloning into 'CS4248-Team23'...
remote: Enumerating objects: 266, done.[K
remote: Counting objects: 100% (266/266), done.[K
remote: Compressing objects: 100% (176/176), done.[K
remote: Total 266 (delta 132), reused 216 (delta 84), pack-reused 0[K
Receiving objects: 100% (266/266), 45.50 MiB | 19.03 MiB/s, done.
Resolving deltas: 100% (132/132), done.
Checking out files: 100% (151/151), done.


In [None]:
!pip install datasets transformers

Collecting datasets
[?25l  Downloading https://files.pythonhosted.org/packages/54/90/43b396481a8298c6010afb93b3c1e71d4ba6f8c10797a7da8eb005e45081/datasets-1.5.0-py3-none-any.whl (192kB)
[K     |████████████████████████████████| 194kB 8.3MB/s 
[?25hCollecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/81/91/61d69d58a1af1bd81d9ca9d62c90a6de3ab80d77f27c5df65d9a2c1f5626/transformers-4.5.0-py3-none-any.whl (2.1MB)
[K     |████████████████████████████████| 2.2MB 33.4MB/s 
Collecting fsspec
[?25l  Downloading https://files.pythonhosted.org/packages/62/11/f7689b996f85e45f718745c899f6747ee5edb4878cadac0a41ab146828fa/fsspec-0.9.0-py3-none-any.whl (107kB)
[K     |████████████████████████████████| 112kB 55.2MB/s 
[?25hCollecting huggingface-hub<0.1.0
  Downloading https://files.pythonhosted.org/packages/a1/88/7b1e45720ecf59c6c6737ff332f41c955963090a18e72acbcbeac6b25e86/huggingface_hub-0.0.8-py3-none-any.whl
Collecting xxhash
[?25l  Downloading https://files.py

In [None]:
# check successful installation of transformers
!python -c "from transformers import pipeline; print(pipeline('sentiment-analysis')('we love you'))"

2021-04-12 13:24:52.410243: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
Downloading: 100% 629/629 [00:00<00:00, 627kB/s]
Downloading: 100% 268M/268M [00:03<00:00, 71.7MB/s]
Downloading: 100% 232k/232k [00:00<00:00, 921kB/s]
Downloading: 100% 48.0/48.0 [00:00<00:00, 39.5kB/s]
[{'label': 'POSITIVE', 'score': 0.9998704791069031}]


In [None]:
model_name = 'bert-base-cased'
num_labels = 5

# 1. Extracting Contextual Embeddings of Pre-Trained BERT

As we also have other non-textual features, we could extract the contextual embeddings of the NUSWhispers posts from the pre-trained BERT model. 

These embeddings (of size 768) can then be combined with the other features and fed into another classification model, e.g. Logistic Regression or a simple neural network (this is done in our repository's [main.py](https://github.com/CindyTsai1/CS4248-Team23/blob/main/main.py))

In [None]:
# Load pretrained model/tokenizer
model_class, tokenizer_class, pretrained_weights = (transformers.BertModel, 
                                                    transformers.BertTokenizer, 
                                                    'bert-base-cased')
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)

In [None]:
data_file = '/content/CS4248-Team23/data/v6_remove_punctuation_remove_non_english_correct_spelling_replace_short_form_slang.csv'
df = pd.read_csv(data_file)

def generate_embedding(x):
  encoding = tokenizer.encode(x, truncation=True, padding=True, max_length=512)
  input_ids = torch.tensor(encoding).unsqueeze(0)
  with torch.no_grad():
    output = model(input_ids)
    last_hidden_state = output[0]
  # Get [CLS] embedding
  features = last_hidden_state[:,0,:].numpy()
  return features

embeddings = df['text'].apply(generate_embedding)

In [None]:
embeddings.to_csv('drive/MyDrive/pt_bert_embeddings.csv')

# 2. Finetuning BERT on NUSWhispers sentiment analysis task

In the previous section, the pre-trained BERT model was simply used to evaluate and extract the embeddings from the NUSWhispers posts.

In this section, we fine-tune the pre-trained BERT on the NUSWhispers post specifically for our text classification (sentiment analysis) task.

References:
- https://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/

In [None]:
# load NUSWhispers dataset
from copy import deepcopy
import pandas as pd
from sklearn.model_selection import train_test_split
from datasets import Dataset, load_metric

data_file = '/content/CS4248-Team23/data/v6_remove_punctuation_remove_non_english_correct_spelling_replace_short_form_slang.csv'

old_train = pd.read_csv(data_file)
train = deepcopy(old_train)

train_dataset, test_dataset = train_test_split(train, test_size=0.2, random_state=10)
train_dataset, val_dataset = train_test_split(train, test_size=0.2, random_state=10)
test_dataset, val_dataset = train_test_split(val_dataset, test_size=0.5, random_state=10)

train_dataset_text = deepcopy(train_dataset[['text','label']])
val_dataset_text = deepcopy(val_dataset[['text','label']])
test_dataset_text = deepcopy(test_dataset[['text','label']])

train_dataset = Dataset.from_pandas(train_dataset_text)
val_dataset = Dataset.from_pandas(val_dataset_text)
test_dataset = Dataset.from_pandas(test_dataset_text)

tokenizer = BertTokenizerFast.from_pretrained(model_name)
def preprocess_function(examples):
    return tokenizer(examples['text'], truncation=True, padding=True)

train_dataset = train_dataset.map(preprocess_function, batched=True, load_from_cache_file=False)
val_dataset = val_dataset.map(preprocess_function, batched=True, load_from_cache_file=False)
test_dataset = test_dataset.map(preprocess_function, batched=True, load_from_cache_file=False)

columns_to_return = ['input_ids', 'label', 'attention_mask']
train_dataset.set_format(type='torch', columns=columns_to_return)
val_dataset.set_format(type='torch', columns=columns_to_return)
test_dataset.set_format(type='torch', columns=columns_to_return)

In [None]:
import torch
import torch.nn as nn
from torch.nn import GELU
from transformers import BertModel, BertForSequenceClassification, BertForPreTraining,\
                         BertTokenizerFast, Trainer, TrainingArguments
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

def pt_bert(model_name, num_labels, train_mode=True):
  model = BertForSequenceClassification.from_pretrained(model_name, 
                                                        num_labels=num_labels)
  if train_mode:
    model.train()
  return model

# def pt_bert_extended(model_name, num_labels, train_mode=True):
#   class ExtendedBert(nn.Module):
#       def __init__(self):
#           super().__init__()

#           self.bert = BertModel.from_pretrained(model_name)
#           self.linear = nn.Linear(1024, 1024)
#           self.act = GELU()
#           self.classifier = nn.Linear(1024, num_labels)

#       def forward(self, encoded, other_feats):
#           # get the hidden state of the last layer
#           last_hidden = self.bert(**encoded)[0]
#           # concatenate with the other given features
#           cat = torch.cat([last_hidden, other_feats], dim=-1)
#           # pass through linear layer
#           output = self.linear(cat)
#           # pass through non-linear activation and final classifier layer
#           return self.classifier(self.act(output))
#   model = ExtendedBert()
#   if train_mode:
#     model.train()
#   return model

model = pt_bert(model_name, num_labels, True)

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at b

In [None]:
training_args = TrainingArguments(
    output_dir='./nuswhispersbert/results',          # output directory
    learning_rate=2e-5,
    num_train_epochs=4.0,            # total # of training epochs
    per_device_train_batch_size=32,  # batch size per device during training
    per_device_eval_batch_size=32,   # batch size for evaluation
    warmup_ratio=0.1,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./nuswhispersbert/logs',            # directory for storing logs
    load_best_model_at_end=True,
    metric_for_best_model='f1',
)

def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='macro')
    acc = accuracy_score(labels, preds)
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

trainer = Trainer(
    model=model,                         # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=val_dataset,            # evaluation dataset
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

trainer.train()

Step,Training Loss


TrainOutput(global_step=444, training_loss=1.1452488254856419, metrics={'train_runtime': 9585.6873, 'train_samples_per_second': 0.046, 'total_flos': 4691647640678400.0, 'epoch': 4.0, 'init_mem_cpu_alloc_delta': 0, 'init_mem_cpu_peaked_delta': 0, 'train_mem_cpu_alloc_delta': 3247280128, 'train_mem_cpu_peaked_delta': 1634304})

In [None]:
trainer.evaluate()

{'epoch': 4.0,
 'eval_accuracy': 0.5578231292517006,
 'eval_f1': 0.4473423393402012,
 'eval_loss': 1.1610209941864014,
 'eval_mem_cpu_alloc_delta': -6082560,
 'eval_mem_cpu_peaked_delta': 6082560,
 'eval_precision': 0.5359161508989095,
 'eval_recall': 0.44302641072377913,
 'eval_runtime': 73.9135,
 'eval_samples_per_second': 5.966}

The fine-tuned BERT gave us a test f1-score of 0.417:

In [None]:
import json 

predictions = trainer.predict(test_dataset)
print(predictions.metrics)

with open('results.json', 'w+') as f:
  f.write(json.dumps(predictions.metrics))

{'test_loss': 1.1826553344726562, 'test_accuracy': 0.5260770975056689, 'test_f1': 0.4172975243387377, 'test_precision': 0.47979861309014915, 'test_recall': 0.4160155344745687, 'test_runtime': 72.9537, 'test_samples_per_second': 6.045, 'test_mem_cpu_alloc_delta': -8663040, 'test_mem_cpu_peaked_delta': 9326592}


In [None]:
# save model
save_directory = '/content/drive/MyDrive/nuswhispers_bert/'
tokenizer.save_pretrained(save_directory)
model.save_pretrained(save_directory)

# 2b. Extracting Contextual Embeddings of BERT finetuned on NUSWhispers

In the previous section, we made use of the finetuning approach directly on NUSWhispers. However, we did not make use of the other non-textual features that we have. Hence, we could extract the contextual embeddings of the fine-tuned model, just like we did to the pre-trained model in the first section, and use that with other features (in main.py).

In [None]:
from transformers import BertTokenizerFast, BertForSequenceClassification

save_directory = '/content/drive/MyDrive/nuswhispers_bert/'
tokenizer = BertTokenizerFast.from_pretrained(save_directory)
model = BertForSequenceClassification.from_pretrained(save_directory,
                                                      output_hidden_states=True)

In [None]:
import numpy as np
import pandas as pd
import torch
import transformers
import warnings
warnings.filterwarnings('ignore')

In [None]:
# load NUSWhispers dataset
from copy import deepcopy
import pandas as pd
from sklearn.model_selection import train_test_split
from datasets import Dataset, load_metric

data_file = '/content/CS4248-Team23/data/v6_remove_punctuation_remove_non_english_correct_spelling_replace_short_form_slang.csv'

old_train = pd.read_csv(data_file)
train = deepcopy(old_train)

train_dataset_text = deepcopy(train[['text','label']])

train_dataset = Dataset.from_pandas(train_dataset_text)

tokenizer = BertTokenizerFast.from_pretrained(model_name)
def preprocess_function(examples):
    return tokenizer(examples['text'], truncation=True, padding=True)

train_dataset = train_dataset.map(preprocess_function, batched=True, load_from_cache_file=False)

columns_to_return = ['input_ids', 'label', 'attention_mask']
train_dataset.set_format(type='torch', columns=columns_to_return)

HBox(children=(FloatProgress(value=0.0, max=5.0), HTML(value='')))




Here we are extracting the embeddings produced by the final hidden layer (before the classification head), where we simply used the embeddings of each post's [CLS] token (a special token appended to the start of every text by the BERT tokenizer). There are also other strategies, e.g. average or max pooling all token's embeddings, taking the 2nd to last hidden layer's embeddings instead of the last, or even pooling the last 4 hidden layers.

In [None]:
def generate_embedding(x):
  inputs = {
    "input_ids": torch.tensor(x['input_ids']).unsqueeze(0),
    "attention_mask": torch.tensor(x['attention_mask']).unsqueeze(0),
  }

  with torch.no_grad():
    output = model(**inputs)
    logits = output[0]
    hidden_states = output[1]
    last_hidden_state = hidden_states[1] # layer right before the classification head
  # Get [CLS] embedding
  features = last_hidden_state[:,0,:].numpy()

  return features

df = train_dataset.to_pandas()
embeddings = df.apply(generate_embedding, axis=1)

In [None]:
# sanity check
# b = generate_embedding(df.iloc[0])
# c = generate_embedding(df.iloc[100])
# c[1][1][:,0,:] - b[1][1][:,0,:]

In [None]:
embeddings

0       [[0.3830672, -0.009595338, -0.04466594, 0.2164...
1       [[0.31250125, -0.025139237, -0.07906971, 0.145...
2       [[0.4044993, 0.11017956, -0.09175598, 0.163171...
3       [[0.42715767, 0.12297561, -0.088395834, 0.1763...
4       [[0.33505115, 0.028211728, -0.09869314, 0.1309...
                              ...                        
4402    [[0.35374606, 0.05952185, -0.110185266, 0.1314...
4403    [[0.42239362, 0.0930569, -0.07726694, 0.185199...
4404    [[0.3414257, 0.00013566887, -0.057890713, 0.18...
4405    [[0.37562442, 0.0037782686, -0.104104154, 0.18...
4406    [[0.36251536, 0.009120925, -0.013022443, 0.200...
Length: 4407, dtype: object

In [None]:
embeddings.to_csv('drive/MyDrive/nw_bert_embeddings.csv')

# 3. Further Pre-Training BERT on GoEmotion

In this section, we explore the idea of further pre-training the original pre-trained BERT model. As the original model was pre-trained on English Wikipedia and BooksCorpus, it might not have been able to capture the distributional statistics of our target domain, i.e. social media (Facebook posts), which tend to have more informal language.

Hence, we could further pre-train using the goemotions dataset which contains a pretty large (200k examples) set of Reddit posts, on the masked language modelling task.

We hypothesize that by tuning the language model to better fit the target domain, the performance of the downstream task (sentiment analysis on NUSWhispers) could be improved.

Refererences: 
- https://github.com/huggingface/transformers/tree/master/examples: 
- https://huggingface.co/blog/pytorch-xla

To speed up the pre-training process, please change the colab's runtime to make use of TPU.

In [None]:
import os
assert os.environ['COLAB_TPU_ADDR'], 'Make sure to select TPU from Edit > Notebook settings > Hardware accelerator'

!pip install -U git+https://github.com/huggingface/transformers

!pip install datasets

# Install Colab TPU compatible PyTorch/TPU wheels and dependencies
!pip install cloud-tpu-client==0.10 https://storage.googleapis.com/tpu-pytorch/wheels/torch_xla-1.8-cp37-cp37m-linux_x86_64.whl

Collecting git+https://github.com/huggingface/transformers
  Cloning https://github.com/huggingface/transformers to /tmp/pip-req-build-agt3w64v
  Running command git clone -q https://github.com/huggingface/transformers /tmp/pip-req-build-agt3w64v
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Building wheels for collected packages: transformers
  Building wheel for transformers (PEP 517) ... [?25l[?25hdone
  Created wheel for transformers: filename=transformers-4.6.0.dev0-cp37-none-any.whl size=2090005 sha256=9edf51e6b43ac951e22d519800bde168860f3ac1e7391c8a362122330e2a3f60
  Stored in directory: /tmp/pip-ephem-wheel-cache-o4rvvnlc/wheels/70/d3/52/b3fa4f8b8ef04167ac62e5bb2accb62ae764db2a378247490e
Successfully built transformers
Installing collected packages: transformers
  Found existing installation: transformers 4.5.0
    Uninstalling transformers-4.5.0:
      Successf

In [None]:
!wget https://raw.githubusercontent.com/huggingface/transformers/master/examples/language-modeling/run_mlm.py
!wget https://raw.githubusercontent.com/huggingface/transformers/master/examples/xla_spawn.py
!wget -P goemotions_data/ https://storage.googleapis.com/gresearch/goemotions/data/full_dataset/goemotions_1.csv
!wget -P goemotions_data/ https://storage.googleapis.com/gresearch/goemotions/data/full_dataset/goemotions_2.csv
!wget -P goemotions_data/ https://storage.googleapis.com/gresearch/goemotions/data/full_dataset/goemotions_3.csv

--2021-04-12 05:20:21--  https://raw.githubusercontent.com/huggingface/transformers/master/examples/language-modeling/run_mlm.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 20875 (20K) [text/plain]
Saving to: ‘run_mlm.py’


2021-04-12 05:20:21 (16.6 MB/s) - ‘run_mlm.py’ saved [20875/20875]

--2021-04-12 05:20:21--  https://raw.githubusercontent.com/huggingface/transformers/master/examples/xla_spawn.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2519 (2.5K) [text/plain]
Saving to: ‘xla_spawn.py’


2021-04-12 05:20:22 (51.5 

In [None]:
# from transformers import BertTokenizerFast
from datasets import load_dataset

train_dataset = load_dataset('csv', data_files=[
                                     'goemotions_data/goemotions_1.csv',
                                     'goemotions_data/goemotions_2.csv',
                                     'goemotions_data/goemotions_3.csv'])
with open('goemotions_corpus.txt','w+') as f:
  corpus='\n'.join(train_dataset['train']['text'])
  f.write(corpus)

Using custom data configuration default-4a82ec2353cca12b


Downloading and preparing dataset csv/default (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /root/.cache/huggingface/datasets/csv/default-4a82ec2353cca12b/0.0.0/2dc6629a9ff6b5697d82c25b73731dd440507a69cbce8b425db50b751e8fcfd0...


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-4a82ec2353cca12b/0.0.0/2dc6629a9ff6b5697d82c25b73731dd440507a69cbce8b425db50b751e8fcfd0. Subsequent calls will reuse this data.


In [None]:
# set up TPU
import tensorflow as tf
import os

# Note that the `tpu` argument is for Colab-only
resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu='grpc://' + os.environ['COLAB_TPU_ADDR'])

tf.config.experimental_connect_to_cluster(resolver)
tf.tpu.experimental.initialize_tpu_system(resolver)
print("All devices: ", tf.config.list_logical_devices('TPU'))

# strategy = tf.distribute.TPUStrategy(resolver)

INFO:absl:Entering into master device scope: /job:worker/replica:0/task:0/device:CPU:0


INFO:tensorflow:Initializing the TPU system: grpc://10.75.235.170:8470


INFO:tensorflow:Initializing the TPU system: grpc://10.75.235.170:8470


INFO:tensorflow:Clearing out eager caches


INFO:tensorflow:Clearing out eager caches


INFO:tensorflow:Finished initializing TPU system.


INFO:tensorflow:Finished initializing TPU system.


All devices:  [LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:7', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:6', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:5', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:4', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:3', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:0', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:1', device_type='TPU'), LogicalDevice(name='/job:worker/replica:0/task:0/device:TPU:2', device_type='TPU')]


In [None]:
!export TPU_IP_ADDRESS="10.75.235.170"  # ex. 10.0.0.2
!export XRT_TPU_CONFIG="tpu_worker;0;$TPU_IP_ADDRESS:8470"

# !python run_mlm.py \
#     --model_name_or_path bert-base-cased \
#     --train_file goemotions_corpus.txt \
#     --do_train \
#     --line_by_line \
#     --output_dir goemo-mlm

!python xla_spawn.py \
  run_mlm.py \
  --model_name_or_path bert-base-cased \
  --train_file goemotions_corpus.txt \
  --do_train \
  --line_by_line \
  --pad_to_max_length \
  --output_dir /content/drive/MyDrive/goemo-mlm \
  --cache_dir cache_dir \
  --overwrite_cache \
  --tpu_metrics_debug \
  --save_steps 20000
  # --overwrite_output_dir \
  # --num_train_epochs 3 \
  # --per_device_train_batch_size 8 \
  # --per_device_eval_batch_size 8 \
  

INFO:run_mlm:Training/evaluation parameters TrainingArguments(output_dir=/content/drive/MyDrive/goemo-mlm, overwrite_output_dir=False, do_train=True, do_eval=None, do_predict=False, evaluation_strategy=IntervalStrategy.NO, prediction_loss_only=False, per_device_train_batch_size=8, per_device_eval_batch_size=8, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=5e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=3.0, max_steps=-1, lr_scheduler_type=SchedulerType.LINEAR, warmup_ratio=0.0, warmup_steps=0, logging_dir=runs/Apr12_05-21-51_f9bc37ee942f, logging_strategy=IntervalStrategy.STEPS, logging_first_step=False, logging_steps=500, save_strategy=IntervalStrategy.STEPS, save_steps=20000, save_total_limit=None, no_cuda=False, seed=42, fp16=False, fp16_opt_level=O1, fp16_backend=auto, fp16_full_eval=False, local_rank=-1, tpu_num_cores=1, tpu_metrics_debug=True, debug=False, dataloader_drop_last=False, e

# 3b. Finetuning Go-Emotions Pre-Trained BERT on NUSWhispers

 *Remember to change runtime back to GPU*

In [None]:
model_name = '/content/drive/MyDrive/goemo-mlm'
num_labels = 5

In [None]:
# Train up Goemo bert on NUSwhspers
# load NUSWhispers dataset
from copy import deepcopy
import pandas as pd
from sklearn.model_selection import train_test_split
from transformers import BertTokenizerFast
from datasets import Dataset, load_metric

data_file = '/content/CS4248-Team23/data/v6_remove_punctuation_remove_non_english_correct_spelling_replace_short_form_slang.csv'

old_train = pd.read_csv(data_file)
train = deepcopy(old_train)

train_dataset, test_dataset = train_test_split(train, test_size=0.2, random_state=10)
train_dataset, val_dataset = train_test_split(train, test_size=0.2, random_state=10)
test_dataset, val_dataset = train_test_split(val_dataset, test_size=0.5, random_state=10)

train_dataset_text = deepcopy(train_dataset[['text','label']])
val_dataset_text = deepcopy(val_dataset[['text','label']])
test_dataset_text = deepcopy(test_dataset[['text','label']])

train_dataset = Dataset.from_pandas(train_dataset_text)
val_dataset = Dataset.from_pandas(val_dataset_text)
test_dataset = Dataset.from_pandas(test_dataset_text)

tokenizer = BertTokenizerFast.from_pretrained(model_name)
def preprocess_function(examples):
    return tokenizer(examples['text'], truncation=True, padding=True)

train_dataset = train_dataset.map(preprocess_function, batched=True, load_from_cache_file=False)
val_dataset = val_dataset.map(preprocess_function, batched=True, load_from_cache_file=False)
test_dataset = test_dataset.map(preprocess_function, batched=True, load_from_cache_file=False)

columns_to_return = ['input_ids', 'label', 'attention_mask']
train_dataset.set_format(type='torch', columns=columns_to_return)
val_dataset.set_format(type='torch', columns=columns_to_return)
test_dataset.set_format(type='torch', columns=columns_to_return)

HBox(children=(FloatProgress(value=0.0, max=4.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))




In [None]:
import torch
import torch.nn as nn
from torch.nn import GELU
from transformers import BertModel, BertForSequenceClassification, BertForPreTraining,\
                         BertTokenizerFast, Trainer, TrainingArguments
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

def pt_bert(model_name, num_labels, train_mode=True):
  model = BertForSequenceClassification.from_pretrained(model_name, 
                                                        num_labels=num_labels,)
  if train_mode:
    model.train()
  return model

model = pt_bert(model_name, num_labels, True)

Some weights of the model checkpoint at /content/drive/MyDrive/goemo-mlm were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.decoder.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at /content/dri

In [None]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir='./genwbert/results',          # output directory
    learning_rate=2e-5,
    num_train_epochs=4.0,            # total # of training epochs
    per_device_train_batch_size=8,  # batch size per device during training
    per_device_eval_batch_size=8,   # batch size for evaluation
    warmup_ratio=0.1,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./genwbert/logs',            # directory for storing logs
    load_best_model_at_end=True,
    metric_for_best_model='f1',
)

def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='macro')
    acc = accuracy_score(labels, preds)
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

trainer = Trainer(
    model=model,                         # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=val_dataset,            # evaluation dataset
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

trainer.train()

Step,Training Loss
500,1.2775
1000,0.943
1500,0.6672


TrainOutput(global_step=1764, training_loss=0.894660015495456, metrics={'train_runtime': 1515.9162, 'train_samples_per_second': 1.164, 'total_flos': 4691647640678400.0, 'epoch': 4.0, 'init_mem_cpu_alloc_delta': 0, 'init_mem_gpu_alloc_delta': 0, 'init_mem_cpu_peaked_delta': 0, 'init_mem_gpu_peaked_delta': 0, 'train_mem_cpu_alloc_delta': 1060864, 'train_mem_gpu_alloc_delta': 911140864, 'train_mem_cpu_peaked_delta': 4096, 'train_mem_gpu_peaked_delta': 6556587520})

In [None]:
trainer.evaluate()

{'epoch': 4.0,
 'eval_accuracy': 0.5374149659863946,
 'eval_f1': 0.40829767690940344,
 'eval_loss': 1.4052960872650146,
 'eval_mem_cpu_alloc_delta': -40960,
 'eval_mem_cpu_peaked_delta': 40960,
 'eval_mem_gpu_alloc_delta': 0,
 'eval_mem_gpu_peaked_delta': 390193152,
 'eval_precision': 0.43933094384707294,
 'eval_recall': 0.4113299109516214,
 'eval_runtime': 8.3014,
 'eval_samples_per_second': 53.123}

In [None]:
import json 

predictions = trainer.predict(test_dataset)
print(predictions.metrics)

with open('results.json', 'w+') as f:
  f.write(json.dumps(predictions.metrics))

{'test_loss': 1.3832087516784668, 'test_accuracy': 0.5351473922902494, 'test_f1': 0.44770590825570683, 'test_precision': 0.5446675640695873, 'test_recall': 0.43751330519547305, 'test_runtime': 8.2763, 'test_samples_per_second': 53.285, 'test_mem_cpu_alloc_delta': -45056, 'test_mem_gpu_alloc_delta': 0, 'test_mem_cpu_peaked_delta': 45056, 'test_mem_gpu_peaked_delta': 390197760}


In [None]:
# save model
save_directory = '/content/drive/MyDrive/ge_nw_bert/'
tokenizer.save_pretrained(save_directory)
model.save_pretrained(save_directory)

# 3c. Extracting Contextual Embeddings of Go-Emotions Pre-Trained & NUSWhispers Fine-Tuned BERT


In [None]:
from transformers import BertTokenizerFast, BertForSequenceClassification

save_directory = '/content/drive/MyDrive/ge_nw_bert/'
tokenizer = BertTokenizerFast.from_pretrained(save_directory)
model = BertForSequenceClassification.from_pretrained(save_directory,
                                                      output_hidden_states=True)

In [None]:
import numpy as np
import pandas as pd
import torch
import transformers
import warnings
warnings.filterwarnings('ignore')

In [None]:
# load NUSWhispers dataset
from copy import deepcopy
import pandas as pd
from sklearn.model_selection import train_test_split
from datasets import Dataset, load_metric

data_file = '/content/CS4248-Team23/data/v6_remove_punctuation_remove_non_english_correct_spelling_replace_short_form_slang.csv'

old_train = pd.read_csv(data_file)
train = deepcopy(old_train)

train_dataset_text = deepcopy(train[['text','label']])

train_dataset = Dataset.from_pandas(train_dataset_text)

tokenizer = BertTokenizerFast.from_pretrained(model_name)
def preprocess_function(examples):
    return tokenizer(examples['text'], truncation=True, padding=True)

train_dataset = train_dataset.map(preprocess_function, batched=True, load_from_cache_file=False)

columns_to_return = ['input_ids', 'label', 'attention_mask']
train_dataset.set_format(type='torch', columns=columns_to_return)

HBox(children=(FloatProgress(value=0.0, max=5.0), HTML(value='')))




Here we are extracting the embeddings produced by the final hidden layer (before the classification head), where we simply used the embeddings of each post's [CLS] token (a special token appended to the start of every text by the BERT tokenizer). There are also other strategies, e.g. average or max pooling all token's embeddings, taking the 2nd to last hidden layer's embeddings instead of the last, or even pooling the last 4 hidden layers.

In [None]:
def generate_embedding(x):
  inputs = {
    "input_ids": torch.tensor(x['input_ids']).unsqueeze(0),
    "attention_mask": torch.tensor(x['attention_mask']).unsqueeze(0),
  }

  with torch.no_grad():
    output = model(**inputs)
    logits = output[0]
    hidden_states = output[1]
    last_hidden_state = hidden_states[1] # layer right before the classification head
  # Get [CLS] embedding
  features = last_hidden_state[:,0,:].numpy()

  return features

df = train_dataset.to_pandas()
embeddings = df.apply(generate_embedding, axis=1)

In [None]:
# sanity check
# b = generate_embedding(df.iloc[0])
# c = generate_embedding(df.iloc[100])
# c[1][1][:,0,:] - b[1][1][:,0,:]

In [None]:
embeddings

0       [[0.27817735, -0.08260021, -0.008533872, 0.228...
1       [[0.22157842, -0.02904309, -0.03032327, 0.1960...
2       [[0.2006335, 0.090700924, -0.10094186, 0.15031...
3       [[0.2374434, 0.107041694, -0.122428514, 0.1505...
4       [[0.18441899, 0.03906344, -0.064785875, 0.1099...
                              ...                        
4402    [[0.21776429, 0.049682744, -0.09176226, 0.1267...
4403    [[0.23981403, 0.07714494, -0.111232854, 0.1523...
4404    [[0.25274277, -0.037078705, -0.01557099, 0.189...
4405    [[0.2536971, -0.06730342, -0.064660765, 0.1913...
4406    [[0.27818578, -0.08658751, -0.022273915, 0.224...
Length: 4407, dtype: object

In [None]:
embeddings.to_csv('drive/MyDrive/ge_nw_bert_embeddings.csv')