**In this Notebook, we present an approach using pretrained BERT word embeddings combined with CNN and FFNN to predict the funniness score of a news headline from the humicroedit dataset**

**1. Introduction**

In Natural language Processing, humour detection is a challenging task. The [SemEval-2020 Task 7](https://arxiv.org/pdf/2008.00304.pdf)  aims to detect humour in English news headlines from micro-edits. The humicroedit dataset (Hossain et al., 2020) contains 9653 training, 2420 development and 3025 testing examples. In task 1 the goal is to predict the the funniness score of an edited headline in the ranges of 0 to 3, where 0 means not funny and 3 means very funny. 

The main idea is that we want to use a context-sensitive embedding as the first layer in building a deep learning regression model. Whereas static word embeddings like Word2Vec or GLoVE give the same meaning to a word in different contexts, a dynamic (contextualized) word embedding trained on a general language model task with a large corpus is able to represent polysemy, capture long- term dependencies in language and help the model learn sentiments. We care about contextualized embeddings because humor is often conveyed through puns, quirky expressions and parodies, which involve using word combinations in unusual ways.

We chose BERT because it is both task-agnostic and deeply bidirectional. As opposed to a bidirectional RNN model, which is not truly bidirectional as states from the two directions do not interact with each other, BERT representations are jointly conditioned on both left and right context in all layers.

For the experiment, we use the PyTorch implementation of the smallest base variant (’bert-base-uncased’) of the pre-trained BERT models by HuggingFace.

**2. Import and Downloads**

In [3]:
#@title Download and Install
# Check GPU
!nvidia-smi
# installing transformers
!pip -q install transformers
!pip -q install nltk

/bin/sh: nvidia-smi: command not found


In [4]:
#@title Imports
# Library imports
import torch
import torch.nn as nn
from torch.utils.data import random_split
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords
from transformers import BertTokenizer, BertModel

from preprocessing.preprocessor import (create_edited_sentences, dataset_question_full_processing,
                                        create_custom_vocab, get_stop_words)
from dataloader.data_loaders import (Task1Dataset, collate_fn_padd, get_input_bert,
                                      get_dataloaders, get_dataloaders_no_random_split)
from models.BERT_FFNN import FFNN
from models.BERT_CNN import CNN
from trainer.BERT_trainer import bert_eval, bert_train
from utils.plot import (plot_sentence_length_stopwords, plot_mean_grade_distribution,
                        plot_number_characters, plot_number_words, plot_top_ngrams, 
                        plot_loss_vs_epochs)
from utils.ngrams import get_top_ngram
from utils.vocab import create_vocab, get_word2idx

**3. Settings and Parameters**

In [5]:
#@title Torch Settings
# Setting random seed and device
SEED = 1

torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

use_cuda = torch.cuda.is_available()
device = torch.device("cuda:0" if use_cuda else "cpu")

# Number of epochs
epochs_bl = 10 
epochs = 100

# Proportion of training data for train compared to dev
train_proportion = 0.8

**4. Data Loading**

In [9]:
# Load data
train_df = pd.read_csv('data/task-1/train.csv')
dev_df = pd.read_csv('data/task-1/dev.csv')
test_df = pd.read_csv('data/task-1/test.csv')

# Convert them to full edited sentences
modified_train_df = create_edited_sentences(train_df)
modified_valid_df = create_edited_sentences(dev_df)
modified_test_df = create_edited_sentences(test_df)

**5. Preprocessing**

As we are interested in a regression task that gives an absolute score for humor instead of comparing two headlines, we directly replaced the word in </> in the original headline with the word given for micro- edits. In the first preprocessing approach (function ```question_sentence_preprocessing``` in the notebook), we took the edited headlines, converted them to all lower cases, converted *’t* was converted to *not*, and removed a subset of punctuation (kept question mark) and trailing white spaces. The second preprocessing approach
(```full_sentence_preprocessing``` in the note- book) is more straightforward, in which we converted all words into lower cases, removed all special char- acters and trailing white spaces. We used these two preprocessing approaches because the BERT model was trained with special characters, and we wanted to examine if different levels of text preprocessing would make a difference in the performance.

***Stop Words***

In [10]:
# nltk stopwords english list
nltk.download('stopwords')
nltk_stopwords = list(stopwords.words('english'))

# import custom stopword list
simple_stopwords, custom_stopwords = get_stop_words()

all_stopwords = list(set(custom_stopwords + nltk_stopwords))

stopwords_lists = [[],simple_stopwords, custom_stopwords,nltk_stopwords,\
                   all_stopwords]
edited_modes = ['question_edited','full_edited']

# OK Applying both question and full version to any dataframe and dropping useless values
modified_train_df = dataset_question_full_processing(modified_train_df)
modified_valid_df = dataset_question_full_processing(modified_valid_df)
modified_test_df = dataset_question_full_processing(modified_test_df)

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/nasmadasser/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


***BERT Embeddings***

In [12]:
# Bert download
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert_model = BertModel.from_pretrained('bert-base-uncased')

**6. BERT + Feed Forward Neural Network**

***a. FFNN question-sentence preprocessing***

In [14]:
## FFNN question-sentence preprocessing
BATCH_SIZE = 64

bert_ffnn = FFNN()
print("Model initialised.")
bert_ffnn.to(device)

# Get inputs for BERT and data loaders
input_id_mask_train = get_input_bert(df=modified_train_df, tokenizer = tokenizer, col="question_edited")

input_id_mask_valid = get_input_bert(df= modified_valid_df, tokenizer=tokenizer, col ="question_edited") 

train_loader, dev_loader = get_dataloaders_no_random_split(input_data_train=input_id_mask_train, 
                                                            targets_train=modified_train_df['meanGrade'],
                                                            input_data_valid=input_id_mask_valid,
                                                            targets_valid=modified_valid_df['meanGrade'], 
                                                            batch_size = BATCH_SIZE) 

print("Dataloaders created.")
# calculate loss
loss_fn = nn.MSELoss()
loss_fn = loss_fn.to(device)

optimizer_ff = torch.optim.Adam(bert_ffnn.parameters(), lr=1e-5)

bert_model = bert_model.to(device=device)

Model initialised.
Dataloaders created.


***Train BERT + FFNN***

In [None]:
question_train_losses_ff, question_valid_losses_ff = bert_train(
     optimizer=optimizer_ff, 
     train_iter=train_loader, 
     dev_iter=dev_loader, 
     model=bert_ffnn,
     loss_fn=loss_fn,
     device= device, 
     number_epoch = 10,     
     model_name='task1_question_ffnn.pt', 
     patience=5, )

***b. FFNN Full-sentence preprocessing***

In [15]:
## FFNN Full-sentence preprocessing

BATCH_SIZE = 64

bert_ffnn_full = FFNN()
print("Model initialised.")

bert_ffnn_full.to(device)

# Get inputs for BERT and data loaders
input_id_mask_train = get_input_bert(df= modified_train_df, tokenizer= tokenizer, col= "full_edited")
input_id_mask_valid = get_input_bert(df= modified_valid_df, tokenizer= tokenizer, col= "full_edited") 
train_loader, dev_loader = get_dataloaders_no_random_split(
    input_data_train=input_id_mask_train, 
    targets_train=modified_train_df['meanGrade'],
    input_data_valid=input_id_mask_valid,
    targets_valid=modified_valid_df['meanGrade'], 
    batch_size = BATCH_SIZE) 
print("Dataloaders created.")
# calculate loss
loss_fn = nn.MSELoss()
loss_fn = loss_fn.to(device)

optimizer_ff_full = torch.optim.Adam(bert_ffnn_full.parameters(), lr=1e-5)

bert_model = bert_model.to(device=device)

Model initialised.
Dataloaders created.


***Train BERT + FFNN***

In [None]:
full_train_losses_ff, full_valid_losses_ff = bert_train(
     optimizer=optimizer_ff_full, 
     train_iter=train_loader, 
     dev_iter=dev_loader, 
    model=bert_ffnn_full,
     loss_fn=loss_fn,
    device = device, 
     number_epoch =10,   
     model_name='task1_full_ffnn.pt', 
    patience=5,)

***Plot Loss***

In [None]:
plot_loss_vs_epochs(question_train_losses_ff, question_valid_losses_ff, 
                        full_train_losses_ff,full_valid_losses_ff, title = "Loss Curves: BERT with FFNN" )

**7. BERT + Convolutional Neural Network**

***a. CNN with question-sentence preprocessing***

In [18]:
## CNN question-sentence preprocessing
BATCH_SIZE = 64
output_channel = [128, 64]

bert_cnn = CNN(out_channels=output_channel)
print("Model initialised.")

bert_cnn.to(device)

# Get inputs for BERT and data loaders
input_id_mask_train = get_input_bert(df=modified_train_df, tokenizer= tokenizer, col="question_edited")
input_id_mask_valid = get_input_bert(df=modified_valid_df, tokenizer= tokenizer, col="question_edited") 
train_loader, dev_loader = get_dataloaders_no_random_split(
    input_data_train=input_id_mask_train, 
    targets_train=modified_train_df['meanGrade'],
    input_data_valid=input_id_mask_valid,
    targets_valid=modified_valid_df['meanGrade'], 
    batch_size = BATCH_SIZE) 
print("Dataloaders created.")

loss_fn = nn.MSELoss()
loss_fn = loss_fn.to(device)

optimizer_cnn = torch.optim.Adam(bert_cnn.parameters(), lr=1e-5)

bert_model = bert_model.to(device=device)

Model initialised.
Dataloaders created.


***Train BERT + CNN***

In [None]:
question_train_losses, question_valid_losses = bert_train(
    optimizer=optimizer_cnn, 
    train_iter=train_loader, 
    dev_iter=dev_loader,
    loss_fn=loss_fn,  
    model=bert_cnn,
    device = device,
    number_epoch = 10,
    patience=5, 
    model_name='task1_question_cnn.pt')

***b. CNN with full-sentence preprocessing***

In [19]:
## CNN full-sentence preprocessing
BATCH_SIZE = 64
output_channel = [128, 64]

bert_cnn = CNN(out_channels=output_channel)
print("Model initialised.")

bert_cnn.to(device)

# Get inputs for BERT and data loaders
input_id_mask_train = get_input_bert(df=modified_train_df, tokenizer= tokenizer, col="full_edited")
input_id_mask_valid = get_input_bert(df=modified_valid_df, tokenizer= tokenizer, col="full_edited") 
train_loader, dev_loader = get_dataloaders_no_random_split(
    input_data_train=input_id_mask_train, 
    targets_train=modified_train_df['meanGrade'],
    input_data_valid=input_id_mask_valid,
    targets_valid=modified_valid_df['meanGrade'], 
    batch_size = BATCH_SIZE) 
print("Dataloaders created.")

loss_fn = nn.MSELoss()
loss_fn = loss_fn.to(device)

optimizer_cnn = torch.optim.Adam(bert_cnn.parameters(), lr=1e-5)

bert_model = bert_model.to(device=device)

Model initialised.
Dataloaders created.


***Train BERT + CNN***

In [None]:
full_train_losses, full_valid_losses = bert_train(
    optimizer=optimizer_cnn, 
    train_iter=train_loader, 
    dev_iter=dev_loader,
    loss_fn=loss_fn,  
    device=device,
    model=bert_cnn, 
    number_epoch = 10,
    patience=5, 
    model_name='task1_full_cnn.pt')

***Plot Loss***

In [None]:
plot_loss_vs_epochs(question_train_losses, question_valid_losses, 
                        full_train_losses,full_valid_losses, title = "Loss Curves: BERT with CNN" )

**8. Testing with BERT+CNN**

In [22]:
import warnings
warnings.filterwarnings('ignore')


test_id_mask = get_input_bert(df=modified_test_df,tokenizer=tokenizer, col="full_edited") 
test_loader = torch.utils.data.DataLoader(test_id_mask, 
                                          shuffle=False, 
                                          batch_size=64)

bert_cnn = CNN(out_channels=[128, 64])
bert_cnn.load_state_dict(torch.load("task1_question_cnn.pt")) 
print("Model (Task 1) loaded.")
bert_cnn.to(device=device)
bert_model = bert_model.to(device=device)
# evaluate model
bert_cnn.eval()
cnn_pred = []

with torch.no_grad():
  print("Start testing ...")
  for (test_id, test_mask) in test_loader:
    test_id = test_id.to(device=device, dtype=torch.long)
    test_mask = test_mask.to(device=device, dtype=torch.long)
    pred = bert_cnn(test_id, test_mask)
    cnn_pred.append(pred.cpu().numpy())

cnn_pred = np.concatenate(cnn_pred)
test_df = modified_test_df[['id']]
test_df.loc[:, 'pred'] = cnn_pred.flatten()

# save to csv
task1_truth = pd.read_csv("data/task-1/truth.csv")
assert(sorted(task1_truth.id) == sorted(test_df.id)),"ID mismatch between ground truth and prediction!"
data = pd.merge(task1_truth, test_df)
rmse = np.sqrt(np.mean((data['meanGrade'] - data['pred'])**2))
print("RMSE = %.3f" % rmse)

RuntimeError: Error(s) in loading state_dict for CNN:
	Missing key(s) in state_dict: "conv1.weight", "conv1.bias", "conv2.weight", "conv2.bias". 
	Unexpected key(s) in state_dict: "ffnn.0.0.weight", "ffnn.0.0.bias", "ffnn.1.0.weight", "ffnn.1.0.bias", "ffnn.2.0.weight", "ffnn.2.0.bias". 
	size mismatch for output.0.weight: copying a param with shape torch.Size([1, 32]) from checkpoint, the shape in current model is torch.Size([1, 64]).