<a href="https://colab.research.google.com/github/Nasmasim/humour-detection/blob/main/run_GloVe_biLSTM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**In this Notebook, we present the baseline approach using GloVe word embeddings combined with a biLSTM to predict the funniness score of a news headline from the humicroedit dataset**

**1. Introduction**

In Natural language Processing, humour detection is a challenging task. The [SemEval-2020 Task 7](https://arxiv.org/pdf/2008.00304.pdf)  aims to detect humour in English news headlines from micro-edits. The humicroedit dataset (Hossain et al., 2020) contains 9653 training, 2420 development and 3025 testing examples. In task 1 the goal is to predict the the funniness score of an edited headline in the ranges of 0 to 3, where 0 means not funny and 3 means very funny. 

**2. Import and Downloads**

In [1]:
# mount project to drive
from google.colab import drive
import sys
import os
drive.mount('/content/drive')
# set
py_file_location = "/content/drive/MyDrive/humour-detection"
sys.path.append(os.path.abspath(py_file_location))

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
#@title Download and Install
# Check GPU
!nvidia-smi
# Baseline data collection- for full download 
#!wget -nc http://nlp.stanford.edu/data/glove.6B.zip -O glove.6B.zip
#!unzip -n glove.6B.zip
# installing transformers
!pip -q install nltk

Mon May 24 08:26:16 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.19.01    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   43C    P0    57W / 149W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [3]:
#@title Imports
# Library imports
import torch
import torch.nn as nn
from torch.utils.data import random_split
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords

from preprocessing.preprocessor import (create_edited_sentences, dataset_question_full_processing,
                                        create_custom_vocab, get_stop_words)
from dataloader.data_loaders import (Task1Dataset, collate_fn_padd, get_input_bert,
                                      get_dataloaders, get_dataloaders_no_random_split)
from models.biLSTM import BiLSTM
from trainer.biLSTM_trainer import biLSTM_train, biLSTM_eval
from utils.plot import (plot_sentence_length_stopwords, plot_mean_grade_distribution,
                        plot_number_characters, plot_number_words, plot_top_ngrams, 
                        plot_loss_vs_epochs)
from utils.ngrams import get_top_ngram
from utils.vocab import create_vocab, get_word2idx

**3. Settings and Parameters**

In [4]:
#@title Torch Settings
# Setting random seed and device
SEED = 1

torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

use_cuda = torch.cuda.is_available()
device = torch.device("cuda:0" if use_cuda else "cpu")

# Number of epochs
epochs_bl = 10 
epochs = 100

# Proportion of training data for train compared to dev
train_proportion = 0.8

**4. Data Loading**

In [5]:
# Load data
train_df = pd.read_csv(py_file_location+'/data/task-1/train.csv')
dev_df = pd.read_csv(py_file_location+'/data/task-1/dev.csv')
test_df = pd.read_csv(py_file_location+'/data/task-1/test.csv')

# Convert them to full edited sentences
modified_train_df = create_edited_sentences(train_df)
modified_valid_df = create_edited_sentences(dev_df)
modified_test_df = create_edited_sentences(test_df)

**5. Preprocessing**

***Stop Words***

In [6]:
# nltk stopwords english list
nltk.download('stopwords')
nltk_stopwords = list(stopwords.words('english'))

# import custom stopword list
simple_stopwords, custom_stopwords = get_stop_words()

all_stopwords = list(set(custom_stopwords + nltk_stopwords))

stopwords_lists = [[],simple_stopwords, custom_stopwords,nltk_stopwords,\
                   all_stopwords]
edited_modes = ['question_edited','full_edited']

# OK Applying both question and full version to any dataframe and dropping useless values
modified_train_df = dataset_question_full_processing(modified_train_df)
modified_valid_df = dataset_question_full_processing(modified_valid_df)
modified_test_df = dataset_question_full_processing(modified_test_df)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


***Word2idx***

In [7]:
#@title Word2idx call on the original sentences
## Approach 1 code, using functions defined above:

# We set our training data and test data
training_data = train_df['original']
test_data = dev_df['original']

# Creating word vectors
training_vocab, training_tokenized_corpus = create_vocab(training_data)
test_vocab, test_tokenized_corpus = create_vocab(test_data)
# Creating joint vocab from test and train:
joint_vocab, joint_tokenized_corpus = create_vocab(pd.concat([training_data, test_data]))
print("Vocab created.")

Vocab created.


In [8]:
file_path =py_file_location+'/embeddings/glove.6B.100d.txt'
wvecs, word2idx, idx2word= get_word2idx(file_path, joint_vocab)

vectorized_seqs = [[word2idx[tok] for tok in seq if tok in word2idx] for seq in training_tokenized_corpus]

# To avoid any sentences being empty (if no words match to our word embeddings)
vectorized_seqs = [x if len(x) > 0 else [0] for x in vectorized_seqs]

**6. Train GloVE + biLSTM**

In [9]:
#@title Train BiLSTM on original sentences
INPUT_DIM = len(word2idx)
EMBEDDING_DIM = 100
BATCH_SIZE = 32

model = BiLSTM(EMBEDDING_DIM, 50, INPUT_DIM, BATCH_SIZE, device)
print("Model initialised.")

model.to(device)
# We provide the model with our embeddings
model.embedding.weight.data.copy_(torch.from_numpy(wvecs))

feature = vectorized_seqs

# 'feature' is a list of lists, each containing embedding IDs for word tokens
train_and_dev = Task1Dataset(feature, train_df['meanGrade'])

train_examples = round(len(train_and_dev)*train_proportion)
dev_examples = len(train_and_dev) - train_examples

train_dataset, dev_dataset = random_split(train_and_dev,
                                           (train_examples,
                                            dev_examples))

train_loader = torch.utils.data.DataLoader(train_dataset, shuffle=True, batch_size=BATCH_SIZE, collate_fn=collate_fn_padd)
dev_loader = torch.utils.data.DataLoader(dev_dataset, batch_size=BATCH_SIZE, collate_fn=collate_fn_padd)

print("Dataloaders created.")

loss_fn = nn.MSELoss()
loss_fn = loss_fn.to(device)

optimizer = torch.optim.Adam(model.parameters())

Model initialised.
Dataloaders created.


In [10]:
biLSTM_train(train_loader, dev_loader, model, epochs_bl, device, optimizer, loss_fn)

Training model.
| Epoch: 01 | Train Loss: 0.36 | Train MSE: 0.36 | Train RMSE: 0.60 |         Val. Loss: 0.34 | Val. MSE: 0.34 |  Val. RMSE: 0.58 |
| Epoch: 02 | Train Loss: 0.35 | Train MSE: 0.35 | Train RMSE: 0.59 |         Val. Loss: 0.34 | Val. MSE: 0.34 |  Val. RMSE: 0.58 |
| Epoch: 03 | Train Loss: 0.34 | Train MSE: 0.34 | Train RMSE: 0.59 |         Val. Loss: 0.34 | Val. MSE: 0.34 |  Val. RMSE: 0.58 |
| Epoch: 04 | Train Loss: 0.34 | Train MSE: 0.34 | Train RMSE: 0.58 |         Val. Loss: 0.34 | Val. MSE: 0.34 |  Val. RMSE: 0.58 |
| Epoch: 05 | Train Loss: 0.33 | Train MSE: 0.33 | Train RMSE: 0.58 |         Val. Loss: 0.34 | Val. MSE: 0.34 |  Val. RMSE: 0.58 |
| Epoch: 06 | Train Loss: 0.29 | Train MSE: 0.29 | Train RMSE: 0.54 |         Val. Loss: 0.35 | Val. MSE: 0.35 |  Val. RMSE: 0.59 |
| Epoch: 07 | Train Loss: 0.26 | Train MSE: 0.26 | Train RMSE: 0.51 |         Val. Loss: 0.36 | Val. MSE: 0.36 |  Val. RMSE: 0.60 |
| Epoch: 08 | Train Loss: 0.24 | Train MSE: 0.24 | Train RMS