___

Project: `Automatic Humour Detection (AHD)`

Programmer: `@crispengari`

Date: `2022-04-26`

Abstract: _`Automatic Humour Detection (AHD) is a very useful topic in morden technologies. In this notebook we are going to create an Artificial Neural Network model using Deep Learning to detect humour in short texts. AHD are very useful because in model technologies such as virtual assistance and chatbots. They help Artificial Virtual Assistance and Bot to detect wether to take the conversation serious or not`._

Research Paper: [`2004.12765`](https://arxiv.org/abs/2004.12765)

Keywords: `pytorch`, `embedding`, `torchtext`, `fast-text`, `CNN`, `dataset`, `accuray`, `binary-classification`, `loss`

Programming Language: `python`

Dataset: [`kaggle`](https://www.kaggle.com/datasets/deepcontractor/200k-short-texts-for-humor-detection)
___

The dataset that we are going to use is based on the `3` files which are stored in the google drive which are:

1. train.csv
2. val.csv
3. test.csv

We are going to use `torchtext` and pytorch to create this model. We are going to create a `CNN` model that will perform a binary classification.


### Mounting the Drive
We are mounting the drive because we are going to load the files from our google drive. In the following code cell we are going to mount the drive as follows:

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Imports
In the following code cell, we are going to import basic packages that we are going to use through out this notebook.

In [2]:
import os
import time
import torch
import random
import math

import numpy as np
import pandas as pd
import torch.nn.functional as F

from prettytable import PrettyTable
from torch import nn
from matplotlib import pyplot as plt
from prettytable import PrettyTable
from torchtext import data
from collections import Counter
from torchtext.vocab import vocab as V, Vocab, vectors


torch.__version__

'1.11.0+cu113'

### Seeds
Setting the seed helps us for reproducibility.

In [4]:
SEED = 42

np.random.seed(SEED)
random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deteministic = True

### Device
We must make use of `gpu` accellaration if possible

In [5]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

device(type='cuda')

### Paths to data
In the following code cells we are going to define the path where our `csv` files are located.

In [6]:
splits_folder = "/content/drive/My Drive/NLP Data/Automatic Humor Detection/splits"
assert os.path.exists(splits_folder) == True


### Reading the data

We are going to read the data from our 3 files as dataframes.

In [7]:
train_df = pd.read_csv(os.path.join(splits_folder, 'train.csv'))
test_df = pd.read_csv(os.path.join(splits_folder, 'test.csv'))
val_df = pd.read_csv(os.path.join(splits_folder, 'val.csv'))

### Features and Labels
Next we are going to extract features and labels fron the from our dataframes for all the three sets.

In [8]:
# train
train_texts = train_df.text.values
train_labels = train_df.label.values

# test
test_texts = test_df.text.values
test_labels = test_df.label.values

# val
val_texts = val_df.text.values
val_labels = val_df.label.values

### Zipping features and labels

We are then going to zip the features and labels of the train data.

In [9]:
train_iter = zip(train_labels, train_texts)

### Tokenizer and Word Index
We are then going to create an instance of a tokenizer. This tokenizer just split the sentence intoo a list of words. For the model we are going to use `spacy` and the language will be `en`.

After that we are then going to create a `counter` instance from collection and count the number of occurance for each word in the train_iter so that we will be able to built a vocabulary based on this ordered dict.

In [10]:
tokenizer = data.utils.get_tokenizer('spacy', 'en')
counter = Counter()

for (label, line) in train_iter:
    counter.update(tokenizer(line))

### Creating the vocabulary

To create the vocabulary we call the `V` on and pass in the `counter` as the first argument. We are then going to pass `min_freq=2` so that words that appears less than 2 times will be converted to `<unk>`.

In [11]:
vocab = V(counter, 
          min_freq=2,
          specials=('<pad>', '<unk>', '<sos>', '<eos>')
)

In [12]:
print(len(vocab))
print(vocab.get_itos())
print(vocab.get_stoi())

38858


### Lext Pipeline and Label Pipeline

These helper functions allows us to preprocess text and labels during creation of data loaders.

1. text_pipeline
* this function takes in an argument `x` which is a sentence and tokenize it into list of words. It then index the words from the vocabulary and returns a list of tokens.

2. label_pipeline
* This function takes in an argument `x` and convert the label to numeric accordingly.

In [13]:
def text_pipeline(x):
  values = list()
  tokens = tokenizer(x)
  for token in tokens:
    try:
      v = vocab[token]
    except RuntimeError as e:
      v = vocab['<unk>']
    values.append(v)
  return values

label_pipeline = lambda x: 1 if x == 'humour' else 0

### Humour Dataset

We are going to create a class that will inherit from `data.Dataset` for our humour dataset.

In [14]:
class HumourDataset(torch.utils.data.Dataset):
  def __init__(self, labels, text):
    super(HumourDataset, self).__init__()
    self.labels = labels
    self.text = text
      
  def __getitem__(self, index):
    return self.labels[index], self.text[index]
  
  def __len__(self):
    return len(self.labels)

### Collate function
This function is passed in the `DataLoader` class to preprocesess our features and labels. We are going to create a `tokenize_batch` as our collate function in the following code cell:

In [15]:
def tokenize_batch(batch, max_len=50, padding="pre"):
  assert padding=="pre" or padding=="post", "the padding can be either pre or post"
  labels_list, text_list = [], []
  for _label, _text in batch:
    labels_list.append(label_pipeline(_label))
    text_holder = torch.zeros(max_len, dtype=torch.int32) # fixed size tensor of max_len with <pad> = 0
    processed_text = torch.tensor(text_pipeline(_text), dtype=torch.int32)
    pos = min(max_len, len(processed_text))
    if padding == "pre":
      text_holder[:pos] = processed_text[:pos]
    else:
      text_holder[-pos:] = processed_text[-pos:]
    text_list.append(text_holder.unsqueeze(dim=0))
  return torch.FloatTensor(labels_list), torch.cat(text_list, dim=0)


### Instances of 3 sets of data

We then going to create the instance of three sets of data, train, test and validation and pass in text and labels t the `HumourDataset`.

In [16]:
train_dataset = HumourDataset(train_labels, train_texts)
test_dataset = HumourDataset(test_labels, test_texts)
val_dataset = HumourDataset(val_labels, val_texts)

### DataLoaders and Batching

We are going to create dataloaders for each set and pass the batchsize of `128` for all of these 3 sets. We are going to pass the argument `shuffle` to True only in the train loader. Our collate function tokenize_batch must be passes so that pytorch will do teh preprocessing for us  that we have defined in this collate function.


In [17]:
BATCH_SIZE = 128

train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True, collate_fn=tokenize_batch)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=False, collate_fn=tokenize_batch)
val_loader = torch.utils.data.DataLoader(val_dataset, batch_size=BATCH_SIZE, shuffle=False, collate_fn=tokenize_batch)

### Checking examples

In [18]:
lbl, txt = iter(train_loader).next()

In [19]:
lbl

tensor([0., 0., 1., 0., 1., 1., 1., 0., 0., 1., 1., 1., 1., 0., 0., 1., 1., 1.,
        1., 1., 0., 1., 1., 1., 0., 0., 0., 1., 1., 1., 1., 1., 1., 0., 0., 0.,
        0., 0., 1., 1., 0., 0., 1., 0., 1., 1., 0., 0., 0., 1., 1., 1., 0., 1.,
        0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 1., 1., 1., 0., 0., 0.,
        0., 0., 1., 1., 1., 0., 0., 0., 1., 1., 1., 1., 1., 0., 0., 1., 1., 0.,
        0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 1., 1., 1.,
        0., 1., 1., 0., 1., 0., 0., 1., 0., 1., 1., 1., 1., 0., 0., 0., 0., 1.,
        0., 1.])

In [20]:
txt[0]

tensor([11473,    93,  3209,   202,    47,   225,   118,   735,    37,  1082,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0],
       dtype=torch.int32)

### Model

We are going to create a model called `AHDCNN` based on [this notebook](https://github.com/CrispenGari/pytorch-python/blob/main/09_NLP/02_Sentiment_Analyisis_Series/04_CNN_Sentiment_Analyisis.ipynb). We are going to use `CNN` to create a model for Automatic Humour Detection on short text:

In [21]:
class AHDCNN(nn.Module):
  def __init__(self, vocab_size, embedding_size, n_filters, filter_sizes, output_size, 
            dropout, pad_idx):
    super().__init__()
    self.embedding = nn.Embedding(vocab_size, embedding_size, padding_idx = pad_idx)
    self.convs = nn.ModuleList([
                                nn.Conv2d(in_channels = 1, 
                                          out_channels = n_filters, 
                                          kernel_size = (fs, embedding_size)) 
                                for fs in filter_sizes
                                ])
    self.fc = nn.Linear(len(filter_sizes) * n_filters, output_size)
    self.dropout = nn.Dropout(dropout)

  def forward(self, text):  
    # text = [batch size, sent len]
    embedded = self.embedding(text)    
    # embedded = [batch size, sent len, emb dim]
    embedded = embedded.unsqueeze(1)
    # embedded = [batch size, 1, sent len, emb dim]

    conved = [F.relu(conv(embedded)).squeeze(3) for conv in self.convs]
    #conved_n = [batch size, n_filters, sent len - filter_sizes[n] + 1]

    pooled = [F.max_pool1d(conv, conv.shape[2]).squeeze(2) for conv in conved]
    #pooled_n = [batch size, n_filters]
    cat = self.dropout(torch.cat(pooled, dim = 1))
    #cat = [batch size, n_filters * len(filter_sizes)]  
    return self.fc(cat)

### Model Instance

In [22]:
EMBEDDING_DIM = 100
N_FILTERS = 100
FILTER_SIZES = [3, 4, 5]
OUTPUT_DIM = 1
DROPOUT = 0.5
INPUT_DIM = len(vocab) 
PAD_IDX = vocab['<pad>']

In [23]:
ahd_model = AHDCNN(INPUT_DIM, EMBEDDING_DIM, N_FILTERS, 
                FILTER_SIZES, OUTPUT_DIM, DROPOUT, PAD_IDX).to(device)
ahd_model

AHDCNN(
  (embedding): Embedding(38858, 100, padding_idx=0)
  (convs): ModuleList(
    (0): Conv2d(1, 100, kernel_size=(3, 100), stride=(1, 1))
    (1): Conv2d(1, 100, kernel_size=(4, 100), stride=(1, 1))
    (2): Conv2d(1, 100, kernel_size=(5, 100), stride=(1, 1))
  )
  (fc): Linear(in_features=300, out_features=1, bias=True)
  (dropout): Dropout(p=0.5, inplace=False)
)

### Counting model parameters

In the next code cell we are going to count model parameters.

In [24]:
def count_trainable_params(model):
  return sum(p.numel() for p in model.parameters()), sum(p.numel() for p in model.parameters() if p.requires_grad)

n_params, trainable_params = count_trainable_params(ahd_model)
print(f"Total number of paramaters: {n_params:,}\nTotal tainable parameters: {trainable_params:,}")

Total number of paramaters: 4,006,401
Total tainable parameters: 4,006,401


### Loading Pretrained vectors

Next we are going to load pretrained vectors. We are going to make use of the `glove_6B_100D` which was trained with more than 6 billion words and returns vectors that are in `100` dimension.

In [25]:
pretrained_embeddings = vectors.GloVe('6B', 100)

.vector_cache/glove.6B.zip: 862MB [02:40, 5.37MB/s]                           
100%|█████████▉| 399999/400000 [00:20<00:00, 19418.35it/s]


### Checking the index of a single word.

In [26]:
print(pretrained_embeddings.stoi.get("the"))

0


In [27]:
vocab.get_stoi().items()



### Updating the embedding matrix

The pretrained embedding matrix was trained on `~6B` words but the words that we have in our vocabulary is very small. So we want to create an embedding matrix that will suits our data.

In [28]:
embedding_matrix = torch.zeros((INPUT_DIM, 100))

for word, index in vocab.get_stoi().items():
  try:
    vector_idx = pretrained_embeddings.stoi.get('word')
    vector = pretrained_embeddings.vectors[vector_idx]
    embedding_matrix[index] = vector
  except:
    embedding_matrix[index] = torch.zeros(100)

### Checking our embedding matrix

In [29]:
embedding_matrix

tensor([[ 0.1233,  0.5574,  0.7420,  ..., -0.1561,  0.0142,  0.6563],
        [ 0.1233,  0.5574,  0.7420,  ..., -0.1561,  0.0142,  0.6563],
        [ 0.1233,  0.5574,  0.7420,  ..., -0.1561,  0.0142,  0.6563],
        ...,
        [ 0.1233,  0.5574,  0.7420,  ..., -0.1561,  0.0142,  0.6563],
        [ 0.1233,  0.5574,  0.7420,  ..., -0.1561,  0.0142,  0.6563],
        [ 0.1233,  0.5574,  0.7420,  ..., -0.1561,  0.0142,  0.6563]])

### Adding our embedding matrix to the embedding layer

We are going to add the embedding weights based on our data to the embedding layer of our model as follows:

In [30]:
ahd_model.embedding.weight.data.copy_(embedding_matrix)

tensor([[ 0.1233,  0.5574,  0.7420,  ..., -0.1561,  0.0142,  0.6563],
        [ 0.1233,  0.5574,  0.7420,  ..., -0.1561,  0.0142,  0.6563],
        [ 0.1233,  0.5574,  0.7420,  ..., -0.1561,  0.0142,  0.6563],
        ...,
        [ 0.1233,  0.5574,  0.7420,  ..., -0.1561,  0.0142,  0.6563],
        [ 0.1233,  0.5574,  0.7420,  ..., -0.1561,  0.0142,  0.6563],
        [ 0.1233,  0.5574,  0.7420,  ..., -0.1561,  0.0142,  0.6563]],
       device='cuda:0')

### Checking the embedding weights.

In [31]:
ahd_model.embedding.weight.data

tensor([[ 0.1233,  0.5574,  0.7420,  ..., -0.1561,  0.0142,  0.6563],
        [ 0.1233,  0.5574,  0.7420,  ..., -0.1561,  0.0142,  0.6563],
        [ 0.1233,  0.5574,  0.7420,  ..., -0.1561,  0.0142,  0.6563],
        ...,
        [ 0.1233,  0.5574,  0.7420,  ..., -0.1561,  0.0142,  0.6563],
        [ 0.1233,  0.5574,  0.7420,  ..., -0.1561,  0.0142,  0.6563],
        [ 0.1233,  0.5574,  0.7420,  ..., -0.1561,  0.0142,  0.6563]],
       device='cuda:0')

### Optimizer and Criterion

We are going to use the `Adam` optimizer with the default parameters and for the loss function we are going to make use of the `BCEWithLogitsLoss()` and add it to the `device`

In [32]:
optimizer = torch.optim.Adam(ahd_model.parameters())
criterion = nn.BCEWithLogitsLoss().to(device)

### Accuracy function

Since this is a binary classification task we need to calculate the binary accuracy between the target labels and predicted labels.

In [33]:
def binary_accuracy(y_preds, y_true):
  rounded_preds = torch.round(torch.sigmoid(y_preds))
  correct = (rounded_preds == y_true).float()
  return correct.sum() / len(correct)

### The `train` and `evaluate` functions

We are then going to create the `train` and `evaluate` functions that will recieve different arguments and returns the `loss` and `accuray`

In [34]:
def train(model, iterator, optimizer, criterion):
  epoch_loss,epoch_acc = 0, 0
  model.train()
  for batch in iterator:
    y, X = batch
    X = X.to(device)
    y = y.to(device)
    optimizer.zero_grad()
    predictions = model(X).squeeze(1)
    loss = criterion(predictions, y)
    acc = binary_accuracy(predictions, y)
    loss.backward()
    optimizer.step()
    epoch_loss += loss.item()
    epoch_acc += acc.item()
  return epoch_loss / len(iterator), epoch_acc / len(iterator)

def evaluate(model, iterator, criterion):
  epoch_loss,epoch_acc = 0, 0
  model.eval()
  with torch.no_grad():
    for batch in iterator:
      y, X = batch
      X = X.to(device)
      y = y.to(device)
      predictions = model(X).squeeze(1)
      loss = criterion(predictions, y)
      acc = binary_accuracy(predictions, y)
      epoch_loss += loss.item()
      epoch_acc += acc.item()
  return epoch_loss / len(iterator), epoch_acc / len(iterator)

### Helper functions

The following two helper functions, helps us to visualize our training loop so that we can see what is going on.

In [35]:
def hms_string(sec_elapsed):
  h = int(sec_elapsed / (60 * 60))
  m = int((sec_elapsed % (60 * 60)) / 60)
  s = sec_elapsed % 60
  return "{}:{:>02}:{:>05.2f}".format(h, m, s)

In [36]:
def visualize_training(start, end, train_loss, train_accuracy, val_loss, val_accuracy, title):
  data = [
       ["Training", f'{train_loss:.3f}', f'{train_accuracy:.3f}', f"{hms_string(end - start)}" ],
       ["Validation", f'{val_loss:.3f}', f'{val_accuracy:.3f}', "" ],       
  ]
  table = PrettyTable(["CATEGORY", "LOSS", "ACCURACY", "ETA"])
  table.align["CATEGORY"] = 'l'
  table.align["LOSS"] = 'r'
  table.align["ACCURACY"] = 'r'
  table.align["ETA"] = 'r'
  table.title = title
  for row in data:
    table.add_row(row)
  print(table)

### Training loop

In the trainning loop we are going to save the model based on the reduction of loss on predicting the validation set.

In [37]:
N_EPOCHS = 10
MODEL_NAME = 'ahd-cnn-torch.pt'


best_valid_loss = float('inf')
for epoch in range(N_EPOCHS):
  start = time.time()
  train_loss, train_acc = train(ahd_model, train_loader, optimizer, criterion)
  valid_loss, valid_acc = evaluate(ahd_model, val_loader, criterion)
  title = f"EPOCH: {epoch+1:02}/{N_EPOCHS:02} {'saving best model...' if valid_loss < best_valid_loss else 'not saving...'}"
  if valid_loss < best_valid_loss:
      best_valid_loss = valid_loss
      torch.save(ahd_model.state_dict(), MODEL_NAME)
  end = time.time()
  visualize_training(start, end, train_loss, train_acc, valid_loss, valid_acc, title)

+--------------------------------------------+
|     EPOCH: 01/10 saving best model...      |
+------------+-------+----------+------------+
| CATEGORY   |  LOSS | ACCURACY |        ETA |
+------------+-------+----------+------------+
| Training   | 0.180 |    0.927 | 0:01:59.81 |
| Validation | 0.100 |    0.963 |            |
+------------+-------+----------+------------+
+--------------------------------------------+
|     EPOCH: 02/10 saving best model...      |
+------------+-------+----------+------------+
| CATEGORY   |  LOSS | ACCURACY |        ETA |
+------------+-------+----------+------------+
| Training   | 0.085 |    0.971 | 0:01:59.03 |
| Validation | 0.088 |    0.968 |            |
+------------+-------+----------+------------+
+--------------------------------------------+
|         EPOCH: 03/10 not saving...         |
+------------+-------+----------+------------+
| CATEGORY   |  LOSS | ACCURACY |        ETA |
+------------+-------+----------+------------+
| Training   

### Evaluating the best model

In the following code cell we are going to evaluate the best model using the `test_loader`

In [38]:
def tabulate(column_names, data, title):
  table = PrettyTable(column_names)
  table.title= title
  table.align[column_names[0]] = 'l'
  table.align[column_names[1]] = 'r'
  table.align[column_names[2]] = 'r'
  table.align[column_names[3]] = 'r'
  for row in data:
    table.add_row(row)
  print(table)

In [39]:
ahd_model.load_state_dict(torch.load(MODEL_NAME))

column_names = ["Set", "Loss", "Accuracy", "ETA (time)"]
test_loss, test_acc = evaluate(ahd_model, test_loader, criterion)
title = "Model Evaluation Summary"
data_rows = [["Test", f'{test_loss:.3f}', f'{test_acc * 100:.2f}%', ""]]

tabulate(column_names, data_rows, title)

+--------------------------------------+
|       Model Evaluation Summary       |
+------+-------+----------+------------+
| Set  |  Loss | Accuracy | ETA (time) |
+------+-------+----------+------------+
| Test | 0.080 |   97.22% |            |
+------+-------+----------+------------+


### Making predictions.



In [40]:
print(train_df.text[1], train_df.label[1])

The richest black man in nyc has got to be duane reade. humour


In [41]:
def preprocess_text(text, max_len=50, padding="pre"):
  assert padding=="pre" or padding=="post", "the padding can be either pre or post"
  text_holder = torch.zeros(max_len, dtype=torch.int32) # fixed size tensor of max_len with <pad> = 0
  processed_text = torch.tensor(text_pipeline(text), dtype=torch.int32)
 
  pos = min(max_len, len(processed_text))
  if padding == "pre":
    text_holder[:pos] = processed_text[:pos]
  else:
    text_holder[-pos:] = processed_text[-pos:]
  text_list= text_holder.unsqueeze(dim=0)
  return text_list

def predict_homour(sent: str, model):
  classes =["NOT HOMOUR", "HUMOUR"]
  model.eval()
  tensor = preprocess_text(sent)
  pred = torch.sigmoid(model(tensor.to(device))).item()
  
  label = 1 if pred >=0.5 else 0
  probability = float(round(pred, 3)) if pred >= 0.5 else float(round(1 - pred, 3))

  pred_obj ={
      "label": label,
      "probability": probability,
      "class": classes[label]
  }
  return pred_obj

In [54]:
vocab["the"]

41

### Humour

In [47]:
train_df.head(5)

Unnamed: 0.1,Unnamed: 0,text,label
0,38762,10 brands that will disappear in 2014: 24/7 wa...,not-humour
1,76883,The richest black man in nyc has got to be dua...,humour
2,2018,What do you get if king kong sits on your pian...,humour
3,133899,"If the opposite of pro is con, then what is th...",humour
4,170373,My friend's body temperature is currently -273...,humour


In [48]:
train_texts[:5], train_labels[:5]

(array(['10 brands that will disappear in 2014: 24/7 wall st.',
        'The richest black man in nyc has got to be duane reade.',
        'What do you get if king kong sits on your piano? a flat note.',
        'If the opposite of pro is con, then what is the opposite of progress',
        "My friend's body temperature is currently -273.15 c don't worry though, he's 0k."],
       dtype=object),
 array(['not-humour', 'humour', 'humour', 'humour', 'humour'], dtype=object))

In [50]:
predict_homour(train_texts[1], ahd_model)

{'class': 'HUMOUR', 'label': 1, 'probability': 0.824}

In [51]:
predict_homour(train_texts[2], ahd_model)

{'class': 'HUMOUR', 'label': 1, 'probability': 1.0}

In [52]:
predict_homour(train_texts[3], ahd_model)

{'class': 'HUMOUR', 'label': 1, 'probability': 0.999}

In [53]:
predict_homour(train_texts[4], ahd_model)

{'class': 'HUMOUR', 'label': 1, 'probability': 1.0}

### None Humour

In [49]:
predict_homour(train_texts[0], ahd_model)

{'class': 'NOT HOMOUR', 'label': 0, 'probability': 1.0}

### Downloading the Model

Now we can download the model so that it can be saved as a static file in the following code cell.

In [None]:
from google.colab import files
files.download(MODEL_NAME)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

### Saving and Downloading the vocabulary

Next we are going to our vocabulary `stoi`

In [None]:
vocab.get_stoi()['the'] == vocab['the']

True

In [None]:
import json

In [None]:
with open('vocab-pt.json', 'w') as f:
  json.dump(vocab.get_stoi(), f)

files.download('vocab-pt.json')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>