___

Project: `Automatic Humour Detection (AHD)`

Programmer: `@crispengari`

Date: `2022-04-26`

Abstract: _`Automatic Humour Detection (AHD) is a very useful topic in morden technologies. In this notebook we are going to create an Artificial Neural Network model using Deep Learning to detect humour in short texts. AHD are very useful because in model technologies such as virtual assistance and chatbots. They help Artificial Virtual Assistance and Bot to detect wether to take the conversation serious or not`._

Research Paper: [`2004.12765`](https://arxiv.org/abs/2004.12765)

Keywords: `pytorch`, `embedding`, `torchtext`, `fast-text`, `CNN`, `dataset`, `accuray`, `binary-classification`, `loss`

Programming Language: `python`

Dataset: [`kaggle`](https://www.kaggle.com/datasets/deepcontractor/200k-short-texts-for-humor-detection)
___

The dataset that we are going to use is based on the `3` files which are stored in the google drive which are:

1. train.csv
2. val.csv
3. test.csv

We are going to use `torchtext` and pytorch to create this model. We are going to create a `CNN` model that will perform a binary classification.


### Mounting the Drive
We are mounting the drive because we are going to load the files from our google drive. In the following code cell we are going to mount the drive as follows:

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Imports
In the following code cell, we are going to import basic packages that we are going to use through out this notebook.

In [2]:
                                                                                        import os
import time
import torch
import random
import math

import numpy as np
import pandas as pd
import torch.nn.functional as F

from prettytable import PrettyTable
from torch import nn
from matplotlib import pyplot as plt
from prettytable import PrettyTable
from torchtext import data
from collections import Counter
from torchtext.vocab import vocab as V, Vocab, vectors


torch.__version__

'1.12.1+cu113'

### Seeds
Setting the seed helps us for reproducibility.

In [3]:
    SEED = 42

np.random.seed(SEED)
random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deteministic = True

### Device
We must make use of `gpu` accellaration if possible

In [4]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

device(type='cuda')

### Paths to data
In the following code cells we are going to define the path where our `csv` files are located.

In [5]:
splits_folder = "/content/drive/My Drive/NLP Data/Automatic Humor Detection/splits"
assert os.path.exists(splits_folder) == True


### Reading the data

We are going to read the data from our 3 files as dataframes.

In [6]:
train_df = pd.read_csv(os.path.join(splits_folder, 'train.csv'))
test_df = pd.read_csv(os.path.join(splits_folder, 'test.csv'))
val_df = pd.read_csv(os.path.join(splits_folder, 'val.csv'))

### Features and Labels
Next we are going to extract features and labels fron the from our dataframes for all the three sets.

In [7]:
# train
train_texts = train_df.text.values
train_labels = train_df.label.values

# test
test_texts = test_df.text.values
test_labels = test_df.label.values

# val
val_texts = val_df.text.values
val_labels = val_df.label.values

### Zipping features and labels

We are then going to zip the features and labels of the train data.

In [8]:
train_iter = zip(train_labels, train_texts)

### Tokenizer and Word Index
We are then going to create an instance of a tokenizer. This tokenizer just split the sentence intoo a list of words. For the model we are going to use `spacy` and the language will be `en`.

After that we are then going to create a `counter` instance from collection and count the number of occurance for each word in the train_iter so that we will be able to built a vocabulary based on this ordered dict.

In [9]:
tokenizer = data.utils.get_tokenizer('spacy', 'en')
counter = Counter()

for (label, line) in train_iter:
    counter.update(tokenizer(line))

  f'Spacy model "{language}" could not be loaded, trying "{OLD_MODEL_SHORTCUTS[language]}" instead'


### Creating the vocabulary

To create the vocabulary we call the `V` on and pass in the `counter` as the first argument. We are then going to pass `min_freq=2` so that words that appears less than 2 times will be converted to `<unk>`.

In [10]:
vocab = V(counter, 
          min_freq=2,
          specials=('<pad>', '<unk>', '<sos>', '<eos>')
)

In [11]:
print(len(vocab))
print(vocab.get_itos())
print(vocab.get_stoi())

38753


### Lext Pipeline and Label Pipeline

These helper functions allows us to preprocess text and labels during creation of data loaders.

1. text_pipeline
* this function takes in an argument `x` which is a sentence and tokenize it into list of words. It then index the words from the vocabulary and returns a list of tokens.

2. label_pipeline
* This function takes in an argument `x` and convert the label to numeric accordingly.

In [12]:
def text_pipeline(x):
  values = list()
  tokens = tokenizer(x)
  for token in tokens:
    try:
      v = vocab[token]
    except RuntimeError as e:
      v = vocab['<unk>']
    values.append(v)
  return values
label_pipeline = lambda x: 1 if x == 'humour' else 0

### Humour Dataset

We are going to create a class that will inherit from `data.Dataset` for our humour dataset.

In [13]:
class HumourDataset(torch.utils.data.Dataset):
  def __init__(self, labels, text):
    super(HumourDataset, self).__init__()
    self.labels = labels
    self.text = text
      
  def __getitem__(self, index):
    return self.labels[index], self.text[index]
  
  def __len__(self):
    return len(self.labels)

### Collate function
This function is passed in the `DataLoader` class to preprocesess our features and labels. We are going to create a `tokenize_batch` as our collate function in the following code cell:

In [14]:
def tokenize_batch(batch, max_len=50, padding="pre"):
  assert padding=="pre" or padding=="post", "the padding can be either pre or post"
  labels_list, text_list = [], []
  for _label, _text in batch:
    labels_list.append(label_pipeline(_label))
    text_holder = torch.zeros(max_len, dtype=torch.int32) # fixed size tensor of max_len with <pad> = 0
    processed_text = torch.tensor(text_pipeline(_text), dtype=torch.int32)
    pos = min(max_len, len(processed_text))
    if padding == "pre":
      text_holder[:pos] = processed_text[:pos]
    else:
      text_holder[-pos:] = processed_text[-pos:]
    text_list.append(text_holder.unsqueeze(dim=0))
  return torch.FloatTensor(labels_list), torch.cat(text_list, dim=0)


### Instances of 3 sets of data

We then going to create the instance of three sets of data, train, test and validation and pass in text and labels t the `HumourDataset`.

In [15]:
train_dataset = HumourDataset(train_labels, train_texts)
test_dataset = HumourDataset(test_labels, test_texts)
val_dataset = HumourDataset(val_labels, val_texts)

### DataLoaders and Batching

We are going to create dataloaders for each set and pass the batchsize of `128` for all of these 3 sets. We are going to pass the argument `shuffle` to True only in the train loader. Our collate function tokenize_batch must be passes so that pytorch will do teh preprocessing for us  that we have defined in this collate function.


In [16]:
BATCH_SIZE = 128

train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True, collate_fn=tokenize_batch)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=False, collate_fn=tokenize_batch)
val_loader = torch.utils.data.DataLoader(val_dataset, batch_size=BATCH_SIZE, shuffle=False, collate_fn=tokenize_batch)

### Checking examples

In [17]:
lbl, txt = iter(train_loader).next()

In [18]:
lbl

tensor([0., 0., 1., 0., 1., 1., 1., 0., 0., 1., 1., 1., 1., 0., 0., 1., 1., 1.,
        1., 1., 0., 1., 1., 1., 0., 0., 0., 1., 1., 1., 1., 1., 1., 0., 0., 0.,
        0., 0., 1., 1., 0., 0., 1., 0., 1., 1., 0., 0., 0., 1., 1., 1., 0., 1.,
        0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 1., 1., 1., 0., 0., 0.,
        0., 0., 1., 1., 1., 0., 0., 0., 1., 1., 1., 1., 1., 0., 0., 1., 1., 0.,
        0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 1., 1., 1.,
        0., 1., 1., 0., 1., 0., 0., 1., 0., 1., 1., 1., 1., 0., 0., 0., 0., 1.,
        0., 1.])

In [19]:
txt[0]

tensor([11461,    93,  3211,   202,    47,   225,   118,   735,    37,  1082,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0],
       dtype=torch.int32)

### Model

We are going to create a model called `AHDBiLSTM`. This model will be nothing but a bidirectional lstm model with embedding layer at the top.

In [20]:
class AHDBiLSTM(nn.Module):
  def __init__(self, vocab_size, embedding_size, hidden_size, output_size, num_layers
               , bidirectional, dropout, pad_idx):
    super(AHDBiLSTM, self).__init__()

    self.embedding = nn.Embedding(vocab_size, embedding_dim=embedding_size, padding_idx=pad_idx)
    self.lstm = nn.LSTM(embedding_size, hidden_size=hidden_size, 
                        bidirectional=bidirectional, num_layers=num_layers,
                        dropout=dropout)
    self.fc = nn.Linear(hidden_size * 2, out_features=output_size)
    self.dropout = nn.Dropout(dropout)

  def forward(self, text, text_lengths):
    embedded = self.dropout(self.embedding(text))
    # set batch_first=true since input shape has batch_size first and text_lengths to the device.
    packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded, text_lengths.to('cpu'), enforce_sorted=False, batch_first=True)
    packed_output, (h_0, c_0) = self.lstm(packed_embedded)
    output, output_lengths = nn.utils.rnn.pad_packed_sequence(packed_output)
    h_0 = self.dropout(torch.cat((h_0[-2,:,:], h_0[-1,:,:]), dim = 1))
    return self.fc(h_0)

### Model Instance

In [21]:
INPUT_DIM = len(vocab) 
EMBEDDING_DIM = 100
HIDDEN_DIM = 256
OUTPUT_DIM = 1
N_LAYERS = 2
BIDIRECTIONAL = True
DROPOUT = 0.5
PAD_IDX = PAD_IDX = vocab['<pad>'] 
ahd_model = AHDBiLSTM(INPUT_DIM, 
              EMBEDDING_DIM, 
              HIDDEN_DIM, 
              OUTPUT_DIM, 
              N_LAYERS, 
              BIDIRECTIONAL, 
              DROPOUT, 
              PAD_IDX
            ).to(device)
ahd_model

AHDBiLSTM(
  (embedding): Embedding(38753, 100, padding_idx=0)
  (lstm): LSTM(100, 256, num_layers=2, dropout=0.5, bidirectional=True)
  (fc): Linear(in_features=512, out_features=1, bias=True)
  (dropout): Dropout(p=0.5, inplace=False)
)

### Counting model parameters

In the next code cell we are going to count model parameters.

In [22]:
def count_trainable_params(model):
  return sum(p.numel() for p in model.parameters()), sum(p.numel() for p in model.parameters() if p.requires_grad)

n_params, trainable_params = count_trainable_params(ahd_model)
print(f"Total number of paramaters: {n_params:,}\nTotal tainable parameters: {trainable_params:,}")

Total number of paramaters: 6,185,957
Total tainable parameters: 6,185,957


### Optimizer and Criterion

We are going to use the `Adam` optimizer with the default parameters and for the loss function we are going to make use of the `BCEWithLogitsLoss()` and add it to the `device`

In [23]:
optimizer = torch.optim.Adam(ahd_model.parameters())
criterion = nn.BCEWithLogitsLoss().to(device)

### Accuracy function

Since this is a binary classification task we need to calculate the binary accuracy between the target labels and predicted labels.

In [24]:
def binary_accuracy(y_preds, y_true):
  rounded_preds = torch.round(torch.sigmoid(y_preds))
  correct = (rounded_preds == y_true).float()
  return correct.sum() / len(correct)

### The `train` and `evaluate` functions

We are then going to create the `train` and `evaluate` functions that will recieve different arguments and returns the `loss` and `accuray`. In the train and evaluation functions we need to pass in the `text_lengths` to our model, which is basically the length of each sentence in a batch. 

In [25]:
def train(model, iterator, optimizer, criterion):
  epoch_loss,epoch_acc = 0, 0
  model.train()
  for batch in iterator:
    y, X = batch
    X = X.to(device)
    y = y.to(device)
    lengths = torch.tensor([len(i) for i in X])
    optimizer.zero_grad()

    predictions = model(X, lengths).squeeze(1)
    loss = criterion(predictions, y)
    acc = binary_accuracy(predictions, y)
    loss.backward()
    optimizer.step()
    epoch_loss += loss.item()
    epoch_acc += acc.item()
  return epoch_loss / len(iterator), epoch_acc / len(iterator)

def evaluate(model, iterator, criterion):
  epoch_loss,epoch_acc = 0, 0
  model.eval()
  with torch.no_grad():
    for batch in iterator:
      y, X = batch
      X = X.to(device)
      y = y.to(device)
      lengths = torch.tensor([len(i) for i in X])
      predictions = model(X, lengths).squeeze(1)
      loss = criterion(predictions, y)
      acc = binary_accuracy(predictions, y)
      epoch_loss += loss.item()
      epoch_acc += acc.item()
  return epoch_loss / len(iterator), epoch_acc / len(iterator)

### Helper functions

The following two helper functions, helps us to visualize our training loop so that we can see what is going on.

In [26]:
def hms_string(sec_elapsed):
  h = int(sec_elapsed / (60 * 60))
  m = int((sec_elapsed % (60 * 60)) / 60)
  s = sec_elapsed % 60
  return "{}:{:>02}:{:>05.2f}".format(h, m, s)

In [27]:
def visualize_training(start, end, train_loss, train_accuracy, val_loss, val_accuracy, title):
  data = [
       ["Training", f'{train_loss:.3f}', f'{train_accuracy:.3f}', f"{hms_string(end - start)}" ],
       ["Validation", f'{val_loss:.3f}', f'{val_accuracy:.3f}', "" ],       
  ]
  table = PrettyTable(["CATEGORY", "LOSS", "ACCURACY", "ETA"])
  table.align["CATEGORY"] = 'l'
  table.align["LOSS"] = 'r'
  table.align["ACCURACY"] = 'r'
  table.align["ETA"] = 'r'
  table.title = title
  for row in data:
    table.add_row(row)
  print(table)

### Training loop

In the trainning loop we are going to save the model based on the reduction of loss on predicting the validation set.

In [28]:
N_EPOCHS = 10
MODEL_NAME = 'ahd-lstm-torch.pt'

best_valid_loss = float('inf')
for epoch in range(N_EPOCHS):
  start = time.time()
  train_loss, train_acc = train(ahd_model, train_loader, optimizer, criterion)
  valid_loss, valid_acc = evaluate(ahd_model, val_loader, criterion)
  title = f"EPOCH: {epoch+1:02}/{N_EPOCHS:02} {'saving best model...' if valid_loss < best_valid_loss else 'not saving...'}"
  if valid_loss < best_valid_loss:
      best_valid_loss = valid_loss
      torch.save(ahd_model.state_dict(), MODEL_NAME)
  end = time.time()
  visualize_training(start, end, train_loss, train_acc, valid_loss, valid_acc, title)

+--------------------------------------------+
|     EPOCH: 01/10 saving best model...      |
+------------+-------+----------+------------+
| CATEGORY   |  LOSS | ACCURACY |        ETA |
+------------+-------+----------+------------+
| Training   | 0.165 |    0.939 | 0:01:47.76 |
| Validation | 0.121 |    0.956 |            |
+------------+-------+----------+------------+
+--------------------------------------------+
|     EPOCH: 02/10 saving best model...      |
+------------+-------+----------+------------+
| CATEGORY   |  LOSS | ACCURACY |        ETA |
+------------+-------+----------+------------+
| Training   | 0.111 |    0.960 | 0:01:44.96 |
| Validation | 0.111 |    0.965 |            |
+------------+-------+----------+------------+
+--------------------------------------------+
|     EPOCH: 03/10 saving best model...      |
+------------+-------+----------+------------+
| CATEGORY   |  LOSS | ACCURACY |        ETA |
+------------+-------+----------+------------+
| Training   

### Saving/Exporting to TorchScript

This is how you save the entire model.

In [29]:
model_scripted = torch.jit.script(ahd_model) # Export to TorchScript
model_scripted.save('model_scripted.pt')

### To load a scripted model we do it as follows.

In [30]:
model = torch.jit.load('model_scripted.pt')
model.eval()

RecursiveScriptModule(
  original_name=AHDBiLSTM
  (embedding): RecursiveScriptModule(original_name=Embedding)
  (lstm): RecursiveScriptModule(original_name=LSTM)
  (fc): RecursiveScriptModule(original_name=Linear)
  (dropout): RecursiveScriptModule(original_name=Dropout)
)

### Downloading the model

In [31]:
from google.colab import files
files.download('model_scripted.pt')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

### Evaluating the best model

In the following code cell we are going to evaluate the best model using the `test_loader`

In [32]:
def tabulate(column_names, data, title):
  table = PrettyTable(column_names)
  table.title= title
  table.align[column_names[0]] = 'l'
  table.align[column_names[1]] = 'r'
  table.align[column_names[2]] = 'r'
  table.align[column_names[3]] = 'r'
  for row in data:
    table.add_row(row)
  print(table)

In [33]:
ahd_model.load_state_dict(torch.load(MODEL_NAME))

column_names = ["Set", "Loss", "Accuracy", "ETA (time)"]
test_loss, test_acc = evaluate(ahd_model, test_loader, criterion)
title = "Model Evaluation Summary"
data_rows = [["Test", f'{test_loss:.3f}', f'{test_acc * 100:.2f}%', ""]]

tabulate(column_names, data_rows, title)

+--------------------------------------+
|       Model Evaluation Summary       |
+------+-------+----------+------------+
| Set  |  Loss | Accuracy | ETA (time) |
+------+-------+----------+------------+
| Test | 0.068 |   97.77% |            |
+------+-------+----------+------------+


### Making predictions.



In [34]:
print(train_df.text[1], train_df.label[1])

The richest black man in nyc has got to be duane reade. humour


In [35]:
def preprocess_text(text, max_len=50, padding="pre"):
  assert padding=="pre" or padding=="post", "the padding can be either pre or post"
  text_holder = torch.zeros(max_len, dtype=torch.int32) # fixed size tensor of max_len with <pad> = 0
  processed_text = torch.tensor(text_pipeline(text), dtype=torch.int32)
 
  pos = min(max_len, len(processed_text))
  if padding == "pre":
    text_holder[:pos] = processed_text[:pos]
  else:
    text_holder[-pos:] = processed_text[-pos:]
  text_list= text_holder.unsqueeze(dim=0)
  return text_list

def predict_homour(sent: str, model):
  classes =["NOT HOMOUR", "HUMOUR"]
  model.eval()
  tensor = preprocess_text(sent).to(device)
  length = torch.tensor([len(t) for t in tensor])
  pred = torch.sigmoid(model(tensor, length)).item()
  label = 1 if pred >=0.5 else 0
  probability = float(round(pred, 3)) if pred >= 0.5 else float(round(1 - pred, 3))

  pred_obj ={
      "label": label,
      "probability": probability,
      "class": classes[label]
  }
  return pred_obj

predict_homour('The richest black man in nyc has got to be duane reade.', ahd_model)

{'label': 1, 'probability': 1.0, 'class': 'HUMOUR'}

### Humour

In [36]:
test_df.head(5)

Unnamed: 0.1,Unnamed: 0,text,label
0,194672,"Beyoncé announces $100,000 in scholarships for...",not-humour
1,115452,Mary alice stephenson's glam4good was inspired...,not-humour
2,101601,I'm not allowed in the vietnamese sandwich sho...,humour
3,6517,My penis is nicknamed the titanic... because i...,humour
4,74203,Life is like a dick pic... sometimes you get t...,humour


In [37]:
test_texts[:5], test_labels[:5]

(array(['Beyoncé announces $100,000 in scholarships for hbcu students',
        "Mary alice stephenson's glam4good was inspired by oprah (video)",
        "I'm not allowed in the vietnamese sandwich shop anymore. they decided to banh mi for life.",
        "My penis is nicknamed the titanic... because it's so big? no,because it is a tragedy.",
        "Life is like a dick pic... sometimes you get things you don't ask for."],
       dtype=object),
 array(['not-humour', 'not-humour', 'humour', 'humour', 'humour'],
       dtype=object))

In [38]:
predict_homour(test_texts[1], ahd_model)

{'label': 0, 'probability': 1.0, 'class': 'NOT HOMOUR'}

In [39]:
predict_homour(test_texts[2], ahd_model)

{'label': 1, 'probability': 1.0, 'class': 'HUMOUR'}

In [40]:
predict_homour(train_texts[3], ahd_model)

{'label': 1, 'probability': 0.986, 'class': 'HUMOUR'}

In [41]:
predict_homour(train_texts[4], ahd_model)

{'label': 1, 'probability': 1.0, 'class': 'HUMOUR'}

### None Humour

In [42]:
predict_homour(train_texts[0], ahd_model)

{'label': 0, 'probability': 1.0, 'class': 'NOT HOMOUR'}

### Downloading the Model

Now we can download the model so that it can be saved as a static file in the following code cell.

In [43]:
from google.colab import files
files.download(MODEL_NAME)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

### Saving and Downloading the vocabulary

Next we are going to our vocabulary `stoi`

In [44]:
vocab.get_stoi()['the'] == vocab['the']

True

In [45]:
import json

In [46]:
with open('vocab-pt.json', 'w') as f:
  json.dump(vocab.get_stoi(), f)

files.download('vocab-pt.json')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [47]:
len(vocab)

38753

In [48]:
ahd_model.load_state_dict(torch.load(MODEL_NAME, map_location=torch.device('cpu')))

<All keys matched successfully>

In [49]:
preprocess_text("what do you get if king kong sits on your piano? a flat note.")

tensor([[49, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 15,  0,  0,
          0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
          0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0]],
       dtype=torch.int32)

In [50]:
def predict_humour(sent: str, model):
    model.eval()
    tensor = preprocess_text(sent)
    classes =["NOT HOMOUR", "HUMOUR"]
    pred = torch.sigmoid(model(tensor.to(device), torch.tensor([len(tensor)]))).item()
    
    print(pred)
    label = 1 if pred >=0.5 else 0
    probability = float(round(pred, 3)) if pred >= 0.5 else float(round(1 - pred, 3))
    return {
      "label": label,
      "probability": probability,
      "class": classes[label]
  }

In [51]:
predict_humour("what do you get if king kong sits on your piano? a flat note.", ahd_model)

0.8602973222732544


{'label': 1, 'probability': 0.86, 'class': 'HUMOUR'}