___

Project: `Automatic Humour Detection (AHD)`

Programmer: `@crispengari`

Date: `2022-04-26`

Abstract: _`Automatic Humour Detection (AHD) is a very useful topic in morden technologies. In this notebook we are going to create an Artificial Neural Network model using Deep Learning to detect humour in short texts. AHD are very useful because in model technologies such as virtual assistance and chatbots. They help Artificial Virtual Assistance and Bot to detect wether to take the conversation serious or not`._

Research Paper: [`2004.12765`](https://arxiv.org/abs/2004.12765)

Keywords: `tensorflow`, `embedding`, `keras`, `pandas`, `CNN`, `dataset`, `accuracy`, `nltk`, `loss`

Programming Language: `python`

Dataset: [`kaggle`](https://www.kaggle.com/datasets/deepcontractor/200k-short-texts-for-humor-detection)
___

The dataset that we are going to use is based on the `3` files which are stored in the google drive which are:

1. train.csv
2. val.csv
3. test.csv

We are going to use `torchtext` and pytorch to create this model. We are going to create a `CNN` model that will perform a binary classification using tensorflow and keras.


### Mounting the Drive
We are mounting the drive because we are going to load the files from our google drive. In the following code cell we are going to mount the drive as follows:

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Imports
In the following code cell, we are going to import basic packages that we are going to use through out this notebook.

In [2]:
import os
import time
import random
import math

import numpy as np
import tensorflow as tf
import pandas as pd

from prettytable import PrettyTable
from matplotlib import pyplot as plt
from prettytable import PrettyTable
from collections import Counter
from tensorflow import keras
from nltk.tokenize import word_tokenize
from sklearn.preprocessing import LabelEncoder

tf.__version__, keras.__version__

('2.8.0', '2.8.0')

### Seeds
Setting the seed helps us for reproducibility.

In [3]:
SEED = 42

np.random.seed(SEED)
random.seed(SEED)
tf.random.set_seed(SEED)

### Device
We must make use of `gpu` accellaration if possible

In [4]:
gpus = tf.config.list_physical_devices('GPU')
if gpus:
  try:
    for gpu in gpus:
      tf.config.experimental.set_memory_growth(gpu, True)
    logical_gpus = tf.config.list_logical_devices('GPU')
    print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
  except RuntimeError as e:
    print(e)
else:
  print("No GPU's")

1 Physical GPUs, 1 Logical GPUs


### Paths to data
In the following code cells we are going to define the path where our `csv` files are located.

In [5]:
splits_folder = "/content/drive/My Drive/NLP Data/Automatic Humor Detection/splits"
assert os.path.exists(splits_folder) == True


### Reading the data

We are going to read the data from our 3 files as dataframes.

In [6]:
train_df = pd.read_csv(os.path.join(splits_folder, 'train.csv'))
test_df = pd.read_csv(os.path.join(splits_folder, 'test.csv'))
val_df = pd.read_csv(os.path.join(splits_folder, 'val.csv'))

### Features and Labels
Next we are going to extract features and labels fron the from our dataframes for all the three sets.

In [7]:
# train
train_texts = train_df.text.values
train_labels = train_df.label.values

# test
test_texts = test_df.text.values
test_labels = test_df.label.values

# val
val_texts = val_df.text.values
val_labels = val_df.label.values

In [8]:
train_labels[:2]

array(['not-humour', 'humour'], dtype=object)

### Label encoding

As we can see our labels are strings and we want to convert them to `numbers` so we are going to make use of the `sklearn` LabelEncoder class to encode our labels to numeric.

In [9]:
encoder = LabelEncoder()
encoder.fit(train_labels)

LabelEncoder()

### Transforming Labels
To transform the labels we need to call the `encoder.transform()` on our three sets of labels.

In [10]:
train_labels = encoder.transform(train_labels)
test_labels = encoder.transform(test_labels)
val_label = encoder.transform(val_labels)

In [11]:
train_labels[:2]

array([1, 0])

### Processing the text
We are going to create a `vocabulary` based on our train data using the `Counter` from the `collections` module in the following code cell.

In [12]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [13]:
counter = Counter()

for sent in train_texts:
  counter.update(word_tokenize(sent))

### Checking the most common `9` words

In [14]:
counter.most_common(9)

[('.', 67724),
 ('the', 67301),
 ('a', 66128),
 ('?', 54788),
 ('to', 48363),
 (',', 37160),
 ("'s", 35868),
 ('you', 34041),
 ('of', 29679)]

### Vocabulary size

In [15]:
vocab_size = len(counter)

In [16]:
vocab_size

86227

### Creating word vectors.


In [17]:
tokenizer = keras.preprocessing.text.Tokenizer(
    num_words = vocab_size
)
tokenizer.fit_on_texts(train_texts)

In [18]:
word_indices = tokenizer.word_index
word_indices_reversed = dict([(v, k) for (k, v) in word_indices.items()])

In [19]:
len(word_indices)

73528

### Helper functions

1. `sequence_to_text`

This helper function will convert a sequence of integers to a sequence of text.

2. `text_to_sequence`

This helper function will convert the sequence of text to sequence of integers.


In [20]:
def sequence_to_text(sequences):
    return " ".join(word_indices_reversed[i] for i in sequences)
    
def text_to_sequence(sent):
  words = word_tokenize(sent.lower())
  sequences = []
  for word in words:
    try:
      sequences.append(word_indices[word])
    except:
      sequences.append(0)
  return sequences

### Loading pretrainned vectors
We are going to load the `Glove.6B.100d` word vectors that was uploaded in the google drive as a `txt` file.


In [21]:
embedding_path = "/content/drive/MyDrive/NLP Data/glove.6B/glove.6B.100d.txt"

assert os.path.exists(embedding_path) == True, "The path does not exists"

In [22]:
embeddings_dictionary = dict()
with open(embedding_path, encoding='utf8') as glove_file:
  for line in glove_file:
    records = line.split()
    word  = records[0]
    vectors = np.asarray(records[1:], dtype='float32')
    embeddings_dictionary[word] = vectors

> Creating embedding matrix that suits our data.

In [23]:
embedding_matrix = np.zeros((vocab_size, 100))
for word, index in tokenizer.word_index.items():
  vector = embeddings_dictionary.get(word)
  if vector is not None:
    try:
      embedding_matrix[index] = vector
    except:
      pass

### Creating sequences

In the following code cell we are going to create the sequences on our 3 sets.

In [24]:
train_sequence_tokens = tokenizer.texts_to_sequences(train_texts)
test_sequence_tokens = tokenizer.texts_to_sequences(test_texts)
val_sequence_tokens = tokenizer.texts_to_sequences(val_texts)

In [25]:
sequence_to_text(train_sequence_tokens[1])

'the richest black man in nyc has got to be duane reade'

In [26]:
sequence_to_text(test_sequence_tokens[1])

'mary alice glam4good was inspired by oprah video'

In [27]:
sequence_to_text(val_sequence_tokens[1])

"darth vader showed up to luke's party uninvited talk about a foe pa"

### Padding the `sequences`
Next we want to pad the sequences to have the same length as we can see that these sentences contains different lengths, so we have to pad the small sentences with `0` and trancate the large sentences.

In [28]:
max_words = 100
train_tokens_sequence_padded = keras.preprocessing.sequence.pad_sequences(
                                       train_sequence_tokens,
                                       maxlen=max_words,
                                       padding="post", 
                                       truncating="post"
                                       )
test_tokens_sequence_padded = keras.preprocessing.sequence.pad_sequences(
                                       test_sequence_tokens,
                                       maxlen=max_words,
                                       padding="post", 
                                       truncating="post"
                                       )
val_tokens_sequence_padded = keras.preprocessing.sequence.pad_sequences(
                                       val_sequence_tokens,
                                       maxlen=max_words,
                                       padding="post", 
                                       truncating="post"
                                       )

### Model

We are going to create a model based on [this notebook](https://github.com/CrispenGari/nlp-tensorflow/blob/main/03_Emotions/01_Emotion_Prediction_From_Text.ipynb). This model is based on a functional API in keras. The model achitecture is a as 
follows:

```
                [ Embedding Layer]
                        |
                        |
[ LSTM ] <---- [Bidirectional Layer] ----> [GRU] (forward_layer)
 (backward_layer)       |
                        |
        [  Gated Recurrent Unit  (GRU)  ]
                        |
                        |
        [ Long Short Term Memory (LSTM) ]
                        |
                        |
                [ Flatten Layer]
                        |
                        |
                 [Dense Layer 1]
                        |
                        | 
                   [ Dropout ]
                        |
                        |   
                 [Dense Layer 2]
                        |
                        |
                 [Dense Layer 3] (output [binary])
```

In [29]:
forward_layer = keras.layers.GRU(128, return_sequences=True, dropout=.25 )
backward_layer = keras.layers.LSTM(128, activation='tanh', return_sequences=True,
                       go_backwards=True, dropout=.25)

input_layer = keras.layers.Input(shape=(100, ), name="input_layer")

embedding_layer = keras.layers.Embedding(
      vocab_size, 
      100, 
      input_length=max_words,
      weights=[embedding_matrix], 
      trainable=True,
      name = "embedding_layer"
)(input_layer)

bidirectional_layer = keras.layers.Bidirectional(
    forward_layer,
    backward_layer = backward_layer,
    name= "bidirectional_layer"
)(embedding_layer)

gru_layer = keras.layers.GRU(
    512, return_sequences=True,
   dropout=.5,
    name= "gru_layer"
)(bidirectional_layer)

lstm_layer = keras.layers.LSTM(
    512, return_sequences=True,
   dropout=.5,
    name="lstm_layer"
)(gru_layer)
flatten_layer = keras.layers.Flatten(name="flatten_layer")(lstm_layer)
fc_1 = keras.layers.Dense(64, activation='relu', name="dense_1")(flatten_layer)
dropout_layer = keras.layers.Dropout(rate=0.5, name="dropout_layer")(fc_1)
fc_2 = keras.layers.Dense(512, activation='relu', name="dense_2")(dropout_layer)
output_layer = keras.layers.Dense(1, activation='sigmoid')(fc_2)
adh_model = keras.Model(inputs=input_layer, outputs=output_layer, name="emotional_model")
adh_model.summary()

Model: "emotional_model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_layer (InputLayer)    [(None, 100)]             0         
                                                                 
 embedding_layer (Embedding)  (None, 100, 100)         8622700   
                                                                 
 bidirectional_layer (Bidire  (None, 100, 256)         205568    
 ctional)                                                        
                                                                 
 gru_layer (GRU)             (None, 100, 512)          1182720   
                                                                 
 lstm_layer (LSTM)           (None, 100, 512)          2099200   
                                                                 
 flatten_layer (Flatten)     (None, 51200)             0         
                                                   

### Training the model
First we need to compile the model and we are going to use the `EarlyStopping` call back so that we stop as soon as the loss start increasing.


In [30]:
early_stoping = keras.callbacks.EarlyStopping(
    monitor='val_loss',
    min_delta=0,
    patience=5,
    verbose=1,
    mode='auto',
    baseline=None,
    restore_best_weights=False,
)

adh_model.compile(
    loss = keras.losses.BinaryCrossentropy(from_logits=False),
    optimizer = keras.optimizers.Adam(1e-3, 0.5),
    metrics = ['accuracy']
)

In [31]:
BATCH_SIZE = 128

adh_model.fit(
    train_tokens_sequence_padded,
    train_labels,
    validation_data=(
        val_tokens_sequence_padded,
        val_label
    ),
    epochs = 10,
    verbose = 1,
    shuffle=True,
    batch_size= BATCH_SIZE,
    validation_batch_size = BATCH_SIZE,
    callbacks = [early_stoping]
)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 9: early stopping


<keras.callbacks.History at 0x7f1674b260d0>

### Evaluating the model

We are going to evaluate the model on the test data.

In [32]:
adh_model.evaluate(test_tokens_sequence_padded, test_labels,
                       verbose=1, batch_size=BATCH_SIZE)



[0.1475924551486969, 0.9543333053588867]

### Model Inference
In the following code cell we are going to make predictions using our model.

In [47]:
def predict_homour(sent: str, model):
  classes =["HUMOUR", "NOT HOMOUR"]
  tokens = text_to_sequence(sent)
  padded_tokens = keras.preprocessing.sequence.pad_sequences([tokens],
                                maxlen=max_words,
                                padding="post", 
                                truncating="post"
                                )
  
  pred = model.predict(padded_tokens)
  pred = tf.squeeze(pred).numpy()
  print(pred)
  label = 1 if pred >=0.5 else 0
  probability = float(round(pred, 3)) if pred >= 0.5 else float(round(1 - pred, 3))

  pred_obj ={
      "label": label,
      "probability": probability,
      "class": classes[label]
  }
  return pred_obj


predict_homour("The richest black man in nyc has got to be duane reade.", adh_model)

0.06213177


{'class': 'HUMOUR', 'label': 0, 'probability': 0.938}

In [48]:
train_df.head(3)

Unnamed: 0.1,Unnamed: 0,text,label
0,38762,10 brands that will disappear in 2014: 24/7 wa...,not-humour
1,76883,The richest black man in nyc has got to be dua...,humour
2,2018,What do you get if king kong sits on your pian...,humour


In [49]:
train_texts[:3], train_labels[:3]

(array(['10 brands that will disappear in 2014: 24/7 wall st.',
        'The richest black man in nyc has got to be duane reade.',
        'What do you get if king kong sits on your piano? a flat note.'],
       dtype=object), array([1, 0, 0]))

In [50]:
predict_homour(train_texts[0], adh_model)

1.0


{'class': 'NOT HOMOUR', 'label': 1, 'probability': 1.0}

In [51]:
predict_homour(train_texts[1], adh_model)

0.06213177


{'class': 'HUMOUR', 'label': 0, 'probability': 0.938}

In [52]:
predict_homour(train_texts[2], adh_model)

2.9394454e-05


{'class': 'HUMOUR', 'label': 0, 'probability': 1.0}

### Saving and downloading

Now we can download the model so that it can be saved as a static file in the following code cell.

In [33]:
MODEL_NAME = 'adh-tf.h5'
adh_model.save(MODEL_NAME)

In [34]:
from google.colab import files
files.download(MODEL_NAME)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

### Saving and Downloading the vocabulary

Next we are going to our vocabulary `stoi`

In [35]:
word_indices['the']

1

In [36]:
import json

In [37]:
with open('vocab-tf.json', 'w') as f:
  json.dump(word_indices, f)

files.download('vocab-tf.json')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>