___

Project: `Automatic Humour Detection (AHD)`

Programmer: `@crispengari`

Date: `2022-04-26`

Abstract: _`Automatic Humour Detection (AHD) is a very useful topic in morden technologies. In this notebook we are going to prepare the data for an AHD pytorch Deep Learning Model using TorchText. AHD are very useful because in model technologies such as virtual assistance and chatbots. They help Artificial Virtual Assistance and Bot to detect wether to take the conversation serious or not`._

Research Paper: [`2004.12765`](https://arxiv.org/abs/2004.12765)

Keywords: `pytorch`, `embedding`, `torchtext`, `fast-text`, `LSTM`, `RNN`

Programming Language: `python`

Dataset: [`kaggle`](https://www.kaggle.com/datasets/deepcontractor/200k-short-texts-for-humor-detection)
___

In this notebook we are going to prepare the datasets using the dataset that was obtained on [kaggle](https://www.kaggle.com/datasets/deepcontractor/200k-short-texts-for-humor-detection). This dataset was dowlnloaded and uploaded on my google drive so that it can be easily loaded in this notebook. We are going to come up with three files at the end of this notebook which are:

1. train.csv
2. val.csv
3. test.csv

The dataset consist of `2K` lines of text labeled `humour` or `not-humor`. The ratio between the two labels is `1:1`.


### Mounting the Drive
In the following code cell we are going to mount the drive as follows:

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Imports
In the following code cell we are going to import the basic packages that we are going to use in this notebook.

In [2]:
import os
import numpy as np
from sklearn.model_selection import train_test_split
import pandas as pd

### Defining file paths
In the following code cell we are going to defile the file paths.

In [3]:
main_file = "/content/drive/My Drive/NLP Data/Automatic Humor Detection/dataset.csv"
splits_folder = "/content/drive/My Drive/NLP Data/Automatic Humor Detection/splits"

assert os.path.exists(main_file) == os.path.exists(splits_folder) == True

### Dataframe
We are going to make use of the pandas module to create 3 splits for our dataset which are:

* train
* test
* val

In [10]:
df = pd.read_csv(main_file)
df.head(10)

Unnamed: 0,text,humor
0,"Joe biden rules out 2020 bid: 'guys, i'm not r...",False
1,Watch: darvish gave hitter whiplash with slow ...,False
2,What do you call a turtle without its shell? d...,True
3,5 reasons the 2016 election feels so personal,False
4,"Pasco police shot mexican migrant from behind,...",False
5,"Martha stewart tweets hideous food photo, twit...",False
6,What is a pokemon master's favorite kind of pa...,True
7,Why do native americans hate it when it rains ...,True
8,"Obama's climate change legacy is impressive, i...",False
9,"My family tree is a cactus, we're all pricks.",True


### Renaming the Columns
The next thing that we will do is to rename the columns of our dataset.

In [11]:
df.rename(columns={"text":"text", "humor": "label"}, inplace=True)
df.head(2)

Unnamed: 0,text,label
0,"Joe biden rules out 2020 bid: 'guys, i'm not r...",False
1,Watch: darvish gave hitter whiplash with slow ...,False


### Changing the labels
We are going to have 2 labels which are:

1. `humour`
2. `not-humour`

We are going to apply a lambda function to our `df` on the `label` column therefore where it says `False` we are going to change it to `not-humour` and `humour` otherwise.

In [12]:
df['label'] = df['label'].apply(lambda x: "humour" if x==True else "not-humour")
df.head(10)

Unnamed: 0,text,label
0,"Joe biden rules out 2020 bid: 'guys, i'm not r...",not-humour
1,Watch: darvish gave hitter whiplash with slow ...,not-humour
2,What do you call a turtle without its shell? d...,humour
3,5 reasons the 2016 election feels so personal,not-humour
4,"Pasco police shot mexican migrant from behind,...",not-humour
5,"Martha stewart tweets hideous food photo, twit...",not-humour
6,What is a pokemon master's favorite kind of pa...,humour
7,Why do native americans hate it when it rains ...,humour
8,"Obama's climate change legacy is impressive, i...",not-humour
9,"My family tree is a cactus, we're all pricks.",humour


### Splitting sets
Now we can use `sklearn` to split our dataframes into 3 sets:

1. train
2. test
3. valid

In [13]:
SEED = 42

In [17]:
train_df, test_df = train_test_split(df, random_state=SEED, test_size=.1)
test_df, val_df = train_test_split(test_df, random_state=SEED, test_size=.1)

### Checking examples

In [18]:
train_df.head(2)

Unnamed: 0,text,label
38762,10 brands that will disappear in 2014: 24/7 wa...,not-humour
76883,The richest black man in nyc has got to be dua...,humour


In [19]:
test_df.head(2)

Unnamed: 0,text,label
194672,"Beyoncé announces $100,000 in scholarships for...",not-humour
115452,Mary alice stephenson's glam4good was inspired...,not-humour


In [20]:
val_df.head(2)

Unnamed: 0,text,label
45782,I know a few people who are the human version ...,humour
16247,Darth vader showed up to luke's party uninvite...,humour


### Counting examples.

In the following code cells we are going to count examples for each set and display them nicely using a `PrettyTable`.

In [21]:
from prettytable import PrettyTable

In [22]:
def tabulate(column_names, data, title):
  table = PrettyTable(column_names)
  table.title= title
  table.align[column_names[0]] = 'l'
  table.align[column_names[1]] = 'r'
  table.align[column_names[2]] = 'r'
  table.align[column_names[3]] = 'r'
  for row in data:
    table.add_row(row)
  print(table)

In [29]:
tabulate([
    "Set", "Total", "Humour", "Not Humour"
],[
   ("training", len(train_df), list(train_df.label).count("humour"), list(train_df.label).count("not-humour")),
    ("testing", len(test_df), list(test_df.label).count("humour"), list(test_df.label).count("not-humour")),
    ("validation", len(val_df), list(val_df.label).count("humour"), list(val_df.label).count("not-humour")),
], "Counting examples in the dataset.") 

+-------------------------------------------+
|     Counting examples in the dataset.     |
+------------+--------+--------+------------+
| Set        |  Total | Humour | Not Humour |
+------------+--------+--------+------------+
| training   | 180000 |  90054 |      89946 |
| testing    |  18000 |   8984 |       9016 |
| validation |   2000 |    962 |       1038 |
+------------+--------+--------+------------+


### Saving files

Now we can save `.csv` files from the three dataframes in google drive as follows:

In [30]:
train_df.to_csv(os.path.join(splits_folder, "train.csv"))
test_df.to_csv(os.path.join(splits_folder, "test.csv"))
val_df.to_csv(os.path.join(splits_folder, "val.csv"))

print("Done")

Done


We are done creating three files for our dataset.