# Joke classification

In this notebook, we detail the different steps for classifying the jokes

In [1]:
%load_ext autoreload
%autoreload 2

In [19]:
import numpy as np
import pandas as pd
import re, string

#nltk
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')

#model evaluation
from sklearn.model_selection import train_test_split

#gensim
from gensim.models import Word2Vec

#pytorch
import torch
import torch.nn as nn

#metrics
from sklearn.metrics import classification_report

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


# Load Data

In [10]:
#we saved balanced dataframe created in the Data Analysis notebook, we will use it
df = pd.read_csv('created_dataframes/df_balanced.csv')
df.head(3)

Unnamed: 0,id,body,category,lengths
0,3013,Markin' around The Christmas Tree\nWhat a dogg...,Animal,1022
1,12808,Yo mama so fat when jumps up in the air she ge...,Animal,55
2,11887,Laws of Feline Physics II\r\n\r\nLaw of Dinner...,Animal,1098


# Text preprocessing

In [11]:
def preprocess(text):
    text = text.lower() 
    text=text.strip()  #remove whitespace
    text=re.compile('<.*?>').sub('', text) #remove html markup
    text = re.compile('[%s]' % re.escape(string.punctuation)).sub(' ', text)  #remove punctuation and special characters
    text = re.sub('\s+', ' ', text)  #remove high spaces and '\n', '\r' characters
    text = re.sub(r'\[[0-9]*\]',' ',text) 
    text=re.sub(r'[^\w\s]', '', str(text).lower().strip())
    text = re.sub(r'\d',' ',text) #remove numbers
    text = re.sub(r'\s+',' ',text) #remove high spaces and '\n', '\r' characters
    return text

def stopword(string):
    a= [i for i in string.split() if i not in stopwords.words('english')]
    return ' '.join(a)


#print result on an example
ex = df['body'][10]
print('Joke:\n', ex)
print('-----------------------')
print('After removing noisy characters: ', preprocess(ex))
print('After removing stopwords: ', stopword(preprocess(ex)))
print('Tokenized:', word_tokenize(stopword(preprocess(ex))))

Joke:
 Please answer yes or no to this question.

Is your answer "no"? 

Hint: This is under trick, remember. 

Answer: Yes or no.
-----------------------
After removing noisy characters:  please answer yes or no to this question is your answer no hint this is under trick remember answer yes or no
After removing stopwords:  please answer yes question answer hint trick remember answer yes
Tokenized: ['please', 'answer', 'yes', 'question', 'answer', 'hint', 'trick', 'remember', 'answer', 'yes']


In [12]:
df['clean_body'] = df['body'].apply(lambda x : stopword(preprocess(x)))
df.head(3)

Unnamed: 0,id,body,category,lengths,clean_body
0,3013,Markin' around The Christmas Tree\nWhat a dogg...,Animal,1022,markin around christmas tree doggie holiday do...
1,12808,Yo mama so fat when jumps up in the air she ge...,Animal,55,yo mama fat jumps air gets stuck
2,11887,Laws of Feline Physics II\r\n\r\nLaw of Dinner...,Animal,1098,laws feline physics ii law dinner table attend...


In [13]:
#save dataframe
df.to_csv('created_dataframes/df_preprocessed.csv', index=False)

# Word2Vec model

We will use Word2Vec features of the cleaned jokes. This model maps words with similar meaning to similar real-valued vectors.

In [3]:
df = pd.read_csv('created_dataframes/df_preprocessed.csv')

In [7]:
# take clean text of the jokes
X = df['clean_body'].values 

#create document = list of of all words in our data
document = []
for i in range(len(X)):
    joke_tok = nltk.word_tokenize(X[i])
    for word in joke_tok:
        document.append(word)
        
document = [document] #create 'list of list' architecture for the word2vec model
    
#word2vec model
SIZE=30 #size of embedding space
word2vec_model = Word2Vec(document, min_count=1, size=SIZE, window=2, sg=1, iter=500)

#save the model
word2vec_model.save("word2vec_size30.model")

In [12]:
#Tokenize cleaned jokes
cleaned_jokes = df['clean_body'].values 
jokes_tok = [nltk.word_tokenize(i) for i in cleaned_jokes]
print('firsts joke tokenized:', jokes_tok[1]) #first joke tokenized

#print word2vec embeddings
word2vec_model = Word2Vec.load("word2vec_size30.model")
print('shape of the embedding:', word2vec_model.wv[jokes_tok[1]].shape)
print('embedding:', word2vec_model.wv[jokes_tok[1]])

firsts joke tokenized: ['yo', 'mama', 'fat', 'jumps', 'air', 'gets', 'stuck']
shape of the embedding: (7, 30)
embedding: [[ 1.06863678e+00 -1.05477965e+00 -2.65132487e-01  1.07242644e+00
  -1.26709461e-01 -2.04000306e+00  1.55308709e-01 -8.59857559e-01
   1.18675566e+00  6.73972070e-01  5.51422477e-01  2.23011225e-01
  -6.57179654e-01 -3.48080933e-01 -3.92193526e-01  1.19578242e+00
   4.27507132e-01 -1.50509775e+00  9.92367804e-01 -8.62774551e-01
   2.76712120e-01 -1.12649727e+00 -1.62747502e+00  8.53952348e-01
  -4.48997617e-01  1.16722906e+00 -1.13196731e+00  1.76026309e+00
   2.74212646e+00 -1.12593627e+00]
 [ 5.80063045e-01 -1.52785826e+00 -5.15684128e-01  1.10004830e+00
   2.14492023e-01 -1.61776423e+00  2.43962914e-01 -4.28375334e-01
   3.42261679e-02  1.24423921e+00  2.09099159e-01  2.76257694e-01
  -7.22553432e-01 -5.57646602e-02 -1.04177184e-01  4.22811061e-01
   3.30781460e-01 -1.61578417e+00  8.20073426e-01 -4.21348423e-01
   3.08500350e-01 -7.21056640e-01 -1.32048702e+00  2

The Word2Vec model maps each word to a vector that has the size of the embedding space (specified when creating the model)

# Create X and Y

## Create X

In [9]:
def create_embeddings(cleaned_jokes, size_embed=30):
    '''
    Create embeddings of jokes
    Args:
        -cleaned_jokes: list of cleaned jokes
    Returns:
        -embeddings: Concatenated (zero padded) Word embeddings of each joke
    '''
    #load the exisiting Word2Vec model
    word2vec_model = Word2Vec.load("word2vec_size"+str(size_embed)+".model")
    
    MAX_NB_WORDS = 200 #maximum number of words considered for each joke
    embeddings = []
    
    #tokenize the jokes
    jokes_tok = [nltk.word_tokenize(joke) for joke in cleaned_jokes]
    
    #we will compute the embedding of joke
    for joke in jokes_tok:
        embedding_joke = np.array([])
        
        #we will compute the embedding of each word, and concatenate the result to a array so that we have only one array representing a joke
        count_words=0
        for word in joke:
            #if the word2vec model never encountered the word, remove it
            if word not in word2vec_model.wv.vocab.keys():
                continue
                
            if count_words==MAX_NB_WORDS: #if we exceeded the total number of words, we stop
                break
            if embedding_joke.shape[0] == 0: #First word in the joke: embedding_joke is empty
                embedding_joke = word2vec_model.wv[word]
            else:
                embedding_joke = np.concatenate([embedding_joke, word2vec_model.wv[word]], axis=0)
            count_words+=1
                
        #test if the number of words is inferior to the number of total words
        if embedding_joke.shape[0] < MAX_NB_WORDS*size_embed: #embedding space has a size of 15
            nb_of_values_to_add = MAX_NB_WORDS*size_embed - embedding_joke.shape[0] #number of values to add to get the same shape as others
            embedding_joke = np.pad(embedding_joke, (0, nb_of_values_to_add), constant_values=0.)
        
        #add to array of embeddings
        embeddings.append(embedding_joke)
        
        
    return np.array(embeddings)

X = create_embeddings(df['clean_body'].values)
print(X.shape)

(6900, 6000)


## Create Y

In this section, we convert the category attribute (string) into a one-hot vector, for example:

- 'Animal' will be equal to [1, 0, 0, 0 ..., 0]
- 'Bar' will be equal to [0, 0, 1, 0, ..., 0] 

In [14]:
y = pd.get_dummies(df['category']).values
x = np.random.randint(0, 3300, size=5)
print(x)
print('Random 5 categories:\n', df.iloc[x]['category'].values)
print('Corresponding 5 one-hot encoded categories:\n', y[x])

[2238 1129 1561 2162 1160]
Random 5 categories:
 ['Men / Women' 'One Liners' 'Sports' 'Men / Women' 'One Liners']
Corresponding 5 one-hot encoded categories:
 [[0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0]]


In [15]:
print(y.shape)

(6900, 23)


# Classification

In [17]:
#split train/test
X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size = 0.2, random_state = 1)
print(X_train.shape,Y_train.shape)
print(X_test.shape,Y_test.shape)

(5520, 6000) (5520, 23)
(1380, 6000) (1380, 23)


In [26]:
#training
from model_embed import LSTM_embed

input_size = X_train.shape[1]
HIDDEN_DIM = 128
output_size = y.shape[1] #number of categories

model = LSTM_embed(input_size, HIDDEN_DIM, output_size)
loss_function = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)
num_epochs = 10

model.learn(X_train, Y_train, loss_function, optimizer, num_epochs)

Epoch : 1/10, Iteration : 500/5520 , Loss : 0.041615374087690116
Epoch : 1/10, Iteration : 1000/5520 , Loss : 0.04111515875450356
Epoch : 1/10, Iteration : 1500/5520 , Loss : 0.04009425029596491
Epoch : 1/10, Iteration : 2000/5520 , Loss : 0.039275629965230104
Epoch : 1/10, Iteration : 2500/5520 , Loss : 0.03816463646472185
Epoch : 1/10, Iteration : 3000/5520 , Loss : 0.03642680443838385
Epoch : 1/10, Iteration : 3500/5520 , Loss : 0.036217531269897504
Epoch : 1/10, Iteration : 4000/5520 , Loss : 0.03722091781337859
Epoch : 1/10, Iteration : 4500/5520 , Loss : 0.035098847669987096
Epoch : 1/10, Iteration : 5000/5520 , Loss : 0.03536662477919511
Epoch : 1/10, Iteration : 5500/5520 , Loss : 0.033409736459906705
Epoch : 2/10, Iteration : 500/5520 , Loss : 0.031311861837943315
Epoch : 2/10, Iteration : 1000/5520 , Loss : 0.0307268901730926
Epoch : 2/10, Iteration : 1500/5520 , Loss : 0.031022331491526813
Epoch : 2/10, Iteration : 2000/5520 , Loss : 0.029508511836517967
Epoch : 2/10, Iterat

In [27]:
predictions = model.compute_predictions(X_test)

labels = np.sort(df['category'].unique())
print(classification_report(Y_test, predictions, target_names=labels))

               precision    recall  f1-score   support

       Animal       0.38      0.39      0.39        56
          Bar       0.70      0.81      0.75        52
       Blonde       0.53      0.34      0.42        58
     Business       0.59      0.43      0.49        56
     Children       0.54      0.53      0.53        59
      College       0.80      0.80      0.80        55
        Gross       0.73      0.57      0.64        63
      Insults       0.49      0.42      0.45        55
  Knock-Knock       1.00      0.95      0.98        64
      Lawyers       0.55      0.69      0.61        58
    Lightbulb       0.94      0.96      0.95        68
      Medical       0.57      0.57      0.57        63
  Men / Women       0.25      0.28      0.26        53
Miscellaneous       0.30      0.34      0.32        62
   One Liners       0.42      0.42      0.42        64
 Other / Misc       0.11      0.17      0.13        47
    Political       0.60      0.51      0.55        69
         

We see that some categories are more spotable than others. For example the 'yo Mama' jokes are well classified, whereas the 'Men / Women' jokes are not.

In [28]:
def one_hot_to_category(encoded_y):
    '''
    Converts the one-hot prediction into the string category
    '''
    df = pd.read_csv('created_dataframes/df_preprocessed.csv')
    categories = np.sort(df['category'].unique())
    labels=[]
    if len(encoded_y.shape)!=1:
        for y in encoded_y:
            index = np.argmax(y) 
            label = categories[index]
            labels.append(label)
    else:
        index = np.argmax(encoded_y) 
        label = categories[index]
        labels.append(label)
    return labels

def accuracy(predictions, true_values):
    '''Computes accuracy'''
    count_misclassified = 0
    for i in range(len(predictions)):
        real_category = one_hot_to_category(true_values[i])
        predicted_category = one_hot_to_category(predictions[i])
        if real_category!=predicted_category:
            count_misclassified+=1
    return 1 - count_misclassified/len(predictions)

print('Global Accuracy: {:.2f}\n'.format(accuracy(predictions, Y_test)))

Global Accuracy: 0.56



The global accuracy is not great. One way to improve the performance would be to improve the model, we could for example use state-of-the-art text classification models.