In [1]:
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, f1_score, accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer


[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


# Data Summary

The dataset consists of texts and our problem is to write a model that can seperate the ones that are humorous from the ones that are not. My objective is to write a simple notebook that will be a headstart for anyone who wants to work on this dataset. Let's investigate a little more.

In [2]:
dataset = pd.read_csv("../input/200k-short-texts-for-humor-detection/dataset.csv")
dataset.head()

Unnamed: 0,text,humor
0,"Joe biden rules out 2020 bid: 'guys, i'm not r...",False
1,Watch: darvish gave hitter whiplash with slow ...,False
2,What do you call a turtle without its shell? d...,True
3,5 reasons the 2016 election feels so personal,False
4,"Pasco police shot mexican migrant from behind,...",False


We see that there are punctuations, numbers, capital and lower letters in the data. These problems will be handled in the data preprocessing part.

In [3]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200000 entries, 0 to 199999
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   text    200000 non-null  object
 1   humor   200000 non-null  bool  
dtypes: bool(1), object(1)
memory usage: 1.7+ MB


There are no null values in the dataset.

# Data Preprocessing

Luckily, python already has a function to lower letters. However, we will use regex to fetch punctuations and numbers.

In [4]:
# Lower all letters
dataset["text"] = dataset["text"].str.lower()
# Remove punctuations
dataset["text"] =  dataset['text'].str.replace('[^\w\s]', '', regex=True)
# Remove numbers
dataset['text'] = dataset['text'].str.replace('\d', '', regex=True)
dataset.head()

Unnamed: 0,text,humor
0,joe biden rules out bid guys im not running,False
1,watch darvish gave hitter whiplash with slow p...,False
2,what do you call a turtle without its shell dead,True
3,reasons the election feels so personal,False
4,pasco police shot mexican migrant from behind ...,False


Generally, stop words does not carry much information so we will remove them as well. To do that, we use the stop words which are chosen by nltk library.

In [5]:
# Remove stop words from dataset
stop_words = stopwords.words('english')
dataset['text'] = dataset['text'].apply(lambda x: " ".join(x for x in str(x).split() if x not in stop_words))
dataset.head()

Unnamed: 0,text,humor
0,joe biden rules bid guys im running,False
1,watch darvish gave hitter whiplash slow pitch,False
2,call turtle without shell dead,True
3,reasons election feels personal,False
4,pasco police shot mexican migrant behind new a...,False


We should also eliminate the rare words in the dataset since it is a low chance that they carry information about our problem. Yet every single word is another column that has to be processed by the machine learning model. 

In [6]:
# Rare words are eliminated
# Rare word definition: count of word will be less then quantile(.25)
freq_words = pd.Series(" ".join(dataset['text']).split()).value_counts()
freq_filter = freq_words[freq_words<=freq_words.quantile(.25)]
dataset["text"] = dataset["text"].apply(lambda x: " ".join(x for x in x.split() if x not in freq_filter))
dataset.head()

Unnamed: 0,text,humor
0,joe biden rules bid guys im running,False
1,watch gave hitter whiplash slow pitch,False
2,call turtle without shell dead,True
3,reasons election feels personal,False
4,police shot mexican migrant behind new autopsy...,False


Lastly, type of labels are changed into int since logistic regression model of sklearn expects its label as integer. We also need to seperate the dataset as train and test to see performance of the model on the data that it never saw.

In [7]:
dataset["humor"] = dataset["humor"].astype(int)
X_train, X_test, y_train, y_test = train_test_split(dataset["text"], dataset["humor"], test_size=0.2, random_state=5)

# Count Vectorization

I just passed default parameters of count vectorizer. 

Below code seperates the data word by word. Every word in the dataset will be the columns of the new dataset and if the word is passed in the given observation(row), the column will show how many times that word passed in the observation.

In [8]:
vectorizer = CountVectorizer()
X_train_vectorized = vectorizer.fit_transform(X_train)
X_test_vectorized = vectorizer.transform(X_test)

# Default Logistic Regression

I choose logistic regression with default parameters except number of maximium iterations due to let the model time to converge the global minima of loss function.

In [9]:
lr = LogisticRegression(max_iter=1000)
lr.fit(X_train_vectorized, y_train)

LogisticRegression(max_iter=1000)

Train performance of the model is given below.

In [10]:
preds = lr.predict(X_train_vectorized)
print(f"Train Accuracy: {accuracy_score(y_train, preds)}")
print(f"Train ROC-AUC Score: {roc_auc_score(y_train, preds)}")
print(f"Train F1 Score: {f1_score(y_train, preds)}")

Train Accuracy: 0.94150625
Train ROC-AUC Score: 0.9415036776839915
Train F1 Score: 0.9416953756253154


Test performance of the model is given below.

In [11]:
preds = lr.predict(X_test_vectorized)
print(f"Test Accuracy: {accuracy_score(y_test, preds)}")
print(f"Test ROC-AUC Score: {roc_auc_score(y_test, preds)}")
print(f"Test F1 Score: {f1_score(y_test, preds)}")


Test Accuracy: 0.9113
Test ROC-AUC Score: 0.911302281684157
Test F1 Score: 0.9109393041819368


Seems like model fit the data pretty well. We see that every metric values are close to each other. Training performance is a little higher than test results as expected. However, it may be a sign to check if the model is overfitted. Cross validation with hyperparameter tuning should be added to the code. However, I will finish the modelling part in here since I want this notebook to be a simple NLP guide.

We can also extract the important words for our model to see what kind of words can be distinctive for humor. To do that, we are going to take the highest absolute value of the coefficients of the model and look at which words these coefficients are correspond to.

Note that important words do not mean these words make the sentences humorous. Some of these words are also shows the model that the sentence is not humorous. The importance of these words are that they are distinctive one way or another.

Let's see most important 50 words in the dataset.

In [12]:
#  Let's learn which words have highest values of training coefficient 
important_feature_indexes = [i[0] for i in sorted(enumerate(np.abs(lr.coef_[0])), key=lambda x: x[1], reverse=True)]
important_features = vectorizer.get_feature_names_out()[important_feature_indexes]
important_features[:50]

array(['favourite', 'fuck', 'photos', 'allegedly', 'call', 'huffpost',
       'shit', 'norris', 'recipes', 'fucking', 'reportedly', 'experts',
       'video', 'yo', 'rescued', 'heres', 'photo', 'ways', 'cuff',
       'dyslexic', 'walks', 'snl', 'tiers', 'joke', 'opposite', 'reasons',
       'reveals', 'viagra', 'knock', 'mexicans', 'cross', 'infographic',
       'toupee', 'lightbulb', 'erection', 'til', 'redneck', 'obamacare',
       'rjokes', 'midgets', 'samantha', 'queer', 'cows', 'recently',
       'health', 'alleged', 'proves', 'diarrhea', 'feds', 'festival'],
      dtype=object)

We see that curse words are mostly important for a sentence to be humorous or not.

Seeing knock in the important features seems right since knock knock jokes are pretty common.

Seeing mexicans in here shows that there may be jokes about races in the dataset.

Health may show the model that the sentence is not humorous as well.

We can also take the sentences that these words are passed and see if they are most likely to be humorous or not. However, I would like to end my analysis here.

Thanks for reading.


## Fatih Özgür Ardıç