# Case Study: Jokes

In this case study we find out if we can make ourselves funnier by analysing jokes from a database.

The case study is divided into several parts:
- Goals
- Parsing
- Preparation (cleaning)
- Processing
- Exploration
- Visualization
- Conclusion

## Goals

In this section we define questions that will be our guideline througout the case study

- What jokes are funny?
- Can we find types of jokes?
- Would a joke recommender work?

We'll (try to) keep these question in mind when performing the case study.

## Parsing

we start out by importing all necessary libraries

In [1]:
import os
import json
import pandas as pd
import numpy as np
import seaborn as sns
import scipy.stats
import matplotlib.pyplot as plt
from IPython.display import set_matplotlib_formats
%matplotlib inline
set_matplotlib_formats('svg')

  set_matplotlib_formats('svg')


in order to download datasets from kaggle, we need an API key to access their API, we'll make that here

In [2]:
if not os.path.exists("/root/.kaggle"):
    os.mkdir("/root/.kaggle")

with open('/root/.kaggle/kaggle.json', 'w') as f:
    json.dump(
        {
            "username":"lorenzf",
            "key":"7a44a9e99b27e796177d793a3d85b8cf"
        }
        , f)

PermissionError: [Errno 13] Permission denied: '/root/.kaggle'

now we can import kaggle too and download the datasets

In [3]:
import kaggle
kaggle.api.dataset_download_files(dataset='pavellexyr/one-million-reddit-jokes', path='./data', unzip=True)



the csv files are now in the './data' folder, we can now read them using pandas, here is the list of all csv files in our folder

In [4]:
os.listdir('./data')

['one-million-reddit-jokes.csv']

With only one file in the dataset, we import it.

In [5]:
reddit_jokes_df = pd.read_csv('./data/one-million-reddit-jokes.csv')
print('shape: ' + str(reddit_jokes_df.shape))
reddit_jokes_df.head()

shape: (1000000, 12)


Unnamed: 0,type,id,subreddit.id,subreddit.name,subreddit.nsfw,created_utc,permalink,domain,url,selftext,title,score
0,post,ftbp1i,2qh72,jokes,False,1585785543,https://old.reddit.com/r/Jokes/comments/ftbp1i...,self.jokes,,My corona is covered with foreskin so it is no...,I am soooo glad I'm not circumcised!,2
1,post,ftboup,2qh72,jokes,False,1585785522,https://old.reddit.com/r/Jokes/comments/ftboup...,self.jokes,,It's called Google Sheets.,Did you know Google now has a platform for rec...,9
2,post,ftbopj,2qh72,jokes,False,1585785508,https://old.reddit.com/r/Jokes/comments/ftbopj...,self.jokes,,The vacuum doesn't snore after sex.\n\n&amp;#x...,What is the difference between my wife and my ...,15
3,post,ftbnxh,2qh72,jokes,False,1585785428,https://old.reddit.com/r/Jokes/comments/ftbnxh...,self.jokes,,[removed],My last joke for now.,9
4,post,ftbjpg,2qh72,jokes,False,1585785009,https://old.reddit.com/r/Jokes/comments/ftbjpg...,self.jokes,,[removed],The Nintendo 64 turns 18 this week...,134


Already we can see a lot of unnecessary information, so cleanup is important. It seems the joke is divided in a title and selftext where often the punchline is present.

## Preparation

here we perform tasks to prepare the data in a more pleasing format.

### Cleanup

First thing I would like to do see which columns are useless, by printing the amount of unique values

In [6]:
for col in reddit_jokes_df.columns:
  print(col)
  print(reddit_jokes_df[col].nunique())
  print()

type
1

id
1000000

subreddit.id
1

subreddit.name
1

subreddit.nsfw
1

created_utc
996373

permalink
1000000

domain
364

url
4410

selftext
520567

title
861254

score
8913



a few columns only have 1 value, also the links are not important for our case, so we drop them too.

In [7]:
reddit_jokes_df = reddit_jokes_df.drop(columns=['type', 'id', 'subreddit.id', 'subreddit.name', 'subreddit.nsfw', 'permalink', 'url'])
reddit_jokes_df.head()

Unnamed: 0,created_utc,domain,selftext,title,score
0,1585785543,self.jokes,My corona is covered with foreskin so it is no...,I am soooo glad I'm not circumcised!,2
1,1585785522,self.jokes,It's called Google Sheets.,Did you know Google now has a platform for rec...,9
2,1585785508,self.jokes,The vacuum doesn't snore after sex.\n\n&amp;#x...,What is the difference between my wife and my ...,15
3,1585785428,self.jokes,[removed],My last joke for now.,9
4,1585785009,self.jokes,[removed],The Nintendo 64 turns 18 this week...,134


much cleaner already!

### Data Types

Before we do anything with our data, it is good to see if our data types are in order

In [8]:
reddit_jokes_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 5 columns):
 #   Column       Non-Null Count    Dtype 
---  ------       --------------    ----- 
 0   created_utc  1000000 non-null  int64 
 1   domain       1000000 non-null  object
 2   selftext     995525 non-null   object
 3   title        1000000 non-null  object
 4   score        1000000 non-null  int64 
dtypes: int64(2), object(3)
memory usage: 38.1+ MB


the created_utc feature is encoded in an unix timestamp, it would be more usefull to transform it to a timestamp

In [9]:
reddit_jokes_df['created'] = pd.to_datetime(reddit_jokes_df['created_utc'], unit='s')
del reddit_jokes_df['created_utc']
reddit_jokes_df.head()

Unnamed: 0,domain,selftext,title,score,created
0,self.jokes,My corona is covered with foreskin so it is no...,I am soooo glad I'm not circumcised!,2,2020-04-01 23:59:03
1,self.jokes,It's called Google Sheets.,Did you know Google now has a platform for rec...,9,2020-04-01 23:58:42
2,self.jokes,The vacuum doesn't snore after sex.\n\n&amp;#x...,What is the difference between my wife and my ...,15,2020-04-01 23:58:28
3,self.jokes,[removed],My last joke for now.,9,2020-04-01 23:57:08
4,self.jokes,[removed],The Nintendo 64 turns 18 this week...,134,2020-04-01 23:50:09


### Missing values

for each dataframe we apply a few checks in order to see the quality of data

In [10]:
print(100*reddit_jokes_df.isna().sum()/reddit_jokes_df.shape[0])

domain      0.0000
selftext    0.4475
title       0.0000
score       0.0000
created     0.0000
dtype: float64


it looks like some jokes are missing the selftext field, we show a few here.

In [11]:
reddit_jokes_df[reddit_jokes_df.selftext.isna()].sort_values(by='score', ascending=False)

Unnamed: 0,domain,selftext,title,score,created
625315,imgur.com,,The funniest /r/jokes has ever been,67950,2017-05-20 15:41:28
971313,self.jokes,,Ellen Pao's career,36918,2015-07-03 15:41:05
942471,self.jokes,,"If a woman sleeps with a bunch of guys, she's ...",17486,2015-10-05 16:09:09
926550,self.jokes,,One in every 2 and a half men is HIV positive.,17456,2015-11-18 04:29:54
919422,self.jokes,,"Accordion to a recent survey, replacing words ...",12580,2015-12-07 18:55:27
...,...,...,...,...,...
929807,self.jokes,,9gag,0,2015-11-09 03:33:22
959394,self.jokes,,Like flaming globes of Sigmund,0,2015-08-14 13:40:21
929809,self.jokes,,"On a scale of 10 to 10, how good am I at givin...",0,2015-11-09 03:26:55
959338,self.jokes,,Who is Julius Caesar's favorite singer? Mark A...,0,2015-08-14 17:03:55


as far as I can see here the jokes are so short they are only one sentence, so we can fill in the missing values with an empty text.

In [12]:
reddit_jokes_df.selftext = reddit_jokes_df.selftext.fillna('')

This does not mean we are done, earlier I noticed the words [removed] and [deleted] in the selftext feature, indicating the joke was removed or deleted, these are missing values!

In [13]:
reddit_jokes_df[reddit_jokes_df.selftext.isin(['[removed]', '[deleted]'])].head()

Unnamed: 0,domain,selftext,title,score,created
3,self.jokes,[removed],My last joke for now.,9,2020-04-01 23:57:08
4,self.jokes,[removed],The Nintendo 64 turns 18 this week...,134,2020-04-01 23:50:09
5,self.jokes,[removed],Sex with teacher.,1,2020-04-01 23:49:55
6,self.jokes,[removed],Another long one.,8,2020-04-01 23:44:11
8,self.jokes,[removed],A Priest takes a walk down to the docks one day,88,2020-04-01 23:39:27


I am going to remove these jokes as they are not complete anymore, it might have been that these jokes have been removed as they have already been posted.

In [14]:
reddit_jokes_df = reddit_jokes_df[~reddit_jokes_df.selftext.isin(['[removed]', '[deleted]'])]
reddit_jokes_df.shape

(578637, 5)

seems we have kept about 578k jokes, not bad!

### Duplicates

As formatting of text might be different i'm not expecting a lot of duplicates, let's see what we can find.

In [15]:
reddit_jokes_df[reddit_jokes_df.duplicated(subset=['title', 'selftext'])]

Unnamed: 0,domain,selftext,title,score,created
211,self.jokes,An academia nut..,What do you call a nut that gets good grades?,5,2020-04-01 18:54:06
4452,self.jokes,Repossssssssst,If a snake who is on reddit has to comment a r...,0,2020-03-27 09:16:20
6349,self.jokes,"“To Japan,” replies her husband. \n\n“Oh my! T...",A woman asks her husband where he’s taking the...,4,2020-03-25 00:48:09
6881,self.jokes,"Fortunately, I belong to the 1% of intelligent...",99.9% of people are idiots.,45135,2020-03-24 09:40:14
8299,self.jokes,You tell it a shitty joke.,How do you get a toilet to laugh?,0,2020-03-22 07:49:45
...,...,...,...,...,...
999779,self.jokes,Dam.,What did the fish say when he hit the wall?,25,2015-03-27 10:33:12
999851,self.jokes,He tractor down.,How did the farmer find his wife?,58,2015-03-27 02:42:29
999882,self.jokes,,women's rights,0,2015-03-27 00:48:36
999936,self.jokes,"Don't be stupid, feminists can't change anything",How many feminists does it take to change a li...,24,2015-03-26 22:00:06


A fair amount of jokes are reposted, so we keep the ones with the highest score.

In [16]:
 reddit_jokes_df = reddit_jokes_df.sort_values('score').drop_duplicates(subset=['title', 'selftext'], keep='last').reset_index()

### Text formatting

Before we can analyze the text in the jokes we have to format it. We can do this by removing all special character and changing it all to lowercase

In [17]:
for col in ['selftext', 'title']:
  reddit_jokes_df[col] = reddit_jokes_df[col].replace(to_replace="[^a-zA-Z,.!? ]", value="", regex=True).str.lower()

reddit_jokes_df.head()

Unnamed: 0,index,domain,selftext,title,score,created
0,630580,self.jokes,"those who need closure,",there are two kinds of people in the world.,0,2017-05-12 17:01:44
1,187066,self.jokes,so when someone asks you can say its .,set your wifi password to,0,2019-05-28 00:30:46
2,437464,self.jokes,tooth hurty!,at what time do you see your dentist?,0,2018-03-28 10:17:26
3,714598,self.jokes,where did you get a phone that works in spaini...,john and juan are on lunch break when juans ph...,0,2017-01-13 02:37:59
4,187072,self.jokes,me how many am i allowed?guy only one me well ...,a guy is handing out free fake mustaches on th...,0,2019-05-28 00:20:01


Next we create a single joke by combining the title and selftext, this makes it easier to operate.

In [18]:
reddit_jokes_df['joke'] = reddit_jokes_df.title + ' ' + reddit_jokes_df.selftext
reddit_jokes_df = reddit_jokes_df.drop(columns=['title', 'selftext'])
reddit_jokes_df.head()

Unnamed: 0,index,domain,score,created,joke
0,630580,self.jokes,0,2017-05-12 17:01:44,there are two kinds of people in the world. th...
1,187066,self.jokes,0,2019-05-28 00:30:46,set your wifi password to so when someone ask...
2,437464,self.jokes,0,2018-03-28 10:17:26,at what time do you see your dentist? tooth hu...
3,714598,self.jokes,0,2017-01-13 02:37:59,john and juan are on lunch break when juans ph...
4,187072,self.jokes,0,2019-05-28 00:20:01,a guy is handing out free fake mustaches on th...


## Processing

### Timing of joke

I would like to know if the timing of the jokes makes an impact on how funny the joke is, so i grouped based on both the weekday as well as the hour of day.

In [19]:
reddit_jokes_weekday = reddit_jokes_df.groupby(reddit_jokes_df.created.dt.weekday).score.agg(['mean', 'count'])
reddit_jokes_weekday

Unnamed: 0_level_0,mean,count
created,Unnamed: 1_level_1,Unnamed: 2_level_1
0,226.871773,79866
1,228.808886,82940
2,222.802165,84793
3,215.771594,84932
4,222.888666,82634
5,232.752534,75089
6,241.322581,75516


In [20]:
reddit_jokes_hour = reddit_jokes_df.groupby(reddit_jokes_df.created.dt.hour).score.agg(['mean', 'count'])
reddit_jokes_hour

Unnamed: 0_level_0,mean,count
created,Unnamed: 1_level_1,Unnamed: 2_level_1
0,189.177767,25646
1,189.383726,25440
2,172.406772,25368
3,140.741126,23637
4,144.06696,21162
5,137.355467,19006
6,168.542319,16671
7,214.903014,15198
8,271.710558,14217
9,398.431366,14009


### Bag of words
To be able to work with the words in our joke, we create a bag of words dataframe, where for each word and joke combination a count is kept of how many times the word is present in that joke. Notice that stopwords are removed.

First we split each joke up in words

In [21]:
joke_words = reddit_jokes_df.joke.str.split(' ')
joke_words.head()

0    [there, are, two, kinds, of, people, in, the, ...
1    [set, your, wifi, password, to, , so, when, so...
2    [at, what, time, do, you, see, your, dentist?,...
3    [john, and, juan, are, on, lunch, break, when,...
4    [a, guy, is, handing, out, free, fake, mustach...
Name: joke, dtype: object

Next we use the nltk toolkit to get a list of english stopwords.

In [22]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stopwords.words('english')[:5]

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


['i', 'me', 'my', 'myself', 'we']

We remove all the stopwords from the jokes, now the jokes have a handicapped grammar.

In [23]:
joke_words = joke_words.head().apply(lambda x : [word for word in x if word not in stopwords.words('english')])
joke_words.head()

0         [two, kinds, people, world., need, closure,]
1       [set, wifi, password, , someone, asks, say, .]
2                 [time, see, dentist?, tooth, hurty!]
3    [john, juan, lunch, break, juans, phone, rings...
4    [guy, handing, free, fake, mustaches, street, ...
Name: joke, dtype: object

Finally we are going to use sklearn and the CountVectorizer to create the BoW vector, this is a sparse matrix as most words are not appearing in most jokes.
This means we cannot visualise the matrix, or our computer would explode.

In [24]:
from sklearn.feature_extraction.text import CountVectorizer

cnt_vect = CountVectorizer(analyzer="word", stop_words=stopwords.words('english'), max_features=20000) 

bow_jokes = cnt_vect.fit_transform(reddit_jokes_df.joke.values)

In [25]:
bow_jokes

<565770x20000 sparse matrix of type '<class 'numpy.int64'>'
	with 9101120 stored elements in Compressed Sparse Row format>

But we can fetch the vocabulary of our bag, which starts with a lot of weird words, indicating we might have chosen too many features

In [26]:
cnt_vect.get_feature_names_out()[:10]

array(['aa', 'aaa', 'aaah', 'aah', 'aardvark', 'aaron', 'ab', 'aback',
       'abacus', 'abandon'], dtype=object)

### Term Frequency - Inverse Document Frequency
Another interesting method is the tf-idf matrix, where each occurence is weighted by the overall frequency of that word. If a word is used often over all jokes, it won't be as important, but if a word is used infrequent it is more important.

Again we use sklearn to vectorize our jokes

In [27]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vect = TfidfVectorizer()
tfidf_jokes = tfidf_vect.fit_transform(reddit_jokes_df.joke.values)
tfidf_jokes

<565770x196601 sparse matrix of type '<class 'numpy.float64'>'
	with 15153814 stored elements in Compressed Sparse Row format>

we can create a quick dataframe to interpret the result, for each word in our dataset we retrieve the inverse document frequency, a high idf means a unique word.

In [28]:
idf = pd.DataFrame(
    {
      'term': tfidf_vect.get_feature_names_out(),
      'idf': tfidf_vect.idf_,
    }
)
idf.head()

Unnamed: 0,term,idf
0,aa,10.026437
1,aaa,10.275653
2,aaaa,12.454185
3,aaaaa,13.147332
4,aaaaaa,13.552798


When we sort them by idf we can find the most unique words, yet it doesn't seem to be useful at the moment.

In [29]:
idf.sort_values(by='idf', ascending=False).head(10)

Unnamed: 0,term,idf
196600,zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzthe,13.552798
110080,misterunderstanding,13.552798
110074,misterectomy,13.552798
110075,misterious,13.552798
110076,misterjmyers,13.552798
110077,misterlee,13.552798
110078,misterogyny,13.552798
110079,misters,13.552798
110081,misterunderstood,13.552798
110072,misterapproximate,13.552798
