# **Exploration of ~37200 Reddit r/jokes dataset.**
Extracted and simplified from [Kaggle](https://www.kaggle.com/datasets/bwandowando/reddit-rjokes-dataset) dataset.
I chose subset of columns that I was interested in my analysis.
Here are the list of columns that I kept:
- *thread_id*: unique id of the thread containing the joke (Object)
- *thread_title*: title of the thread (Object) *sometimes this title contains start of the joke*
- *thread_selftext*: Text of the thread which includes the joke (Object)
- *thread_score*: This score between 0 and 1 (Object) *supposed to be Upvotes - downvotes but I can't see any negative values so still I don't know what is it.*
- *thread_num_comments*: Number of comments in the thread (float64)
- *thread_created_utc*: Time of the thread creation in UTC (Object)
- *thread_upvote_ratio*: Ratio of upvotes to to total votes (float64)
- *thread_over_18*: Whether the thread is over 18 or not (Object)


In [91]:
import pandas as pd
import matplotlib.pyplot as plt
import datetime

# pd cosmetics
pd.set_option('display.max_colwidth', 3000)
pd.set_option('display.max_rows', 3000)
pd.set_option('display.max_columns', 3000)
pd.set_option('display.width', 1000)


# Three files are in the *data* folder
# 1) reddit_jokes_slim.csv: All ~37200 jokes (clean and adult)
# 2) reddit_jokes_slim_clean.csv : Only clean jokes
# 3) reddit_jokes_slim_plus18.csv: only adult jokes
df_jokes_slim = pd.read_csv('./data/reddit_jokes_slim.csv')
df_jokes_slim.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37222 entries, 0 to 37221
Data columns (total 8 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   thread_id            37222 non-null  object 
 1   thread_title         37222 non-null  object 
 2   thread_selftext      37216 non-null  object 
 3   thread_score         37221 non-null  object 
 4   thread_num_comments  37221 non-null  float64
 5   thread_created_utc   37221 non-null  object 
 6   thread_upvote_ratio  37220 non-null  float64
 7   thread_over_18       37220 non-null  object 
dtypes: float64(2), object(6)
memory usage: 2.3+ MB


In [92]:
# some columns have small number of empty cells. We can drop them as they are very small percentage of all rows.
# thread_created_utc is string and I change it to timestamp
# thread_score is string, I change it to float.
df_jokes_slim.dropna(how='any',inplace=True)
df_jokes_slim.reset_index(inplace=True)
df_jokes_slim['thread_created_utc'] = pd.to_datetime(df_jokes_slim['thread_created_utc'], format='mixed', utc=True)
df_jokes_slim['thread_score'] = df_jokes_slim['thread_score'].astype(float)
df_jokes_slim.drop('index',axis=1)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37215 entries, 0 to 37214
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype              
---  ------               --------------  -----              
 0   index                37215 non-null  int64              
 1   thread_id            37215 non-null  object             
 2   thread_title         37215 non-null  object             
 3   thread_selftext      37215 non-null  object             
 4   thread_score         37215 non-null  float64            
 5   thread_num_comments  37215 non-null  float64            
 6   thread_created_utc   37215 non-null  datetime64[ns, UTC]
 7   thread_upvote_ratio  37215 non-null  float64            
 8   thread_over_18       37215 non-null  object             
dtypes: datetime64[ns, UTC](1), float64(3), int64(1), object(4)
memory usage: 2.6+ MB


In [103]:
df_jokes_slim.head(5)

Unnamed: 0,index,thread_id,thread_title,thread_selftext,thread_score,thread_num_comments,thread_created_utc,thread_upvote_ratio,thread_over_18
0,0,164tcap,The Turks invented sex,and then the Greeks improved it by removing th...,1125.0,104.0,2023-08-29 20:25:52+00:00,0.87,True
1,1,164un9h,"A guy goes into a store and tells the clerk, ""...","\n\nThe guy, clearly offended, says, “""Well, y...",3507.0,564.0,2023-08-29 21:14:51+00:00,0.87,False
2,2,16533xo,A man seeks cover in a cabin from the winter s...,"Very NSFW!\n\nAllright, so this is a story fro...",614.0,40.0,2023-08-30 03:07:13+00:00,0.92,True
3,3,1654dk9,A man comes home late from the bar,He knows his wife don't like it when he drinks...,433.0,16.0,2023-08-30 04:09:11+00:00,0.95,False
4,4,1655h6k,"A man is sitting at a bar at closing time, com...",How is it I always get in trouble with my wife...,648.0,43.0,2023-08-30 05:07:24+00:00,0.94,False


# Can we use Thread_score and thread_upvote_ratio as a measure of joke quality
I don't know how reddit calculates the thread_score so to gain an insight into what thread_score represents, I calculate the correlation between thread_score, thread_upvote_ratio and thread_num_comments to see if there are any correlation between them.

In [106]:

correlation1 = df_jokes_slim['thread_score'].corr(df_jokes_slim['thread_num_comments'])
correlation2 = df_jokes_slim['thread_score'].corr(df_jokes_slim['thread_upvote_ratio'])

print("Correlation between thread_score and thread_num_comments:", correlation1)
print("Correlation between thread_score and thread_upvote_ratio:", correlation2)


Correlation between thread_score and thread_num_comments: 0.7229783655802435
Correlation between thread_score and thread_upvote_ratio: 0.19715549518702158


As you can see above, there is a weak correlation between thread_score and thread_upvote_ratio. However, there is a strong correlation between thread_score and thread_num_comments. This means that the number of comments is more important than the upvote ratio thread_score calculation. That makes complete sense from Reddit's perspective. Because, Reddit wants to discover the high engagement threads and naturally, number of comments has higher engagement value compare to up/down vote. However, that is not as useful for our purpose, which is try to study and measure the funniness of the jokes. Number of comments, and naturally the thread score, measures the engagement of the thread/jokes which can be as the result of many factors other than the quality of the joke. In summary, we can't use thread_score alone as a measure of funniness of the joke.
Let's look at 