# **Exploration of ~37200 Reddit r/jokes dataset.**
Extracted and simplified from [Kaggle](https://www.kaggle.com/datasets/bwandowando/reddit-rjokes-dataset) dataset.
I chose subset of columns that I was interested in my analysis.
Here are the list of columns that I kept:
- *thread_id*: unique id of the thread containing the joke (Object)
- *thread_title*: title of the thread (Object) *sometimes this title contains start of the joke*
- *thread_selftext*: Text of the thread which includes the joke (Object)
- *thread_score*: This score between 0 and 1 (Object) *Upvotes - downvotes*
- *thread_num_comments*: Number of comments in the thread (float64)
- *thread_created_utc*: Time of the thread creation in UTC (Object)
- *thread_upvote_ratio*: Ratio of upvotes to downvotes (float64)
- *thread_over_18*: Whether the thread is over 18 or not (Object)


In [91]:
import pandas as pd
import matplotlib.pyplot as plt
import datetime

# pd cosmetics
pd.set_option('display.max_colwidth', 3000)
pd.set_option('display.max_rows', 3000)
pd.set_option('display.max_columns', 3000)
pd.set_option('display.width', 1000)


# Three files are in the *data* folder
# 1) reddit_jokes_slim.csv: All ~37200 jokes (clean and adult)
# 2) reddit_jokes_slim_clean.csv : Only clean jokes
# 3) reddit_jokes_slim_plus18.csv: only adult jokes
df_jokes_slim = pd.read_csv('./data/reddit_jokes_slim.csv')
df_jokes_slim.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37222 entries, 0 to 37221
Data columns (total 8 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   thread_id            37222 non-null  object 
 1   thread_title         37222 non-null  object 
 2   thread_selftext      37216 non-null  object 
 3   thread_score         37221 non-null  object 
 4   thread_num_comments  37221 non-null  float64
 5   thread_created_utc   37221 non-null  object 
 6   thread_upvote_ratio  37220 non-null  float64
 7   thread_over_18       37220 non-null  object 
dtypes: float64(2), object(6)
memory usage: 2.3+ MB


In [92]:
# some columns have small number of empty cells. We can drop them as they are very small percentage of all rows.
# thread_created_utc is string and I change it to timestamp
# thread_score is string, I change it to float.
df_jokes_slim.dropna(how='any',inplace=True)
df_jokes_slim.reset_index(inplace=True)
df_jokes_slim['thread_created_utc'] = pd.to_datetime(df_jokes_slim['thread_created_utc'], format='mixed', utc=True)
df_jokes_slim['thread_score'] = df_jokes_slim['thread_score'].astype(float)
df_jokes_slim.drop('index',axis=1)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37215 entries, 0 to 37214
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype              
---  ------               --------------  -----              
 0   index                37215 non-null  int64              
 1   thread_id            37215 non-null  object             
 2   thread_title         37215 non-null  object             
 3   thread_selftext      37215 non-null  object             
 4   thread_score         37215 non-null  float64            
 5   thread_num_comments  37215 non-null  float64            
 6   thread_created_utc   37215 non-null  datetime64[ns, UTC]
 7   thread_upvote_ratio  37215 non-null  float64            
 8   thread_over_18       37215 non-null  object             
dtypes: datetime64[ns, UTC](1), float64(3), int64(1), object(4)
memory usage: 2.6+ MB


In [99]:
df_jokes_slim.describe()

Unnamed: 0,index,thread_score,thread_num_comments,thread_upvote_ratio
count,37215.0,37215.0,37215.0,37215.0
mean,18610.802741,404.407013,22.674271,0.686012
std,10744.863794,1953.16311,99.501791,0.226151
min,0.0,0.0,0.0,0.03
25%,9306.5,0.0,2.0,0.5
50%,18611.0,5.0,4.0,0.74
75%,27914.5,42.0,9.0,0.88
max,37221.0,53635.0,8011.0,1.0


In [None]:
404*