# **Exploration of ~37200 Reddit r/jokes dataset.**
Extracted and simplified from [Kaggle](https://www.kaggle.com/datasets/bwandowando/reddit-rjokes-dataset) dataset.
I chose subset of columns that I was interested in my analysis.
Here are the list of columns that I kept:
- *thread_id*: unique id of the thread containing the joke (Object)
- *thread_title*: title of the thread (Object) *sometimes this title contains start of the joke*
- *thread_selftext*: Text of the thread which includes the joke (Object)
- *thread_score*: This score between 0 and 1 (Object) *I am not sure yet how it is calculated*
- *thread_num_comments*: Number of comments in the thread (float64)
- *thread_created_utc*: Time of the thread creation in UTC (Object)
- *thread_upvote_ratio*: Ratio of upvotes to downvotes (float64)
- *thread_over_18*: Whether the thread is over 18 or not (Object)


In [69]:
import pandas as pd
import matplotlib.pyplot as plt
import datetime


# Three files are in the *data* folder
# 1) reddit_jokes_slim.csv: All ~37200 jokes (clean and adult)
# 2) reddit_jokes_slim_clean.csv : Only clean jokes
# 3) reddit_jokes_slim_plus18.csv: only adult jokes
df_jokes_slim = pd.read_csv('./data/reddit_jokes_slim.csv')
df_jokes_slim.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37222 entries, 0 to 37221
Data columns (total 8 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   thread_id            37222 non-null  object 
 1   thread_title         37222 non-null  object 
 2   thread_selftext      37216 non-null  object 
 3   thread_score         37221 non-null  object 
 4   thread_num_comments  37221 non-null  float64
 5   thread_created_utc   37221 non-null  object 
 6   thread_upvote_ratio  37220 non-null  float64
 7   thread_over_18       37220 non-null  object 
dtypes: float64(2), object(6)
memory usage: 2.3+ MB


In [73]:
# some columns have small number of empty cells. We can drop them as they are very small percentage of all rows.
# thread_created_utc is string and I change it to timestamp
# thread_score is string, I change it to float.
df_jokes_slim.dropna(how='any')
df_jokes_slim.reset_index(inplace=True)

df_jokes_slim['thread_created_utc'] = pd.to_datetime(df_jokes_slim['thread_created_utc'], format='mixed', utc=True)
df_jokes_slim['thread_score'] = df_jokes_slim['thread_score'].astype(float)
df_jokes_slim.set_index('thread_id',inplace=True)
df_jokes_slim.info()

<class 'pandas.core.frame.DataFrame'>
Index: 37215 entries, 164tcap to 10pzucr
Data columns (total 8 columns):
 #   Column               Non-Null Count  Dtype              
---  ------               --------------  -----              
 0   index                37215 non-null  int64              
 1   thread_title         37215 non-null  object             
 2   thread_selftext      37215 non-null  object             
 3   thread_score         37215 non-null  float64            
 4   thread_num_comments  37215 non-null  float64            
 5   thread_created_utc   37215 non-null  datetime64[ns, UTC]
 6   thread_upvote_ratio  37215 non-null  float64            
 7   thread_over_18       37215 non-null  object             
dtypes: datetime64[ns, UTC](1), float64(3), int64(1), object(3)
memory usage: 2.6+ MB


In [71]:
df_jokes_slim.head()


Unnamed: 0_level_0,index,thread_title,thread_selftext,thread_score,thread_num_comments,thread_created_utc,thread_upvote_ratio,thread_over_18
thread_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
164tcap,0,The Turks invented sex,and then the Greeks improved it by removing th...,1125.0,104.0,2023-08-29 20:25:52+00:00,0.87,True
164un9h,1,"A guy goes into a store and tells the clerk, ""...","\n\nThe guy, clearly offended, says, “""Well, y...",3507.0,564.0,2023-08-29 21:14:51+00:00,0.87,False
16533xo,2,A man seeks cover in a cabin from the winter s...,"Very NSFW!\n\nAllright, so this is a story fro...",614.0,40.0,2023-08-30 03:07:13+00:00,0.92,True
1654dk9,3,A man comes home late from the bar,He knows his wife don't like it when he drinks...,433.0,16.0,2023-08-30 04:09:11+00:00,0.95,False
1655h6k,4,"A man is sitting at a bar at closing time, com...",How is it I always get in trouble with my wife...,648.0,43.0,2023-08-30 05:07:24+00:00,0.94,False


In [72]:
df_jokes_slim.drop('index', axis=1, inplace=True)
df_jokes_slim.set_index('thread_id', inplace=True)
df_jokes_slim.info()


thread_id
164tcap    1
164un9h    1
16533xo    2
1654dk9    2
1655h6k    2
          ..
10pworf    1
10py2jx    1
10py87c    1
10pynwm    1
10pzucr    1
Name: thread_created_utc, Length: 37215, dtype: int32