## Exploratory Data Analysis

After we have collected a sample of ~1000 posts in JSON format is large enough to identify what we want to analyse from Reddit posts. The aim of this step is to identify what are the interesting columns to be analyzed and the criteria to follow is simple: 

### "Feature selection based on the number of unique values"


Columns were all values are the same, are not interesting for any kind of analysis although they are helpful to Reddit itself. 
The purpose of this step, is not yet creating a transformation pipeline into production, but just to give you and idea of what is interesting to analyze and, therefore, we can select this columns in our production environment in Spark.
Besides, I'll be considering specific columns for my analysis, but feel free to choose any other columns for your project.

A sample the posts extracted is available on this repo as a [raw JSON file](reddit-posts.json) so you can run this script and explore other variables.

In [2]:
# libraries
import pandas as pd

In [3]:
# functions

def check_unique_values_in_column(column):
    '''Return the len of the unique values in a column'''
    try:
        unique_values = column.unique()
        return len(unique_values) > 1
    except Exception as e:
        return "error:", e
    
def drop_same_columns(df):
    '''Drop columns that have the same data but different name'''
    columns = df.columns.tolist()
    columns_to_drop = []
    # Iterate through each column by index
    for i in range(len(columns)):
        col1 = columns[i]
        for j in range(i + 1, len(columns)):  # Compare with columns after the current one
            col2 = columns[j]
            if df[col1].equals(df[col2]):
                columns_to_drop.append(col2)

    return df.drop(columns=columns_to_drop, inplace=True)
            

In [4]:
# Read JSON file as a pandas DataFrame
df = pd.read_json("reddit-posts.json")

In [5]:
# Quick exploration
pd.set_option("display.max_columns", None)
pd.set_option("display.max_colwidth", None)
df.head(n=20)

Unnamed: 0,comment_limit,comment_sort,approved_at_utc,selftext,author_fullname,saved,mod_reason_title,gilded,clicked,title,link_flair_richtext,subreddit_name_prefixed,hidden,pwls,link_flair_css_class,downs,top_awarded_type,hide_score,name,quarantine,link_flair_text_color,upvote_ratio,author_flair_background_color,subreddit_type,ups,total_awards_received,media_embed,author_flair_template_id,is_original_content,user_reports,secure_media,is_reddit_media_domain,is_meta,category,secure_media_embed,link_flair_text,can_mod_post,score,approved_by,is_created_from_ads_ui,author_premium,thumbnail,edited,author_flair_css_class,author_flair_richtext,gildings,content_categories,is_self,mod_note,created,link_flair_type,wls,removed_by_category,banned_by,author_flair_type,domain,allow_live_comments,selftext_html,likes,suggested_sort,banned_at_utc,view_count,archived,no_follow,is_crosspostable,pinned,over_18,all_awardings,awarders,media_only,can_gild,spoiler,locked,author_flair_text,treatment_tags,visited,removed_by,num_reports,distinguished,subreddit_id,author_is_blocked,mod_reason_by,removal_reason,link_flair_background_color,id,is_robot_indexable,report_reasons,discussion_type,num_comments,send_replies,whitelist_status,contest_mode,mod_reports,author_patreon_flair,author_flair_text_color,permalink,parent_whitelist_status,stickied,url,subreddit_subscribers,created_utc,num_crossposts,media,is_video,_fetched,_additional_fetch_params,_comments_by_id,link_flair_template_id,author_cakeday
0,2048,confidence,,,t2_dp4v5ppv4,False,,0,False,What is 70 Years too young for ?,[],r/AskReddit,False,6,,0,,True,t3_16ai7q2,False,dark,1.0,,public,1,0,{},,False,[],,False,False,,{},,False,1,,False,False,,0,,[],{},,True,,1693901670,text,6,,,text,self.AskReddit,False,,,,,,False,True,False,False,False,[],[],False,False,False,False,,[],False,,,,t5_2qh1i,False,,,,16ai7q2,True,,,0,True,all_ads,False,[],False,,/r/AskReddit/comments/16ai7q2/what_is_70_years_too_young_for/,all_ads,False,https://www.reddit.com/r/AskReddit/comments/16ai7q2/what_is_70_years_too_young_for/,42866556,1693901670,0,,False,False,{},{},,
1,2048,confidence,,,t2_buyarvi24,False,,0,False,What was being a teen in the 80s like?,[],r/AskReddit,False,6,,0,,True,t3_16ai6wk,False,dark,1.0,,public,1,0,{},,False,[],,False,False,,{},,False,1,,False,False,,0,,[],{},,True,,1693901584,text,6,,,text,self.AskReddit,False,,,,,,False,True,False,False,False,[],[],False,False,False,False,,[],False,,,,t5_2qh1i,False,,,,16ai6wk,True,,,0,True,all_ads,False,[],False,,/r/AskReddit/comments/16ai6wk/what_was_being_a_teen_in_the_80s_like/,all_ads,False,https://www.reddit.com/r/AskReddit/comments/16ai6wk/what_was_being_a_teen_in_the_80s_like/,42866556,1693901584,0,,False,False,{},{},,
2,2048,confidence,,,t2_50j8n3pr,False,,0,False,How do you know mermaids are real?,[],r/AskReddit,False,6,,0,,True,t3_16ai6r2,False,dark,1.0,,public,1,0,{},,False,[],,False,False,,{},,False,1,,False,False,,0,,[],{},,True,,1693901569,text,6,,,text,self.AskReddit,False,,,,,,False,True,False,False,False,[],[],False,False,False,False,,[],False,,,,t5_2qh1i,False,,,,16ai6r2,True,,,1,True,all_ads,False,[],False,,/r/AskReddit/comments/16ai6r2/how_do_you_know_mermaids_are_real/,all_ads,False,https://www.reddit.com/r/AskReddit/comments/16ai6r2/how_do_you_know_mermaids_are_real/,42866556,1693901569,0,,False,False,{},{},,
3,2048,confidence,,,t2_6e362qnc,False,,0,False,What's the biggest challenge you face when trying to improve your mental health and find happiness? What has hindered you or held you back?,[],r/AskReddit,False,6,,0,,True,t3_16ai6he,False,dark,1.0,,public,1,0,{},,False,[],,False,False,,{},,False,1,,False,False,,0,,[],{},,True,,1693901540,text,6,,,text,self.AskReddit,False,,,,,,False,True,False,False,False,[],[],False,False,False,False,,[],False,,,,t5_2qh1i,False,,,,16ai6he,True,,,0,True,all_ads,False,[],False,,/r/AskReddit/comments/16ai6he/whats_the_biggest_challenge_you_face_when_trying/,all_ads,False,https://www.reddit.com/r/AskReddit/comments/16ai6he/whats_the_biggest_challenge_you_face_when_trying/,42866556,1693901540,0,,False,False,{},{},,
4,2048,confidence,,,t2_hwl3f6bt9,False,,0,False,What is the most unexpected thing you've found in an old book or hidden in the walls of a house?,[],r/AskReddit,False,6,,0,,True,t3_16ai61f,False,dark,1.0,,public,1,0,{},,False,[],,False,False,,{},,False,1,,False,False,,0,,[],{},,True,,1693901494,text,6,,,text,self.AskReddit,False,,,,,,False,True,False,False,False,[],[],False,False,False,False,,[],False,,,,t5_2qh1i,False,,,,16ai61f,True,,,1,True,all_ads,False,[],False,,/r/AskReddit/comments/16ai61f/what_is_the_most_unexpected_thing_youve_found_in/,all_ads,False,https://www.reddit.com/r/AskReddit/comments/16ai61f/what_is_the_most_unexpected_thing_youve_found_in/,42866556,1693901494,0,,False,False,{},{},,
5,2048,confidence,,,t2_j29khopik,False,,0,False,What age should people be allowed to drink alcohol in your opinion?,[],r/AskReddit,False,6,,0,,True,t3_16ai5tl,False,dark,1.0,,public,1,0,{},,False,[],,False,False,,{},,False,1,,False,False,,0,,[],{},,True,,1693901473,text,6,,,text,self.AskReddit,False,,,,,,False,True,False,False,False,[],[],False,False,False,False,,[],False,,,,t5_2qh1i,False,,,,16ai5tl,True,,,5,True,all_ads,False,[],False,,/r/AskReddit/comments/16ai5tl/what_age_should_people_be_allowed_to_drink/,all_ads,False,https://www.reddit.com/r/AskReddit/comments/16ai5tl/what_age_should_people_be_allowed_to_drink/,42866556,1693901473,0,,False,False,{},{},,
6,2048,confidence,,,t2_oesovuo2,False,,0,False,"teachers of reddit, have you ever had sexual/romantic thoughts about a student? if so, what did you do?",[],r/AskReddit,False,6,,0,,True,t3_16ai5lh,False,dark,0.5,,public,0,0,{},,False,[],,False,False,,{},,False,0,,False,False,,0,,[],{},,True,,1693901451,text,6,,,text,self.AskReddit,False,,,,,,False,True,False,False,True,[],[],False,False,False,False,,[],False,,,,t5_2qh1i,False,,,,16ai5lh,True,,,0,True,all_ads,False,[],False,,/r/AskReddit/comments/16ai5lh/teachers_of_reddit_have_you_ever_had/,all_ads,False,https://www.reddit.com/r/AskReddit/comments/16ai5lh/teachers_of_reddit_have_you_ever_had/,42866556,1693901451,0,,False,False,{},{},,
7,2048,confidence,,,t2_6e362qnc,False,,0,False,[Serious] What's the biggest challenge you face when trying to improve your mental health and find happiness? What has hindered you or held you back?,"[{'e': 'text', 't': 'Serious Replies Only'}]",r/AskReddit,False,6,,0,,True,t3_16ai5kn,False,light,1.0,,public,1,0,{},,False,[],,False,False,,{},Serious Replies Only,False,1,,False,False,,0,,[],{},,True,,1693901449,richtext,6,,,text,self.AskReddit,False,,,,,,False,True,False,False,False,[],[],False,False,False,False,,[],False,,,,t5_2qh1i,False,,,#99c160,16ai5kn,True,,,1,True,all_ads,False,[],False,,/r/AskReddit/comments/16ai5kn/serious_whats_the_biggest_challenge_you_face_when/,all_ads,False,https://www.reddit.com/r/AskReddit/comments/16ai5kn/serious_whats_the_biggest_challenge_you_face_when/,42866556,1693901449,0,,False,False,{},{},54ea6bda-dcf0-11e2-9548-12313b0c8c59,
8,2048,confidence,,,t2_i4n0j8jlt,False,,0,False,"Do you have a secret fetish, and what is it?",[],r/AskReddit,False,6,,0,,True,t3_16ai5hz,False,dark,1.0,,public,1,0,{},,False,[],,False,False,,{},,False,1,,False,False,,0,,[],{},,True,,1693901442,text,6,,,text,self.AskReddit,False,,,,,,False,True,False,False,False,[],[],False,False,False,False,,[],False,,,,t5_2qh1i,False,,,,16ai5hz,True,,,2,True,all_ads,False,[],False,,/r/AskReddit/comments/16ai5hz/do_you_have_a_secret_fetish_and_what_is_it/,all_ads,False,https://www.reddit.com/r/AskReddit/comments/16ai5hz/do_you_have_a_secret_fetish_and_what_is_it/,42866556,1693901442,0,,False,False,{},{},,
9,2048,confidence,,,t2_lzq5kmwx,False,,0,False,What's a movie series that started off great and then hit rock bottom?,[],r/AskReddit,False,6,,0,,True,t3_16ai5g3,False,dark,1.0,,public,1,0,{},,False,[],,False,False,,{},,False,1,,False,False,,0,,[],{},,True,,1693901437,text,6,,,text,self.AskReddit,False,,,,,,False,True,False,False,False,[],[],False,False,False,False,,[],False,,,,t5_2qh1i,False,,,,16ai5g3,True,,,2,True,all_ads,False,[],False,,/r/AskReddit/comments/16ai5g3/whats_a_movie_series_that_started_off_great_and/,all_ads,False,https://www.reddit.com/r/AskReddit/comments/16ai5g3/whats_a_movie_series_that_started_off_great_and/,42866556,1693901437,0,,False,False,{},{},,


As expected, just with a quick exploration of the dataset, we realize that we have:
  1. Columns that have the same content but with different col name i.e. --> created vs created_utc
  2. Columns that seems to have all values the same for all the post ID

Let's now identify what columns are interesting for our analysis.

In [6]:
# Generate a list of columns that have more than 1 unique value or unhashable such as nested arrays
columns_to_keep = [column for column in df.columns if check_unique_values_in_column(df[column]) in ('unhashable', True)]

# Filter the dataset with only interesting columns to be analyzed:
reddit_df_filtered = df[columns_to_keep]
reddit_df_filtered.head()

Unnamed: 0,author_fullname,title,hide_score,name,link_flair_text_color,upvote_ratio,ups,total_awards_received,link_flair_text,score,author_premium,edited,created,link_flair_type,no_follow,over_18,link_flair_background_color,id,num_comments,send_replies,permalink,url,subreddit_subscribers,created_utc,link_flair_template_id,author_cakeday
0,t2_dp4v5ppv4,What is 70 Years too young for ?,True,t3_16ai7q2,dark,1.0,1,0,,1,False,0,1693901670,text,True,False,,16ai7q2,0,True,/r/AskReddit/comments/16ai7q2/what_is_70_years_too_young_for/,https://www.reddit.com/r/AskReddit/comments/16ai7q2/what_is_70_years_too_young_for/,42866556,1693901670,,
1,t2_buyarvi24,What was being a teen in the 80s like?,True,t3_16ai6wk,dark,1.0,1,0,,1,False,0,1693901584,text,True,False,,16ai6wk,0,True,/r/AskReddit/comments/16ai6wk/what_was_being_a_teen_in_the_80s_like/,https://www.reddit.com/r/AskReddit/comments/16ai6wk/what_was_being_a_teen_in_the_80s_like/,42866556,1693901584,,
2,t2_50j8n3pr,How do you know mermaids are real?,True,t3_16ai6r2,dark,1.0,1,0,,1,False,0,1693901569,text,True,False,,16ai6r2,1,True,/r/AskReddit/comments/16ai6r2/how_do_you_know_mermaids_are_real/,https://www.reddit.com/r/AskReddit/comments/16ai6r2/how_do_you_know_mermaids_are_real/,42866556,1693901569,,
3,t2_6e362qnc,What's the biggest challenge you face when trying to improve your mental health and find happiness? What has hindered you or held you back?,True,t3_16ai6he,dark,1.0,1,0,,1,False,0,1693901540,text,True,False,,16ai6he,0,True,/r/AskReddit/comments/16ai6he/whats_the_biggest_challenge_you_face_when_trying/,https://www.reddit.com/r/AskReddit/comments/16ai6he/whats_the_biggest_challenge_you_face_when_trying/,42866556,1693901540,,
4,t2_hwl3f6bt9,What is the most unexpected thing you've found in an old book or hidden in the walls of a house?,True,t3_16ai61f,dark,1.0,1,0,,1,False,0,1693901494,text,True,False,,16ai61f,1,True,/r/AskReddit/comments/16ai61f/what_is_the_most_unexpected_thing_youve_found_in/,https://www.reddit.com/r/AskReddit/comments/16ai61f/what_is_the_most_unexpected_thing_youve_found_in/,42866556,1693901494,,


Finally we check if we have some columns that are the same, but with different names. We must iterate for each column and find if the following 

In [7]:
drop_same_columns(reddit_df_filtered)
reddit_df_filtered.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return df.drop(columns=columns_to_drop, inplace=True)


Unnamed: 0,author_fullname,title,hide_score,name,link_flair_text_color,upvote_ratio,ups,total_awards_received,link_flair_text,author_premium,edited,created,link_flair_type,no_follow,over_18,link_flair_background_color,id,num_comments,send_replies,permalink,url,subreddit_subscribers,link_flair_template_id,author_cakeday
0,t2_dp4v5ppv4,What is 70 Years too young for ?,True,t3_16ai7q2,dark,1.0,1,0,,False,0,1693901670,text,True,False,,16ai7q2,0,True,/r/AskReddit/comments/16ai7q2/what_is_70_years_too_young_for/,https://www.reddit.com/r/AskReddit/comments/16ai7q2/what_is_70_years_too_young_for/,42866556,,
1,t2_buyarvi24,What was being a teen in the 80s like?,True,t3_16ai6wk,dark,1.0,1,0,,False,0,1693901584,text,True,False,,16ai6wk,0,True,/r/AskReddit/comments/16ai6wk/what_was_being_a_teen_in_the_80s_like/,https://www.reddit.com/r/AskReddit/comments/16ai6wk/what_was_being_a_teen_in_the_80s_like/,42866556,,
2,t2_50j8n3pr,How do you know mermaids are real?,True,t3_16ai6r2,dark,1.0,1,0,,False,0,1693901569,text,True,False,,16ai6r2,1,True,/r/AskReddit/comments/16ai6r2/how_do_you_know_mermaids_are_real/,https://www.reddit.com/r/AskReddit/comments/16ai6r2/how_do_you_know_mermaids_are_real/,42866556,,
3,t2_6e362qnc,What's the biggest challenge you face when trying to improve your mental health and find happiness? What has hindered you or held you back?,True,t3_16ai6he,dark,1.0,1,0,,False,0,1693901540,text,True,False,,16ai6he,0,True,/r/AskReddit/comments/16ai6he/whats_the_biggest_challenge_you_face_when_trying/,https://www.reddit.com/r/AskReddit/comments/16ai6he/whats_the_biggest_challenge_you_face_when_trying/,42866556,,
4,t2_hwl3f6bt9,What is the most unexpected thing you've found in an old book or hidden in the walls of a house?,True,t3_16ai61f,dark,1.0,1,0,,False,0,1693901494,text,True,False,,16ai61f,1,True,/r/AskReddit/comments/16ai61f/what_is_the_most_unexpected_thing_youve_found_in/,https://www.reddit.com/r/AskReddit/comments/16ai61f/what_is_the_most_unexpected_thing_youve_found_in/,42866556,,


In [9]:
reddit_df_filtered.columns.to_list()

['author_fullname',
 'title',
 'hide_score',
 'name',
 'link_flair_text_color',
 'upvote_ratio',
 'ups',
 'total_awards_received',
 'link_flair_text',
 'author_premium',
 'edited',
 'created',
 'link_flair_type',
 'no_follow',
 'over_18',
 'link_flair_background_color',
 'id',
 'num_comments',
 'send_replies',
 'permalink',
 'url',
 'subreddit_subscribers',
 'link_flair_template_id',
 'author_cakeday']

Thanks to this little EDA, now we can address properly Data Modelling questions such us:
  1. What are we going to measure?
  2. What need to me our dimensions?
  3. What would be the relationships?

## Conclusion

Take in consideration that this file is not in production and the aim of this is to just have a better understanding of the data we are going to process. We have seen what could be some filters to apply to the dataset, in order to keep only the columns that are interesting for our project.