# Data Wrangling

## Problem Statements

> 1. Develop a classification model that can predict whether a reddit post belongs to the subreddits r/LifeProTips or r/Lifehacks, based on the content of the post
>    - Furthermore, optimize accuracy and precision

> 2. Identify the top 15 keywords that distinguish r/LifeProTips and r/Lifehacks

> 3. Determine the most frequent content posted in each subbreddit, and provide a recommendation that is most appropriate to new reddit users

> 4. Insights on what makes a most popular, and unpopular, in each subbreddit
>    - Based on self-texts only

## Imports

In [155]:
%run 00_Workflow_Functions.ipynb import api_call

In [150]:
import pandas as pd
import requests
from collections import defaultdict

## Preliminaries

In [4]:
# lifehacks
lhs_url = "https://api.pushshift.io/reddit/search/submission/?subreddit=lifehacks&metadata=true&size=0"
lht_url = "https://api.pushshift.io/reddit/search/submission/?subreddit=lifehacks&metadata=true&size=0&is_self=true"

# lifeprotips
lpts_url = "https://api.pushshift.io/reddit/search/submission/?subreddit=LifeProTips&metadata=true&size=0"
lptt_url = "https://api.pushshift.io/reddit/search/submission/?subreddit=LifeProTips&metadata=true&size=0&is_self=true"

In [5]:
res_lhs = requests.get(lhs_url)
res_lht = requests.get(lht_url)
res_lpts = requests.get(lpts_url)
res_lptt = requests.get(lptt_url)

print(res_lhs.status_code, res_lht.status_code)
print(res_lpts.status_code, res_lptt.status_code)

200 200
200 200


All requests successful.

In [6]:
lhs_count = res_lhs.json()['metadata']['total_results']
lht_count = res_lht.json()['metadata']['total_results']

lpts_count = res_lpts.json()['metadata']['total_results']
lptt_count = res_lptt.json()['metadata']['total_results']

print(f"LifeHacks Total Submissions: {lhs_count}\nLifeHacks Total Self-Text Posts: {lht_count}")
print(f"LifeProTips Total Submissions: {lpts_count}\nLifeProTips Total Self-Text Posts: {lptt_count}")

LifeHacks Total Submissions: 81787
LifeHacks Total Self-Text Posts: 23266
LifeProTips Total Submissions: 555477
LifeProTips Total Self-Text Posts: 534475


## Wrangling Data - User Posts

In [159]:
lf_call = api_call('lifehacks')

In [160]:
len(lf_call)

25

In [161]:
# data keys
# lf_call[0].keys()

dict_keys(['all_awardings', 'allow_live_comments', 'author', 'author_flair_css_class', 'author_flair_richtext', 'author_flair_text', 'author_flair_type', 'author_fullname', 'author_is_blocked', 'author_patreon_flair', 'author_premium', 'awarders', 'can_mod_post', 'contest_mode', 'created_utc', 'domain', 'full_link', 'gildings', 'id', 'is_created_from_ads_ui', 'is_crosspostable', 'is_meta', 'is_original_content', 'is_reddit_media_domain', 'is_robot_indexable', 'is_self', 'is_video', 'link_flair_background_color', 'link_flair_richtext', 'link_flair_text_color', 'link_flair_type', 'locked', 'media', 'media_embed', 'media_only', 'no_follow', 'num_comments', 'num_crossposts', 'over_18', 'parent_whitelist_status', 'permalink', 'pinned', 'post_hint', 'preview', 'pwls', 'removed_by_category', 'retrieved_on', 'score', 'secure_media', 'secure_media_embed', 'selftext', 'send_replies', 'spoiler', 'stickied', 'subreddit', 'subreddit_id', 'subreddit_subscribers', 'subreddit_type', 'thumbnail', 'thum

In [126]:
# metadata keys
# lh_sub_json['metadata'].keys()

In [127]:
# lh_sub_json['data'][63]

In [39]:
#lh_sub_json['metadata']

In [59]:
#lh_sub_json['data'][0]['author']

In [128]:
# Content we care about:
# keys = ['author', 'author_fullname', 'created_utc', 'selftext', 'title', 'subreddit', 'is_video', 'num_comments', 'score', 'upvote_ratio']

In [133]:
# lh_data = defaultdict(list)

In [134]:
for i in range(len(lh_sub_json['data'])):
    for key in keys:
        try:
            lh_data[key].append(lh_sub_json['data'][i][key])
        except:
            print(f"Error on index: {i}\nkey \"{key}\" not found.")
            lh_data[key].append(None)

In [135]:
pd.DataFrame(lh_data).tail()

Unnamed: 0,author,author_fullname,created_utc,selftext,title,subreddit,is_video,num_comments,score,upvote_ratio
95,MustacheMufasa,t2_m6py7kop,1650821565,About 2/3 of ovens made in the past ten years ...,Hack your oven to use it as an air fryer,lifehacks,False,0,1,1.0
96,cableguysmith,t2_b3xrj,1650820039,,Store mushrooms in a paper bag in the fridge -...,lifehacks,False,0,1,1.0
97,Nail6085,t2_mab45wyg,1650816720,[removed],"Do you know that you can make up to $10,600 da...",lifehacks,False,0,1,1.0
98,ractacsac,t2_3ejy81pe,1650816225,,just a tip for y'all,lifehacks,False,0,1,1.0
99,JaneStudioTV1,t2_jy6lfsg7,1650813489,,🎯 ICE MAKER FOR SODA COLA MACHINE STREAM SPARK...,lifehacks,False,0,1,1.0


In [139]:
data = pd.DataFrame(lh_data)
data.tail()

Unnamed: 0,author,author_fullname,created_utc,selftext,title,subreddit,is_video,num_comments,score,upvote_ratio
95,MustacheMufasa,t2_m6py7kop,1650821565,About 2/3 of ovens made in the past ten years ...,Hack your oven to use it as an air fryer,lifehacks,False,0,1,1.0
96,cableguysmith,t2_b3xrj,1650820039,,Store mushrooms in a paper bag in the fridge -...,lifehacks,False,0,1,1.0
97,Nail6085,t2_mab45wyg,1650816720,[removed],"Do you know that you can make up to $10,600 da...",lifehacks,False,0,1,1.0
98,ractacsac,t2_3ejy81pe,1650816225,,just a tip for y'all,lifehacks,False,0,1,1.0
99,JaneStudioTV1,t2_jy6lfsg7,1650813489,,🎯 ICE MAKER FOR SODA COLA MACHINE STREAM SPARK...,lifehacks,False,0,1,1.0


In [144]:
last_utc = data.loc[len(data) - 1, 'created_utc']