# Data Wrangling

## Problem Statements

> 1. Develop a classification model that can predict whether a reddit post belongs to the subreddits r/LifeProTips or r/Lifehacks, based on the content of the post
>    - Furthermore, optimize accuracy and precision

> 2. Identify the top 15 keywords that distinguish r/LifeProTips and r/Lifehacks

> 3. Determine the most frequent content posted in each subbreddit, and provide a recommendation that is most appropriate to new reddit users

> 4. Insights on what makes a most popular, and unpopular, in each subbreddit
>    - Based on self-texts only

## Imports

In [46]:
%run 00_Workflow_Functions.ipynb import na_only, api_call, data_wrangling

In [2]:
import pandas as pd
import requests
from collections import defaultdict

## Preliminaries

In [3]:
# lifehacks
lhs_url = "https://api.pushshift.io/reddit/search/submission/?subreddit=lifehacks&metadata=true&size=0"
lht_url = "https://api.pushshift.io/reddit/search/submission/?subreddit=lifehacks&metadata=true&size=0&is_self=true"

# lifeprotips
lpts_url = "https://api.pushshift.io/reddit/search/submission/?subreddit=LifeProTips&metadata=true&size=0"
lptt_url = "https://api.pushshift.io/reddit/search/submission/?subreddit=LifeProTips&metadata=true&size=0&is_self=true"

In [4]:
res_lhs = requests.get(lhs_url)
res_lht = requests.get(lht_url)
res_lpts = requests.get(lpts_url)
res_lptt = requests.get(lptt_url)

print(res_lhs.status_code, res_lht.status_code)
print(res_lpts.status_code, res_lptt.status_code)

200 200
200 200


All requests successful.

In [5]:
lhs_count = res_lhs.json()['metadata']['total_results']
lht_count = res_lht.json()['metadata']['total_results']

lpts_count = res_lpts.json()['metadata']['total_results']
lptt_count = res_lptt.json()['metadata']['total_results']

print(f"LifeHacks Total Submissions: {lhs_count}\nLifeHacks Total Self-Text Posts: {lht_count}")
print(f"LifeProTips Total Submissions: {lpts_count}\nLifeProTips Total Self-Text Posts: {lptt_count}")

LifeHacks Total Submissions: 81826
LifeHacks Total Self-Text Posts: 23280
LifeProTips Total Submissions: 555648
LifeProTips Total Self-Text Posts: 534646


## Data Wrangling - r/Lifehacks

In [6]:
# Content we care about:
keys = ['author', 'author_fullname', 'created_utc', 'selftext', 'title', 'subreddit', 'is_video', 'num_comments', 'score', 'upvote_ratio']

# instantiate new dict to capture api data
lh_data = defaultdict(list)

In [7]:
# making api call
lh_call = api_call('lifehacks', 100, '1648771200')

Note here we use the UTC `1648771200` which is `Friday, April 1, 2022 12:00:00 AM`. We use this date for consistency of the data we obtain.

In [8]:
len(lh_call) # we could only request data 100 submissions at a time

100

In [9]:
# wrangling api call into a dictionary that will be used on a dataframe
data = data_wrangling(lh_data, keys, lh_call)

In [10]:
# checking if any data was not capture in the api call
data['error_log']

[]

In [11]:
# api data dictionary to dataframe
df_lh = pd.DataFrame(data['data'])
df_lh.tail()

Unnamed: 0,author,author_fullname,created_utc,selftext,title,subreddit,is_video,num_comments,score,upvote_ratio
95,Agent_Exile,t2_jsijnho7,1648637199,,true lifehack,lifehacks,False,0,1,1.0
96,satrianovian20,t2_l18xyp0p,1648636580,,useful education that will bring you closer to...,lifehacks,False,0,1,1.0
97,Scvoopy,t2_dj6kvqr3,1648631857,,Ottocast Coupon Code | 30% OFF Discount Code 2022,lifehacks,False,0,1,1.0
98,FederalBlacksmith663,t2_imeh5kjl,1648629755,[removed],Should I send a gift to my ex?,lifehacks,False,0,1,1.0
99,satrianovian20,t2_l18xyp0p,1648624604,,useful information that will bring you closer ...,lifehacks,False,0,1,1.0


A small trick here. We will use the submission time of the last post we collected, and wrangle more data that predates that submission. We will then append that data to the dataframe, until we have all the data we need. The process for that is below.

In [12]:
# last collected submission
last_utc = df_lh.loc[len(df_lh) - 1, 'created_utc']
last_utc

1648624604

Here we continue making api calls, with new data each time (predating the last data that is collected each time). We will collect at least 1000 rows of data.

In [13]:
# continue wrangling data until a certain size is met
while len(df_lh) < 5000:
    try:
        lh_call = api_call('lifehacks', 100, last_utc)
    except:
        print("Data wrangling failed.")
        break
    
    data = data_wrangling(lh_data, keys, lh_call)
    df_lh = pd.DataFrame(data['data'])
    last_utc = df_lh.loc[len(df_lh) - 1, 'created_utc']

In [14]:
# verifying data was collected
df_lh.shape

(5020, 10)

In [15]:
df_lh.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5096 entries, 0 to 5095
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   author           5096 non-null   object 
 1   author_fullname  5054 non-null   object 
 2   created_utc      5096 non-null   int64  
 3   selftext         5088 non-null   object 
 4   title            5096 non-null   object 
 5   subreddit        5096 non-null   object 
 6   is_video         5096 non-null   bool   
 7   num_comments     5096 non-null   int64  
 8   score            5096 non-null   int64  
 9   upvote_ratio     5096 non-null   float64
dtypes: bool(1), float64(1), int64(3), object(5)
memory usage: 363.4+ KB


In [16]:
na_only(df_lh)

author_fullname    42
selftext            8
dtype: int64

Looks like we have some missing data. Since it's a very small amount, we will drop it now.

In [17]:
# drop NAs
df_lh = df_lh.dropna()

In [18]:
len(df_lh) #nunmber of rows

5054

In [19]:
# checking if our data is unique based on submission times
len(np.unique(df_lh['created_utc']))

5051

All but a few rows of data have different times of submission, which is a strong suggestion all our data are unique submissions.

In [20]:
df_lh.head()

Unnamed: 0,author,author_fullname,created_utc,selftext,title,subreddit,is_video,num_comments,score,upvote_ratio
0,anonymousbrowzer,t2_14k10v,1648770303,,"When the smoke detector goes off from cooking,...",lifehacks,False,0,1,1.0
1,PlantBasedRedditor,t2_g4e0rfz,1648766714,,Use Goo Gone on scissors and blades to reduce ...,lifehacks,False,0,1,1.0
2,CryptographerFar5073,t2_ldjcr311,1648763992,,Bingo Bash,lifehacks,False,0,1,1.0
3,Giant_weiner_not_dog,t2_konlr4kt,1648762832,,How to troll someone,lifehacks,False,0,1,1.0
4,Giant_weiner_not_dog,t2_konlr4kt,1648762309,,what a nice way to have your meal( credit to u...,lifehacks,False,0,1,1.0


In [21]:
df_lh.tail()

Unnamed: 0,author,author_fullname,created_utc,selftext,title,subreddit,is_video,num_comments,score,upvote_ratio
5091,burcbuluklu,t2_ion98,1638550411,,Viral School TikTok Hacks for 2021,lifehacks,False,1,1,1.0
5092,mibuikus,t2_7hga8htf,1638547006,,Like reading in bed? Clip one of these onto th...,lifehacks,False,56,1,1.0
5093,Erenetanenoron,t2_d44wvw5c,1638545953,[removed],Ensure your kids won't bother you by telling t...,lifehacks,False,0,1,1.0
5094,vensucksatlife,t2_3mmhy6o9,1638539180,Hey I have been looking for months now for a s...,Bypass campus throttling Down/Up speed on router,lifehacks,False,1,1,1.0
5095,bimboselene,t2_e8ew28q7,1638534817,,Jewish life hack! Many of us don't clean the w...,lifehacks,False,25,1,1.0


## Data Wrangling - r/LifeProTips

We use the exact methodology to wrangle data for `r/LifeProTips` as we did with `r/lifehacks`.

In [22]:
# instantiate new dict to capture api data
lpt_data = defaultdict(list)

In [23]:
# making api call
lpt_call = api_call('LifeProTips', 100, '1648771200')

In [24]:
len(lpt_call) # we could only request data 100 submissions at a time

100

In [25]:
# wrangling api call into a dictionary that will be used on a dataframe
data = data_wrangling(lpt_data, keys, lpt_call)

Note we are using the same keys as `r/lifehacks`.

In [26]:
# checking if any data was not capture in the api call
data['error_log']

[]

In [27]:
# api data dictionary to dataframe
df_lpt = pd.DataFrame(data['data'])
df_lpt.tail()

Unnamed: 0,author,author_fullname,created_utc,selftext,title,subreddit,is_video,num_comments,score,upvote_ratio
95,PrinceLelouch,t2_gdxku,1648735730,[removed],Save space with your Jigsaw Puzzles by putting...,LifeProTips,False,1,1,1.0
96,bmbustamante,t2_kb04e,1648735471,[removed],LPT Request: How to learn how to re-string my ...,LifeProTips,False,1,1,1.0
97,AGirlNamedPanini,t2_shu2lou,1648735345,[removed],You can have a nice bathroom to use anywhere y...,LifeProTips,False,1,1,1.0
98,cosmicucumber,t2_942tzpp,1648734733,I finally got around to painting my nails blac...,LPT: Want more compliments as a guy? Paint you...,LifeProTips,False,1,1,1.0
99,MarkGibneyNSC,t2_cspu1pkh,1648733196,,LPT: If you’ve naturally clammy hands and you’...,LifeProTips,False,1,1,1.0


In [28]:
# last collected submission
last_utc = df_lpt.loc[len(df_lpt) - 1, 'created_utc']
last_utc

1648733196

Here we continue making api calls, with new data each time (predating the last data that is collected each time). We will collect at least 1000 rows of data.

In [29]:
# continue wrangling data until a certain size is met
while len(df_lpt) < 5000:
    try:
        lpt_call = api_call('Lifeprotips', 100, last_utc)
    except:
        print("Data wrangling failed.")
        break
    
    data = data_wrangling(lpt_data, keys, lpt_call)
    df_lpt = pd.DataFrame(data['data'])
    last_utc = df_lpt.loc[len(df_lpt) - 1, 'created_utc']

In [30]:
# verifying data was collected
df_lpt.shape

(5015, 10)

In [31]:
df_lpt.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5015 entries, 0 to 5014
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   author           4693 non-null   object 
 1   author_fullname  4663 non-null   object 
 2   created_utc      4693 non-null   float64
 3   selftext         4672 non-null   object 
 4   title            4693 non-null   object 
 5   subreddit        4693 non-null   object 
 6   is_video         4693 non-null   object 
 7   num_comments     4693 non-null   float64
 8   score            4693 non-null   float64
 9   upvote_ratio     4693 non-null   float64
dtypes: float64(4), object(6)
memory usage: 391.9+ KB


In [32]:
na_only(df_lpt)

author_fullname    352
selftext           343
author             322
created_utc        322
title              322
subreddit          322
is_video           322
num_comments       322
score              322
upvote_ratio       322
dtype: int64

Looks like we have some missing data. Since it's a relatively small amount, we will drop it now.

In [33]:
# drop NAs
df_lpt = df_lpt.dropna()

In [34]:
len(df_lpt) #nunmber of rows

4663

In [35]:
# checking if our data is unique based on submission times
len(np.unique(df_lpt['created_utc']))

4659

All but a few rows of data have different times of submission, which is a strong suggestion all our data are unique submissions.

In [36]:
df_lpt.head()

Unnamed: 0,author,author_fullname,created_utc,selftext,title,subreddit,is_video,num_comments,score,upvote_ratio
0,No-Software5654,t2_d3qolst6,1648771000.0,Be very careful who you are surrounding yourse...,"LPT: In a world like this, don't trust anybody.",LifeProTips,False,1.0,1.0,1.0
1,A-RareEntity,t2_4puu3g6g,1648770000.0,[removed],LPT: When you have a long drive ahead of you t...,LifeProTips,False,1.0,1.0,1.0
2,thegreatparanoia,t2_1j3hi83u,1648769000.0,[removed],"LPT: Take 2 seconds to ask ""Can you hear me"" b...",LifeProTips,False,1.0,1.0,1.0
3,PreppingKangaroo,t2_h4nwbg1s,1648769000.0,"Keep in mind, these prices are based on where ...",LPT: Always take advantage of sales on non-per...,LifeProTips,False,1.0,1.0,1.0
4,photomancottrell,t2_6e6siq1u,1648768000.0,Focus your work on the areas of your house tha...,LPT: When short on time and your house needs t...,LifeProTips,False,1.0,1.0,1.0


In [37]:
df_lpt.tail()

Unnamed: 0,author,author_fullname,created_utc,selftext,title,subreddit,is_video,num_comments,score,upvote_ratio
4688,sscorpio77,t2_jugpp8rg,1646173000.0,[removed],LPT: You won’t have to constantly brake to avo...,LifeProTips,False,2.0,1.0,1.0
4689,CreatorVilla,t2_f1xkvrju,1646173000.0,,"LPT: If you want someone to trust you, approac...",LifeProTips,False,1.0,1.0,1.0
4690,TheNative93,t2_52ulcx1x,1646173000.0,"If they have a number on their website, or on ...",LPT Whenever you submit a resume don’t wait fo...,LifeProTips,False,1.0,1.0,1.0
4691,duskymk,t2_22umtf1d,1646173000.0,[removed],LPT: Always bring your phone with you in a pub...,LifeProTips,False,1.0,1.0,1.0
4692,shwarma_heaven,t2_ddvb4,1646172000.0,,LPT: Just because a car stops before a parking...,LifeProTips,False,1.0,1.0,1.0


-----

### Merging Data

In [38]:
# "stacking" both dataframes by row using concatenate method
df_all = pd.concat([df_lh, df_lpt], axis=0)

In [39]:
# must reset repeated indices after concatenation
df_all.reset_index(drop=True, inplace=True)

In [40]:
df_all.shape

(9717, 10)

In [41]:
df_all.head()

Unnamed: 0,author,author_fullname,created_utc,selftext,title,subreddit,is_video,num_comments,score,upvote_ratio
0,anonymousbrowzer,t2_14k10v,1648770000.0,,"When the smoke detector goes off from cooking,...",lifehacks,False,0.0,1.0,1.0
1,PlantBasedRedditor,t2_g4e0rfz,1648767000.0,,Use Goo Gone on scissors and blades to reduce ...,lifehacks,False,0.0,1.0,1.0
2,CryptographerFar5073,t2_ldjcr311,1648764000.0,,Bingo Bash,lifehacks,False,0.0,1.0,1.0
3,Giant_weiner_not_dog,t2_konlr4kt,1648763000.0,,How to troll someone,lifehacks,False,0.0,1.0,1.0
4,Giant_weiner_not_dog,t2_konlr4kt,1648762000.0,,what a nice way to have your meal( credit to u...,lifehacks,False,0.0,1.0,1.0


In [42]:
df_all.tail()

Unnamed: 0,author,author_fullname,created_utc,selftext,title,subreddit,is_video,num_comments,score,upvote_ratio
9712,sscorpio77,t2_jugpp8rg,1646173000.0,[removed],LPT: You won’t have to constantly brake to avo...,LifeProTips,False,2.0,1.0,1.0
9713,CreatorVilla,t2_f1xkvrju,1646173000.0,,"LPT: If you want someone to trust you, approac...",LifeProTips,False,1.0,1.0,1.0
9714,TheNative93,t2_52ulcx1x,1646173000.0,"If they have a number on their website, or on ...",LPT Whenever you submit a resume don’t wait fo...,LifeProTips,False,1.0,1.0,1.0
9715,duskymk,t2_22umtf1d,1646173000.0,[removed],LPT: Always bring your phone with you in a pub...,LifeProTips,False,1.0,1.0,1.0
9716,shwarma_heaven,t2_ddvb4,1646172000.0,,LPT: Just because a car stops before a parking...,LifeProTips,False,1.0,1.0,1.0


In [43]:
df_all.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9717 entries, 0 to 9716
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   author           9717 non-null   object 
 1   author_fullname  9717 non-null   object 
 2   created_utc      9717 non-null   float64
 3   selftext         9717 non-null   object 
 4   title            9717 non-null   object 
 5   subreddit        9717 non-null   object 
 6   is_video         9717 non-null   object 
 7   num_comments     9717 non-null   float64
 8   score            9717 non-null   float64
 9   upvote_ratio     9717 non-null   float64
dtypes: float64(4), object(6)
memory usage: 759.3+ KB


In [44]:
na_only(df_all)

0

Everything looks good! We can now export.

## Data Exporting

In [45]:
#df_all.to_csv('../datasets/submissions_data.csv', index=False)

In [46]:
# TODO: wrangle data for top posts only