# Data Wrangling

## Problem Statements

> 1. Develop a classification model that can predict whether a reddit post belongs to the subreddits r/LifeProTips or r/Lifehacks, based on the content of the post
>    - Furthermore, optimize accuracy and precision

> 2. Identify the top 15 keywords that distinguish r/LifeProTips and r/Lifehacks

> 3. Determine the most frequent content posted in each subbreddit, and provide a recommendation that is most appropriate to new reddit users

> 4. Insights on what makes a most popular, and unpopular, in each subbreddit
>    - Based on self-texts only

## Imports

In [1]:
%run 00_Workflow_Functions.ipynb import api_call, data_wrangling

In [2]:
import pandas as pd
import requests
from collections import defaultdict

## Preliminaries

In [3]:
# lifehacks
lhs_url = "https://api.pushshift.io/reddit/search/submission/?subreddit=lifehacks&metadata=true&size=0"
lht_url = "https://api.pushshift.io/reddit/search/submission/?subreddit=lifehacks&metadata=true&size=0&is_self=true"

# lifeprotips
lpts_url = "https://api.pushshift.io/reddit/search/submission/?subreddit=LifeProTips&metadata=true&size=0"
lptt_url = "https://api.pushshift.io/reddit/search/submission/?subreddit=LifeProTips&metadata=true&size=0&is_self=true"

In [4]:
res_lhs = requests.get(lhs_url)
res_lht = requests.get(lht_url)
res_lpts = requests.get(lpts_url)
res_lptt = requests.get(lptt_url)

print(res_lhs.status_code, res_lht.status_code)
print(res_lpts.status_code, res_lptt.status_code)

200 200
200 200


All requests successful.

In [5]:
lhs_count = res_lhs.json()['metadata']['total_results']
lht_count = res_lht.json()['metadata']['total_results']

lpts_count = res_lpts.json()['metadata']['total_results']
lptt_count = res_lptt.json()['metadata']['total_results']

print(f"LifeHacks Total Submissions: {lhs_count}\nLifeHacks Total Self-Text Posts: {lht_count}")
print(f"LifeProTips Total Submissions: {lpts_count}\nLifeProTips Total Self-Text Posts: {lptt_count}")

LifeHacks Total Submissions: 81814
LifeHacks Total Self-Text Posts: 23275
LifeProTips Total Submissions: 555600
LifeProTips Total Self-Text Posts: 534598


## Data Wrangling - r/Lifehacks

In [6]:
# Content we care about:
keys = ['author', 'author_fullname', 'created_utc', 'selftext', 'title', 'subreddit', 'is_video', 'num_comments', 'score', 'upvote_ratio']

# instantiate new dict to capture api data
lh_data = defaultdict(list)

In [7]:
# making api call
lh_call = api_call('lifehacks', 100)

In [8]:
len(lh_call) # we could only request data 100 submissions at a time

100

In [9]:
# wrangling api call into a dictionary that will be used on a dataframe
data = data_wrangling(lh_data, keys, lh_call)

In [10]:
# checking if any data was not capture in the api call
data['error_log']

[]

In [11]:
# api data dictionary to dataframe
df_lh = pd.DataFrame(data['data'])
df_lh.tail()

Unnamed: 0,author,author_fullname,created_utc,selftext,title,subreddit,is_video,num_comments,score,upvote_ratio
95,YoMammasAHo420,t2_fq0c3sa0,1650895181,,The Ultimate Life Hack From Norm Macdonald,lifehacks,False,0,1,1.0
96,Kylanv,t2_fckvhxug,1650891971,,20 Hilarious Comics By ‘Bits &amp; Pieces’ Wit...,lifehacks,False,0,1,1.0
97,incutech,t2_9zs2u,1650891393,,when your mom refuses to quit driving after he...,lifehacks,False,0,1,1.0
98,Skorobogatiji,t2_y98z7,1650887003,,Do you know that dogs can smell serious diseas...,lifehacks,False,0,1,1.0
99,Apprehensive_Toe8550,t2_4gjyet18,1650885527,[removed],For every upvote and comment I will give a lif...,lifehacks,False,0,1,1.0


A small trick here. We will use the submission time of the last post we collected, and wrangle more data that predates that submission. We will then append that data to the dataframe, until we have all the data we need. The process for that is below.

In [12]:
# last collected submission
last_utc = df_lh.loc[len(df_lh) - 1, 'created_utc']
last_utc

1650885527

Here we continue making api calls, with new data each time (predating the last data that is collected each time). We will collect at least 1000 rows of data.

In [13]:
# continue wrangling data until a certain size is met
while len(df_lh) < 1000:
    try:
        lh_call = api_call('lifehacks', 100, last_utc)
    except:
        print("Data wrangling failed.")
        break
    
    data = data_wrangling(lh_data, keys, lh_call)
    df_lh = pd.DataFrame(data['data'])
    last_utc = df_lh.loc[len(df_lh) - 1, 'created_utc']

In [14]:
# verifying data was collected
df_lh.shape

(1099, 10)

In [15]:
df_lh.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1099 entries, 0 to 1098
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   author           1099 non-null   object 
 1   author_fullname  1095 non-null   object 
 2   created_utc      1099 non-null   int64  
 3   selftext         1097 non-null   object 
 4   title            1099 non-null   object 
 5   subreddit        1099 non-null   object 
 6   is_video         1099 non-null   bool   
 7   num_comments     1099 non-null   int64  
 8   score            1099 non-null   int64  
 9   upvote_ratio     1099 non-null   float64
dtypes: bool(1), float64(1), int64(3), object(5)
memory usage: 78.5+ KB


In [17]:
df_lh.isna().sum()

author             0
author_fullname    4
created_utc        0
selftext           2
title              0
subreddit          0
is_video           0
num_comments       0
score              0
upvote_ratio       0
dtype: int64

Looks like we have some missing data. Since it's a very small amount, we will drop it now.

In [19]:
# drop NAs
df_lh = df_lh.dropna()

In [20]:
len(df_lh) #nunmber of rows

1095

In [21]:
# checking if our data is unique based on submission times
len(np.unique(df_lh['created_utc']))

1095

All row data has different times of submission, which is a strong suggestion all our data are unique submissions.

In [22]:
df_lh.head()

Unnamed: 0,author,author_fullname,created_utc,selftext,title,subreddit,is_video,num_comments,score,upvote_ratio
0,Cutesifer_101,t2_7zsy6pvt,1651176966,The damage wasn’t that bad other than the fact...,I spilled monster energy,lifehacks,False,0,1,1.0
1,Subtotalpoet,t2_dcvoz53,1651176676,,"Wife forgot ur favorite ice cream? Improvise, ...",lifehacks,False,0,1,1.0
2,amintowords,t2_racie,1651175651,Set an alarm for an hour after you're meant to...,How to remember to take tablets on time,lifehacks,False,0,1,1.0
3,rokokslot87,t2_kv1oregh,1651174857,,SLOT ONLINE MENANG BESAR | SLOT DEPOSIT PULSA,lifehacks,False,0,1,1.0
4,Distinct_Expert_7648,t2_mczxpj82,1651174438,[removed],Infertility Clinic in Pune,lifehacks,False,0,1,1.0


In [23]:
df_lh.tail()

Unnamed: 0,author,author_fullname,created_utc,selftext,title,subreddit,is_video,num_comments,score,upvote_ratio
1094,Benjamin-Info,t2_6qa849ep,1648141834,,STORY TO ILLUSTRATE THE WARNING: LOOK BEFORE Y...,lifehacks,False,0,1,1.0
1095,Universal-theories,t2_l2v5b5rh,1648139884,,Who is the Lord,lifehacks,False,0,1,1.0
1096,Comfortable_College9,t2_9xvv1oxg,1648134172,,How to sleep alone without fear 🌼😁😁,lifehacks,False,0,1,1.0
1097,zareth_merus,t2_evdznttz,1648129366,,Use heavy duty paper clips to organize your fr...,lifehacks,False,0,1,1.0
1098,Serious-Ice3998,t2_evi4rvxq,1648129038,,You're welcome. The right way to untangle jewe...,lifehacks,False,0,1,1.0


## Data Wrangling - r/LifeProTips

We use the exact methodology to wrangle data for `r/LifeProTips` as we did with `r/lifehacks`.

In [24]:
# instantiate new dict to capture api data
lpt_data = defaultdict(list)

In [25]:
# making api call
lpt_call = api_call('LifeProTips', 100)

In [26]:
len(lpt_call) # we could only request data 100 submissions at a time

100

In [27]:
# wrangling api call into a dictionary that will be used on a dataframe
data = data_wrangling(lpt_data, keys, lpt_call)

Note we are using the same keys as `r/lifehacks`.

In [28]:
# checking if any data was not capture in the api call
data['error_log']

[]

In [29]:
# api data dictionary to dataframe
df_lpt = pd.DataFrame(data['data'])
df_lpt.tail()

Unnamed: 0,author,author_fullname,created_utc,selftext,title,subreddit,is_video,num_comments,score,upvote_ratio
95,Old-Maintenance24923,t2_j2e8k26h,1651130044,[removed],LPT: Any LPT suggesting some jobs actually req...,LifeProTips,False,0,1,1.0
96,cascadegaming,t2_15yrfc,1651128770,I use this strategy every time I cook or bake ...,"LPT: When planning a recipe, instead of making...",LifeProTips,False,1,1,1.0
97,perfectlysaneboy,t2_a0ez8oa4,1651127996,,"LPT: When given a raise, be ambitious and alwa...",LifeProTips,False,1,1,1.0
98,goretsky,t2_3tihk,1651127544,"Hello,\n\nIf you are taking a trip that involv...",LPT: Taking a trip? Add your flight to your ca...,LifeProTips,False,1,1,1.0
99,Playful_Impression_8,t2_835oyy5h,1651127055,[removed],HP LaserJet Enterprise M612dn A blank sheet of...,LifeProTips,False,1,1,1.0


In [30]:
# last collected submission
last_utc = df_lpt.loc[len(df_lpt) - 1, 'created_utc']
last_utc

1651127055

Here we continue making api calls, with new data each time (predating the last data that is collected each time). We will collect at least 1000 rows of data.

In [31]:
# continue wrangling data until a certain size is met
while len(df_lpt) < 1000:
    try:
        lpt_call = api_call('Lifeprotips', 100, last_utc)
    except:
        print("Data wrangling failed.")
        break
    
    data = data_wrangling(lpt_data, keys, lpt_call)
    df_lpt = pd.DataFrame(data['data'])
    last_utc = df_lpt.loc[len(df_lpt) - 1, 'created_utc']

In [32]:
# verifying data was collected
df_lpt.shape

(1099, 10)

In [34]:
df_lpt.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1099 entries, 0 to 1098
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   author           1099 non-null   object 
 1   author_fullname  1083 non-null   object 
 2   created_utc      1099 non-null   int64  
 3   selftext         1089 non-null   object 
 4   title            1099 non-null   object 
 5   subreddit        1099 non-null   object 
 6   is_video         1099 non-null   bool   
 7   num_comments     1099 non-null   int64  
 8   score            1099 non-null   int64  
 9   upvote_ratio     1099 non-null   float64
dtypes: bool(1), float64(1), int64(3), object(5)
memory usage: 78.5+ KB


In [35]:
df_lpt.isna().sum()

author              0
author_fullname    16
created_utc         0
selftext           10
title               0
subreddit           0
is_video            0
num_comments        0
score               0
upvote_ratio        0
dtype: int64

Looks like we have some missing data. Since it's a very small amount, we will drop it now.

In [36]:
# drop NAs
df_lpt = df_lpt.dropna()

In [38]:
len(df_lpt) #nunmber of rows

1083

In [39]:
# checking if our data is unique based on submission times
len(np.unique(df_lpt['created_utc']))

1083

All row data has different times of submission, which is a strong suggestion all our data are unique submissions.

In [40]:
df_lpt.head()

Unnamed: 0,author,author_fullname,created_utc,selftext,title,subreddit,is_video,num_comments,score,upvote_ratio
0,Relative-Kesuki9802,t2_izoub7ix,1651180668,[removed],Student Research Survey,LifeProTips,False,1,1,1.0
1,johnkoetsier,t2_iy7f,1651180441,[removed],When someone comes to build/install/renovate s...,LifeProTips,False,1,1,1.0
2,Fair-Pattern-1745,t2_mhzb1dp4,1651178681,[removed],i removed clothes,LifeProTips,False,1,1,1.0
3,BigOlBlimp,t2_53c1t8bg,1651178589,Lots of people think blower hand dryers aren't...,LPT: Air hand dryers will dry your hands quick...,LifeProTips,False,1,1,1.0
4,JonusTJonnerson,t2_2qhrakuw,1651178192,,LPT Tell your jokes at work with a straight (a...,LifeProTips,False,1,1,1.0


In [41]:
df_lpt.tail()

Unnamed: 0,author,author_fullname,created_utc,selftext,title,subreddit,is_video,num_comments,score,upvote_ratio
1094,Bloodstone2012,t2_jt1jo,1650646779,,LPT: it is healthier to skip a meal and just d...,LifeProTips,False,1,1,1.0
1095,cross_peach,t2_30cjlc75,1650646116,,LPT: You can access several online courses and...,LifeProTips,False,1,1,1.0
1096,jmincorporated,t2_16tlrj,1650645631,,LPT - when you receive a medical bill always c...,LifeProTips,False,1,1,1.0
1097,the_women_era,t2_m8qtr3lc,1650645336,[removed],People Will Always Judge Your Actions. No Matt...,LifeProTips,False,1,1,1.0
1098,patchaclus,t2_9411lb63,1650645171,,LPT: you can set rules on your email to redire...,LifeProTips,False,1,1,1.0


-----

### Merging Data

In [43]:
# "stacking" both dataframes by row using concatenate method
df_all = pd.concat([df_lh, df_lpt], axis=0)

In [48]:
# must reset repeated indices after concatenation
df_all.reset_index(drop=True, inplace=True)

In [45]:
df_all.shape

(2178, 10)

In [49]:
df_all.head()

Unnamed: 0,author,author_fullname,created_utc,selftext,title,subreddit,is_video,num_comments,score,upvote_ratio
0,Cutesifer_101,t2_7zsy6pvt,1651176966,The damage wasn’t that bad other than the fact...,I spilled monster energy,lifehacks,False,0,1,1.0
1,Subtotalpoet,t2_dcvoz53,1651176676,,"Wife forgot ur favorite ice cream? Improvise, ...",lifehacks,False,0,1,1.0
2,amintowords,t2_racie,1651175651,Set an alarm for an hour after you're meant to...,How to remember to take tablets on time,lifehacks,False,0,1,1.0
3,rokokslot87,t2_kv1oregh,1651174857,,SLOT ONLINE MENANG BESAR | SLOT DEPOSIT PULSA,lifehacks,False,0,1,1.0
4,Distinct_Expert_7648,t2_mczxpj82,1651174438,[removed],Infertility Clinic in Pune,lifehacks,False,0,1,1.0


In [50]:
df_all.tail()

Unnamed: 0,author,author_fullname,created_utc,selftext,title,subreddit,is_video,num_comments,score,upvote_ratio
2173,Bloodstone2012,t2_jt1jo,1650646779,,LPT: it is healthier to skip a meal and just d...,LifeProTips,False,1,1,1.0
2174,cross_peach,t2_30cjlc75,1650646116,,LPT: You can access several online courses and...,LifeProTips,False,1,1,1.0
2175,jmincorporated,t2_16tlrj,1650645631,,LPT - when you receive a medical bill always c...,LifeProTips,False,1,1,1.0
2176,the_women_era,t2_m8qtr3lc,1650645336,[removed],People Will Always Judge Your Actions. No Matt...,LifeProTips,False,1,1,1.0
2177,patchaclus,t2_9411lb63,1650645171,,LPT: you can set rules on your email to redire...,LifeProTips,False,1,1,1.0


In [55]:
df_all.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2178 entries, 0 to 2177
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   author           2178 non-null   object 
 1   author_fullname  2178 non-null   object 
 2   created_utc      2178 non-null   int64  
 3   selftext         2178 non-null   object 
 4   title            2178 non-null   object 
 5   subreddit        2178 non-null   object 
 6   is_video         2178 non-null   bool   
 7   num_comments     2178 non-null   int64  
 8   score            2178 non-null   int64  
 9   upvote_ratio     2178 non-null   float64
dtypes: bool(1), float64(1), int64(3), object(5)
memory usage: 155.4+ KB


In [56]:
na_only(df_all)

0

Everything looks good! We can now export.

## Data Exporting

In [54]:
#df_all.to_csv('../datasets/submissions_data.csv', index=False)

In [1]:
# TODO: wrangle data for top posts only