# Data Wrangling

## Problem Statements

> 1. Develop a classification model that can predict whether a reddit post belongs to the subreddits r/LifeProTips or r/Lifehacks, based on the content of the post
>    - Furthermore, optimize accuracy and precision

> 2. Identify the top 15 keywords that distinguish r/LifeProTips and r/Lifehacks

> 3. Determine the most frequent content posted in each subbreddit, and provide a recommendation that is most appropriate to new reddit users

> 4. Insights on what makes a most popular, and unpopular, in each subbreddit
>    - Based on self-texts only

## Imports

In [165]:
%run 00_Workflow_Functions.ipynb import api_call, data_wrangling

In [150]:
import pandas as pd
import requests
from collections import defaultdict

## Preliminaries

In [4]:
# lifehacks
lhs_url = "https://api.pushshift.io/reddit/search/submission/?subreddit=lifehacks&metadata=true&size=0"
lht_url = "https://api.pushshift.io/reddit/search/submission/?subreddit=lifehacks&metadata=true&size=0&is_self=true"

# lifeprotips
lpts_url = "https://api.pushshift.io/reddit/search/submission/?subreddit=LifeProTips&metadata=true&size=0"
lptt_url = "https://api.pushshift.io/reddit/search/submission/?subreddit=LifeProTips&metadata=true&size=0&is_self=true"

In [5]:
res_lhs = requests.get(lhs_url)
res_lht = requests.get(lht_url)
res_lpts = requests.get(lpts_url)
res_lptt = requests.get(lptt_url)

print(res_lhs.status_code, res_lht.status_code)
print(res_lpts.status_code, res_lptt.status_code)

200 200
200 200


All requests successful.

In [6]:
lhs_count = res_lhs.json()['metadata']['total_results']
lht_count = res_lht.json()['metadata']['total_results']

lpts_count = res_lpts.json()['metadata']['total_results']
lptt_count = res_lptt.json()['metadata']['total_results']

print(f"LifeHacks Total Submissions: {lhs_count}\nLifeHacks Total Self-Text Posts: {lht_count}")
print(f"LifeProTips Total Submissions: {lpts_count}\nLifeProTips Total Self-Text Posts: {lptt_count}")

LifeHacks Total Submissions: 81787
LifeHacks Total Self-Text Posts: 23266
LifeProTips Total Submissions: 555477
LifeProTips Total Self-Text Posts: 534475


## Data Wrangling - r/Lifehacks

In [182]:
# Content we care about:
keys = ['author', 'author_fullname', 'created_utc', 'selftext', 'title', 'subreddit', 'is_video', 'num_comments', 'score', 'upvote_ratio']

# instantiate new dict to capture api data
lh_data = defaultdict(list)

In [163]:
# making api call
lf_call = api_call('lifehacks', 100)

In [164]:
len(lf_call) # we could only request data 100 submissions at a time

100

In [179]:
# wrangling api call into a dictionary that will be used on a dataframe
data = data_wrangling(lh_data, keys, lf_call)

In [184]:
# checking if any data was not capture in the api call
data['error_log']

[]

In [180]:
# api data dictionary to dataframe
df = pd.DataFrame(data['data'])
df.tail()

Unnamed: 0,author,author_fullname,created_utc,selftext,title,subreddit,is_video,num_comments,score,upvote_ratio
95,Panzerschwein45,t2_72ju8ai7,1650824490,,Ad-Hoc Smoked Tuna!,lifehacks,False,0,1,1.0
96,Not_your_hoe_,t2_g37s4cyh,1650823542,I have an important upcoming examination in 2 ...,how can i make an efficient study routine?,lifehacks,False,0,1,0.99
97,AlternativeAgile747,t2_m57ssorg,1650823516,[removed],Top 50 Law Schools In The USA #Best Law School...,lifehacks,False,0,1,1.0
98,jamesgang007,t2_bus45,1650821923,,Yeti cocktail life hack,lifehacks,False,0,2,1.0
99,MustacheMufasa,t2_m6py7kop,1650821565,About 2/3 of ovens made in the past ten years ...,Hack your oven to use it as an air fryer,lifehacks,False,0,1,1.0


A small trick here. We will use the submission time of the last post we collected, and wrangle more data that predates that submission. We will then append that data to the dataframe, until we have all the data we need. The process for that is below.

In [186]:
# last collected submission
last_utc = df.loc[len(df) - 1, 'created_utc']
last_utc

1650821565

----

Here we continue making api calls, with new data each time (predating the last data that is collected each time). We will collect at least 1000 rows of data.

In [188]:
# continue wrangling data until a certain size is met
while len(df) < 1000:
    try:
        lf_call = api_call('lifehacks', 100, last_utc)
    except:
        print("Data wrangling failed.")
        break
    
    data = data_wrangling(lh_data, keys, lf_call)
    df = pd.DataFrame(data['data'])
    last_utc = df.loc[len(df) - 1, 'created_utc']

In [189]:
# verifying data was collected
df.shape

(1099, 10)

In [190]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1099 entries, 0 to 1098
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   author           1099 non-null   object 
 1   author_fullname  1095 non-null   object 
 2   created_utc      1099 non-null   int64  
 3   selftext         1097 non-null   object 
 4   title            1099 non-null   object 
 5   subreddit        1099 non-null   object 
 6   is_video         1099 non-null   bool   
 7   num_comments     1099 non-null   int64  
 8   score            1099 non-null   int64  
 9   upvote_ratio     1099 non-null   float64
dtypes: bool(1), float64(1), int64(3), object(5)
memory usage: 78.5+ KB


In [191]:
df.isna().sum()

author             0
author_fullname    4
created_utc        0
selftext           2
title              0
subreddit          0
is_video           0
num_comments       0
score              0
upvote_ratio       0
dtype: int64

Looks like we have some missing data. Since it's a very small amount, we will drop it now.

In [192]:
# checking if our data is unique based on submission times
len(np.unique(df['created_utc']))

1099

All row data has different time of submission, which is a strong suggestion all our data are unique submissions.