# ![](https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png)  Project 3: Web APIs & Classification
Reddit's API:  Data Wrangling, Natural Language Processing, and Classification Modeling


This project covers three of the biggest concepts in Data Science:
- Data Wrangling/Acquisition
- Natural Language Processing
- Classification Modeling

---
## Technical Report:   *Reddit API Data Collection*
This notebook --just one component of the overall project-- reflects the collection/scraping of two subreddits (subject-specific Reddit user communities) of my choosing.

Part 1 of the project focuses on **Data wrangling/gathering/acquisition**. 
The expectatiion is that not all acquired data will be clean or in a structured/organized format (like a single .csv file or SQL table). While an API request for data is ideal, some scraping may be required if the website of interest does not have an API (or it's terribly documented).

Within this notebook, scraped data is saved to .csv datasets which can be referenced here:
- `RAW_subreddit_NUTRITION.csv`:  [subreddit: Nutrition](../data/RAW_subreddit_NUTRITION.csv)
- `RAW_subreddit_MEDICINE.csv`:  [subreddit: Medicine](../data/RAW_subreddit_MEDICINE.csv)

The concatenated data can be referenced here:
- `SUBREDDITS.csv`: [concatenated: Nutrition & Medicine](../data/SUBREDDITS.csv)

Ultimately this data will be used with NLP to train a classifier on which subreddit a given post came from. **This is a binary classification problem**. 


---
## Reddit API Data Collection

#### About the API

Reddit's API is fairly straightforward. For example, posts from the [`/r/boardgames`](https://www.reddit.com/r/boardgames) subreddit can be obtained by adding `.json` to the end of the url: https://www.reddit.com/r/boardgames.json. Data is returned in `.json` format which is similar to a Python dictionary structure.

## Subreddit Selection

Generally I found that subreddits with larger user communities generate more posts, which are the source of data for this project. I chose to scrape two closely related subreddits: Nutrition and Medicine.
 - `Nutrition` subreddit: 456K members / 8K+ posts scraped
 - `Medicine` subreddit:  241K members / 5K+ posts scraped



## Import Required Python Libraries

In [1]:
# required to make API requests
import requests
# required to throttle your scraping loop... 
import time
# required for get_date function...
import datetime as dt
# required to ignore warnings
import warnings
warnings.filterwarnings("ignore")

In [2]:
# Python libraries used for data analysis
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

### Build a scraper to grab posts

In [3]:
# function code from GA alumni Josh Robin and Brian Collins' lecture
# uses pushshift API
def query_pushshift(subreddit, kind='submission', skip=30, times=18, 
                    subfield = ['title', 'selftext', 'subreddit', 'created_utc', 'author', 'num_comments', 'score', 'is_self'],
                    comfields = ['body', 'score', 'created_utc']):

    stem = "https://api.pushshift.io/reddit/search/{}/?subreddit={}&size=500".format(kind, subreddit)
    mylist = []
    
    for x in range(1, times):
        
        URL = "{}&after={}d".format(stem, skip * x)
        print(URL)
        response = requests.get(URL)
        assert response.status_code == 200
        mine = response.json()['data']
        df = pd.DataFrame.from_dict(mine)
        mylist.append(df)
        time.sleep(2)
        
    full = pd.concat(mylist, sort=False)
    
    if kind == "submission":
        
        full = full[subfield]
        
        full = full.drop_duplicates()
        
        full = full.loc[full['is_self'] == True]
        
    def get_date(created):
        return dt.date.fromtimestamp(created)
    
    _timestamp = full["created_utc"].apply(get_date)
    
    full['timestamp'] = _timestamp

    print(full.shape)
    
    return full

## Scrape data: Nutrition  *(8,453 - 2,275 "removed" = 6,178)*

In [4]:
# invoke scraper function
nutrition_posts = query_pushshift('nutrition')

https://api.pushshift.io/reddit/search/submission/?subreddit=nutrition&size=500&after=30d
https://api.pushshift.io/reddit/search/submission/?subreddit=nutrition&size=500&after=60d
https://api.pushshift.io/reddit/search/submission/?subreddit=nutrition&size=500&after=90d
https://api.pushshift.io/reddit/search/submission/?subreddit=nutrition&size=500&after=120d
https://api.pushshift.io/reddit/search/submission/?subreddit=nutrition&size=500&after=150d
https://api.pushshift.io/reddit/search/submission/?subreddit=nutrition&size=500&after=180d
https://api.pushshift.io/reddit/search/submission/?subreddit=nutrition&size=500&after=210d
https://api.pushshift.io/reddit/search/submission/?subreddit=nutrition&size=500&after=240d
https://api.pushshift.io/reddit/search/submission/?subreddit=nutrition&size=500&after=270d
https://api.pushshift.io/reddit/search/submission/?subreddit=nutrition&size=500&after=300d
https://api.pushshift.io/reddit/search/submission/?subreddit=nutrition&size=500&after=330d
ht

### Create a backup of the raw data: Nutrition

In [5]:
# create dataframe of subreddit
raw_df_nutr = pd.DataFrame(nutrition_posts)

In [6]:
# profile raw dataframe
raw_df_nutr.head()

Unnamed: 0,title,selftext,subreddit,created_utc,author,num_comments,score,is_self,timestamp
0,What's the best way to gradually stop eating j...,,nutrition,1560311690,MortelleTSpears,11,5,True,2019-06-11
1,The 2 best veggies,Imagine you have access to all kinds of vegeta...,nutrition,1560313777,brumate,10,5,True,2019-06-12
2,Reintroducing Dairy,Any sage advice on how one might go about this?,nutrition,1560314043,eroticmarshmellows,4,0,True,2019-06-12
3,Eating a meal without meat makes my stomach fe...,I've been considering and wavering on becoming...,nutrition,1560315599,predoucheous,28,5,True,2019-06-12
4,Does hyrdrogenated palm oil stunt your growth,I know it's a stupid question but I was just c...,nutrition,1560315998,QuestionNoire,11,2,True,2019-06-12


In [7]:
!pwd

/Users/ngms/GA-DSI/SUBMISSIONS/Project_3/code


In [8]:
# export raw dataframe
# raw_df_nutr.to_csv('data/RAW_subreddit_NUTRITION.csv', index=False)

In [11]:
# verify file creation
!ls -ltr ../data/RAW_subreddit_NUTRITION.csv

-rw-r--r--  1 ngms  staff  4038482 Jul 10 22:26 ../data/RAW_subreddit_NUTRITION.csv


In [15]:
raw_df_nutr.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8457 entries, 0 to 499
Data columns (total 9 columns):
title           8457 non-null object
selftext        8358 non-null object
subreddit       8457 non-null object
created_utc     8457 non-null int64
author          8457 non-null object
num_comments    8457 non-null int64
score           8457 non-null int64
is_self         8457 non-null bool
timestamp       8457 non-null object
dtypes: bool(1), int64(3), object(5)
memory usage: 602.9+ KB


In [21]:
# identify any "removed" posts
raw_df_nutr[raw_df_nutr['selftext'] == '[removed]'].sum()

title           Simple Tips for Fitness SuccessRecovering drug...
selftext        [removed][removed][removed][removed][removed][...
subreddit       nutritionnutritionnutritionnutritionnutritionn...
created_utc                                         3495859089937
author          VarunMishra31A2n0nTabzzzmeisterBaston5Spurlaut...
num_comments                                                 4205
score                                                        3088
is_self                                                      2270
dtype: object

## Scrape data: Medicine  *(5,302 - 3,275 "removed" = 2,027)*

In [22]:
# invoke scraper function
health_posts = query_pushshift('medicine')

https://api.pushshift.io/reddit/search/submission/?subreddit=medicine&size=500&after=30d
https://api.pushshift.io/reddit/search/submission/?subreddit=medicine&size=500&after=60d
https://api.pushshift.io/reddit/search/submission/?subreddit=medicine&size=500&after=90d
https://api.pushshift.io/reddit/search/submission/?subreddit=medicine&size=500&after=120d
https://api.pushshift.io/reddit/search/submission/?subreddit=medicine&size=500&after=150d
https://api.pushshift.io/reddit/search/submission/?subreddit=medicine&size=500&after=180d
https://api.pushshift.io/reddit/search/submission/?subreddit=medicine&size=500&after=210d
https://api.pushshift.io/reddit/search/submission/?subreddit=medicine&size=500&after=240d
https://api.pushshift.io/reddit/search/submission/?subreddit=medicine&size=500&after=270d
https://api.pushshift.io/reddit/search/submission/?subreddit=medicine&size=500&after=300d
https://api.pushshift.io/reddit/search/submission/?subreddit=medicine&size=500&after=330d
https://api.p

In [23]:
# create dataframe of subreddit
raw_df_med = pd.DataFrame(health_posts)

In [24]:
# profile raw dataframe
raw_df_med.head()

Unnamed: 0,title,selftext,subreddit,created_utc,author,num_comments,score,is_self,timestamp
1,Buy Tramadol Online,[removed],medicine,1560319774,buytramadolonlinecod,0,1,True,2019-06-12
4,EMA restricts the use of fluoroquinolones in L...,Source: [https://www.ema.europa.eu/en/medicine...,medicine,1560334619,KokoskoZippo,8,1,True,2019-06-12
5,Online Anticancer Medicine in India | Generic ...,[removed],medicine,1560336980,Oddway_international,0,1,True,2019-06-12
7,Question for orthopaedic surgeons re: surgical...,[removed],medicine,1560348433,beaniegirl2,0,1,True,2019-06-12
11,Question on Philips Wearable Biosensor,[removed],medicine,1560351479,CFStark77,0,1,True,2019-06-12


In [25]:
# export raw dataframe
# raw_df_med.to_csv('data/RAW_subreddit_MEDICINE.csv', index=False)

In [26]:
!pwd

/Users/ngms/GA-DSI/SUBMISSIONS/Project_3/code


In [27]:
# verify file creation
!ls -ltr ../data/RAW_subreddit_MEDICINE.csv

-rw-r--r--  1 ngms  staff  2043087 Jul 10 22:18 ../data/RAW_subreddit_MEDICINE.csv


In [28]:
raw_df_med.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5304 entries, 1 to 497
Data columns (total 9 columns):
title           5304 non-null object
selftext        5217 non-null object
subreddit       5304 non-null object
created_utc     5304 non-null int64
author          5304 non-null object
num_comments    5304 non-null int64
score           5304 non-null int64
is_self         5304 non-null bool
timestamp       5304 non-null object
dtypes: bool(1), int64(3), object(5)
memory usage: 378.1+ KB


In [30]:
# identify any "removed" posts
raw_df_med[raw_df_med['selftext'] == '[removed]'].sum()

title           Buy Tramadol OnlineOnline Anticancer Medicine ...
selftext        [removed][removed][removed][removed][removed][...
subreddit       medicinemedicinemedicinemedicinemedicinemedici...
created_utc                                         5042146960336
author          buytramadolonlinecodOddway_internationalbeanie...
num_comments                                                 5257
score                                                        7513
is_self                                                      3272
dtype: object

---
## Combine datasets

#### For each subreddit, create dataframe with two columns: 'title' and 'selftext'
Then set the positive and negative classes for this binary classification in a single column (called "nutrition")

## Nutrition

In [31]:
# nutrition is the positive class (1)
# create new dataframe with only two columns
nutrition_df = raw_df_nutr[['title', 'selftext']]
# add a classification column and set the negative class
nutrition_df['nutrition'] = 1

In [32]:
# profile dataframe
print(nutrition_df.shape)
nutrition_df.head()

(8457, 3)


Unnamed: 0,title,selftext,nutrition
0,What's the best way to gradually stop eating j...,,1
1,The 2 best veggies,Imagine you have access to all kinds of vegeta...,1
2,Reintroducing Dairy,Any sage advice on how one might go about this?,1
3,Eating a meal without meat makes my stomach fe...,I've been considering and wavering on becoming...,1
4,Does hyrdrogenated palm oil stunt your growth,I know it's a stupid question but I was just c...,1


In [33]:
# check for nulls
nutrition_df.isnull().sum()

title         0
selftext     99
nutrition     0
dtype: int64

In [34]:
# drop 'selftext' column before joining dataframes...
nutrition_df.drop(columns='selftext', inplace=True)
nutrition_df.head()

Unnamed: 0,title,nutrition
0,What's the best way to gradually stop eating j...,1
1,The 2 best veggies,1
2,Reintroducing Dairy,1
3,Eating a meal without meat makes my stomach fe...,1
4,Does hyrdrogenated palm oil stunt your growth,1


In [35]:
# remove duplicates
nutrition_df.drop_duplicates(inplace=True)

In [36]:
# profile dataframe
print(nutrition_df.shape)

(8368, 2)


## Medicine

In [37]:
# medicine is the negative class (0)
# create new dataframe with only two columns
medicine_df = raw_df_med[['title', 'selftext']]
# add a classification column and set the negative class
medicine_df['nutrition'] = 0

In [38]:
# profile dataframe
print(medicine_df.shape)
medicine_df.head()

(5304, 3)


Unnamed: 0,title,selftext,nutrition
1,Buy Tramadol Online,[removed],0
4,EMA restricts the use of fluoroquinolones in L...,Source: [https://www.ema.europa.eu/en/medicine...,0
5,Online Anticancer Medicine in India | Generic ...,[removed],0
7,Question for orthopaedic surgeons re: surgical...,[removed],0
11,Question on Philips Wearable Biosensor,[removed],0


In [39]:
# check for nulls
medicine_df.isnull().sum()

title         0
selftext     87
nutrition     0
dtype: int64

In [40]:
# drop 'selftext' column before joining dataframes...
medicine_df.drop(columns='selftext', inplace=True)
medicine_df.head()

Unnamed: 0,title,nutrition
1,Buy Tramadol Online,0
4,EMA restricts the use of fluoroquinolones in L...,0
5,Online Anticancer Medicine in India | Generic ...,0
7,Question for orthopaedic surgeons re: surgical...,0
11,Question on Philips Wearable Biosensor,0


In [41]:
# remove duplicates
medicine_df.drop_duplicates(inplace=True)

In [42]:
# profile dataframe
print(medicine_df.shape)

(5085, 2)


## Concatenate the two dataframes into one dataframe

In [43]:
# combine text for the two subreddits based on class values
subreddits = pd.concat([nutrition_df, medicine_df], ignore_index=True)

In [44]:
# export concatenated dataframe
# subreddits.to_csv('data/SUBREDDITS.csv', index=False)

In [45]:
# look at positive class (nutrition)
subreddits.head()

Unnamed: 0,title,nutrition
0,What's the best way to gradually stop eating j...,1
1,The 2 best veggies,1
2,Reintroducing Dairy,1
3,Eating a meal without meat makes my stomach fe...,1
4,Does hyrdrogenated palm oil stunt your growth,1


In [46]:
# look at negative class (medicine)
subreddits.tail()

Unnamed: 0,title,nutrition
13448,Question about nutrition in pediatrics,0
13449,"family member diagnosed with RS3PE ""Remitting ...",0
13450,Contest Time! Funniest comeback reply to this ...,0
13451,Contest Time! Funniest comeback reply to this ...,0
13452,How to handle alternative practitioners,0


In [47]:
subreddits.isnull().sum()

title        0
nutrition    0
dtype: int64

In [48]:
subreddits.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13453 entries, 0 to 13452
Data columns (total 2 columns):
title        13453 non-null object
nutrition    13453 non-null int64
dtypes: int64(1), object(1)
memory usage: 210.3+ KB


In [49]:
# count number of observations in each class
# note:  classes are very unbalanced...
subreddits['nutrition'].value_counts()

1    8368
0    5085
Name: nutrition, dtype: int64

In [50]:
subreddits.groupby('nutrition').describe()

Unnamed: 0_level_0,title,title,title,title
Unnamed: 0_level_1,count,unique,top,freq
nutrition,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
0,5085,5085,Advice on writing a medical report,1
1,8368,8368,Pesticide chemicals in urine can't be good,1


## Reddit API Data Collection is complete

Proceed to the next notebook:
- [SubReddits_NLP_Random_Forest](02_SubReddits_NLP_Random_Forest.ipynb)
