# ![](https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png)  Project 3: Web APIs & Classification
Reddit's API:  Data Wrangling, Natural Language Processing, and Classification Modeling


This project covers three of the biggest concepts in Data Science:
- Data Wrangling/Acquisition
- Natural Language Processing
- Classification Modeling


---
## Technical Report:   *Reddit API Data Collection*
This notebook --just one component of the overall project-- reflects the collection, import (and cleaning?? ) of two subreddits of my choosing. . .

Part 1 of the project focuses on **Data wrangling/gathering/acquisition**. 
The expectatiion is that not all acquired data will be clean or in a structured/organized format (like a single .csv file or SQL table). While an API request for data is ideal, some scraping may be required if the website of interest does not have an API (or it's terribly documented).

. . . At the end of this notebook, scraped (& cleaned?? ) data is saved to .csv datasets which can be referenced here:
- `subreddit_NUTR.csv`:  [subreddit: Nutrition](../data/subreddit_NUTRITION.csv)
- `subreddit_MED.csv`:  [subreddit: Medicine](../data/subreddit_MED.csv)

Ultimately this data will be used with NLP to train a classifier on which subreddit a given post came from. **This is a binary classification problem**.


**Data Collection**
- Was enough data gathered to generate a significant result?
- Was data collected that was useful and relevant to the project?
- Was data collection and storage optimized through *custom functions, pipelines, and/or automation*?
- Was thought given to the server receiving the requests such as considering number of requests per second?



---
## Reddit API Data Collection

#### About the API

Reddit's API is fairly straightforward. For example, if I want the posts from [`/r/boardgames`](https://www.reddit.com/r/boardgames), all I have to do is add `.json` to the end of the url: https://www.reddit.com/r/boardgames.json

To help you get started, we have a primer video on how to use Reddit's API: https://www.youtube.com/watch?v=5Y3ZE26Ciuk

---

### Requirements

- Gather and prepare your data using the `requests` library.
- **Create and compare two models**. One of these must be a Bayes classifier, however the other can be a classifier of your choosing: logistic regression, KNN, SVM, etc.
- A Jupyter Notebook with your analysis for a peer audience of data scientists.
- An executive summary of the results you found.
- A short presentation outlining your process and findings for a semi-technical audience.

**Pro Tip 1:** You can find a good example executive summary [here](https://www.proposify.biz/blog/executive-summary).

**Pro Tip 2:** Reddit will give you 25 posts **per request**. To get enough data, you'll need to hit Reddit's API **repeatedly** (most likely in a `for` loop). _Be sure to use the `time.sleep()` function at the end of your loop to allow for a break in between requests. **THIS IS CRUCIAL**_

**Pro tip 3:** The API will cap you at 1,000 posts for each subreddit (assuming the subreddit has that many posts).

**Pro tip 4:** At the end of each loop, be sure to save the results from your scrape as a `csv`: JSON from Reddit > Pandas DataFrame > CSV. That way, if something goes wrong in your loop, you won't lose all your data.



In [3]:
# required to make API requests
import requests
# required to throttle your scraping loop... 
import time

In [4]:
# Python libraries used for data analysis
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

# Riley's youtube video

To help you get started, we have a primer video on how to use Reddit's API: https://www.youtube.com/watch?v=5Y3ZE26Ciuk


also:  https://git.generalassemb.ly/DSI-US-4/project_3


# Riley's pro tips

- **Pro Tip 2**: Reddit will give you 25 posts per request. To get enough data, you'll need to hit Reddit's API repeatedly (most likely in a for loop). Be sure to use the time.sleep() function at the end of your loop to allow for a break in between requests. THIS IS CRUCIAL

    If you want more than the top 25 posts, you'll need to:
    1. get the name of the last post:  data['data']['after']
    2. use that name to hit the following url:  https://www.reddit.com/hot.json?after=THE_AFTER_FROM_STEP_1
    
    (NOTES: the "key=value" portion after the '?' is called the "query string"; multiple query strings can be separated with an '&'...)
    3. create a loop to repeat steps 1 and 2 until you have a sufficient number of posts


- **Pro tip 3**: The API will cap you at 1,000 posts for each subreddit (assuming the subreddit has that many posts).

- **Pro tip 4**: At the end of each loop, be sure to save the results from your scrape as a csv: JSON from Reddit > Pandas DataFrame > CSV. That way, if something goes wrong in your loop, you won't lose all your data.

## Additional pro tips:

- anytime you're investigating a new json dictionary:
    - check out what's in the json data:
        - the_json = res.json()    
    - check out the keys to the json data:  
        - the_json.keys()
    - sort the keys:
        - sorted(the_json.keys()
    - investigate what's associated with a particular key: (ex:  ['data', 'kind'] )
        - the_json['data']
    - investigate keys associated with ['data']
        - sorted(the_json['data'].keys())
    - count the number of posts...
        - len(the_json['data']['children'])
    - create a dataframe to see what's goin on...
        pd.DataFrame(the_json['data']['children'])

we're looking for (4) pieces of content about each thread:
1. title of the thread
2. the subreddit that the thread corresponds to
3. the length of time it has been up on Reddit
4. the number of comments on the thread


## collecting for subreddit:  Health...

In [289]:
# ###################################
# only do this part for the first run???

# create empty list
# posts = []
# reset parameter *(NOTE: "after" is the key of the last post in your list of posts)
# after = None
# ###################################

# loop through... ("4" specifies number of loops; expect total of 100 b/c 25 allowed per request)
for i in range(4):
    print(i)
    # if parameter "after" has been reset...
    if after == None:
        # ...create empty dictionary
        params = {}
    # if parameter "after" has NOT been reset...
    else:
        # populate empty dictionary with value for "after"; i.e. key id from last entry you scraped...
        params = {'after': after}
    # NOTE: "hot" is a subreddit category; you can select a different one...
    url = 'https://www.reddit.com/r/Health.json'
    # set up a personal user agent (to avoid API request errors by bypassing reddit's default agent)
    headers = {'User-agent': 'my user agent 0.1'}
    # make your API request; (NOTE: "res" is short for "response"...)
    res = requests.get(url, params=params, headers=headers)
    
    # if API request is successful (i.e. status 200)
    if res.status_code == 200:
        # get the json data from the website
        the_json = res.json()
        #####################################
        # extend (not append!) your existing "posts" dataset with the updated values
        # NOTE: extend actually appends new data *inside* the existing list (whereas append would append a new list to the existing list...)
        posts.extend(the_json['data']['children'])
        #####################################
        # is the key (ID) of the last post in your list of posts
        after = the_json['data']['after']
    else:
    # if API request is NOT successful (i.e. status 404, 429, etc.)
        # show the status
        print(res.status_code)
        # break the loop (i.e. stop processing)
        break
    # set loop data retrieval slow enough as to not trigger any DOS alert which will block your IP...
    time.sleep(3)


# create dataframe
# need to iterate through posts which is a list of dicts of dicts... (for title inside data)
# use list comprehesion (or for loop)
import_posts = pd.DataFrame(posts)


0
1
2
3


In [290]:
# check dataframe:
print('shape: ', import_posts.shape)
# 1st run:  85 rows
# 2nd run:  170 rows
# 3rd run:  255 rows
# 4th run:  340 rows
# 4th run:  340 rows
# 5th run:  425 rows
# 6th run:  510 rows
# 7th run:  595 rows
# 8th run:  680 rows
# 9th run:  765 rows
# 10th run: 850 rows
# 935 rows


import_posts.info()

shape:  (2071, 2)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2071 entries, 0 to 2070
Data columns (total 2 columns):
data    2071 non-null object
kind    2071 non-null object
dtypes: object(2)
memory usage: 32.4+ KB


In [19]:
# profile the data
# print(import_posts.iloc[0])
# import_posts.head()

data    {'approved_at_utc': None, 'subreddit': 'Health...
kind                                                   t3
Name: 0, dtype: object


Unnamed: 0,data,kind
0,"{'approved_at_utc': None, 'subreddit': 'Health...",t3
1,"{'approved_at_utc': None, 'subreddit': 'Health...",t3
2,"{'approved_at_utc': None, 'subreddit': 'Health...",t3
3,"{'approved_at_utc': None, 'subreddit': 'Health...",t3
4,"{'approved_at_utc': None, 'subreddit': 'Health...",t3


In [291]:
# put this in a separate cell, so it doesn't overwrite???

# export dataframe
import_posts.to_csv('../data/subreddit_Health.csv', index=False)


In [292]:
# verify file creation

!ls -ltr ../data/subreddit_Health.csv

-rw-r--r--@ 1 ngms  staff  8295215 Jul  8 15:02 ../data/subreddit_Health.csv


In [293]:
# populate empty dictionary with value for "after"; i.e. key id from last entry you scraped...
# params = {'after': after}

# whats in our 'after' dictionary?
params


# t3_c6yfqe

{'after': 't3_c6yfqe'}

## collecting for subreddit:  Today I Learned (TIL)...

In [168]:
# ###################################
# only do this part for the first run???

# create empty list
# til_posts = []
# reset parameter *(NOTE: "after" is the key of the last post in your list of posts)
# til_after = None
# ###################################

# loop through... ("4" specifies number of loops; expect total of 100 b/c 25 allowed per request)
for i in range(4):
    print(i)
    # if parameter "after" has been reset...
    if til_after == None:
        # ...create empty dictionary
        params = {}
    # if parameter "after" has NOT been reset...
    else:
        # populate empty dictionary with value for "after"; i.e. key id from last entry you scraped...
        params = {'after': til_after}
    # NOTE: "hot" is a subreddit category; you can select a different one...
    url = 'https://www.reddit.com/r/todayilearned.json'
    # set up a personal user agent (to avoid API request errors by bypassing reddit's default agent)
    headers = {'User-agent': 'my user agent 0.1'}
    # make your API request; (NOTE: "res" is short for "response"...)
    res = requests.get(url, params=params, headers=headers)
    
    # if API request is successful (i.e. status 200)
    if res.status_code == 200:
        # get the json data from the website
        the_json = res.json()
        #####################################
        # extend (not append!) your existing "posts" dataset with the updated values
        # NOTE: extend actually appends new data *inside* the existing list (whereas append would append a new list to the existing list...)
        til_posts.extend(the_json['data']['children'])
        #####################################
        # is the key (ID) of the last post in your list of posts
        til_after = the_json['data']['after']
    else:
    # if API request is NOT successful (i.e. status 404, 429, etc.)
        # show the status
        print(res.status_code)
        # break the loop (i.e. stop processing)
        break
    # set loop data retrieval slow enough as to not trigger any DOS alert which will block your IP...
    time.sleep(3)


# create dataframe
# need to iterate through posts which is a list of dicts of dicts... (for title inside data)
# use list comprehesion (or for loop)
import_til_posts = pd.DataFrame(til_posts)



0
1
2
3


In [169]:
# check dataframe:
print('shape: ', import_til_posts.shape)
# 1st run:   100 rows
# 2nd run:   200 rows
# 3rd run:   300 rows
# 4th run:   400 rows
# 5th run:   500 rows
# 6th run:   600 rows
# 7th run:   683 rows
# 8th run:   783 rows
# 9th run:   883 rows
# 10th run:  983 rows

# 1083
# 1183
# 1283
# 1370
# 1470
# 1570
# 1670
# 1770
# 1870
# 1970
# 2054

import_til_posts.info()

shape:  (2054, 2)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2054 entries, 0 to 2053
Data columns (total 2 columns):
data    2054 non-null object
kind    2054 non-null object
dtypes: object(2)
memory usage: 32.2+ KB


In [200]:
# check unique values
# len(set(import_til_posts))

2

In [32]:
# profile the data
# print(import_til_posts.iloc[0])
# import_til_posts.head()

data    {'approved_at_utc': None, 'subreddit': 'todayi...
kind                                                   t3
Name: 0, dtype: object


Unnamed: 0,data,kind
0,"{'approved_at_utc': None, 'subreddit': 'todayi...",t3
1,"{'approved_at_utc': None, 'subreddit': 'todayi...",t3
2,"{'approved_at_utc': None, 'subreddit': 'todayi...",t3
3,"{'approved_at_utc': None, 'subreddit': 'todayi...",t3
4,"{'approved_at_utc': None, 'subreddit': 'todayi...",t3


In [170]:
# put this in a separate cell, so it doesn't overwrite???

# export dataframe
import_til_posts.to_csv('../data/subreddit_TIL.csv', index=False)


In [171]:
# verify file creation

!ls -ltr ../data/subreddit_TIL.csv

-rw-r--r--  1 ngms  staff  7645236 Jul  7 18:45 ../data/subreddit_TIL.csv


In [172]:
# populate empty dictionary with value for "after"; i.e. key id from last entry you scraped...
# params = {'after': til_after}

# whats in our 'after' dictionary?
params

# t3_ca8vah

{'after': 't3_ca8vah'}

In [78]:
# import_til_posts

## collecting for subreddit:  Medicine (TIL)...

In [154]:
# ###################################
# only do this part for the first run???

# create empty list
# med_posts = []
# reset parameter *(NOTE: "after" is the key of the last post in your list of posts)
# med_after = None
# ###################################

# loop through... ("4" specifies number of loops; expect total of 100 b/c 25 allowed per request)
for i in range(4):
    print(i)
    # if parameter "after" has been reset...
    if med_after == None:
        # ...create empty dictionary
        params = {}
    # if parameter "after" has NOT been reset...
    else:
        # populate empty dictionary with value for "after"; i.e. key id from last entry you scraped...
        params = {'after': med_after}
    # NOTE: "hot" is a subreddit category; you can select a different one...
    url = 'https://www.reddit.com/r/medicine.json'
    # set up a personal user agent (to avoid API request errors by bypassing reddit's default agent)
    headers = {'User-agent': 'my user agent 0.1'}
    # make your API request; (NOTE: "res" is short for "response"...)
    res = requests.get(url, params=params, headers=headers)
    
    # if API request is successful (i.e. status 200)
    if res.status_code == 200:
        # get the json data from the website
        the_json = res.json()
        #####################################
        # extend (not append!) your existing "posts" dataset with the updated values
        # NOTE: extend actually appends new data *inside* the existing list (whereas append would append a new list to the existing list...)
        med_posts.extend(the_json['data']['children'])
        #####################################
        # is the key (ID) of the last post in your list of posts
        med_after = the_json['data']['after']
    else:
    # if API request is NOT successful (i.e. status 404, 429, etc.)
        # show the status
        print(res.status_code)
        # break the loop (i.e. stop processing)
        break
    # set loop data retrieval slow enough as to not trigger any DOS alert which will block your IP...
    time.sleep(3)


# create dataframe
# need to iterate through posts which is a list of dicts of dicts... (for title inside data)
# use list comprehesion (or for loop)
import_med_posts = pd.DataFrame(med_posts)


0
1
2
3


In [155]:
# check dataframe:
print('shape: ', import_med_posts.shape)

# import_med_posts.info()

shape:  (2072, 2)


In [156]:
# put this in a separate cell, so it doesn't overwrite???

# export dataframe
import_med_posts.to_csv('../data/subreddit_MED.csv', index=False)


In [157]:
# verify file creation

!ls -ltr ../data/subreddit_MED.csv

-rw-r--r--  1 ngms  staff  7609485 Jul  8 23:15 ../data/subreddit_MED.csv


In [158]:
# populate empty dictionary with value for "after"; i.e. key id from last entry you scraped...
# params = {'after': med_after}

# whats in our 'after' dictionary?
params

# t3_c6p0x8


{'after': 't3_c6p0x8'}

In [None]:
# profile the data
# print(import_med_posts.iloc[0])
# import_med_posts.head()

In [95]:
print(len(the_json['data']['after']))
print(the_json['data']['after'])


9
t3_bbujir


In [None]:
# count the number of posts (per loop?)...
len(the_json['data']['children'])


In [99]:
# count the number of posts...
#     len(the_json['data']['children'])
len(import_med_posts)


1587

In [96]:
# sort the keys to the json data:
sorted(the_json.keys())


['data', 'kind']

In [97]:
import_med_posts['data'].head()

0    {'approved_at_utc': None, 'subreddit': 'medici...
1    {'approved_at_utc': None, 'subreddit': 'medici...
2    {'approved_at_utc': None, 'subreddit': 'medici...
3    {'approved_at_utc': None, 'subreddit': 'medici...
4    {'approved_at_utc': None, 'subreddit': 'medici...
Name: data, dtype: object

In [None]:
# import_med_posts.iloc[0]

In [109]:
# looks like the 'data' key has more info, lets look closer...
# investigate ("sub")keys associated with the ['data'] key
sorted(the_json['data'].keys())


['after', 'before', 'children', 'dist', 'modhash']

In [160]:
#  return[' '.join([post['title'], post['selftext']])
        
# import_med_posts['title']


In [107]:
# investigate what's associated with a particular key: (ex:  ['data', 'kind'] )
print('json key: kind -  \n', the_json['kind'])
print('json key: data -  \n', the_json['data'])


json key: kind -  
 Listing
json key: data -  
 {'modhash': '', 'dist': 25, 'children': [{'kind': 't3', 'data': {'approved_at_utc': None, 'subreddit': 'medicine', 'selftext': 'Just curious if anyone here has a story where they had to open a patient up so emergently they couldn’t wait to transport them to the OR. If so, what was the indication? What was the outcome?', 'author_fullname': 't2_f51q0', 'saved': False, 'mod_reason_title': None, 'gilded': 0, 'clicked': False, 'title': 'Any surgeons here ever have to open a patient at the bedside?', 'link_flair_richtext': [], 'subreddit_name_prefixed': 'r/medicine', 'hidden': False, 'pwls': 6, 'link_flair_css_class': None, 'downs': 0, 'hide_score': False, 'name': 't3_bcux79', 'quarantine': False, 'link_flair_text_color': 'dark', 'author_flair_background_color': '', 'subreddit_type': 'public', 'ups': 369, 'total_awards_received': 0, 'media_embed': {}, 'author_flair_template_id': None, 'is_original_content': False, 'user_reports': [], 'secure_me

In [132]:
# investigate what's associated with a particular ("sub")key from the list above:
# print('json key: after -  \n', the_json['data']['after'])
# print('json key: before -  \n', the_json['data']['before'])
print('json key: children -  \n', the_json['data']['children'])
# print('json key: dist -  \n', the_json['data']['dist'])
# print('json key: modhash -  \n', the_json['data']['modhash'])

# print('json key: title -  \n', the_json['data']['title']
# print('json key: permalink -  \n', the_json['data']['permalink'])
# print('json key: url -  \n', the_json['data']['url'])
# print('json key: domain -  \n', the_json['data']['domain'])
# print('json key: content_categories -  \n', the_json['data']['content_categories'])
# print('json key: author_flair_text -  \n', the_json['data']['author_flair_text'])



# title
# url
# permalink

# approved_at_utc
# subreddit
# subreddit_name_prefixed
# subreddit_type
# subreddit_id
# selftext
# author_fullname
# saved
# is_original_content
# content_categories
# domain
# author_flair_text


json key: children -  
 [{'kind': 't3', 'data': {'approved_at_utc': None, 'subreddit': 'medicine', 'selftext': 'Just curious if anyone here has a story where they had to open a patient up so emergently they couldn’t wait to transport them to the OR. If so, what was the indication? What was the outcome?', 'author_fullname': 't2_f51q0', 'saved': False, 'mod_reason_title': None, 'gilded': 0, 'clicked': False, 'title': 'Any surgeons here ever have to open a patient at the bedside?', 'link_flair_richtext': [], 'subreddit_name_prefixed': 'r/medicine', 'hidden': False, 'pwls': 6, 'link_flair_css_class': None, 'downs': 0, 'hide_score': False, 'name': 't3_bcux79', 'quarantine': False, 'link_flair_text_color': 'dark', 'author_flair_background_color': '', 'subreddit_type': 'public', 'ups': 369, 'total_awards_received': 0, 'media_embed': {}, 'author_flair_template_id': None, 'is_original_content': False, 'user_reports': [], 'secure_media': None, 'is_reddit_media_domain': False, 'is_meta': False, '

In [None]:
# count the number of posts (per scrape?)...

# 1st run = 100
# 2nd run = 200
len(set([p['data']['name'] for p in import_med_posts]))



## Reddit API Data Collection is complete

Proceed to the next notebook:
- [SubReddit_NLP_Data_Cleaning](02_SubReddit_NLP_Data_Cleaning.ipynb)


---
---

# ![](https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png) Project 3: Web APIs & Classification



### The Data Science Process

**Problem Statement**
- Is it clear what the goal of the project is?
- What type of model will be developed?
- How will success be evaluated?
- Is the scope of the project appropriate?
- Is it clear who cares about this or why this is important to investigate?
- Does the student consider the audience and the primary and secondary stakeholders?

**Data Collection**
- Was enough data gathered to generate a significant result?
- Was data collected that was useful and relevant to the project?
- Was data collection and storage optimized through custom functions, pipelines, and/or automation?
- Was thought given to the server receiving the requests such as considering number of requests per second?

**Data Cleaning and EDA**
- Are missing values imputed/handled appropriately?
- Are distributions examined and described?
- Are outliers identified and addressed?
- Are appropriate summary statistics provided?
- Are steps taken during data cleaning and EDA framed appropriately?
- Does the student address whether or not they are likely to be able to answer their problem statement with the provided data given what they've discovered during EDA?

**Preprocessing and Modeling**
- Is text data successfully converted to a matrix representation?
- Are methods such as stop words, stemming, and lemmatization explored?
- Does the student properly split and/or sample the data for validation/training purposes?
- Does the student test and evaluate a variety of models to identify a production algorithm (**AT MINIMUM:** Bayes and one other model)?
- Does the student defend their choice of production model relevant to the data at hand and the problem?
- Does the student explain how the model works and evaluate its performance successes/downfalls?

**Evaluation and Conceptual Understanding**
- Does the student accurately identify and explain the baseline score?
- Does the student select and use metrics relevant to the problem objective?
- Does the student interpret the results of their model for purposes of inference?
- Is domain knowledge demonstrated when interpreting results?
- Does the student provide appropriate interpretation with regards to descriptive and inferential statistics?

**Conclusion and Recommendations**
- Does the student provide appropriate context to connect individual steps back to the overall project?
- Is it clear how the final recommendations were reached?
- Are the conclusions/recommendations clearly stated?
- Does the conclusion answer the original problem statement?
- Does the student address how findings of this research can be applied for the benefit of stakeholders?
- Are future steps to move the project forward identified?

