# Using Reddit's API for Predicting WhichSubreddit?
## Notebook 1:  Scraping Posts from Reddit's API¶

In this project, I will demionstrate two major skills. Collecting data via an API request and then building a binary predictor.

There are two components to starting a data science problem: the problem statement, and acquiring the data.

In this project, my problem statement will be: _Can I tell which Subreddit a particular post comes from based on a simple analysis of the text used by the author?_

I will acquire my data by scraping the text and titles to posts from two distinctive subreddits from [Reddit homepage](https://www.reddit.com/), datascience and genetics. These posts will be uniquely identified by their id. If it becomes necessary to supplement my "text only" approach to determining Subreddit, I will also scrape the time posts were first made to reddit and the number of comments added to each post. Posts will be exported to my second notebook as a json file.

Posts will be further processed in my second notebook and readied to be run through my models in the third notebook (see below).

In [1]:
import requests
import time
import pandas as pd
import json

In [2]:
headers = {'User-agent': 'yukihadeishi .1'}
res = requests.get('https://reddit.com/hot.json', headers=headers)

In [3]:
curr_json = res.json()

In [4]:
posts = []
def get_posts(sub = 'all', num_pages = 100):
    counter = 0
    after = None
    while counter < num_pages:
        if after == None:
            params = {}
        else:
            params = {'after': after}
        res = requests.get(f'https://reddit.com/r/{sub}/.json', params,
                           headers=headers)
        curr_json = res.json()
        if(res.status_code!=200):
            print('Try again')
            return None
        else:
            page = curr_json['data'].get('children')
        posts.extend(page)
        after = curr_json['data']['after']
        counter += 1
        time.sleep(1)
    return posts

In [5]:
datascience_posts = get_posts(sub='datascience', num_pages=100)

In [6]:
genetics_posts = get_posts(sub='genetics', num_pages=100)

In [9]:
reddit_posts = datascience_posts + genetics_posts

In [11]:
with open('../data/test_dump3.json', 'w+') as f:
    json.dump(reddit_posts, f)

#### Reddit posts exported as json.dump for further processing.