# Project 3: Web APIs & NLP

**For project 3, your goal is two-fold:**

Using Pushshift's API, you'll collect posts from two subreddits of your choosing.
You'll then use NLP to train a classifier on which subreddit a given post came from. This is a binary classification problem.

### Requirements
- Gather and prepare your data using the requests library.
- Create and compare two models. One of these must be a Bayes classifier, however the other can be a classifier of your choosing: logistic regression, KNN, SVM, etc.
- An executive summary of your results.

#### steps:
1. Data Collection
2. EDA and data cleaning
3. Feature Engineering
4. Modeling
5. Results

### 1. Data Collection

In [1]:
#import the libraries
import pandas as pd
import requests
import time
import datetime as dt

In [2]:
def query_pushshift(subreddit, kind = 'submission', day_window = 30, n = 5):
    # BASE_URL & Parameters
    # also known as the "API endpoint" 
    BASE_URL = f"https://api.pushshift.io/reddit/search/{kind}" 
    size = 1000
    #day_window = 1
    # instantiate empty list for temp storage
    posts = []
    # implement for loop with `time.sleep(2)`
    for i in range(1, n + 1):
        response = requests.get(BASE_URL,
                                params = {
                                    'subreddit': subreddit,
                                    'size': size,
                                    'after': f'{day_window*i}d'
                                })
        print(f'Getting data from subreddit {subreddit} after {day_window*i} days')
        assert response.status_code == 200
        json = response.json()['data']
        df = pd.DataFrame.from_dict(json)
        posts.append(df)
        time.sleep(2)
    # pd.concat storage list
    full = pd.concat(posts, sort=False)
    # if submission
    if kind == "submission":
        # select desired columns
        full = full[SUBFIELDS]
        # drop duplicates
        full.drop_duplicates(inplace = True)
        # select `is_self` == True
        full = full.loc[full['is_self'] == True]
    # create `timestamp` column
    full['timestamp'] = full["created_utc"].map(dt.date.fromtimestamp)
    print("Query Complete!")    
    return full 

In [3]:
bitcoin_df = query_pushshift("Bitcoin")

Getting data from subreddit Bitcoin after 30 days
Getting data from subreddit Bitcoin after 60 days
Getting data from subreddit Bitcoin after 90 days
Getting data from subreddit Bitcoin after 120 days
Getting data from subreddit Bitcoin after 150 days
Query Complete!


In [4]:
ethereum_df = query_pushshift('ethereum')

Getting data from subreddit ethereum after 30 days
Getting data from subreddit ethereum after 60 days
Getting data from subreddit ethereum after 90 days
Getting data from subreddit ethereum after 120 days
Getting data from subreddit ethereum after 150 days
Query Complete!


In [5]:
print(bitcoin_df.shape)
ethereum_df.shape

(2471, 9)


(1674, 9)

In [6]:
combined = pd.concat([bitcoin_df, ethereum_df], axis = 0).reset_index(drop = True)
combined.shape

(4145, 9)

In [7]:
combined.head(2)

Unnamed: 0,title,selftext,subreddit,created_utc,author,num_comments,score,is_self,timestamp
0,With all this stimulus- why is bitcoin acting ...,,Bitcoin,1585371522,Somebodykilledmybro,6,1,True,2020-03-27
1,Abra Wallet,Is abra wallet a good place to store bitcoin i...,Bitcoin,1585373310,JayR111,4,1,True,2020-03-27


In [8]:
combined['selftext'][2]

'&amp;#x200B;\n\nhttps://preview.redd.it/6dr71yvvncp41.jpg?width=1410&amp;format=pjpg&amp;auto=webp&amp;s=c3eab9d646265c291355d280e18368f105053e45'

In [9]:
%store ethereum_df
%store bitcoin_df
%store combined

Stored 'ethereum_df' (DataFrame)
Stored 'bitcoin_df' (DataFrame)
Stored 'combined' (DataFrame)
