# Project 3: Web APIs & NLP

**For project 3, your goal is two-fold:**

Using Pushshift's API, you'll collect posts from two subreddits of your choosing.
You'll then use NLP to train a classifier on which subreddit a given post came from. This is a binary classification problem.

### Requirements
- Gather and prepare your data using the requests library.
- Create and compare two models. One of these must be a Bayes classifier, however the other can be a classifier of your choosing: logistic regression, KNN, SVM, etc.
- An executive summary of your results.

#### steps:
1. Data Collection
2. EDA and data cleaning
3. Feature Engineering
4. Modeling
5. Results

### 1. Data Collection

In [24]:
#import the libraries
import pandas as pd
import requests
import time
import datetime as dt

In [25]:
def query_pushshift(subreddit, kind = 'submission', day_window = 30, n = 5):
    SUBFIELDS = ['title', 'selftext', 'subreddit', 'created_utc', 'author', 'num_comments', 'score', 'is_self']
    # BASE_URL & Parameters
    BASE_URL = f"https://api.pushshift.io/reddit/search/{kind}" # also known as the "API endpoint" 
    size = 1000
    #day_window = 1
    # instantiate empty list for temp storage
    posts = []
    # implement for loop with `time.sleep(2)`
    for i in range(1, n + 1):
        response = requests.get(BASE_URL,
                                params = {
                                    'subreddit': subreddit,
                                    'size': size,
                                    'after': f'{day_window*i}d'
                                })
        print(f'Getting data from subreddit {subreddit} after {day_window*i} days')
        assert response.status_code == 200
        json = response.json()['data']
        df = pd.DataFrame.from_dict(json)
        posts.append(df)
        time.sleep(2)
    # pd.concat storage list
    full = pd.concat(posts, sort=False)
    # if submission
    if kind == "submission":
        # select desired columns
        full = full[SUBFIELDS]
        # drop duplicates
        full.drop_duplicates(inplace = True)
        # select `is_self` == True
        full = full.loc[full['is_self'] == True]
    # create `timestamp` column
    full['timestamp'] = full["created_utc"].map(dt.date.fromtimestamp)
    print("Query Complete!")    
    return full 

In [26]:
bitcoin_df = query_pushshift("Bitcoin")

Getting data from subreddit Bitcoin after 30 days
Getting data from subreddit Bitcoin after 60 days
Getting data from subreddit Bitcoin after 90 days
Getting data from subreddit Bitcoin after 120 days
Getting data from subreddit Bitcoin after 150 days
Query Complete!


In [27]:
ethereum_df = query_pushshift('ethereum')

Getting data from subreddit ethereum after 30 days
Getting data from subreddit ethereum after 60 days
Getting data from subreddit ethereum after 90 days
Getting data from subreddit ethereum after 120 days
Getting data from subreddit ethereum after 150 days
Query Complete!


In [41]:
ripple_df = query_pushshift('Ripple')

Getting data from subreddit Ripple after 30 days
Getting data from subreddit Ripple after 60 days
Getting data from subreddit Ripple after 90 days
Getting data from subreddit Ripple after 120 days
Getting data from subreddit Ripple after 150 days
Query Complete!


In [39]:
print(bitcoin_df.shape)
ethereum_df.shape

(2405, 9)


(1679, 9)

In [47]:
ripple_df.shape

(646, 9)

In [48]:
combined = pd.concat([bitcoin_df, ethereum_df], axis = 0).reset_index(drop = True)
combined.shape

(4084, 9)

In [51]:
combined.head(2)

Unnamed: 0,title,selftext,subreddit,created_utc,author,num_comments,score,is_self,timestamp
0,🏦Remitano is Rewarding Every Delayed Transacti...,"Starting from ⏱February 10, 2020, Remitano wil...",Bitcoin,1585025915,remimay,0,1,True,2020-03-23
1,Veteran Trader: Bitcoin Should be Viewed as “C...,[removed],Bitcoin,1585026571,ProfessionalUnit4,8,1,True,2020-03-23


In [53]:
combined['selftext'][2]

"The Bitcoin Seminars that were previously going to be held in March, 2020 are postponed to June 2020 due to the COVID-19 Crisis. But the event is still on.\n\n# [Crowdfunding link to make this event a huge success.](https://tallyco.in/s/7zhcb1/)\n\n**Some Details:** \n\nCryptocurrency ban has been lifted in India &amp; buying, selling &amp; holding is completely Legalized. We are organizing Bitcoin Seminars across the best Tech institutes in India.\n\nIndia is #2 most populated country in the world and Bitcoin has already faced a hard time staying in people's radar here. We are a team of 27 Tech-savvy individuals who are planning several seminars on Bitcoin &amp; Blockchain technology across the Engineering &amp; Technology Colleges in India. The seminars will be held starting June 25, 2020.\n\nWe are in need of funds to cover our Travel, food, lodging expenses as well as Books, pamphlets, brochures printing material. So any kind donation would be greatly appreciated.\n\n**The college

In [50]:
%store ethereum_df
%store bitcoin_df
%store combined
%store ripple_df

Stored 'ethereum_df' (DataFrame)
Stored 'bitcoin_df' (DataFrame)
Stored 'combined' (DataFrame)
Stored 'ripple_df' (DataFrame)
