In this project, we will practice two major skills. Collecting data via an API request and then building a binary predictor.

As we discussed in week 2, and earlier today, there are two components to starting a data science problem: the problem statement, and acquiring the data.

For this article, your problem statement will be: _What characteristics of a post on Reddit contribute most to what subreddit it belongs to?_

Your method for acquiring the data will be scraping threads from at least two subreddits. 

Once you've got the data, you will build a classification model that, using Natural Language Processing and any other relevant features, predicts which subreddit a given post belongs to.

### Scraping Thread Info from Reddit.com

#### Set up a request (using requests) to the URL below. 

*NOTE*: Reddit will throw a [429 error](https://httpstatuses.com/429) when using the following code:
```python
res = requests.get(URL)
```

This is because Reddit has throttled python's default user agent. You'll need to set a custom `User-agent` to get your request to work.
```python
res = requests.get(URL, headers={'User-agent': 'YOUR NAME Bot 0.1'})
```

In [1]:
#imports 
import requests
import json
import pandas as pd
import time

In [2]:
url = "https://www.reddit.com/r/Todayilearned/.json"

In [3]:
## YOUR CODE HERE
res = requests.get(url,headers = {'User-agent':'OAUTH 2'})

In [4]:
res.status_code

200

#### Use `res.json()` to convert the response into a dictionary format and set this to a variable. 

```python
data = res.json()
```

In [5]:
data = res.json()

#### Getting more results

By default, Reddit will give you the top 25 posts:

```python
print(len(data['data']['children']))
```

If you want more, you'll need to do two things:
1. Get the name of the last post: `data['data']['after']`
2. Use that name to hit the following url: `http://www.reddit.com/r/boardgames.json?after=THE_AFTER_FROM_STEP_1`
3. Create a loop to repeat steps 1 and 2 until you have a sufficient number of posts. 

*NOTE*: Reddit will limit the number of requests per second you're allowed to make. When you create your loop, be sure to add the following after each iteration.

```python
time.sleep(3) # sleeps 3 seconds before continuing```

This will throttle your loop and keep you within Reddit's guidelines. You'll need to import the `time` library for this to work!

In [6]:
print(len(data['data']['children']))

25


In [7]:
## YOUR CODE HERE
after = data['data']['after']
after

't3_9e4lz9'

### testing getting next page

In [8]:
nextpage = url + '?after='+after

res = requests.get(nextpage,headers = {'User-agent':'OAUTH 2'})
res.status_code

200

In [9]:
nextpage_data = res.json()
nextpage_data['data']['after']


't3_9e5v2h'

### testing functions within functions

In [10]:
url = 'https://api.pushshift.io/reddit/search/submission/?subreddit=FoodPorn&size=500'

In [11]:
#testing access to webisite
def test_page(url):
    try:
        res = requests.get(url,headers = {'User-agent':'OAUTH 2'})
    except:
        return('not valid URL')
    try:
        res.status_code < 400
        return(res.json())  
    except:
        return('status code error',res.status_code)
     

In [12]:
#testing function
data = test_page(url)

In [13]:
#looking at whats inside data
data.keys()

dict_keys(['data'])

In [14]:
#empty list
posts = []


In [15]:
#scraping website page
def get_page(data,post_list):
    main = data['data']
    for i in range(len(main)):
                temp_dict = {}
                temp_dict['posts'] = main[i]['title']
                post_list.append(temp_dict)
    return(post_list)

In [16]:
get_page(data,posts)
len(posts)

500

In [17]:
#gets next page url
def next_page(url,page_num):
    try:
        url = url[:url.index('&after=')]
        url = url + '&after='+ str(page_num) + 'd'
        print(url)
        return(url)
    except:
        url = url + '&after='+ str(page_num) + 'd'
        print(url)
        return(url)

In [18]:
next_page(url,10)

https://api.pushshift.io/reddit/search/submission/?subreddit=FoodPorn&size=500&after=10d


'https://api.pushshift.io/reddit/search/submission/?subreddit=FoodPorn&size=500&after=10d'

In [19]:
TIL_url = 'https://api.pushshift.io/reddit/search/submission/?subreddit=todayilearned&size=500'
food_url ='https://api.pushshift.io/reddit/search/submission/?subreddit=FoodPorn&size=500'

In [20]:
TIL_posts = []
food_posts = []

In [21]:
def start_scrape(this_url,post_list,page_start,page_end):
    for i in range(page_start,page_end):
        data = test_page(this_url)
        get_page(data,post_list)
        this_url = next_page(this_url,i)
        time.sleep(3) 
    return (post_list)

In [24]:
#start_scrape(food_url,food_posts,1,3)

In [25]:
len(food_posts)

1766

In [27]:
len(posts)

500

In [30]:
df = pd.DataFrame(food_posts)
#checking to make sure everything work correctly
#print(df.iloc[[0,499,500,1000,1499,1500,2499]])


### Save your results as a CSV
You may do this regularly while scraping data as well, so that if your scraper stops of your computer crashes, you don't lose all your data.

In [31]:
#takes post list and makes them into csv
def posts_to_csv(name):
    if len(food_posts) != 0:
        food_name = 'Food_' + name
        food_df = pd.DataFrame(food_posts)
        food_df.to_csv(path_or_buf = food_name)
        
    else:
        print('food posts is empty')
        
    if len(TIL_posts) != 0:
        #setting up name
        TIL_name = 'TIL_' + name
        #making posts into a df and then csv
        TIL_df = pd.DataFrame(TIL_posts)
        TIL_df.to_csv(TIL_name)
    else:
        print('TIL posts is empty')

In [109]:
start_scrape('https://api.pushshift.io/reddit/search/submission/?subreddit=FoodPorn&size=3',food_posts,1,3)

https://api.pushshift.io/reddit/search/submission/?subreddit=FoodPorn&size=3&after=1d
https://api.pushshift.io/reddit/search/submission/?subreddit=FoodPorn&size=3&after=2d


[{'posts': 'Beef, black bean and sweet potato enchiladas wrapped in chipotle tortillas, smothered in hatch chili sauce'},
 {'posts': 'This weekends Smoked Tri-Tip- Damn! Nice defined smoke ring.'},
 {'posts': 'Charcuterie and Negroni [OC]'},
 {'posts': 'A Cup of This Tea Is The Best Remedy for Stress'},
 {'posts': 'I (made) pastrami!'},
 {'posts': 'Arugala, Baby Spinach, Black Beans, Bean Sprouts, Red and Yellow Peppers, Mozzeralla, Sharp Cheddar, Feta with a Rasberry Vinaigrette with Eggplant Fries on the side'},
 {'posts': 'Chicago Deep Dish (4 minute mark)'},
 {'posts': '[oc] mushroom w/ parmesan cream and a pepperoni from Peel in Edwardsville IL'},
 {'posts': 'Pasta Salad with homemade vinaigrette [OC]'},
 {'posts': '[OC] Cheeseboard for friends.'},
 {'posts': '(I tried to make) Katz pastrami tonight'},
 {'posts': '[OC] Korean style of beef tartare with live octopus'},
 {'posts': "Couldn't find a good shrimp po boy near where in from, so i made one"},
 {'posts': "Poor Man's Burnt E

In [33]:
# Export to csv
posts_to_csv('test')

TIL posts is empty


## collecting data

In [None]:
#reset lists
TIL_posts = []
food_posts = []

In [None]:
#check whats in lists
print('TIL posts',len(TIL_posts))
print('Food posts',len(food_posts))

### day 1

In [None]:
# TIL scraping
start_scrape(TIL_url,TIL_postsL,1,90)

# Food scraping
start_scrape(food_url,food_posts,1,90)

In [None]:
#turns the posts into csv
posts_to_csv('4k')

### day 2

In [None]:
# TIL scraping
start_scrape(TIL_url,TIL_posts,91,120)

# Food scraping
start_scrape(food_url,food_posts,91,120)

In [None]:
posts_to_csv('day2_4.5')

### day 3

In [None]:
# TIL scraping
start_scrape(TIL_url,TIL_posts,181,200)

# Food scraping
start_scrape(food_url,food_posts,181,200)

In [None]:
posts_to_csv('day3_9.5k')

## Cleaning

In [255]:
#importing food csv
food = pd.read_csv('./Data/Food_day3_9.5k')
food_df = pd.DataFrame(testfood)

#import TIL csv 
TIL = pd.read_csv('./Data/TIL_day3_9.5k')
TIL_df = pd.DataFrame(testTIL)

In [256]:
#looking at food dataframe
food_df.head()
#food_df.shape

Unnamed: 0.1,Unnamed: 0,posts
0,0,Fried potatos with sour cream and butter
1,1,Argentine dinner
2,2,Prepping Some Budae Jjigae [2560x1440] OC
3,3,Creamy Garlic Butter Tuscan Shrimp
4,4,Pesto cavatappi with Parmesan chicken topped w...


In [257]:
#looking at TIL dataframe
TIL_df.head()
#TIL_df.shape

Unnamed: 0.1,Unnamed: 0,posts
0,0,TIL The anonymous informant who provided the W...
1,1,TIL TIL went full circle.
2,2,"The Keto Diet Isn't That Great, And Other Facts"
3,3,The Species with Amnesia Series - Episode #8 -...
4,4,"TIL about Freddie Oversteegen. She, along with..."


In [258]:
#drop Unanamed columns
food_df.drop(columns=['Unnamed: 0'],inplace = True)
TIL_df.drop(columns=['Unnamed: 0'],inplace = True)

In [259]:
#replace unecessary characters
food_df = food_df.applymap(lambda cell:cell.lower().replace('.','')
                 .replace(',','').replace('?','').replace('-','').replace('[','')
                .replace(']','').replace('(','').replace(')','').replace('&','').replace('$','')
                 .replace(';','').replace(':','').replace('!',''))

TIL_df = TIL_df.applymap(lambda cell:cell.lower().replace('.','')
                 .replace(',','').replace('?','').replace('-','').replace('[','')
                .replace(']','').replace('(','').replace(')','').replace('&','').replace('$','')
                 .replace(';','').replace(':','').replace('!',''))

In [260]:
food_df.head()

Unnamed: 0,posts
0,fried potatos with sour cream and butter
1,argentine dinner
2,prepping some budae jjigae 2560x1440 oc
3,creamy garlic butter tuscan shrimp
4,pesto cavatappi with parmesan chicken topped w...


In [261]:
TIL_df.head()

Unnamed: 0,posts
0,til the anonymous informant who provided the w...
1,til til went full circle
2,the keto diet isn't that great and other facts
3,the species with amnesia series episode #8 a...
4,til about freddie oversteegen she along with h...


##### setting target


In [304]:
len(df)

19000

In [263]:
#creating new binary column food = 0
food_df['subreddit'] = 0
TIL_df['subreddit'] = 1

In [264]:
#combining the dataframes
df = pd.concat([food_df,TIL_df])

In [265]:
df.head()

Unnamed: 0,posts,subreddit
0,fried potatos with sour cream and butter,0
1,argentine dinner,0
2,prepping some budae jjigae 2560x1440 oc,0
3,creamy garlic butter tuscan shrimp,0
4,pesto cavatappi with parmesan chicken topped w...,0


In [266]:
df.head()

Unnamed: 0,posts,subreddit
0,fried potatos with sour cream and butter,0
1,argentine dinner,0
2,prepping some budae jjigae 2560x1440 oc,0
3,creamy garlic butter tuscan shrimp,0
4,pesto cavatappi with parmesan chicken topped w...,0


In [267]:
X = df['posts']
y = df['subreddit']

In [268]:
X.head()

0             fried potatos with sour cream and butter
1                                     argentine dinner
2              prepping some budae jjigae 2560x1440 oc
3                   creamy garlic butter tuscan shrimp
4    pesto cavatappi with parmesan chicken topped w...
Name: posts, dtype: object

In [269]:
y.head()

0    0
1    0
2    0
3    0
4    0
Name: subreddit, dtype: int64

## NLP

#### Use `CountVectorizer` or `TfidfVectorizer` from scikit-learn to create features from the thread titles and descriptions (NOTE: Not all threads have a description)
- Examine using count or binary features in the model
- Re-evaluate your models using these. Does this improve the model performance? 
- What text features are the most valuable? 

In [270]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.cross_validation import train_test_split,cross_val_score

In [271]:
#train/test split
x_train,x_test,y_train,y_test = train_test_split(X,y)

In [272]:
## YOUR CODE HERE
#count vectorizer
countVec = CountVectorizer(stop_words='english',ngram_range=(1,4))
countVec.fit(x_train)
len(countVec.vocabulary_)

256738

In [273]:
countVec.get_feature_names()

['00',
 '00 skips',
 '00 skips leap',
 '00 skips leap day',
 '00 song',
 '00 song everytime',
 '00 song everytime touch',
 '000',
 '000 currently',
 '000 currently free',
 '000 currently free amazon',
 '000 people',
 '000 people today',
 '000 people today barely',
 '000 pieces',
 '000 pieces chewing',
 '000 pieces chewing gum',
 '000000000',
 '000000000 elementary',
 '000000000 elementary articles',
 '000000000 elementary articles observable',
 '000000000000000000000000000000000000',
 '000000000000000000000000000000000000 000000000',
 '000000000000000000000000000000000000 000000000 elementary',
 '000000000000000000000000000000000000 000000000 elementary articles',
 '000126',
 '000126 314',
 '000126 314 differs',
 '000126 314 differs 000159',
 '000159',
 '000251',
 '000251 boiling',
 '000251 boiling point',
 '000251 boiling point 999743',
 '0025',
 '0025 german',
 '0025 german citizens',
 '0025 german citizens 20646',
 '003',
 '003 27448985600',
 '003 27448985600 different',
 '003 27448

In [274]:
#converting x_train and x_test to matrix 
x_train_matrix = countVec.transform(x_train)
x_test_matrix = countVec.transform(x_test)

## Predicting subreddit using Random Forests + Another Classifier

In [275]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [276]:

rf = RandomForestClassifier()
rf.fit(x_train_matrix,y_train)

#### We want to predict a binary variable - class `0` for one of your subreddits and `1` for the other.

In [290]:

rf.predict(x_train_matrix)

array([0, 0, 1, ..., 0, 1, 1])

#### Thought experiment: What is the baseline accuracy for this model?

In [291]:

cross_val_score(rf,x_train_matrix,y_train,cv=5).mean()

0.9664561403508772

#### Create a `RandomForestClassifier` model to predict which subreddit a given post belongs to.

In [293]:
from sklearn.model_selection import GridSearchCV

In [298]:

#grid search to find the best parameters
grid_params = {
   'min_samples_leaf':[1,3,5],
    'n_estimators':[10,15,25],
    'criterion':['gini','entropy'],
    'max_features':['auto',5,10]
}

gs = GridSearchCV(
    RandomForestClassifier(),
    grid_params,
    verbose = 1,
    cv = 5,
    n_jobs = 4
)

gs_results = gs.fit(x_train_matrix,y_train)

Fitting 5 folds for each of 54 candidates, totalling 270 fits


[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:  3.3min
[Parallel(n_jobs=4)]: Done 192 tasks      | elapsed: 11.1min
[Parallel(n_jobs=4)]: Done 270 out of 270 | elapsed: 13.1min finished


In [299]:
#looking through the gridsearch
print('best params',gs.best_params_)
print('best score',gs.best_score_)

best params {'criterion': 'entropy', 'max_features': 'auto', 'min_samples_leaf': 3, 'n_estimators': 25}
best score 0.9738245614035088


In [300]:
#fitting random forest with best params
rf = RandomForestClassifier(criterion='entropy',min_samples_leaf=3,n_estimators=25)
rf.fit(x_train_matrix,y_train)

#getting prediction
pred = rf.predict(x_test_matrix)

#printing the accuracy score
accuracy_score(y_test,pred)

0.9728421052631578

#### Repeat the model-building process using a different classifier (e.g. `MultinomialNB`, `LogisticRegression`, etc)

In [301]:
#traning logistic regression model on word vec
log = LogisticRegression()
log.fit(x_train_matrix,y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [302]:
log_pred = log.predict(x_test_matrix)

In [303]:
accuracy_score(y_test,log_pred)

0.9652631578947368