# Using Reddit's API for Predicting Comments

In this project, we will practice two major skills. Collecting data via an API request and then building a binary predictor.

As we discussed in week 2, and earlier today, there are two components to starting a data science problem: the problem statement, and acquiring the data.

For this article, your problem statement will be: _What characteristics of a post on Reddit contribute most to what subreddit it belongs to?_

Your method for acquiring the data will be scraping threads from at least two subreddits. 

Once you've got the data, you will build a classification model that, using Natural Language Processing and any other relevant features, predicts which subreddit a given post belongs to.

## Getting Data

I used extractor.py contained in this folder to extract data, using the starter code instructions as a guide.

## Begin Analysis

In [43]:
import pandas as pd
import numpy as np
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, BaggingClassifier
from sklearn.linear_model import LogisticRegression

## Bring in Data, Last Cleaning Details

In [12]:
data = pd.read_csv('./data/data_cleaned.csv')
data.drop('Unnamed: 0', inplace = True, axis =1)
data.head()

Unnamed: 0,subreddit,data
0,AskWomen,Whats your favorite podcast/what podcasts are ...
1,AskWomen,How did you stop feeling rushed into finding l...
2,AskWomen,What’s romanticized in modern culture but real...
3,AskWomen,What is your opinion on people referring to ad...
4,AskWomen,"Introverted moms with extroverted, super talka..."


In [13]:
data['human'] = np.where(data['subreddit'] == 'AskWomen', 1, 0)
data['text'] = data['data']
data.drop(['data', 'subreddit'], inplace = True, axis =1)

In [14]:
X = list(data['text'])
y = data['human']

In [15]:
y.mean()
#Not overly unbalanced at .39586

0.39586485123550175

In [22]:
# pull out any non-alphanumeric chars.
import regex as re
for i in range(len(X)):
    X[i] = re.sub(r'[^a-zA-Z0-9]', " ", X[i])

In [23]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = y)
print(y_train.mean(), y_test.mean())
#Good split based on means (makes sense since stratified)

0.3960995292535306 0.3951612903225806


## NLP

#### Use `CountVectorizer` or `TfidfVectorizer` from scikit-learn to create features from the thread titles and descriptions (NOTE: Not all threads have a description)
- Examine using count or binary features in the model
- Re-evaluate your models using these. Does this improve the model performance? 
- What text features are the most valuable? 

### Start with Count Vectorizer

In [98]:
def cvec_model(X_train, y_train, m, n):
    # initiate model
    print('n-gram range:', m, 'to', n)
    cvec = CountVectorizer(analyzer = 'word', ngram_range = (m, n)
    cvec.fit(X_train, y_train)
    
    # transform data
    X_train_cvec = cvec.transform(X_train)
    X_test_cvec = cvec.transform(X_test)

    # define regression
    log_reg_cvec = LogisticRegression(penalty='l1')
    print('Cross Val Score:', cross_val_score(log_reg_cvec, X_train_cvec, y_train).mean())

    # use GridSearchCV to optimize log reg
    parameters = {'C': np.logspace(0, 3, 10)}
    best_log = GridSearchCV(log_reg_cvec, parameters)
    
    # fit and score regression
    log_reg_cvec.fit(X_train_cvec, y_train)
    print('Model Accuracy:', log_reg_cvec.score(X_test_cvec, y_test), '\n')

Check range of n-grams

In [99]:
for m in range(1, 3):
    for n in range(1, 5):
        cvec_model(X_train, y_train, m, n)

n-gram range: 1 to 1
Cross Val Score: 0.7760746295957563
Model Accuracy: 0.7782258064516129 

n-gram range: 1 to 2
Cross Val Score: 0.7706928439322805
Model Accuracy: 0.7862903225806451 

n-gram range: 1 to 3
Cross Val Score: 0.7706928439322805
Model Accuracy: 0.7862903225806451 

n-gram range: 1 to 4
Cross Val Score: 0.7706928439322805
Model Accuracy: 0.7862903225806451 

n-gram range: 2 to 1
Cross Val Score: 0.7760746295957563
Model Accuracy: 0.7782258064516129 

n-gram range: 2 to 2
Cross Val Score: 0.6139828872223239
Model Accuracy: 0.6229838709677419 

n-gram range: 2 to 3
Cross Val Score: 0.6159976695187963
Model Accuracy: 0.6229838709677419 

n-gram range: 2 to 4
Cross Val Score: 0.6173417610037328
Model Accuracy: 0.6270161290322581 



Model score stays around the same value through the models. Seems to have a ceiling around .87 or .88.  
NB: The values currently are with a GridSearchCV optimizer and lasso penalty, but scored higher on both with other set that is contained in power point information.

### Now try TD-IDF Vectorizer

In [100]:
def tvec_model(X_train, y_train, m, n):
    # initiate model
    print('n-gram range:', m, 'to', n)
    tvec = TfidfVectorizer(analyzer = 'word', ngram_range= (m, n),
                          stop_words = 'english')
    tvec.fit(X_train, y_train)
    
    # transform data
    X_train_tvec = tvec.transform(X_train)
    X_test_tvec = tvec.transform(X_test)
    
    # define regression
    log_reg_tvec = LogisticRegression(penalty='l1')
    print('Cross Val Score:', cross_val_score(log_reg_tvec, X_train_tvec, y_train).mean())

    # use GridSearchCV to optimize log reg
    parameters = {'C': np.logspace(0, 3, 10)}
    best_log = GridSearchCV(log_reg_tvec, parameters)
    
    # fit and score regression
    log_reg_tvec.fit(X_train_tvec, y_train)
    print("Model Accuracy:", log_reg_tvec.score(X_test_tvec, y_test), '\n')

Check range of n-gram values

In [101]:
for m in range(1, 3):
    for n in range(1, 6):
        tvec_model(X_train, y_train, m, n)

n-gram range: 1 to 1
Cross Val Score: 0.7329905358074372
Model Accuracy: 0.7681451612903226 

n-gram range: 1 to 2
Cross Val Score: 0.665123400334668
Model Accuracy: 0.6713709677419355 

n-gram range: 1 to 3
Cross Val Score: 0.6644472897994024
Model Accuracy: 0.6733870967741935 

n-gram range: 1 to 4
Cross Val Score: 0.659073633721521
Model Accuracy: 0.6754032258064516 

n-gram range: 1 to 5
Cross Val Score: 0.6577322520984493
Model Accuracy: 0.6713709677419355 

n-gram range: 2 to 1
Cross Val Score: 0.7329905358074372
Model Accuracy: 0.7681451612903226 

n-gram range: 2 to 2
Cross Val Score: 0.6119708147877162
Model Accuracy: 0.6088709677419355 

n-gram range: 2 to 3
Cross Val Score: 0.6052476475011687
Model Accuracy: 0.6088709677419355 

n-gram range: 2 to 4
Cross Val Score: 0.604574246827768
Model Accuracy: 0.6088709677419355 

n-gram range: 2 to 5
Cross Val Score: 0.6039008461543673
Model Accuracy: 0.6088709677419355 



Similar range again for the models. No breaking .9 on any of them.

##  Trying Decision Trees and Random Forest

### Decision Tree w/ Count Vectorizer

In [141]:
# build and fit CV model to use data in others.
cvec_final = CountVectorizer(ngram_range=(1,3))
cv_model = cvec_final.fit(X_train, y_train)

#transform X data
cv_train = cvec_final.transform(X_train)
cv_test = cvec_final.transform(X_test)

In [142]:
from sklearn.tree import DecisionTreeClassifier

# build and initial test
dt_tree = DecisionTreeClassifier()

# fit baseline model
dt_tree.fit(cv_train, y_train)
print('Model Accuracy:', dt_tree.score(cv_train, y_train))
print('Model Accuracy:', dt_tree.score(cv_test, y_test))

Model Accuracy: 1.0
Model Accuracy: 0.8689516129032258


In [143]:
# use grid search for parameters check
params = {'max_depth': range(60, 90, 3),
          'max_leaf_nodes': range(2, 10, 2)}
dt = GridSearchCV(dt_tree, params)
best = dt.fit(cv_train, y_train)

In [144]:
print(best.best_params_)
print(best.best_score_)

{'max_depth': 60, 'max_leaf_nodes': 8}
0.8412911903160726


In [145]:
# run with GridSearch results
dt_opt = DecisionTreeClassifier(max_depth=60,
                                max_leaf_nodes = 8)
dt_opt.fit(cv_train, y_train)

print('Model Accuracy:', dt_opt.score(cv_train, y_train))
print('Model Accuracy:', dt_opt.score(cv_test, y_test))

Model Accuracy: 0.8480161398789509
Model Accuracy: 0.8608870967741935


### Random Forest

In [146]:
## Random Forest Classifier Model
rf = RandomForestClassifier(random_state=42)
print("Cross Val Score:", cross_val_score(rf, cv_train, y_train).mean())

Cross Val Score: 0.8601196404013306


In [147]:
# fit and run baseline model
rf.fit(cv_train, y_train)
print("Train Model Accuracy:", rf.score(cv_train, y_train))
print("Test Model Accuracy:", rf.score(cv_test, y_test))

Train Model Accuracy: 0.995965030262273
Test Model Accuracy: 0.8528225806451613


In [150]:
# use grid search for parameters check
params = {'n_estimators': range(10, 60, 5),
          'max_depth': range(10, 20, 2),
          'max_leaf_nodes': range(10, 15, 1)}

rf_best = GridSearchCV(rf, params)
rf_best.fit(cv_train, y_train)

#report results
print(rf_best.best_params_)
print(rf_best.best_score_)

{'max_depth': 12, 'max_leaf_nodes': 14, 'n_estimators': 40}
0.7935440484196369


In [181]:
# run model with grid search params
rf_opt = RandomForestClassifier(n_estimators=40,
                                max_leaf_nodes=14,
                                max_depth = 12)

rf_opt.fit(cv_train, y_train)

print("Train Model Accuracy:", rf_opt.score(cv_train, y_train))
print("Test Model Accuracy:", rf_opt.score(cv_test, y_test))

Train Model Accuracy: 0.8298587760591796
Test Model Accuracy: 0.8084677419354839


## Other Models

In [193]:
# Multinomial NB, using the Count Vec with n-gram (1, 3) and stop words INCLUDED
from sklearn.naive_bayes import MultinomialNB

# Baseline model
#initiate MNB model and feed transformed data
reddit_classify_model = MultinomialNB(alpha=0)
reddit_classify_model.fit(cv_train, y_train)


print("Score on Train Data:", reddit_classify_model.score(cv_train, y_train))
print("Score on Test Data:", reddit_classify_model.score(cv_test, y_test))

Score on Train Data: 1.0
Score on Test Data: 0.7681451612903226


  'setting alpha = %.1e' % _ALPHA_MIN)


In [189]:
from sklearn.metrics import confusion_matrix
y_hat = reddit_classify_model.predict(cv_test)
print(confusion_matrix(y_hat, y_test))

[[269  55]
 [ 31 141]]


# Executive Summary
---
Put your executive summary in a Markdown cell below.