
# Project 3.03 - Modeling

## Table of Contents

- [Functions](#Functions)

- [Imports](#Imports)

- [Null Hypothesis](#Null-Hypothesis) 

- [Model Pre-processing](#Model-Pre_Processing)

- [Vector Selection](#Vector-Selection)

- [Logistic Regression](#Logistic-Regression)

- [Multinomial Naive Bayes](#Multinomial-Naive-Bayes)

- [Random Forest](#Random-Forest)

- [K Nearest Neighbors](#K-Nearest-Neighbors)

## Imports

In [84]:
# Imports
#Import Cleaning and Viz packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from nltk.corpus import stopwords
%matplotlib inline

#Import SK Learn Modeling Libraries
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
#from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.metrics import confusion_matrix, accuracy_score, plot_confusion_matrix

In [85]:
#Import dataframe
df1 = pd.read_csv('./Data/Reddit_Posts.csv')

df1.head(2) 

Unnamed: 0,author,date created (epoch time),post_id,body,upvote score,number of comments,subreddit,length,word_count,tokenized_words,lemmatized_tokenized_words,stemmatized_tokenized_words,post_reddit_finance
0,kmuinnovation,1631255155,plfj7f,Schweizer Kredit Rangliste des Monats Septembe...,1,0,Finance,52,7,"['schweizer', 'kredit', 'rangliste', 'des', 'm...","['schweizer', 'kredit', 'rangliste', 'de', 'mo...","['schweizer', 'kredit', 'ranglist', 'de', 'mon...",1
1,sillychillly,1631255546,plfm8v,Mastercard acquires CipherTrace to enhance cry...,1,0,Finance,62,7,"['mastercard', 'acquires', 'ciphertrace', 'to'...","['mastercard', 'acquires', 'ciphertrace', 'to'...","['mastercard', 'acquir', 'ciphertrac', 'to', '...",1


**The stopwords added in our prior EDA work will be returned here to factor into our modeling**

In [86]:
custom_stop_word=['the', 'to','in','of', 'and', 'for', 
                  'is','on', 'with', 'com', 'https', 'www', 
                  'content', 'tinyurl', 'wp', 'uploads', 'jpg', '2020', 
                  '2021', 'pdf', 'amp', '10', '19'] 

#Create stopwords variable
stopwords_1 = stopwords.words('english')
stopwords_1 = custom_stop_word + stopwords_1

## Null Hypothesis




**Our Null Hypothesis for this classification problem is .5, based off of the random chance of picking one prediction over another, as we are making a classification prediction between two subreddits of a similar shape. This null hypothesis will alternate accordingly if subreddit data sources are changed**

In [87]:
#Develop Null Hypothesis
df1['post_reddit_finance'].value_counts(normalize=True)

0    0.500025
1    0.499975
Name: post_reddit_finance, dtype: float64

## Models Represented

**The models we will be running are as follows**


- Logistic Regression (count vect. and a tfidf vect.)
- Multinomial Naive Bayes (tfidf vect. and a Gridsearch)
- Random Forest (tfidf vect. and a Gridsearch)
- KNN Neighbors (tfidf vect. and a Gridsearch)

## Vector Selection

**Our target variable will be whether a post belongs to reddit finance or not.**

In [88]:
# Define X and y
X = df1['body']
y = df1['post_reddit_finance']

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=.30, 
                                                    random_state=42, stratify = y)

#Check shape to make sure it matches
print(len(X_train))
print(len(y_train))
print(len(X_test))
print(len(y_test))

14000
14000
6001
6001


#### Count Vectorizer

**Here we instantiate our vectorizer with our custom stop word list and an ngram_range of [1,2]. Our stop word list while our ingram range pushes the analyzer to capture complex expressions formed by the composition of more than one word. (So instead of "not, good," the analyzer recognizes "not good")**

In [89]:
#Create Count Vectorizer
cvec = CountVectorizer(stop_words = stopwords_1, ngram_range=(1,2))
cvec.fit(X_train)

#Transform vector
X_train_cv = cvec.transform(X_train)
X_test_cv = cvec.transform(X_test)

X_train_cv

<14000x106326 sparse matrix of type '<class 'numpy.int64'>'
	with 235722 stored elements in Compressed Sparse Row format>

#### TFIDF Vectorizer

In [90]:

#Instantiate Vectorizer
tvec = TfidfVectorizer(stop_words= stopwords_1, ngram_range=(1,2))

#Transform Vector
X_train_TF = tvec.fit_transform(X_train)
X_test_TF = tvec.transform(X_test)


## Logistic Regression

**We run Logistic Regressions as a means to test the effectiveness of our vectorizers. Our regression off of a count vectorizer scored a 75% accuracy on unseen data, while our TFidf Vectorizer scored 76%**. 

**While the scoring is similar, our count vectorized regression is more overfit, which was expected because it typically results in bias for more frequent words, while a Tfidf Vectorizer penalizes these values and analyzes more factors. We will use our Tfidf Vectorizer for the bulk of our modeling**


In [91]:
#Logistic Regression with Count Vectorizer
lr_model = LogisticRegression(solver='lbfgs', max_iter=5000)
lr_model.fit(X_train_cv, y_train)


score = lr_model.score(X_train_cv, y_train)
score_test = lr_model.score(X_test_cv, y_test)

print('Training score:',score)
print('Testing score:',score_test)

importance = lr_model.coef_
print(importance)

Training score: 0.9725714285714285
Testing score: 0.7582069655057491
[[-0.11347765  0.02551134  0.00638595 ...  0.06666518  0.06666518
   0.06666518]]


In [92]:
#Logistic Regression with TFID Vectorizer
lr = LogisticRegression(solver='lbfgs', max_iter=5000)
lr.fit(X_train_TF, y_train)

# Score model on train data
score = lr.score(X_train_TF, y_train)
test_score = lr.score(X_test_TF, y_test)


print('Training Score:',score)
print('Testing Score:',test_score)

importance = lr_model.coef_
print(importance)

Training Score: 0.921
Testing Score: 0.7688718546908848
[[-0.11347765  0.02551134  0.00638595 ...  0.06666518  0.06666518
   0.06666518]]


## Multinomial Naive Bayes

**We run a Multinomial Naive Bayes as it tends to test well on discrete features, which our array is**. 



In [95]:
nb = MultinomialNB(alpha = 1)

nb.fit(X_train_TF, y_train)

y_pred = nb.predict(X_test_TF)

print(accuracy_score(y_test, y_pred))


tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
print("True Negatives: %s" % tn)
print("False Positives: %s" % fp)
print("False Negatives: %s" % fn)
print("True Positives: %s" % tp)


# calculate Accuracy (Total correct)
acc = (tn + tp) / (tn + fp + fn + tp)
print("Accuracy: %s" %acc)

# Calculate the specificity (True negative)
spec = tn/(tn + fp)
print("Specificity: %s" %spec)

# calculate precision
prec = tp / (tp + fp)
print("Precision: %s" %prec)

# calculate recall 
rec = tp / (tp + fn)
print("Recall: %s" %rec)


0.7657057157140477
True Negatives: 2448
False Positives: 553
False Negatives: 853
True Positives: 2147
Accuracy: 0.7657057157140477
Specificity: 0.8157280906364545
Precision: 0.7951851851851852
Recall: 0.7156666666666667


**Our standard, unoptimized version runs at .76** 

In [None]:
# Set up pipeline
mb_pipe = Pipeline([('tvec', TfidfVectorizer(stop_words=stopwords_1, max_features=500, ngram_range=(1,2))),
    ('mnb', MultinomialNB())
])


mb_params = {
    'mnb__alpha': [0.01, 0.1, 0.5, 1.0, 10.0], 
    'mnb__fit_prior': [True]
}


# Set up a gridsearch
mb_TF = GridSearchCV(mb_pipe, mb_params, cv=10, verbose=1)

# Fit the gridsearch
mb_TF.fit(X_train, y_train);

In [None]:
#What is the best score
print("Best Score: %s" % mb_TF.best_score_)

# What are the best hyperparameters?
print("Best Params: %s" % mb_TF.best_params_)

# Score model on testing set
print("Training Score: %s"% mb_TF.score(X_train, y_train))
print("Test Score: %s" % mb_TF.score(X_test, y_test))

preds = mb_TF.predict(X_test)

tn, fp, fn, tp = confusion_matrix(y_test, preds).ravel()
print("True Negatives: %s" % tn)
print("False Positives: %s" % fp)
print("False Negatives: %s" % fn)
print("True Positives: %s" % tp)

**Our multinomial model scored relatively well in terms of our training and test score (.74,.73). The score itself appears to be having a good degree of fit**

In [None]:
#Calculate metrics

# calculate Accuracy (Total correct)
acc = (tn + tp) / (tn + fp + fn + tp)
print("Accuracy: %s" %acc)

# Calculate the specificity (True negative)
spec = tn/(tn + fp)
print("Specificity: %s" %spec)

# calculate precision
prec = tp / (tp + fp)
print("Precision: %s" %prec)

# calculate recall 
rec = tp / (tp + fn)
print("Recall: %s" %rec)


In [1]:
plot_confusion_matrix(mb_TF, X_test, y_test, cmap='Blues', values_format='d');

NameError: name 'plot_confusion_matrix' is not defined

## Random Forest

**As a tree algorithm, a Random Forest selects randomized subsets of data, contrary to a regular decision tree, preventing one or a few features that are are very strong predictors for the response variable (target y) being correlated the This randomness results in a wide diversity that generally results in a better model.**

**In this case, our unoptimized random forest with a cross val score of 5 ran slightly higher than our optimized model**

Fitting 5 folds for each of 3 candidates, totalling 15 fits
0.7515714285714286

#Best features
{'max_depth': None, 'max_features': 'sqrt', 'n_estimators': 200}

In [None]:
rf = RandomForestClassifier()

In [None]:
cross_val_score(rf, X_train_TF, y_train, cv=5).mean()

In [None]:

rf_params = {
    'n_estimators': [100], #200,300,400], #number of trees built before taking averages of predictions
    'max_depth': [None,] #1,2,3,4,5], 
    'max_features': ['sqrt',] #.5] #Number of features considered as the random forest to split a node
}

rfs = GridSearchCV(rf, rf_params, cv=5, n_jobs=-1, verbose=1) #CV of 10 pushes higher variance, reducing bias
rfs.fit(X_train_TF, y_train)
print(rfs.best_score_)
rfs.best_params_

In [None]:
rfs.best_estimator_.feature_importances_

In [None]:
print('Best Estimator Score Train: ', rfs.best_estimator_.score(X_train_TF, y_train))
print('Best Estimator Score Test: ', rfs.best_estimator_.score(X_test_TF, y_test))

#Get our matrix results that show our accuracy
tn, fp, fn, tp = confusion_matrix(y_test, rfs.predict(X_test_TF)).ravel()
print("True Negatives: %s" % tn)
print("False Positives: %s" % fp)
print("False Negatives: %s" % fn)
print("True Positives: %s" % tp)

In [None]:
# calculate Accuracy (Total correct)
acc = (tn + tp) / (tn + fp + fn + tp)
print("Accuracy: %s" %acc)

# Calculate the specificity (True negative)
spec = tn/(tn + fp)
print("Specificity: %s" %spec)

# calculate precision
prec = tp / (tp + fp)
print("Precision: %s" %prec)

# calculate recall 
rec = tp / (tp + fn)
print("Recall: %s" %rec)

In [None]:
plot_confusion_matrix(rfs, X_test_TF, y_test, cmap='Blues', values_format='d');

## K Nearest Neighbors

**Our K Neighbors models running unoptimized on our Count Vectorizer were severely overfit to the training data (75% accuracy), given the test recognized 58% of unseen data**

**On our TFidf Vectorizer, the unoptimized model improves to 81%, and has a greater degree of fit with the test data (71% Accuracy)**

In [None]:
knn = KNeighborsClassifier().fit(X_train_TF, y_train)

print('Training set score: ' + str(knn.score(X_train_cv,y_train)))
print('Test set score: ' + str(knn.score(X_test_cv,y_test)))

**Our K Neighbors model running optimized were showed that our most optimized score for our K_Neighbors model was a training score of 75.7%, with a test score of 74%. The optimized model settled on these parameters (n_neighbors = 71, and our metric = euclidean)

In [None]:
knn_params = {
    'n_neighbors': range(1, 100),
    'metric': ['euclidean', 'manhattan']
}

# Instantiate our GridSearch
gs_knn = GridSearchCV(
    KNeighborsClassifier(), # what model to fit
    knn_params, # dictionary of parameters to search
    cv=10, # number of folds (default is 5)
    n_jobs=-1,
    verbose=1
)

gs_knn.fit(X_train_TF, y_train)

print('Best Params: ',gs_knn.best_params_)
print('Best Estimator Score Train: ', gs_knn.best_estimator_.score(X_train_TF, y_train))
print('Best Estimator Score Test: ', gs_knn.best_estimator_.score(X_test_TF, y_test))


In [None]:

tn, fp, fn, tp = confusion_matrix(y_test, gs_knn.predict(X_test_TF)).ravel()
print("True Negatives: %s" % tn)
print("False Positives: %s" % fp)
print("False Negatives: %s" % fn)
print("True Positives: %s" % tp)

# calculate Accuracy (Total correct)
acc = (tn + tp) / (tn + fp + fn + tp)
print("Accuracy: %s" %acc)

# Calculate the specificity (True negative)
spec = tn/(tn + fp)
print("Specificity: %s" %spec)

# calculate precision
prec = tp / (tp + fp)
print("Precision: %s" %prec)

# calculate recall 
rec = tp / (tp + fn)
print("Recall: %s" %rec)

#plot_confusion_matrix(gs, X_test, y_test, cmap='Blues', values_format='d');

## Model Choice - K-Nearest Neighbors with Gridsearch

In the end, We choose to go with our K-Nearest Neighbors model, with gridsearch parameters of nearest neighbors = 72, and the Euclidean metric. 
This is because our gridsearch performed well, having the best overall fit of all the data models, and a reasonable degree of accuracy in comparison to the other models.