## Notebook 5. Improving Models and Conclusions

##### For my final modeling:
- From my previous notebook, I wanted to explore and work more on the Random Foresrt model using pipelines and GridSearch. Here, I'm goin to see if I could improve the score and see what this model would tell me. I'm also going to remove words 'read' and 'book' from text using the stop words to challange the model and probably reduce the variance.

##### For Evaluation:
* Instantiate and display the counfusion matrix 
* Get the true negative, false positive, false negative, and true positive
* Create and display predictions for each single post, save predictions as separate csv file

##### Make Final Final Conclusion

### 1. Loading Data

In [1]:
# Import libraries

import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

import warnings
warnings.filterwarnings('ignore')

In [2]:
# Loading dataset

df = pd.read_csv('../data/cleaned_df.csv')
df.head(2)

Unnamed: 0,text,subreddit_Parenting,subreddit_books
0,why do you like james joyce james joyce from m...,0,1
1,we yevgeny zamyatin spoilers just finished thi...,0,1


### 2. Adding 'read' and 'book' to stop words list

In [3]:
# Recover stop words
%store -r sw

np.array(sw)[0:5]

sw = sw + ['read', 'book']

### 3. Setting target and train test split my model

I'm going to again set up the target - subreddit books, predictors - text and perform the train_test_split.

 - Train-test-split will automatically devide dataset to four parts. Train model on %75 of data and test on %25.
 - Random_state=42 will ensure that splits model generates are reproducible. 
 - Option stratify will ensure that we have a balance of our target varable in both the training and the test data.

In [4]:
# Identify target and features
# Test train split

X = df.text
y = df.subreddit_books

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42,
                                                           stratify=y)

### 4. Instantiating the Pipe(Tf-Idf and Random Forest) and Setting Parameters for the Grid Search

In [51]:
# Instantiate TfIdf Vectorizer + Random Forest Classifer pipeline

pipe = Pipeline([  ('tfidf', TfidfVectorizer()), ('rf', RandomForestClassifier()) ])

In [52]:
# Define a dictionary of hyperparameters
# My parametrer choice is based on GA lessons recommendations

params = {
    'tfidf__stop_words': [sw],
    'tfidf__max_df' :    [.75, .8, .9, .95],
    'tfidf__min_df':    [2, 4, 8],
    'tfidf__max_features': [10, 100, 500, 1000, 3000],
    'tfidf__norm' : ['l1', 'l2'],
    'rf__n_estimators': [10, 50, 100],
    'rf__max_depth': [None, 3, 5],
    'rf__max_depth': [None, 3, 5]}

### 5. Fitting the Grid Search and Evaluating scores

In [53]:
#Instantiate GridSearchCV object

gs = GridSearchCV(pipe, 
                  param_grid=params, 
                  cv=5,
                  verbose=1)


gs.fit(X_train, y_train) 


Fitting 5 folds for each of 1080 candidates, totalling 5400 fits


GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('tfidf', TfidfVectorizer()),
                                       ('rf', RandomForestClassifier())]),
             param_grid={'rf__max_depth': [None, 3, 5],
                         'rf__n_estimators': [10, 50, 100],
                         'tfidf__max_df': [0.75, 0.8, 0.9, 0.95],
                         'tfidf__max_features': [10, 100, 500, 1000, 3000],
                         'tfidf__min_df': [2, 4, 8],
                         'tfidf__norm': ['l1', 'l2'],
                         'tfidf__stop_words': [['i', 'me', 'my', 'myself', 'we',
                                                'our', 'ours', 'ourselves',
                                                'you', "you're", "you've",
                                                "you'll", "you'd", 'your',
                                                'yours', 'yourself',
                                                'yourselves', 'he', 'him',
                     

In [54]:
print('gs best score:', gs.best_score_)
print('gs best parameters:', gs.best_params_)
print('gs best estimator:', gs.best_estimator_)

gs best score: 0.9745415770250663
gs best parameters: {'rf__max_depth': None, 'rf__n_estimators': 100, 'tfidf__max_df': 0.75, 'tfidf__max_features': 3000, 'tfidf__min_df': 8, 'tfidf__norm': 'l2', 'tfidf__stop_words': ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'ag

In [55]:
print('train score:', gs.score(X_train,y_train))
print('test score:',  gs.score(X_test,y_test))

train score: 1.0
test score: 0.9743833017077799


### After exploring the scores it is possible to make next predicitions:

- Establishing Pipes and Grid Search definitely helps a lot when working with large sets of data and speeds up the process. Too bad i didn't use it in my previous notebook, but would definitely use it in the future.
- The model worked well even I removed 2 most popular words from book dataset
- the model chooses the best score, i believe it is a cross val score close to test score.
- the train score is still high and if I have more time, I would apply standard scaler to the model to check if it helps.
- the grid search picked my lowest max_df and highest max_features. I believe it did to have less variance, because of still high numbers of root words like reading and books.
- Number of estimators is 100 what I thought it would take
- L2 regularization  were choosen because it tries to estimate the mean of the data to avoid overfitting

### 6. Predictions and Model Evaluation using Confusion Matrix

In [63]:
#Predictions
# Ideas how to build and plot the confusion matrix came from the General Assembly lessons

pred = gs.predict(X_test)

tn, fp, fn, tp = confusion_matrix(y_test, pred).ravel()

accuracy = (tp + tn) / (tp + tn + fp + fn)
misclassification = 1 - accuracy

print('Accuracy rate:', accuracy)
print('Misclassification rate:', misclassification)

Accuracy rate: 0.9743833017077799
Misclassification rate: 0.02561669829222013


In [65]:
# Plotting confusion matrix

cm = confusion_matrix(y_test, pred)
cm_df = pd.DataFrame(data=cm, columns=['predicted words in book', 
                                       'predicted words in parenting'], 
                     index=['Actual books', 'Actual parenting'])
cm_df

Unnamed: 0,predicted words in book,predicted words in parenting
Actual books,1019,26
Actual parenting,28,1035


 - From this chart above we can see that model did pretty nice job in predicting posts. It predicted 1019 posts from 1045 that belong to books and 1035 posts from 1063 that belong to parenting.
 
 

For future plotting it is possible to use this link:
https://stackoverflow.com/questions/25009284/how-to-plot-roc-curve-in-python


### 7. Looking at the posts...

In [70]:
# Setting the probabilities

probs = gs.predict_proba(X_test)
preds = probs[:,1]

In [77]:
# Converting all probablities to a data frame
# This idea was found from former GA's student Annet Kerr and
# https://stackoverflow.com/questions/25009284/how-to-plot-roc-curve-in-python

pred_df = pd.DataFrame({'posts': X_test,
                        'y_test_score': y_test,
                        'prediction': preds,
                        'proba_parenting': probs[:,0],
                        'proba_books': probs[:,1]})


In [79]:
pred_df.head()

Unnamed: 0,posts,y_test_score,prediction,proba_parenting,proba_books
822,help book depository keeps refunding i am so s...,1,0.59,0.41,0.59
5617,feeling guilty getting a kid kicked out of day...,0,0.08,0.92,0.08
6585,can i live a full life as a parent i need help...,0,0.06,0.94,0.06
2934,oz series guide i want to read all the books i...,1,0.98,0.02,0.98
6035,my f son m has an eating problem and my boyfri...,0,0.09,0.91,0.09


In [83]:
pred_df.to_csv('../data/predictions.csv', index = False)

### 8. Conclusion

**Super great News!!!! The model sucessfully classifies the posts looking at the words and we could clearly see it from the predicted dataframe. The logistic regression is a winning model, but random forest with some adjustments also predict subreddits very well. The average preddicted percentage is about %97-98.**
