# Random Forest Model

# 
These are our imports:

In [13]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.model_selection import cross_val_score, train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import confusion_matrix, plot_confusion_matrix, classification_report

# 
Bring in the cleaned data set:

In [2]:
df = pd.read_csv('../data/depression_bipolar_cleaned.csv')
df

Unnamed: 0,selftext,subreddit,tokenized,sentences,word_count
0,I am now homeless and my phone service will tu...,0,"['homeless', 'phone', 'service', 'turn', 'real...",i am now homeless and my phone service will tu...,234
1,People always describe their depression as con...,0,"['people', 'always', 'describe', 'depression',...",people always describe their depression a cons...,155
2,I have been struggling really hard with this. ...,0,"['struggling', 'really', 'hard', 'graduated', ...",i have been struggling really hard with this w...,394
3,So Its been 1 year I had my future secure and ...,0,"['1', 'year', 'future', 'secure', 'well', 'wa'...",so it been 1 year i had my future secure and w...,105
4,I am 15 yrs old and I just need some help.\n\n...,0,"['15', 'yr', 'old', 'need', 'help', 'thing', '...",i am 15 yr old and i just need some help thing...,154
...,...,...,...,...,...
9995,I honestly love the flavour of them!\n\nI star...,1,"['honestly', 'love', 'flavour', 'started', 're...",i honestly love the flavour of them i started ...,130
9996,Three days on the trot I have been unable to s...,1,"['three', 'day', 'trot', 'unable', 'sleep', 'e...",three day on the trot i have been unable to sl...,132
9997,Just wanted to say that I have been feeling go...,1,"['wanted', 'say', 'feeling', 'good', 'last', '...",just wanted to say that i have been feeling go...,107
9998,Hey guys. I just wanted to ask you if you have...,1,"['hey', 'guy', 'wanted', 'ask', 'ever', 'probl...",hey guy i just wanted to ask you if you have e...,339


# 
### Gridsearch 
Create a grid of different model paramaters to test out all at once. This should help us refine our model quickly.

#### 
Instantiate the Model Variables. X is our independent variable (which will become vectorized variables) and y is our dependent variable:

In [3]:
X = df['sentences']
y = df['subreddit']

# 
Split the data into a training and testing set:

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.4,
                                                    stratify=y,
                                                    random_state=24)

# 
Establish a pipeline for our data to go through in the modeling process. First it will go through vectorization with TF-IDF and then the resulting matrix is run through our random forest model:

In [5]:
pipe_rf = Pipeline([
    ('tvec', TfidfVectorizer()),
    ('rf', RandomForestClassifier())
])

# 
This is the grid that our GridSearch will go through. Every different combination will be used to find the optimal model parameters.

In [6]:
pipe_params = {
    'tvec__max_features': [8_000],
    'tvec__max_df': [600, 700, 800],
    'tvec__min_df': [20, 25, 30],
    'tvec__stop_words': ['english'],
    'tvec__ngram_range': [(1, 2)],
    'rf__min_samples_split': [20, 25, 30],
    'rf__min_samples_leaf': [1, 2, 3],
    'rf__max_features': [90, 100, 110],
    'rf__n_estimators': [100],
    'rf__max_depth': [90, 100, 110],
}

# 
Instantiate and fit the GridSearchCV:

In [7]:
gs = GridSearchCV(pipe_rf, param_grid=pipe_params, cv=3, n_jobs = 4)
gs.fit(X_train, y_train)

GridSearchCV(cv=3,
             estimator=Pipeline(steps=[('tvec', TfidfVectorizer()),
                                       ('rf', RandomForestClassifier())]),
             n_jobs=4,
             param_grid={'rf__max_depth': [90, 100, 110],
                         'rf__max_features': [90, 100, 110],
                         'rf__min_samples_leaf': [1, 2, 3],
                         'rf__min_samples_split': [20, 25, 30],
                         'rf__n_estimators': [100],
                         'tvec__max_df': [600, 700, 800],
                         'tvec__max_features': [8000],
                         'tvec__min_df': [20, 25, 30],
                         'tvec__ngram_range': [(1, 2)],
                         'tvec__stop_words': ['english']})

# 
Get the optimal scores for the gridsearch as well as the optimal parameters from our grid:

In [8]:
gs.score(X_train, y_train)

0.974

In [9]:
gs.score(X_test, y_test)

0.84375

In [10]:
gs.best_estimator_

Pipeline(steps=[('tvec',
                 TfidfVectorizer(max_df=700, max_features=8000, min_df=20,
                                 ngram_range=(1, 2), stop_words='english')),
                ('rf',
                 RandomForestClassifier(max_depth=90, max_features=90,
                                        min_samples_split=25))])

In [13]:
gs.best_params_

{'rf__max_depth': 90,
 'rf__max_features': 90,
 'rf__min_samples_leaf': 1,
 'rf__min_samples_split': 25,
 'rf__n_estimators': 100,
 'tvec__max_df': 700,
 'tvec__max_features': 8000,
 'tvec__min_df': 20,
 'tvec__ngram_range': (1, 2),
 'tvec__stop_words': 'english'}

#### 
These accuracy scores are interesting. It's extremely overfit, but the test accuracy is almost as good as the logistic regression.
# 

# 
#### Create an individual model with our parameters from the GridSearch

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.4,
                                                    stratify=y,
                                                    random_state=24)

In [5]:
tvec1 = TfidfVectorizer(max_df=700, 
                       max_features=8000, 
                       min_df=20,
                       ngram_range=(1, 2),
                       stop_words='english')

rf1 = RandomForestClassifier(max_depth=90, 
                            max_features=90,
                            min_samples_leaf=1,
                            min_samples_split=25)

In [6]:
X_train = tvec1.fit_transform(X_train)

In [7]:
X_test = tvec1.transform(X_test)

In [8]:
rf1.fit(X_train, y_train)

RandomForestClassifier(max_depth=90, max_features=90, min_samples_split=25)

In [10]:
rf1.score(X_train, y_train)

0.9758333333333333

In [11]:
rf1.score(X_test, y_test)

0.844

# 
Get predictions so we can check all of our scores:

In [9]:
y_pred = rf1.predict(X_test)

In [15]:
print(classification_report(y_test, y_pred, target_names = ['depression', 'bipolar'], digits = 3))

              precision    recall  f1-score   support

  depression      0.817     0.886     0.850      2000
     bipolar      0.876     0.802     0.837      2000

    accuracy                          0.844      4000
   macro avg      0.846     0.844     0.844      4000
weighted avg      0.846     0.844     0.844      4000



# 
### Overall Analysis

The random forest is fascinating. It took significantly longer than logistic regression and dwarfed naive bayes in runtime. But it does have accuracy scores that are almost as high as logistic regression, albeit at the expense of extreme overfitness. Most interestingly though it has the only recall score to be over 80%. Based on purely performance, this is our best model. Let's take a look back at all the models and draw our conclusions: 

### 
[Conclusions](./Conclusions.ipynb)  
  
[Return to Read Me](../README.md)