# Naive Bayes Model

# 
These are our imports:

In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import cross_val_score, train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import confusion_matrix, plot_confusion_matrix, classification_report

# 
Bring in the cleaned data set:

In [3]:
df = pd.read_csv('../data/depression_bipolar_cleaned.csv')
df

Unnamed: 0,selftext,subreddit,tokenized,sentences,word_count
0,I am now homeless and my phone service will tu...,0,"['homeless', 'phone', 'service', 'turn', 'real...",i am now homeless and my phone service will tu...,234
1,People always describe their depression as con...,0,"['people', 'always', 'describe', 'depression',...",people always describe their depression a cons...,155
2,I have been struggling really hard with this. ...,0,"['struggling', 'really', 'hard', 'graduated', ...",i have been struggling really hard with this w...,394
3,So Its been 1 year I had my future secure and ...,0,"['1', 'year', 'future', 'secure', 'well', 'wa'...",so it been 1 year i had my future secure and w...,105
4,I am 15 yrs old and I just need some help.\n\n...,0,"['15', 'yr', 'old', 'need', 'help', 'thing', '...",i am 15 yr old and i just need some help thing...,154
...,...,...,...,...,...
9995,I honestly love the flavour of them!\n\nI star...,1,"['honestly', 'love', 'flavour', 'started', 're...",i honestly love the flavour of them i started ...,130
9996,Three days on the trot I have been unable to s...,1,"['three', 'day', 'trot', 'unable', 'sleep', 'e...",three day on the trot i have been unable to sl...,132
9997,Just wanted to say that I have been feeling go...,1,"['wanted', 'say', 'feeling', 'good', 'last', '...",just wanted to say that i have been feeling go...,107
9998,Hey guys. I just wanted to ask you if you have...,1,"['hey', 'guy', 'wanted', 'ask', 'ever', 'probl...",hey guy i just wanted to ask you if you have e...,339


# 
### Gridsearch 
Create a grid of different model paramaters to test out all at once. This should help us refine our model quickly.

#### 
Instantiate the Model Variables. X is our independent variable (which will become vectorized variables) and y is our dependent variable:

In [4]:
X = df['sentences']
y = df['subreddit']

# 
Split the data into a training and testing set:

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.4,
                                                    stratify=y,
                                                    random_state=24)

# 
Establish a baseline score by getting the percentage of our dominant prediction (in this case it's 50/50):

In [5]:
y_test.value_counts(normalize = True)

0    0.5
1    0.5
Name: subreddit, dtype: float64

# 
Establish a pipeline for our data to go through in the modeling process. First it will go through vectorization with TF-IDF and then the resulting matrix is run through our Naive Bayes model:

In [6]:
pipe_tvec = Pipeline([
    ('tvec', TfidfVectorizer()),
    ('nb', MultinomialNB())
])

# 
This is the grid that our GridSearch will go through. Every different combination will be used to find the optimal model parameters.

In [7]:
pipe_tvec_params = {
    'tvec__max_features': [6_000, 7_000, 8_000],
    'tvec__max_df': [400, 500, 600],
    'tvec__min_df': [10, 20, 25],
    'tvec__stop_words': ['english'],
    'tvec__ngram_range': [(1,2)]
}

# 
Instantiate and fit the GridSearchCV:

In [8]:
gs_tvec = GridSearchCV(pipe_tvec,
                        param_grid = pipe_tvec_params, 
                        cv=3, n_jobs = 4)

In [9]:
gs_tvec.fit(X_train, y_train)

GridSearchCV(cv=3,
             estimator=Pipeline(steps=[('tvec', TfidfVectorizer()),
                                       ('nb', MultinomialNB())]),
             n_jobs=4,
             param_grid={'tvec__max_df': [400, 500, 600],
                         'tvec__max_features': [6000, 7000, 8000],
                         'tvec__min_df': [10, 20, 25],
                         'tvec__ngram_range': [(1, 2)],
                         'tvec__stop_words': ['english']})

# 
Get the optimal scores for the gridsearch as well as the optimal parameters from our grid:

In [10]:
gs_tvec.score(X_train, y_train)

0.8753333333333333

In [11]:
gs_tvec.score(X_test, y_test)

0.82225

In [12]:
gs_tvec.best_estimator_

Pipeline(steps=[('tvec',
                 TfidfVectorizer(max_df=500, max_features=6000, min_df=10,
                                 ngram_range=(1, 2), stop_words='english')),
                ('nb', MultinomialNB())])

In [13]:
gs_tvec.best_params_

{'tvec__max_df': 500,
 'tvec__max_features': 6000,
 'tvec__min_df': 10,
 'tvec__ngram_range': (1, 2),
 'tvec__stop_words': 'english'}

#### 
These accuracy scores are good. It's a little overfit, but not terribly. Considering all of the different models we ran, this is probably the most realistic variance
# 

# 
#### Create an individual model with our parameters from the GridSearch

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.4,
                                                    stratify=y,
                                                    random_state=24)

In [6]:
y_test.value_counts(normalize = True)

0    0.5
1    0.5
Name: subreddit, dtype: float64

In [7]:
tvec1 = TfidfVectorizer(max_df=500, 
                        max_features=6000, 
                        min_df=10,
                        ngram_range=(1, 2), 
                        stop_words=['english']
                      )

In [8]:
X_train = tvec1.fit_transform(X_train)

In [9]:
X_test = tvec1.transform(X_test)

In [10]:
nb1 = MultinomialNB()

In [11]:
nb1.fit(X_train, y_train)

MultinomialNB()

In [13]:
nb1.score(X_train, y_train)

0.86

In [14]:
nb1.score(X_test, y_test)

0.8215

# 
Get predictions so we can check all of our scores:

In [12]:
y_pred = nb1.predict(X_test)

In [15]:
print(classification_report(y_test, y_pred, target_names = ['depression', 'bipolar'], digits = 3))

              precision    recall  f1-score   support

  depression      0.800     0.858     0.828      2000
     bipolar      0.847     0.785     0.815      2000

    accuracy                          0.822      4000
   macro avg      0.823     0.822     0.821      4000
weighted avg      0.823     0.822     0.821      4000



# 
### Overall Analysis

Naive Bayes did much better than the baseline model. We're off to a really good start with our modeling. The accuracy is over 80% and the recall(sensitivity) is just below 80%. Let's see if logistic regression performs any better.

# 
**Up Next:**  
[Logistic Regression Model](./Logistic_Regression_Model.ipynb)  
  
[Return to Read Me](../README.md)