# Will They Seek Treatment for a Mental Illness?

__By Lotus Baumgarner__

This is part two of two notebooks. This notebook covers the Models, Hyperparameter Tuning, Final Model Selection and Conclusion/Next Steps. The part one notebook covers the Data/Problem understanding, Data Cleaning, Visualizations and Initial Feature Selection.


In [None]:
# Basic Data Manipulation
import pandas as pd
import numpy as np

# Visualization and Statistics
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import scipy.stats as stats

# Preprocessing and Models
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_validate
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV

from sklearn.metrics import make_scorer
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score
from sklearn.metrics import roc_curve, auc, roc_auc_score
from sklearn.metrics import classification_report
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import confusion_matrix

from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.base import BaseEstimator, TransformerMixin
from xgboost import XGBClassifier

# Other Imports
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
import os  
import pickle

In [None]:
df = pd.read_csv("C:\\Users\\lotus\\Documents\\Flatiron\\Projects\\Phase5-CapstoneProject\\Data\\Treatment2_data.csv")

df.head()

In [None]:
df.info()

## 3. Train-Test Split & Basic Pipeline Set-up:
I used __Train-Test Split__ with a holdout set and __Cross-Validation__. I used a 80/20 split for my training_validation set and holdout set. Then I used a 75/25 split to split the train_validation set into seperate training and validation sets. I also mapped my target variable (y) to be Yes = 1 and No = 0.

__I am labeling my TP, TN, FP, and FN as follows:__

•	__True Positives (TP):__ The number of individuals who were correctly predicted to seek treatment for a mental illness (i.e., the model predicted "Yes" for treatment, and the actual value was also "Yes").

•	__True Negatives (TN):__ The number of individuals who were correctly predicted not to seek treatment for a mental illness (i.e., the model predicted "No" for treatment, and the actual value was also "No").

•	__False Positives (FP):__ The number of individuals who were incorrectly predicted to seek treatment for a mental illness (i.e., the model predicted "Yes" for treatment, but the actual value was "No").

•	__False Negatives (FN):__ The number of individuals who were incorrectly predicted not to seek treatment for a mental illness (i.e., the model predicted "No" for treatment, but the actual value was "Yes").

I will be focusing on __Precision__ as my metric.

In [None]:
# Selected only the features with significant association. (See EDAs-Feature Selection Notebook)
significant_features = ['Gender', 'Family_History', 'Mental_Health_Interview', 'Care_Options', 
                        'Self_Employed', 'Coping_Struggles', 'Growing_Stress']

# Defined X and y to be split. And mapped y to numeric format.
X = df[significant_features]
y = df['Treatment'].map({'Yes': 1, 'No': 0})

# First split the dataset into train_validation and holdout sets (80/20)
X_train_val, X_holdout, y_train_val, y_holdout = train_test_split(X, y, test_size=0.2, random_state=42)

# Then split the train_validation set into seperate training and validation sets (75/25)
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=0.25, random_state=42)


In [None]:
# Printed shape of all training, validation, and holdout sets.
print("The X Train_Validation set shape is ", X_train_val.shape)
print("The X Train set shape is            ", X_train.shape)
print("The X Validation set shape is       ", X_val.shape)
print("The X Holdout set shape is          ", X_holdout.shape)
print()
print("The y Train_Validation set shape is ", y_train_val.shape)
print("The y Train set shape is            ", y_train.shape)
print("The y Validation set shape is       ", y_val.shape)
print("The y Holdout set shape is          ", y_holdout.shape)

In [None]:
# Created a preprocessing pipeline with OneHotEncoder for categorical features
categorical_features = significant_features

categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('cat', categorical_transformer, categorical_features)
    ])

# Defined a basic pipeline with an empty Model place
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', 'Model')
])

pipeline

## 4A. Model 1: Baseline - Logistic Regression:

In [None]:
# Created the Baseline model using Logistic Regression
LogReg = LogisticRegression(max_iter=1000, random_state=42)

# Added LogReg to empty model slot on pipeline
logreg_pipeline = pipeline.set_params(model=LogReg)

logreg_pipeline

In [None]:
# Trained the model
logreg_pipeline.fit(X_train, y_train)

# Predictions & Evaluation
y_pred1 = logreg_pipeline.predict(X_val)
accuracy1 = accuracy_score(y_val, y_pred1)
classification_report1 = classification_report(y_val, y_pred1)

# Cross-validated
cv_scores1 = cross_val_score(logreg_pipeline, X_train_val, y_train_val, cv=10, scoring='accuracy')

print("\033[1mThe Accuracy score on the validation set is:\033[0m ", accuracy1)
print("\033[1mClassification Report:\033[0m\n", classification_report1)
print("\033[1mCross-validation precision scores:\033[0m\n", cv_scores1)

##### FINDINGS:  Validation Set
Keeping in mind that Yes = 1 and No = 0, The Precision score of 0.6866 means the model is predicting 69% of individuals who are likely to seek treatment on their own. 

In [None]:
# Created the confusion matrix for visualization
cm1 = confusion_matrix(y_val, y_pred1, labels=logreg_pipeline.named_steps['model'].classes_)

plt.figure(figsize=(8, 6))
sns.heatmap(cm1, annot=True, fmt='d', cmap='Purples')

plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')

plt.xticks(ticks=[0.5, 1.5], labels=['No', 'Yes'])
plt.yticks(ticks=[0.5, 1.5], labels=['No', 'Yes'])

plt.show()

##### FINDINGS: Validation Set
__TP:__ The model correctly predicted __7,668__ individuals who would seek treatment for a mental illness.

__TN:__ The model correctly predicted __6,485__ individuals who would not seek treatment for a mental illness.

__FP:__ The model incorrectly predicted __3,499__ individuals as seeking treatment on their own when they will not.

__FN:__ The model incorrectly predicted __2,348__ individuals as not seeking treatment on their own when they will.

In [None]:
# Used Get Feature Name Out to get a list and total number of the columns created by OneHotEncoder
ohe_feature_names = logreg_pipeline.named_steps['preprocessor'].named_transformers_['cat'].get_feature_names_out()

total_new_columns = len(ohe_feature_names)

print("\033[1mThe list of new columns from OHE is:\033[0m\n", ohe_feature_names)
print()
print("\033[1mThe total number of new columns is:\033[0m ", total_new_columns)

#### Tested on the Holdout set:

In [None]:
# Predictions & Evaluation on the holdout set
y_holdout_pred1 = logreg_pipeline.predict(X_holdout)

accuracy_holdout1 = accuracy_score(y_holdout, y_holdout_pred1)
classification_report_holdout1 = classification_report(y_holdout, y_holdout_pred1)

# Cross-validated
cv_scores_holdout1 = cross_val_score(logreg_pipeline, X, y, cv=10, scoring='accuracy')

print("\033[1mThe Accuracy score on the holdout set is:\033[0m ", accuracy_holdout1)
print("\033[1mClassification Report:\033[0m\n", classification_report_holdout1)
print("\033[1mCross-validation precision scores:\033[0m\n", cv_scores_holdout1)

In [None]:
# Created the confusion matrix for visualization
cm_holdout1 = confusion_matrix(y_holdout, y_holdout_pred1, labels=logreg_pipeline.named_steps['model'].classes_)

plt.figure(figsize=(8, 6))
sns.heatmap(cm_holdout1, annot=True, fmt='d', cmap='Purples')

plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')

plt.xticks(ticks=[0.5, 1.5], labels=['No', 'Yes'])
plt.yticks(ticks=[0.5, 1.5], labels=['No', 'Yes'])

plt.show()

##### FINDINGS Holdout Set:
Overall, the model's performance is similar to what was observed on the validation set (0.6866 Validation and 0.6777 Holdout), indicating that the model is generalizing well to unseen data. The TP, TN, FP, FN are all close to eachother as well.

__TP:__ Validation Set: __7,668__ -------- Holdout Set: __7,578__ 

__TN:__ Validation Set: __6,485__ -------- Holdout Set: __6,475__

__FP:__ Validation Set: __3,499__ -------- Holdout Set: __3,603__

__FN:__ Validation Set: __2,348__ -------- Holdout Set: __2,344__

## 4B. Model 2:  Random Forest Classifier

In [None]:
# Defined Random Forest Classifier as my 2nd model
RandForest = RandomForestClassifier(random_state=42)

# Added RandForest to empty model slot on pipeline
RF_pipeline = pipeline.set_params(model=RandForest)

RF_pipeline

In [None]:
# Trained the model
RF_pipeline.fit(X_train, y_train)

# Predictions & Evaluation
y_pred2 = RF_pipeline.predict(X_val)
accuracy2 = accuracy_score(y_val, y_pred2)
classification_report2 = classification_report(y_val, y_pred2)

# Cross-validated
cv_scores2 = cross_val_score(RF_pipeline, X_train_val, y_train_val, cv=10, scoring='accuracy')

print("\033[1mThe Accuracy score on the validation set is:\033[0m ", accuracy2)
print("\033[1mClassification Report:\033[0m\n", classification_report2)
print("\033[1mCross-validation precision scores:\033[0m\n", cv_scores2)

##### FINDINGS:  Validation Set
Keeping in mind that Yes = 1 and No = 0, The Precision score of 0.6893 means the model is still predicting 69% of individuals who are likely to seek treatment on their own correctly.  But it increased to predicting 78% of individuals who are unlikely to seek treatment on their own correctly now.  

In [None]:
# Created the confusion matrix for visualization
cm2 = confusion_matrix(y_val, y_pred2, labels=RF_pipeline.named_steps['model'].classes_)

plt.figure(figsize=(8, 6))
sns.heatmap(cm2, annot=True, fmt='d', cmap='Purples')

plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')

plt.xticks(ticks=[0.5, 1.5], labels=['No', 'Yes'])
plt.yticks(ticks=[0.5, 1.5], labels=['No', 'Yes'])

plt.show()

##### FINDINGS: Validation Set
__TP:__ The model correctly predicted __8,286__ individuals who would seek treatment for a mental illness.

__TN:__ The model correctly predicted __6,250__ individuals who would not seek treatment for a mental illness.

__FP:__ The model incorrectly predicted __3,734__ individuals as seeking treatment on their own when they will not.

__FN:__ The model incorrectly predicted __1,730__ individuals as not seeking treatment on their own when they will.

While Random Forest did increase the number of True Positives (7,668 --> 8,286), it also increased the number of False Positives (3,499 --> 3,734).  
And while it did decrease the number of False Negatives (2,348 --> 1,730), it also decreased the number of True Negatives (6,485 --> 6,250).  
I'm hoping that by using __Hyperparameter Tuning__ I'll be able to decrease my False Positives and increase my overall Precision Score.


#### GridSearchCV:
Used GridSearchCV to find the best hyperparameters and model.  Then applied that model to the pipeline and retested against the validation and holdout sets.

In [None]:
# Set up the parameter grid for hyperparameter tuning
RF_param_grid = {
    'model__n_estimators': [100, 200, 300],
    'model__max_depth': [None, 10, 20],
    'model__min_samples_split': [2, 5, 10],
    'model__min_samples_leaf': [1, 2, 4],
}

# Set up GridSearchCV
grid_search_RF = GridSearchCV(RF_pipeline, RF_param_grid, cv=5, scoring='accuracy')

# Fit GridSearchCV to the train_val set to find the best parameters
grid_search_RF.fit(X_train_val, y_train_val)

# Best hyperparameters and the corresponding best model
best_RF_params = grid_search_RF.best_params_
best_RF_model = grid_search_RF.best_estimator_


print("\033[1mThe Best Parameters are:\033[0m\n", best_RF_params)
print()
print("\033[1mThe Best Model is:\033[0m\n", best_RF_model)
print()
print("\033[1mThe Best Accuracy Score is: \033[0m", best_RF_score)

In [None]:
# Evaluate accuracy on the validation set
val_accuracy = best_RF_model.score(X_val, y_val)

# Evaluate accuracy on the holdout set
holdout_accuracy = best_RF_model.score(X_holdout, y_holdout)

print("The Validation Set Accuracy Score is: ", val_accuracy)
print("Holdout Set Accuracy:", holdout_accuracy)

In [None]:
# Created the confusion matrix for visualization


##### FINDINGS:


## 4C. Model 3:  XGB Classifier

In [None]:
# Defined XGBClassifier as my 3rd model
XGBClass = XGBClassifier(eval_metric='logloss', random_state=42)

# Added XGBClass to empty model slot on pipeline
XGBClass_pipeline = pipeline.set_params(model=XGBClass)

XGBClass_pipeline

In [None]:
# Trained the model
XGBClass_pipeline.fit(X_train, y_train)

# Predictions & Evaluation
y_pred3 = XGBClass_pipeline.predict(X_val)
precision3 = precision_score(y_val, y_pred3)
classification_report3 = classification_report(y_val, y_pred3)

# Cross-validated
cv_scores3 = cross_val_score(XGBClass_pipeline, X_train_val, y_train_val, cv=5, scoring='precision')

print("The Precision score on the validation set is: ", precision3)
print("Classification Report:\n", classification_report3)
print("Cross-validation precision scores:\n", cv_scores3)

##### FINDINGS: Validation Set
Keeping in mind that Yes = 1 and No = 0, 

In [None]:
# Created the confusion matrix for visualization
cm3 = confusion_matrix(y_val, y_pred3, labels=XGBClass_pipeline.named_steps['model'].classes_)

plt.figure(figsize=(8, 6))
sns.heatmap(cm3, annot=True, fmt='d', cmap='Purples')

plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')

plt.xticks(ticks=[0.5, 1.5], labels=['No', 'Yes'])
plt.yticks(ticks=[0.5, 1.5], labels=['No', 'Yes'])

plt.show()

##### FINDINGS: Validation Set
__TP:__ The model correctly predicted __8,286__ individuals who would seek treatment for a mental illness.

__TN:__ The model correctly predicted __6,250__ individuals who would not seek treatment for a mental illness.

__FP:__ The model incorrectly predicted __3,734__ individuals as seeking treatment on their own when they will not.

__FN:__ The model incorrectly predicted __1,730__ individuals as not seeking treatment on their own when they will.

While Random Forest did increase the number of True Positives (7,668 --> 8,286), it also increased the number of False Positives (3,499 --> 3,734).  
And while it did decrease the number of False Negatives (2,348 --> 1,730), it also decreased the number of True Negatives (6,485 --> 6,250).  
I'm hoping that by using __Hyperparameter Tuning__ I'll be able to decrease my False Positives and increase my overall Precision Score.

#### GridSearchCV:
Used GridSearchCV again to find the best hyperparameters and model.  Then applied that model to the pipeline and retested against the validation and holdout sets.

In [None]:
# Defined XGB Classifier with the new parameters
param_grid_XGBClass = {
    'model__n_estimators': [100, 200, 300],
    'model__max_depth': [3, 4, 5],
    'model__learning_rate': [0.01, 0.1, 0.2],
}


# Set up GridSearchCV
grid_search_XGBClass = GridSearchCV(XGBClass_pipeline, param_grid_XGBClass, cv=5, scoring='precision')

# Fit GridSearchCV to the train_val set to find the best parameters
grid_search_XGBClass.fit(X_train_val, y_train_val)

# Best hyperparameters and the corresponding best model
best_XGBClass_params = grid_search_XGBClass.best_params_
best_XGBClass_model = grid_search_XGBClass.best_estimator_
best_XGBClass_score = grid_search_XGBClass.best_score_

print("\033[1mThe Best Parameters are:\033[0m\n", best_XGBClass_params)
print()
print("\033[1mThe Best Model is:\033[0m\n", best_XGBClass_model)
print()
print("\033[1mThe Best Precision Score is: \033[0m", best_XGBClass_score)