In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import joblib

from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import train_test_split


# 07. Findings
___


[]Look into why recall score is low
[]Test_with_Predict dataframe, then filter for false negatives 
[]see which rows is strong coef for 1 and strong coef for 0


Introduction


In this notebook, I will be looking at the scores of each classification models and making comparisions between their performance. I hope to investigate where my model was not able to classify correctly.
___


In [18]:
score= pd.read_csv('../data/cleaned_data/scores.csv', index_col=0)

score_pca= pd.read_csv('../data/cleaned_data/scores_pca.csv', index_col=0)

score_rf= pd.read_csv('../data/cleaned_data/scores_rf.csv', index_col=0)


In [19]:
pd.concat([score, score_pca, score_rf])

Unnamed: 0,F1 score,Recall score,Precision score,Accuracy
Basline LogReg,37.96,27.92,59.3,94.45
Best LogReg,37.96,27.78,59.91,94.48
Best DT,39.59,29.79,59.01,94.47
Best SMOTE LogReg,27.54,57.67,18.09,81.56
Best SMOTE DT,26.13,32.16,22.0,88.94
Best LogReg (PCA),36.88,26.71,59.54,93.76
Best DT (PCA),39.84,31.36,54.6,92.36
Base RF,32.99,22.7,60.36,94.39
Best RF,29.89,19.56,63.31,94.42
Best SMOTE RF,30.45,66.17,19.77,81.62


Here, I can compare each of my models' scores and make comparisons between them. 

We can see that for each model, we have relatively low F1 scores because our recall and precision seem to be inversely proportional. Most of the models behave similarily despite tweaking models with different hyperparameters and sampling.


Surprisingly, the PCA decision tree appears to perform better than my decision tree with SMOTE, as we can see that the F1 score jumps from 27.54% to 39.84%. With PCA, the model is more precise and more accurate, it performs similarily to the best Decision Tree that has no SMOTE or PCA. I believed that Best SMOTE DT would actually outperform the other two DT models, but actually we can see that SMOTE was not helpful if we use a decision tree model.

Unfortunately, none of my models have a recall score higher than 70%, which would have been ideal. The models that did score highly, in terms of recall, were the Logistic Regression with SMOTE and the Random Forest with SMOTE. So SMOTE improved the performance of the other two models.
Something to look into is using a mix of SMOTE to inflate the minority class and random undersampling of the majority class.


---
### Investigation

we can load in our best model, our RandomForestClassifier model that was trained with a balance dataset. We can see what our best estimator is, as a reminder.

In [21]:
model = joblib.load('../model/fitted_RF_sm.pkl')

In [22]:
model.best_estimator_

Pipeline(memory='/var/folders/r3/bz5mjtds4dvdw0hskxwvs9vc0000gp/T/tmpyvthk1ol',
         steps=[('scaler', StandardScaler()),
                ('model',
                 RandomForestClassifier(max_depth=12, n_estimators=128))])

We will load in our data again and input our test data into our model.

In [23]:
heart22 = pd.read_csv('~/Desktop/capstone-project-Tasnimacj/data/cleaned_data/heart22_preprocessed.csv', index_col=0)
y = heart22['HadAngina'] # Target Variable
X = heart22.drop('HadAngina', axis=1) 

In [24]:
X_rem, X_test, y_rem, y_test = train_test_split(X, y, test_size=0.2, random_state=25, stratify=y)

print(f'The remainder set has {len(X_rem)} data points.')
print(f'The test set has {len(X_test)} data points.')

The remainder set has 196810 data points.
The test set has 49203 data points.


We can use our model to make our predictions again.

In [25]:
y_predicts = model.predict(X_test)

In [28]:
X_test['preds'] = y_predicts
X_test['actual'] = y_test


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_test['preds'] = y_predicts
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_test['actual'] = y_test


In [30]:
X_test.head()

Unnamed: 0,Female,GeneralHealth,PhysicalHealthDays,MentalHealthDays,LastCheckupTime,PhysicalActivities,SleepHours,RemovedTeeth,HadHeartAttack,HadStroke,...,PneumoVaxEver,TetanusLast10Tdap,HighRiskLastYear,CovidPos,RaceEthnicity_Black only,RaceEthnicity_Hispanic,RaceEthnicity_Multiracial,RaceEthnicity_Other race only,preds,actual
157820,1,4,0.0,0.0,1,0,12.0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
116930,1,3,0.0,0.0,1,0,8.0,0,0,0,...,0,1,0,1,0,0,0,0,0,0
24338,0,1,0.0,0.0,1,1,7.0,0,0,0,...,1,0,0,1,0,0,0,0,0,0
39845,0,1,0.0,0.0,1,1,7.0,1,0,0,...,0,1,0,0,0,0,0,0,0,0
112477,0,3,0.0,0.0,1,1,7.0,0,0,0,...,1,1,0,1,0,0,0,0,1,0


I combine our X_test back with the actual y results, and add our predicted y results together. Using this new table, we can look closer at which datapoints our model got wrong. 

In [32]:
X_test.loc[(X_test['preds'] == 0) & (X_test['actual'] == 1)]

Unnamed: 0,Female,GeneralHealth,PhysicalHealthDays,MentalHealthDays,LastCheckupTime,PhysicalActivities,SleepHours,RemovedTeeth,HadHeartAttack,HadStroke,...,PneumoVaxEver,TetanusLast10Tdap,HighRiskLastYear,CovidPos,RaceEthnicity_Black only,RaceEthnicity_Hispanic,RaceEthnicity_Multiracial,RaceEthnicity_Other race only,preds,actual
167573,1,4,10.0,0.0,1,1,6.0,2,0,0,...,1,1,0,1,0,0,0,0,0,1
227127,0,1,0.0,0.0,1,1,7.0,0,0,0,...,1,1,0,0,0,0,0,0,0,1
234244,1,5,15.0,30.0,1,1,6.0,3,0,0,...,0,1,0,0,0,0,0,0,0,1
80028,1,3,0.0,30.0,1,1,8.0,3,0,1,...,1,1,0,1,0,0,0,0,0,1
174634,0,2,0.0,0.0,1,1,6.0,1,0,0,...,0,0,0,0,1,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
184435,1,3,3.0,2.0,1,1,5.0,1,0,0,...,0,1,0,1,0,0,0,1,0,1
111039,1,3,0.0,1.0,1,1,8.0,1,0,0,...,1,1,0,0,0,0,0,0,0,1
162031,1,3,4.0,0.0,1,0,7.0,0,0,0,...,1,0,0,1,0,0,0,0,0,1
175003,0,3,0.0,0.0,1,0,8.0,0,0,0,...,1,1,0,0,0,0,0,0,0,1


There were 1012 rows that were incorrectly predicted 0 but were supposed to be 1.

In [33]:

feat_importances = pd.DataFrame({'Feature': X_rem.columns, 'Importance': model.best_estimator_.named_steps["model"].feature_importances_,})
feat_importances.sort_values(by='Importance', ascending=False, inplace=True)

If we look at the models feature importance, we can try and see if the data points that we misclassified had a certain pattern within features.

In [35]:
feat_importances.head()

Unnamed: 0,Feature,Importance
26,AgeCategory,0.173836
8,HadHeartAttack,0.105537
30,AlcoholDrinkers,0.089224
0,Female,0.067024
4,LastCheckupTime,0.066795


In [39]:
X_test['AgeCategory'].loc[(X_test['preds'] == 0) & (X_test['actual'] == 1)].value_counts()

AgeCategory
11    165
9     159
13    152
10    142
12    126
8      88
7      68
6      39
5      31
4      16
3      13
2       7
1       6
Name: count, dtype: int64

It seems that the model struggled with datapoints where a person would be 60 or over.

If we look more closely at which features the model deemed less important, perhaps we can see if the misclassified datapoints had those features.

In [40]:
feat_importances.tail()

Unnamed: 0,Feature,Importance
21,DifficultyDressingBathing,0.001332
9,HadStroke,0.001042
12,HadCOPD,0.000983
39,RaceEthnicity_Multiracial,0.000817
14,HadKidneyDisease,0.000629


In [52]:

print( X_test['DifficultyDressingBathing'].loc[(X_test['preds'] == 0) & (X_test['actual'] == 1)].value_counts())

print(X_test['HadStroke'].loc[(X_test['preds'] == 0) & (X_test['actual'] == 1)].value_counts())

print(X_test['HadCOPD'].loc[(X_test['preds'] == 0) & (X_test['actual'] == 1)].value_counts())

print(X_test['RaceEthnicity_Multiracial'].loc[(X_test['preds'] == 0) & (X_test['actual'] == 1)].value_counts())
 
print(X_test['HadKidneyDisease'].loc[(X_test['preds'] == 0) & (X_test['actual'] == 1)].value_counts())

DifficultyDressingBathing
0    907
1    105
Name: count, dtype: int64
HadStroke
0    873
1    139
Name: count, dtype: int64
HadCOPD
0    785
1    227
Name: count, dtype: int64
RaceEthnicity_Multiracial
0    997
1     15
Name: count, dtype: int64
HadKidneyDisease
0    848
1    164
Name: count, dtype: int64


It looks as if our model had difficultly picking up on the '0' values for the 5 least important features. 

The DifficultyDressingBathing column in the original dataset is quite imbalanced, this may be why the model failed to give it more importance.

___


Conclusion


My model has some trouble picking up on features of low importance especially. There could be other values in certain columns or combinations of values that would make it difficult to predict correctly too. 

As a next step, I want to try combining upsampling and downsampling my dataset to see if it will improve my recall score, without risking my precision score. Another thing to try is upsampling different features to balance my features as well as my target variable, and see if the model can give more importance to other features. 