## Optimization Attempt 3
In an attempt to optimize the RF model, features with low importances will be dropped (diabetes, BPMeds, prevalentStroke)

In [1]:
# Importing dependencies
import pandas as pd
from pathlib import Path
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
%matplotlib inline


In [2]:
# Reading in the data
file_path = Path('../Resources/cleaned_data.csv')
df = pd.read_csv(file_path)
df.head()

Unnamed: 0,sex,age,education,smokingStatus,cigsPerDay,BPMeds,prevalentStroke,prevalentHyp,diabetes,totChol,sysBP,diaBP,BMI,heartRate,glucose,CHDRisk
0,1.0,39,4,0.0,0,0,0,0,0,195,106.0,70.0,26.97,80,77,0
1,0.0,46,2,0.0,0,0,0,0,0,250,121.0,81.0,28.73,95,76,0
2,1.0,48,1,1.0,20,0,0,0,0,245,127.5,80.0,25.34,75,70,0
3,0.0,61,3,1.0,30,0,0,1,0,225,150.0,95.0,28.58,65,103,1
4,0.0,46,3,1.0,23,0,0,0,0,285,130.0,84.0,23.1,85,85,0


In [3]:
# Drop Columns
df.drop(columns=['diabetes', 'BPMeds', 'prevalentStroke'], inplace=True)
df.head()

Unnamed: 0,sex,age,education,smokingStatus,cigsPerDay,prevalentHyp,totChol,sysBP,diaBP,BMI,heartRate,glucose,CHDRisk
0,1.0,39,4,0.0,0,0,195,106.0,70.0,26.97,80,77,0
1,0.0,46,2,0.0,0,0,250,121.0,81.0,28.73,95,76,0
2,1.0,48,1,1.0,20,0,245,127.5,80.0,25.34,75,70,0
3,0.0,61,3,1.0,30,1,225,150.0,95.0,28.58,65,103,1
4,0.0,46,3,1.0,23,0,285,130.0,84.0,23.1,85,85,0


In [4]:
# Separating target variables and features
y = df['CHDRisk']
X = df.drop(columns='CHDRisk')

In [5]:
# Count Unique Values
y.value_counts()

CHDRisk
0    3084
1     553
Name: count, dtype: int64

In [6]:
# Splitting the data into testing and training data
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1, stratify=y)

In [7]:
# Count Unique Values
y_train.value_counts()

CHDRisk
0    2312
1     415
Name: count, dtype: int64

In [8]:
# Using synthetic minority over-sampling technique to balance the target variable conditions in the training data
smote = SMOTE(random_state=1)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

In [9]:
y_train_resampled.value_counts()

CHDRisk
0    2312
1    2312
Name: count, dtype: int64

In [10]:
X_train_resampled.shape

(4624, 12)

In [11]:
y_train_resampled.shape

(4624,)

In [12]:
# Scaling the feature variables
scaler = StandardScaler()
X_scaler = scaler.fit(X_train_resampled)

X_train_scaled = X_scaler.transform(X_train_resampled)
X_test_scaled = X_scaler.transform(X_test)

In [13]:
# Instantiating the model
rf_model = RandomForestClassifier(n_estimators=500, random_state=78)

In [14]:
# Training the model
rf_model = rf_model.fit(X_train_scaled, y_train_resampled)

In [15]:
# Making predictions with the testing data
test_predictions = rf_model.predict(X_test_scaled)

## Testing Data Results

In [17]:
# Creating the confusion matrix
cm = confusion_matrix(y_test, test_predictions)
cm_df = pd.DataFrame(
    cm, index=["Actual 0", "Actual 1"], columns=["Predicted 0", "Predicted 1"]
)

# Calculating the accuracy and recall scores
acc_score = accuracy_score(y_test, test_predictions)

In [18]:
# Printing the results
print("Confusion Matrix")
display(cm_df)
print(f"Accuracy Score : {acc_score}")
print("Classification Report")
print(classification_report(y_test, test_predictions))

Confusion Matrix


Unnamed: 0,Predicted 0,Predicted 1
Actual 0,710,62
Actual 1,102,36


Accuracy Score : 0.8197802197802198
Classification Report
              precision    recall  f1-score   support

           0       0.87      0.92      0.90       772
           1       0.37      0.26      0.31       138

    accuracy                           0.82       910
   macro avg       0.62      0.59      0.60       910
weighted avg       0.80      0.82      0.81       910



In [29]:
# Saving the model

import joblib
joblib_file = "rf_model.joblib"
joblib.dump(rf_model, joblib_file)

import pickle
with open("rf_model.pkl", "wb") as file:
    pickle.dump(rf_model, file)

## Conclusion
Removing the features with low importances (diabetes, BPMeds, and prevalentStroke) performed just as well as our baseline, however with fewer features, thus making this the best model. However, there is still the issue of a low recall score for the minority class, which is a concern in health data. In order to address this, we'll attempt to lower the threshold to see if this improves the recall for the minority class.

In [27]:
# Predict probabilities on the test data
y_proba = rf_model.predict_proba(X_test_scaled)[:, 1]

# Define a new threshold
threshold = 0.45

# Predict classes based on the new threshold
y_pred_threshold = (y_proba >= threshold).astype(int)
# Calculating the accuracy and recall scores
acc_score = accuracy_score(y_test, y_pred_threshold)

In [28]:
# Evaluate the model with the new threshold
print("Classification Report with Adjusted Threshold:")
print(classification_report(y_test, y_pred_threshold))
print(f"Accuracy Score : {acc_score}")

Classification Report with Adjusted Threshold:
              precision    recall  f1-score   support

           0       0.88      0.86      0.87       772
           1       0.30      0.33      0.31       138

    accuracy                           0.78       910
   macro avg       0.59      0.60      0.59       910
weighted avg       0.79      0.78      0.79       910

Accuracy Score : 0.7824175824175824


## Conclusion
Lowering the threshold did produce better recall scores for the minority class, however at the cost of reducing the overall model accuracy. 