## Running K-fold Cross-Validation 

Running in a separate notebook to determine if the Random Forest model is overfit

In [22]:
# Importing dependencies
import joblib
import pandas as pd
from pathlib import Path
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_validate
from sklearn.metrics import accuracy_score, recall_score, classification_report

In [23]:
# Loading model
rf_model = joblib.load("rf_model.joblib")

In [24]:
# Loading dataset
file_path = Path('Resources/cleaned_data.csv')
df = pd.read_csv(file_path)
df.head()

Unnamed: 0,sex,age,education,smokingStatus,cigsPerDay,BPMeds,prevalentStroke,prevalentHyp,diabetes,totChol,sysBP,diaBP,BMI,heartRate,glucose,CHDRisk
0,1.0,39,4,0.0,0,0,0,0,0,195,106.0,70.0,26.97,80,77,0
1,0.0,46,2,0.0,0,0,0,0,0,250,121.0,81.0,28.73,95,76,0
2,1.0,48,1,1.0,20,0,0,0,0,245,127.5,80.0,25.34,75,70,0
3,0.0,61,3,1.0,30,0,0,1,0,225,150.0,95.0,28.58,65,103,1
4,0.0,46,3,1.0,23,0,0,0,0,285,130.0,84.0,23.1,85,85,0


In [25]:
# Dropping features with low importances
df.drop(columns=['diabetes', 'BPMeds', 'prevalentStroke'], inplace=True)
df.head()

Unnamed: 0,sex,age,education,smokingStatus,cigsPerDay,prevalentHyp,totChol,sysBP,diaBP,BMI,heartRate,glucose,CHDRisk
0,1.0,39,4,0.0,0,0,195,106.0,70.0,26.97,80,77,0
1,0.0,46,2,0.0,0,0,250,121.0,81.0,28.73,95,76,0
2,1.0,48,1,1.0,20,0,245,127.5,80.0,25.34,75,70,0
3,0.0,61,3,1.0,30,1,225,150.0,95.0,28.58,65,103,1
4,0.0,46,3,1.0,23,0,285,130.0,84.0,23.1,85,85,0


In [26]:
# Preparing the data
y = df['CHDRisk']
X = df.drop(columns='CHDRisk')

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1, stratify=y)

In [27]:
smote = SMOTE(random_state=1)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

In [28]:
# Standardizing the data
scaler = StandardScaler()
X_resampled_scaled = scaler.fit_transform(X_train_resampled)
X_test_scaled = scaler.transform(X_test)

In [29]:
# Setting up k-fold cross-validation
k = 5
kf = StratifiedKFold(n_splits=k, shuffle=True, random_state=1)

In [30]:
scoring = ['accuracy', 'recall']
# Perform cross-validation
cv_results = cross_validate(rf_model, X_resampled_scaled, y_train_resampled, cv=kf, scoring=scoring)


In [31]:
# Print cross-validation results
print("Cross-validation results (accuracy):", cv_results['test_accuracy'])
print("Mean cross-validation accuracy:", cv_results['test_accuracy'].mean())
print("Standard deviation of cross-validation accuracy:", cv_results['test_accuracy'].std())

print("Cross-validation results (recall):", cv_results['test_recall'])
print("Mean cross-validation recall:", cv_results['test_recall'].mean())
print("Standard deviation of cross-validation recall:", cv_results['test_recall'].std())

Cross-validation results (accuracy): [0.91286727 0.90273556 0.90070922 0.90678825 0.91176471]
Mean cross-validation accuracy: 0.9069730019667442
Standard deviation of cross-validation accuracy: 0.004794327302056221
Cross-validation results (recall): [0.9127789  0.87829615 0.89676113 0.88461538 0.9148073 ]
Mean cross-validation recall: 0.89745177423196
Standard deviation of cross-validation recall: 0.01461727234180685


## Conclusion
The accuracy and recall scores are very consistent across all segments using k-fold validation, showing that the model is indeed not overfit.