## Optimization Attempt 1
In an attempt to optimize the rf model, features with low PCA loadings will be dropped in this trial (sex, BPMeds, prevalentStroke, diabetes)

In [1]:
# Importing dependencies
import pandas as pd
from pathlib import Path
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
%matplotlib inline


In [2]:
# Reading in the data
file_path = Path('../Resources/cleaned_data.csv')
df = pd.read_csv(file_path)
df.head()

Unnamed: 0,sex,age,education,smokingStatus,cigsPerDay,BPMeds,prevalentStroke,prevalentHyp,diabetes,totChol,sysBP,diaBP,BMI,heartRate,glucose,CHDRisk
0,1.0,39,4,0.0,0,0,0,0,0,195,106.0,70.0,26.97,80,77,0
1,0.0,46,2,0.0,0,0,0,0,0,250,121.0,81.0,28.73,95,76,0
2,1.0,48,1,1.0,20,0,0,0,0,245,127.5,80.0,25.34,75,70,0
3,0.0,61,3,1.0,30,0,0,1,0,225,150.0,95.0,28.58,65,103,1
4,0.0,46,3,1.0,23,0,0,0,0,285,130.0,84.0,23.1,85,85,0


In [3]:
# Drop Columns
df.drop(columns=['sex', 'BPMeds', 'prevalentStroke', 'diabetes'], inplace=True)
df.head()

Unnamed: 0,age,education,smokingStatus,cigsPerDay,prevalentHyp,totChol,sysBP,diaBP,BMI,heartRate,glucose,CHDRisk
0,39,4,0.0,0,0,195,106.0,70.0,26.97,80,77,0
1,46,2,0.0,0,0,250,121.0,81.0,28.73,95,76,0
2,48,1,1.0,20,0,245,127.5,80.0,25.34,75,70,0
3,61,3,1.0,30,1,225,150.0,95.0,28.58,65,103,1
4,46,3,1.0,23,0,285,130.0,84.0,23.1,85,85,0


In [4]:
# Separating target variables and features
y = df['CHDRisk']
X = df.drop(columns='CHDRisk')

In [5]:
# Count Unique Values
y.value_counts()

CHDRisk
0    3084
1     553
Name: count, dtype: int64

In [6]:
# Splitting the data into testing and training data
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1, stratify=y)

In [7]:
# Count Unique Values
y_train.value_counts()

CHDRisk
0    2312
1     415
Name: count, dtype: int64

In [8]:
# Using synthetic minority over-sampling technique to balance the target variable conditions in the training data
smote = SMOTE(random_state=1)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

In [9]:
y_train_resampled.value_counts()

CHDRisk
0    2312
1    2312
Name: count, dtype: int64

In [10]:
X_train_resampled.shape

(4624, 11)

In [11]:
y_train_resampled.shape

(4624,)

In [12]:
# Scaling the feature variables
scaler = StandardScaler()
X_scaler = scaler.fit(X_train_resampled)

X_train_scaled = X_scaler.transform(X_train_resampled)
X_test_scaled = X_scaler.transform(X_test)

In [13]:
# Instantiating the model
rf_model = RandomForestClassifier(n_estimators=500, random_state=78)

In [14]:
# Training the model
rf_model = rf_model.fit(X_train_scaled, y_train_resampled)

In [15]:
# Making predictions with the testing data
test_predictions = rf_model.predict(X_test_scaled)

## Testing Data Results

In [17]:
# Creating the confusion matrix
cm = confusion_matrix(y_test, test_predictions)
cm_df = pd.DataFrame(
    cm, index=["Actual 0", "Actual 1"], columns=["Predicted 0", "Predicted 1"]
)

# Calculating the accuracy and recall scores
acc_score = accuracy_score(y_test, test_predictions)

In [18]:
# Printing the results
print("Confusion Matrix")
display(cm_df)
print(f"Accuracy Score : {acc_score}")
print("Classification Report")
print(classification_report(y_test, test_predictions))

Confusion Matrix


Unnamed: 0,Predicted 0,Predicted 1
Actual 0,681,91
Actual 1,101,37


Accuracy Score : 0.789010989010989
Classification Report
              precision    recall  f1-score   support

           0       0.87      0.88      0.88       772
           1       0.29      0.27      0.28       138

    accuracy                           0.79       910
   macro avg       0.58      0.58      0.58       910
weighted avg       0.78      0.79      0.79       910



## Conclusion
Removing the features with low PCA loadings (sex, BPMeds, prevalentStroke, heartRate) did not improve accuracy and overall model performance.