# GSB 545: Advanced Machine Learning for Business Analytics
## Name: Ruojia Kuang
## Date: 4/22/24

### Goal
We are trying to predict whether the patient had heart attack or not based on 2022 without nans dataset.

### Data
https://www.kaggle.com/datasets/kamilpytlak/personal-key-indicators-of-heart-disease

### Constrains
1. You need to use at least one boosting model in your work to answer the questions above, but you should explore at least two other models in order to answer the above questions as best you can.
2. The kaggle page indicates that the the classes are extremely unbalanced in this dataset and need adjustment accordingly.

In [2]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [3]:
df = pd.read_csv("/Users/ruojiakuang/Desktop/GSB 545 Advanced ML/Heart Disease/2022/heart_2022_no_nans.csv")
df.head()

Unnamed: 0,State,Sex,GeneralHealth,PhysicalHealthDays,MentalHealthDays,LastCheckupTime,PhysicalActivities,SleepHours,RemovedTeeth,HadHeartAttack,...,HeightInMeters,WeightInKilograms,BMI,AlcoholDrinkers,HIVTesting,FluVaxLast12,PneumoVaxEver,TetanusLast10Tdap,HighRiskLastYear,CovidPos
0,Alabama,Female,Very good,4.0,0.0,Within past year (anytime less than 12 months ...,Yes,9.0,None of them,No,...,1.6,71.67,27.99,No,No,Yes,Yes,"Yes, received Tdap",No,No
1,Alabama,Male,Very good,0.0,0.0,Within past year (anytime less than 12 months ...,Yes,6.0,None of them,No,...,1.78,95.25,30.13,No,No,Yes,Yes,"Yes, received tetanus shot but not sure what type",No,No
2,Alabama,Male,Very good,0.0,0.0,Within past year (anytime less than 12 months ...,No,8.0,"6 or more, but not all",No,...,1.85,108.86,31.66,Yes,No,No,Yes,"No, did not receive any tetanus shot in the pa...",No,Yes
3,Alabama,Female,Fair,5.0,0.0,Within past year (anytime less than 12 months ...,Yes,9.0,None of them,No,...,1.7,90.72,31.32,No,No,Yes,Yes,"No, did not receive any tetanus shot in the pa...",No,Yes
4,Alabama,Female,Good,3.0,15.0,Within past year (anytime less than 12 months ...,Yes,5.0,1 to 5,No,...,1.55,79.38,33.07,No,No,Yes,Yes,"No, did not receive any tetanus shot in the pa...",No,No


In [4]:
print(df.dtypes)

State                         object
Sex                           object
GeneralHealth                 object
PhysicalHealthDays           float64
MentalHealthDays             float64
LastCheckupTime               object
PhysicalActivities            object
SleepHours                   float64
RemovedTeeth                  object
HadHeartAttack                object
HadAngina                     object
HadStroke                     object
HadAsthma                     object
HadSkinCancer                 object
HadCOPD                       object
HadDepressiveDisorder         object
HadKidneyDisease              object
HadArthritis                  object
HadDiabetes                   object
DeafOrHardOfHearing           object
BlindOrVisionDifficulty       object
DifficultyConcentrating       object
DifficultyWalking             object
DifficultyDressingBathing     object
DifficultyErrands             object
SmokerStatus                  object
ECigaretteUsage               object
C

In [5]:
# Check the unique values in the column to ensure they are 'No' and 'Yes'
print(df['HadHeartAttack'].unique())

['No' 'Yes']


In [6]:
# Count the values again
heart_attack_counts = df['HadHeartAttack'].value_counts()

# Print the counts
print("Count of people who had a heart attack (Yes):", heart_attack_counts['Yes'])
print("Count of people who did not have a heart attack (No):", heart_attack_counts['No'])

Count of people who had a heart attack (Yes): 13435
Count of people who did not have a heart attack (No): 232587


Since the classes of the dataset are extremely unbalanced, instead of using accuracy, we use metrics like precision, recall, F1-score, or ROC-AUC to evaluate the models. 

In [7]:
# Convert categorical variables to numerical format using LabelEncoder
label_encoders = {}
for column in df.select_dtypes(include=['object']).columns:
    label_encoders[column] = LabelEncoder()
    df[column] = label_encoders[column].fit_transform(df[column])

# Normalize the numerical features
scaler = StandardScaler()
numerical_columns = df.select_dtypes(include=['float64']).columns
df[numerical_columns] = scaler.fit_transform(df[numerical_columns])

# Split the data into training and testing sets
X = df.drop('HadHeartAttack', axis=1)  # Features
y = df['HadHeartAttack']  # Target variable
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Modeling

## Initial Model

## Ada Boost

In [8]:
from sklearn.model_selection import train_test_split, cross_val_score, RepeatedKFold
from sklearn.ensemble import AdaBoostClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
import numpy as np
import matplotlib.pyplot as plt

In [9]:
# Define the model
ada = AdaBoostClassifier(n_estimators=50)

# Evaluate the model using cross-validation
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
scores = cross_val_score(ada, X_train, y_train, scoring='accuracy', cv=cv, n_jobs=-1)

# Fit the model on the full training set
ada.fit(X_train, y_train)
y_pred = ada.predict(X_test)

In [10]:
# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

# Print results
print("AdaBoost Results:")
print(f"Cross-validated Accuracy: {np.mean(scores):.3f} ({np.std(scores):.3f})")
print(f"Test Accuracy: {accuracy:.3f}")
print(f"Test Precision: {precision:.3f}")
print(f"Test Recall: {recall:.3f}")
print(f"Test F1-score: {f1:.3f}")
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print("")

AdaBoost Results:
Cross-validated Accuracy: 0.948 (0.002)
Test Accuracy: 0.948
Test Precision: 0.532
Test Recall: 0.272
Test F1-score: 0.360
Confusion Matrix:
[[45943   630]
 [ 1915   717]]



## XGBoost

In [9]:
from xgboost import XGBClassifier

# Define the model
xgb = XGBClassifier(n_estimators=50)

# Evaluate the model using cross-validation
scores = cross_val_score(xgb, X_train, y_train, scoring='accuracy', cv=cv, n_jobs=-1)

# Fit the model on the full training set
xgb.fit(X_train, y_train)
y_pred = xgb.predict(X_test)

In [10]:
# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

# Print results
print("XGBoost Results:")
print(f"Cross-validated Accuracy: {np.mean(scores):.3f} ({np.std(scores):.3f})")
print(f"Test Accuracy: {accuracy:.3f}")
print(f"Test Precision: {precision:.3f}")
print(f"Test Recall: {recall:.3f}")
print(f"Test F1-score: {f1:.3f}")
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print("")

XGBoost Results:
Cross-validated Accuracy: 0.948 (0.002)
Test Accuracy: 0.949
Test Precision: 0.556
Test Recall: 0.242
Test F1-score: 0.337
Confusion Matrix:
[[46066   507]
 [ 1996   636]]



## Logistic Regression

In [11]:
# Define the model
lr = LogisticRegression()

# Evaluate the model using cross-validation
scores = cross_val_score(lr, X_train, y_train, scoring='accuracy', cv=cv, n_jobs=-1)

# Fit the model on the full training set
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

In [12]:
# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

# Print results
print("Logistic Regression Results:")
print(f"Cross-validated Accuracy: {np.mean(scores):.3f} ({np.std(scores):.3f})")
print(f"Test Accuracy: {accuracy:.3f}")
print(f"Test Precision: {precision:.3f}")
print(f"Test Recall: {recall:.3f}")
print(f"Test F1-score: {f1:.3f}")
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print("")

Logistic Regression Results:
Cross-validated Accuracy: 0.948 (0.002)
Test Accuracy: 0.949
Test Precision: 0.544
Test Recall: 0.242
Test F1-score: 0.335
Confusion Matrix:
[[46039   534]
 [ 1996   636]]



When evaluating models, the model with a higher F1 score is typically chosen as the better-performing model. Therefore, Ada Boost has is the best model with the highest F1 score 0.36.

### Adjusting the hyper-parameters for Ada Boost

In [11]:
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier

In [12]:
# Define the parameter grid to search
param_grid = {
    'n_estimators': [50, 100, 200],
    'learning_rate': [0.01, 0.1, 1.0],
    'estimator': [DecisionTreeClassifier(max_depth=1)],
    'algorithm': ['SAMME']
}

# Perform grid search
grid_search = GridSearchCV(ada, param_grid, cv=5, n_jobs=-1)
grid_search.fit(X_train, y_train)

# Get the best model from the grid search
best_model = grid_search.best_estimator_

# Print the best parameters found
print("Best Hyperparameters:", grid_search.best_params_)

Best Hyperparameters: {'algorithm': 'SAMME', 'estimator': DecisionTreeClassifier(max_depth=1), 'learning_rate': 0.1, 'n_estimators': 200}


In [13]:
# Evaluate the best model on the test set
y_pred = best_model.predict(X_test)

# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

# Print results
print("Logistic Regression Results:")
print(f"Cross-validated Accuracy: {np.mean(scores):.3f} ({np.std(scores):.3f})")
print(f"Test Accuracy: {accuracy:.3f}")
print(f"Test Precision: {precision:.3f}")
print(f"Test Recall: {recall:.3f}")
print(f"Test F1-score: {f1:.3f}")
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print("")

Logistic Regression Results:
Cross-validated Accuracy: 0.948 (0.002)
Test Accuracy: 0.948
Test Precision: 0.551
Test Recall: 0.157
Test F1-score: 0.244
Confusion Matrix:
[[46236   337]
 [ 2219   413]]



### Adjusting the hyper-parameters for XGBoost

In [13]:
# Define the parameter grid
param_grid = {
    'learning_rate': [0.01, 0.1, 0.2],
    'max_depth': [3, 5, 7],
    'n_estimators': [100, 200, 500]
}

# Instantiate the GridSearchCV object
grid_search = GridSearchCV(estimator=xgb, param_grid=param_grid, cv=3, n_jobs=-1, verbose=2)

# Perform the grid search
grid_search.fit(X_train, y_train)

# Get the best model from the grid search
best_model = grid_search.best_estimator_

# Evaluate the best model on the test set
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print("Best Parameters:", grid_search.best_params_)
print("Best Model Test Accuracy:", accuracy)

Fitting 3 folds for each of 27 candidates, totalling 81 fits
Best Parameters: {'learning_rate': 0.01, 'max_depth': 5, 'n_estimators': 500}
Best Model Test Accuracy: 0.9496595874403008


In [14]:
# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

# Print results
print("Logistic Regression Results:")
print(f"Cross-validated Accuracy: {np.mean(scores):.3f} ({np.std(scores):.3f})")
print(f"Test Accuracy: {accuracy:.3f}")
print(f"Test Precision: {precision:.3f}")
print(f"Test Recall: {recall:.3f}")
print(f"Test F1-score: {f1:.3f}")
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print("")

Logistic Regression Results:
Cross-validated Accuracy: 0.948 (0.002)
Test Accuracy: 0.950
Test Precision: 0.577
Test Recall: 0.220
Test F1-score: 0.319
Confusion Matrix:
[[46148   425]
 [ 2052   580]]

[CV] END ..learning_rate=0.01, max_depth=3, n_estimators=500; total time=  14.1s
[CV] END ..learning_rate=0.01, max_depth=5, n_estimators=500; total time=  18.7s
[CV] END ...learning_rate=0.1, max_depth=3, n_estimators=200; total time=   6.1s
[CV] END ...learning_rate=0.1, max_depth=5, n_estimators=100; total time=   4.8s
[CV] END ...learning_rate=0.1, max_depth=5, n_estimators=200; total time=   7.7s
[CV] END ...learning_rate=0.1, max_depth=7, n_estimators=100; total time=   5.5s
[CV] END ...learning_rate=0.1, max_depth=7, n_estimators=500; total time=  20.0s
[CV] END ...learning_rate=0.2, max_depth=5, n_estimators=200; total time=   8.3s
[CV] END ...learning_rate=0.2, max_depth=7, n_estimators=200; total time=   9.2s
[CV] END ..learning_rate=0.01, max_depth=3, n_estimators=500; total t

### Optimal hyperparameters for logistic regression

In [26]:
# Define the parameter grid
param_grid = {
    'C': [0.001, 0.01, 0.1, 1, 10, 100],  # Regularization parameter
    'penalty': ['l1', 'l2'],  # Penalty norm
    'max_iter': [100, 500, 1000]  # Maximum number of iterations
}

# Create the grid search object
grid_search = GridSearchCV(LogisticRegression(), param_grid, cv=5, n_jobs=-1)

# Perform the grid search
grid_search.fit(X_train, y_train)

# Get the best model from the grid search
best_model = grid_search.best_estimator_

# Evaluate the best model on the test set
y_pred = best_model.predict(X_test)

# Print the best hyperparameters found
print("Best Hyperparameters:", grid_search.best_params_)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Best Hyperparameters: {'C': 0.01, 'max_iter': 500, 'penalty': 'l2'}


In [27]:
# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

# Print results
print("Logistic Regression Results:")
print(f"Cross-validated Accuracy: {np.mean(scores):.3f} ({np.std(scores):.3f})")
print(f"Test Accuracy: {accuracy:.3f}")
print(f"Test Precision: {precision:.3f}")
print(f"Test Recall: {recall:.3f}")
print(f"Test F1-score: {f1:.3f}")
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

Logistic Regression Results:
Cross-validated Accuracy: 0.948 (0.002)
Test Accuracy: 0.948
Test Precision: 0.545
Test Recall: 0.217
Test F1-score: 0.310
Confusion Matrix:
[[46098   475]
 [ 2062   570]]


In [30]:
from sklearn.metrics import classification_report

# Evaluate the model
print(classification_report(y_test, y_pred))

# Analyze coefficients
coefficients = pd.DataFrame({'Feature': X.columns, 'Coefficient': best_model.coef_[0]})
print(coefficients)

              precision    recall  f1-score   support

           0       0.96      0.99      0.97     46573
           1       0.55      0.22      0.31      2632

    accuracy                           0.95     49205
   macro avg       0.75      0.60      0.64     49205
weighted avg       0.94      0.95      0.94     49205

                      Feature  Coefficient
0                       State    -0.000352
1                         Sex     0.663457
2               GeneralHealth    -0.030629
3          PhysicalHealthDays     0.068067
4            MentalHealthDays     0.033243
5             LastCheckupTime     0.122392
6          PhysicalActivities    -0.105562
7                  SleepHours    -0.039636
8                RemovedTeeth    -0.028715
9                   HadAngina     2.343470
10                  HadStroke     0.815777
11                  HadAsthma     0.021856
12              HadSkinCancer    -0.122162
13                    HadCOPD     0.150360
14      HadDepressiveDisorde

Whether a patient had Angina is significantly postively associated with underlying heart disease conditions. The variable "HadAngina" having a coefficient of 2.343470 suggests that individuals who have experienced angina are 2 times more likely to have heart disease.