# Heart Attack Classifier

## About the Dataset

This dataset provides a detailed health profile of individuals in Indonesia, focusing on heart attack prediction. It includes key demographic, clinical, lifestyle, and environmental factors associated with cardiovascular risks. The dataset reflects real-world health trends in Indonesia, considering factors such as hypertension, diabetes, obesity, smoking, and pollution exposure.

Indonesia has seen a rising trend in cardiovascular diseases, making early prediction and prevention crucial. The dataset is structured to support machine learning models for predicting heart attack risks, public health research, and epidemiological studies.

🔍 Variable Definitions (Full Description)

1. Demographics
* age (int): Age of the individual (25-90 years)
* gender (str): Gender of the individual (Male, Female)
* region (str): Living area (Urban, Rural)
* income_level (str): Socioeconomic status (Low, Middle, High)

2. Clinical Risk Factors
* hypertension (int): High blood pressure (1 = Yes, 0 = No)
* diabetes (int): Diagnosed diabetes (1 = Yes, 0 = No)
* cholesterol_level (int): Total cholesterol level (mg/dL)
* obesity (int): BMI > 30 (1 = Yes, 0 = No)
* waist_circumference (int): Waist circumference in cm
* family_history (int): Family history of heart disease (1 = Yes, 0 = No)

3. Lifestyle & Behavioral Factors
* smoking_status (str): Smoking habit (Never, Past, Current)
* alcohol_consumption (str): Alcohol intake (None, Moderate, High)
* physical_activity (str): Physical activity level (Low, Moderate, High)
* dietary_habits (str): Diet quality (Healthy, Unhealthy)

4. Environmental & Social Factors
* air_pollution_exposure (str): Pollution exposure (Low, Moderate, High)
* stress_level (str): Stress level (Low, Moderate, High)
* sleep_hours (float): Average sleep hours per night (3-9 hours)

5. Medical Screening & Health System Factors
* blood_pressure_systolic (int): Systolic BP (mmHg)
* blood_pressure_diastolic (int): Diastolic BP (mmHg)
* fasting_blood_sugar (int): Blood sugar level (mg/dL)
* cholesterol_hdl (int): HDL cholesterol level (mg/dL)
* cholesterol_ldl (int): LDL cholesterol level (mg/dL)
* triglycerides (int): Triglyceride level (mg/dL)
* EKG_results (str): Electrocardiogram result (Normal, Abnormal)
* previous_heart_disease (int): Prior heart disease (1 = Yes, 0 = No)
* medication_usage (int): Currently taking heart-related medications (1 = Yes, 0 = No)
* participated_in_free_screening (int): Attended Indonesia’s free health screening program (1 = Yes, 0 = No)

6. Target Variable
* heart_attack (int): Heart attack occurrence (1 = Yes, 0 = No)

## Import the Library

In [1]:
import pandas as pd
import numpy as np
"""
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
""";


## Load and Inspect the Data

### Reading the Dataset

In [2]:
data = pd.read_csv("data/heart_attack_prediction_indonesia.csv", keep_default_na=False)
data.head()

Unnamed: 0,age,gender,region,income_level,hypertension,diabetes,cholesterol_level,obesity,waist_circumference,family_history,...,blood_pressure_diastolic,fasting_blood_sugar,cholesterol_hdl,cholesterol_ldl,triglycerides,EKG_results,previous_heart_disease,medication_usage,participated_in_free_screening,heart_attack
0,60,Male,Rural,Middle,0,1,211,0,83,0,...,62,173,48,121,101,Normal,0,0,0,0
1,53,Female,Urban,Low,0,0,208,0,106,1,...,76,70,58,83,138,Normal,1,0,1,0
2,62,Female,Urban,Low,0,0,231,1,112,1,...,74,118,69,130,171,Abnormal,0,1,0,1
3,73,Male,Urban,Low,1,0,202,0,82,1,...,65,98,52,85,146,Normal,0,1,1,0
4,52,Male,Urban,Middle,1,0,232,0,89,0,...,75,104,59,127,139,Normal,1,0,1,1


### Check for Missing Values

In [3]:
data.isna().sum()

age                               0
gender                            0
region                            0
income_level                      0
hypertension                      0
diabetes                          0
cholesterol_level                 0
obesity                           0
waist_circumference               0
family_history                    0
smoking_status                    0
alcohol_consumption               0
physical_activity                 0
dietary_habits                    0
air_pollution_exposure            0
stress_level                      0
sleep_hours                       0
blood_pressure_systolic           0
blood_pressure_diastolic          0
fasting_blood_sugar               0
cholesterol_hdl                   0
cholesterol_ldl                   0
triglycerides                     0
EKG_results                       0
previous_heart_disease            0
medication_usage                  0
participated_in_free_screening    0
heart_attack                

## Preprocessing

### Shuffle and Reduce the Data

In [4]:
data_shuffled = data.sample(frac=1, random_state=42).reset_index(drop=True)

data_reduced = data_shuffled.head(10000)

print(data_reduced.shape)

(10000, 28)


### Split the Data into Features and Target (X & y)

In [5]:
X = data_reduced.drop('heart_attack', axis = 1)
X.head()

Unnamed: 0,age,gender,region,income_level,hypertension,diabetes,cholesterol_level,obesity,waist_circumference,family_history,...,blood_pressure_systolic,blood_pressure_diastolic,fasting_blood_sugar,cholesterol_hdl,cholesterol_ldl,triglycerides,EKG_results,previous_heart_disease,medication_usage,participated_in_free_screening
0,52,Female,Urban,Middle,0,1,217,0,84,0,...,154,67,70,58,125,57,Abnormal,0,1,0
1,69,Male,Urban,Low,1,0,233,0,114,0,...,129,68,90,52,128,144,Normal,1,0,1
2,74,Male,Urban,Middle,0,0,176,0,57,0,...,152,90,137,31,133,150,Normal,0,0,1
3,48,Female,Urban,Middle,1,0,143,0,96,1,...,101,79,81,57,127,159,Normal,0,0,0
4,38,Female,Urban,Low,1,0,176,0,89,1,...,142,79,136,42,160,214,Normal,1,1,1


In [6]:
y = data_reduced['heart_attack']
y.head()

0    1
1    1
2    0
3    0
4    0
Name: heart_attack, dtype: int64

### Check the Data Types

In [7]:
data.dtypes

age                                 int64
gender                             object
region                             object
income_level                       object
hypertension                        int64
diabetes                            int64
cholesterol_level                   int64
obesity                             int64
waist_circumference                 int64
family_history                      int64
smoking_status                     object
alcohol_consumption                object
physical_activity                  object
dietary_habits                     object
air_pollution_exposure             object
stress_level                       object
sleep_hours                       float64
blood_pressure_systolic             int64
blood_pressure_diastolic            int64
fasting_blood_sugar                 int64
cholesterol_hdl                     int64
cholesterol_ldl                     int64
triglycerides                       int64
EKG_results                       

In [8]:
data.select_dtypes(include="object").shape[1]

10

### Data Transforming

In [9]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler

categorical_features = ['gender', 'region', 'income_level', 'smoking_status', 'alcohol_consumption', 'physical_activity', 'dietary_habits',
                        'air_pollution_exposure', 'stress_level', 'EKG_results']

transformer = ColumnTransformer(
            [('one_hot', OneHotEncoder(), categorical_features)],
            remainder = "passthrough")

X_encoded = transformer.fit_transform(X)

scaler = StandardScaler()
X_transformed = scaler.fit_transform(X_encoded)

In [10]:
pd.DataFrame(X_transformed)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,33,34,35,36,37,38,39,40,41,42
0,1.051737,-1.051737,-0.733961,0.733961,-0.420413,-0.810556,1.098201,-0.591517,1.023885,-0.581353,...,-0.763800,1.638787,-1.240105,-1.468279,0.850427,-0.116989,-1.888558,-0.505931,1.005817,-1.225511
1,-0.950808,0.950808,-0.733961,0.733961,-0.420413,1.233721,-0.910580,-0.591517,1.023885,-0.581353,...,1.702391,-0.038360,-1.139773,-0.744325,0.243918,-0.031381,-0.112694,1.976555,-0.994217,0.815986
2,-0.950808,0.950808,-0.733961,0.733961,-0.420413,-0.810556,1.098201,-0.591517,-0.976672,1.720124,...,0.008779,1.504615,1.067553,0.956965,-1.878865,0.111299,0.009780,-0.505931,-0.994217,0.815986
3,1.051737,-1.051737,-0.733961,0.733961,-0.420413,-0.810556,1.098201,-0.591517,1.023885,-0.581353,...,0.636783,-1.916764,-0.036110,-1.070104,0.749342,-0.059917,0.193490,-0.505931,-0.994217,-1.225511
4,1.051737,-1.051737,-0.733961,0.733961,-0.420413,1.233721,-0.910580,-0.591517,1.023885,-0.581353,...,1.773263,0.833757,-0.036110,0.920767,-0.766931,0.881768,1.316163,1.976555,1.005817,0.815986
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,-0.950808,0.950808,1.362471,-1.362471,-0.420413,1.233721,-0.910580,1.690569,-0.976672,-0.581353,...,-0.285237,-0.038360,-0.537775,1.789511,0.041748,-0.516492,-1.010832,-0.505931,1.005817,-1.225511
9996,1.051737,-1.051737,-0.733961,0.733961,-0.420413,1.233721,-0.910580,-0.591517,1.023885,-0.581353,...,1.324115,0.766671,-0.939107,-1.468279,0.951512,0.482266,0.826269,1.976555,1.005817,0.815986
9997,-0.950808,0.950808,1.362471,-1.362471,-0.420413,-0.810556,1.098201,1.690569,-0.976672,-0.581353,...,-1.381542,-0.239617,0.365222,-0.599535,0.446088,0.910304,-0.990420,-0.505931,-0.994217,0.815986
9998,-0.950808,0.950808,1.362471,-1.362471,-0.420413,-0.810556,1.098201,-0.591517,1.023885,-0.581353,...,0.085048,0.229984,0.465555,1.536128,1.052597,-0.887458,-0.031045,-0.505931,-0.994217,0.815986


In [11]:
dummies = pd.get_dummies(
    data_reduced[['gender', 'region', 'income_level', 'smoking_status', 'alcohol_consumption', 'physical_activity', 'dietary_habits', 'air_pollution_exposure',
          'stress_level', 'EKG_results']])
dummies

Unnamed: 0,gender_Female,gender_Male,region_Rural,region_Urban,income_level_High,income_level_Low,income_level_Middle,smoking_status_Current,smoking_status_Never,smoking_status_Past,...,dietary_habits_Healthy,dietary_habits_Unhealthy,air_pollution_exposure_High,air_pollution_exposure_Low,air_pollution_exposure_Moderate,stress_level_High,stress_level_Low,stress_level_Moderate,EKG_results_Abnormal,EKG_results_Normal
0,True,False,False,True,False,False,True,False,True,False,...,False,True,False,True,False,False,True,False,True,False
1,False,True,False,True,False,True,False,False,True,False,...,True,False,False,False,True,False,False,True,False,True
2,False,True,False,True,False,False,True,False,False,True,...,False,True,True,False,False,False,True,False,False,True
3,True,False,False,True,False,False,True,False,True,False,...,False,True,False,False,True,False,False,True,False,True
4,True,False,False,True,False,True,False,False,True,False,...,True,False,False,False,True,False,False,True,False,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,False,True,True,False,False,True,False,True,False,False,...,False,True,False,False,True,False,True,False,False,True
9996,True,False,False,True,False,True,False,False,True,False,...,False,True,False,False,True,False,False,True,False,True
9997,False,True,True,False,False,False,True,True,False,False,...,False,True,False,True,False,False,False,True,False,True
9998,False,True,True,False,False,False,True,False,True,False,...,False,True,True,False,False,False,True,False,False,True


In [12]:
#data_reduced.to_csv('data/data_reduced.csv', index=False)

## Splitting Data Into Training and Testing

In [13]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_transformed, y, test_size=0.2, random_state=42)

## Fit into Model

In [19]:
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier

model_rf = RandomForestClassifier(random_state=42)
model_svc = SVC(probability=True, random_state=42)
model_gb = GradientBoostingClassifier(random_state=42)
model_xgb = XGBClassifier(eval_metric='logloss', random_state=42)

models = {
    "Random Forest": model_rf,
    "SVM": model_svc,
    "Gradient Boosting": model_gb,
    "XGBoost": model_xgb}

for name, model in models.items():
    scores = cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy')
    print(f"\n{name} Cross-Validation Accuracy:")
    print(f"Scores: {scores}")
    print(f"Mean Accuracy: {scores.mean()*100:.2f}%")


Random Forest Cross-Validation Accuracy:
Scores: [0.73875  0.7225   0.725    0.709375 0.72125 ]
Mean Accuracy: 72.34%

SVM Cross-Validation Accuracy:
Scores: [0.72625  0.7275   0.720625 0.715    0.713125]
Mean Accuracy: 72.05%

Gradient Boosting Cross-Validation Accuracy:
Scores: [0.75625  0.73375  0.738125 0.7325   0.72125 ]
Mean Accuracy: 73.64%

XGBoost Cross-Validation Accuracy:
Scores: [0.72375  0.725    0.6975   0.70625  0.703125]
Mean Accuracy: 71.11%


## Hyperparameter Tuning

In [20]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint
from sklearn.metrics import classification_report

param_dist_rf = {
    'n_estimators': [50, 100],
    'max_depth': [10, 20]
}

param_dist_svc = {
    'C': [0.1, 1, 10],
    'kernel': ['rbf', 'linear'],
    'gamma': ['scale', 'auto']
}

param_dist_gb = {
    'n_estimators': [50, 100],
    'learning_rate': [0.01, 0.1],
    'max_depth': [3, 5],
    'subsample': [0.8, 1.0]
}

param_dist_xgb = {
    'n_estimators': [50, 100],
    'learning_rate': [0.01, 0.1],
    'max_depth': [3, 5],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0]
}

def tune_model(name, model, param_grid):
    print(f"\n Tuning {name}...")
    search = RandomizedSearchCV(
        model, param_distributions=param_grid,
        n_iter=4, cv=3, random_state=42, n_jobs=2)
    search.fit(X_train, y_train)
    print(f"✅ Best Parameters for {name}:", search.best_params_)
    y_pred = search.predict(X_test)
    print(f"📊 Classification Report for {name}:\n", classification_report(y_test, y_pred))
    return search.best_estimator_

best_rf = tune_model("Random Forest", model_rf, param_dist_rf)
best_svc = tune_model("SVM", model_svc, param_dist_svc)
best_gb = tune_model("Gradient Boosting", model_gb, param_dist_gb)
best_xgb = tune_model("XGBoost", model_xgb, param_dist_xgb)


 Tuning Random Forest...
✅ Best Parameters for Random Forest: {'n_estimators': 100, 'max_depth': 10}
📊 Classification Report for Random Forest:
               precision    recall  f1-score   support

           0       0.73      0.84      0.79      1195
           1       0.70      0.55      0.61       805

    accuracy                           0.72      2000
   macro avg       0.72      0.70      0.70      2000
weighted avg       0.72      0.72      0.72      2000


 Tuning SVM...
✅ Best Parameters for SVM: {'kernel': 'linear', 'gamma': 'scale', 'C': 10}
📊 Classification Report for SVM:
               precision    recall  f1-score   support

           0       0.75      0.82      0.78      1195
           1       0.69      0.59      0.63       805

    accuracy                           0.73      2000
   macro avg       0.72      0.70      0.71      2000
weighted avg       0.72      0.73      0.72      2000


 Tuning Gradient Boosting...
✅ Best Parameters for Gradient Boosting: {'su



✅ Best Parameters for XGBoost: {'subsample': 0.8, 'n_estimators': 50, 'max_depth': 3, 'learning_rate': 0.1, 'colsample_bytree': 1.0}
📊 Classification Report for XGBoost:
               precision    recall  f1-score   support

           0       0.76      0.82      0.79      1195
           1       0.70      0.60      0.65       805

    accuracy                           0.73      2000
   macro avg       0.73      0.71      0.72      2000
weighted avg       0.73      0.73      0.73      2000



## Feature Importance

In [25]:
feature_names = transformer.transformers_[0][1].get_feature_names_out(categorical_features)
feature_names = list(feature_names) + [col for col in X.columns if col not in categorical_features]

def print_feature_importance(model, model_name):
    if hasattr(model, 'feature_importances_'):
        importances = model.feature_importances_
        feature_importance = dict(zip(feature_names, importances))
        sorted_importance = dict(sorted(feature_importance.items(), key=lambda item: item[1], reverse=True))
        print(f"\nFeature Importances for {model_name}:")
        for feature, importance in sorted_importance.items():
            print(f"{feature}: {importance:.4f}")

    elif hasattr(model, 'coef_'):
        coefs = model.coef_[0]
        feature_importance = dict(zip(feature_names, coefs))
        sorted_feature_importance = dict(sorted(feature_importance.items(), key=lambda item: abs(item[1]), reverse=True))
        print(f"\nFeature Importances (coefficients) for {model_name}:")
        for feature, importance in sorted_feature_importance.items():
            print(f"{feature}: {importance:.4f}")
    else:
        print(f"\n⚠️ Model {model_name} does not support feature_importances_.")

print_feature_importance(best_rf, "Random Forest")
print_feature_importance(best_svc, "SVM")
print_feature_importance(best_gb, "Gradient Boosting")
print_feature_importance(best_xgb, "XGBoost")


Feature Importances for Random Forest:
previous_heart_disease: 0.1314
hypertension: 0.1220
age: 0.0616
cholesterol_level: 0.0609
diabetes: 0.0574
fasting_blood_sugar: 0.0535
smoking_status_Current: 0.0523
obesity: 0.0467
waist_circumference: 0.0402
sleep_hours: 0.0394
triglycerides: 0.0394
cholesterol_ldl: 0.0374
blood_pressure_diastolic: 0.0358
blood_pressure_systolic: 0.0345
cholesterol_hdl: 0.0331
smoking_status_Never: 0.0138
smoking_status_Past: 0.0083
participated_in_free_screening: 0.0058
alcohol_consumption_None: 0.0057
medication_usage: 0.0057
income_level_Low: 0.0056
dietary_habits_Unhealthy: 0.0055
stress_level_Moderate: 0.0055
gender_Male: 0.0054
income_level_Middle: 0.0054
gender_Female: 0.0053
dietary_habits_Healthy: 0.0053
air_pollution_exposure_Moderate: 0.0053
alcohol_consumption_Moderate: 0.0052
stress_level_High: 0.0051
region_Urban: 0.0050
air_pollution_exposure_Low: 0.0050
region_Rural: 0.0050
air_pollution_exposure_High: 0.0050
physical_activity_Low: 0.0050
family