# Heart Disease Prediction based on symptoms

## Step 1: Define the Problem & Collect Data

Heart disease risk factors can be divided into two categories - those that can't be changed and those that can be modified. Factors that can't be changed include increasing age, male gender, and heredity. Thalassemia, a hereditary condition, is also a risk factor for heart disease. Factors that can be modified to reduce the risk of heart disease include smoking, high cholesterol, high blood pressure, physical inactivity, being overweight, and having diabetes. Other factors that may contribute to heart disease risk include stress, alcohol consumption, and poor diet/nutrition.

The Aim is to classication different heart diseases based on features, and features parameters like Sex, Age and Blood pressure etc. The input data for this project comes from a UCI open source [dataset](https://archive.ics.uci.edu/ml/datasets/Heart+Disease) that contains these features and their reference heart disease  from 0 (no presence) to 4. The dataset includes a total of 303 lines of data. The project will use this data to train a model that can accurately classify heart diseases level.

This database contains 14 attributes, The "target" field refers to the presence of heart disease in the patient. It is integer valued from 0 (no presence) to 4. Experiments with the Cleveland database have concentrated on simply attempting to distinguish presence (values 1,2,3,4) from absence (value 0).

## Dataset Sources:
* 1st Part (303 rows): https://archive.ics.uci.edu/ml/datasets/Heart+Disease
* 2nd Part (1025 rows): https://www.kaggle.com/datasets/johnsmith88/heart-disease-dataset

In [118]:
#loading libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split


Here we loaded the Python libararies:

The code involves importing several Python libraries used in machine learning and data analysis
The first line imports "numpy" for array manipulation and math functions
The second line imports "pandas" for structured data management using DataFrames
The third line imports "StandardScaler" from "scikit-learn" for data standardization with mean 0 and standard deviation 1
The fourth line imports "train_test_split" from the same library for splitting dataset into training and testing sets for model development and evaluation.

In [119]:
# Loading kaggle data csv
dataset_url = "/content/heart_disease.csv"
dataset = pd.read_csv(dataset_url)
dataset

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,52.0,1.0,0.0,125.0,212.0,0.0,1.0,168.0,0.0,1.0,2.0,2.000000,3.0,0.0
1,53.0,1.0,0.0,140.0,203.0,1.0,0.0,155.0,1.0,3.1,0.0,0.000000,3.0,0.0
2,70.0,1.0,0.0,145.0,174.0,0.0,1.0,125.0,1.0,2.6,0.0,0.000000,3.0,0.0
3,61.0,1.0,0.0,148.0,203.0,0.0,1.0,161.0,0.0,0.0,2.0,1.000000,3.0,0.0
4,62.0,0.0,0.0,138.0,294.0,1.0,1.0,106.0,0.0,1.9,1.0,3.000000,2.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1323,45.0,1.0,1.0,110.0,264.0,0.0,0.0,132.0,0.0,1.2,2.0,0.000000,7.0,1.0
1324,68.0,1.0,4.0,144.0,193.0,1.0,0.0,141.0,0.0,3.4,2.0,2.000000,7.0,2.0
1325,57.0,1.0,4.0,130.0,131.0,0.0,0.0,115.0,1.0,1.2,2.0,1.000000,7.0,3.0
1326,57.0,0.0,2.0,130.0,236.0,0.0,2.0,174.0,0.0,0.0,2.0,1.000000,3.0,1.0


In [120]:
print("Data rows and columns:", dataset.shape)

Data rows and columns: (1328, 14)


In [121]:
print(dataset.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1328 entries, 0 to 1327
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1328 non-null   float64
 1   sex       1328 non-null   float64
 2   cp        1328 non-null   float64
 3   trestbps  1328 non-null   float64
 4   chol      1328 non-null   float64
 5   fbs       1328 non-null   float64
 6   restecg   1328 non-null   float64
 7   thalach   1328 non-null   float64
 8   exang     1328 non-null   float64
 9   oldpeak   1328 non-null   float64
 10  slope     1328 non-null   float64
 11  ca        1328 non-null   float64
 12  thal      1328 non-null   float64
 13  target    1328 non-null   float64
dtypes: float64(14)
memory usage: 145.4 KB
None


In [29]:
# lst=['sex', 'cp', 'fbs', 'restecg', 'exang', 'slope', 'ca', 'thal']
# dataset[lst] = dataset[lst].astype(object)

#### Dataset Analysis:

The given code performs dataset analysis on a dataset that contains information about diseases and their corresponding features. Here are the key points

In [122]:
dataset.describe()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
count,1328.0,1328.0,1328.0,1328.0,1328.0,1328.0,1328.0,1328.0,1328.0,1328.0,1328.0,1328.0,1328.0,1328.0
mean,54.435241,0.692018,1.448042,131.629518,246.158133,0.149096,0.634789,149.226657,0.334337,1.064232,1.434488,0.735459,2.873847,0.60994
std,9.061227,0.461833,1.375974,17.529085,51.615928,0.356318,0.691139,22.968291,0.471936,1.171519,0.623758,1.009187,1.473678,0.753764
min,29.0,0.0,0.0,94.0,126.0,0.0,0.0,71.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,48.0,0.0,0.0,120.0,211.0,0.0,0.0,132.0,0.0,0.0,1.0,0.0,2.0,0.0
50%,56.0,1.0,1.0,130.0,240.0,0.0,1.0,152.0,0.0,0.8,1.0,0.0,3.0,1.0
75%,61.0,1.0,2.0,140.0,275.0,0.0,1.0,166.0,1.0,1.8,2.0,1.0,3.0,1.0
max,77.0,1.0,4.0,200.0,564.0,1.0,2.0,202.0,1.0,6.2,3.0,4.0,7.0,4.0


1. age: The person's age in years

2. sex: The person's sex (1 = male, 0 = female)

3. cp: The chest pain experienced (Value 1: typical angina, Value 2: atypical angina, Value 3: non-anginal pain, Value 4: asymptomatic)

4. trestbps: The person's resting blood pressure (mm Hg on admission to the hospital)

5. chol: The person's cholesterol measurement in mg/dl

6. fbs: The person's fasting blood sugar (> 120 mg/dl, 1 = true; 0 = false)

7. restecg: Resting electrocardiographic measurement (0 = normal, 1 = having ST-T wave abnormality, 2 = showing probable or definite left ventricular hypertrophy by Estes' criteria)

8. thalach: The person's maximum heart rate achieved

9. exang: Exercise induced angina (1 = yes; 0 = no)

10. oldpeak: ST depression induced by exercise relative to rest ('ST' relates to positions on the ECG plot. See more here)

11. slope: the slope of the peak exercise ST segment (Value 1: upsloping, Value 2: flat, Value 3: downsloping)

12. ca: The number of major vessels (0-3)

13. thal: A blood disorder called thalassemia (3 = normal; 6 = fixed defect; 7 = reversable defect)

14. target: Heart disease (0 = no, 1-4 = heart rate scale)

Calculating the number of samples of individual category of heart disease. 

In [123]:
disease_count = pd.DataFrame(dataset['target'].value_counts()).reset_index()
disease_count

Unnamed: 0,index,target
0,0.0,663
1,1.0,581
2,2.0,36
3,3.0,35
4,4.0,13


In above code:

Here we creates a new DataFrame called "disease_count" which counts the number of occurrences of each value in the "target" column of an existing DataFrame called "dataset". It uses the "value_counts()" method to count the occurrences of each unique value in the "target" column and the "reset_index()" method to reset the index of the new DataFrame. The result is a DataFrame called "disease_count" that shows how many times each unique value appears in the "target" column of the original DataFrame.

#### Correlation:

In [124]:
# Calculating correlation between target of heart disease and other features.
print(dataset.corr()["target"].abs().sort_values(ascending=False))

target      1.000000
cp          0.431651
slope       0.347910
thal        0.336933
restecg     0.216721
exang       0.080104
thalach     0.065657
sex         0.061770
oldpeak     0.042522
age         0.035036
ca          0.030759
chol        0.023438
trestbps    0.011801
fbs         0.000651
Name: target, dtype: float64


In [50]:
print(f"Percentage of patience without heart problems: {round(np.count_nonzero(dataset['target'])/len(dataset['target']),2)}")
print(f"Percentage of patience with heart problems: {round(np.count_nonzero(dataset['target']==0)/len(dataset['target']),2)}")

Percentage of patience without heart problems: 0.5
Percentage of patience with heart problems: 0.5


#### Training dataset preparation:

- Load the dataset and split it into features and target.
- Normalize the numerical columns using StandardScaler.
- Spliting the dataset into training and testing.

In [125]:
# Split the data into features and target
X = dataset.drop(['target'], axis=1)
y = dataset['target']

In [126]:
# Initialize the model
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
import lightgbm as lgb

# RF model
rf_model = RandomForestClassifier()
rf_model.fit(X, y)

# XGB model
xg_model = XGBClassifier()
xg_model.fit(X, y)

# Train the model
lgb_model = lgb.LGBMRegressor()
lgb_model.fit(X, y)

# Get feature importances
rf_imp = rf_model.feature_importances_
xg_imp = xg_model.feature_importances_
lgb_imp = lgb_model.feature_importances_

# Get Column names
feature_names = X.columns

# Create dataframe of feature importances
feature_importances = pd.DataFrame({'Feature': feature_names,
                                    'Random Forest - Imp': rf_imp,
                                    'XGBoost - Imp': xg_imp,
                                    'LightGBM - Imp': lgb_imp
                                    })

# Sort the dataframe by feature importance
feature_importances = feature_importances.sort_values(by='LightGBM - Imp', ascending=False)

# Print the feature importances
print(feature_importances)

     Feature  Random Forest - Imp  XGBoost - Imp  LightGBM - Imp
7    thalach             0.104653       0.040230             466
4       chol             0.087961       0.031622             422
9    oldpeak             0.089131       0.045945             374
0        age             0.086003       0.045654             350
3   trestbps             0.077429       0.033297             311
2         cp             0.118846       0.155996             302
12      thal             0.166877       0.224779             209
6    restecg             0.041847       0.041338             162
11        ca             0.086815       0.106502             149
10     slope             0.063180       0.107899             109
1        sex             0.028217       0.093055              69
8      exang             0.035286       0.056078              51
5        fbs             0.013755       0.017606              26


In [127]:
X = X[["age","cp","ca","chol","thal","thalach","oldpeak"]]

In [128]:
X.head()

Unnamed: 0,age,cp,ca,chol,thal,thalach,oldpeak
0,52.0,0.0,2.0,212.0,3.0,168.0,1.0
1,53.0,0.0,0.0,203.0,3.0,155.0,3.1
2,70.0,0.0,0.0,174.0,3.0,125.0,2.6
3,61.0,0.0,1.0,203.0,3.0,161.0,0.0
4,62.0,0.0,3.0,294.0,2.0,106.0,1.9


In [133]:
# Normalize the numerical columns
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler

preprocessor = ColumnTransformer(
    [
        ("scaler", StandardScaler(), list(X.columns)),
    ],
    verbose_feature_names_out=False,
).set_output(transform="pandas")

X = preprocessor.fit_transform(X)

classes_n = len(np.unique(y))

## Step 2: Model evaluation method:

- **Train/Test Split:** The dataset split into training and testing sets, with the model being trained on the training set and evaluated on the testing set. This will be done using the train_test_split function from scikit-learn library.

- **Confusion Matrix:** A confusion matrix can be used to evaluate the performance of a classification model. It shows the number of true positives, true negatives, false positives, and false negatives, and can be used to calculate various performance metrics such as precision, recall, and F1-score.

In [135]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

In [39]:
!pip install flaml[notebook] -q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.2/224.2 KB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m159.0/159.0 KB[0m [31m15.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m120.9/120.9 KB[0m [31m13.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m83.6/83.6 KB[0m [31m10.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m36.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for openml (setup.py) ... [?25l[?25hdone
  Building wheel for liac-arff (setup.py) ... [?25l[?25hdone


In [66]:
from flaml import AutoML
automl_settings = {
    "time_budget": 60,  # Seconds
    "metric": 'accuracy', # Evaluation Metric
    "task": 'classification' # Supervised ML Task
}
autoML = AutoML()
autoML.fit(X_train, y_train, **automl_settings)

print(f"BEST MODEL:\n{autoML.model.estimator}")
print(f"ACCURACY SCORE: {autoML.score(X_test, y_test)}")

[flaml.automl.automl: 03-22 19:55:58] {2726} INFO - task = classification
[flaml.automl.automl: 03-22 19:55:58] {2728} INFO - Data split method: stratified
[flaml.automl.automl: 03-22 19:55:58] {2731} INFO - Evaluation method: cv
[flaml.automl.automl: 03-22 19:55:58] {1316} INFO - class 4 augmented from 11 to 22
[flaml.automl.automl: 03-22 19:55:58] {2858} INFO - Minimizing error metric: 1-accuracy
[flaml.automl.automl: 03-22 19:55:58] {3004} INFO - List of ML learners in AutoML Run: ['lgbm', 'rf', 'xgboost', 'extra_tree', 'xgb_limitdepth', 'lrl1']
[flaml.automl.automl: 03-22 19:55:58] {3334} INFO - iteration 0, current learner lgbm
[flaml.automl.automl: 03-22 19:55:58] {3472} INFO - Estimated sufficient time budget=576s. Estimated necessary time budget=13s.
[flaml.automl.automl: 03-22 19:55:58] {3519} INFO -  at 0.1s,	estimator lgbm's best error=0.2591,	best estimator lgbm's best error=0.2591
[flaml.automl.automl: 03-22 19:55:58] {3334} INFO - iteration 1, current learner lgbm
[flaml.



[flaml.automl.automl: 03-22 19:55:58] {3519} INFO -  at 0.4s,	estimator lgbm's best error=0.2190,	best estimator lgbm's best error=0.2190
[flaml.automl.automl: 03-22 19:55:58] {3334} INFO - iteration 6, current learner lgbm
[flaml.automl.automl: 03-22 19:55:58] {3519} INFO -  at 0.5s,	estimator lgbm's best error=0.2190,	best estimator lgbm's best error=0.2190
[flaml.automl.automl: 03-22 19:55:58] {3334} INFO - iteration 7, current learner lgbm
[flaml.automl.automl: 03-22 19:55:59] {3519} INFO -  at 0.6s,	estimator lgbm's best error=0.2190,	best estimator lgbm's best error=0.2190
[flaml.automl.automl: 03-22 19:55:59] {3334} INFO - iteration 8, current learner lgbm
[flaml.automl.automl: 03-22 19:55:59] {3519} INFO -  at 0.7s,	estimator lgbm's best error=0.2190,	best estimator lgbm's best error=0.2190
[flaml.automl.automl: 03-22 19:55:59] {3334} INFO - iteration 9, current learner xgboost
[flaml.automl.automl: 03-22 19:55:59] {3519} INFO -  at 0.8s,	estimator xgboost's best error=0.2442,	



[flaml.automl.automl: 03-22 19:55:59] {3519} INFO -  at 1.0s,	estimator extra_tree's best error=0.3495,	best estimator lgbm's best error=0.2190
[flaml.automl.automl: 03-22 19:55:59] {3334} INFO - iteration 12, current learner extra_tree
[flaml.automl.automl: 03-22 19:55:59] {3519} INFO -  at 1.1s,	estimator extra_tree's best error=0.3365,	best estimator lgbm's best error=0.2190
[flaml.automl.automl: 03-22 19:55:59] {3334} INFO - iteration 13, current learner rf
[flaml.automl.automl: 03-22 19:55:59] {3519} INFO -  at 1.3s,	estimator rf's best error=0.2983,	best estimator lgbm's best error=0.2190
[flaml.automl.automl: 03-22 19:55:59] {3334} INFO - iteration 14, current learner rf
[flaml.automl.automl: 03-22 19:55:59] {3519} INFO -  at 1.4s,	estimator rf's best error=0.2834,	best estimator lgbm's best error=0.2190
[flaml.automl.automl: 03-22 19:55:59] {3334} INFO - iteration 15, current learner xgboost
[flaml.automl.automl: 03-22 19:55:59] {3519} INFO -  at 1.6s,	estimator xgboost's best 



[flaml.automl.automl: 03-22 19:56:00] {3519} INFO -  at 1.7s,	estimator lgbm's best error=0.2181,	best estimator lgbm's best error=0.2181
[flaml.automl.automl: 03-22 19:56:00] {3334} INFO - iteration 17, current learner xgboost
[flaml.automl.automl: 03-22 19:56:00] {3519} INFO -  at 1.8s,	estimator xgboost's best error=0.2433,	best estimator lgbm's best error=0.2181
[flaml.automl.automl: 03-22 19:56:00] {3334} INFO - iteration 18, current learner lgbm
[flaml.automl.automl: 03-22 19:56:00] {3519} INFO -  at 1.9s,	estimator lgbm's best error=0.2181,	best estimator lgbm's best error=0.2181
[flaml.automl.automl: 03-22 19:56:00] {3334} INFO - iteration 19, current learner rf




[flaml.automl.automl: 03-22 19:56:00] {3519} INFO -  at 2.1s,	estimator rf's best error=0.2834,	best estimator lgbm's best error=0.2181
[flaml.automl.automl: 03-22 19:56:00] {3334} INFO - iteration 20, current learner lgbm
[flaml.automl.automl: 03-22 19:56:00] {3519} INFO -  at 2.5s,	estimator lgbm's best error=0.1528,	best estimator lgbm's best error=0.1528
[flaml.automl.automl: 03-22 19:56:00] {3334} INFO - iteration 21, current learner lgbm
[flaml.automl.automl: 03-22 19:56:01] {3519} INFO -  at 2.8s,	estimator lgbm's best error=0.1528,	best estimator lgbm's best error=0.1528
[flaml.automl.automl: 03-22 19:56:01] {3334} INFO - iteration 22, current learner lgbm
[flaml.automl.automl: 03-22 19:56:01] {3519} INFO -  at 3.0s,	estimator lgbm's best error=0.1528,	best estimator lgbm's best error=0.1528
[flaml.automl.automl: 03-22 19:56:01] {3334} INFO - iteration 23, current learner lgbm
[flaml.automl.automl: 03-22 19:56:01] {3519} INFO -  at 3.3s,	estimator lgbm's best error=0.1528,	best



[flaml.automl.automl: 03-22 19:56:14] {3519} INFO -  at 16.2s,	estimator xgb_limitdepth's best error=0.1500,	best estimator lgbm's best error=0.1239
[flaml.automl.automl: 03-22 19:56:14] {3334} INFO - iteration 60, current learner rf




[flaml.automl.automl: 03-22 19:56:14] {3519} INFO -  at 16.4s,	estimator rf's best error=0.2666,	best estimator lgbm's best error=0.1239
[flaml.automl.automl: 03-22 19:56:14] {3334} INFO - iteration 61, current learner xgb_limitdepth
[flaml.automl.automl: 03-22 19:56:14] {3519} INFO -  at 16.6s,	estimator xgb_limitdepth's best error=0.1500,	best estimator lgbm's best error=0.1239
[flaml.automl.automl: 03-22 19:56:14] {3334} INFO - iteration 62, current learner xgb_limitdepth




[flaml.automl.automl: 03-22 19:56:15] {3519} INFO -  at 16.8s,	estimator xgb_limitdepth's best error=0.1296,	best estimator lgbm's best error=0.1239
[flaml.automl.automl: 03-22 19:56:15] {3334} INFO - iteration 63, current learner xgb_limitdepth




[flaml.automl.automl: 03-22 19:56:15] {3519} INFO -  at 17.1s,	estimator xgb_limitdepth's best error=0.1296,	best estimator lgbm's best error=0.1239
[flaml.automl.automl: 03-22 19:56:15] {3334} INFO - iteration 64, current learner xgb_limitdepth




[flaml.automl.automl: 03-22 19:56:15] {3519} INFO -  at 17.4s,	estimator xgb_limitdepth's best error=0.1296,	best estimator lgbm's best error=0.1239
[flaml.automl.automl: 03-22 19:56:15] {3334} INFO - iteration 65, current learner xgboost
[flaml.automl.automl: 03-22 19:56:15] {3519} INFO -  at 17.5s,	estimator xgboost's best error=0.2433,	best estimator lgbm's best error=0.1239
[flaml.automl.automl: 03-22 19:56:15] {3334} INFO - iteration 66, current learner xgb_limitdepth




[flaml.automl.automl: 03-22 19:56:16] {3519} INFO -  at 17.6s,	estimator xgb_limitdepth's best error=0.1296,	best estimator lgbm's best error=0.1239
[flaml.automl.automl: 03-22 19:56:16] {3334} INFO - iteration 67, current learner xgb_limitdepth




[flaml.automl.automl: 03-22 19:56:17] {3519} INFO -  at 19.6s,	estimator xgb_limitdepth's best error=0.1295,	best estimator lgbm's best error=0.1239
[flaml.automl.automl: 03-22 19:56:17] {3334} INFO - iteration 68, current learner rf
[flaml.automl.automl: 03-22 19:56:18] {3519} INFO -  at 19.8s,	estimator rf's best error=0.2666,	best estimator lgbm's best error=0.1239
[flaml.automl.automl: 03-22 19:56:18] {3334} INFO - iteration 69, current learner lgbm
[flaml.automl.automl: 03-22 19:56:19] {3519} INFO -  at 20.8s,	estimator lgbm's best error=0.1239,	best estimator lgbm's best error=0.1239
[flaml.automl.automl: 03-22 19:56:19] {3334} INFO - iteration 70, current learner extra_tree
[flaml.automl.automl: 03-22 19:56:19] {3519} INFO -  at 20.9s,	estimator extra_tree's best error=0.1985,	best estimator lgbm's best error=0.1239
[flaml.automl.automl: 03-22 19:56:19] {3334} INFO - iteration 71, current learner extra_tree
[flaml.automl.automl: 03-22 19:56:19] {3519} INFO -  at 21.2s,	estimator



[flaml.automl.automl: 03-22 19:56:23] {3519} INFO -  at 25.0s,	estimator extra_tree's best error=0.1165,	best estimator extra_tree's best error=0.1165
[flaml.automl.automl: 03-22 19:56:23] {3334} INFO - iteration 86, current learner extra_tree
[flaml.automl.automl: 03-22 19:56:23] {3519} INFO -  at 25.3s,	estimator extra_tree's best error=0.1165,	best estimator extra_tree's best error=0.1165
[flaml.automl.automl: 03-22 19:56:23] {3334} INFO - iteration 87, current learner extra_tree
[flaml.automl.automl: 03-22 19:56:23] {3519} INFO -  at 25.5s,	estimator extra_tree's best error=0.1165,	best estimator extra_tree's best error=0.1165
[flaml.automl.automl: 03-22 19:56:23] {3334} INFO - iteration 88, current learner extra_tree
[flaml.automl.automl: 03-22 19:56:24] {3519} INFO -  at 25.7s,	estimator extra_tree's best error=0.1165,	best estimator extra_tree's best error=0.1165
[flaml.automl.automl: 03-22 19:56:24] {3334} INFO - iteration 89, current learner extra_tree
[flaml.automl.automl: 03



[flaml.automl.automl: 03-22 19:56:41] {3519} INFO -  at 43.1s,	estimator extra_tree's best error=0.0969,	best estimator extra_tree's best error=0.0969
[flaml.automl.automl: 03-22 19:56:41] {3334} INFO - iteration 180, current learner extra_tree
[flaml.automl.automl: 03-22 19:56:41] {3519} INFO -  at 43.2s,	estimator extra_tree's best error=0.0969,	best estimator extra_tree's best error=0.0969
[flaml.automl.automl: 03-22 19:56:41] {3334} INFO - iteration 181, current learner lrl1


INFO:flaml.tune.searcher.blendsearch:No low-cost partial config given to the search algorithm. For cost-frugal search, consider providing low-cost values for cost-related hps via 'low_cost_partial_config'. More info can be found at https://microsoft.github.io/FLAML/docs/FAQ#about-low_cost_partial_config-in-tune


[flaml.automl.automl: 03-22 19:56:41] {3519} INFO -  at 43.5s,	estimator lrl1's best error=0.3169,	best estimator extra_tree's best error=0.0969
[flaml.automl.automl: 03-22 19:56:41] {3334} INFO - iteration 182, current learner lrl1




[flaml.automl.automl: 03-22 19:56:42] {3519} INFO -  at 43.9s,	estimator lrl1's best error=0.3169,	best estimator extra_tree's best error=0.0969
[flaml.automl.automl: 03-22 19:56:42] {3334} INFO - iteration 183, current learner lrl1




[flaml.automl.automl: 03-22 19:56:42] {3519} INFO -  at 44.4s,	estimator lrl1's best error=0.3160,	best estimator extra_tree's best error=0.0969
[flaml.automl.automl: 03-22 19:56:42] {3334} INFO - iteration 184, current learner lgbm




[flaml.automl.automl: 03-22 19:56:44] {3519} INFO -  at 45.8s,	estimator lgbm's best error=0.1183,	best estimator extra_tree's best error=0.0969
[flaml.automl.automl: 03-22 19:56:44] {3334} INFO - iteration 185, current learner xgboost




[flaml.automl.automl: 03-22 19:56:44] {3519} INFO -  at 46.0s,	estimator xgboost's best error=0.2414,	best estimator extra_tree's best error=0.0969
[flaml.automl.automl: 03-22 19:56:44] {3334} INFO - iteration 186, current learner extra_tree
[flaml.automl.automl: 03-22 19:56:44] {3519} INFO -  at 46.3s,	estimator extra_tree's best error=0.0969,	best estimator extra_tree's best error=0.0969
[flaml.automl.automl: 03-22 19:56:44] {3334} INFO - iteration 187, current learner rf
[flaml.automl.automl: 03-22 19:56:44] {3519} INFO -  at 46.6s,	estimator rf's best error=0.2088,	best estimator extra_tree's best error=0.0969
[flaml.automl.automl: 03-22 19:56:44] {3334} INFO - iteration 188, current learner extra_tree
[flaml.automl.automl: 03-22 19:56:45] {3519} INFO -  at 46.7s,	estimator extra_tree's best error=0.0969,	best estimator extra_tree's best error=0.0969
[flaml.automl.automl: 03-22 19:56:45] {3334} INFO - iteration 189, current learner rf
[flaml.automl.automl: 03-22 19:56:45] {3519} IN



[flaml.automl.automl: 03-22 19:56:49] {3519} INFO -  at 51.1s,	estimator xgboost's best error=0.2191,	best estimator extra_tree's best error=0.0969
[flaml.automl.automl: 03-22 19:56:49] {3334} INFO - iteration 212, current learner xgboost
[flaml.automl.automl: 03-22 19:56:49] {3519} INFO -  at 51.2s,	estimator xgboost's best error=0.2191,	best estimator extra_tree's best error=0.0969
[flaml.automl.automl: 03-22 19:56:49] {3334} INFO - iteration 213, current learner xgboost




[flaml.automl.automl: 03-22 19:56:49] {3519} INFO -  at 51.4s,	estimator xgboost's best error=0.2097,	best estimator extra_tree's best error=0.0969
[flaml.automl.automl: 03-22 19:56:49] {3334} INFO - iteration 214, current learner extra_tree




[flaml.automl.automl: 03-22 19:56:49] {3519} INFO -  at 51.5s,	estimator extra_tree's best error=0.0969,	best estimator extra_tree's best error=0.0969
[flaml.automl.automl: 03-22 19:56:49] {3334} INFO - iteration 215, current learner extra_tree
[flaml.automl.automl: 03-22 19:56:50] {3519} INFO -  at 51.7s,	estimator extra_tree's best error=0.0969,	best estimator extra_tree's best error=0.0969
[flaml.automl.automl: 03-22 19:56:50] {3334} INFO - iteration 216, current learner xgboost
[flaml.automl.automl: 03-22 19:56:50] {3519} INFO -  at 51.9s,	estimator xgboost's best error=0.1883,	best estimator extra_tree's best error=0.0969
[flaml.automl.automl: 03-22 19:56:50] {3334} INFO - iteration 217, current learner extra_tree




[flaml.automl.automl: 03-22 19:56:50] {3519} INFO -  at 52.0s,	estimator extra_tree's best error=0.0969,	best estimator extra_tree's best error=0.0969
[flaml.automl.automl: 03-22 19:56:50] {3334} INFO - iteration 218, current learner xgboost
[flaml.automl.automl: 03-22 19:56:50] {3519} INFO -  at 52.2s,	estimator xgboost's best error=0.1883,	best estimator extra_tree's best error=0.0969
[flaml.automl.automl: 03-22 19:56:50] {3334} INFO - iteration 219, current learner xgboost




[flaml.automl.automl: 03-22 19:56:50] {3519} INFO -  at 52.3s,	estimator xgboost's best error=0.1883,	best estimator extra_tree's best error=0.0969
[flaml.automl.automl: 03-22 19:56:50] {3334} INFO - iteration 220, current learner xgboost




[flaml.automl.automl: 03-22 19:56:50] {3519} INFO -  at 52.5s,	estimator xgboost's best error=0.1706,	best estimator extra_tree's best error=0.0969
[flaml.automl.automl: 03-22 19:56:50] {3334} INFO - iteration 221, current learner xgboost
[flaml.automl.automl: 03-22 19:56:51] {3519} INFO -  at 52.7s,	estimator xgboost's best error=0.1706,	best estimator extra_tree's best error=0.0969
[flaml.automl.automl: 03-22 19:56:51] {3334} INFO - iteration 222, current learner lgbm




[flaml.automl.automl: 03-22 19:56:51] {3519} INFO -  at 52.8s,	estimator lgbm's best error=0.1183,	best estimator extra_tree's best error=0.0969
[flaml.automl.automl: 03-22 19:56:51] {3334} INFO - iteration 223, current learner extra_tree
[flaml.automl.automl: 03-22 19:56:51] {3519} INFO -  at 53.1s,	estimator extra_tree's best error=0.0969,	best estimator extra_tree's best error=0.0969
[flaml.automl.automl: 03-22 19:56:51] {3334} INFO - iteration 224, current learner extra_tree
[flaml.automl.automl: 03-22 19:56:51] {3519} INFO -  at 53.2s,	estimator extra_tree's best error=0.0969,	best estimator extra_tree's best error=0.0969
[flaml.automl.automl: 03-22 19:56:51] {3334} INFO - iteration 225, current learner lgbm
[flaml.automl.automl: 03-22 19:56:51] {3519} INFO -  at 53.3s,	estimator lgbm's best error=0.1183,	best estimator extra_tree's best error=0.0969
[flaml.automl.automl: 03-22 19:56:51] {3334} INFO - iteration 226, current learner xgboost




[flaml.automl.automl: 03-22 19:56:52] {3519} INFO -  at 53.7s,	estimator xgboost's best error=0.1473,	best estimator extra_tree's best error=0.0969
[flaml.automl.automl: 03-22 19:56:52] {3334} INFO - iteration 227, current learner xgboost




[flaml.automl.automl: 03-22 19:56:52] {3519} INFO -  at 53.9s,	estimator xgboost's best error=0.1473,	best estimator extra_tree's best error=0.0969
[flaml.automl.automl: 03-22 19:56:52] {3334} INFO - iteration 228, current learner xgboost




[flaml.automl.automl: 03-22 19:56:52] {3519} INFO -  at 54.5s,	estimator xgboost's best error=0.1407,	best estimator extra_tree's best error=0.0969
[flaml.automl.automl: 03-22 19:56:52] {3334} INFO - iteration 229, current learner extra_tree
[flaml.automl.automl: 03-22 19:56:53] {3519} INFO -  at 54.7s,	estimator extra_tree's best error=0.0969,	best estimator extra_tree's best error=0.0969
[flaml.automl.automl: 03-22 19:56:53] {3334} INFO - iteration 230, current learner xgboost




[flaml.automl.automl: 03-22 19:56:53] {3519} INFO -  at 55.0s,	estimator xgboost's best error=0.1407,	best estimator extra_tree's best error=0.0969
[flaml.automl.automl: 03-22 19:56:53] {3334} INFO - iteration 231, current learner extra_tree




[flaml.automl.automl: 03-22 19:56:53] {3519} INFO -  at 55.2s,	estimator extra_tree's best error=0.0969,	best estimator extra_tree's best error=0.0969
[flaml.automl.automl: 03-22 19:56:53] {3334} INFO - iteration 232, current learner xgboost




[flaml.automl.automl: 03-22 19:56:54] {3519} INFO -  at 55.7s,	estimator xgboost's best error=0.1407,	best estimator extra_tree's best error=0.0969
[flaml.automl.automl: 03-22 19:56:54] {3334} INFO - iteration 233, current learner extra_tree
[flaml.automl.automl: 03-22 19:56:54] {3519} INFO -  at 55.9s,	estimator extra_tree's best error=0.0969,	best estimator extra_tree's best error=0.0969
[flaml.automl.automl: 03-22 19:56:54] {3334} INFO - iteration 234, current learner extra_tree
[flaml.automl.automl: 03-22 19:56:54] {3519} INFO -  at 56.0s,	estimator extra_tree's best error=0.0969,	best estimator extra_tree's best error=0.0969
[flaml.automl.automl: 03-22 19:56:54] {3334} INFO - iteration 235, current learner extra_tree
[flaml.automl.automl: 03-22 19:56:54] {3519} INFO -  at 56.2s,	estimator extra_tree's best error=0.0969,	best estimator extra_tree's best error=0.0969
[flaml.automl.automl: 03-22 19:56:54] {3334} INFO - iteration 236, current learner xgboost




[flaml.automl.automl: 03-22 19:56:56] {3519} INFO -  at 57.9s,	estimator xgboost's best error=0.1314,	best estimator extra_tree's best error=0.0969
[flaml.automl.automl: 03-22 19:56:56] {3334} INFO - iteration 237, current learner xgboost




[flaml.automl.automl: 03-22 19:56:57] {3519} INFO -  at 59.1s,	estimator xgboost's best error=0.1314,	best estimator extra_tree's best error=0.0969
[flaml.automl.automl: 03-22 19:56:57] {3334} INFO - iteration 238, current learner extra_tree
[flaml.automl.automl: 03-22 19:56:57] {3519} INFO -  at 59.3s,	estimator extra_tree's best error=0.0969,	best estimator extra_tree's best error=0.0969
[flaml.automl.automl: 03-22 19:56:57] {3334} INFO - iteration 239, current learner rf
[flaml.automl.automl: 03-22 19:56:57] {3519} INFO -  at 59.5s,	estimator rf's best error=0.2088,	best estimator extra_tree's best error=0.0969
[flaml.automl.automl: 03-22 19:56:57] {3334} INFO - iteration 240, current learner extra_tree
[flaml.automl.automl: 03-22 19:56:58] {3519} INFO -  at 59.7s,	estimator extra_tree's best error=0.0969,	best estimator extra_tree's best error=0.0969
[flaml.automl.automl: 03-22 19:56:58] {3334} INFO - iteration 241, current learner extra_tree
[flaml.automl.automl: 03-22 19:56:58] {

In [136]:
from sklearn.ensemble import ExtraTreesClassifier

parameters = {
    "criterion":'entropy',
    "max_features":0.9,
    "max_leaf_nodes":150, 
    "n_estimators":50,
    "min_samples_split":3
}

et_model = ExtraTreesClassifier(**parameters, n_jobs=-1,random_state=12)
et_model.fit(X_train, y_train)

# print(f"BEST MODEL:\n{autoML.model.estimator}")
print(f"ACCURACY SCORE: {et_model.score(X_test, y_test)}")

ACCURACY SCORE: 0.8872180451127819


In [138]:
feature_importances = dict(zip(X.columns, et_model.feature_importances_))
feature_importances

{'age': 0.09518934194778592,
 'cp': 0.17854495743353105,
 'ca': 0.14466583772246042,
 'chol': 0.08126926101501233,
 'thal': 0.2544417076143235,
 'thalach': 0.1072482380392525,
 'oldpeak': 0.13864065622763425}

In [139]:
import joblib

# Save the trained model
joblib.dump(et_model, 'heart_disease.joblib')

['heart_disease.joblib']

In [140]:
# Load the model back
heart_model = joblib.load('/content/heart_disease.joblib')

In [148]:
heart_model.predict([[57,4,1,131,7,115,1.2]])



array([2.])

In [145]:
dataset.tail()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
1323,45.0,1.0,1.0,110.0,264.0,0.0,0.0,132.0,0.0,1.2,2.0,0.0,7.0,1.0
1324,68.0,1.0,4.0,144.0,193.0,1.0,0.0,141.0,0.0,3.4,2.0,2.0,7.0,2.0
1325,57.0,1.0,4.0,130.0,131.0,0.0,0.0,115.0,1.0,1.2,2.0,1.0,7.0,3.0
1326,57.0,0.0,2.0,130.0,236.0,0.0,2.0,174.0,0.0,0.0,2.0,1.0,3.0,1.0
1327,38.0,1.0,3.0,138.0,175.0,0.0,0.0,173.0,0.0,0.0,1.0,0.672241,3.0,0.0


## Step 3: Develop the First Model
- Import required libraries such as tensorflow and keras.
- Create a sequential model with three dense layers using relu and softmax activation functions.
- Compile the model with 'adam' optimizer, 'sparse_categorical_crossentropy' loss function, and 'accuracy' metric.

## Step 4: Develop a Model with Dropout

In the preceding step, a notable distinction between training and testing accuracy was observed. To address this issue, the introduction of dropout into the model architecture is proposed. Dropout regularization can be implemented within the model to randomly remove inputs during training, effectively reducing overfitting and promoting generalization.

## Step 5: Regularize the Model and Tune the Hyperparameters

This code performs hyperparameter tuning using grid search with cross-validation to find the best combination of hyperparameters for a Keras Sequential model. It also include L2 regularization that help to cater overfitting issue. Overall, this code is an important step in the machine learning workflow that helps to optimize the model's hyperparameters for better performance.

In [59]:
!pip install scikeras[tensorflow] -q

In [65]:
import tensorflow as tf
from keras.layers import Dropout
from keras import regularizers
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasClassifier
from scikeras.wrappers import KerasClassifier

# Define the model creation function with hyperparameters
def create_model(units=64, learning_rate=0.01, dropout_rate=0.2):
    model = Sequential([
        Dense(units, activation='relu', input_shape=(X_train.shape[1],), kernel_regularizer=regularizers.l2(0.001)),
        Dropout(dropout_rate),
        Dense(units, activation='relu', kernel_regularizer=regularizers.l2(0.001)),
        Dropout(dropout_rate),
        Dense(classes_n, activation='softmax')
    ])
    optimizer = tf.keras.optimizers.Adamax(learning_rate=learning_rate)
    model.compile(optimizer=optimizer,
                  loss=tf.keras.losses.SparseCategoricalCrossentropy(),
                  metrics=['accuracy'])
    return model

# Create the model
model = create_model()

# Fit the model on the training dataset with 200 epochs, batch size of 32, and a validation split of 0.1.
model.fit(X_train, y_train, epochs=1000, batch_size=128, validation_split=0.2, verbose=1)

# Evaluate the model on the test dataset
test_loss, test_acc = model.evaluate(X_test, y_test, verbose=0)
print('Test accuracy:', test_acc)


Epoch 1/1000
Epoch 2/1000
Epoch 3/1000
Epoch 4/1000
Epoch 5/1000
Epoch 6/1000
Epoch 7/1000
Epoch 8/1000
Epoch 9/1000
Epoch 10/1000
Epoch 11/1000
Epoch 12/1000
Epoch 13/1000
Epoch 14/1000
Epoch 15/1000
Epoch 16/1000
Epoch 17/1000
Epoch 18/1000
Epoch 19/1000
Epoch 20/1000
Epoch 21/1000
Epoch 22/1000
Epoch 23/1000
Epoch 24/1000
Epoch 25/1000
Epoch 26/1000
Epoch 27/1000
Epoch 28/1000
Epoch 29/1000
Epoch 30/1000
Epoch 31/1000
Epoch 32/1000
Epoch 33/1000
Epoch 34/1000
Epoch 35/1000
Epoch 36/1000
Epoch 37/1000
Epoch 38/1000
Epoch 39/1000
Epoch 40/1000
Epoch 41/1000
Epoch 42/1000
Epoch 43/1000
Epoch 44/1000
Epoch 45/1000
Epoch 46/1000
Epoch 47/1000
Epoch 48/1000
Epoch 49/1000
Epoch 50/1000
Epoch 51/1000
Epoch 52/1000
Epoch 53/1000
Epoch 54/1000
Epoch 55/1000
Epoch 56/1000
Epoch 57/1000
Epoch 58/1000
Epoch 59/1000
Epoch 60/1000
Epoch 61/1000
Epoch 62/1000
Epoch 63/1000
Epoch 64/1000
Epoch 65/1000
Epoch 66/1000
Epoch 67/1000
Epoch 68/1000
Epoch 69/1000
Epoch 70/1000
Epoch 71/1000
Epoch 72/1000
E

In [64]:
import torch
import torch.nn as nn
import torch.optim as optim

# Define the model creation function with hyperparameters
class Net(nn.Module):
    def __init__(self, input_size, output_size, units, dropout_rate):
        super(Net, self).__init__()
        self.fc1 = nn.Linear(input_size, units)
        self.fc2 = nn.Linear(units, units)
        self.fc3 = nn.Linear(units, output_size)
        self.dropout = nn.Dropout(p=dropout_rate)
    
    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = self.dropout(x)
        x = torch.relu(self.fc2(x))
        x = self.dropout(x)
        x = torch.softmax(self.fc3(x), dim=-1)
        return x

def train_model(X_train, y_train, units=64, learning_rate=0.01, dropout_rate=0.3, epochs=200, batch_size=32):
    input_size = X_train.shape[1]
    output_size = len(torch.Tensor(y_train).unique())
    model = Net(input_size, output_size, units, dropout_rate)
    optimizer = optim.Adam(model.parameters(), lr=learning_rate)
    criterion = nn.CrossEntropyLoss()

    for epoch in range(epochs):
        running_loss = 0.0
        for i in range(0, len(X_train), batch_size):
            inputs = torch.tensor(X_train[i:i+batch_size], dtype=torch.float32)
            labels = torch.tensor(y_train[i:i+batch_size].values, dtype=torch.long)
            optimizer.zero_grad()
            outputs = model(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()
            running_loss += loss.item() * inputs.size(0)
        epoch_loss = running_loss / len(X_train)
        if epoch % 10 == 9: # Print every 10 epochs
            print(f"Epoch {epoch+1}/{epochs} Loss: {epoch_loss:.4f}")
    return model

# Train the model with given hyperparameters
model = train_model(X_train, y_train, units=64, learning_rate=0.01, dropout_rate=0.3, epochs=1000, batch_size=32)

# Evaluate the model on the test dataset
with torch.no_grad():
    inputs = torch.tensor(X_test, dtype=torch.float32)
    labels = torch.tensor(y_test.values, dtype=torch.long)
    outputs = model(inputs)
    loss = nn.CrossEntropyLoss()(outputs, labels)
    _, predicted = torch.max(outputs, 1)
    accuracy = (predicted == labels).sum().item() / len(labels)
    print(f"Test loss: {loss:.4f} Test accuracy: {accuracy:.4f}")


Epoch 10/1000 Loss: 1.1515
Epoch 20/1000 Loss: 1.1347
Epoch 30/1000 Loss: 1.1339
Epoch 40/1000 Loss: 1.1282
Epoch 50/1000 Loss: 1.1263
Epoch 60/1000 Loss: 1.1255
Epoch 70/1000 Loss: 1.1302
Epoch 80/1000 Loss: 1.1286
Epoch 90/1000 Loss: 1.1415
Epoch 100/1000 Loss: 1.1248
Epoch 110/1000 Loss: 1.1264
Epoch 120/1000 Loss: 1.1247
Epoch 130/1000 Loss: 1.1345
Epoch 140/1000 Loss: 1.1357
Epoch 150/1000 Loss: 1.1161
Epoch 160/1000 Loss: 1.1359
Epoch 170/1000 Loss: 1.1274
Epoch 180/1000 Loss: 1.1121
Epoch 190/1000 Loss: 1.1209
Epoch 200/1000 Loss: 1.1203
Epoch 210/1000 Loss: 1.1139
Epoch 220/1000 Loss: 1.1189
Epoch 230/1000 Loss: 1.1177
Epoch 240/1000 Loss: 1.1201
Epoch 250/1000 Loss: 1.1179
Epoch 260/1000 Loss: 1.1197
Epoch 270/1000 Loss: 1.1271
Epoch 280/1000 Loss: 1.1073
Epoch 290/1000 Loss: 1.1315
Epoch 300/1000 Loss: 1.1299
Epoch 310/1000 Loss: 1.1177
Epoch 320/1000 Loss: 1.1223
Epoch 330/1000 Loss: 1.1258
Epoch 340/1000 Loss: 1.1280
Epoch 350/1000 Loss: 1.1243
Epoch 360/1000 Loss: 1.1231
E