# Personal Key Indicators of Heart Disease

`Goal:` Explore dataset and build the best predictive machine learning model possible for classification people with heart disease.

Data was taken from Kaggle https://www.kaggle.com/datasets/kamilpytlak/personal-key-indicators-of-heart-disease

In [2]:
# Importing the most necessary libraries for exploratory data analysis
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
# Import the dataset
heart_data = pd.read_csv('heart_2020_cleaned.csv')

The `heart_2020_cleaned.csv` contains information on the different aspects of person's health. The columns in the data set include:

- **HeartDisease** - Respondents that have ever reported having coronary heart disease (CHD) or myocardial infarction (MI)
- **BMI** - Body Mass Index (BMI)
- **Smoking** - Have you smoked at least 100 cigarettes in your entire life?
- **AlcoholDrinking** - Heavy drinkers (adult men having more than 14 drinks per week and adult women having more than 7 drinks per week
- **Stroke** - (Ever told) (you had) a stroke?
- **PhysicalHealth** - Now thinking about your physical health, which includes physical illness and injury, for how many days during the past 30 days was your physical health not good?
- **MentalHealth** - Thinking about your mental health, for how many days during the past 30 days was your mental health not good?
- **DiffWalking** - Do you have serious difficulty walking or climbing stairs?
- **Sex** - Are you male or female?
- **AgeCategory** - Fourteen-level age category
- **Rase** - Imputed race/ethnicity value
- **Diabetic** - (Ever told) (you had) diabetes?
- **PhysicalActivity** - Adults who reported doing physical activity or exercise during the past 30 days other than their regular job
- **GenHealth** - Would you say that in general your health is...
- **SleepTime** - On average, how many hours of sleep do you get in a 24-hour period?
- **Asthma** - (Ever told) (you had) asthma?
- **KidneyDisease** - Not including kidney stones, bladder infection or incontinence, were you ever told you had kidney disease?
- **SkinCancer** - (Ever told) (you had) skin cancer?

In [4]:
# See first 5 rows
heart_data.head()

Unnamed: 0,HeartDisease,BMI,Smoking,AlcoholDrinking,Stroke,PhysicalHealth,MentalHealth,DiffWalking,Sex,AgeCategory,Race,Diabetic,PhysicalActivity,GenHealth,SleepTime,Asthma,KidneyDisease,SkinCancer
0,No,16.6,Yes,No,No,3.0,30.0,No,Female,55-59,White,Yes,Yes,Very good,5.0,Yes,No,Yes
1,No,20.34,No,No,Yes,0.0,0.0,No,Female,80 or older,White,No,Yes,Very good,7.0,No,No,No
2,No,26.58,Yes,No,No,20.0,30.0,No,Male,65-69,White,Yes,Yes,Fair,8.0,Yes,No,No
3,No,24.21,No,No,No,0.0,0.0,No,Female,75-79,White,No,No,Good,6.0,No,No,Yes
4,No,23.71,No,No,No,28.0,0.0,Yes,Female,40-44,White,No,Yes,Very good,8.0,No,No,No


In [5]:
# Check the datatypes of dataset's columns
heart_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 319795 entries, 0 to 319794
Data columns (total 18 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   HeartDisease      319795 non-null  object 
 1   BMI               319795 non-null  float64
 2   Smoking           319795 non-null  object 
 3   AlcoholDrinking   319795 non-null  object 
 4   Stroke            319795 non-null  object 
 5   PhysicalHealth    319795 non-null  float64
 6   MentalHealth      319795 non-null  float64
 7   DiffWalking       319795 non-null  object 
 8   Sex               319795 non-null  object 
 9   AgeCategory       319795 non-null  object 
 10  Race              319795 non-null  object 
 11  Diabetic          319795 non-null  object 
 12  PhysicalActivity  319795 non-null  object 
 13  GenHealth         319795 non-null  object 
 14  SleepTime         319795 non-null  float64
 15  Asthma            319795 non-null  object 
 16  KidneyDisease     31

In [6]:
# Unique values of Race column
heart_data.Race.unique()

array(['White', 'Black', 'Asian', 'American Indian/Alaskan Native',
       'Other', 'Hispanic'], dtype=object)

In [7]:
# Unique values of Race column
heart_data.GenHealth.unique()
print(heart_data.Diabetic.unique())

['Yes' 'No' 'No, borderline diabetes' 'Yes (during pregnancy)']


In [8]:
# Unique values of AgeCategory column
heart_data.AgeCategory.unique()

array(['55-59', '80 or older', '65-69', '75-79', '40-44', '70-74',
       '60-64', '50-54', '45-49', '18-24', '35-39', '30-34', '25-29'],
      dtype=object)

In [9]:
# Check for null values
heart_data.isna().sum()

HeartDisease        0
BMI                 0
Smoking             0
AlcoholDrinking     0
Stroke              0
PhysicalHealth      0
MentalHealth        0
DiffWalking         0
Sex                 0
AgeCategory         0
Race                0
Diabetic            0
PhysicalActivity    0
GenHealth           0
SleepTime           0
Asthma              0
KidneyDisease       0
SkinCancer          0
dtype: int64

In [10]:
# Now I need to prepare the data for use in machine learning model
heart_data['HeartDisease'] = heart_data.HeartDisease.map({'Yes':1, 'No':0})
heart_data['Smoking'] = heart_data.Smoking.map({'Yes':1, 'No':0})
heart_data['AlcoholDrinking'] = heart_data.AlcoholDrinking.map({'Yes':1, 'No':0})
heart_data['Stroke'] = heart_data.Stroke.map({'Yes':1, 'No':0})
heart_data['DiffWalking'] = heart_data.DiffWalking.map({'Yes':1, 'No':0})
heart_data['Sex'] = heart_data.Sex.map({'Male':1, 'Female':0})
heart_data['AgeCategory'] = heart_data.AgeCategory.map({'18-24':1, '25-29':2, '30-34':3, '35-39':4, '40-44':5, '45-49':6, '50-54':7, '55-59':8, '60-64':9, '65-69':10, '70-74':11, '75-79':12, '80 or older':13})
heart_data = heart_data.drop('Race', axis=1)
heart_data['Diabetic'] = heart_data.Diabetic.map({'Yes':1, 'No':0, 'No, borderline diabetes':0, 'Yes (during pregnancy)':1})
heart_data['PhysicalActivity'] = heart_data.PhysicalActivity.map({'Yes':1, 'No':0})
heart_data['GenHealth'] = heart_data.GenHealth.map({'Excellent':3, 'Very good':2, 'Good':1, 'Fair':0, 'Poor':-1})
heart_data['Asthma'] = heart_data.Asthma.map({'Yes':1, 'No':0})
heart_data['KidneyDisease'] = heart_data.KidneyDisease.map({'Yes':1, 'No':0})
heart_data['SkinCancer'] = heart_data.SkinCancer.map({'Yes':1, 'No':0})

In [11]:
# It seems like we successfully prepare our data for machine learning
heart_data.head()

Unnamed: 0,HeartDisease,BMI,Smoking,AlcoholDrinking,Stroke,PhysicalHealth,MentalHealth,DiffWalking,Sex,AgeCategory,Diabetic,PhysicalActivity,GenHealth,SleepTime,Asthma,KidneyDisease,SkinCancer
0,0,16.6,1,0,0,3.0,30.0,0,0,8,1,1,2,5.0,1,0,1
1,0,20.34,0,0,1,0.0,0.0,0,0,13,0,1,2,7.0,0,0,0
2,0,26.58,1,0,0,20.0,30.0,0,1,10,1,1,0,8.0,1,0,0
3,0,24.21,0,0,0,0.0,0.0,0,0,12,0,0,1,6.0,0,0,1
4,0,23.71,0,0,0,28.0,0.0,1,0,5,0,1,2,8.0,0,0,0


In [12]:
# Import everything for creating a machine learning model and feature selection
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from mlxtend.feature_selection import SequentialFeatureSelector as SFS

In [12]:
# For this task I choose Logistic Regression machine learning algorithm
# Split the data to training and test parts
x = heart_data.drop('HeartDisease', axis=1)
y = heart_data['HeartDisease']
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.8, test_size=0.2, random_state=10)

lr = LogisticRegression(max_iter=1000)

In [13]:
# I want to identify number of features that give the best performance by using Sequential Backward Selection
features = []
scores = []

for i in range(1, 17):
    sbs = SFS(
        lr,
        k_features = i,
        forward = False,
        floating = True,
        scoring = 'accuracy',
        cv = 0
    )
    sbs.fit(x_train, y_train)
    features.append(sbs.subsets_[i]['feature_names'])
    scores.append(sbs.subsets_[i]['avg_score'])

In [14]:
# I want to find the best score and features associated with it
print(scores, '\n')
print('The best score {} our model got when had {} features'.format(max(scores), scores.index(max(scores))+1), '\n')
print(features[9])
# ('BMI', 'Smoking', 'Stroke', 'DiffWalking', 'Sex', 'AgeCategory', 'Diabetic', 'GenHealth', 'Asthma', 'KidneyDisease')

[0.9143005675510875, 0.9150236870495161, 0.9149650557388327, 0.9151683109492018, 0.9151448584249284, 0.91546928501071, 0.9156608139589425, 0.915692083991307, 0.9156490876968058, 0.9157741678262638, 0.9157311715317625, 0.9157468065479448, 0.9156764489751247, 0.9156881752372614, 0.9156490876968058, 0.9155279163213934] 

The best score 0.9157741678262638 our model got when had 10 features 

('BMI', 'Smoking', 'Stroke', 'DiffWalking', 'Sex', 'AgeCategory', 'Diabetic', 'GenHealth', 'Asthma', 'KidneyDisease')


In [13]:
# Split the data to training and test parts based on feature engineering from before
X = heart_data[['BMI', 'Smoking', 'Stroke', 'DiffWalking', 'Sex', 'AgeCategory', 'Diabetic', 'GenHealth', 'Asthma', 'KidneyDisease']]
Y = heart_data['HeartDisease']
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, train_size=0.8, test_size=0.2, random_state=10)

lr_modified = LogisticRegression(max_iter=1000)
lr_modified.fit(X_train.values, Y_train.values)
print('Score from running the model on test data', lr_modified.score(X_test.values, Y_test.values))

Score from running the model on test data 0.9161025031660908


In [14]:
# Creating an interface where everyone can find out their probability of having a heart disease
def BMI(body_mass, height):
    return round(body_mass/(height**2), 2)

values = []
body_mass, height = (input('1. Body Mass in Kg, Height in m (like 75, 1.75)')).split(',')
values.append(BMI(float(body_mass), float(height)))
values.append(int(input('2. Have you smoked at least 100 cigarettes in your entire life? If Yes input 1, No input 0')))
values.append(int(input('3. (Ever told) (you had) a stroke? If Yes input 1, No input 0')))
values.append(int(input('4. Do you have serious difficulty walking or climbing stairs? If Yes input 1, No input 0')))
values.append(int(input('5. Are you male or female? If you are male input 1, female input 0')))
values.append(int(input('6. In what age category are you? 18-24:1, 25-29:2, 30-34:3, 35-39:4, 40-44:5, 45-49:6, 50-54:7, 55-59:8, 60-64:9, 65-69:10, 70-74:11, 75-79:12, 80 or older:13')))
values.append(int(input('7. (Ever told) (you had) diabetes? If Yes input 1, No input 0')))
values.append(int(input('8. Would you say that in general your health is Excellent:3, Very good:2, Good:1, Fair:0, Poor:-1')))
values.append(int(input('9. (Ever told) (you had) asthma? If Yes input 1, No input 0')))
values.append(int(input('10. Not including kidney stones, bladder infection or incontinence, were you ever told you had kidney disease? If Yes input 1, No input 0')))

values = np.array(values).reshape(1, -1)
if lr_modified.predict(values)[0] == 1:
    print('There is a high chance that you have heart disease')
else:
    print("You most likely don't have heart disease")
print('The probability of you having a heart disease is', round(lr_modified.predict_proba(values)[0][1]*100, 2), 'percent')

1. Body Mass in Kg, Height in m (like 75, 1.75)67.8, 1.73
2. Have you smoked at least 100 cigarettes in your entire life? If Yes input 1, No input 00
3. (Ever told) (you had) a stroke? If Yes input 1, No input 00
4. Do you have serious difficulty walking or climbing stairs? If Yes input 1, No input 00
5. Are you male or female? If you are male input 1, female input 01
6. In what age category are you? 18-24:1, 25-29:2, 30-34:3, 35-39:4, 40-44:5, 45-49:6, 50-54:7, 55-59:8, 60-64:9, 65-69:10, 70-74:11, 75-79:12, 80 or older:131
7. (Ever told) (you had) diabetes? If Yes input 1, No input 00
8. Would you say that in general your health is Excellent:3, Very good:2, Good:1, Fair:0, Poor:-12
9. (Ever told) (you had) asthma? If Yes input 1, No input 00
10. Not including kidney stones, bladder infection or incontinence, were you ever told you had kidney disease? If Yes input 1, No input 00
You most likely don't have heart disease
The probability of you having a heart disease is 0.57 percent


## `Conclusion`

I explored the dataset, prepared it for machine learning process and created Logistic Regression machine learning model to classify whether a person have a heart disease or not with 91.61% accuracy on test data.