# Heart attack Prediction using Machine Learning

In this notebook we are going to perform Exploratory Data Analysis and use various Machine Learning Models to predict whether the patient has heart disease or not depending on the values of various features. We will be using Bokeh and a little bit of Seaborn to plot the graphs.

Firstly we will import all necessary libraries

In [1]:
import numpy as np 
import pandas as pd 
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn import svm
import warnings
warnings.filterwarnings('ignore')

#                               About the Datasets

This notebook contains 4 databases concerning heart disease diagnosis. All attributes are    numeric-valued. The data was collected from the four following locations:

                1. Cleveland Clinic Foundation (cleveland.csv)
                2. Hungarian Institute of Cardiology, Budapest (hungarian.csv)
                3. V.A. Medical Center, Long Beach, CA (long-beach-va.csv)
                4. University Hospital, Zurich, Switzerland (switzerland.csv)


Each database has the same instance format. While the databases have 76 raw attributes, only 14 of them are actually used. This database contains 76 attributes, but all published experiments refer to using a subset of 14 of them. In particular, the Cleveland database is the only one that has been used by ML researchers to this date. The "goal" field refers to the presence of heart disease in the patient. It is integer valued from 0 (no presence) to 4.
Experiments with the Cleveland database have concentrated on simply attempting to distinguish presence (values 1,2,3,4) from absence (value 0).  


Number of Instances: 
             
              Database:               Number of instances:

              Cleveland:                      303
              Hungarian:                      294
              Switzerland:                    123
              Long Beach VA:                  200


Number of Attributes: 76 (including the predicted attribute)


The authors of the databases have requested that any publications resulting from the use of the data include the names of the principal investigator responsible for the data collection at each institution. They would be:

        a. Hungarian Institute of Cardiology. Budapest: Andras Janosi, M.D.
        b. University Hospital, Zurich, Switzerland: William Steinbrunn, M.D.
        c. University Hospital, Basel, Switzerland: Matthias Pfisterer, M.D.
        d. V.A. Medical Center, Long Beach and Cleveland Clinic Foundation:
	     Robert Detrano, M.D., Ph.D.

The dataset consists of 303 individual data. There are 14 columns in the dataset, which      are described below:

        1. Age: displays the age of the individual.

        2. Sex: displays the gender of the individual using the following format:
                    1 = male
                    0 = female

        3. Cp (Chest-pain type): displays the type of chest-pain experienced by the
           individual using the following format:
                    1 = typical angina
                    2 = atypical angina
                    3 = non-anginal pain
                    4 = asymptotic

        4. TrestBPS (Resting Blood Pressure): Displays the resting blood pressure value of
           an individual in mmHg (unit). It can take continuous values from 94 to 200.

        5. Chol (Serum Cholesterol): Displays the serum cholesterol in mg/dl (unit)

        6. Fbs (Fasting Blood Sugar): Compares the fasting blood sugar value of an
           individual with 120mg/dl:
                    1 (true) = Fasting blood sugar > 120mg/dl 
                    0 (False) = Fasting blood sugar < 120mg/dl 

        7. RestECG (Resting ECG): displays resting electrocardiographic results
                    0 = normal
                    1 = having ST-T wave abnormality
                    2 = left ventricular hypertrophy

        8. Thalach (Max heart rate achieved): displays the max heart rate achieved by an
           individual. It can take continuous value from 71 to 202.

        9. Exang (Exercise induced angina): Angina is a type of chest pain caused by reduced
           blood flow to the heart.
                    1 = yes
                    0 = no

        10.	OldPeak (ST depression induced by exercise relative to rest): displays the value
            which is an integer or float.

        11.	Slope (Peak exercise ST segment):
                    1 = upsloping
                    2 = flat
                    3 = down sloping

        12.	Ca (Number of major vessels (0–3) colored by fluoroscopy): displays the value as
            integer or float.

        13.	Thal: displays the thalassemia:
                    3 = normal
                    6 = fixed defect
                    7 = reversible defect

        14.	Target (Diagnosis of heart disease): Displays whether the individual is
            suffering from heart disease or not:
                    0 = absence
                    1, 2, 3, 4 = present.


In [2]:
df0 = pd.read_csv('cleveland.csv')
df1 = pd.read_csv('hungarian.csv')
df2 = pd.read_csv('switzerland.csv')
df3 = pd.read_csv('va.csv')

In [3]:
df0.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,1,145,233,1,2,150,0,2.3,3,0.0,6.0,0
1,67,1,4,160,286,0,2,108,1,1.5,2,3.0,3.0,1
2,67,1,4,120,229,0,2,129,1,2.6,2,2.0,7.0,1
3,37,1,3,130,250,0,0,187,0,3.5,3,0.0,3.0,0
4,41,0,2,130,204,0,2,172,0,1.4,1,0.0,3.0,0


In [4]:
df1.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,28,1,2,130.0,132.0,0.0,2.0,185.0,0.0,0.0,,,,0
1,29,1,2,120.0,243.0,0.0,0.0,160.0,0.0,0.0,,,,0
2,29,1,2,140.0,,0.0,0.0,170.0,0.0,0.0,,,,0
3,30,0,1,170.0,237.0,0.0,1.0,170.0,0.0,0.0,,,6.0,0
4,31,0,2,100.0,219.0,0.0,1.0,150.0,0.0,0.0,,,,0


In [5]:
df2.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,32,1,1,95.0,0,,0.0,127.0,0.0,0.7,1.0,,,1
1,34,1,4,115.0,0,,,154.0,0.0,0.2,1.0,,,1
2,35,1,4,,0,,0.0,130.0,1.0,,,,7.0,1
3,36,1,4,110.0,0,,0.0,125.0,1.0,1.0,2.0,,6.0,1
4,38,0,4,105.0,0,,0.0,166.0,0.0,2.8,1.0,,,1


In [6]:
df3.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,4,140.0,260.0,0.0,1,112.0,1.0,3.0,2.0,,,1
1,44,1,4,130.0,209.0,0.0,1,127.0,0.0,0.0,,,,0
2,60,1,4,132.0,218.0,0.0,1,140.0,1.0,1.5,3.0,,,1
3,55,1,4,142.0,228.0,0.0,1,149.0,1.0,2.5,1.0,,,1
4,66,1,3,110.0,213.0,1.0,2,99.0,1.0,1.3,2.0,,,0


In [7]:
df0_missing = df0.isna()
df0_missing.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,False,False,False,False,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,False,False,False,False


In [8]:
df1_missing = df1.isna()
df1_missing.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,False,False,False,False,False,False,False,False,False,False,True,True,True,False
1,False,False,False,False,False,False,False,False,False,False,True,True,True,False
2,False,False,False,False,True,False,False,False,False,False,True,True,True,False
3,False,False,False,False,False,False,False,False,False,False,True,True,False,False
4,False,False,False,False,False,False,False,False,False,False,True,True,True,False


In [9]:
df2_missing = df2.isna()
df2_missing.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,False,False,False,False,False,True,False,False,False,False,False,True,True,False
1,False,False,False,False,False,True,True,False,False,False,False,True,True,False
2,False,False,False,True,False,True,False,False,False,True,True,True,False,False
3,False,False,False,False,False,True,False,False,False,False,False,True,False,False
4,False,False,False,False,False,True,False,False,False,False,False,True,True,False


In [10]:
df3_missing = df3.isna()
df3_missing.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,False,False,False,False,False,False,False,False,False,False,False,True,True,False
1,False,False,False,False,False,False,False,False,False,False,True,True,True,False
2,False,False,False,False,False,False,False,False,False,False,False,True,True,False
3,False,False,False,False,False,False,False,False,False,False,False,True,True,False
4,False,False,False,False,False,False,False,False,False,False,False,True,True,False


In [11]:
df0_num_missing = df0_missing.sum()
df0_num_missing

age         0
sex         0
cp          0
trestbps    0
chol        0
fbs         0
restecg     0
thalach     0
exang       0
oldpeak     0
slope       0
ca          4
thal        2
target      0
dtype: int64

In [12]:
df1_num_missing = df1_missing.sum()
df1_num_missing

age           0
sex           0
cp            0
trestbps      1
chol         23
fbs           8
restecg       1
thalach       1
exang         1
oldpeak       0
slope       190
ca          291
thal        266
target        0
dtype: int64

In [13]:
df2_num_missing = df2_missing.sum()
df2_num_missing

age           0
sex           0
cp            0
trestbps      2
chol          0
fbs          75
restecg       1
thalach       1
exang         1
oldpeak       6
slope        17
ca          118
thal         52
target        0
dtype: int64

In [14]:
df3_num_missing = df3_missing.sum()
df3_num_missing

age           0
sex           0
cp            0
trestbps     57
chol         56
fbs           7
restecg       0
thalach      53
exang        53
oldpeak      56
slope       102
ca          198
thal        166
target        0
dtype: int64

In [15]:
df0.isna().mean().round(4) * 100

age         0.00
sex         0.00
cp          0.00
trestbps    0.00
chol        0.00
fbs         0.00
restecg     0.00
thalach     0.00
exang       0.00
oldpeak     0.00
slope       0.00
ca          1.32
thal        0.66
target      0.00
dtype: float64

In [16]:
df1.isna().mean().round(4) * 100

age          0.00
sex          0.00
cp           0.00
trestbps     0.34
chol         7.82
fbs          2.72
restecg      0.34
thalach      0.34
exang        0.34
oldpeak      0.00
slope       64.63
ca          98.98
thal        90.48
target       0.00
dtype: float64

In [17]:
df2.isna().mean().round(4) * 100

age          0.00
sex          0.00
cp           0.00
trestbps     1.63
chol         0.00
fbs         60.98
restecg      0.81
thalach      0.81
exang        0.81
oldpeak      4.88
slope       13.82
ca          95.93
thal        42.28
target       0.00
dtype: float64

In [18]:
df3.isna().mean().round(4) * 100

age          0.0
sex          0.0
cp           0.0
trestbps    28.5
chol        28.0
fbs          3.5
restecg      0.0
thalach     26.5
exang       26.5
oldpeak     28.0
slope       51.0
ca          99.0
thal        83.0
target       0.0
dtype: float64

In [19]:
df0.drop(['slope', 'ca','thal'], axis = 1,inplace=True)
df0.dropna(inplace=True, thresh=3)
df0.info()
df0.shape

<class 'pandas.core.frame.DataFrame'>
Int64Index: 303 entries, 0 to 302
Data columns (total 11 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       303 non-null    int64  
 1   sex       303 non-null    int64  
 2   cp        303 non-null    int64  
 3   trestbps  303 non-null    int64  
 4   chol      303 non-null    int64  
 5   fbs       303 non-null    int64  
 6   restecg   303 non-null    int64  
 7   thalach   303 non-null    int64  
 8   exang     303 non-null    int64  
 9   oldpeak   303 non-null    float64
 10  target    303 non-null    int64  
dtypes: float64(1), int64(10)
memory usage: 28.4 KB


(303, 11)

In [20]:
df1.drop(['slope', 'ca','thal'], axis = 1,inplace=True)
df1.dropna(inplace=True, thresh=3)
df1.info()
df1.shape

<class 'pandas.core.frame.DataFrame'>
Int64Index: 294 entries, 0 to 293
Data columns (total 11 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       294 non-null    int64  
 1   sex       294 non-null    int64  
 2   cp        294 non-null    int64  
 3   trestbps  293 non-null    float64
 4   chol      271 non-null    float64
 5   fbs       286 non-null    float64
 6   restecg   293 non-null    float64
 7   thalach   293 non-null    float64
 8   exang     293 non-null    float64
 9   oldpeak   294 non-null    float64
 10  target    294 non-null    int64  
dtypes: float64(7), int64(4)
memory usage: 27.6 KB


(294, 11)

In [21]:
df2.drop(['slope', 'ca','thal'], axis = 1,inplace=True)
df2.dropna(inplace=True, thresh=3)
df2.info()
df2.shape

<class 'pandas.core.frame.DataFrame'>
Int64Index: 123 entries, 0 to 122
Data columns (total 11 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       123 non-null    int64  
 1   sex       123 non-null    int64  
 2   cp        123 non-null    int64  
 3   trestbps  121 non-null    float64
 4   chol      123 non-null    int64  
 5   fbs       48 non-null     float64
 6   restecg   122 non-null    float64
 7   thalach   122 non-null    float64
 8   exang     122 non-null    float64
 9   oldpeak   117 non-null    float64
 10  target    123 non-null    int64  
dtypes: float64(6), int64(5)
memory usage: 11.5 KB


(123, 11)

In [22]:
df3.drop(['slope', 'ca','thal'], axis = 1,inplace=True)
df3.dropna(inplace=True, thresh=3)
df3.info()
df3.shape

<class 'pandas.core.frame.DataFrame'>
Int64Index: 200 entries, 0 to 199
Data columns (total 11 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       200 non-null    int64  
 1   sex       200 non-null    int64  
 2   cp        200 non-null    int64  
 3   trestbps  143 non-null    float64
 4   chol      144 non-null    float64
 5   fbs       193 non-null    float64
 6   restecg   200 non-null    int64  
 7   thalach   147 non-null    float64
 8   exang     147 non-null    float64
 9   oldpeak   144 non-null    float64
 10  target    200 non-null    int64  
dtypes: float64(6), int64(5)
memory usage: 18.8 KB


(200, 11)

In [23]:
df0.describe()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,target
count,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0
mean,54.438944,0.679868,3.158416,131.689769,246.693069,0.148515,0.990099,149.607261,0.326733,1.039604,0.458746
std,9.038662,0.467299,0.960126,17.599748,51.776918,0.356198,0.994971,22.875003,0.469794,1.161075,0.49912
min,29.0,0.0,1.0,94.0,126.0,0.0,0.0,71.0,0.0,0.0,0.0
25%,48.0,0.0,3.0,120.0,211.0,0.0,0.0,133.5,0.0,0.0,0.0
50%,56.0,1.0,3.0,130.0,241.0,0.0,1.0,153.0,0.0,0.8,0.0
75%,61.0,1.0,4.0,140.0,275.0,0.0,2.0,166.0,1.0,1.6,1.0
max,77.0,1.0,4.0,200.0,564.0,1.0,2.0,202.0,1.0,6.2,1.0


In [24]:
df1.describe()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,target
count,294.0,294.0,294.0,293.0,271.0,286.0,293.0,293.0,293.0,294.0,294.0
mean,47.826531,0.72449,2.982993,132.583618,250.848708,0.06993,0.21843,139.129693,0.303754,0.586054,0.360544
std,7.811812,0.447533,0.965117,17.626568,67.657711,0.255476,0.460868,23.589749,0.460665,0.908648,0.480977
min,28.0,0.0,1.0,92.0,85.0,0.0,0.0,82.0,0.0,0.0,0.0
25%,42.0,0.0,2.0,120.0,209.0,0.0,0.0,122.0,0.0,0.0,0.0
50%,49.0,1.0,3.0,130.0,243.0,0.0,0.0,140.0,0.0,0.0,0.0
75%,54.0,1.0,4.0,140.0,282.5,0.0,0.0,155.0,1.0,1.0,1.0
max,66.0,1.0,4.0,200.0,603.0,1.0,2.0,190.0,1.0,5.0,1.0


In [25]:
df2.describe()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,target
count,123.0,123.0,123.0,121.0,123.0,48.0,122.0,122.0,122.0,117.0,123.0
mean,55.317073,0.918699,3.699187,130.206612,0.0,0.104167,0.360656,121.557377,0.442623,0.653846,0.934959
std,9.032108,0.274414,0.688726,22.559151,0.0,0.308709,0.590077,25.977438,0.498745,1.056061,0.247606
min,32.0,0.0,1.0,80.0,0.0,0.0,0.0,60.0,0.0,-2.6,0.0
25%,51.0,1.0,4.0,115.0,0.0,0.0,0.0,104.25,0.0,0.0,1.0
50%,56.0,1.0,4.0,125.0,0.0,0.0,0.0,121.0,0.0,0.3,1.0
75%,61.5,1.0,4.0,145.0,0.0,0.0,1.0,140.0,1.0,1.5,1.0
max,74.0,1.0,4.0,200.0,0.0,1.0,2.0,182.0,1.0,3.7,1.0


In [26]:
df3.describe()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,target
count,200.0,200.0,200.0,143.0,144.0,193.0,200.0,147.0,147.0,144.0,200.0
mean,59.35,0.97,3.505,134.699301,239.569444,0.352332,0.735,122.795918,0.646259,1.320833,0.745
std,7.811697,0.171015,0.795701,18.445976,52.788753,0.478939,0.683455,21.990328,0.479765,1.106236,0.436955
min,35.0,0.0,1.0,96.0,100.0,0.0,0.0,69.0,0.0,-0.5,0.0
25%,55.0,1.0,3.0,120.0,208.0,0.0,0.0,109.0,0.0,0.0,0.0
50%,60.0,1.0,4.0,130.0,228.0,0.0,1.0,120.0,1.0,1.5,1.0
75%,64.0,1.0,4.0,148.0,271.25,1.0,1.0,140.0,1.0,2.0,1.0
max,77.0,1.0,4.0,190.0,458.0,1.0,2.0,180.0,1.0,4.0,1.0
