# Heart Disease Data Wrangling

#### Objective: Heart disease is one of the leading causes of death in America and being able to detect heart disease promptly can have a number of benefits.  In this project, we will try to detect heart disease using supervised machine learning strategies.  This model is not intended to replace doctors by any means, instead this model is intended to be a tool to assist physicians in their decision making. 

#### Data Source: https://www.kaggle.com/fedesoriano/heart-failure-prediction

##### Kaggle Project overview:
Cardiovascular diseases (CVDs) are the number 1 cause of death globally, taking an estimated 17.9 million lives each year, which accounts for 31% of all deaths worldwide. Four out of five CVD deaths are due to heart attacks and strokes, and one-third of these deaths occur prematurely in people under 70 years of age. Heart failure is a common event caused by CVDs and this dataset contains 11 features that can be used to predict a possible heart disease.

People with cardiovascular disease or who are at high cardiovascular risk (due to the presence of one or more risk factors such as hypertension, diabetes, hyperlipidaemia or already established disease) need early detection and management wherein a machine learning model can be of great help.


In [30]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import math 

In [3]:
heart_data=pd.read_csv("heart.csv.xls")

In [4]:
heart_data

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0
...,...,...,...,...,...,...,...,...,...,...,...,...
913,45,M,TA,110,264,0,Normal,132,N,1.2,Flat,1
914,68,M,ASY,144,193,1,Normal,141,N,3.4,Flat,1
915,57,M,ASY,130,131,0,Normal,115,Y,1.2,Flat,1
916,57,F,ATA,130,236,0,LVH,174,N,0.0,Flat,1


In [6]:
print("Number of patients in the analysis: " + str(len(heart_data.index)))

Number of patients in the analysis: 918


### Replacing missing cholesterol values with the median.  There are a lot of missing values in the cholesterol column.

In [7]:
heart_data['Cholesterol']=heart_data['Cholesterol'].replace(0,heart_data['Cholesterol'].median())

In [8]:
heart_data3=heart_data

In [9]:
RestingECG=heart_data3['RestingECG'].tolist()

##### In later parts of this project, I will need numerical values for all of our data.  I will create 'dummy variables' now for some of our categorical data.  By creating dummy variables, I'm changing categorical data to 0's and 1's and adding columns for those attributes.  For example, you will see the column ExerciseAngina has values Y and N.  Instead of presenting the data like that I will add columns, Y and N, and populate the cells with 0's and 1's according to their correspondence.  

In [10]:
exercise_angina=pd.get_dummies(heart_data["ExerciseAngina"])

In [12]:
exercise_angina

Unnamed: 0,N,Y
0,1,0
1,1,0
2,1,0
3,0,1
4,1,0
...,...,...
913,1,0
914,1,0
915,0,1
916,1,0


In [13]:
ST=pd.get_dummies(heart_data["ST_Slope"])

In [14]:
Chest_Pain_Type=pd.get_dummies(heart_data["ChestPainType"])

In [15]:
Resting_ECG=pd.get_dummies(heart_data["RestingECG"])

In [16]:
Sex_type=pd.get_dummies(heart_data["Sex"])

##### Now that I have created new columns that can be added to the dataframe, I need to remove the columns that have categorical information.

In [17]:
heart_data=heart_data.drop(['Sex'], axis=1)

In [18]:
heart_data=heart_data.drop(['ChestPainType'], axis=1)

In [19]:
heart_data=heart_data.drop(['RestingECG'], axis=1)

In [20]:
heart_data=heart_data.drop(['ExerciseAngina'], axis=1)

In [21]:
heart_data=heart_data.drop(['ST_Slope'], axis=1)

In [22]:
heart_data=pd.concat([ST, heart_data, exercise_angina, Chest_Pain_Type, Resting_ECG, Sex_type], axis=1)

In [23]:
heart_data

Unnamed: 0,Down,Flat,Up,Age,RestingBP,Cholesterol,FastingBS,MaxHR,Oldpeak,HeartDisease,...,Y,ASY,ATA,NAP,TA,LVH,Normal,ST,F,M
0,0,0,1,40,140,289,0,172,0.0,0,...,0,0,1,0,0,0,1,0,0,1
1,0,1,0,49,160,180,0,156,1.0,1,...,0,0,0,1,0,0,1,0,1,0
2,0,0,1,37,130,283,0,98,0.0,0,...,0,0,1,0,0,0,0,1,0,1
3,0,1,0,48,138,214,0,108,1.5,1,...,1,1,0,0,0,0,1,0,1,0
4,0,0,1,54,150,195,0,122,0.0,0,...,0,0,0,1,0,0,1,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
913,0,1,0,45,110,264,0,132,1.2,1,...,0,0,0,0,1,0,1,0,0,1
914,0,1,0,68,144,193,1,141,3.4,1,...,0,1,0,0,0,0,1,0,0,1
915,0,1,0,57,130,131,0,115,1.2,1,...,1,1,0,0,0,0,1,0,0,1
916,0,1,0,57,130,236,0,174,0.0,1,...,0,0,1,0,0,1,0,0,1,0


##### You can see that in the table above, we only have numerical values.

### Next, I will check to see if there are other missing values within the table.  I will check to see if 0 is present in the following columns:
#### RestingBP, MaxHR, Age

In [27]:
(heart_data['MaxHR']==0).sum()

0

In [28]:
(heart_data['RestingBP']==0).sum()

1

In [29]:
(heart_data['Age']==0).sum()

0

#### There is only one missing value found above.  I will replace the missing Resting Blood Pressure value with the median. 

In [31]:
heart_data['RestingBP']=heart_data['RestingBP'].replace(0,heart_data['RestingBP'].median())

In [32]:
heart_data

Unnamed: 0,Down,Flat,Up,Age,RestingBP,Cholesterol,FastingBS,MaxHR,Oldpeak,HeartDisease,...,Y,ASY,ATA,NAP,TA,LVH,Normal,ST,F,M
0,0,0,1,40,140,289,0,172,0.0,0,...,0,0,1,0,0,0,1,0,0,1
1,0,1,0,49,160,180,0,156,1.0,1,...,0,0,0,1,0,0,1,0,1,0
2,0,0,1,37,130,283,0,98,0.0,0,...,0,0,1,0,0,0,0,1,0,1
3,0,1,0,48,138,214,0,108,1.5,1,...,1,1,0,0,0,0,1,0,1,0
4,0,0,1,54,150,195,0,122,0.0,0,...,0,0,0,1,0,0,1,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
913,0,1,0,45,110,264,0,132,1.2,1,...,0,0,0,0,1,0,1,0,0,1
914,0,1,0,68,144,193,1,141,3.4,1,...,0,1,0,0,0,0,1,0,0,1
915,0,1,0,57,130,131,0,115,1.2,1,...,1,1,0,0,0,0,1,0,0,1
916,0,1,0,57,130,236,0,174,0.0,1,...,0,0,1,0,0,1,0,0,1,0


In [33]:
(heart_data['RestingBP']==0).sum()

0

# Next steps
## In the next section of this project, I will go through an exploratory data analysis that will help me tune this model so that we can understand important features and eventually create a model that predicts whether a patient has heart disease.  Look to the next project in the repository labeled, "Heart Disease Exploratory Data Analysis."