# Heart Disease Prediction

## Abstract:
Heart disease is easier to treat when it is detected in the early stages. Machine learning techniques may aid a more efficient analysis in the prediction of the disease. Moreover, this prediction is one of the most central problems in medicine, as it is one of the leading diseases related to an unhealthy lifestyle. So, an early prediction of this disease will be useful for a cure or aversion.  

## Problem Statement:
Analyze the heart disease dataset to explore the machine learning algorithms and build a decision tree model to predict the disease.  

## Dataset Information:
Each attribute in the heart disease dataset is a medical risk factor.  

## Variable Description:
* <u>age:</u>	Age of the patient
* <u>gender:</u> Gender of the patient - (0,1) - (Male, Female)
* <u>chest_pain:</u> It refers to the chest pain experienced by the patient -(0,1,2,3)
* <u>rest_bps:</u> Blood pressure of the patient while resting(in mm/Hg)
* <u>cholesterol:</u>	Patient's cholesterol level (in mg/dl)
* <u>fasting_blood_sugar:</u>	The blood sugar of the patient while fasting
* <u>rest_ecg:</u> Potassium level (0,1,2)
* <u>thalach:</u>	The patient’s maximum heart rate
* <u>exer_angina:</u>	It refers to exercise-induced angina - (1=Yes, 0=No)
* <u>old_peak:</u> It is the ST depression induced by exercise relative to rest(ST relates to the position on ECG plots)
* <u>slope:</u> It refers to the slope of the peak of the exercise ST-Segment- (0,1,2)
* <u>ca:</u> Number of major vessels - (0,1,2,3,4)
* <u>thalassemia:</u>	It refers to thalassemia which is a blood disorder - (0,1,2,3)
* <u>target:</u> The patient has heart disease or not - (1=Yes, 0=No)

## Scope:
* Understand data by performing exploratory data analysis
* Training and building Decision Tree classification algorithm to predict if a patient has heart disease
* Understand feature importances and improve the model
* Understand various model performance metrics and measure the performance of each model

In [1]:
# Imports

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn import tree

In [2]:
df = pd.read_csv(r'HeartDisease.csv', header=0)
df.head()

Unnamed: 0,age,gender,chest_pain,rest_bps,cholestrol,fasting_blood_sugar,rest_ecg,thalach,exer_angina,old_peak,slope,ca,thalassemia,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [3]:
# Basic Data Exploration.

print("Data types: \n", df.dtypes)
print("\n\n\n The shape of the data: \n", df.shape)
print("\n\n\n Data description: \n\n", df.describe(),"\n\n")

Data types: 
 age                      int64
gender                   int64
chest_pain               int64
rest_bps                 int64
cholestrol               int64
fasting_blood_sugar      int64
rest_ecg                 int64
thalach                  int64
exer_angina              int64
old_peak               float64
slope                    int64
ca                       int64
thalassemia              int64
target                   int64
dtype: object



 The shape of the data: 
 (303, 14)



 Data description: 

               age      gender  chest_pain    rest_bps  cholestrol  \
count  303.000000  303.000000  303.000000  303.000000  303.000000   
mean    54.366337    0.683168    0.966997  131.623762  246.264026   
std      9.082101    0.466011    1.032052   17.538143   51.830751   
min     29.000000    0.000000    0.000000   94.000000  126.000000   
25%     47.500000    0.000000    0.000000  120.000000  211.000000   
50%     55.000000    1.000000    1.000000  130.000000  240.0

In [4]:
print("Data Information: \n\n", df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   age                  303 non-null    int64  
 1   gender               303 non-null    int64  
 2   chest_pain           303 non-null    int64  
 3   rest_bps             303 non-null    int64  
 4   cholestrol           303 non-null    int64  
 5   fasting_blood_sugar  303 non-null    int64  
 6   rest_ecg             303 non-null    int64  
 7   thalach              303 non-null    int64  
 8   exer_angina          303 non-null    int64  
 9   old_peak             303 non-null    float64
 10  slope                303 non-null    int64  
 11  ca                   303 non-null    int64  
 12  thalassemia          303 non-null    int64  
 13  target               303 non-null    int64  
dtypes: float64(1), int64(13)
memory usage: 33.3 KB
Data Information: 

 None


In [5]:
# Checking column names

df.columns

Index(['age', 'gender', 'chest_pain', 'rest_bps', 'cholestrol',
       'fasting_blood_sugar', 'rest_ecg', 'thalach', 'exer_angina', 'old_peak',
       'slope', 'ca', 'thalassemia', 'target'],
      dtype='object')

#### Checking if there are missing values. If yes, we will handle them.

In [6]:
df.isnull().sum()
# No null values

age                    0
gender                 0
chest_pain             0
rest_bps               0
cholestrol             0
fasting_blood_sugar    0
rest_ecg               0
thalach                0
exer_angina            0
old_peak               0
slope                  0
ca                     0
thalassemia            0
target                 0
dtype: int64

* We see that we have 303 rows and  14 columns
* <u>'old_peak'</u> has float data type and rest all <u>features</u> are of int data type.
* We have 0 NaN values.
* No feature has the datatype object, which eliminates the possibility of having special characters in place of values.
* DecisionTree is not affected alot by outliers, hence we will ignore them.

In [7]:
# Splitting train data into X(Independent) & Y(Dependent)

X = df.values[:,0:-1]
Y = df.values[:,-1]

In [8]:
print("Shape of X:", X.shape)
print("Shape of Y:", Y.shape)

Shape of X: (303, 13)
Shape of Y: (303,)


We confirm that X and Y are correct using .shape

#### Creating X(Independent) and Y(Dependent) Variables

In [9]:
# Split the data into test and train
# We use test_size 0.2 (80%/20%) as we have less data

#from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=10)

### Running a Decision Tree model

In [10]:
# Creating a DecisionTree model object.

#from sklearn.tree import DecisionTreeClassifier
model_DecisionTree = DecisionTreeClassifier(criterion="gini", random_state=10)

# Fitting the model on the data and predict the values
model_DecisionTree.fit(X_train, Y_train)
Y_pred = model_DecisionTree.predict(X_test)
# print(Y_pred)
# print(list(zip(Y_test, Y_pred)))

In [11]:
# Generating accuracy scores and metrics.

#from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
print("Confusion Matrix: \n\n", confusion_matrix(Y_test, Y_pred))
print("\n Accuracy of the model: ", accuracy_score(Y_test, Y_pred))
print("\n\n Classification Report: \n\n", classification_report(Y_test, Y_pred))

Confusion Matrix: 

 [[24 11]
 [ 5 21]]

 Accuracy of the model:  0.7377049180327869


 Classification Report: 

               precision    recall  f1-score   support

         0.0       0.83      0.69      0.75        35
         1.0       0.66      0.81      0.72        26

    accuracy                           0.74        61
   macro avg       0.74      0.75      0.74        61
weighted avg       0.75      0.74      0.74        61



In [12]:
model_DecisionTree.score(X_train, Y_train)

1.0

In [13]:
print(list(zip(df.columns[0:-1],model_DecisionTree.feature_importances_)))

[('age', 0.04531587587888406), ('gender', 0.033898478491911924), ('chest_pain', 0.29176503457428227), ('rest_bps', 0.10160480431912085), ('cholestrol', 0.0773729378814243), ('fasting_blood_sugar', 0.0), ('rest_ecg', 0.0), ('thalach', 0.05844875340491687), ('exer_angina', 0.05994924448790483), ('old_peak', 0.09456106029755625), ('slope', 0.06516631481632049), ('ca', 0.16346600461348118), ('thalassemia', 0.008451491234197112)]


In [14]:
# We can build a DF to view the feature importance and sort for a better visual.

feature_imp = pd.DataFrame()
feature_imp["Feature"] = df.columns[:-1]
feature_imp["Importance"] = model_DecisionTree.feature_importances_ 

feature_imp.sort_values("Importance", ascending = False)

Unnamed: 0,Feature,Importance
2,chest_pain,0.291765
11,ca,0.163466
3,rest_bps,0.101605
9,old_peak,0.094561
4,cholestrol,0.077373
10,slope,0.065166
8,exer_angina,0.059949
7,thalach,0.058449
0,age,0.045316
1,gender,0.033898


## Pruning

In [15]:
# We try to pass hyper parameters to try and prune the model and compare accuracy.

pruned_DecisionTree = DecisionTreeClassifier(criterion="gini", 
                                            random_state=10,
                                           splitter="best",
                                           min_samples_leaf=4,
                                           min_samples_split=5,
                                           max_depth=5,
                                           #max_leaf_nodes=100,
                                           max_features=10
                                           ) 

# Fitting the model on the data and predict the values
pruned_DecisionTree.fit(X_train, Y_train)
Y_pred = pruned_DecisionTree.predict(X_test)

print("Confusion Matrix: \n\n", confusion_matrix(Y_test, Y_pred))
print("\n Accuracy of the model: ", accuracy_score(Y_test, Y_pred))
print("\n\n Classification Report: \n\n", classification_report(Y_test, Y_pred))

Confusion Matrix: 

 [[28  7]
 [ 3 23]]

 Accuracy of the model:  0.8360655737704918


 Classification Report: 

               precision    recall  f1-score   support

         0.0       0.90      0.80      0.85        35
         1.0       0.77      0.88      0.82        26

    accuracy                           0.84        61
   macro avg       0.83      0.84      0.83        61
weighted avg       0.85      0.84      0.84        61



In [16]:
# Using Entropy instead of gini

model_DecisionTree = DecisionTreeClassifier(criterion="entropy", 
                                            random_state=10,
                                           splitter="best",
                                           min_samples_leaf=4,
                                           min_samples_split=5,
                                           max_depth=5,
                                           #max_leaf_nodes=100,
                                           max_features=10
                                           ) 

# Fitting the model on the data and predict the values
model_DecisionTree.fit(X_train, Y_train)
Y_pred = model_DecisionTree.predict(X_test)

print("Confusion Matrix: \n\n", confusion_matrix(Y_test, Y_pred))
print("\n Accuracy of the model: ", accuracy_score(Y_test, Y_pred))
print("\n\n Classification Report: \n\n", classification_report(Y_test, Y_pred))

Confusion Matrix: 

 [[27  8]
 [ 5 21]]

 Accuracy of the model:  0.7868852459016393


 Classification Report: 

               precision    recall  f1-score   support

         0.0       0.84      0.77      0.81        35
         1.0       0.72      0.81      0.76        26

    accuracy                           0.79        61
   macro avg       0.78      0.79      0.78        61
weighted avg       0.79      0.79      0.79        61



In [17]:
# Predicting using the Random_Forest_Classifier

#from sklearn.ensemble import RandomForestClassifier

model_RandomForest = RandomForestClassifier(n_estimators=100,
                                           random_state=10,
                                           bootstrap=True)

# Fitting the model on the data and predict the values
model_RandomForest.fit(X_train, Y_train)

Y_pred = model_RandomForest.predict(X_test)

print("Confusion Matrix: \n\n", confusion_matrix(Y_test, Y_pred))
print("\n Accuracy of the model: ", accuracy_score(Y_test, Y_pred))
print("\n\n Classification Report: \n\n", classification_report(Y_test, Y_pred))

Confusion Matrix: 

 [[28  7]
 [ 5 21]]

 Accuracy of the model:  0.8032786885245902


 Classification Report: 

               precision    recall  f1-score   support

         0.0       0.85      0.80      0.82        35
         1.0       0.75      0.81      0.78        26

    accuracy                           0.80        61
   macro avg       0.80      0.80      0.80        61
weighted avg       0.81      0.80      0.80        61



In [18]:
# Predicting using the Extra_Trees_Classifier

#from sklearn.ensemble import ExtraTreesClassifier

model_EXT = ExtraTreesClassifier(n_estimators=100,
                                 random_state=10,
                                 bootstrap=True)   # Here, bootstrap=True is not mandatory

# Fitting the model on the data and predict the values
model_EXT.fit(X_train, Y_train)

Y_pred = model_EXT.predict(X_test)

print("Confusion Matrix: \n\n", confusion_matrix(Y_test, Y_pred))
print("\n Accuracy of the model: ", accuracy_score(Y_test, Y_pred))
print("\n\n Classification Report: \n\n", classification_report(Y_test, Y_pred))

Confusion Matrix: 

 [[29  6]
 [ 4 22]]

 Accuracy of the model:  0.8360655737704918


 Classification Report: 

               precision    recall  f1-score   support

         0.0       0.88      0.83      0.85        35
         1.0       0.79      0.85      0.81        26

    accuracy                           0.84        61
   macro avg       0.83      0.84      0.83        61
weighted avg       0.84      0.84      0.84        61



<head>
	<title>Metrics Summary</title> 
	<style>
		table td {
			text-align:center;
		}
	</style>
</head>
<body>
	<table>
		<thead>
			<tr>
                <th><u>Metrics</u></th>
                <th><u>Classes</u></th>
				<th>Base DecisionTree</th>
				<th>After Pruning</th>
				<th>Using Entropy</th>
				<th>RandomForest</th>
				<th>ExtraTrees</th>
			</tr>
		</thead>
		<tbody>
			<tr>
				<td>Accuracy</td>
				<td>-</td>
				<td>73.77%</td>
				<td>83.60%</td>
				<td>78.68%</td>
				<td>80.32%</td>
				<td>83.60%</td>
			</tr>
			<tr>
				<td>Precision</td>
				<td>class 0</td>
				<td>0.83</td>
				<td>0.90</td>
				<td>0.84</td>
				<td>0.85</td>
				<td>0.88</td>
			</tr>
			<tr>
				<td>-</td>
				<td>class 1</td>
				<td>0.66</td>
				<td>0.77</td>
				<td>0.72</td>
				<td>0.75</td>
				<td>0.79</td>
			</tr>
			<tr>
				<td>Recall</td>
				<td>class 0</td>
				<td>0.69</td>
				<td>0.80</td>
				<td>0.77</td>
				<td>0.80</td>
				<td>0.83</td>
			</tr>
			<tr>
				<td>-</td>
				<td>class 1</td>
				<td>0.81</td>
				<td>0.88</td>
				<td>0.81</td>
				<td>0.81</td>
				<td>0.85</td>
			</tr>
			<tr>
				<td>F1-Score</td>
				<td>class 0</td>
				<td>0.75</td>
				<td>0.85</td>
				<td>0.81</td>
				<td>0.82</td>
				<td>0.85</td>
			</tr>
			<tr>
				<td>-</td>
				<td>class 1</td>
				<td>0.72</td>
				<td>0.82</td>
				<td>0.76</td>
				<td>0.78</td>
				<td>0.81</td>
			</tr>
		</tbody>
	</table>
</body>

* Our pruned DecisionTree model and ExtraaTrees model gives us the same highest accuracy(83.6%) and best scores.
* But the overall f1-score for class 1 is slightly better in case of our pruned DecisionTree model.
* We confirm from the metrics that our pruned model performs better and hence we will go ahead and predict using it.

In [19]:
Y_pred = pruned_DecisionTree.predict(X)

print("Confusion Matrix: \n\n", confusion_matrix(Y, Y_pred))
print("\n Accuracy of the model: ", accuracy_score(Y, Y_pred))
print("\n\n Classification Report: \n\n", classification_report(Y, Y_pred))

Confusion Matrix: 

 [[114  24]
 [ 17 148]]

 Accuracy of the model:  0.8646864686468647


 Classification Report: 

               precision    recall  f1-score   support

         0.0       0.87      0.83      0.85       138
         1.0       0.86      0.90      0.88       165

    accuracy                           0.86       303
   macro avg       0.87      0.86      0.86       303
weighted avg       0.86      0.86      0.86       303



In [20]:
#from sklearn import tree
with open(r"model_DecisionTree.txt", "w") as f:

    f = tree.export_graphviz(pruned_DecisionTree, feature_names=df.columns[0:-1], out_file=f)

# generating the file and uploading the code in webgraphviz.com to plot the decision tree.

We will now create a dataframe to hold the dataset and its predicted values and dump it to an excel file.

In [21]:
final_df = pd.read_csv(r'HeartDisease.csv', header=0)
Y_pred = Y_pred.astype(int)
final_df['Pred'] = Y_pred

final_df.head()

Unnamed: 0,age,gender,chest_pain,rest_bps,cholestrol,fasting_blood_sugar,rest_ecg,thalach,exer_angina,old_peak,slope,ca,thalassemia,target,Pred
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1,1


In [22]:
final_df.to_excel("HeartDisease Predicted.xlsx", header = True)

# End of Project