**TEAMID PNT2022TMID30384**

***Visualzing and predicting Heart disease with interactive dashboard ***

***Heart Disease Prediction Using Machine Learning Approach ***

Heart disease (HD) is a major cause of mortality in modern society. Medical diagnosis is an extremely important but complicated task that should be Performed Accurately and Efficiently

Cardiovascular disease is difficult to detect due to several risk factors, including high blood pressure, cholesterol, and an abnormal pulse rate.
In this machine learning project, we have collected the dataset from Kaggle(https://www.kaggle.com/search?q=heart+disease+prediction) and we will be using Machine Learning to make predictions on whether a person is suffering from Heart Disease or not.

Problem Statement

* Complete analysis of Heart Disease Kaggle Dataset
* To predict whether a person has a heart disease or not based on the various biological and physical parameters
Machine Learning Algorithms
* Random Forest Classifier
 K-Nearest Neighbors Classifier
* Decision Tree Classifier
* Naive Bayes Classifier

**Import libraries**

Let's first import all the necessary libraries. We will use numpy and pandas to start with. For visualization, we will usepyplot subpackage of matplotlib, use rcParams to add styling to the plots and rainbow for colors and seaborn. For implementing Machine Learning models and processing of data, we will use the sklearn library.



In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import rcParams
from matplotlib.cm import rainbow
import seaborn as sns
%matplotlib inline

In [6]:
For processing the data, we'll import a few libraries. To split the available dataset for testing and training, we'll use the train_test_split method. 

SyntaxError: ignored

In [3]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder
from sklearn import tree
from warnings import filterwarnings
filterwarnings("ignore")

In [4]:
#model validation
from sklearn.metrics import log_loss,roc_auc_score,precision_score,f1_score,recall_score,roc_curve,auc,plot_roc_curve
from sklearn.metrics import classification_report, confusion_matrix,accuracy_score,fbeta_score,matthews_corrcoef
from sklearn import metrics
from mlxtend.plotting import plot_confusion_matrix

For model validation, we'll import a few libraries

In [5]:
#extra
from sklearn.pipeline import make_pipeline, make_union
from sklearn.preprocessing import PolynomialFeatures
from sklearn.feature_selection import SelectFwe, f_regression

Next, we will import all the Machine Learning algorithms

In [7]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB

**Import Dataset**

In [8]:
dataset = pd.read_csv('HeartDataset.csv',sep=',',encoding="utf-8")

FileNotFoundError: ignored

**Data Preparation and Exploration**

In [9]:
type(dataset)

NameError: ignored

In [None]:
dataset.shape

In [None]:
dataset.info()

In [None]:
dataset.columns

In [None]:
dataset.describe()

In [None]:
dataset

In [None]:
dataset.isnull().sum()

In [None]:
dataset.apply(lambda x:len(x.unique()))

In [None]:
print('Chest pain type',dataset['Chest pain type'].unique())
print('FBS over 120',dataset['FBS over 120'].unique())
print('EKG results ',dataset['EKG results'].unique())
print('Exercise angina',dataset['Exercise angina'].unique())
print('Slope of ST',dataset['Slope of ST'].unique())
print('Number of vessels fluro',dataset['Number of vessels fluro'].unique())
print(' Thallium',dataset['Thallium'].unique())

**Dataset Description**

This dataset consists of 14 features and a HeartDisease variable. The detailed description of all the features are as follows:
1. Age: Patients Age in years (Numeric)
2. Sex: Gender of patient (Male - 1, Female - 0)(Nominal)
3. Chest Pain Type: Type of chest pain experienced by patient categorized into :(Nominal)
* Value 1: Typical angina
* Value 2: Atypical angina
* Value 3: Non-anginal pain
* Value 4: Asymptomatic
(Angina: Angina is caused when there is not enough oxygen-rich blood flowing to a certain part of the heart. The arteries of the heart become narrow due to fatty deposits in the artery walls. The narrowing of arteries means that blood supply to the heart is reduced, causing angina.)
4. BP: Level of blood pressure at resting mode in mm/HG (Numerical)
5. cholestrol: Serum cholestrol in mg/dl (Numeric)
(Cholesterol means the blockage for blood supply in the blood vessels)
6.FBS over 120: Blood sugar levels on fasting > 120 mg/dl represents as 1 in case of true and 0 as false (Nominal)
(blood sugar taken after a long gap between a meal and the test. Typically, it's taken before any meal in the morning.)
7. EKG results :
* Value 0: Normal
* Value 1: Having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
* Value 2: Showing probable or definite left ventricular hypertrophy by Estes' criteria.
8.Max HR:
Estimation maximum HR rate is about 90
9.Exercise angina: Exercise induced angina (1 = yes; 0 = no)
(is chest pain while exercising or doing any physical activity.)
10.ST depression :
Exercise induced ST-depression in comparison with the state of rest (Numeric)
(ST Depression is the difference between value of ECG at rest and after exercise. An electrocardiogram records the electrical signals in your heart. It's a common and painless test used to quickly detect heart problems and monitor your heart's health. Electrocardiograms — also called ECGs or EKGs — are often done in a doctor's office, a clinic or a hospital room. ECG machines are standard equipment in operating rooms and ambulances. Some personal devices, such as smart watches.)
11.slope of ST: ST segment measured in terms of slope during peak exercise (Nominal)
* Value 1: Upsloping
* Value 2: Flat
* Value 3: Downsloping
12.Number of vessels fluro:Number of major blood vessels (0-3)(Numeric)
(Fluoroscopy is an imaging technique that uses X-rays to obtain real-time moving images of the interior of an object. In its primary application of medical imaging, a fluoroscope allows a physician to see the internal structure and function of a patient, so that the pumping action of the heart or the motion of swallowing, for example, can be watched)
13.Thallium
* Value 3: normal
* Value 6: fixed defect
* Value 7: reversibe defect
14.HeatDisease :arget: It is the target variable which we have to predict 2 means patient is suffering from heart risk and 1 means patient is normal. (0 = no disease; 1 = disease)
Data Visualization
Now let's see various visual representations of the data to understand more about relationship between various features.
Distribution of Heart disease
It's always a good practice to work with a dataset where the target  classes are of approximately equal size. Thus, let's check for the same.

In [None]:
fig, (ax1) = plt.subplots(nrows=1, ncols=1, sharey=False, figsize=(14,6))

ax1 = dataset['HeartDisease'].value_counts().plot.pie( x="Heart disease" ,y ='no.of patients', 
                   autopct = "%1.0f%%",labels=["Heart Disease","Normal"], startangle = 60,ax=ax1);
ax1.set(title = 'Percentage of Heart disease patients in Dataset')
plt.show()

In [None]:
y = dataset["HeartDisease"]

In [None]:
rcParams['figure.figsize'] = 8,6
plt.bar(dataset['HeartDisease'].unique(), dataset['HeartDisease'].value_counts(), color = ['blue', 'green'])
plt.xticks([1, 2])
plt.xlabel('Target Classes (1 =no disease; 2 = disease)')
plt.ylabel('Samples')
plt.title('Count of each Target Class')
HeartDisease_temp = dataset.HeartDisease.value_counts()
print(HeartDisease_temp)

From the total dataset of 270 patients, 150 (56%) have a heart disease (target=2)
Next, we'll take a look at the histograms for each variable.

In [None]:
dataset.hist(edgecolor='black',layout = (7, 2),
            figsize = (10, 30),
            color=['purple'])

Exploratory Data Analysis (EDA)

Gender distribution based on heart disease

In [None]:
dataset["Sex"].unique()

In [None]:
# Number of males and females
F = dataset[dataset["Sex"] == 0].count()["HeartDisease"]
M = dataset[dataset["Sex"] == 1].count()["HeartDisease"]

# Create a plot
figure, ax = plt.subplots(figsize = (6, 4))
ax.bar(x = ['Female', 'Male'], height = [F, M])
plt.xlabel('Gender')
plt.title('Number of Males and Females in the dataset')
plt.show()

**Heart Disease frequency for gender**

In [None]:
pd.crosstab(dataset.Sex,dataset.HeartDisease).plot(kind="bar",figsize=(20,10),color=['blue','#AA1111' ])
plt.title('Heart Disease Frequency for Sex')
plt.xlabel('Sex (0 = Female, 1 = Male)')
plt.xticks(rotation=0)
plt.legend(["Don't have Disease", "Have Disease"])
plt.ylabel('Frequency')
plt.show()

In [None]:
countFemale = len(dataset[dataset.Sex == 0])
countMale = len(dataset[dataset.Sex == 1])
print("Percentage of Female Patients:{:.2f}%".format((countFemale)/(len(dataset.Sex))*100))
print("Percentage of Male Patients:{:.2f}%".format((countMale)/(len(dataset.Sex))*100))

Age distribution based on heart disease

In [None]:
# Display age distribution based on heart disease
sns.distplot(dataset[dataset['HeartDisease'] == 1]['Age'], label='Do not have heart disease')
sns.distplot(dataset[dataset['HeartDisease'] == 2]['Age'], label = 'Have heart disease')
plt.xlabel('Frequency')
plt.ylabel('Age')
plt.title('Age Distribution based on Heart Disease')
plt.legend()
plt.show()

In [None]:
print('Min age of people who do not have heart disease: ', min(dataset[dataset['HeartDisease'] == 0]['Age']))
print('Max age of people who do not have heart disease: ', max(dataset[dataset['HeartDisease'] == 0]['Age']))
print('Average age of people who do not have heart disease: ', dataset[dataset['HeartDisease'] == 0]['Age'].mean())

In [None]:
print('Min age of people who have heart disease: ', min(dataset[dataset['HeartDisease'] == 1]['Age']))
print('Max age of people who have heart disease: ', max(dataset[dataset['HeartDisease'] == 1]['Age']))
print('Average age of people who have heart disease: ', dataset[dataset['HeartDisease'] == 1]['Age'].mean())

**Heart Disease Frequency for ages**

In [None]:
pd.crosstab(dataset.Age,dataset.HeartDisease).plot(kind="bar",figsize=(20,6))
plt.title('Heart Disease Frequency for Ages')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.savefig('heartDiseaseAndAges.png')
plt.show()

In [None]:
plt.figure(figsize=(12, 10))
dataset.Age.hist(bins=80)

In [None]:
print(f"The most of the patients have a mean age of : {dataset.Age.mean()}")

**Distribution of Categorial features**

In [None]:
categorial = [('Sex', ['female', 'male']), 
              ('Chest pain type', ['typical angina', 'atypical angina', 'non-anginal pain', 'asymptomatic']), 
              ('FBS over 120', ['fbs > 120mg', 'fbs < 120mg']), 
              ('EKG results', ['normal', 'ST-T wave', 'left ventricular']), 
              ('Exercise angina', ['yes', 'no']), 
              ('Slope of ST', ['upsloping', 'flat', 'downsloping']), 
              ('Thallium', ['normal', 'fixed defect', 'reversible defect'])]

In [None]:
def plotGrid(isCategorial):
    if isCategorial:
        [plotCategorial(x[0], x[1], i) for i, x in enumerate(categorial)] 
    else:
        [plotContinuous(x[0], x[1], i) for i, x in enumerate(continuous)] 

In [None]:
def plotCategorial(attribute, labels, ax_index):
    sns.countplot(x=attribute, data=dataset, ax=axes[ax_index][0])
    sns.countplot(x='HeartDisease', hue=attribute, data=dataset, ax=axes[ax_index][1])
    avg = dataset[[attribute, 'HeartDisease']].groupby([attribute], as_index=False).mean()
    sns.barplot(x=attribute, y='HeartDisease', hue=attribute, data=avg, ax=axes[ax_index][2])
    
    for t, l in zip(axes[ax_index][1].get_legend().texts, labels):
        t.set_text(l)
    for t, l in zip(axes[ax_index][2].get_legend().texts, labels):
        t.set_text(l)

In [None]:
fig_categorial, axes = plt.subplots(nrows=len(categorial), ncols=3, figsize=(15, 30))

plotGrid(True)

**Distribution of Continuous features**

In [None]:
continuous = [('BP', 'blood pressure in mm Hg'), 
              ('Cholesterol', 'serum cholestoral in mg/d'), 
              ('Max HR', 'maximum heart rate achieved'), 
              ('Slope of ST', 'ST depression by exercise relative to rest'), 
              ('Number of vessels fluro', '# major vessels: (0-3) colored by flourosopy')]

In [None]:
def plotContinuous(attribute, xlabel, ax_index):
    sns.distplot(dataset[[attribute]], ax=axes[ax_index][0])
    axes[ax_index][0].set(xlabel=xlabel, ylabel='density')
    sns.violinplot(x='HeartDisease', y=attribute, data=dataset, ax=axes[ax_index][1])

In [None]:
fig_continuous, axes = plt.subplots(nrows=len(continuous), ncols=2, figsize=(15, 22))

plotGrid(isCategorial=False)

**PiePlots**

In [None]:
fig, ax = plt.subplots(4,2, figsize = (14,14))
((ax1, ax2), (ax3, ax4), (ax5, ax6), (ax7, ax8)) = ax

labels = ["Male", "Female"]
values = dataset['Sex'].value_counts().tolist()[:2]
ax1.pie(x=values, labels=labels, autopct="%1.1f%%",colors=['#AAb3ff','#CC80FF'],shadow=True, startangle=45,explode=[0.1, 0.1])
ax1.set_title("Sex", fontdict={'fontsize': 12},fontweight ='bold')

labels = ["Typical angina", "Atypical angina","non-anginal pain","asymptomatic"]
values = dataset['Chest pain type'].value_counts().tolist()
ax2.pie(x=values, labels=labels, autopct="%1.1f%%",colors=['#AAb3ff','#CC80FF','#DD00AA','#FF0099'],shadow=True,startangle=45,explode=[0.1, 0.1, 0.1, 0.2])
ax2.set_title("Chest Pain", fontdict={'fontsize': 12},fontweight ='bold')

labels = dataset['FBS over 120'].value_counts().index.tolist()[:2]
values = dataset['FBS over 120'].value_counts().tolist()
ax3.pie(x=values, labels=labels, autopct="%1.1f%%",colors=['#AAb3ff','#CC80FF'],shadow=True, startangle=45,explode=[0.1, 0.15])
ax3.set_title("Fasting Blood Sugar", fontdict={'fontsize': 12},fontweight ='bold')

labels = dataset['EKG results'].value_counts().index.tolist()[:3]
values = dataset['EKG results'].value_counts().tolist()
ax4.pie(x=values, labels=labels, autopct="%1.1f%%", colors=['#AAb3ff','#CC80FF','#DD00AA'],shadow=True,startangle=45,explode=[ 0.05, 0.05, 0.05])
ax4.set_title("Resting Blood Pressure", fontdict={'fontsize': 12},fontweight ='bold')

labels = dataset['Exercise angina'].value_counts().index.tolist()[:2]
values = dataset['Exercise angina'].value_counts().tolist()
ax5.pie(x=values, labels=labels, autopct="%1.1f%%", colors=['#AAb3ff','#CC80FF'],shadow=True, startangle=45,explode=[0.1, 0.1])
ax5.set_title("Exercise induced Angina", fontdict={'fontsize': 12},fontweight ='bold')
labels = dataset['Slope of ST'].value_counts().index.tolist()[:3]
values = dataset['Slope of ST'].value_counts().tolist()
ax6.pie(x=values, labels=labels, autopct="%1.1f%%", colors=['#AAb3ff','#CC80FF','#DD00AA'],shadow=True,startangle=45,explode=[  0.1, 0.1, 0.1])
ax6.set_title("Peak exercise ST_segment Slope", fontdict={'fontsize': 12},fontweight ='bold')

labels = dataset['Number of vessels fluro'].value_counts().index.tolist()[:4]
values = dataset['Number of vessels fluro'].value_counts().tolist()
ax7.pie(x=values, labels=labels, autopct="%1.1f%%", shadow=True, startangle=45,explode=[0.05, 0.07, 0.1, 0.1],colors=['#AAb3ff','#CC80FF','#DD00AA','#FF0099'])
ax7.set_title("Major vessels", fontdict={'fontsize': 12},fontweight ='bold')


labels = dataset['Thallium'].value_counts().index.tolist()[:3]
values = dataset['Thallium'].value_counts().tolist()
ax8.pie(x=values, labels=labels, autopct="%1.1f%%", shadow=True, startangle=45,explode=[0.1, 0.1, 0.1],colors=['#AAb3ff','#CC80FF','#DD00AA'])
ax8.set_title("Types of Thalassemia", fontdict={'fontsize': 12},fontweight ='bold')

plt.tight_layout()
plt.show()

plt.savefig("PiePlots.png")

**Target Correlation**

In [None]:
plt.figure(figsize=(10,10))
sns.heatmap(pd.DataFrame(dataset.corr()['HeartDisease']).sort_values(by='HeartDisease').transpose().drop('HeartDisease',axis=1).transpose(),annot=True,cmap='twilight')
plt.savefig("HeartDiseaseCorrelations.png")

**Feature Importance**

In [None]:
X = dataset.drop('HeartDisease',axis=1)
Y = dataset['HeartDisease']
from sklearn.feature_selection import SelectKBest, chi2
fs = SelectKBest(score_func=chi2, k='all')
fs.fit(X, Y)
per = []
for i in fs.scores_:
    per.append(round(((i/sum(fs.scores_))*100),3))

features_data = pd.DataFrame({'Feature':X.columns,'Scores':fs.scores_,'Importance (%)':per}).sort_values(by=['Scores'],ascending=False)

plt.figure(figsize=(9,4))
sns.barplot( 'Importance (%)','Feature',orient='h',data=features_data,palette='twilight_shifted_r')
insignificant = features_data.loc[features_data['Importance (%)']<0.005]['Feature'].unique()
features_data = features_data.set_index('Feature')
features_data
plt.savefig("FeatureImportance.png")

Analysing Fasting Blood sugar (FBS)

Heart disease according to Fasting Blood sugar

In [None]:
# Display fasting blood sugar in bar chart
dataset.groupby(dataset['FBS over 120']).count()['HeartDisease'].plot(kind = 'bar', title = 'Fasting Blood Sugar', figsize = (8, 6))
plt.xticks(np.arange(2), ('fbs < 120 mg/dl', 'fbs > 120 mg/dl'), rotation = 0)
plt.show()

**Display fasting blood sugar based on the target**

In [None]:
pd.crosstab(dataset['FBS over 120'],dataset['HeartDisease'].plot(kind = "bar", figsize = (8, 6)))
plt.title('Heart Disease Frequency According to Fasting Blood Sugar')
plt.xlabel('Fasting Blood Sugar')
plt.xticks(np.arange(2), ('fbs < 120 mg/dl', 'fbs > 120 mg/dl'), rotation = 0)
plt.ylabel('Frequency')
plt.show()

**Analysing the Chest Pain (cp) (4 types of chest pain)**

[Value 1: typical angina, Value 2: atypical angina, Value 3: non-anginal pain, Value 4: asymptomatic]

In [None]:
dataset["Chest pain type"].unique()

In [None]:
plt.figure(figsize=(26, 10))
sns.barplot(dataset["Chest pain type"],y)

In [None]:
pd.crosstab(dataset['Chest pain type'],dataset['HeartDisease']).plot(kind = "bar", figsize = (8, 6))
plt.title('Heart Disease Frequency According to Chest Pain Type')
plt.xlabel('Chest Pain Type')
plt.xticks(np.arange(4), ('typical angina', 'atypical angina', 'non-anginal pain', 'asymptomatic'), rotation = 0)
plt.ylabel('Frequency')
plt.show()

Analysing Resting Blood Pressure

In [None]:
dataset["BP"].unique()

In [None]:
plt.figure(figsize=(26, 10))
sns.barplot(dataset["BP"],y)

In [None]:
ig, (axis1, axis2) = plt.subplots(1, 2,figsize=(25, 5))
ax = sns.distplot(dataset[dataset['HeartDisease'] == 0]['BP'], label='Do not have heart disease', ax = axis1)
ax.set(xlabel='People Do Not Have Heart Disease')
ax = sns.distplot(dataset[dataset['HeartDisease'] == 1]['BP'], label = 'Have heart disease', ax = axis2)
ax.set(xlabel='People Have Heart Disease')
plt.show()

In [None]:
# Get min, max and average of the  blood pressure of the people do not have heart diseas
print('Min blood pressure of people who do not have heart disease: ', min(dataset[dataset['HeartDisease'] == 0]['BP']))
print('Max blood pressure of people who do not have heart disease: ', max(dataset[dataset['HeartDisease'] == 0]['BP']))
print('Average blood pressure of people who do not have heart disease: ', dataset[dataset['HeartDisease'] == 0]['BP'].mean())

In [None]:
# Get min, max and average of the blood pressure of the people have heart diseas
print('Min blood pressure of people who have heart disease: ', min(dataset[dataset['HeartDisease'] == 1]['BP']))
print('Max blood pressure of people who have heart disease: ', max(dataset[dataset['HeartDisease'] == 1]['BP']))
print('Average blood pressure of people who have heart disease: ', dataset[dataset['HeartDisease'] == 1]['BP'].mean())

**Analysing the Resting Electrocardiographic Measurement [restecg]**

(0 = normal, 1 = having ST-T wave abnormality, 2 = showing probable or definite left ventricular hypertrophy by Estes' criteria)

In [None]:
dataset["EKG results"].unique()

In [None]:
# Display electrocardiographic results in bar chart
dataset.groupby(dataset['EKG results']).count()['HeartDisease'].plot(kind = 'bar', title = 'Resting Electrocardiographic Results', figsize = (8, 6))
plt.xticks(np.arange(3), ('normal', 'ST-T wave abnormality', 'probable or left ventricular hypertrophy'))
plt.show()

In [None]:
# Display resting electrocardiographic results based on the target
pd.crosstab(dataset['EKG results'],dataset['HeartDisease']).plot(kind = "bar", figsize = (8, 6))
plt.title('Heart Disease Frequency According to Resting Electrocardiographic Results')
plt.xticks(np.arange(3), ('normal', 'ST-T wave abnormality', 'probable or left ventricular hypertrophy'))
plt.xlabel('Resting Electrocardiographic Results')
plt.ylabel('Frequency')
plt.show()

Usually the people who do not have heart disease have normal 

electrocardiographic, whereas the people who have heart disease have probable or left ventricular hypertrophy.

Analysing Exercise Induced Angina [exang]

(1 = yes; 0 = no)

In [None]:
dataset["Exercise angina"].unique()

In [None]:
# Display exercise induced angina in bar chart
dataset.groupby(dataset['Exercise angina']).count()['HeartDisease'].plot(kind = 'bar', title = 'Exercise Induced Angina',  figsize = (8, 6))
plt.xticks(np.arange(2), ('No', 'Yes'), rotation = 0)
plt.show()

**Display exercise induced angina based on the target**

In [None]:
pd.crosstab(dataset['Exercise angina'],dataset['HeartDisease']).plot(kind = "bar", figsize = (8, 6))
plt.title('Heart Disease Frequency According to Exercise Induced Angina')
plt.xlabel('Exercise Induced Angina')
plt.xticks(np.arange(2), ('No', 'Yes'), rotation = 0)
plt.ylabel('Frequency')
plt.show()

The people who suffer from exercise induced angina are more likely to be infected with the heart disease.

Analysing the Slope of the peak exercise ST segment [slope]

(Value 1: upsloping, Value 2: flat, Value 3: downsloping)

In [None]:
dataset["Slope of ST"].unique()

In [None]:
# Display slope of the peak exercise ST segment in bar chart
dataset.groupby(dataset['Slope of ST']).count()['HeartDisease'].plot(kind = 'bar', title = 'Slope of the Peak Exercise ST Segment', figsize = (8, 6))
plt.xticks(np.arange(3), ('upsloping', 'flat', 'downsloping'), rotation = 0)
plt.show()