TEAMID : PNT2022TMID30384

Visualzing and predicting Heart disease with interactive dashboard

**Heart Disease Prediction Using Machine Learning Approach**

Heart disease (HD) is a major cause of mortality in modern society. Medical diagnosis is an extremely important but complicated task that should be Performed Accurately and Efficiently

Cardiovascular disease is difficult to detect due to several risk factors, including high blood pressure, cholesterol, and an abnormal pulse rate.
In this machine learning project, we have collected the dataset from Kaggle(https://www.kaggle.com/search?q=heart+disease+prediction) and we will be using Machine Learning to make predictions on whether a person is suffering from Heart Disease or not.

Problem Statement

* Complete analysis of Heart Disease Kaggle Dataset
* To predict whether a person has a heart disease or not based on the various biological and physical parameters

Machine Learning Algorithms

* Random Forest Classifier
* K-Nearest Neighbors Classifier
* Decision Tree Classifier
* Naive Bayes Classifier

**Import libraries**

Let's first import all the necessary libraries. We will use numpy and pandas to start with. For visualization, we will usepyplot subpackage of matplotlib, use rcParams to add styling to the plots and rainbow for colors and seaborn. For implementing Machine Learning models and processing of data, we will use the sklearn library.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import rcParams
from matplotlib.cm import rainbow
import seaborn as sns
%matplotlib inline

For processing the data, we'll import a few libraries. To split the available dataset for testing and training, we'll use the train_test_split method.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder
from sklearn import tree
from warnings import filterwarnings
filterwarnings("ignore")

In [None]:
#model validation
from sklearn.metrics import log_loss,roc_auc_score,precision_score,f1_score,recall_score,roc_curve,auc,plot_roc_curve
from sklearn.metrics import classification_report, confusion_matrix,accuracy_score,fbeta_score,matthews_corrcoef
from sklearn import metrics
from mlxtend.plotting import plot_confusion_matrix

For model validation, we'll import a few libraries

In [None]:
#extra
from sklearn.pipeline import make_pipeline, make_union
from sklearn.preprocessing import PolynomialFeatures
from sklearn.feature_selection import SelectFwe, f_regression

Next, we will import all the Machine Learning algorithms

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB

Import Dataset

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
dataset = pd.read_csv('/content/Heart_Disease_Prediction.csv')

NameError: ignored

Data Preparation and Exploration

In [None]:
type(dataset)

NameError: ignored

In [None]:
dataset.shape

In [None]:
dataset.info()

In [None]:
dataset.columns

In [None]:
dataset.describe()

In [None]:
dataset

In [None]:
dataset.head()

In [None]:
dataset.isnull().sum()

In [None]:
dataset.apply(lambda x:len(x.unique()))

In [None]:
print('Chest pain type',dataset['Chest pain type'].unique())
print('FBS over 120',dataset['FBS over 120'].unique())
print('EKG results ',dataset['EKG results'].unique())
print('Exercise angina',dataset['Exercise angina'].unique())
print('Slope of ST',dataset['Slope of ST'].unique())
print('Number of vessels fluro',dataset['Number of vessels fluro'].unique())
print(' Thallium',dataset['Thallium'].unique())

Chest pain type [4 3 2 1]

FBS over 120 [0 1]

EKG results  [2 0 1]

Exercise angina [0 1]

Slope of ST [2 1 3]

Number of vessels fluro [3 0 1 2]

 Thallium [3 7 6]

**Dataset Description**

This dataset consists of 14 features and a HeartDisease variable. The detailed description of all the features are as follows:
1. Age: Patients Age in years (Numeric)
2. Sex: Gender of patient (Male - 1, Female - 0)(Nominal)
3. Chest Pain Type: Type of chest pain experienced by patient categorized into :(Nominal)
* Value 1: Typical angina
* Value 2: Atypical angina
 Value 3: Non-anginal pain
* Value 4: Asymptomatic
(Angina: Angina is caused when there is not enough oxygen-rich blood flowing to a certain part of the heart. The arteries of the heart become narrow due to fatty deposits in the artery walls. The narrowing of arteries means that blood supply to the heart is reduced, causing angina.)
4. BP: Level of blood pressure at resting mode in mm/HG (Numerical)
5. cholestrol: Serum cholestrol in mg/dl (Numeric)
(Cholesterol means the blockage for blood supply in the blood vessels)
6.FBS over 120: Blood sugar levels on fasting > 120 mg/dl represents as 1 in case of true and 0 as false (Nominal)
(blood sugar taken after a long gap between a meal and the test. Typically, it's taken before any meal in the morning.)
7. EKG results :
* Value 0: Normal
* Value 1: Having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
* Value 2: Showing probable or definite left ventricular hypertrophy by Estes' criteria.
8.Max HR:

Estimation maximum HR rate is about 90

9.Exercise angina: Exercise induced angina (1 = yes; 0 = no)

(is chest pain while exercising or doing any physical activity.)

10.ST depression :

Exercise induced ST-depression in comparison with the state of rest (Numeric)
(ST Depression is the difference between value of ECG at rest and after exercise. An electrocardiogram records the electrical signals in your heart. 

It's a common and painless test used to quickly detect heart problems and monitor your heart's health. Electrocardiograms — also called ECGs or EKGs — are often done in a doctor's office, a clinic or a hospital room. ECG machines are standard equipment in operating rooms and ambulances. Some personal devices, such as smart watches.)
11.slope of ST: ST segment measured in terms of slope during peak exercise (Nominal)
* Value 1: Upsloping
* Value 2: Flat
* Value 3: Downsloping
12.Number of vessels fluro:Number of major blood vessels (0-3)(Numeric)
(Fluoroscopy is an imaging technique that uses X-rays to obtain real-time moving images of the interior of an object. In its primary application of medical imaging, a fluoroscope allows a physician to see the internal structure and function of a patient, so that the pumping action of the heart or the motion of swallowing, for example, can be watched)

13.Thallium

* Value 3: normal
* Value 6: fixed defect
* Value 7: reversibe defect

14.HeatDisease :arget: It is the target variable which we have to predict 2 means patient is suffering from heart risk and 1 means patient is normal. (0 = no disease; 1 = disease)

**Data Visualization**

Now let's see various visual representations of the data to understand more about relationship between various features.

**Distribution of Heart disease**

It's always a good practice to work with a dataset where the target classes are of approximately equal size. Thus, let's check for the same.

In [None]:
fig, (ax1) = plt.subplots(nrows=1, ncols=1, sharey=False, figsize=(14,6))

ax1 = dataset['HeartDisease'].value_counts().plot.pie( x="Heart disease" ,y ='no.of patients', 
                   autopct = "%1.0f%%",labels=["Heart Disease","Normal"], startangle = 60,ax=ax1);
ax1.set(title = 'Percentage of Heart disease patients in Dataset')
plt.show()

In [None]:
y = dataset["HeartDisease"]

In [None]:
rcParams['figure.figsize'] = 8,6
plt.bar(dataset['HeartDisease'].unique(), dataset['HeartDisease'].value_counts(), color = ['blue', 'green'])
plt.xticks([1, 2])
plt.xlabel('Target Classes (1 =no disease; 2 = disease)')
plt.ylabel('Samples')
plt.title('Count of each Target Class')
HeartDisease_temp = dataset.HeartDisease.value_counts()
print(HeartDisease_temp)

From the total dataset of 270 patients, 150 (56%) have a heart disease (target=2)

Next, we'll take a look at the histograms for each variable.

In [None]:
dataset.hist(edgecolor='black',layout = (7, 2),
            figsize = (10, 30),
            color=['purple'])

Exploratory Data Analysis (EDA)

Gender distribution based on heart disease

In [None]:
dataset["Sex"].unique()

In [None]:
# Number of males and females
F = dataset[dataset["Sex"] == 0].count()["HeartDisease"]
M = dataset[dataset["Sex"] == 1].count()["HeartDisease"]

# Create a plot
figure, ax = plt.subplots(figsize = (6, 4))
ax.bar(x = ['Female', 'Male'], height = [F, M])
plt.xlabel('Gender')
plt.title('Number of Males and Females in the dataset')
plt.show()

Heart Disease frequency for gender

In [None]:
pd.crosstab(dataset.Sex,dataset.HeartDisease).plot(kind="bar",figsize=(20,10),color=['blue','#AA1111' ])
plt.title('Heart Disease Frequency for Sex')
plt.xlabel('Sex (0 = Female, 1 = Male)')
plt.xticks(rotation=0)
plt.legend(["Don't have Disease", "Have Disease"])
plt.ylabel('Frequency')
plt.show()

In [None]:
countFemale = len(dataset[dataset.Sex == 0])
countMale = len(dataset[dataset.Sex == 1])
print("Percentage of Female Patients:{:.2f}%".format((countFemale)/(len(dataset.Sex))*100))
print("Percentage of Male Patients:{:.2f}%".format((countMale)/(len(dataset.Sex))*100))

Percentage of Female Patients:32.22%

Percentage of Male Patients:67.78%

**Age distribution based on heart disease**

In [None]:
# Display age distribution based on heart disease
sns.distplot(dataset[dataset['HeartDisease'] == 1]['Age'], label='Do not have heart disease')
sns.distplot(dataset[dataset['HeartDisease'] == 2]['Age'], label = 'Have heart disease')
plt.xlabel('Frequency')
plt.ylabel('Age')
plt.title('Age Distribution based on Heart Disease')
plt.legend()
plt.show()

In [None]:
print('Min age of people who do not have heart disease: ', min(dataset[dataset['HeartDisease'] == 0]['Age']))
print('Max age of people who do not have heart disease: ', max(dataset[dataset['HeartDisease'] == 0]['Age']))
print('Average age of people who do not have heart disease: ', dataset[dataset['HeartDisease'] == 0]['Age'].mean())

In [None]:
print('Min age of people who have heart disease: ', min(dataset[dataset['HeartDisease'] == 1]['Age']))
print('Max age of people who have heart disease: ', max(dataset[dataset['HeartDisease'] == 1]['Age']))
print('Average age of people who have heart disease: ', dataset[dataset['HeartDisease'] == 1]['Age'].mean())

**Heart Disease Frequency for ages**

In [None]:
pd.crosstab(dataset.Age,dataset.HeartDisease).plot(kind="bar",figsize=(20,6))
plt.title('Heart Disease Frequency for Ages')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.savefig('heartDiseaseAndAges.png')
plt.show()