# Heart Disease Analysis and Prediction

In this notebook we trying to predict whether a patient should be diagnosed with Heart Disease or not. This is a binary outcome:<br>
1. **Positive (1):** patient diagnosed with Heart Disease
1. **Negative (0):** patient not diagnosed with Heart Disease<br>

Multiple machine learning Models will be applied to see which yields greatest accuracy.

# Dataset Features:
The Output (Positive or Negative diagnosis of Heart Disease) is determined by 13 features:
1. **age:** age of the patient
2. **sex:** 1 = male, 0 = female (binary)
3. **cp:** chest pain type (4 values) Value 0: typical angina, Value 1: atypical angina, Value 2: non-anginal pain, Value 3: asymptomatic
4. **trestbps:** resting blood pressure
5. **chol:** serum cholesterol in mg/dl
6. **fbs:** fasting blood sugar > 120 mg/dl (binary) (1 = true; 0 = false)
7. **restecg:** resting electrocardiography results (values 0, 1, 2)
8. **thalachh:** maximum heart rate achieved
9. **exng:** exercise induced angina (binary) (1 = yes, 0 = no)
10. **oldpeak:** = ST depression induced by exercise relative to rest
11. **slp:** of the peak exercise ST segment (Value 0: up sloping , Value 1: flat , Value 2: down sloping )
12. **caa:** number of major vessels (values: 0–3)
13. **thall:** maximum heart rate achieved (0 = no-data, 1 = normal, 2 = fixed defect, 3 = reversible defect)

In [None]:
# import all the required libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.svm import SVC
from xgboost import XGBClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix 

In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
sns.set_context('talk')
sns.set_palette('Paired')
sns.set(style = 'darkgrid')

# Data Exploration

In [None]:
# load the dataset
data = pd.read_csv('/kaggle/input/heart-attack-analysis-prediction-dataset/heart.csv')

In [None]:
# display first few records
data.head()

In [None]:
# show the number of records and the number of features
data.shape

In [None]:
# get a basic understanding of the dataset
data.info()

In [None]:
# summarise the count, mean, standard deviation, min and max for numeric features
data.describe()

# Data Cleaning

In [None]:
# check for null values
data.isnull().sum()

As there are no null values in data, we will go ahead with finding duplicates.

In [None]:
# show duplicate rows in the dataset
data[data.duplicated(keep = False)]

In [None]:
# drop the duplicated row
data.drop_duplicates(keep = 'first', inplace = True)

In [None]:
# check correlations between all variables
data.corr()

Visualise the correlation matrix to see whether the features are positively or negatively correlated with the target (output).

In [None]:
# plot corr function
plt.figure(figsize = (13, 7))
ax = sns.heatmap(data.corr(), vmin = -1, vmax = 1, center = 0, cmap = sns.diverging_palette(20, 220, n = 200), annot = True)
ax.set_xticklabels(ax.get_xticklabels(), rotation = 45, horizontalalignment = 'right')
plt.show()

So there is a positive correlation between chest pain (cp) and the target. On the other hand there is a negative correlation between exercise induced angina (exang) and the target.

# Data Visualisation

In [None]:
plt.figure(figsize = (15, 8))
sns.countplot(x = 'age', hue = 'output', data = data).set_title('Heart Disease Frequency for Ages')
plt.legend(title = 'Output', loc = 'upper right', labels = ['No Heart Disease', 'Hvae Heart Disease'])
plt.show()

In [None]:
plt.figure(figsize = (15, 8))
ax = sns.countplot(x = 'sex', hue = 'output', data = data)
ax.set_xticklabels(['Female', 'Male'])
ax.set_title('Heart Disease Frequency for Gender')
plt.legend(title = 'Output', loc = 'upper left', labels = ['No Heart Disease', 'Hvae Heart Disease'])
plt.xlabel('Gender')
plt.show()

In [None]:
plt.figure(figsize = (15, 8))
ax = sns.countplot(x = 'sex', hue = 'output', data = data)
ax.set_xticklabels(['Female', 'Male'])
ax.set_title('Heart Disease Frequency for Gender')
plt.legend(title = 'Output', loc = 'upper left', labels = ['No Heart Disease', 'Hvae Heart Disease'])
plt.xlabel('Gender')
plt.show()

In [None]:
# plt.figure(figsize = (55, 10))
plt.figure(figsize = (15, 8))
ax = sns.countplot(x = 'cp', hue = 'output', data = data)
ax.set_xticklabels(['Typical Angina', 'Atypical angina', 'Non-Anginal Pain', 'Asymptomatic'])
ax.set_title('Heart Disease Frequency According to Chest Pain Type')
plt.legend(title = 'Output', loc = 'upper right', labels = ['No Heart Disease', 'Have Heart Disease'])
plt.xlabel('Chest Pain Type')
plt.show()

Most of the Heart Disease patients are found to have asymptomatic chest pain. These group of people might show atypical symptoms like indigestion, flu or a strained chest muscle.

In [None]:
plt.figure(figsize = (15, 8))
ax = sns.countplot(x = 'fbs', hue = 'output', data = data)
ax.set_xticklabels(['False', 'True'])
ax.set_title('Heart Disease Frequency According to Fasting Blood Sugar')
plt.legend(title = 'Output', loc = 'upper right', labels = ['No Heart Disease', 'Hvae Heart Disease'])
plt.xlabel('Fasting Blood Sugar > 120 mg/dl')
plt.show()

In [None]:
plt.figure(figsize = (15, 8))
ax = sns.countplot(x = 'slp', hue = 'output', data = data)
ax.set_xticklabels(['Up', 'Flat', 'Down'])
ax.set_title('Heart Disease Frequency According to Fasting Blood Sugar')
plt.legend(title = 'Output', loc = 'upper left', labels = ['No Heart Disease', 'Hvae Heart Disease'])
plt.xlabel('ST Segment')
plt.show()

In [None]:
plt.figure(figsize = (15, 8))
ax = sns.countplot(x = 'thall', hue = 'sex', data = data)
ax.set_xticklabels(['No Info', 'Fixed Defect', 'Normal', 'Reversible Defect'])
ax.set_title('Heart Disease Frequency According to Blood Disorder')
plt.legend(title = 'Gender', loc = 'upper left', labels = ['Female', 'Male'])
plt.xlabel('ST Segment')
plt.xlabel('Gender')
plt.show()

# Data Preprocessing

Before providing the dataset to any model, it is essential to check outliers and transform it so that its distribution will have a mean of 0 and a standard deviation of 1.

### Outliers Detection

In [None]:
fig, axes = plt.subplots(4, 3, figsize = (17, 15))
fig.suptitle('Outliers Detection')
sns.boxplot(ax = axes[0,0], x = data['age'])
sns.boxplot(ax = axes[0,1], x = data['cp'])
sns.boxplot(ax = axes[0,2], x = data['trtbps'])
sns.boxplot(ax = axes[1,0], x = data['chol'])
sns.boxplot(ax = axes[1,1], x = data['fbs'])
sns.boxplot(ax = axes[1,2], x = data['restecg'])
sns.boxplot(ax = axes[2,0], x = data['thalachh'])
sns.boxplot(ax = axes[2,1], x = data['oldpeak'])
sns.boxplot(ax = axes[2,2], x = data['slp'])
sns.boxplot(ax = axes[3,0], x = data['caa'])
sns.boxplot(ax = axes[3,1], x = data['thall'])

From the above box plots, outliers are present in trtbps, chol, thalachh, oldpeak, caa, thall. Yet, I'm not going to remove them because of the sensitivity and risk of medical data as it's different than the other kind of data. The exclusion of outliers has a dramatic impact on the type I error.

### Normalisation

Since most of the machine learning algorithms use Euclidean distance between two data points in their computations, this is a problem. To suppress this effect, we need to bring all features to the same level of magnitudes. This can be achieved by a method called feature scaling.

In [None]:
# create a new dataframe for normalised dataset
normalised_data = data.copy()

In [None]:
columns_to_scale = ['age', 'trtbps', 'chol', 'thalachh', 'oldpeak']

In [None]:
ss = StandardScaler()
normalised_data[columns_to_scale] = ss.fit_transform(normalised_data[columns_to_scale])

# Modelling

In this notebook 5 different machine learning algorithms will be evaluated on the dataset for prediction analysis: 

1. Logistic Regression (Logistic)
1. Naive Bayes (NaiveBayes)
1. Classification and Regression Trees or CART (REPTree)
1. k-Nearest Neighbors or KNN (IBk)
1. Support Vector Machines or SVM (SMO)
1. Random Forest and Desion Trees
1. XGBoost

Each algorithm will be evaluated using classification accuracy, to measure the performance of each model. First step in the data modelling is to label the dataset with X (matrix of independent variables) and y (vector of the dependent variable). Then create an instance of the model to train and fit the model, then calculate predictions of test set in order to get the classification report.

In [None]:
# label data into feature data and target data
X = normalised_data.iloc[:, :-1]
y = normalised_data.iloc[:, -1]

In [None]:
# split dataset into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

## Logistic Regression

In [None]:
# create instance of model
lr = LogisticRegression(random_state = 42) 

In [None]:
# train the model 
lr.fit(X_train, y_train)

In [None]:
# get y predictions
y_pred = lr.predict(X_test)

In [None]:
# show performance metrics
print(classification_report(y_test, y_pred))

In [None]:
# print the confusion matrix
print (confusion_matrix(y_test, y_pred))

## Naives Bayes

In [None]:
# create instance of model
nb = GaussianNB()

In [None]:
# train the model
nb.fit(X_train, y_train)

In [None]:
# get y predictions
y_pred = nb.predict(X_test)

In [None]:
# print performance report
print(classification_report(y_test, y_pred))

In [None]:
# print the confusion matrix
print (confusion_matrix(y_test, y_pred))

## XGBoost

In [None]:
# create an instance
xgb = XGBClassifier(random_state = 42)

In [None]:
# train the model
xgb.fit(X_train, y_train)

In [None]:
# get y predictions
y_pred = xgb.predict(X_test)

In [None]:
# print out the accuracy
print(classification_report(y_test, y_pred))

In [None]:
# print the confusion matrix
print (confusion_matrix(y_test, y_pred))

## Random Forest

In [None]:
# create an instance of model
rf = RandomForestClassifier(random_state=42, n_estimators=500)

In [None]:
# fit the model
rf.fit(X_train, y_train)

In [None]:
# get y predictions
y_pred = rf.predict(X_test)

In [None]:
# show accuracy report
print(classification_report(y_test, y_pred))

In [None]:
# print the confusion matrix
print (confusion_matrix(y_test, y_pred))

## Decision Trees

In [None]:
# create an instance
dt = DecisionTreeClassifier(random_state = 42)

In [None]:
# train model 
dt.fit(X_train, y_train)

In [None]:
# get y predictions
y_pred = dt.predict(X_test)

In [None]:
# print performance report
print(classification_report(y_test, y_pred))

In [None]:
# print the confusion matrix
print (confusion_matrix(y_test, y_pred))

## KNN

In [None]:
# create instance of model
knn = KNeighborsClassifier()

In [None]:
# train model 
knn.fit(X_train, y_train)

In [None]:
# get y predictions
y_pred = knn.predict(X_test)

In [None]:
# print the accuracy
print(classification_report(y_test, y_pred))

In [None]:
# print the confusion matrix
print (confusion_matrix(y_test, y_pred))

## SVM

In [None]:
# get instance of the model
svm = SVC(random_state = 42)

In [None]:
# train the model 
svm.fit(X_train, y_train)

In [None]:
# get y predictions
y_pred = svm.predict(X_test)

In [None]:
# show performance report
print(classification_report(y_test, y_pred))

In [None]:
# print the confusion matrix
print (confusion_matrix(y_test, y_pred))

___