# Heart disease prediction
This notebook will present some machine learning models to predict heart disease. The dataset used is the [Heart Failure Prediction](https://www.kaggle.com/datasets/fedesoriano/heart-failure-prediction) dataset from Kaggle.
## Importing libraries

In [None]:
from os import stat
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import display
import seaborn as sns
import numpy as np
%matplotlib inline

## Importing dataset

In [None]:
# Size of the file

filename = 'data/heart.csv'

print(f'File size: {stat(filename).st_size / 1024} kB.')

# Read the data
df = pd.read_csv(filename)

## Exploratory data analysis

In [None]:
display(df.head(10))
display(df.info())
display(df.describe())

We want to predict the `HeartDisease` column, which is a binary column. We can see that the dataset is balanced, with 50% of the patients having heart disease and 50% not having heart disease.

Let's first see the age repartition.

In [None]:
from scipy.stats import norm

plt.figure(figsize=(10, 6))
plt.hist(df['Age'], bins=df['Age'].nunique(), color='red', alpha=0.5, edgecolor='black', linewidth=1.2, density=True, label='Age')
plt.title('Age distribution')
plt.plot(np.linspace(min(df['Age']), max(df['Age']), 100), norm.pdf(np.linspace(min(df['Age']), max(df['Age']), 100), df['Age'].mean(), df['Age'].std()), color='blue', label='Mean')
plt.xlabel('Age')
plt.legend()
plt.show()

In [None]:
from scipy.stats import norm

plt.figure(figsize=(10, 6))
plt.hist(df['RestingBP'], bins=df['RestingBP'].nunique(), color='red', alpha=0.5, edgecolor='black', linewidth=1.2, density=True, label='Age')
plt.title('RestingPB distribution')
plt.plot(np.linspace(min(df['RestingBP']), max(df['RestingBP']), 100), norm.pdf(np.linspace(min(df['RestingBP']), max(df['RestingBP']), 100), df['RestingBP'].mean(), df['RestingBP'].std()), color='blue', label='Mean')
plt.xlabel('RestingPB')
plt.legend()
plt.show()

In [None]:
plt.figure(figsize=(10, 6))
plt.scatter(df['Age'], df['HeartDisease'])
plt.title('Heart Disease by Age')
plt.show()

In [None]:
plt.figure(figsize=(10, 6))
plt.scatter(df['Cholesterol'], df['HeartDisease'])
plt.title('Heart Disease by Cholesterol level')
plt.show()

Let's see the correlation between the numerical features and the target variable.

In [None]:
display(df.corr(numeric_only=True))
plt.figure(figsize=(13,10))
sns.heatmap(df.corr(), cmap = plt.cm.RdYlBu_r, vmin = -0.25, annot = True, vmax = 0.6)
plt.title('Correlation heatmap of the dataset')
plt.show()

## Data preprocessing
We have to preprocess the data before feeding it to the machine learning models.
We will replace the categorical features with numerical ones.

In [None]:
for col in df.columns:
    if df[col].dtype == 'object':
        dic = {}
        i = 0
        for k in df[col].unique():
            dic[k] = i
            i += 1
        df[col] = df[col].map(dic)

display(df.head(10))

Let's see the correlation between the numerical features and the target variable.

In [None]:
display(df.corr(numeric_only=True))
plt.figure(figsize=(13,10))
sns.heatmap(df.corr(), cmap = plt.cm.RdYlBu_r, vmin = -0.25, annot = True, vmax = 0.6)
plt.title('Correlation heatmap of the dataset')
plt.show()

This analysis shows that the `HeartDisease` column is correlated with the `Age` column, the `ChestPainType` column, the `ExerciseAngina` column,
the `Oldpeak` column and the `ST_Slope` column.

We also notice that these three last variables are correlated with each other.
Also, as the `Oldpeak` column is correlated with the `ST_Slope` column, we will only keep the `ST_Slope` column, which is higher correlated with the target variable.

## Feature selection

We are now going to use the `SelectKBest` class from `sklearn.feature_selection` to select the best features.

In [None]:
from sklearn.feature_selection import SelectKBest, chi2

X_new = SelectKBest(chi2, k=5).set_output(transform = "pandas").fit_transform(df.drop(['Oldpeak','HeartDisease'], axis=1), df['HeartDisease'])

print(f'Original shape: {df.drop(["Oldpeak", "HeartDisease"], axis=1).shape}')
print(f'New shape: {X_new.shape}')

display(X_new.head(10))
display(X_new.info())
display(X_new.describe())

Now that the features have been selected, we are going to train our classifiers on the selected features.

## Training the models
### Spliiting the data

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X_new, df['HeartDisease'], test_size=0.2, random_state=42)

print(f'X_train shape: {X_train.shape}')
print(f'Y_train shape: {Y_train.shape}')
print(f'X_test shape: {X_test.shape}')
print(f'Y_test shape: {Y_test.shape}')

### Logistic regression
#### Training the model
We will start by training a logistic regression model.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

reg = LogisticRegression()
reg.fit(X_train, Y_train)

print(f'Accuracy: {reg.score(X_test, Y_test)}')

scores = cross_val_score(reg, X_new, df['HeartDisease'], cv=5)
print(f'Cross validation scores: {scores}')
print(f'Cross validation mean score: {scores.mean()}')
print(f'Cross validation standard deviation: {scores.std()}')

We already have a good accuracy, but we will try to improve it by tuning the hyperparameters.
#### Tuning the hyperparameters
As we made variable selection, wee will try to make a model that uses all the features.

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(df.drop('HeartDisease', axis = 1), df['HeartDisease'], test_size=0.2, random_state=42)

print(f'X_train shape: {X_train.shape}')
print(f'Y_train shape: {Y_train.shape}')
print(f'X_test shape: {X_test.shape}')
print(f'Y_test shape: {Y_test.shape}')


reg = LogisticRegression()
reg.fit(X_train, Y_train)

print(f'Accuracy: {reg.score(X_test, Y_test)}')

scores = cross_val_score(reg, df.drop('HeartDisease', axis = 1), df['HeartDisease'], cv=5)
print(f'Cross validation scores: {scores}')
print(f'Cross validation mean score: {scores.mean()}')
print(f'Cross validation standard deviation: {scores.std()}')

We have here a better result, with a better accuracy and a better recall.

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(df.drop(['HeartDisease', 'Oldpeak'], axis = 1), df['HeartDisease'], test_size=0.2, random_state=42)

print(f'X_train shape: {X_train.shape}')
print(f'Y_train shape: {Y_train.shape}')
print(f'X_test shape: {X_test.shape}')
print(f'Y_test shape: {Y_test.shape}')


reg = LogisticRegression()
reg.fit(X_train, Y_train)

print(f'Accuracy: {reg.score(X_test, Y_test)}')

scores = cross_val_score(reg, df.drop(['HeartDisease', 'Oldpeak'], axis = 1), df['HeartDisease'], cv=5)
print(f'Cross validation scores: {scores}')
print(f'Cross validation mean score: {scores.mean()}')
print(f'Cross validation standard deviation: {scores.std()}')

Dropping the `Oldpeak` column was a good idea: scores are even higher, more than 0.8.

But this model doesn't converge, so we will try another one.

### Decision tree
#### Training the model

We will directly try it ont the global dataset, then try dropping the `Oldpeak` column.

In [None]:
from sklearn.tree import DecisionTreeClassifier

X_train, X_test, Y_train, Y_test = train_test_split(df.drop('HeartDisease', axis = 1), df['HeartDisease'], test_size=0.2, random_state=42)

tree = DecisionTreeClassifier()
tree.fit(X_train, Y_train)

print(f'Accuracy: {tree.score(X_test, Y_test)}')

scores = cross_val_score(tree, df.drop('HeartDisease', axis = 1), df['HeartDisease'], cv=5)
print(f'Cross validation scores: {scores}')
print(f'Cross validation mean score: {scores.mean()}')
print(f'Cross validation standard deviation: {scores.std()}')

Let's now drop the `Oldpeak` column.

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(df.drop(['HeartDisease', 'Oldpeak'], axis = 1), df['HeartDisease'], test_size=0.2, random_state=42)

tree = DecisionTreeClassifier()
tree.fit(X_train, Y_train)

print(f'Accuracy: {tree.score(X_test, Y_test)}')

scores = cross_val_score(tree, df.drop(['HeartDisease', 'Oldpeak'], axis = 1), df['HeartDisease'], cv=5)
print(f'Cross validation scores: {scores}')
print(f'Cross validation mean score: {scores.mean()}')
print(f'Cross validation standard deviation: {scores.std()}')

It looks like the tree model is less accurate than the logistic regression model.