### Problem statement
To build a classification model that predicts whether a patient has heart disease (1) or not (0) based on the provided features.

## Features
- Age: age of the patient [years]
- Sex: sex of the patient [M: Male, F: Female]
- ChestPainType: chest pain type [TA: Typical Angina, ATA: Atypical Angina, NAP: Non-Anginal Pain, ASY: Asymptomatic]
- RestingBP: resting blood pressure [mm Hg]
- Cholesterol: serum cholesterol [mm/dl]
- FastingBS: fasting blood sugar [1: if FastingBS > 120 mg/dl, 0: otherwise]
- RestingECG: resting electrocardiogram results [Normal: Normal, ST: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV), LVH: showing probable or definite left ventricular hypertrophy by Estes' criteria]
- MaxHR: maximum heart rate achieved [Numeric value between 60 and 202]
- ExerciseAngina: exercise-induced angina [Y: Yes, N: No]
- Oldpeak: oldpeak = ST [Numeric value measured in depression]
- ST_Slope: the slope of the peak exercise ST segment [Up: upsloping, Flat: flat, Down: downsloping]
- HeartDisease: output class [1: heart disease, 0: Normal]

## Data collection
Dataset source - (https://www.kaggle.com/datasets/fedesoriano/heart-failure-prediction/data)

## 1. Importing Packages

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
%matplotlib inline
import warnings
pd.options.display.max_rows = 999
warnings.filterwarnings('ignore')


Importing data csv as Pandas DataFrame

In [2]:
df = pd.read_csv('data/heart.csv')

## 2. Data insight

In [None]:
df.head()

In [None]:
df.shape

Data doesnt have any Nan values.

In [None]:
df.isna().sum()

Data doesnt have any duplicates

In [None]:
df.duplicated().sum()

In [None]:
df.info()

In [None]:
df.nunique()

In [None]:
df.describe()

The features means are very different. This tells us that we need to standarize data.

In [None]:
numeric_features = [feature for feature in df.columns if df[feature].dtype != 'O']
categorical_features = [feature for feature in df.columns if df[feature].dtype == 'O']

print('We have {0} numerical values: {1}'.format(len(numeric_features),numeric_features))
print('We have {0} categorical values: {1}'.format(len(categorical_features),categorical_features))


## 3. EDA with visualization

In [None]:
plt.figure(figsize=(15,10))
plt.subplot(1,3,1)
sns.histplot(data=df,x='RestingBP',kde=True)
plt.subplot(1,3,2)
sns.histplot(data=df,x='Cholesterol',kde=True)
plt.subplot(1,3,3)
sns.histplot(data=df,x='FastingBS',kde=True)


Conclusions:
- RestingBP is distributed normally
- Cholesterol is also distributed normally because 0 values are unreal.
- 

We can see that for RestingBP is value 0 and for cholesterol are so much values 0 what is impossible. So we are going to do something with that.

In [None]:
# This record is false information. We are going to replace 0 values with something.
# df[df['RestingBP']==0]

In [13]:
# df.loc[449,['RestingBP','Cholesterol']] = np.nan

In [14]:
# df[df['Cholesterol']==0]['Cholesterol'] = np.nan

In [None]:
plt.subplot(3,1,1)
sns.boxplot(data=df,x = 'Cholesterol')
plt.subplot(3,1,2)
sns.boxplot(data=df,x = 'RestingBP')

There are many outliers in columns RestingBP and Cholesterol so we are going to replace nan values with median.

In [16]:
# df['Cholesterol'] = df['Cholesterol'].fillna(df['Cholesterol'].median())
# df['RestingBP'] = df['RestingBP'].fillna(df['RestingBP'].median())

In [None]:
# df.isna().sum()

In [None]:
plt.figure(figsize=(15,10))
plt.subplot(1,2,1)
sns.histplot(data=df,x='MaxHR',kde=True)
plt.subplot(1,2,2)
sns.histplot(data=df,x='Oldpeak',kde=True)

- Oldpeak is right-scewed

In [None]:
plt.figure(figsize=(25,50))
for i,feature in enumerate(categorical_features):
    plt.subplot(5,1,i+1)
    sns.countplot(data=df,x=feature,hue='HeartDisease')

Conclusions:
- The number of males tested is a lot bigger than females
- Most of the cases have ST slope Flat or Up
- The resting electrocardiogram results were normal for most of the patients.

In [20]:
# df['ExerciseAngina'] = df['ExerciseAngina'].map({'N':0,'Y':1})
# df['Sex'] = df['Sex'].map({'M':1,'F':0})

In [24]:
# from sklearn.preprocessing import TargetEncoder
# te = TargetEncoder()
# df['ST_Slope'] = te.fit_transform(df[['ST_Slope']],df['HeartDisease'])
# df['ChestPainType'] = te.fit_transform(df[['ChestPainType']],df['HeartDisease'])

In [26]:
# freq = df["RestingECG"].value_counts(normalize=True)
# df["RestingECG"] = df["RestingECG"].map(freq)

In [None]:

plt.subplot(1, 2, 2)
ax = sns.countplot(df, x="HeartDisease", edgecolor="black", linewidth=0.8)


plt.title("Heart Disease cases")

plt.tight_layout(pad=3)

There are slightly more cases with heart disease than no heart disease cases. Dataset is balanced 