# 0. Outline
1. Introduction
2. Preparations
    2.1 Load Libraries
    2.2 Load Data
3. Overview of Data: File structure & content
4. Initial Exploration
5. Individual features
6. Feature relations
7. Feature Engineering & Modelling

# 1. Introduction
The notebook is extensive EDA and analysis of multiple models on how they perform in the task of heart-attack prediction.

["One person dies every 36 seconds in the United States from cardiovascular disease. About 655,000 Americans die from heart disease each year—that's 1 in every 4 deaths."](https://www.cdc.gov/heartdisease/facts.htm)

Thus, heart attack analysis should not be limited to prediction, but should be able to identify the influence of the factors and their respective importance in estimating the risk of the heart-attack. Further, instead of predicting a binary value 1 or 0, a risk probability would be a more useful metric.

# 2. Preparations

## 2.1 Load libraries

In [None]:
!pip install autoviml

In [None]:
import os

import numpy as np                         # linear algebra
import pandas as pd                        # data processing, CSV file I/O (e.g. pd.read_csv)

import matplotlib.pyplot as plt            # plotting
import seaborn as sns                      # plotting

import warnings

## 2.2 Load Data

In [None]:
o_sat = pd.read_csv('/kaggle/input/heart-attack-analysis-prediction-dataset/o2Saturation.csv')
df    = pd.read_csv('/kaggle/input/heart-attack-analysis-prediction-dataset/heart.csv')

# 3. Overview of Data: File structure & content

The data is provided in two files:
* heart.csv
* o2Saturation.csv

'Heart.csv' contains the medical variables of individual patients in first 13 columns, followed by target column which is binary, with 0 & 1 having less more chance of heart attack. It is imported as 'df' dataframe.

In [None]:
df.head()

The columns of df dataframe are described as:
* age: The age of patient. With increased age, it can be hypothesized that the probability of heart attack will increase.

* sex: This is provided as binary feature with '0' & '1' values. However, it is not clear which of the two feature is male or female. Based on [medical observations](https://www.health.harvard.edu/heart-health/throughout-life-heart-attacks-are-twice-as-common-in-men-than-women), we can make an educated guess of which of the binary feature is what. However, if the dataset is not randomly sampled or the has been conditioned for sex, it may not work.

* cp: Chest pain type. Chest pain can happen due to multiple reasons, including those [not related to cardiac issues](https://www.healthline.com/health/chest-pain#TOC_TITLE_HDR_1). This feature classifies the chest pain into those related or not related to heart issues.

* trtbs: Resting blood pressure (in mm Hg). High blood pressure has been positively correlated to [increased probability of heart attack](https://www.heart.org/en/health-topics/high-blood-pressure/health-threats-from-high-blood-pressure/how-high-blood-pressure-can-lead-to-a-heart-attack).

* chol: cholestoral in mg/dl fetched via BMI sensor. This dataset doesn't specify the kind of cholestrol. There are two main types of cholesterol: low-density lipoprotein (LDL), or "bad" cholesterol, and high-density lipoprotein (HDL), or "good" cholesterol. [High value of LDL build-up of plaque inside the blood vessels, which might lead to heart attack.](https://medlineplus.gov/lab-tests/cholesterol-levels/) Conversely, HDL helps get rid of "bad" LDL cholesterol. We may hypothesize that the reposted cholestrol is LDL, and heart rate should increase with chol levels.

* fbs : (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false). [Higher blood sugar might lead to increased risk of heart attacks](https://www.webmd.com/diabetes/news/20160108/high-blood-sugar-may-increase-heart-attack-complications-study)

* rest_ecg : resting electrocardiographic results. It has 3 values, 0 being normal, 1 having [ST-T wave abnormality](http://hqmeded-ecg.blogspot.com/2019/02/st-depression-and-t-wave-inversion-in.html)Right ventricular hypertrophy as the cause of the ST depression and T-wave inversion, and 2 showing probable or definite left ventricular hypertrophy by Estes' criteria. Right ventricle hypertropy(abnormal enlargement of cardiac muscle surrounding the right ventricle) [usually occurs due to chronic lung disease or structural defects in heart.](https://en.wikipedia.org/wiki/Right_ventricular_hypertrophy). It can be beneign or can lead of heart complications. Left ventricle hypertropy(abnormal enlargement of cardiac muscle surrounding the left ventricle) happens as [results of other heart problems](https://www.heart.org/en/health-topics/heart-valve-problems-and-disease/heart-valve-problems-and-causes/what-is-left-ventricular-hypertrophy-lvh) can lead to heart failure.

* thalach : maximum heart rate achieved. The maximum heart rate for a healthy person can be estimated using [simple formula](https://www.cdc.gov/physicalactivity/basics/measuring/heartrate.htm). It can be used to undestand whether the person is doing good in terms of aerobic health, which is supposed to be related to heart health.

* exang: exercise induced angina (1 = yes; 0 = no). The chest pain during exercise can be read in line with cp(Chest pain type). Chest pain can happen due to multiple reasons, and heart attack is one of them.

* oldpeak: The description of this feature has not been provided.

* slp: Seems slope? The description of this feature has not been provided.

* thall: In relation to thalach, it can be assumed to something related to heart-rate, maybe resting heart rate.

* output: 0= less chance of heart attack 1= more chance of heart attack

o2Saturation.csv provides the oxygen saturation. However, the number of rows is different from heart.csv and there is no mapping file provided.

In [None]:
o_sat.head()

# 4. Initial Exploration

In [None]:
df.shape

In [None]:
df.describe()

The mean age for the dataset is 54+ years, with min being 29 and max being 77. So, the dataset is more skewed towards a comparatively older population. The mean of the output is 0.545, i.e the dataset is balanced in terms of number proportion of high and low risk individuals. The mean of the sex feature is 0.68.It means there is more representation of one sex.

## Missing values

In [None]:
print(df.isnull().sum())
print(df.info())

There are no missing values.

## Baseline model based on output proportion

In [None]:
df.output.value_counts(normalize=True) * 100

As 54.46% of people in the dataset have high risk of heart-attack, a baseline model can predict everyone having high risk of heart-attack and be 54.45% accurate. Let's see how much we can gain of the baseline model.

# 5. Individual features

We look at two specific aspects of individual features:
* The individual distribtion
* The distribution conditioned on the ouput

In [None]:
plt.figure(figsize=(15,8))
sns.set_theme(style="darkgrid")
#sns.figsize = [12,10]

## Age

In [None]:
plt.figure(figsize=(12,10))
sns.countplot(x='age', data=df,);

The age has distribution from 29 to 77, with a peak around 58 years. The distribtion is left skewed. Let's check the distribution conditional to output.

In [None]:
warnings.filterwarnings(action="ignore")

fig = plt.figure(figsize = (12,10))

plt.subplot(221)
sns.histplot(x='age', hue='output', bins=range(25, 81, 1), data=df,kde=True)
plt.subplot(222)
sns.swarmplot(y='age', x='output', data=df)
plt.subplot(223)
sns.boxplot(y='age', x='output', data=df)
plt.subplot(224)
sns.violinplot(y='age', x='output', data=df,)
plt.tight_layout()
plt.show()

It seems that there is weak inverse relation between the age and the high risk of heart-attack, which is counter-intuitive. It suggests that there is some confounding factor which can explain this relation.


## Sex

In [None]:
fig, axes = plt.subplots(figsize=(12, 10) , nrows = 2, ncols = 1)
fig.suptitle('Sex')

pd.crosstab(df.sex, df.output).plot.bar(stacked=True, figsize = (12,10), ax=axes[0]);
axes[0].set_ylabel('Sum of output')
axes[0].get_xaxis().get_label().set_visible(False)
#axes[0].set_title("Sex vs output")

pd.crosstab(df.sex, df.output).apply(lambda r: r/r.sum()*100, axis=1).plot.\
bar(stacked=True, figsize = (12,10), ax=axes[1]);
plt.ylabel('Percent Distribution of output')

for rec in axes[1].patches:
    height = rec.get_height()
    axes[1].text(rec.get_x() + rec.get_width() / 2, 
              rec.get_y() + height / 2,
              "{:.0f}%".format(height),
              ha='center', 
              va='bottom')

As discussed before, the data is unbalanced in favour sex category '1'. However, as visible clearly, the proportion of high risk persons are in sex category 0. So, 1 might be representing females.

## Cp: Chest pain type

In [None]:
fig, axes = plt.subplots(figsize=(12, 10) , nrows = 2, ncols = 1)
fig.suptitle('Cp: Chest pain type')

pd.crosstab(df.cp, df.output).plot.bar(stacked=True, figsize = (12,10), ax=axes[0]);
axes[0].set_ylabel('Sum of output')
axes[0].get_xaxis().get_label().set_visible(False)
#axes[0].set_title("cp vs output")

pd.crosstab(df.cp, df.output).apply(lambda r: r/r.sum()*100, axis=1).plot.\
bar(stacked=True, figsize = (12,10), ax=axes[1]);
plt.ylabel('Percent Distribution of output')

for rec in axes[1].patches:
    height = rec.get_height()
    axes[1].text(rec.get_x() + rec.get_width() / 2, 
              rec.get_y() + height / 2,
              "{:.0f}%".format(height),
              ha='center', 
              va='bottom')

The chest pain type are as follows:
    -Value 1: typical angina
    -Value 2: atypical angina
    -Value 3: non-anginal pain
    -Value 4: asymptomatic
 
Typical angina is the chest pain which has characteristics of heart attack. In the current case, we see that surprisingly, the high risk cases of heart-attack. It suggests that this angina, although showing symptoms warned patients of implending fatality which could have led to medical help or/and change in lifestyle.
The most dangerous ones is the atypical angina, which did not have all the characteristics of heart-attack. So, they might not have been taken so seriously or as warning of increasing risk of heart-attack. 
The non-anginal pain is which is not related to heart, but could be due to something else. It is interesting that it has almost 80% prevelence of risky cases.

## trtbs: Resting blood pressure (in mm Hg)

In [None]:
warnings.filterwarnings(action="ignore")

fig = plt.figure(figsize = (12,10))
fig.suptitle('trtbs: Resting blood pressure (in mm Hg)')

plt.subplot(221)
sns.histplot(x='trtbps', hue='output', bins=range(90, 210, 1), data=df,kde=True)
plt.subplot(222)
sns.swarmplot(y='trtbps', x='output', data=df)
plt.subplot(223)
sns.boxplot(y='trtbps', x='output', data=df)
plt.subplot(224)
sns.violinplot(y='trtbps', x='output', data=df,)
plt.tight_layout()
plt.show()

Based on the plots, there no significant distributional difference between the two groups of resting blood pressure.

## chol: cholestoral

In [None]:
warnings.filterwarnings(action="ignore")

fig = plt.figure(figsize = (12,10))
fig.suptitle('chol: cholestoral')

plt.subplot(221)
sns.histplot(x='chol', hue='output', bins=range(110, 600, 1), data=df,kde=True)
plt.subplot(222)
sns.swarmplot(y='chol', x='output', data=df)
plt.subplot(223)
sns.boxplot(y='chol', x='output', data=df)
plt.subplot(224)
sns.violinplot(y='chol', x='output', data=df,)
plt.tight_layout()
plt.show()

The cholestoral levels do not seem to significantly differ between the two groupos of low and high risk individuals. However, there are outliers in high risk group which have very high cholestoral levels. Such extremely high cholestoral levels are absent in low risk group.

## fbs : (fasting blood sugar)

In [None]:
fig, axes = plt.subplots(figsize=(12, 10) , nrows = 2, ncols = 1)
fig.suptitle('fbs: fasting blood sugar')

pd.crosstab(df.fbs, df.output).plot.bar(stacked=True, figsize = (12,10), ax=axes[0]);
axes[0].set_ylabel('Sum of output')
axes[0].get_xaxis().get_label().set_visible(False)
#axes[0].set_title("fbs vs output")

pd.crosstab(df.fbs, df.output).apply(lambda r: r/r.sum()*100, axis=1).plot.\
bar(stacked=True, figsize = (12,10), ax=axes[1]);
plt.ylabel('Percent Distribution of output')

for rec in axes[1].patches:
    height = rec.get_height()
    axes[1].text(rec.get_x() + rec.get_width() / 2, 
              rec.get_y() + height / 2,
              "{:.0f}%".format(height),
              ha='center', 
              va='bottom')

Less than 20% of the total cases have high fasting blood sugar. However, the distribution of high and low risk is almost same for both high and low fbs groups.

## rest_ecg : resting electrocardiographic results

In [None]:
fig, axes = plt.subplots(figsize=(12, 10) , nrows = 2, ncols = 1)
fig.suptitle('rest_ecg : resting electrocardiographic results')

pd.crosstab(df.restecg, df.output).plot.bar(stacked=True, figsize = (12,10), ax=axes[0]);
axes[0].set_ylabel('Sum of output')
axes[0].get_xaxis().get_label().set_visible(False)
#axes[0].set_title("rest_ecg vs output")

pd.crosstab(df.restecg, df.output).apply(lambda r: r/r.sum()*100, axis=1).plot.\
bar(stacked=True, figsize = (12,10), ax=axes[1]);
plt.ylabel('Percent Distribution of output')

for rec in axes[1].patches:
    height = rec.get_height()
    axes[1].text(rec.get_x() + rec.get_width() / 2, 
              rec.get_y() + height / 2,
              "{:.0f}%".format(height),
              ha='center', 
              va='bottom')

The value 1, i.e. "having ST-T wave abnormality" has the highest high-risk cases, where there is hypertropy of right ventricle. Surprisngly, the left ventricular hypertrophy case(value 3) has lower high risk patients than even normal resting ecg case.

## thalach : maximum heart rate achieved

In [None]:
warnings.filterwarnings(action="ignore")

fig = plt.figure(figsize = (12,10))
fig.suptitle('thalach : maximum heart rate achieved')

plt.subplot(221)
sns.histplot(x='thalachh', hue='output', bins=range(60, 220, 1), data=df,kde=True)
plt.subplot(222)
sns.swarmplot(y='thalachh', x='output', data=df)
plt.subplot(223)
sns.boxplot(y='thalachh', x='output', data=df)
plt.subplot(224)
sns.violinplot(y='thalachh', x='output', data=df,)
plt.tight_layout()
plt.show()

The patients who have higher heartbeat have higher risk of heart-attack. However, there is significant overlap between two groups.

## exang: exercise induced angina (1 = yes; 0 = no)

In [None]:
fig, axes = plt.subplots(figsize=(12, 10) , nrows = 2, ncols = 1)
fig.suptitle('exng : exercise induced angina')

pd.crosstab(df.exng, df.output).plot.bar(stacked=True, figsize = (12,10), ax=axes[0]);
axes[0].set_ylabel('Sum of output')
axes[0].get_xaxis().get_label().set_visible(False)
#axes[0].set_title("exng vs output")

pd.crosstab(df.exng, df.output).apply(lambda r: r/r.sum()*100, axis=1).plot.\
bar(stacked=True, figsize = (12,10), ax=axes[1]);
plt.ylabel('Percent Distribution of output')

for rec in axes[1].patches:
    height = rec.get_height()
    axes[1].text(rec.get_x() + rec.get_width() / 2, 
              rec.get_y() + height / 2,
              "{:.0f}%".format(height),
              ha='center', 
              va='bottom')

The group where the angina was induced during exercise exertion had has lower prevelence of high-risk cases, as angina could have been contributed due to exercise or the angina might have given warning about the condition of heart, leading to taking of proper measures.

The description of 'oldpeak', 'thall' & 'slp' has not been provided. However, we can have a look at their respective distributions.

In [None]:
df[['oldpeak', 'slp', 'thall']].describe()

In [None]:
df[['oldpeak', 'slp', 'thall']].nunique()

## Oldpeak

In [None]:
warnings.filterwarnings(action="ignore")

fig = plt.figure(figsize = (12,10))
fig.suptitle('oldpeak')

plt.subplot(221)
sns.histplot(x='oldpeak', hue='output', bins=range(0, 7, 1), data=df,kde=True)
plt.subplot(222)
sns.swarmplot(y='oldpeak', x='output', data=df)
plt.subplot(223)
sns.boxplot(y='oldpeak', x='output', data=df)
plt.subplot(224)
sns.violinplot(y='oldpeak', x='output', data=df,)
plt.tight_layout()
plt.show()

The distribution shows difference between the high and low risk cases, which high risk cases having a low value of 'oldpeak' on an average. The high risk cases are very much concentrated near 0.

## slp

In [None]:
fig, axes = plt.subplots(figsize=(12, 10) , nrows = 2, ncols = 1)
fig.suptitle('slp')

pd.crosstab(df.slp, df.output).plot.bar(stacked=True, figsize = (12,10), ax=axes[0]);
axes[0].set_ylabel('Sum of output')
axes[0].get_xaxis().get_label().set_visible(False)
#axes[0].set_title("slp vs output")

pd.crosstab(df.slp, df.output).apply(lambda r: r/r.sum()*100, axis=1).plot.\
bar(stacked=True, figsize = (12,10), ax=axes[1]);
plt.ylabel('Percent Distribution of output')

for rec in axes[1].patches:
    height = rec.get_height()
    axes[1].text(rec.get_x() + rec.get_width() / 2, 
              rec.get_y() + height / 2,
              "{:.0f}%".format(height),
              ha='center', 
              va='bottom')

For this feature too, most cases have value of either 1 or 2. Value 2 has the highest prevelence of high-risk cases.

## thall

In [None]:
fig, axes = plt.subplots(figsize=(12, 10) , nrows = 2, ncols = 1)
fig.suptitle('thall')

pd.crosstab(df.thall, df.output).plot.bar(stacked=True, figsize = (12,10), ax=axes[0]);
axes[0].set_ylabel('Sum of output')
axes[0].get_xaxis().get_label().set_visible(False)
#axes[0].set_title("thall vs output")

pd.crosstab(df.thall, df.output).apply(lambda r: r/r.sum()*100, axis=1).plot.\
bar(stacked=True, figsize = (12,10), ax=axes[1]);
plt.ylabel('Percent Distribution of output')

for rec in axes[1].patches:
    height = rec.get_height()
    axes[1].text(rec.get_x() + rec.get_width() / 2, 
              rec.get_y() + height / 2,
              "{:.0f}%".format(height),
              ha='center', 
              va='bottom')

The code '0' has hardly any cases attached to it. Code 2 has highest number of cases, and it has highrest prevelence of high risk patients too.

In [None]:
df.head()

# 6. Feature relations

## Pairplot

Let's create pairplot of numerical(continuous) features:

In [None]:
cols = ['age','trtbps','chol','thalachh','oldpeak']
g = sns.pairplot(data=df, vars=cols, size=1.7,
                 hue='output', palette=['green','red'],
                plot_kws={'alpha':0.5}, )
g.set(xticklabels=[])

On visual inspection, it does not seem that there is a very strong correlation between any two features. However, following relations seem interesting ones to be explored:
* Age vs sex
* Chol vs sex vs age
* Chol vs sex vs output
There are multiple more possible combinations. Let's focus on these for now.

## Multiple feature relation

In [None]:
fig, axes = plt.subplots(figsize=(12, 10) , nrows = 2, ncols = 2)

sns.pointplot(y='age', x='sex', hue='output' ,data=df, ax=axes[0,0]);
sns.pointplot(y='age', x='cp', hue='output' ,data=df, ax=axes[0,1]);
sns.pointplot(y='chol', x='cp', hue='output' ,data=df, ax=axes[1,0]);
sns.swarmplot(x = 'sex', y='chol', hue='output', data=df, ax=axes[1,1])

Some interesting observations:
1. Even though we were suspecting that '1' respresents female, when looked through the age, it says otherwise. Males are more at risk of heart-attack and earlier than females. So, '1' seem to be males. The data might be sampled to increase the cases of high risk cases deliberatley.
2. Comparing cases when conditioned for cholesterol and when conditioned for age, the chest pain shows inverse relation for asymtomatic pain case. The case of asymtomatic pain predicting high risk of heart attck is higher for higher age group. An asymptomatic pain even with low cholesterol can be fatal.
3. The cholesteral distribution has a higher range sex code 0.


I will add further relations between features.

## Correlation plot between continious features

In [None]:
plt.figure(figsize=(10,10))
sns.heatmap(df[['age', 'trtbps', 'chol', 'thalachh', 'oldpeak']].corr());

The correation(positive or negative), means that there is a linear directional relationship between two features, the direction specified by the sign of the correlation and strength of the correlation specified by the value.

**We observe**:

* There is a slight negative correlation between the 'maximum heart-rate achived' and 'age'.
* There is a slight negative correlation between the 'maximum heart-rate achived' and 'oldpeak'.

However, there is no high correlation pair, and we can just ignore the correlation for now.

# 7. Feature Engineering & Modelling

We will use autmoviml package for automatic generation of features, and selection of the best features.

In [None]:
from autoviml.Auto_ViML import Auto_ViML

In [None]:
df.nunique()

We convert the int to numeric for the features which are continuous.

In [None]:
#df.dtypes
numeric_col = ['age', 'trtbps', 'chol', 'thalachh', 'oldpeak']

for i in numeric_col:
    df[i] = df[i].astype('float')

In [None]:
model, features, trainm, testm = Auto_ViML(
    train = df,
    target = 'output',
    #"",
    #"",
    hyper_param="GS",
    feature_reduction=False,
    scoring_parameter="balanced_accuracy",
    KMeans_Featurizer=False,
    Boosting_Flag=False,
    Binning_Flag=False,
    Add_Poly=1,
    Stacking_Flag=False,
    Imbalanced_Flag=False,
    verbose=2,
)