## 1. Dataset import

### Fetch heart disease dataset

In [1]:
from ucimlrepo import fetch_ucirepo

heart_disease = fetch_ucirepo(id=45)

### Dataset metadata

In [2]:
print("{:=^50}".format("Metadata"))
for k, v in heart_disease.metadata.items():
    print(f"{k}:\t{v}")

uci_id:	45
name:	Heart Disease
repository_url:	https://archive.ics.uci.edu/dataset/45/heart+disease
data_url:	https://archive.ics.uci.edu/static/public/45/data.csv
abstract:	4 databases: Cleveland, Hungary, Switzerland, and the VA Long Beach
area:	Health and Medicine
tasks:	['Classification']
characteristics:	['Multivariate']
num_instances:	303
num_features:	13
feature_types:	['Categorical', 'Integer', 'Real']
demographics:	['Age', 'Sex']
target_col:	['num']
index_col:	None
has_missing_values:	yes
missing_values_symbol:	NaN
year_of_dataset_creation:	1989
last_updated:	Fri Nov 03 2023
dataset_doi:	10.24432/C52P4X
creators:	['Andras Janosi', 'William Steinbrunn', 'Matthias Pfisterer', 'Robert Detrano']
intro_paper:	{'ID': 231, 'type': 'NATIVE', 'title': 'International application of a new probability algorithm for the diagnosis of coronary artery disease.', 'authors': 'R. Detrano, A. Jánosi, W. Steinbrunn, M. Pfisterer, J. Schmid, S. Sandhu, K. Guppy, S. Lee, V. Froelicher', 'venue': 'Am

### Variables info

In [3]:
print("\n{:=^50}".format("Variables"))
important_cols = ["name", "role", "type", "units", "missing_values"]
print(heart_disease.variables[important_cols])


        name     role         type  units missing_values
0        age  Feature      Integer  years             no
1        sex  Feature  Categorical   None             no
2         cp  Feature  Categorical   None             no
3   trestbps  Feature      Integer  mm Hg             no
4       chol  Feature      Integer  mg/dl             no
5        fbs  Feature  Categorical   None             no
6    restecg  Feature  Categorical   None             no
7    thalach  Feature      Integer   None             no
8      exang  Feature  Categorical   None             no
9    oldpeak  Feature      Integer   None             no
10     slope  Feature  Categorical   None             no
11        ca  Feature      Integer   None            yes
12      thal  Feature  Categorical   None            yes
13       num   Target      Integer   None             no


## 2. Data profiling

### Profile generation

In [76]:
from pathlib import Path

from ydata_profiling import ProfileReport

data_df = heart_disease.data.original
dataset_profile = ProfileReport(data_df, title = "Heart disease report")

report_path = Path.cwd()/"dataset_report.html"
dataset_profile.to_file(report_path, silent=True)

  from .autonotebook import tqdm as notebook_tqdm
Summarize dataset: 100%|██████████| 49/49 [00:03<00:00, 15.47it/s, Completed]                 
Generate report structure: 100%|██████████| 1/1 [00:03<00:00,  3.36s/it]
Render HTML: 100%|██████████| 1/1 [00:00<00:00,  1.26it/s]
Export report to file: 100%|██████████| 1/1 [00:00<00:00, 237.70it/s]


### Profile overview

##### Features

- `age` [_numerical_]:
    - min. 29, max. 77, mean 54.44, median 56, std. dev. 9
    - patients aged mainly 40 (5-th percentile) - 68 (95-th percentile)
- `sex` [_categorical_]:
    - $values \in \{0, 1\}$
    - not evenly distributed (206 male, 97 female)
- `cp` (chest pain type) [_categorical_]:
    - $values \in \{1, 2, 3, 4\}$ = {typical angina, atypical angina, non-anginal pain, asymptomatic}
    - not evenly distributed - (respectively: 23, 50, 86, 144)
- `trestbps` (resting blood pressure) [_integer_]:
    - min. 94, max. 200, mean 131.69, median 130, std. dev. 17.6
- `chol` (serum cholestoral) [_integer_]:
    - min. 126, max. 564, mean 246.69, std. dev. 51.8
- `fbs` (fasting blood sugar > 120 mg/dl) [_categorical_]:
    - $values \in \{0, 1\}$ (1 = true; 0 = false)
    - not evenly distributed (258 false, 45 true)
- `restecg` (resting electrocardiographic results) [_categorical_]:
    - $values \in \{0, 1, 2\}$, where:
        - 0: normal
        - 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
        - 2: showing probable or definite left ventricular hypertrophy by Estes' criteria
    - not evenly distributed (0 - 151, 1 - 4, 2 - 148)
- `thalach` (maximum heart rate achieved) [_integer_]:
    - min. 71, max. 202, mean 149.6, std. dev. 22.9
- `exang` (exercise induced angina) [_categorical_]:
    - $values \in \{0, 1\}$ (1 = yes; 0 = no)
    - not evenly distributed (204 patitents without this disease, only 99 with)
- `oldpeak` (ST depression induced by exercise relative to rest) [_integer_]:
    - min. 0, max. 6.2, mean 1.04, median 0.8, std. dev. 1.16
    - definitely not a normal distribution (not even close)
- `slope` (the slope of the peak exercise ST segment) [_categorical_]:
    - $values \in \{1, 2, 3\}$ (1 = upsloping; 2 = flat; 3 = downsloping)
    - not evenly distributed (1 - 142, 2 - 140, 3 - 21)
- `ca` (number of major vessels (0-3) colored by flourosopy) [_categorical_]:
    - $values \in \{0, 1, 2, 3\}$
    - not evenly distributed (0 - 176, 1 - 65, 2 - 38, 3 - 20)
    - 4 values missing
- `thal` [_categorical_]:
    - $values \in \{3, 6, 7\}$ (3 = normal; 6 = fixed defect; 7 = reversable defect)
    - not evenly distributed (3 - 166, 6 - 18, 7 - 117)

##### Correlation matrix


- Noticeable negative correlation between `age` and `thalach` (maximum heart rate achieved), also between `oldpeak` (ST depression induced by exercise relative to rest) and `thalach`.
- Positive correlation between `cp` and `exang` (categorical features).  
<img src="./assets/correlation_matrix.png" alt="Correlation Matrix" width="500"/>

### Numerical features distributions

##### Code

In [41]:
import matplotlib.pyplot as plt
import pandas as pd
from scipy import stats


def plot_distributions(data_df: pd.DataFrame, features: list[str], filename: str, n_cols: int=2):
    n_features = len(features)
    n_rows = (n_features + n_cols - 1) // n_cols

    fig, axes = plt.subplots(n_rows, n_cols, figsize=(4*n_cols, 3*n_rows))
    fig.suptitle("Distribution of Features vs Normal Distribution", fontsize=16)

    for i, feature in enumerate(features):
        row = i // n_cols
        col = i % n_cols
        ax = axes[row, col] if n_rows > 1 else axes[col]
        data = sorted(data_df[feature])

        # Fit normal distribution
        mean, std = stats.norm.fit(data)
        pdf_norm = stats.norm.pdf(data, mean, std)

        # Plot histogram and fitted normal distribution
        ax.hist(data, bins="auto", density=True, alpha=0.7)
        ax.plot(data, pdf_norm, "red", label="Normal")

        ax.set_title(feature)
        ax.set_xlabel("Values")
        ax.set_ylabel("Density")
        ax.legend()

    # Remove unused plots
    for i in range(n_features, n_rows * n_cols):
        row = i // n_cols
        col = i % n_cols
        fig.delaxes(axes[row, col] if n_rows > 1 else axes[col])

    plt.tight_layout()
    plt.subplots_adjust(top=0.9)
    plt.savefig(filename)
    plt.close()

In [68]:
from pathlib import Path

from sklearn.preprocessing import MinMaxScaler

num_features = ["age", "trestbps", "chol", "thalach", "oldpeak"]

original_df = heart_disease.data.features

scaler = MinMaxScaler()
scaled_num_data = scaler.fit_transform(original_df[num_features])
scaled_num_df = pd.DataFrame(scaled_num_data, columns=num_features)

plot_distributions(original_df, num_features, filename=Path.cwd()/"assets"/"original_dist.png")
plot_distributions(scaled_num_df, num_features, filename=Path.cwd()/"assets"/"scaled_dist.png")

##### Comparison

We can make the following observations based on the charts provided below (original data - left, scaled data by `MinMaxScaler` - right):
- `age`: roughly normal distribution, with a slight left skew.
- `trestbps`: somewhat normal distribution but with multiple peaks.
- `chol`: a little bit right-skewed but not very much different from a normal dist. bell curve.
- `thalach`: left-skewed and somewhat bimodal. Deviates quite much from a normal distribution.
- `oldpeak`: highly skewed distribution with a large peak at 0 and a long right tail. Very far from a normal distribution.  

Based on those observations I decided to choose `MinMaxScaler` for the further data processing. Neural network requires numerical data to be scaled to some fixed range while also keeping the original data distribution. This type of scaler fits perfectly to this use case.


<p align="center">
    <img src="./assets/original_dist.png" alt="Original features distribution" width="500"/>
    <img src="./assets/scaled_dist.png" alt="Scaled features distribution" width="500"/>
</p>

### Summary

1. Dataset is not well-balanced. It is noticeable especially in the following features: `sex`, `fbs`, `exang`.
2. Distributions of some numerical features can be considered as close to normal (`trestbps`, `chol`, `age`) but the distribution of `oldpeak` is defenitely not a normal one.
3. None of the categorical features is evenly distributed. In each group some disproportions occur and some categories are much more common than the others.
4. Missing values are present in `ca` (4) and `thal` (2) features. We can utilize _mode imputation_ (_most frequent_ imputation) to replace them (since they are categorical).

## 3. Feature matrix converter code

In [75]:
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

cat_features = ["sex", "cp", "fbs", "restecg", "exang", "slope", "ca", "thal"]
num_features = ["age", "trestbps", "chol", "thalach", "oldpeak"]

original_df = heart_disease.data.features

# Prepare categorical data
imputer = SimpleImputer(strategy="most_frequent")
original_df[cat_features] = imputer.fit_transform(original_df[cat_features])

encoder = OneHotEncoder(categories="auto", sparse_output=False)
encoded_cat_data = encoder.fit_transform(original_df[cat_features])
encoded_features = encoder.get_feature_names_out(cat_features)
original_df[encoded_features] = pd.DataFrame(encoded_cat_data, columns=encoded_features)

# Prepare numerical data
scaler = MinMaxScaler()
scaled_num_data = scaler.fit_transform(original_df[num_features])
original_df[num_features] = pd.DataFrame(scaled_num_data, columns=num_features)

# Convert DataFrame to feature matrix
original_df = original_df.drop(columns=cat_features)
feature_matrix = original_df.to_numpy()
