# Heart Failure: Exploratory Data Analysis

Citation: Davide Chicco, Giuseppe Jurman: Machine learning can predict survival of patients with heart failure from serum creatinine and ejection fraction alone. BMC Medical Informatics and Decision Making 20, 16 (2020). (https://bmcmedinformdecismak.biomedcentral.com/articles/10.1186/s12911-020-1023-5)

# Initial Thoughts

1. All features are either numeric or binary, with no string-based categorical variables.
2. With the exception of sex, all other binary features use 0 to represent the absence of the characteristic (e.g., no anaemia, no diabetes).
3. none of the features initially stand out as either messy or low singal

In [18]:
import pandas as pd

url = "https://raw.githubusercontent.com/CarsonShively/Heart-Failure/refs/heads/main/data/heart_failure.csv"
df = pd.read_csv(url)

df.head()

Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time,DEATH_EVENT
0,75.0,0,582,0,20,1,265000.0,1.9,130,1,0,4,1
1,55.0,0,7861,0,38,0,263358.03,1.1,136,1,0,6,1
2,65.0,0,146,0,20,0,162000.0,1.3,129,1,1,7,1
3,50.0,1,111,0,20,0,210000.0,1.9,137,1,0,7,1
4,65.0,1,160,1,20,0,327000.0,2.7,116,0,0,8,1


# Class imbalance

Slight class imbalane, I can apply classweights to handle this.

In [19]:
df['DEATH_EVENT'].value_counts(normalize=True)

Unnamed: 0_level_0,proportion
DEATH_EVENT,Unnamed: 1_level_1
0,0.67893
1,0.32107


# Skew:

Significant skew:
*   serum_creatinine
*   creatinine_phosphokinase
*   platelets

All skewed right, after applying a log transformation, followed clipping outliers at the quantiles (0.01) and (0.99) skew reached an acceptable level.

Minor skew:
*  serum_sodium
*  ejection_fraction

Simply clipping thier outliers at the (0.01) and (0.99) quantiles improved there skew.

In [20]:
from scipy.stats import skew
import numpy as np

numeric_features = [
    'age',
    'ejection_fraction',
    'serum_creatinine',
    'serum_sodium',
    'creatinine_phosphokinase',
    'platelets',
    'time'
]

print("========== Raw Skew ==========")
for col in numeric_features:
    print(f"{col:30}: skew = {skew(df[col].dropna()):.3f}")

df['serum_creatinine'] = np.log1p(df['serum_creatinine'])
df['creatinine_phosphokinase'] = np.log1p(df['creatinine_phosphokinase'])
df['platelets'] = np.log1p(df['platelets'])

for col in ['serum_creatinine', 'creatinine_phosphokinase', 'platelets', 'serum_sodium', 'ejection_fraction']:
    lower = df[col].quantile(0.01)
    upper = df[col].quantile(0.99)
    df[col] = df[col].clip(lower=lower, upper=upper)

print("========== Post-Transformation ==========")
for col in numeric_features:
    print(f"{col:30}: skew = {skew(df[col].dropna()):.3f}")

age                           : skew = 0.421
ejection_fraction             : skew = 0.553
serum_creatinine              : skew = 4.434
serum_sodium                  : skew = -1.043
creatinine_phosphokinase      : skew = 4.441
platelets                     : skew = 1.455
time                          : skew = 0.127
age                           : skew = 0.421
ejection_fraction             : skew = 0.454
serum_creatinine              : skew = 1.986
serum_sodium                  : skew = -0.444
creatinine_phosphokinase      : skew = 0.409
platelets                     : skew = -0.965
time                          : skew = 0.127


# Linearity

Strong non-linear relationships:
*   ejection_fraction
*   time

weak non-linear relationship:
*   serum_creatinine

For linear-based models, I can add squared features to help it capture these non-linear relationships.






In [21]:
import statsmodels.api as sm
import warnings
warnings.filterwarnings('ignore')

print("========== Linearity ==========")
for feature in numeric_features:
    X = df[[feature]].dropna().copy()
    y = df.loc[X.index, 'DEATH_EVENT']

    X[f'{feature}_squared'] = X[feature] ** 2
    X = sm.add_constant(X)

    model = sm.Logit(y, X).fit(disp=False, method='bfgs')
    p_value = model.pvalues[f'{feature}_squared']
    print(f"{feature:30}: p-value for {feature}² = {p_value:.4f}")

age                           : p-value for age² = 0.1076
ejection_fraction             : p-value for ejection_fraction² = 0.0001
serum_creatinine              : p-value for serum_creatinine² = 0.0048
serum_sodium                  : p-value for serum_sodium² = 0.1409
creatinine_phosphokinase      : p-value for creatinine_phosphokinase² = 0.6327
platelets                     : p-value for platelets² = 0.1932
time                          : p-value for time² = 0.0010
