🩺
# Heart Failure Dataset: Advanced EDA, Survival, and Modeling Notebook
**Author:** Cholpon Zhakshylykova  
**Data:** heart_failure.csv  
**Goal:** Understand, visualize, and engineer features prior to modeling.


In [21]:
# 1. Imports & Setup
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
from scipy import stats
import math
import os

sns.set_palette("husl")
plt.style.use('seaborn-v0_8')
os.makedirs('plots', exist_ok=True)


## 2. Data Loading & Initial Checks
- Shape, info, missing values, duplicates
- Quick view of data

In [None]:
# 2. Load Data & Initial Checks
df = pd.read_csv('heart_failure.csv')
print(f"Shape: {df.shape}")
df.info()
display(df.head())

#missing data
if df.isnull().values.any():
    print("Missing data detected.")
else:
    print("No missing data found.")

# Check for duplicates
dup_count = df.duplicated().sum()
print(f"Duplicate rows: {dup_count}")
if dup_count > 0:
    display(df[df.duplicated()])
else:
    print("No exact duplicates found.")


## 3. Feature Descriptions

**Clinical Meaning Reference**  
(Feel free to delete this cell after EDA if you want a tighter report!)


In [None]:
feature_descriptions = {
    'age': 'Age of the patient (years)',
    'anaemia': 'Decrease of red blood cells or hemoglobin (boolean)',
    'creatinine_phosphokinase': 'Level of CPK enzyme in blood (mcg/L)',
    'diabetes': 'If the patient has diabetes (boolean)',
    'ejection_fraction': 'Percentage of blood leaving the heart at each contraction (%)',
    'high_blood_pressure': 'If patient has hypertension (boolean)',
    'platelets': 'Platelets in blood (kiloplatelets/mL)',
    'serum_creatinine': 'Level of serum creatinine in the blood (mg/dL)',
    'serum_sodium': 'Level of serum sodium in the blood (mEq/L)',
    'sex': '1 = Male, 0 = Female',
    'smoking': 'If the patient smokes (boolean)',
    'time': 'Follow-up period (days)',
    'DEATH_EVENT': 'If patient died during follow-up (boolean, target)'
}
for k,v in feature_descriptions.items():
    print(f"{k}: {v}")


## 4. Descriptive Statistics & class balance



In [None]:
# Numeric summary
display(df[["age", "creatinine_phosphokinase", "ejection_fraction", "platelets", "serum_creatinine", "serum_sodium", "time"]].describe())

# Class balance
print("\nTarget class balance:")
print(df['DEATH_EVENT'].value_counts())
print('Minority class %: {:.1f}%'.format(100 * df['DEATH_EVENT'].value_counts(normalize=True)[1]))
sns.countplot(x='DEATH_EVENT', data=df, palette=['lightgreen','lightcoral'])
plt.title('Survival (0) vs Death (1)')
plt.xticks([0,1],['Survived','Died'])
plt.ylabel('Count')
plt.show()


- Class imbalance, the amount of the survived and died are not balanced

In [None]:
# Histograms for all numerics
numeric_cols = ["age", "creatinine_phosphokinase", "ejection_fraction", "platelets", "serum_creatinine", "serum_sodium", "time"]
df[numeric_cols].hist(bins=16, figsize=(16, 10), layout=(3, 4), color='skyblue', edgecolor='black')
plt.suptitle("Histograms of Numeric Features", fontsize=18)
plt.tight_layout(rect=[0,0,1,0.97])
plt.show()

# KDE by target
fig, axes = plt.subplots(math.ceil(len(numeric_cols)/4), 4, figsize=(18, 8))
axes = axes.flatten()

for ax, col in zip(axes, numeric_cols):
    [sns.kdeplot(df[df['DEATH_EVENT']==event][col], fill=True, ax=ax, label=event, color=c)
     for event, c in zip([0,1], ['blue','red'])]
    ax.set_title(col)
    ax.legend(title='DEATH_EVENT')
    ax.set_xlabel("")
    ax.set_ylabel("")

[ax.axis('off') for ax in axes[len(numeric_cols):]]  # Hide unused axes

fig.suptitle("KDE of Numeric Features by Survival", y=1.02, fontsize=18)
plt.tight_layout()
plt.show()



In [None]:
# Peform normality test
numeric_cols = ["age", "creatinine_phosphokinase", "ejection_fraction", "platelets", "serum_creatinine", "serum_sodium", "time"]

# Create a table summarizing the Shapiro-Wilk test results
normality_results = []
for col in numeric_cols:
    stat, p = stats.shapiro(df[col])
    normality_results.append({'Feature': col, 'Statistic': stat, 'p-value': p})
normality_df = pd.DataFrame(normality_results)
display(normality_df.style.format({'Statistic': '{:.3f}', 'p-value': '{:.3f}'}).background_gradient(cmap="coolwarm", subset=['p-value']))

- Features are not normally distributed. 

## 5. Outlier Checks


In [None]:

# --- Boxplots by target, all in one figure ---
ncols, nrows = 4, math.ceil(len(numeric_cols) / 4)

fig, axes = plt.subplots(nrows, ncols, figsize=(18, 2.8 * nrows))
axes = axes.flatten()

for ax, col in zip(axes, numeric_cols):
    sns.boxplot(x='DEATH_EVENT', y=col, data=df, hue='DEATH_EVENT', palette=['blue', 'red'], ax=ax, legend=False)
    ax.set_title(col)
    ax.set_xlabel("")
    ax.set_ylabel("")
    ax.set_xticks([0, 1])  # Explicitly set tick positions
    ax.set_xticklabels(['Survived', 'Died'])  # Set tick labels
for ax in axes[len(numeric_cols):]:
    ax.axis('off')
fig.suptitle("Boxplots of Numeric Features by Survival", y=1.02, fontsize=18)
plt.tight_layout()
plt.show()

- Serum Creatinine look like an interesting feature, since it is higher among the dead population. 
- Ejektion fraction (Amount of the blood that is pumped from the heart in one beat) is lower for the survived patient. 
- Serum Sodium is interestingly lower for dead population
- Time is low for the died patient, maybe regular medical intervention or check ups play a role in the survival for the heart deceased patients. 

## 6. Correlation Analysis


In [None]:
# Pearson heatmap
# Pearson correlation messures linear relationships between features by calculating the covariance between them, normalized by their standard deviations.
plt.figure(figsize=(12,9))
sns.heatmap(df[numeric_cols].corr(), annot=True, fmt=".2f", cmap="coolwarm", vmin=-1, vmax=1, square=True)
plt.title("Pearson Correlation Matrix")
plt.tight_layout()
plt.show()

# Spearman heatmap
# Spearman correlation measures monotonic relationships between features, making it robust to outliers and non-linear relationships.
plt.figure(figsize=(12,9))
sns.heatmap(df[numeric_cols].corr(method='spearman'), annot=True, fmt=".2f", cmap="vlag", vmin=-1, vmax=1, square=True)
plt.title("Spearman Correlation Matrix")
plt.tight_layout()
plt.show()

# Highly correlated feature pairs (Pearson, abs > 0.7)
corrmat = df[numeric_cols].corr().abs()
high_corrs = (
    corrmat.where(np.triu(np.ones(corrmat.shape), k=1).astype(bool))
    .stack()
    .reset_index()
    .rename(columns={'level_0': 'Feature 1', 'level_1': 'Feature 2', 0: 'Correlation'})
)

values = high_corrs[high_corrs['Correlation'] > 0.7].sort_values(by='Correlation', ascending=False)

if not values.empty:  # Check if the DataFrame is not empty
    print("Feature pairs with absolute Pearson correlation > 0.7:")
    display(values)
else:
    print("No feature pairs with absolute Pearson correlation > 0.7 found.")

# Correlation with DEATH_EVENT
corr_target = df[numeric_cols + ['DEATH_EVENT']].corr()['DEATH_EVENT'].sort_values(ascending=False)
print("\nPearson correlation with DEATH_EVENT:")
print(corr_target)
print("\nTop 5 absolute correlations with DEATH_EVENT:")
print(corr_target.drop('DEATH_EVENT').abs().sort_values(ascending=False).head())


- There is no strong correlation among the different numerich features using both methods. 
- Absence of the linear relationship is good for the implementations linear models. 

## 7. Pairplot and Violin Plots


In [None]:

# Pairplot of key features by target
# This visualizes the distribution and relationships between key numeric features, colored by survival status.
# It helps identify patterns and potential clusters in the data.
# Pairplot is useful for exploring relationships between multiple variables and the target variable.
# It can reveal correlations, distributions, and potential outliers.


# Pairplot of key features by target
sns.pairplot(df[["age", "creatinine_phosphokinase", "ejection_fraction", "platelets", "serum_creatinine", "serum_sodium", "time", "DEATH_EVENT"]],
             hue='DEATH_EVENT', palette=['blue', 'red'])
plt.suptitle('Pairplot of Key Features by Survival', y=1.02)
plt.show()

# Violin plots for numeric features by target
numeric_cols = ["age", "creatinine_phosphokinase", "ejection_fraction", "platelets", "serum_creatinine", "serum_sodium", "time"]
fig, axes = plt.subplots(2, 4, figsize=(18, 10))
axes = axes.flatten()

for ax, feat in zip(axes, numeric_cols):
    sns.violinplot(x='DEATH_EVENT', y=feat, data=df, hue='DEATH_EVENT', palette=['blue', 'red'], ax=ax, legend=False)
    ax.set_title(f'{feat.replace("_", " ").title()} by Survival Status')
    ax.set_xlabel("")
    ax.set_ylabel("")
    ax.set_xticks([0, 1])  # Explicitly set tick positions
    ax.set_xticklabels(['Survived', 'Died'])  # Set tick labels

# Hide unused subplot
for ax in axes[len(numeric_cols):]:
    ax.axis('off')

fig.suptitle("Violin Plots of Numeric Features by Survival", y=1.02, fontsize=18)
plt.tight_layout()
plt.show()

## 8. Statistical Tests (t-test, chi-square, effect size)



In [None]:
import pandas as pd
from scipy import stats
import numpy as np


# Cohen's d is a measure of effect size that indicates the standardized difference between two means.
# It is calculated as the difference between the means of two groups divided by the pooled standard deviation.
# It is commonly used in hypothesis testing to quantify the magnitude of differences between groups.
# I will use the both thte t-test and the Mann-Whitney U test to compare the distributions of numeric features between the two target classes (DEATH_EVENT).
# But remember the features are not normally distributed, so the t-test may not be appropriate.


def cohens_d(x, y):
    return (x.mean()-y.mean())/np.sqrt((x.std()**2 + y.std()**2)/2)

numeric_cols = ["age", "creatinine_phosphokinase", "ejection_fraction","platelets", "serum_creatinine", "serum_sodium", "time"]

num_stats = []
for feat in numeric_cols:
    x0 = df[df['DEATH_EVENT']==0][feat]
    x1 = df[df['DEATH_EVENT']==1][feat]
    t, p = stats.ttest_ind(x0, x1)
    u, p_u = stats.mannwhitneyu(x0, x1)
    d = cohens_d(x0, x1)
    num_stats.append({
        'Feature': feat,
        't-test p': p,
        'Mann-Whitney p': p_u,
        "Cohen's d": d
    })

num_stats_df = pd.DataFrame(num_stats)
display(num_stats_df.style.format({
    't-test p': '{:.4f}',
    'Mann-Whitney p': '{:.4f}',
    "Cohen's d": '{:.2f}'
}).background_gradient(cmap="Blues", subset=["Cohen's d"]))

# --- Categorical tests ---
cat_feats = ['anaemia','diabetes','high_blood_pressure','sex','smoking']
cat_stats = []
for feat in cat_feats:
    crosstab = pd.crosstab(df[feat], df['DEATH_EVENT'])
    chi2, p, dof, _ = stats.chi2_contingency(crosstab)
    cat_stats.append({
        'Feature': feat,
        'Chi2': chi2,
        'p-value': p
    })
cat_stats_df = pd.DataFrame(cat_stats)
display(cat_stats_df.style.format({'Chi2': '{:.2f}', 'p-value': '{:.4f}'}).background_gradient(cmap="Oranges", subset=['Chi2']))

## 9. Key EDA Insights

- No missing values or duplicate rows.
- Mortality rate is ~32% (data is imbalanced).
- Some numeric features are skewed (used log transforms).
- *Statistical tests suggest age, ejection fraction, serum creatinine, and comorbidity are strong predictors.*

*Proceed to the Modeling notebook for further analysis and prediction!*
