# Heart Failure Survival Prediction Project

## Project Overview
This project focuses on predicting the survival outcomes of heart failure patients using clinical and demographic attributes. Heart disease is one of the leading causes of death worldwide, and understanding the factors that influence survival rates is critical for early intervention and improved patient outcomes.

The goal of this project is to apply **machine learning techniques** to build and compare predictive models that can forecast whether a heart failure patient will survive or not. In addition to building accurate models, this project aims to identify the key clinical and demographic features that contribute most to survival predictions. 

## Dataset Information
The dataset used for this project comes from the **[UCI Machine Learning Repository](https://archive.ics.uci.edu/dataset/519/heart+failure+clinical+records)**. It contains 299 records of patients who experienced heart failure, with 13 clinical and demographic features, along with the target variable `DEATH_EVENT`, which indicates whether the patient survived (0) or died (1).

### Key Features
- **Age**: Age of the patient
- **Anaemia**: Decrease of red blood cells or hemoglobin (Boolean)
- **Creatinine Phosphokinase (CPK)**: Level of CPK enzyme in blood (mcg/L)
- **Diabetes**: Whether the patient has diabetes (Boolean)
- **Ejection Fraction**: Percentage of blood leaving the heart at each contraction
- **High Blood Pressure**: Whether the patient has hypertension (Boolean)
- **Platelets**: Platelet count (kiloplatelets/mL)
- **Serum Creatinine**: Level of serum creatinine (mg/dL)
- **Serum Sodium**: Level of serum sodium (mEq/L)
- **Sex**: Male (1) or Female (0)
- **Smoking**: Whether the patient is a smoker (Boolean)
- **Time**: Follow-up period (days)

### Target Variable
- **DEATH_EVENT**: 
    - 1 = Patient died
    - 0 = Patient survived

---

## Project Objectives
This project will cover:
- **Exploratory Data Analysis (EDA)** to understand data distributions, trends, and relationships between features and patient outcomes.
- **Data Preprocessing & Feature Engineering** to clean, transform, and prepare the data for modeling.
- **Machine Learning Model Development** where multiple models will be trained and evaluated, including:
    - Logistic Regression
    - Decision Tree
    - Random Forest
    - Support Vector Machine (SVM)
    - Artificial Neural Network (ANN)
- **Model Comparison & Performance Evaluation** to select the best-performing model.
- **Feature Importance Analysis** to identify which features contribute most to survival predictions.
- **Conclusion & Key Insights** to summarize findings and offer actionable insights for healthcare professionals.

---

## Tools & Libraries
The following tools and libraries will be used in this project:
- Python
- Pandas
- NumPy
- Matplotlib & Seaborn
- Scikit-learn
- TensorFlow/Keras (for ANN)

---


## Feature Dictionary

| Feature | Description |
|---|---|
| **age** | Patient's age in years. Older patients are typically at higher risk. |
| **anaemia** | Whether the patient has low red blood cells (1 = Yes, 0 = No). Anaemia can worsen heart failure outcomes. |
| **creatinine_phosphokinase** | Level of CPK enzyme in the blood (mcg/L). High levels may indicate heart muscle damage. |
| **diabetes** | Whether the patient has diabetes (1 = Yes, 0 = No). Diabetes increases heart failure risk. |
| **ejection_fraction** | Percentage of blood leaving the heart at each contraction. Low values indicate poor heart function. |
| **high_blood_pressure** | Whether the patient has hypertension (1 = Yes, 0 = No). High BP adds stress to the heart. |
| **platelets** | Platelet count (kiloplatelets/mL). May indicate blood clotting ability or potential abnormalities. |
| **serum_creatinine** | Level of creatinine in the blood (mg/dL). High levels indicate potential kidney problems. |
| **serum_sodium** | Level of sodium in the blood (mEq/L). Low levels indicate fluid retention and severe heart failure. |
| **sex** | Patient's gender (1 = Male, 0 = Female). Gender can influence survival rates. |
| **smoking** | Whether the patient smokes (1 = Yes, 0 = No). Smoking worsens heart and blood vessel health. |
| **time** | Follow-up period (days). Shorter times may indicate early death. |
| **DEATH_EVENT** | Target variable (1 = Died, 0 = Survived). |


## Importing Libraries

In this section, we import all the necessary libraries required for data loading, exploration, visualization, modeling, and evaluation.


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix, classification_report, roc_curve, auc
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

## Data Loading & Initial Exploration

In this section, we load the dataset and perform basic data checks to understand its structure, spot potential issues (like missing values), and plan the next steps for data cleaning and preprocessing.


In [5]:
# Load the dataset
url = "C:\\Users\\EWURA\\Desktop\\UCI xHeart\\heart_failure_clinical_records_dataset.csv"
df = pd.read_csv(url)

# First look at the data
print("First 5 Rows of the Dataset:")
display(df.head())

# Basic Info
print("\n Basic Information About the Dataset:")
df.info() 

# Check for missing values
print("\n Missing Values Per Column:")
print(df.isnull().sum())

# Summary statistics
print("\n Summary Statistics:")
display(df.describe())

# Check shape
print(f"\n Dataset contains {df.shape[0]} rows and {df.shape[1]} columns.")

# Check unique values in each column (great for categorical features)
print("\n Unique Values Per Column:")
for col in df.columns:
    print(f"{col}: {df[col].nunique()} unique values")

# Check class distribution (how many survived vs died)
print("\ Survival Distribution (Target - DEATH_EVENT):")
print(df['DEATH_EVENT'].value_counts())
print(df['DEATH_EVENT'].value_counts(normalize=True) * 100)


First 5 Rows of the Dataset:


Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time,DEATH_EVENT
0,75.0,0,582,0,20,1,265000.0,1.9,130,1,0,4,1
1,55.0,0,7861,0,38,0,263358.03,1.1,136,1,0,6,1
2,65.0,0,146,0,20,0,162000.0,1.3,129,1,1,7,1
3,50.0,1,111,0,20,0,210000.0,1.9,137,1,0,7,1
4,65.0,1,160,1,20,0,327000.0,2.7,116,0,0,8,1



 Basic Information About the Dataset:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 299 entries, 0 to 298
Data columns (total 13 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   age                       299 non-null    float64
 1   anaemia                   299 non-null    int64  
 2   creatinine_phosphokinase  299 non-null    int64  
 3   diabetes                  299 non-null    int64  
 4   ejection_fraction         299 non-null    int64  
 5   high_blood_pressure       299 non-null    int64  
 6   platelets                 299 non-null    float64
 7   serum_creatinine          299 non-null    float64
 8   serum_sodium              299 non-null    int64  
 9   sex                       299 non-null    int64  
 10  smoking                   299 non-null    int64  
 11  time                      299 non-null    int64  
 12  DEATH_EVENT               299 non-null    int64  
dtypes: float64(3), int64(10)
m

Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time,DEATH_EVENT
count,299.0,299.0,299.0,299.0,299.0,299.0,299.0,299.0,299.0,299.0,299.0,299.0,299.0
mean,60.833893,0.431438,581.839465,0.41806,38.083612,0.351171,263358.029264,1.39388,136.625418,0.648829,0.32107,130.26087,0.32107
std,11.894809,0.496107,970.287881,0.494067,11.834841,0.478136,97804.236869,1.03451,4.412477,0.478136,0.46767,77.614208,0.46767
min,40.0,0.0,23.0,0.0,14.0,0.0,25100.0,0.5,113.0,0.0,0.0,4.0,0.0
25%,51.0,0.0,116.5,0.0,30.0,0.0,212500.0,0.9,134.0,0.0,0.0,73.0,0.0
50%,60.0,0.0,250.0,0.0,38.0,0.0,262000.0,1.1,137.0,1.0,0.0,115.0,0.0
75%,70.0,1.0,582.0,1.0,45.0,1.0,303500.0,1.4,140.0,1.0,1.0,203.0,1.0
max,95.0,1.0,7861.0,1.0,80.0,1.0,850000.0,9.4,148.0,1.0,1.0,285.0,1.0



 Dataset contains 299 rows and 13 columns.

 Unique Values Per Column:
age: 47 unique values
anaemia: 2 unique values
creatinine_phosphokinase: 208 unique values
diabetes: 2 unique values
ejection_fraction: 17 unique values
high_blood_pressure: 2 unique values
platelets: 176 unique values
serum_creatinine: 40 unique values
serum_sodium: 27 unique values
sex: 2 unique values
smoking: 2 unique values
time: 148 unique values
DEATH_EVENT: 2 unique values
\ Survival Distribution (Target - DEATH_EVENT):
0    203
1     96
Name: DEATH_EVENT, dtype: int64
0    67.892977
1    32.107023
Name: DEATH_EVENT, dtype: float64


---

##  Data Summary and Initial Observations

###  Dataset Overview
The dataset contains **299 records** and **13 columns**, representing the clinical and demographic characteristics of heart failure patients. Each row corresponds to a patient, with the target variable `DEATH_EVENT` indicating whether the patient survived (`0`) or died (`1`).

### Column Types and Structure
- The dataset has a mix of **numerical** and **categorical/binary** columns.
- There are no missing values across any of the 13 columns, indicating the dataset is complete.

###  Target Variable Distribution
- Out of **299 patients**, **96 patients (32.1%) died**, while **203 patients (67.9%) survived**.
- This indicates a **moderate class imbalance**, which will be considered during model training 

### Key Summary Statistics
The following highlights are derived from the summary statistics:

- **Age** ranges from **40 to 95 years**, with a median of **60 years**.
- **Ejection Fraction** (a key heart performance metric) ranges from **14% to 80%**, with a median of **38%**.
- **Serum Creatinine**, which reflects kidney function, ranges from **0.5 to 9.4 mg/dL**, with a median of **1.1 mg/dL**. 
- **Creatinine Phosphokinase (CPK)**, an enzyme linked to muscle damage, varies widely (from **23 to 7861 mcg/L**), suggesting the presence of significant outliers.
- **Platelet counts** show considerable variation, from **25,100 to 850,000 kiloplatelets/mL**.

These wide ranges for some clinical attributes (such as **CPK** and **platelets**) highlight the importance of exploring **outliers** during the data cleaning and analysis phases.

###  Unique Values Per Column
- **Binary Features** such as `anaemia`, `diabetes`, `high_blood_pressure`, `sex`, and `smoking` each have exactly **2 unique values** (0 or 1), confirming they are categorical.
- Some features, like `ejection_fraction` and `serum_sodium`, have relatively few unique values, which may make them **candidates for grouping or binning** during feature engineering.

### Data Completeness
-  **No missing values detected across all columns.**

---

### Key Takeaways for Next Steps

- **We have a fully complete dataset with 299 patient records and 13 features.**
- **There’s a moderate class imbalance, so evaluation metrics like precision, recall, and F1-score will be important.**
- **Several numerical features (CPK, platelets, serum creatinine) show large variability, suggesting potential outliers or skewed distributions that will require special attention.**
- **Binary categorical features (anaemia, diabetes, high blood pressure, sex, smoking) are clean and ready for analysis.**

---

##  Exploratory Data Analysis (EDA)

The goal of this exploratory data analysis is to better understand the characteristics of heart failure patients in this dataset, explore how key clinical and demographic factors relate to survival outcomes, and identify potential patterns or trends that may help inform predictive modeling. 

Through a combination of visualizations and feature-specific analysis, we aim to uncover key insights that will guide the feature engineering and model building processes in the later stages of this project.
