# Heart Failure Prediction Using Machine Learning

### Life cycle of Machine learning Project

- Understanding the Problem Statement
- Data Collection
- Exploratory data analysis
- Data Cleaning
- Data Pre-Processing
- Model Training
- Choose best model

### Introduction

Heart failure is a severe condition where the heart cannot pump blood efficiently, leading to various health complications and even death. Early prediction of heart failure can significantly improve patient outcomes by enabling timely medical interventions. This project leverages machine learning techniques to predict heart failure based on a dataset obtained from Kaggle, which includes various patient health metrics.

### 1) Problem Statement

Heart failure is a critical condition where the heart is unable to pump blood efficiently, leading to severe health consequences and potentially fatal outcomes. Despite advancements in medical technology, accurately predicting heart failure remains a challenge due to the complex interplay of various patient health metrics and the limitations of traditional predictive methods.

Current prediction systems rely heavily on clinical judgment and statistical methods, which may not fully capture the intricate patterns and relationships in the data. These methods can be time-consuming, require significant manual effort, and often lack the accuracy needed for early and reliable prediction of heart failure.

The primary problem is the need for an automated, accurate, and efficient predictive system that can analyze large volumes of patient data, identify subtle patterns, and provide consistent, unbiased predictions. Such a system would support healthcare providers in making timely and informed decisions, ultimately improving patient outcomes and reducing healthcare costs.

This project aims to address this problem by developing a machine learning-based model for heart failure prediction. The model will be trained on a comprehensive dataset of patient health metrics and will leverage advanced machine learning algorithms to provide high-accuracy predictions. The system will be designed for integration into existing healthcare infrastructures, offering a user-friendly interface for real-time prediction and decision support.

### 2) Data Collection

The Dataset is 5 heart datasets are combined over 11 common features which makes it the largest heart disease dataset available so far for research purposes. The five datasets used for its curation are:

    - Cleveland: 303 observations
    - Hungarian: 294 observations
    - Switzerland: 123 observations
    - Long Beach VA: 200 observations
    - Stalog (Heart) Data Set: 270 observations

**Total: 1190 observations
<br>Duplicated: 272 observations
<br>Final dataset: 918 observations** 

**Dataset Atrributes**

- Age : age of the patient [years]
- Sex : sex of the patient [M: Male, F: Female]
- ChestPainType : chest pain type [TA: Typical Angina, ATA: Atypical Angina, NAP: Non-Anginal Pain, ASY: Asymptomatic]
- RestingBP : resting blood pressure [mm Hg]
- Cholesterol : serum cholesterol [mm/dl]
- FastingBS : fasting blood sugar [1: if FastingBS > 120 mg/dl, 0: otherwise]
- RestingECG : resting electrocardiogram results [Normal: Normal, ST: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV), LVH: showing probable or definite left ventricular hypertrophy by Estes' criteria]
- MaxHR : maximum heart rate achieved [Numeric value between 60 and 202]
- ExerciseAngina : exercise-induced angina [Y: Yes, N: No]
- Oldpeak : oldpeak = ST [Numeric value measured in depression]
- ST_Slope : the slope of the peak exercise ST segment [Up: upsloping, Flat: flat, Down: downsloping]
- HeartFailure : output class [1: heart disease, 0: Normal]

#### 2.1 Import Data and Required Packages

**Importing Pandas, Numpy, Matplotlib, Seaborn and Warings Library.**

In [43]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import warnings

warnings.filterwarnings("ignore")

%matplotlib inline

**Import the CSV Data as Pandas Dataframe**

In [44]:
df = pd.read_csv("heart.csv")

**Show top 5 Records**

In [45]:
df.head()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartFailure
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0


**Shape of the dataset**

In [46]:
df.shape

(918, 12)

**Summary of the dataset**

In [47]:
# Display summary statistics for a dataframe
df.describe()

Unnamed: 0,Age,RestingBP,Cholesterol,FastingBS,MaxHR,Oldpeak,HeartFailure
count,918.0,918.0,918.0,918.0,918.0,918.0,918.0
mean,53.510893,132.396514,198.799564,0.233115,136.809368,0.887364,0.553377
std,9.432617,18.514154,109.384145,0.423046,25.460334,1.06657,0.497414
min,28.0,0.0,0.0,0.0,60.0,-2.6,0.0
25%,47.0,120.0,173.25,0.0,120.0,0.0,0.0
50%,54.0,130.0,223.0,0.0,138.0,0.6,1.0
75%,60.0,140.0,267.0,0.0,156.0,1.5,1.0
max,77.0,200.0,603.0,1.0,202.0,6.2,1.0


**Check Datatypes in the dataset**

In [48]:
# Check Null and Dtypes
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 918 entries, 0 to 917
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Age             918 non-null    int64  
 1   Sex             918 non-null    object 
 2   ChestPainType   918 non-null    object 
 3   RestingBP       918 non-null    int64  
 4   Cholesterol     918 non-null    int64  
 5   FastingBS       918 non-null    int64  
 6   RestingECG      918 non-null    object 
 7   MaxHR           918 non-null    int64  
 8   ExerciseAngina  918 non-null    object 
 9   Oldpeak         918 non-null    float64
 10  ST_Slope        918 non-null    object 
 11  HeartFailure    918 non-null    int64  
dtypes: float64(1), int64(6), object(5)
memory usage: 86.2+ KB


### 3) Exploring Data

In [49]:
# define numerical & categorical columns
numeric_features = [feature for feature in df.columns if df[feature].dtype != 'O']
categorical_features = [feature for feature in df.columns if df[feature].dtype == 'O']

# print columns
print('We have {} numerical features : {}'.format(len(numeric_features), numeric_features))
print('\nWe have {} categorical features : {}'.format(len(categorical_features), categorical_features))

We have 7 numerical features : ['Age', 'RestingBP', 'Cholesterol', 'FastingBS', 'MaxHR', 'Oldpeak', 'HeartFailure']

We have 5 categorical features : ['Sex', 'ChestPainType', 'RestingECG', 'ExerciseAngina', 'ST_Slope']


In [50]:
# proportion of count data on categorical columns
for col in categorical_features:
    print(df[col].value_counts(normalize=True) * 100)
    print('---------------------------')

Sex
M    78.976035
F    21.023965
Name: proportion, dtype: float64
---------------------------
ChestPainType
ASY    54.030501
NAP    22.113290
ATA    18.845316
TA      5.010893
Name: proportion, dtype: float64
---------------------------
RestingECG
Normal    60.130719
LVH       20.479303
ST        19.389978
Name: proportion, dtype: float64
---------------------------
ExerciseAngina
N    59.586057
Y    40.413943
Name: proportion, dtype: float64
---------------------------
ST_Slope
Flat    50.108932
Up      43.028322
Down     6.862745
Name: proportion, dtype: float64
---------------------------


**Insights**

- sex column is highly biased towards male. As per data we get to know that male has more heart failure them female.

In [51]:
df.head()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartFailure
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0


**Insights**

- There is no multicollinearity between any variables

In [52]:
#Checking for Null Values
df.isnull().sum()

Age               0
Sex               0
ChestPainType     0
RestingBP         0
Cholesterol       0
FastingBS         0
RestingECG        0
MaxHR             0
ExerciseAngina    0
Oldpeak           0
ST_Slope          0
HeartFailure      0
dtype: int64

## 4.Data Cleaning and Model Training.

### Handling Missing values
- Handling Missing values
- Handling Duplicates
- Check data type

In [53]:
df.isnull().sum()

Age               0
Sex               0
ChestPainType     0
RestingBP         0
Cholesterol       0
FastingBS         0
RestingECG        0
MaxHR             0
ExerciseAngina    0
Oldpeak           0
ST_Slope          0
HeartFailure      0
dtype: int64

In [54]:
df.duplicated().sum()

np.int64(0)

In [55]:

df.head()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartFailure
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0


In [56]:
categorical = ['Sex', 'ChestPainType', 'RestingECG', 'ExerciseAngina', 'ST_Slope']
final_df = pd.get_dummies(df, columns=categorical)

In [57]:
X = final_df.drop(['HeartFailure'],axis=1)
y = final_df['HeartFailure']

In [58]:
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X,y,test_size=0.2,random_state=1)

In [59]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report,ConfusionMatrixDisplay, \
                            precision_score, recall_score, f1_score, roc_auc_score,roc_curve 

In [60]:
def evaluate_clf(true, predicted):
    acc = accuracy_score(true, predicted) # Calculate Accuracy
    f1 = f1_score(true, predicted) # Calculate F1-score
    precision = precision_score(true, predicted) # Calculate Precision
    recall = recall_score(true, predicted)  # Calculate Recall
    roc_auc = roc_auc_score(true, predicted) #Calculate Roc
    return acc, f1 , precision, recall, roc_auc

In [61]:
models = {
    "Random Forest": RandomForestClassifier(),
    "Decision Tree": DecisionTreeClassifier(),
    "Logistic Regression": LogisticRegression(),
    "K-Neighbors Classifier": KNeighborsClassifier(),
    "Support Vector Classifier": SVC()

}

In [62]:
# Create a function which can evaluate models and return a report 
def evaluate_models(X, y, models):
    '''
    This function takes in X and y and models dictionary as input
    It splits the data into Train Test split
    Iterates through the given model dictionary and evaluates the metrics
    Returns: Dataframe which contains report of all models metrics with cost
    '''
    # separate dataset into train and test
    X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=42)
    
    models_list = []
    accuracy_list = []
    auc= []
    
    for i in range(len(list(models))):
        model = list(models.values())[i]
        model.fit(X_train, y_train) # Train model

        # Make predictions
        y_train_pred = model.predict(X_train)
        y_test_pred = model.predict(X_test)

        # Training set performance
        model_train_accuracy, model_train_f1,model_train_precision,\
        model_train_recall,model_train_rocauc_score=evaluate_clf(y_train ,y_train_pred)


        # Test set performance
        model_test_accuracy,model_test_f1,model_test_precision,\
        model_test_recall,model_test_rocauc_score=evaluate_clf(y_test, y_test_pred)

        print(list(models.keys())[i])
        models_list.append(list(models.keys())[i])

        print('Model performance for Training set')
        print("- Accuracy: {:.4f}".format(model_train_accuracy))
        print('- F1 score: {:.4f}'.format(model_train_f1)) 
        print('- Precision: {:.4f}'.format(model_train_precision))
        print('- Recall: {:.4f}'.format(model_train_recall))
        print('- Roc Auc Score: {:.4f}'.format(model_train_rocauc_score))

        print('----------------------------------')

        print('Model performance for Test set')
        print('- Accuracy: {:.4f}'.format(model_test_accuracy))
        accuracy_list.append(model_test_accuracy)
        print('- F1 score: {:.4f}'.format(model_test_f1))
        print('- Precision: {:.4f}'.format(model_test_precision))
        print('- Recall: {:.4f}'.format(model_test_recall))
        print('- Roc Auc Score: {:.4f}'.format(model_test_rocauc_score))
        auc.append(model_test_rocauc_score)
        print('='*35)
        print('\n')
        
    report=pd.DataFrame(list(zip(models_list, accuracy_list)), columns=['Model Name', 'Accuracy']).sort_values(by=['Accuracy'], ascending=False)
        
    return report

In [63]:
base_model_report =evaluate_models(X=X, y=y, models=models)

Random Forest
Model performance for Training set
- Accuracy: 1.0000
- F1 score: 1.0000
- Precision: 1.0000
- Recall: 1.0000
- Roc Auc Score: 1.0000
----------------------------------
Model performance for Test set
- Accuracy: 0.8750
- F1 score: 0.8900
- Precision: 0.9118
- Recall: 0.8692
- Roc Auc Score: 0.8761


Decision Tree
Model performance for Training set
- Accuracy: 1.0000
- F1 score: 1.0000
- Precision: 1.0000
- Recall: 1.0000
- Roc Auc Score: 1.0000
----------------------------------
Model performance for Test set
- Accuracy: 0.7935
- F1 score: 0.8119
- Precision: 0.8632
- Recall: 0.7664
- Roc Auc Score: 0.7988


Logistic Regression
Model performance for Training set
- Accuracy: 0.8692
- F1 score: 0.8821
- Precision: 0.8692
- Recall: 0.8953
- Roc Auc Score: 0.8665
----------------------------------
Model performance for Test set
- Accuracy: 0.8533
- F1 score: 0.8696
- Precision: 0.9000
- Recall: 0.8411
- Roc Auc Score: 0.8556


K-Neighbors Classifier
Model performance for Trai

In [64]:
base_model_report

Unnamed: 0,Model Name,Accuracy
0,Random Forest,0.875
2,Logistic Regression,0.853261
1,Decision Tree,0.793478
3,K-Neighbors Classifier,0.706522
4,Support Vector Classifier,0.690217


In [65]:
best_model = RandomForestClassifier()
best_model = best_model.fit(X_train,Y_train)
y_pred = best_model.predict(X_test)
score = accuracy_score(Y_test,y_pred)
cr = classification_report(Y_test,y_pred)

print("FINAL MODEL 'Random Forest'")
print ("Accuracy Score value: {:.4f}".format(score))
print (cr)

FINAL MODEL 'Random Forest'
Accuracy Score value: 0.9022
              precision    recall  f1-score   support

           0       0.89      0.86      0.88        74
           1       0.91      0.93      0.92       110

    accuracy                           0.90       184
   macro avg       0.90      0.90      0.90       184
weighted avg       0.90      0.90      0.90       184



In [66]:
import pickle
# open a file, where you ant to store the data
file = open('randomforest.pkl', 'wb')


# dump information to that file
pickle.dump(best_model, file)

In [67]:
column_names = X.columns

# Print the column names
print(column_names)

Index(['Age', 'RestingBP', 'Cholesterol', 'FastingBS', 'MaxHR', 'Oldpeak',
       'Sex_F', 'Sex_M', 'ChestPainType_ASY', 'ChestPainType_ATA',
       'ChestPainType_NAP', 'ChestPainType_TA', 'RestingECG_LVH',
       'RestingECG_Normal', 'RestingECG_ST', 'ExerciseAngina_N',
       'ExerciseAngina_Y', 'ST_Slope_Down', 'ST_Slope_Flat', 'ST_Slope_Up'],
      dtype='object')


In [68]:
input_data = ([[40,140,200,0,170,1,0,1,1,0,0,0,1,0,0,1,0,0,1,0]])


input_data_numpy_array = np.asarray(input_data)

reshape_input_data = input_data_numpy_array.reshape(1,-1)

prediction = best_model.predict(reshape_input_data)

print(prediction)

if(prediction[0]==0):
  print('Less')
else:
  print('More')

[1]
More


In [69]:
pip install joblib

Note: you may need to restart the kernel to use updated packages.


In [70]:
from joblib import dump 
dump(best_model,'randomforest5.pkl')

['randomforest5.pkl']