The World Health Organization (WHO) estimates that cardiovascular diseases (CVDs) claim the lives of 17.9 million people annually. This figure represents 31 percent of the total number of fatalities that occur across the globe.
This dataset comprises 12 variables that may be used to predict mortality caused by heart failure. Heart failure is a frequent occurrence that is caused by cardiovascular diseases (CVDs).


The majority of cardiovascular illnesses are preventable if behavioral risk factors including smoking, poor diet and obesity, lack of physical exercise, and problematic alcohol consumption are addressed through population-wide interventions.



People who have cardiovascular disease or who are at high cardiovascular risk (due to the presence of one or more risk factors such as hypertension, diabetes, hyperlipidaemia, or already established disease) require early detection and management, which is where a machine learning model can be of great assistance. These individuals have a need for early detection and management.

In [40]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from pandas_profiling import ProfileReport

from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier

from sklearn.metrics import accuracy_score
from sklearn.metrics import plot_confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import precision_score

import warnings
warnings.filterwarnings('ignore')

In [41]:
data = pd.read_csv ('heart_failure_clinical_records_dataset.csv')

In [3]:
data

Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time,DEATH_EVENT
0,75.0,0,582,0,20,1,265000.00,1.90,130,1,0,4,1
1,55.0,0,7861,0,38,0,263358.03,1.10,136,1,0,6,1
2,65.0,0,146,0,20,0,162000.00,1.30,129,1,1,7,1
3,50.0,1,111,0,20,0,210000.00,1.90,137,1,0,7,1
4,65.0,1,160,1,20,0,327000.00,2.70,116,0,0,8,1
5,90.0,1,47,0,40,1,204000.00,2.10,132,1,1,8,1
6,75.0,1,246,0,15,0,127000.00,1.20,137,1,0,10,1
7,60.0,1,315,1,60,0,454000.00,1.10,131,1,1,10,1
8,65.0,0,157,0,65,0,263358.03,1.50,138,0,0,10,1
9,80.0,1,123,0,35,1,388000.00,9.40,133,1,1,10,1


<!-- dataframe.isna().sum()

Return a true or false object indicating if the values are NA.

NA values, such as None gets mapped to True values.

Everything else gets mapped to False values.

Then returns the sum of the values -->

dataframe.isna().sum()

Return a true or false object indicating if the values are NA.

NA values, such as None gets mapped to True values.

Everything else gets mapped to False values.

Then returns the sum of the values

In [4]:
data.isna().sum()

age                         0
anaemia                     0
creatinine_phosphokinase    0
diabetes                    0
ejection_fraction           0
high_blood_pressure         0
platelets                   0
serum_creatinine            0
serum_sodium                0
sex                         0
smoking                     0
time                        0
DEATH_EVENT                 0
dtype: int64

While analyzing this dataset, we wanted to see the unique values in 
certain datasets, which can be done using Pandas nunique() function.

In [5]:
for column in data.columns:
    print(f"{column}: Number of unique values {data[column].nunique()}")

age: Number of unique values 47
anaemia: Number of unique values 2
creatinine_phosphokinase: Number of unique values 208
diabetes: Number of unique values 2
ejection_fraction: Number of unique values 17
high_blood_pressure: Number of unique values 2
platelets: Number of unique values 176
serum_creatinine: Number of unique values 40
serum_sodium: Number of unique values 27
sex: Number of unique values 2
smoking: Number of unique values 2
time: Number of unique values 148
DEATH_EVENT: Number of unique values 2


<!-- # data.info()

# Print a concise summary of the CSV file, or data.

# This method prints information about a dataFrame including the basic object which 
# stores the axis labels for all pandas objects and columns, non-null values and memory usage.

# -- Over 299 entries from this dataset

# Everything else is object Dtype because they are considered categorical variables, 
# and we need to change them the binary form. -->

data.info()

Print a concise summary of the CSV file, or data.

This method prints information about a dataFrame including the basic object which stores the axis labels for all pandas objects and columns, non-null values and memory usage.

-- Over 299 entries from this dataset

Everything else is object Dtype because they are considered categorical variables, and we need to change them the binary form.

In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 299 entries, 0 to 298
Data columns (total 13 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   age                       299 non-null    float64
 1   anaemia                   299 non-null    int64  
 2   creatinine_phosphokinase  299 non-null    int64  
 3   diabetes                  299 non-null    int64  
 4   ejection_fraction         299 non-null    int64  
 5   high_blood_pressure       299 non-null    int64  
 6   platelets                 299 non-null    float64
 7   serum_creatinine          299 non-null    float64
 8   serum_sodium              299 non-null    int64  
 9   sex                       299 non-null    int64  
 10  smoking                   299 non-null    int64  
 11  time                      299 non-null    int64  
 12  DEATH_EVENT               299 non-null    int64  
dtypes: float64(3), int64(10)
memory usage: 30.5 KB


In [7]:
data.describe()

Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time,DEATH_EVENT
count,299.0,299.0,299.0,299.0,299.0,299.0,299.0,299.0,299.0,299.0,299.0,299.0,299.0
mean,60.833893,0.431438,581.839465,0.41806,38.083612,0.351171,263358.029264,1.39388,136.625418,0.648829,0.32107,130.26087,0.32107
std,11.894809,0.496107,970.287881,0.494067,11.834841,0.478136,97804.236869,1.03451,4.412477,0.478136,0.46767,77.614208,0.46767
min,40.0,0.0,23.0,0.0,14.0,0.0,25100.0,0.5,113.0,0.0,0.0,4.0,0.0
25%,51.0,0.0,116.5,0.0,30.0,0.0,212500.0,0.9,134.0,0.0,0.0,73.0,0.0
50%,60.0,0.0,250.0,0.0,38.0,0.0,262000.0,1.1,137.0,1.0,0.0,115.0,0.0
75%,70.0,1.0,582.0,1.0,45.0,1.0,303500.0,1.4,140.0,1.0,1.0,203.0,1.0
max,95.0,1.0,7861.0,1.0,80.0,1.0,850000.0,9.4,148.0,1.0,1.0,285.0,1.0


For each column the following statistics - if relevant for the column type - are presented in an interactive HTML report:

Type inference: detect the types of columns in a dataframe.

Essentials: type, unique values, missing values

Quantile statistics like minimum value, Q1, median, Q3, maximum, range, interquartile range

Descriptive statistics like mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness

Most frequent values

Histogram

Correlations highlighting of highly correlated variables, Spearman, Pearson and Kendall matrices

Missing values matrix, count, heatmap and dendrogram of missing values

Text analysis learn about categories (Uppercase, Space), scripts (Latin, Cyrillic) and blocks (ASCII) of text data.

File and Image analysis extract file sizes, creation dates and dimensions and scan for truncated images or those containing EXIF information.

In [8]:
profilerep = ProfileReport(data)

In [9]:
profilerep

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

NumExpr defaulting to 8 threads.


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



In [10]:
profilerep.to_notebook_iframe()

time is highly correlated with DEATH_EVENT	High correlation

DEATH_EVENT is highly correlated with time	High correlation

time is highly correlated with DEATH_EVENT	High correlation

DEATH_EVENT is highly correlated with time	High correlation

ejection_fraction is highly correlated with serum_creatinine and DEATH_EVENT High correlation

serum_creatinine is highly correlated with ejection_fraction	High correlation

sex is highly correlated with smoking	High correlation

smoking is highly correlated with sex	High correlation

time is highly correlated with DEATH_EVENT	High correlation

DEATH_EVENT is highly correlated with ejection_fraction and time	High correlation

Since we are trying to predict the DEATH_EVENT.

In [11]:
y = data ['DEATH_EVENT']
x = data.drop ('DEATH_EVENT', axis=1)

In [12]:
y

0      1
1      1
2      1
3      1
4      1
      ..
294    0
295    0
296    0
297    0
298    0
Name: DEATH_EVENT, Length: 299, dtype: int64

In [13]:
x

Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time
0,75.0,0,582,0,20,1,265000.00,1.90,130,1,0,4
1,55.0,0,7861,0,38,0,263358.03,1.10,136,1,0,6
2,65.0,0,146,0,20,0,162000.00,1.30,129,1,1,7
3,50.0,1,111,0,20,0,210000.00,1.90,137,1,0,7
4,65.0,1,160,1,20,0,327000.00,2.70,116,0,0,8
5,90.0,1,47,0,40,1,204000.00,2.10,132,1,1,8
6,75.0,1,246,0,15,0,127000.00,1.20,137,1,0,10
7,60.0,1,315,1,60,0,454000.00,1.10,131,1,1,10
8,65.0,0,157,0,65,0,263358.03,1.50,138,0,0,10
9,80.0,1,123,0,35,1,388000.00,9.40,133,1,1,10


Transform features by scaling each feature to a given range.

This estimator scales and translates each feature individually such that it is in the given range on the training set, e.g. between zero and one.

Will transform the x-axis of data to datapoints in between 0 and 1

In [14]:
scaler = MinMaxScaler()
x = scaler.fit_transform(x)

In [15]:
pd.DataFrame(x)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11
0,0.636364,0.0,0.071319,0.0,0.090909,1.0,0.290823,0.157303,0.485714,1.0,0.0,0.000000
1,0.272727,0.0,1.000000,0.0,0.363636,0.0,0.288833,0.067416,0.657143,1.0,0.0,0.007117
2,0.454545,0.0,0.015693,0.0,0.090909,0.0,0.165960,0.089888,0.457143,1.0,1.0,0.010676
3,0.181818,1.0,0.011227,0.0,0.090909,0.0,0.224148,0.157303,0.685714,1.0,0.0,0.010676
4,0.454545,1.0,0.017479,1.0,0.090909,0.0,0.365984,0.247191,0.085714,0.0,0.0,0.014235
5,0.909091,1.0,0.003062,0.0,0.393939,1.0,0.216875,0.179775,0.542857,1.0,1.0,0.014235
6,0.636364,1.0,0.028451,0.0,0.015152,0.0,0.123530,0.078652,0.685714,1.0,0.0,0.021352
7,0.363636,1.0,0.037254,1.0,0.696970,0.0,0.519942,0.067416,0.514286,1.0,1.0,0.021352
8,0.454545,0.0,0.017096,0.0,0.772727,0.0,0.288833,0.112360,0.714286,0.0,0.0,0.021352
9,0.727273,1.0,0.012758,0.0,0.318182,1.0,0.439932,1.000000,0.571429,1.0,1.0,0.021352


Split arrays or matrices into random train and test subsets.

In [16]:
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.7)

In [17]:
x_test

array([[0.36363636, 1.        , 0.28808369, ..., 0.        , 0.        ,
        0.65124555],
       [0.36363636, 0.        , 0.11150804, ..., 1.        , 0.        ,
        0.27046263],
       [0.54545455, 0.        , 0.00829293, ..., 0.        , 0.        ,
        0.75088968],
       ...,
       [0.41818182, 0.        , 0.1164838 , ..., 1.        , 1.        ,
        0.29893238],
       [0.27272727, 0.        , 0.03993366, ..., 0.        , 0.        ,
        0.24911032],
       [0.58181818, 0.        , 0.043506  , ..., 1.        , 1.        ,
        0.19572954]])

In [18]:
x_train

array([[0.52727273, 0.        , 0.17810666, ..., 1.        , 1.        ,
        0.5088968 ],
       [0.37576364, 1.        , 0.0163307 , ..., 0.        , 0.        ,
        0.59786477],
       [0.09090909, 0.        , 0.07131921, ..., 1.        , 0.        ,
        0.03558719],
       ...,
       [0.54545455, 1.        , 0.00586884, ..., 0.        , 0.        ,
        0.14234875],
       [0.36363636, 0.        , 0.00382751, ..., 0.        , 0.        ,
        0.29537367],
       [0.54545455, 0.        , 0.00893085, ..., 1.        , 1.        ,
        0.72597865]])

In [19]:
y_test

208    0
91     0
253    0
9      1
142    0
      ..
130    0
57     0
108    0
79     0
59     1
Name: DEATH_EVENT, Length: 90, dtype: int64

In [20]:
y_train

176    0
188    0
17     1
264    0
24     1
      ..
246    1
232    0
53     1
104    0
231    0
Name: DEATH_EVENT, Length: 209, dtype: int64

In [21]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import plot_confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import precision_score

In [22]:
#from sklearn.linear_model import LogisticRegression

log_reg_model = LogisticRegression()
log_reg_model.fit(x_train, y_train)
y_pred = log_reg_model.predict(x_test)

log_acc = log_reg_model.score(x_test, y_test)
log_f1 = f1_score(y_test, y_pred)

In [23]:
#from sklearn.svm import SVC

svc_model = SVC()
svc_model.fit(x_train, y_train)
y_pred = svc_model.predict(x_test)

svm_acc = svc_model.score(x_test, y_test)
svm_f1 = f1_score(y_test, y_pred)

In [24]:
#from sklearn.neural_network import MLPClassifier

nn_model = MLPClassifier(hidden_layer_sizes=(128, 128))
nn_model.fit(x_train, y_train)
y_pred = nn_model.predict(x_test)

nn_acc = nn_model.score(x_test, y_test)
nn_f1 = f1_score(y_test, y_pred)



In [25]:
#from sklearn.ensemble import RandomForestClassifier
rfc_model = RandomForestClassifier()
rfc_model.fit(x_train, y_train)
y_pred = rfc_model.predict(x_test)

rfc_acc = rfc_model.score(x_test, y_test)
rfc_f1 = f1_score(y_test, y_pred)

In [26]:
#from sklearn.naive_bayes import GaussianNB
gnb_model = GaussianNB()
gnb_model.fit(x_train, y_train)
y_pred = gnb_model.predict(x_test)

gnb_acc = gnb_model.score(x_test, y_test)
gnb_f1 = f1_score(y_test, y_pred)

In [27]:
#from sklearn.neighbors import KNeighborsClassifier
knn_model = KNeighborsClassifier()
knn_model.fit(x_train, y_train)
y_pred = knn_model.predict(x_test)

knn_acc = knn_model.score(x_test, y_test)
knn_f1 = f1_score(y_test, y_pred)

In [28]:
print(f"Logistic Regression:\nAccuracy: {log_acc}\nF1 Score: {log_f1}\n")
print(f"Support Vector Machine:\nAccuracy: {svm_acc}\nF1 Score: {svm_f1}\n")
print(f"Neural Network:\nAccuracy: {nn_acc}\nF1 Score: {nn_f1}\n")
print(f"Random Forest Classifier:\nAccuracy: {rfc_acc}\nF1 Score: {rfc_f1}\n")
print(f"Gaussian NB:\nAccuracy: {gnb_acc}\nF1 Score: {gnb_f1}\n")
print(f"KNeighbors Classifier:\nAccuracy: {knn_acc}\nF1 Score: {knn_f1}\n")

Logistic Regression:
Accuracy: 0.8444444444444444
F1 Score: 0.7307692307692308

Support Vector Machine:
Accuracy: 0.8
F1 Score: 0.608695652173913

Neural Network:
Accuracy: 0.8333333333333334
F1 Score: 0.7457627118644068

Random Forest Classifier:
Accuracy: 0.8888888888888888
F1 Score: 0.8214285714285715

Gaussian NB:
Accuracy: 0.8
F1 Score: 0.6538461538461539

KNeighbors Classifier:
Accuracy: 0.7111111111111111
F1 Score: 0.4347826086956522



In [29]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import plot_confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import precision_score

Logistic regression is actually a classification technique that gives the probabilistic output of dependent categorical value based on certain independent variables.

Logistic regression uses the logistic function to calculate the probability.

Usually, for doing binary classification with logistic regression, we decide on a threshold value of probability above which the output is considered as 1 and below the threshold, the output is considered as 0.

In [83]:
log_reg_model = LogisticRegression()
log_reg_model.fit(x_train, y_train)
prediction_y = log_reg_model.predict(x_test)

print(classification_report(y_test, prediction_y))

              precision    recall  f1-score   support

           0       0.85      0.93      0.89        61
           1       0.83      0.66      0.73        29

    accuracy                           0.84        90
   macro avg       0.84      0.79      0.81        90
weighted avg       0.84      0.84      0.84        90



In [84]:
plot_confusion_matrix(log_reg_model, x_test, y_test, cmap = 'cool')
plt.show
plt.title('Logistical_Regression')
plt.savefig("Logistical_Regression.png")

The Support Vector Machine Algorithm, better known as SVM is a supervised machine learning algorithm that finds applications in solving Classification and Regression problems.

SVM makes use of extreme data points (vectors) in order to generate a hyperplane, these vectors/data points are called support vectors. The primary objective of the SVM algorithm is to create an optimal hyperplane with a maximum margin that can separate an n-dimensional space into distinct classes.

In [85]:
svc_model = SVC()
svc_model.fit(x_train, y_train)
prediction_y = svc_model.predict(x_test)

print(classification_report(y_test, prediction_y))


              precision    recall  f1-score   support

           0       0.79      0.95      0.87        61
           1       0.82      0.48      0.61        29

    accuracy                           0.80        90
   macro avg       0.81      0.72      0.74        90
weighted avg       0.80      0.80      0.78        90



In [86]:
plot_confusion_matrix(svc_model, x_test, y_test, cmap = 'cool')
plt.show
plt.title('Support Vector Machine')
plt.savefig("Support Vector Machine.png")

Suppose we have two predictor variables and want to do a binary classification.

hidden_layer_sizes : With this parameter we can specify the number of layers and the number of nodes we want to have in the Neural Network Classifier.

Each element in the tuple represents the number of nodes at the ith position, where i is the index of the tuple. 

Thus, the length of the tuple indicates the total number of hidden layers in the neural network.

In [87]:
nn_model = MLPClassifier()
nn_model.fit(x_train, y_train)
prediction_y = nn_model.predict(x_test)

print(classification_report(y_test, prediction_y))

              precision    recall  f1-score   support

           0       0.89      0.92      0.90        61
           1       0.81      0.76      0.79        29

    accuracy                           0.87        90
   macro avg       0.85      0.84      0.84        90
weighted avg       0.87      0.87      0.87        90



In [88]:
plot_confusion_matrix(nn_model, x_test, y_test, cmap = 'cool')
plt.show
plt.title('MLPClassifier')
plt.savefig("MLPClassifier.png")

The Random forest classifier creates a set of decision trees from a randomly selected subset of the training set.

It is basically a set of decision trees (DT) from a randomly selected subset of the training set and then It collects the votes from different decision trees to decide the final prediction.

In [89]:
rfc_model = RandomForestClassifier()
rfc_model.fit(x_train, y_train)
prediction_y = rfc_model.predict(x_test)

print(classification_report(y_test, prediction_y))

              precision    recall  f1-score   support

           0       0.90      0.89      0.89        61
           1       0.77      0.79      0.78        29

    accuracy                           0.86        90
   macro avg       0.83      0.84      0.84        90
weighted avg       0.86      0.86      0.86        90



In [79]:
plot_confusion_matrix(rfc_model, x_test, y_test, cmap = 'cool')
plt.show
plt.title(RandomForestClassifier)
plt.savefig('RandomForestClassifier.png')

The Naive Bayes algorithm assumes that the predictors have independent and equal contributions in determining the output class.

Naive Bayes model’s assumption that all predictors are independent of each other is not practical in real-world scenarios but still, this assumption gives a good result in most of the cases.

Naive Bayes is commonly used for text classification where data dimensionality is often quite high.

In [60]:
gnh_model = GaussianNB()
gnb_model.fit(x_train, y_train)
prediction_y = gnb_model.predict(x_test)

print(classification_report(y_test, prediction_y))

              precision    recall  f1-score   support

           0       0.82      0.90      0.86        61
           1       0.74      0.59      0.65        29

    accuracy                           0.80        90
   macro avg       0.78      0.74      0.76        90
weighted avg       0.79      0.80      0.79        90



K-Nearest Neighbors is one of the most basic yet essential classification algorithms in Machine Learning.

It belongs to the supervised learning domain and finds intense application in pattern recognition, data mining and intrusion detection. 

In [90]:
plot_confusion_matrix(gnb_model, x_test, y_test, cmap = "cool")
plt.show
plt.title('Gaussian_NB')
plt.savefig("Gaussian_NB.png")

In [91]:
knn_model = KNeighborsClassifier()
knn_model.fit(x_train, y_train)
prediction_y = knn_model.predict(x_test)

print(classification_report(y_test, prediction_y))

              precision    recall  f1-score   support

           0       0.74      0.89      0.81        61
           1       0.59      0.34      0.43        29

    accuracy                           0.71        90
   macro avg       0.66      0.62      0.62        90
weighted avg       0.69      0.71      0.69        90



In [1]:
plot_confusion_matrix(knn_model, x_test, y_test, cmap = "cool")
plt.show
plt.title('KNeighborsClassifier')
plt.savefig("KNeighborsClassifier.png")

NameError: name 'plot_confusion_matrix' is not defined

# Logistic Regression:
## Accuracy: 0.8444444444444444
## F1 Score: 0.7307692307692308

# Support Vector Machine:
## Accuracy: 0.8
## F1 Score: 0.608695652173913

# Neural Network:
## Accuracy: 0.8333333333333334
## F1 Score: 0.7457627118644068

# Random Forest Classifier:
## Accuracy: 0.8888888888888888
## F1 Score: 0.8214285714285715

# Gaussian NB:
## Accuracy: 0.8
## F1 Score: 0.6538461538461539

# KNeighbors Classifier:
## Accuracy: 0.7111111111111111
## F1 Score: 0.4347826086956522


# The random forest algorithm creates numerous decision trees and blends them together, producing a more accurate and reliable prediction. The accuracy attained is statistically significant. Decision boundaries can be ascertained by splitting the data set to more manageable components.

# With the rate at which Heart Disease is becoming more prevalent among people, it is critical to recognize Heart Failure at an early stage. This study demonstrates the value of machine learning in the healthcare business for decision-making and lowering diagnostic costs. 

# The work's key contribution is the development of a novel optimum model for forecasting Heart Failure in its early stages of the disease based on the symptoms of the patient's and previous datasets, as well as the emphasis on exploratory data analysis and data processing. 