## Heart Failure Prediction 

![](https://viewmedica.com/images/thumbslarge/heartfailure_1280.jpg)

source : https://www.wkhs.com/heart/conditions-treated/congestive-heart-failure

### Hi there!😄 I am new to data science and this is my try on the Heart Failure Prediction dataset. Feel free to comment if you have any questions, insights or advice on this or any data science related :) Upvote if you find my work useful for you! Thank you!

# **<span style="color:#6daa9f;">IMPORTING LIBRARIES</span>**


In [None]:
import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

#split train and test set
from sklearn.model_selection import train_test_split

#sklearn model
from sklearn.naive_bayes import GaussianNB
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report


# **<span style="color:#6daa9f;">LOAD DATA</span>**


In [None]:
data = pd.read_csv('../input/heart-failure-clinical-data/heart_failure_clinical_records_dataset.csv')

data.head()

# **<span style="color:#6daa9f;">EDA</span>**


In [None]:
#summary of variables in data
print(data.describe())

#identify column names,column data type and shape of the dataset
print(np.shape(data))

print(data.columns.tolist())

data.dtypes

From here,we can see that the dataset contains 13 columns with 299 rows.
The datatype for every rows are also all numerics 

Detail About Data:

- Age: age of patient
- anaemia :whether the patient have a low red blood cell or hemoglobin (0 or 1)
- creatinine_phosphokinase: an enzyme in our blood (the level of CPK enzyme in blood mcg/L),  
high level of CPK indicate that there has been an injury or stress to muscle tissue, heart or brain
- diabetes: if the patient has diabetes (type 2 diabetes patient is 4 times more likely  to develop heart failure than someone without)
- ejection_fraction : percentage of blood pump out of the heart at each contraction (percentage)
- high_blood_pressure: if the patient have hypertension
- plateletes:the level of platelets in blood 
- serum_creatinine: the level of serum creatinine in patient's blood mg/dL- 
  give an estimation on how well the kidney filters
- serum_sodium: level of serum sodium in the blood
- sex:gender of patient (0 or 1) female/male
- smoking : if the patient smokes or not
- time:follow up period of patient
- death_event:patient deceased during the follow up period


In [None]:
#identify missing values in dataset
data.isnull().sum()


The sum of missing values for every column.In this case,there are zero missing values for all columns

In [None]:
corrMatrix = data.corr()
plt.subplots(figsize=(20,20))

sns.heatmap(corrMatrix, annot=True)
plt.figure(figsize=(100,100))
plt.show()

Correlation between all features:
- follow up time of patient have the highest correlation with death_event 
- followed by ejection fraction
- linearly collerated age (higher age, more likely to death)
- serum_creatinine level linearly collerated with death_event

From the matrix, I decided to use age, time,ejection_fraction,serum_Creatinine and serum_Sodium


## Variable Analysis
Reference : https://towardsdatascience.com/data-exploration-and-analysis-using-python-e564473d7607

Using univariate analysis to highlight missing and outlier values.Our variable are categorical variables.

Handling outliers


### Data Analysis & Visualization

visualize every data 

Age 
-the number of death event according to age distribution




In [None]:
#check the data provided according to our target variable(death_event)

print(data['DEATH_EVENT'].value_counts())
data['DEATH_EVENT'].value_counts().plot(kind='bar')

Above clearly shows that our data is imbalanced with the data for '0' is 203 while for '1' is 96 samples.

In [None]:
sns.histplot(data,x='age',hue='DEATH_EVENT',multiple='stack')


In [None]:
sns.histplot(data,x='age',hue='sex',multiple="stack")



- I will not get rid of outliers as these values are medical data which actually have meaning medically




## Model Building

**Split train and test data**



In [None]:
col = ['time','ejection_fraction','serum_creatinine','age','serum_sodium']
predictors = data[col]
target = data["DEATH_EVENT"]

X_train, X_test, y_train, y_test = train_test_split(predictors, target, test_size = 0.22, random_state = 0)



**Normalize our data**

Since we have many numerical columns in our data with different range of values, I decided to change the values to a common scale using normalization to bring it all in the same range.

In [None]:
# Feature Scaling

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

 ## 5 MODEL
 
* Naive Bayes
* Random Forest
* Logistic Regression
* SVM
* Decision Trees


In [None]:
#gaussian naive bayes
gnb = GaussianNB()
gnb.fit(X_train, y_train)
  
# making predictions on the testing set
y_pred = gnb.predict(X_test)
  
# comparing actual response values (y_test) with predicted response values (y_pred)
acc_naivebayes = metrics.accuracy_score(y_test, y_pred)*100
print("Gaussian Naive Bayes model accuracy(in %):",acc_naivebayes )
print(classification_report(y_test, y_pred))

In [None]:
#logistic regression 

classifier = LogisticRegression(random_state = 22)
classifier.fit(X_train, y_train)

y_pred = classifier.predict(X_test)

cm = confusion_matrix(y_test, y_pred)
print(cm)
sns.heatmap(cm, annot=True)

acc_logisticregression = round(accuracy_score(y_pred, y_test) * 100, 2)
print ("Logistic Regression model Accuracy : ", acc_logisticregression) 
print(classification_report(y_test, y_pred))

In [None]:
#random forest
model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

acc_randomforest = round(accuracy_score(y_pred, y_test) * 100, 2)
print("Random Forest Model Accuracy : ",acc_randomforest)
print(classification_report(y_test, y_pred))

In [None]:
#SVM Classifier

from sklearn.svm import SVC  
clf = SVC(kernel='linear') 
  
# fitting x samples and y classes 
clf.fit(X_train, y_train) 
y_pred = clf.predict(X_test)

acc_SVM = round(accuracy_score(y_pred, y_test) * 100, 2)
print("SVM Model Accuracy : ",acc_SVM)
print(classification_report(y_test, y_pred))

In [None]:
from sklearn.tree import DecisionTreeClassifier

modeldt = DecisionTreeClassifier()

# fit the model with the training data
modeldt.fit(X_train,y_train)

# depth of the decision tree
print('Depth of the Decision Tree :', modeldt.get_depth())

# predict the target on the train dataset
predict_train = modeldt.predict(X_train)

# Accuray Score on train dataset
accuracy_train = accuracy_score(y_train,predict_train)
print('accuracy_score on train dataset : ', accuracy_train)

# predict the target on the test dataset
predict_test = model.predict(X_test)

# Accuracy Score on test dataset
accuracy_test = accuracy_score(y_test,predict_test)
print('accuracy_score on test dataset : ', accuracy_test*100)
print(classification_report(y_test, y_pred))