<h1 style = "background:yellow;color:Blue;border:0;border-radius:3px;font-family:verdana" > Project: Predicting mortality due to Heart Failure </h1>

# Problem Statement:
Cardiovascular diseases (CVDs) are the number 1 cause of death globally, taking an estimated 17.9 million lives each year, which accounts for 31% of all deaths worlwide. Heart failure is a common event caused by CVDs and this dataset contains 12 features that can be used to predict mortality by heart failure.

What we will do, we will examine these features well, and we will eliminate the situations that will adversely affect our model.Then, we will try to make predictions on the 6 models mentioned below and compare their results.

<h2 style = "background:yellow;color:blue;border:0;border-radius:3px;font-family:verdana" >Import Library </h2>

In [None]:
import numpy as np 
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.preprocessing import RobustScaler, StandardScaler
from sklearn.metrics import mean_squared_error, confusion_matrix, accuracy_score, plot_confusion_matrix
from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier

#warning
import warnings
warnings.filterwarnings('ignore')


# Citation
Davide Chicco, Giuseppe Jurman: Machine learning can predict survival of patients with heart failure from serum creatinine and ejection fraction alone. BMC Medical Informatics and Decision Making 20, 16 (2020). (Link: https://bmcmedinformdecismak.biomedcentral.com/articles/10.1186/s12911-020-1023-5#Sec2)

<a id = "1"></a>
<h2 style = "background:yellow;color:blue;border:0;border-radius:3px;font-family:verdana">Load and Check Data </h2>

In [None]:
data = pd.read_csv("Heart_Failure_DataSet.csv")

In [None]:
data.info() 

In [None]:
data.dropna(inplace=True)

In [None]:
data.info()

In [None]:
data['smoking']=data['smoking'].map({'YES':1,'NO':0})

In [None]:
data.head()

In [None]:
data['sex']=data['sex'].map({'MAN':1,'WOMAN':0})

In [None]:
data.head()
data.isna().sum()

In [None]:
#The Columns
print("Data Columns --> ",data.columns)

In [None]:
data.head()

In [None]:
desc = data.describe()
desc

In [None]:
print(data.isna().sum())

In [None]:
data['time']=data['time'].astype('int64')

In [None]:
data.info()

<ul>
    <li style = "color:red"> <p style = "color:black;font-weight:bold"> We have checked the columns in the data. No null data. </p>  </li>
</ul>

<a id = "2"></a>
<h2 style = "background:yellow;color:blue;border:0;border-radius:3px;font-family:verdana">Variable Description </h2>

<ol>    
    <li style = "color:darkred;font-weight:bold" > <p style = "color:black;font-weight:400" > <strong> age : </strong> the age of the person with heart failure </p> </li>  
    <li style = "color:darkred;font-weight:bold" > <p style = "color:black;font-weight:400" > <strong> anaemia : </strong> Decrease of red blood cells or hemoglobin (boolean) </p> </li>
    <li style = "color:darkred;font-weight:bold" > <p style = "color:black;font-weight:400" > <strong> creatinine_phosphokinase : </strong> Level of the CPK enzyme in the blood (mcg/L) </p> </li>  
    <li style = "color:darkred;font-weight:bold" > <p style = "color:black;font-weight:400" > <strong> diabetes : </strong> If the patient has diabetes (boolean) </p> </li>
    <li style = "color:darkred;font-weight:bold" > <p style = "color:black;font-weight:400" > <strong> ejection_fraction : </strong> Percentage of blood leaving the heart at each contraction (percentage) </p> </li>  
    <li style = "color:darkred;font-weight:bold" > <p style = "color:black;font-weight:400" > <strong> high_blood_pressure  : </strong> If the patient has hypertension (boolean) </p> </li>
    <li style = "color:darkred;font-weight:bold" > <p style = "color:black;font-weight:400" > <strong> platelets : </strong> Platelets in the blood (kiloplatelets/mL) </p> </li>  
    <li style = "color:darkred;font-weight:bold" > <p style = "color:black;font-weight:400" > <strong> serum_creatinine : </strong> Level of serum creatinine in the blood (mg/dL) </p> </li>
    <li style = "color:darkred;font-weight:bold" > <p style = "color:black;font-weight:400" > <strong> serum_sodium : </strong>Level of serum sodium in the blood (mEq/L) </p> </li>
    <li style = "color:darkred;font-weight:bold" > <p style = "color:black;font-weight:400" > <strong> sex : </strong> Woman or man (binary) </p> </li>
    <li style = "color:darkred;font-weight:bold" > <p style = "color:black;font-weight:400" > <strong> smoking : </strong> If the patient smokes or not (boolean) </p> </li>
    <li style = "color:darkred;font-weight:bold" > <p style = "color:black;font-weight:400" > <strong> time : </strong> Follow-up period (days) </p> </li>
    <li style = "color:darkred;font-weight:bold" > <p style = "color:black;font-weight:400" > <strong> DEATH_EVENT : </strong> If the patient deceased during the follow-up period (boolean) </p> </li>
</ol>

In [None]:
data.info()

<ul>    
    <li style = "color:darkred;font-weight:bold" > <p style = "color:black;font-weight:400" > <strong> float64 : </strong> age, platelets, serum_creatinine</p> </li> 
    <li style = "color:darkred;font-weight:bold" > <p style = "color:black;font-weight:400" > <strong> int64 : </strong> We see that all our remaining columns are int.</p> </li>
</ul>

<a id = "3"></a>
<h2 style = "background:yellow;color:blue;border:0;border-radius:3px;font-family:verdana">Univariate Variable Analysis </h2>

<ul>    
    <li style = "color:darkred;font-weight:bold" > <p style = "color:black;font-weight:400" > <strong> We see that our data consists of float and int columns. But there are some striking columns here. We see that these are categorical. We will examine these columns.  </strong>  </p>
        <ul>
            <li style = "color:gray"> <p style = "color:black"> Numerical Variable </p> </li>
            <li style = "color:gray"> <p style = "color:black"> Categorical Variable </p> </li>
        </ul>
    </li> 
</ul>

<a id = "4"></a>
<h3 style = "background:blue;color:white;border:0;border-radius:3px;font-family:verdana">Numerical Variable </h3>

In [None]:
def plot_hist(variable):
    print("min {} : {} ".format(variable, min(data[variable])))
    print("max {} : {}".format(variable, max(data[variable])))
    
    plt.figure(figsize=(9,3))
    plt.hist(data[variable], color="darkred")
    plt.xlabel(variable)
    plt.ylabel("Occurences")
    plt.title("{} distribution with histogram ".format(variable))
    plt.show()

In [None]:
numericVar = ["age","creatinine_phosphokinase","ejection_fraction","platelets","serum_creatinine","serum_sodium","time"]
for n in numericVar:
    plot_hist(n)

<a id = "5"></a>
<h3 style = "background:blue;color:white;border:0;border-radius:3px;font-family:verdana">Categorical Variable </h3>

In [None]:
def bar_plot(variable):
    
    # get feature
    var = data[variable]
    #count number of categorical variable (value/sample)
    varValue = var.value_counts()

    #visualize
    plt.figure(figsize=(9,3))
    plt.bar(varValue.index, varValue,color = "lightgreen", edgecolor = "black", linewidth = 2)
    plt.xticks(varValue.index, varValue.index.values)
    plt.ylabel("Number of Patients")
    plt.title(variable)
    plt.show()
    print("{}: \n {}".format(variable,varValue))

In [None]:
category = ["anaemia","diabetes","high_blood_pressure","sex","smoking"]
for c in category:
    bar_plot(c)

<a id = "6"></a>
<h2 style = "background:yellow;color:blue;border:0;border-radius:3px;font-family:verdana">Exploratory Data Analysis (EDA) </h2>

In [None]:
data.head()

In [None]:
data1 = data.loc[:,data.columns!='DEATH_EVENT']
data1.head()
data1.info()

In [None]:
corr_matrix = data1.corr()
correl = data.corr()
sns.clustermap(corr_matrix, annot = True, fmt = ".2f")
plt.title("Correlation btw features")
plt.show()

<ul>    
    <li style = "color:darkred;font-weight:bold" > <p style = "color:black;font-weight:400" > <strong> We have drawn the relationship matrix and examine the relationships between properties.  </strong>  </p> </li>
    <li style = "color:darkred;font-weight:bold" > <p style = "color:black;font-weight:400" > <strong> If the relation between properties is 1 it means that it is true and -1 means it is inversely proportional.  </strong>  </p> </li> 
</ul>

In [None]:
threshold = 0.2 
filtre = np.abs(correl["DEATH_EVENT"]) > threshold
corr_features = correl.columns[filtre].tolist()
sns.clustermap(data[corr_features].corr(), annot = True, fmt = ".2f")
plt.title("Correlation Between Features w Corr Theshold 0.75")
plt.show()

In [None]:
#pair plot
sns.pairplot(data[corr_features], diag_kind = "kde", markers = "+", hue = "DEATH_EVENT")
plt.show()

<a id = "10"></a>
<h2 style = "background:yellow;color:blue;border:0;border-radius:3px;font-family:verdana">Modeling </h2>

<p  style = "color:black;font-weight:500" > We will carry out our trainings using the models you see below. Finally, we will compare their results. </p>
<ul>
    <li style = "color:darkred;font-weight:bold" >Logistic Regression Model</li>
    <li style = "color:darkred;font-weight:bold" >Decision Tree Model</li>    
    <li style = "color:darkred;font-weight:bold" >RandomForest Model</li>
    <li style = "color:darkred;font-weight:bold" >SVM Model</li>
    <li style = "color:darkred;font-weight:bold" >XGBoost Model</li>          
</ul>

<h3 style = "background:yellow;color:blue;border:0;border-radius:3px;font-family:verdana">Train - Test Split </h3>

In [None]:
X = data.drop("DEATH_EVENT", axis = 1)
y = data.DEATH_EVENT
X.head()

In [None]:
test_size = 0.2
X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size = test_size, random_state = 42)

In [None]:
print("X_train shape {}, len {}.".format(X_train.shape,len(X_train)))
print("X_test shape {}, len {}.".format(X_test.shape,len(X_test)))
print("Y_train shape {}, len {}.".format(Y_train.shape,len(Y_train)))
print("Y_test shape {}, len {}.".format(Y_test.shape,len(Y_test)))

In [None]:
# list to keep our results
result_acc = []

<a id = "13"></a>
<h3 style = "background:yellow;color:blue;border:0;border-radius:3px;font-family:verdana">Logistic Regression Model</h3>

Input:- Training Data i.e. X_train, Y_train

Output:- Accuracy

Tool:- sklearn.linear_model

In [None]:
model_log_reg = LogisticRegression(solver='lbfgs', max_iter=1000)
model_log_reg.fit(X_train, Y_train)
importance = model_log_reg.coef_[0]

plt.bar([x for x in range(len(importance))], importance, color = "orange")
plt.show()

In [None]:
x_train_log_reg = X_train  #renaming training and testing set as per model
x_test_log_reg = X_test    #renaming training and testing set as per model

In [None]:
log_reg = LogisticRegression()
log_reg.fit(x_train_log_reg, Y_train)
y_pred_log = log_reg.predict(x_test_log_reg)
cm_log_reg = confusion_matrix(y_pred_log, Y_test)
acc_log_reg = accuracy_score(Y_test, y_pred_log)
result_acc.append(round(acc_log_reg*100,2))
print("RESULT")
print("Logistic Regression Model Acc : ",acc_log_reg)
print("Logistic Regression Model Cm : ",cm_log_reg)
plot_confusion_matrix(log_reg,x_test_log_reg,Y_test)

<a id = "14"></a>
<h3 style = "background:yellow;color:blue;border:0;border-radius:3px;font-family:verdana">DecisionTree Model</h3>

Input:- Training Data i.e. X_train, Y_train

Output:- Accuracy

Tool:- sklearn.tree

In [None]:
model_decision_tree = DecisionTreeClassifier()
model_decision_tree.fit(X_train, Y_train)
importance = model_decision_tree.feature_importances_

plt.bar([x for x in range(len(importance))], importance, color = "blue")
plt.show()

In [None]:
x_train_dec = X_train  #renaming training and testing set as per model
x_test_dec = X_test    #renaming training and testing set as per model

In [None]:
dt_param_grid = {"min_samples_split" : range(10,500,20),
                "max_depth": range(1,20)}

decision_tree = DecisionTreeClassifier()
clf = GridSearchCV(decision_tree, param_grid=dt_param_grid, cv = StratifiedKFold(n_splits = 10), scoring = "accuracy", n_jobs = -1,verbose = 1)
clf.fit(x_train_dec,Y_train)

y_pred_decision_tree = clf.predict(x_test_dec)
cm_y_pred_decision_tree = confusion_matrix(Y_test, y_pred_decision_tree)
acc_y_pred_decision_tree = accuracy_score(Y_test, y_pred_decision_tree)
result_acc.append(round(acc_y_pred_decision_tree*100,2))
print("RESULT")
print("Decision Tree Model Acc : ",acc_y_pred_decision_tree)
print("Decision Tree Model Cm : ",cm_y_pred_decision_tree)
plot_confusion_matrix(clf,x_test_dec,Y_test)

<a id = "12"></a>
<h3 style = "background:yellow;color:blue;border:0;border-radius:3px;font-family:verdana">Random Forest Model</h3>

Input:- Training Data i.e. X_train, Y_train

Output:- Accuracy

Tool:- sklearn.ensemble

<ul>
        <li style = "color:darkred;font-weight:bold" > <p style = "color:black;font-weight:400" > <strong> First of all, we will find our important features for that model and we will use them. </strong> </p> </li>
            <li style = "color:darkred;font-weight:bold" > <p style = "color:black;font-weight:400" > <strong> we will do this for all other models too. </strong> </p> </li>
</ul>

In [None]:
model_rnd = RandomForestClassifier()
model_rnd.fit(X_train, Y_train)
importance = model_rnd.feature_importances_

# plot feature importance
plt.bar([x for x in range(len(importance))], importance, color = "red")
plt.show()

In [None]:
x_train_random_forest = X_train  #renaming training and testing set as per model
x_test_random_forest = X_test    #renaming training and testing set as per model

In [None]:
random_forest_model = RandomForestClassifier(max_depth=7, random_state=25)
random_forest_model.fit(x_train_random_forest, Y_train)
y_pred_random_forest = random_forest_model.predict(x_test_random_forest)
cm_random_forest = confusion_matrix(y_pred_random_forest, Y_test)
acc_random_forest = accuracy_score(Y_test, y_pred_random_forest)
result_acc.append(round(acc_random_forest*100,2))
print("RESULT")
print("Random Forest Model Acc : ",acc_random_forest)
print("Random Forest Model Cm : ",cm_random_forest)
plot_confusion_matrix(random_forest_model,x_test_random_forest,Y_test)

<a id = "15"></a>
<h3 style = "background:yellow;color:blue;border:0;border-radius:3px;font-family:verdana">SVM Model</h3>

Input:- Training Data i.e. X_train, Y_train

Output:- Accuracy

Tool:- sklearn.svm

In [None]:
sc = StandardScaler()
X_training = sc.fit_transform(X_train)
model_svm = SVC(kernel="linear")
model_svm.fit(X_training, Y_train)
importance = model_svm.coef_[0]

plt.bar([x for x in range(len(importance))], importance, color = "brown")
plt.show()

In [None]:
x_train_svm = X_train  #renaming training and testing set as per model
x_test_svm = X_test    #renaming training and testing set as per model

In [None]:
svm = SVC()
svm.fit(x_train_svm, Y_train)
y_pred_svm = svm.predict(x_test_svm)
cm_svm = confusion_matrix(y_pred_svm, Y_test)
acc_svm = accuracy_score(Y_test, y_pred_svm)
result_acc.append(round(acc_svm*100,2))
print("RESULT")
print("SVM Model Acc : ",acc_svm)
print("SVM Model Cm : ",cm_svm)
plot_confusion_matrix(svm,x_test_svm,Y_test)

<a id = "11"></a>
<h3 style = "background:yellow;color:blue;border:0;border-radius:3px;font-family:verdana">XGBoost model </h3>

Input:- Training Data i.e. X_train, Y_train

Output:- Accuracy

Tool:- xgboost

In [None]:
XGB = XGBClassifier(max_depth = 1,eval_metric='mlogloss')
XGB.fit(X_train, Y_train)
y_pred_xgb = XGB.predict(X_test)
cm_xgb = confusion_matrix(y_pred_xgb, Y_test)
acc_xgb = accuracy_score(Y_test, y_pred_xgb) 
result_acc.append(round(acc_xgb*100,2))
print("RESULT:",result_acc)
print("XGBoost Model Acc : ",acc_xgb)
print("XGBoost Model Cm : ",cm_xgb)
plot_confusion_matrix(XGB,X_test,Y_test)

<a id = "17"></a>
<h2 style = "background:yellow;color:blue;border:0;border-radius:3px;font-family:verdana">Model Result</h2>

In [None]:
results = pd.DataFrame({"Model Result":result_acc, 
                        "Models":["LogisticRegression",
                                  "DecisionTree",
                                  "RandomForest",
                                  "SVM",
                                  "XGBoost"]})

In [None]:
results

In [None]:
g = sns.barplot("Model Result", "Models", data = results)
g.set_xlabel("Accuracy")
g.set_title("Models Result", color = "darkred")
plt.show()

<ul>
    <li style = "color:darkred;font-weight:bold" > <p style = "color:black;font-weight:400" > <strong> Yes, we have come to an end. As you can see, our biggest success with decision tree is 95.2 </strong> </p> </li>
</ul>