# Analysis Report on Heart Failure Dataset

## Introduction
Heart failure (HF) occurs when the heart cannot pump enough blood to meet the needs of the body. It is a critical and widespread medical condition that affects millions of individuals worldwide. Choosing to work on this project focused on heart failure came with the goal to make a positive contribution to public health and the potential to contribute to advancements in medical science. 

This report aims to use Supervised Machine Learning format (Classification and Regression) to predict patients’ survival from their data and the most important features among those included in their medical records. It is based on clinical records of 299 heart failure patients gotten from https://archive.ics.uci.edu.

### The clinical variables in the dataset are:
- age: age of the patient (years)
- anaemia: decrease of red blood cells or hemoglobin (boolean)
- creatinine phosphokinase  (CPK): level of the CPK enzyme in the blood (mcg/L)
- diabetes: if the patient has diabetes (boolean)
- ejection fraction: percentage of blood leaving the heart at each contraction  (percentage)
- high blood pressure: if the patient has hypertension (boolean)
- platelets: platelets in the blood (kiloplatelets/mL)
- sex: woman or man (binary)
- serum creatinine: level of serum creatinine in the blood (mg/dL)
- serum sodium: level of serum sodium in the blood (mEq/L)
- smoking: if the patient smokes or not (boolean)
- time: follow-up period (days)
- [target] death event: if the patient died during the follow-up period (boolean)

### For Regression Analysis
response variable (y) = creatinine phosphokinase

explanatory variable (x) = age, ejection fraction, platelets, and serum creatinine

### For Classification
response variable (y) = death event

explanatory variable (x) = age, ejection fraction, platelets, and serum creatinin

In [4]:
# Dependencies
import pandas as pd
import numpy as np
import statistics as stats
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import statsmodels.formula.api as sm
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.linear_model import LogisticRegression

# PART A: Regression Analysis
In this section, I would be performing regression analysis on my data (HeartFailure.csv) to estimate the relationship between my dependent variable creatinine phosphokinase and independent variables age, ejection fraction, platelets, and serum creatinine.

In [5]:
# reading and converting csv file to dataframe
data = pd.read_csv('HeartFailure.csv')
data

Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time,DEATH_EVENT
0,75.0,0,582,0,20,1,265000.00,1.9,130,1,0,4,1
1,55.0,0,7861,0,38,0,263358.03,1.1,136,1,0,6,1
2,65.0,0,146,0,20,0,162000.00,1.3,129,1,1,7,1
3,50.0,1,111,0,20,0,210000.00,1.9,137,1,0,7,1
4,65.0,1,160,1,20,0,327000.00,2.7,116,0,0,8,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
294,62.0,0,61,1,38,1,155000.00,1.1,143,1,1,270,0
295,55.0,0,1820,0,38,0,270000.00,1.2,139,0,0,271,0
296,45.0,0,2060,1,60,0,742000.00,0.8,138,0,0,278,0
297,45.0,0,2413,0,38,0,140000.00,1.4,140,1,1,280,0


In [6]:
# assigning the variables to I would use to a new variable name
cp_data = data['creatinine_phosphokinase']
age_data = data['age']
ef_data = data['ejection_fraction']
pl_data = data['platelets']
sc_data = data['serum_creatinine']

### Finding the correlation between each pair of variables

In [7]:
# selecting the variables needed for correlation
var = data[['creatinine_phosphokinase', 'age', 'ejection_fraction', 'platelets', 'serum_creatinine']]

# calculating the correlation matrix
corr_matrix = var.corr()

# displaying the correlation matrix
print("Correlation matrix:")
print(corr_matrix)

# correlation interpretation
for var1 in var.columns:
    for var2 in var.columns:
        if var1 != var2:
            # get the correlation value between the two variables and print
            corr_value = corr_matrix.loc[var1, var2]
            print(f"\nCorrelation between {var1} and {var2}: {corr_value:.2f}")

             # find the strength of the correlation
            if abs(corr_value) >= 0.5:
                print("There is a strong correlation between the variables.")
            else:
                print("There is a weak correlation between the variables.")
            
            # determine if the relationship is positive or negative
            if corr_value > 0:
                print("Positive linear correlation.")
            elif corr_value < 0:
                print("Negative linear correlation.")
            else:
                print("No linear correlation between the variables.")
            print()

Correlation matrix:
                          creatinine_phosphokinase       age  \
creatinine_phosphokinase                  1.000000 -0.081584   
age                                      -0.081584  1.000000   
ejection_fraction                        -0.044080  0.060098   
platelets                                 0.024463 -0.052354   
serum_creatinine                         -0.016408  0.159187   

                          ejection_fraction  platelets  serum_creatinine  
creatinine_phosphokinase          -0.044080   0.024463         -0.016408  
age                                0.060098  -0.052354          0.159187  
ejection_fraction                  1.000000   0.072177         -0.011302  
platelets                          0.072177   1.000000         -0.041198  
serum_creatinine                  -0.011302  -0.041198          1.000000  

Correlation between creatinine_phosphokinase and age: -0.08
There is a weak correlation between the variables.
Negative linear correlation.


Co

From the above, we can conclude that our variables have more weak negative linear correlation that positive.

### Scatter Matrix of the variables

In [8]:
# Plotting the scatter matrix
dat = data[['age', 'ejection_fraction', 'platelets', 'serum_creatinine', 'creatinine_phosphokinase']]
pd.DataFrame.iteritems = pd.DataFrame.items
fig = px.scatter_matrix(dat, width=900, height=600)
fig.show()

Interpretation: From the last row scatter matrix plot above, it shows that there are no linear relationship between the response variable creatinine_phosphokinase and any of the explanatory variables (age, ejection_fraction, platelets, and serum_creatinine).

### Multiple Linear Regression Analysis of the variables

In [9]:
# Linear regression analysis
model = sm.ols('creatinine_phosphokinase ~ age + ejection_fraction + platelets + serum_creatinine',data)
result = model.fit()
print(result.summary2())

                      Results: Ordinary least squares
Model:              OLS                      Adj. R-squared:     -0.005    
Dependent Variable: creatinine_phosphokinase AIC:                4967.6964 
Date:               2024-05-22 13:49         BIC:                4986.1986 
No. Observations:   299                      Log-Likelihood:     -2478.8   
Df Model:           4                        F-statistic:        0.6488    
Df Residuals:       294                      Prob (F-statistic): 0.628     
R-squared:          0.009                    Scale:              9.4592e+05
---------------------------------------------------------------------------
                        Coef.   Std.Err.    t    P>|t|    [0.025    0.975] 
---------------------------------------------------------------------------
Intercept             1038.0204 372.4637  2.7869 0.0057  304.9874 1771.0534
age                     -6.3072   4.8135 -1.3103 0.1911  -15.7804    3.1661
ejection_fraction       -3.3738   

### Model Analyzation and Interpretation

- Intercept: The intercept of 1038.0204 suggests that when all independent variables are zero, the predicted value of the dependent variable is 1038.0204.

- age: The negative coefficient (-6.3072) suggests that, holding other variables constant, an increase in age is associated with a decrease in creatinine_phosphokinase.

- ejection_fraction: The negative coefficient (-3.3738) suggests that, holding other variables constant, an increase in ejection_fraction is associated with a decrease in creatinine_phosphokinase.

- platelets: The positive coefficient (0.0002) is small and the p-value is high, indicating that platelets may not be a statistically significant predictor.

- serum_creatinine: The negative coefficient (-3.3840) suggests that, holding other variables constant, an increase in serum_creatinine is associated with a decrease in creatinine_phosphokinase.


### Making Prediction

In [10]:
# setting of new independent variables for prediction
age = [30, 70, 80, 45, 60]
ejection_fraction = [19, 25, 45, 38, 50]
platelets = [150000, 278000, 508000, 134000, 155000]
serum_creatinine = [1.2, 2.5, 3.0, 2.0, 2.8]

prediction = pd.DataFrame({'age': age, 'ejection_fraction': ejection_fraction,
                           'platelets': platelets, 'serum_creatinine': serum_creatinine})
prediction['Predicted CPK level'] = result.predict(prediction)
prediction

Unnamed: 0,age,ejection_fraction,platelets,serum_creatinine,Predicted CPK level
0,30,19,150000,1.2,815.221865
1,70,25,278000,2.5,567.800673
2,80,45,508000,3.0,488.58264
3,45,38,134000,2.0,650.117094
4,60,50,155000,2.8,517.158189


Observation: The R-squared value (0.009) is very low in this model, indicating that the model does not explain much of the variance in the dependent variable. Additionally, the p-values for the coefficients suggest that none of the independent variables are statistically significant in predicting the dependent variable. In other words, we might have to explore other variables for future predictions.

# PART B: Classification
In this section, I would be performing classification on my data(HeartFailure.csv) to predict the correct label of my dependent variable death event while still using age, ejection fraction, platelets, and serum creatinine as my independent variables.

### Frequency Table for DEATH_EVENT

In [11]:
# making a frequency table for DEATH_EVENT
freq_de = data['DEATH_EVENT'].value_counts()
freq_de = pd.DataFrame({'DEATH_EVENT': freq_de.keys(), 'frequency': freq_de.values})
freq_de = freq_de.sort_values(by = 'DEATH_EVENT')
freq_de['relative frequency'] = freq_de['frequency']/freq_de['frequency'].sum()
freq_de

Unnamed: 0,DEATH_EVENT,frequency,relative frequency
0,0,203,0.67893
1,1,96,0.32107


From the frequency table above, it is clear that death after follow-up period (0) has a frequency of 203 and relative frequency of 0.67893, while death during follow-up period (1) has a frequency of 96 and relative frequency of 0.32107.

### Training and Testing the data


In [12]:
# response variable(y) and explanatory variables(x)
y = data[['DEATH_EVENT']]
x = data[['age', 'ejection_fraction', 'platelets', 'serum_creatinine']]

# splitting data into 70% train and 30% test data
(x_train, x_test, y_train, y_test) = train_test_split(x,y,test_size=0.3, random_state=42)

train_data = pd.concat([x_train, y_train], axis=1, join='inner')
test_data = pd.concat([x_test, y_test], axis=1, join='inner')
print('Train Data \n', train_data, '\n')
print('Test Data \n', test_data)

Train Data 
         age  ejection_fraction  platelets  serum_creatinine  DEATH_EVENT
224  58.000                 25  504000.00               1.0            0
68   70.000                 25  244000.00               1.2            1
222  42.000                 35  365000.00               1.1            0
37   82.000                 50  321000.00               1.0            1
16   87.000                 38  262000.00               0.9            1
..      ...                ...        ...               ...          ...
188  60.667                 40  201000.00               1.0            0
71   58.000                 35  122000.00               0.9            0
106  55.000                 45  263000.00               1.3            0
270  44.000                 30  263358.03               1.6            0
102  80.000                 25  149000.00               1.1            0

[209 rows x 5 columns] 

Test Data 
       age  ejection_fraction  platelets  serum_creatinine  DEATH_EVENT
28

### Multiple Logistic Regression Analysis based on Train data

In [13]:
# Logistic regression model using statsmodel formula API
model = sm.logit(formula='DEATH_EVENT ~ age + ejection_fraction + platelets + serum_creatinine', data=train_data)
result2 = model.fit()

# print the summary of the logistic regression model
print(result2.summary2())

Optimization terminated successfully.
         Current function value: 0.456735
         Iterations 6
                         Results: Logit
Model:              Logit            Method:           MLE       
Dependent Variable: DEATH_EVENT      Pseudo R-squared: 0.233     
Date:               2024-05-22 13:52 AIC:              200.9153  
No. Observations:   209              BIC:              217.6270  
Df Model:           4                Log-Likelihood:   -95.458   
Df Residuals:       204              LL-Null:          -124.38   
Converged:          1.0000           LLR p-value:      8.2430e-12
No. Iterations:     6.0000           Scale:            1.0000    
-----------------------------------------------------------------
                   Coef.  Std.Err.    z    P>|z|   [0.025  0.975]
-----------------------------------------------------------------
Intercept         -3.3970   1.2090 -2.8097 0.0050 -5.7666 -1.0274
age                0.0642   0.0163  3.9293 0.0001  0.0322  0.0962


The logistic regression model summary above has provided valuable information about the relationship between the predictors (age, ejection_fraction, platelets, and serum_creatinine) and the likelihood of death event.

### Finding the Predictions (classes) for the Test data and Comparing them to the True values.

In [14]:
# Make predictions on death event
predictions = result2.predict(test_data)
predicted_labels = (predictions > 0.05).astype(int)
true_labels = test_data['DEATH_EVENT']

pred = pd.DataFrame({'True Death Status': true_labels,
                    'Predicted Probability': predictions,
                    'Predicted Death Status': predicted_labels})
print(pred)
print('\nTrue Status \n', true_labels.value_counts(), '\n')
print('Predicted Status \n', predicted_labels.value_counts())

     True Death Status  Predicted Probability  Predicted Death Status
281                  0               0.596109                       1
265                  0               0.108076                       1
164                  1               0.131410                       1
9                    1               0.999001                       1
77                   0               0.062411                       1
..                 ...                    ...                     ...
132                  0               0.077532                       1
72                   1               0.562412                       1
15                   1               0.321737                       1
10                   1               0.865367                       1
157                  0               0.220349                       1

[90 rows x 3 columns]

True Status 
 DEATH_EVENT
0    53
1    37
Name: count, dtype: int64 

Predicted Status 
 1    83
0     7
Name: count, dtype: int64


From the above we observe that true classification is 37 deaths during follow-up period and 53 deaths after follow-up period, and we predict 83 deaths during the follow-up period and 7 deaths after follow-up period.

### Confusion Matrix and Accuracy

In [15]:
# Evaluate the model's performance:
# finding the confusion matrix
true_labels = test_data['DEATH_EVENT']
conf_matrix = confusion_matrix(true_labels, predicted_labels)
print('Confusion Matrix:\n', conf_matrix)

# finding the test accuracy
test_acc = accuracy_score(true_labels, predicted_labels)
print('The Accuracy for the Test Set is {}'.format(test_acc*100))
print('The Test Error Rate is {}'.format(100-test_acc*100))

Confusion Matrix:
 [[ 5 48]
 [ 2 35]]
The Accuracy for the Test Set is 44.44444444444444
The Test Error Rate is 55.55555555555556


The accuracy of the fitted model is 44.44% and the test error rate is 55.56%

## Conclusion
Based on the analysis done in PART A (Regression Analysis), I noticed lack of linear relationship between the response and explanatory variables, and lots of weak negative relationships between the variables which brings me to the conclusion that not all the variables can be used to accurately predict the survival rate of a heart failure patient, and that more data is required for a better prediction. 

Based on PART B analysis (Classification Analysis), I discovered that most patients died after their follow-up period. Using that knowledge I would suggest that for future analysis the health progress record should be taken, that is, a record of changes in the patients blood system during the follow-up periods. Because with this record a better prediction on patients' survival can be made as more information would be taken and deaths can be decreased.


## Reference
Chicco, D., & Jurman, G.(2020). Machine learning can predict survival of patients with heart failure from serum creatinine and ejection fraction alone. BMC Med Inform Decis Mak 20, 16. https://doi.org/10.1186/s12911-020-1023-5