# Heart Failure Clinical Records
This notebook will perform some essential EDA and cleaning for the input data.  The data will be exported after it has been cleaned to be used in the model.

## Introduction

The data comes from the [UCL Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Heart+failure+clinical+records#).  It contains medical records of 299 patients who had heart failure and consists of 13 clinical features.  There is also an [academic paper](https://doi.org/10.1186/s12911-020-1023-5) associated with this data, which is in fact based upon an [older paper](https://doi.org/10.1371/journal.pone.0181001).  Both of these papers provide significant detail about what the clinical features mean.

In this case, the challenge is to predict the `DEATH_EVENT` of a patient as a boolean (in the form of `0` or `1`).  The complete features present in the data are:

* `age` — Age of the patient (years).
* `anaemia` — Decrease of red blood cells or hemoglobin (boolean).
* `high_blood_pressure` — If the patient has hypertension (boolean).
* `creatinine_phosphokinase` — Level of the CPK enzyme in the blood (mcg/L).
* `diabetes` — If the patient has diabetes (boolean).
* `ejection_fraction` — Percentage of blood leaving the heart at each contraction (percentage/100).
* `platelets` — Platelets in the blood (kiloplatelets/mL).
* `sex` — Woman or man (binary).
* `serum_creatinine` — Level of serum creatinine in the blood (mg/dL).
* `serum_sodium` — Level of serum sodium in the blood (mEq/L).
* `smoking` — If the patient smokes or not (boolean).
* `time` — Follow-up period (days).
* `DEATH_EVENT` — If the patient deceased during the follow-up period (boolean).

The newer paper that this data is based upon suggests that using only `ejection_fraction` and `serum_creatinine` can result in an accuracy of ~0.838 (among other measurements).

## Preparation

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

all_data = pd.read_csv('./data/heart_failure_clinical_records_dataset.csv')

# TODO
* Check and fix data types.
* Check ranges; do they make sense?
* Investigate any outliers.
* Balance of data for target variable.
* Correlation investigation for potential feature selection.
* Skewness of variables of interest (and any possible transformations).

## Data Overview
Let's firstly have a browse of the data just to see what is present.

In [2]:
all_data.head()

Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time,DEATH_EVENT
0,75.0,0,582,0,20,1,265000.0,1.9,130,1,0,4,1
1,55.0,0,7861,0,38,0,263358.03,1.1,136,1,0,6,1
2,65.0,0,146,0,20,0,162000.0,1.3,129,1,1,7,1
3,50.0,1,111,0,20,0,210000.0,1.9,137,1,0,7,1
4,65.0,1,160,1,20,0,327000.0,2.7,116,0,0,8,1


## Sanity Checks

Before we continue or make any changes, let's check for any duplicate rows.  Luckily, there are none!

In [3]:
all_data[all_data.duplicated()]

Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time,DEATH_EVENT


Let's also check for any missing values that may need to be investigated and fixed (imputing or investigated for removal).  We're lucky again that there are none.

In [4]:
print(f'Total rows with a null element: {sum(all_data.isnull().any(1))}')

Total rows with a null element: 0


## Data Types
So the data has no duplicates or missing values, which is great.  Let's now check the data types for each column based on the accompanying paper's explanation and fix any that come up.

Pandas, by default, uses 64-bit values (`int64`, `float64`, etc.).  We could go about reducing the size of these to save memory, but realistically given the small size of this dataset, it's really not worth doing.

Of the below, the two that concern me are `age` and `platelets`.  The former should definitely be an integer, so something is wrong there.  The latter, looking at the brief dataframe print earlier, could be mostly rounded with a few exceptions; this is worth looking into.

In [6]:
all_data.dtypes

age                         float64
anaemia                       int64
creatinine_phosphokinase      int64
diabetes                      int64
ejection_fraction             int64
high_blood_pressure           int64
platelets                   float64
serum_creatinine            float64
serum_sodium                  int64
sex                           int64
smoking                       int64
time                          int64
DEATH_EVENT                   int64
dtype: object

### Age
Starting with `age`, we can check what elements are not an integer to see how many are off.

In [7]:
all_data[all_data['age'].map(lambda x: not x.is_integer())]

Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time,DEATH_EVENT
185,60.667,1,104,1,30,0,389000.0,1.5,136,1,0,171,1
188,60.667,1,151,1,40,1,201000.0,1.0,136,0,0,172,0


It appears that two values are decimals, so it's likely an input error.  Given that they are so few in number, it's worth just rounding them up and converting the column to be an `int64`.

In [12]:
all_data['age'] = all_data['age'].round(0).astype(np.int64)

In [18]:
print(f'The age column is now: {all_data.age.dtype}')

# Check the above two values that were floats before.
print(f'Entry 185 is now {all_data.age.iloc[185]}')
print(f'Entry 188 is now {all_data.age.iloc[188]}')

The age column is now: int64
Entry 185 is now 61
Entry 188 is now 61


### Platelets
Now let's deal with the `platelets` column.

In [None]:
# TODO: Continue the fixing from the other doc but here.