# Heart Failure Clinical Records
This notebook will perform some essential EDA and cleaning for the input data.  The data will be exported after it has been cleaned to be used in the model.

## Introduction

The data comes from the [UCL Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Heart+failure+clinical+records#).  It contains medical records of 299 patients who had heart failure and consists of 13 clinical features.  There is also an [academic paper](https://doi.org/10.1186/s12911-020-1023-5) associated with this data, which is in fact based upon an [older paper](https://doi.org/10.1371/journal.pone.0181001).  Both of these papers provide significant detail about what the clinical features mean.

In this case, the challenge is to predict the `DEATH_EVENT` of a patient as a boolean (in the form of `0` or `1`).  The complete features present in the data are:

* `age` — Age of the patient (years).
* `anaemia` — Decrease of red blood cells or hemoglobin (boolean).
* `high_blood_pressure` — If the patient has hypertension (boolean).
* `creatinine_phosphokinase` — Level of the CPK enzyme in the blood (mcg/L).
* `diabetes` — If the patient has diabetes (boolean).
* `ejection_fraction` — Percentage of blood leaving the heart at each contraction (percentage/100).
* `platelets` — Platelets in the blood (kiloplatelets/mL).
* `sex` — Woman or man (binary).
* `serum_creatinine` — Level of serum creatinine in the blood (mg/dL).
* `serum_sodium` — Level of serum sodium in the blood (mEq/L).
* `smoking` — If the patient smokes or not (boolean).
* `time` — Follow-up period (days).
* `DEATH_EVENT` — If the patient deceased during the follow-up period (boolean).

The newer paper that this data is based upon suggests that using only `ejection_fraction` and `serum_creatinine` can result in an accuracy of ~0.838 (among other measurements).

## Preparation

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

all_data = pd.read_csv('./data/heart_failure_clinical_records_dataset.csv')

# TODO
1. A quick browse of what's available.
2. Any nulls?
3. Check ranges; do they make sense?
4. Check and fix data types.
5. Check for duplicates.
6. Investigate any outliers.
7. Balance of data for target variable.
8. Correlation investigation for potential feature selection.
8. Skewness of variables of interest (and any possible transformations).