<p align="center" style="font-size:50px">Prediction Heart Failure</p>
<p align="center" style="font-size:30px">Exploratory Data Analysis</p>
<p align="center">In this step, will be make a EDA for help understand better the dataset and help a make the modeling later.</p>

### About Dataset
This dataset contains the medical records of 299 patients who had heart failure, collected during their follow-up period, where each patient profile has 13 clinical features.

### Structure of Dataset
- `age`: Age of the patient
- `anaemia`: Decrease of red blood cells or hemoglobin
- `creatinine_phosphokinase`: Level of the CPK enzyme in the blood (mcg/l)
- `diabetes`: If the patient has diabetes
- `ejection_fraction`: Percentage of blood leaving the heart at each contraction
- `high_blood_pressure`: If the patient has hypertension
- `platelets`: Platelets in the blood (kiloplatelets/mL)
- `serum_creatinine`: Level of serum creatinine in the blood (mg/dL)
- `serum_sodium`: Level of serum sodium in the blood (mEq/L)
- `sex`: Woman or man
- `smoking`: If the patient smokes or not
- `time`: Follow-up period (days)
- `death_event`: If the patient died during the follow-up period

In [1]:
#import libraries
import pandas as pd
import numpy as np
import plotly.graph_objects as go
import plotly.figure_factory as ff
from plotly.subplots import make_subplots

In [2]:
# load dataset
df = pd.read_csv('../data/heart_failure_clinical_records_dataset.csv')

### Understanding Data

In [3]:
print(f'This dataset contains:\n{df.shape[0]} rows and {df.shape[1]} columns.')

This dataset contains:
299 rows and 13 columns.


In [4]:
# First five rows
df.head()

Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time,DEATH_EVENT
0,75.0,0,582,0,20,1,265000.0,1.9,130,1,0,4,1
1,55.0,0,7861,0,38,0,263358.03,1.1,136,1,0,6,1
2,65.0,0,146,0,20,0,162000.0,1.3,129,1,1,7,1
3,50.0,1,111,0,20,0,210000.0,1.9,137,1,0,7,1
4,65.0,1,160,1,20,0,327000.0,2.7,116,0,0,8,1


In [5]:
# columns of dataset
df.columns

Index(['age', 'anaemia', 'creatinine_phosphokinase', 'diabetes',
       'ejection_fraction', 'high_blood_pressure', 'platelets',
       'serum_creatinine', 'serum_sodium', 'sex', 'smoking', 'time',
       'DEATH_EVENT'],
      dtype='object')

In [6]:
# check null value counts and datatype of columns
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 299 entries, 0 to 298
Data columns (total 13 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   age                       299 non-null    float64
 1   anaemia                   299 non-null    int64  
 2   creatinine_phosphokinase  299 non-null    int64  
 3   diabetes                  299 non-null    int64  
 4   ejection_fraction         299 non-null    int64  
 5   high_blood_pressure       299 non-null    int64  
 6   platelets                 299 non-null    float64
 7   serum_creatinine          299 non-null    float64
 8   serum_sodium              299 non-null    int64  
 9   sex                       299 non-null    int64  
 10  smoking                   299 non-null    int64  
 11  time                      299 non-null    int64  
 12  DEATH_EVENT               299 non-null    int64  
dtypes: float64(3), int64(10)
memory usage: 30.5 KB


- This dataset contains only one category feature that is sex, but this feature has already been encoded being 0 woman and 1 man. However, this encoding doesn't make sense, then later in the modeling step it will be change.
- There are no null values

In [7]:
df['sex'] = df['sex'].astype('object')

In [8]:
df_cleaned = df.copy()
df_cleaned['sex'] = np.where(df_cleaned['sex']==0,'woman','man')

In [9]:
df_cleaned.head()

Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time,DEATH_EVENT
0,75.0,0,582,0,20,1,265000.0,1.9,130,man,0,4,1
1,55.0,0,7861,0,38,0,263358.03,1.1,136,man,0,6,1
2,65.0,0,146,0,20,0,162000.0,1.3,129,man,1,7,1
3,50.0,1,111,0,20,0,210000.0,1.9,137,man,0,7,1
4,65.0,1,160,1,20,0,327000.0,2.7,116,woman,0,8,1


### Univariate Analysis

In [10]:
# ploting grafhics of distribution numerical variables
numerical_variables = df[['age','creatinine_phosphokinase','ejection_fraction','platelets','serum_creatinine','serum_sodium']]

fig  = make_subplots(rows = 3, cols=2)
rows = [1,1,2,2,3,3]
cols = [1,2,1,2,1,2]

def create_trace(df: pd.DataFrame):
    list_traces = list()
    for column in df.columns:
        trace = ff.create_distplot([df[column]],group_labels=[column],show_rug=False)
        list_traces.append(trace.data[0])
    return list_traces

traces = create_trace(numerical_variables)

fig.add_traces(traces,rows=rows,cols=cols)
fig.show()
            