## Objective
Train a _Random Forest_ on a _real-world_-dataset on _heart failure prediction_.

*# Submission Guidelines
Your finished _Jupyter Notebook_ - both as `.ipynb` and exported `.pdf`.

# Background

![](images/heart-failure.jpg)

From the _World Health Organization_:

* Cardiovascular diseases (CVDs) are the leading cause of death globally.
* An estimated 17.9 million people died from CVDs in 2019, representing 32% of all global deaths. Of these deaths, 85% were due to heart attack and stroke.
* Over three quarters of CVD deaths take place in low- and middle-income countries.
* Out of the 17 million premature deaths (under the age of 70) due to noncommunicable diseases in 2019, 38% were caused by CVDs.
* Most cardiovascular diseases can be prevented by addressing behavioural risk factors such as tobacco use, unhealthy diet and obesity, physical inactivity and harmful use of alcohol.
* It is important to detect cardiovascular disease as early as possible so that management with counselling and medicines can begin.

Apart from `Age` and `Sex`, you are given the following attributes to predict whether a person has suffered (or is likely to suffer) from a `HeartDisease` ($1$) or not ($0$):

* `ChestPainType`: type of chest pain.
    * _TA_: Typical Angina
    * _ATA_: Atypical Angina
    * _NAP_: Non-Anginal Pain
    * _ASY_: Asymptomatic
* `RestingBP`: resting blood pressure in _mmHg_.
* `Cholesterol`: serum cholesterol  in _mm/dl_.
* `FastingBS`: $1$ if the the fasting blood sugar is above 120 _mg/dl_.
* `RestingECG`: resting electrocardiogram results.
    * _Normal_: Normal
    * _ST_: having _ST-T_ wave abnormality (_T wave inversions_ and/or _ST elevation_ or depression higher than 0.05 _mV_)
    * _LVH_: showing probable or definite _left ventricular hypertrophy_ by _Estes' criteria_
* `MaxHR`: maximum heart rate achieved.
* `ExerciseAngina`: _Yes_, if the person suffers from exercise-induced angina.
* `Oldpeak`: measure of the depression occurring in the _ST_ segment (_mm_).
* `ST_Slope`: the slope of the peak exercise _ST_ segment.
    * _Up_: upsloping
    * _Flat_: flat
    * _Down_: downsloping

# Task

1. Load the data from the provided `.csv`-**files** into `X_train`, `y_train` and `X_test`.
    * This time you don't have access to the _test labels_! 
1. Perform a quick _EDA_ (_Exploratory Data Analysis_).
1. Create 2-3 visualizations to illustrate aspects of the data you think are interesting!
1. Encode the _categorical variables_ as you deem suitable.
    * As always, only `fit` the _encoder_ on the training data, not the test data!
    * Hint: You'll need both _ordinal_ and _one-hot_ encoding!
1. Feel free to apply any further preprocessing steps.
1. Train and evaluate a _random forest_ using _cross validation_ on `X_train`.
1. Train a model using the best combination of hyperparameters and preprocessing steps on all of `X_train` and make predictions on `X_test`.
1. You'll get `y_test` in the next lesson, to see how well your model performs on unseen data!

**Use the _random state_ $12$ whereever possible so we can compare our results across the class!**

In [1]:
my_random_state = 12

# 1. Loading The Data

In [3]:
import pandas as pd
X_train = pd.read_csv('input/X_train.csv', index_col=0)
y_train = pd.read_csv('input/y_train.csv', index_col=0)['HeartDisease']
X_test = pd.read_csv('input/X_test.csv', index_col=0)

# 2. EDA

# 3. Vizualization

# 4. Encoding

# 5. Furhter data preprocessing

# 6. Train and evaluate RF using CV on X_train  

# 7. Train a model using the best combination of hyperparameters and preprocessing 

# 8. Check how well your model performs on unseen data finally using y_test