# Hacktiv8 Phase 1: Graded Challenge 3

---

Graded Challenge ini dibuat guna mengevaluasi pembelajaran pada Hacktiv8 Data Science Fulltime Program khususnya pada konsep Ensemble.

## Introduction

By [Rifky Aliffa](https://github.com/Penzragon)

### Dataset

Dataset yang digunakan pada project ini adalah dataset yang berisi data pasien yang memiliki penyakit gagal jantung. Dataset ini berisi 299 baris dengan 13 kolom yang diantaranya adalah age, anaemia, creatinine_phosphokinase, diabetes, ejection_fraction, dan masih banyak lagi. dataset dapat dilihat di [Kaggle](https://www.kaggle.com/andrewmvd/heart-failure-clinical-data).

Keterangan kolom pada dataset ini adalah:

| Feature                  | Description                                                            |
| ------------------------ | ---------------------------------------------------------------------- |
| age                      | Age                                                                    |
| anaemia                  | Decrease of red blood cells or hemoglobin (boolean)                    |
| creatinine_phosphokinase | Level of the CPK enzyme in the blood (mcg/L)                           |
| diabetes                 | If the patient has diabetes (boolean)                                  |
| ejection_fraction        | Percentage of blood leaving the heart at each contraction (percentage) |
| high_blood_pressure      | If the patient has hypertension (boolean)                              |
| platelets                | Platelets in the blood (kiloplatelets/mL)                              |
| serum_creatinine         | Level of serum creatinine in the blood (mg/dL)                         |
| serum_sodium             | Level of serum sodium in the blood (mEq/L)                             |
| sex                      | Woman or man (binary)                                                  |
| smoking                  | If the patient smokes or not (boolean)                                 |
| time                     | Follow-up period (days)                                                |
| DEATH_EVENT              | If the patient deceased during the follow-up period (boolean)          |

### Objectives

**Graded Challenge 3** ini dibuat guna mengevaluasi konsep Ensemble Learning sebagai berikut:

- Mampu memahami konsep Ensemble dan Boosting
- Mampu mempersiapkan data untuk digunakan dalam model Random Forest dan Boosting.
- Mampu mengimplementasikan Random Forest dan Boosting untuk membuat prediksi.

## Import Libraries

Menggunakan extension `patch_sklearn()` dari Intel® untuk mempercepat training model.

In [1]:
from sklearnex import patch_sklearn
patch_sklearn()

Intel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)


In [2]:
import warnings
warnings.filterwarnings("ignore") 

Pada project ini akan library yang akan digunakan adalah **Pandas**, **Numpy**, **Matplotlib**, **Seaborn**, **Scikit-Learn**, dan **XGBoost**.

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from xgboost import XGBClassifier

## Data Loading

Membuat dataframe bernama `heart` dari file `heart_failure_clinical_records_dataset.csv`.

In [4]:
heart = pd.read_csv('heart_failure_clinical_records_dataset.csv')

In [5]:
heart.head()

Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time,DEATH_EVENT
0,75.0,0,582,0,20,1,265000.0,1.9,130,1,0,4,1
1,55.0,0,7861,0,38,0,263358.03,1.1,136,1,0,6,1
2,65.0,0,146,0,20,0,162000.0,1.3,129,1,1,7,1
3,50.0,1,111,0,20,0,210000.0,1.9,137,1,0,7,1
4,65.0,1,160,1,20,0,327000.0,2.7,116,0,0,8,1


### Data Characteristics

In [6]:
heart.shape

(299, 13)

Dataset ini berisi **299 baris** dengan **13 kolom**.

In [7]:
heart.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 299 entries, 0 to 298
Data columns (total 13 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   age                       299 non-null    float64
 1   anaemia                   299 non-null    int64  
 2   creatinine_phosphokinase  299 non-null    int64  
 3   diabetes                  299 non-null    int64  
 4   ejection_fraction         299 non-null    int64  
 5   high_blood_pressure       299 non-null    int64  
 6   platelets                 299 non-null    float64
 7   serum_creatinine          299 non-null    float64
 8   serum_sodium              299 non-null    int64  
 9   sex                       299 non-null    int64  
 10  smoking                   299 non-null    int64  
 11  time                      299 non-null    int64  
 12  DEATH_EVENT               299 non-null    int64  
dtypes: float64(3), int64(10)
memory usage: 30.5 KB


Dari basic info ini dapat diketahui dataframe terbentuk dari:
- 3 kolom bertipe data **float**
- 10 kolom bertipe data **integer**

Dan juga tidak terdapat missing value pada datase.

In [8]:
heart.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
age,299.0,60.833893,11.894809,40.0,51.0,60.0,70.0,95.0
anaemia,299.0,0.431438,0.496107,0.0,0.0,0.0,1.0,1.0
creatinine_phosphokinase,299.0,581.839465,970.287881,23.0,116.5,250.0,582.0,7861.0
diabetes,299.0,0.41806,0.494067,0.0,0.0,0.0,1.0,1.0
ejection_fraction,299.0,38.083612,11.834841,14.0,30.0,38.0,45.0,80.0
high_blood_pressure,299.0,0.351171,0.478136,0.0,0.0,0.0,1.0,1.0
platelets,299.0,263358.029264,97804.236869,25100.0,212500.0,262000.0,303500.0,850000.0
serum_creatinine,299.0,1.39388,1.03451,0.5,0.9,1.1,1.4,9.4
serum_sodium,299.0,136.625418,4.412477,113.0,134.0,137.0,140.0,148.0
sex,299.0,0.648829,0.478136,0.0,0.0,1.0,1.0,1.0


Jika dilihat dari summary statistic diatas sepertinya terdapat anomali pada kolom `creatinine_phosphokinase` dan `platelets` karena nilai minimumnya dan maksimumnya sangat berjauhan dengan quartile 1 dan 3. Maka akan dieksplorasi lebih dalam pada bagian **EDA**.

## Data Cleaning

### Missing Value

In [9]:
heart.isna().sum()

age                         0
anaemia                     0
creatinine_phosphokinase    0
diabetes                    0
ejection_fraction           0
high_blood_pressure         0
platelets                   0
serum_creatinine            0
serum_sodium                0
sex                         0
smoking                     0
time                        0
DEATH_EVENT                 0
dtype: int64

Tidak terdapat missing value pada dataset ini.

### Duplicated Data

In [10]:
heart.duplicated().sum()

0

Pada dataset ini juga tidak terdapat data yang duplikat.