# Ensemble Modeling

## B. Importing Libraries & Dataset

In [1]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.model_selection import KFold
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np  

In [2]:
df = pd.read_csv(r"/Users/steffipoliwoda/Desktop/heartFailure.csv")

## C. Feature Selection

In [3]:
feature_columns = ['age', 'anaemia', 'creatinine_phosphokinase', 'diabetes','ejection_fraction','high_blood_pressure','platelets', 'serum_creatinine', 'serum_sodium', 'sex', 'smoking', 'time']

In [4]:
#split dataset in features and target variable
x = df[feature_columns]
y = df['DEATH_EVENT']

## D. Splitting Dataset

We will divide the dataset into a training set and a test set using the function train_test_split() and by passing three parameters: 
- features
- target
- test_set size
- Optional: random_state

In [5]:
# split X and y into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=11)

The Dataset is split into two parts with a ratio of 80:20. It means that 80% data will be used for model training and 20% for model testing.

## E. Model Development & Hyperparameters

### 1. DecisionTree Classifier

We will first specify the hyperparameters of the base estimator (decision tree classifier) with the entropy or information gain as the splitting criterion. We will define the hyperparameters for the bagging classifier. We will pass the base estimator object to the classifier as a hyperparameter.

In [6]:
# Importing the base classifiers 
from sklearn.tree import DecisionTreeClassifier

In [7]:
# Specify the hyperparameters for the base estimator (decision tree) and initialise the model
# entropy or information gain as splitting criterion
dt_params = { 'criterion': 'entropy', 'random_state': 11}

In [8]:
dt = DecisionTreeClassifier(**dt_params)

We will fit the decision tree model to the training data to compare prediction accuracy.

In [9]:
# fit the decision tree model to the training data
dt.fit(x_train, y_train)

DecisionTreeClassifier(criterion='entropy', random_state=11)

In [10]:
# Calculating the predicting accuracy for training data
dt_preds_train = dt.predict(x_train)

In [11]:
# Calculating the predicting accuracy for test data
dt_preds_test = dt.predict(x_test)

In [12]:
print('Decision Tree:\n> Accuracy on training data = {:.4f}\n> Accuracy on validation data = {:.4f}'.format( accuracy_score(y_true=y_train, y_pred=dt_preds_train), accuracy_score(y_true=y_test, y_pred=dt_preds_test)))

Decision Tree:
> Accuracy on training data = 1.0000
> Accuracy on validation data = 0.7333


- The decision tree predicts all the class labels of the training examples correctly.
- The lower test accuracy indicates high variance of the model.
- High variance means that the decision tree is overfitting to the training data. 

### 2. Bagging Classifier

In [None]:
# Importing the bagging classifiers 
from sklearn.ensemble import BaggingClassifier

In [None]:
# Importing ensemble classifier 
from sklearn.ensemble import BaggingClassifier

In [13]:
# Specify the hyperparameters for the BaggingClassifier
bc_params = { 'base_estimator': dt, 'n_estimators': 50, 'max_samples': 0.5, 'random_state': 11, 'n_jobs': -1}

In [14]:
# pass the base estimator object to the clssifier as hyperparameter
bc = BaggingClassifier(**bc_params)

We will fit the bagging classifier model to the training data and calculate the prediction accuracy.

In [15]:
# fit the bagging classifier model to the training data
bc.fit(x_train, y_train)

BaggingClassifier(base_estimator=DecisionTreeClassifier(criterion='entropy',
                                                        random_state=11),
                  max_samples=0.5, n_estimators=50, n_jobs=-1, random_state=11)

In [16]:
# Calculating the predicting accuracy for training data
bc_preds_train = bc.predict(x_train)

In [17]:
# Calculating the predicting accuracy for test data
bc_preds_test = bc.predict(x_test)

In [18]:
print('Bagging Classifier:\n> Accuracy on training data = {:.4f}\n> Accuracy on validation data = {:.4f}'.format( accuracy_score(y_true=y_train, y_pred=bc_preds_train), accuracy_score(y_true=y_test, y_pred=bc_preds_test)))

Bagging Classifier:
> Accuracy on training data = 0.9498
> Accuracy on validation data = 0.8500


- The Decision Tree Classifier has a higher accuracy on the training data than the Bagging Classifier.
- The Bagging Classifier has a higher accuracy on the test data than the Decision Tree Classifier.

### 3. RandomForestClassifier

In [19]:
from sklearn.ensemble import RandomForestClassifier

We will specify the hyperparameters and initialize the model. We will use entropy as the splitting criterion for the decision trees in a forest comprising 100 trees. 

In [20]:
rf_params = { 'n_estimators': 100, 'criterion': 'entropy', 'max_features': 0.5, 'min_samples_leaf': 10, 'random_state': 11, 'n_jobs': -1}

In [21]:
rf = RandomForestClassifier(**rf_params)

We will fit the Random Forest classifier model to the training data and calculate the prediction accuracy.

In [22]:
rf.fit(x_train, y_train)

RandomForestClassifier(criterion='entropy', max_features=0.5,
                       min_samples_leaf=10, n_jobs=-1, random_state=11)

In [23]:
rf_preds_train = rf.predict(x_train)

In [24]:
rf_preds_test = rf.predict(x_test)

In [25]:
print('Random Forest:\n> Accuracy on training data = {:.4f}\n> Accuracy on validation data = {:.4f}'.format( accuracy_score(y_true=y_train, y_pred=rf_preds_train), accuracy_score(y_true=y_test, y_pred=rf_preds_test)))

Random Forest:
> Accuracy on training data = 0.8745
> Accuracy on validation data = 0.9000


- The accuracy of the Random Forest on the test set is comprated to the bagging classifier is almost the same. - Even though the Bagging Classifier has higher accuracy on the training dataset.