# Exercise - Boosting for classification

1. Use the **load_breast_cancer** data (remember to split your data into a train, validation, and test data). Using your training and validation data, optimize the parameters of your GradientBoostingClassifier. How well does your optimized model perform on the test data?
1. Implement an RF model. How well do they perform on the test data? Try to "vote" using boosting and RF and select the class with the most votes. How well does your ensemble of all three models perform?

**See slides for more details!**

In [1]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn import ensemble
import pandas as pd
import numpy as np

X, y = load_breast_cancer(return_X_y=True)

# Use `train_test_split` to split your data into a train and a test set.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Use `train_test_split` to split your train data into a train and a validation  set.
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

print(X_train.shape, X_val.shape, X_test.shape, y_train.shape, y_val.shape, y_test.shape)

(364, 30) (91, 30) (114, 30) (364,) (91,) (114,)


# Exercise 1

Use the **load_breast_cancer** data (remember to split your data into a train, validation, and test data). Using your training and validation data, optimize the parameters of your GradientBoostingClassifier. How well does your optimized model perform on the test data?

Let us start by ensuring we can just run an GBT without any optimization.

In [2]:
gbt_current = ensemble.GradientBoostingClassifier()
gbt_current.fit(X_train, y_train)
y_val_hat = gbt_current.predict(X_val)
acc = accuracy_score(y_val, y_val_hat)

print(f'Boosting with default settings has validation accuracy of {round(acc * 100, 2)}%.')

Boosting with default settings has validation accuracy of 95.6%.


In [6]:
# Remember you can try other stuff than these specific parameters.
# Just here to get you started!

n_estimators_list = [50,200,500]
min_samples_split_list = [15,20,25]
min_samples_leaf_list = [5,10,15]

results = []

for n_estimators in n_estimators_list:
    for min_samples_split in min_samples_split_list:
        for min_samples_leaf in min_samples_leaf_list:
            gbt_current = ensemble.GradientBoostingClassifier(
                n_estimators=n_estimators,
                min_samples_split=min_samples_split,
                min_samples_leaf=min_samples_leaf,
                )
            gbt_current.fit(X_train, y_train)
            y_val_hat = gbt_current.predict(X_val)
            acc = accuracy_score(y_val, y_val_hat)

            results.append([acc, n_estimators, min_samples_split, min_samples_leaf])

results = pd.DataFrame(results)
results.columns = ['Accuracy', 'n_estimators', 'min_samples_split', 'min_samples_leaf']
print(results)

    Accuracy  n_estimators  min_samples_split  min_samples_leaf
0   0.989011            50                 15                 5
1   1.000000            50                 15                10
2   1.000000            50                 15                15
3   1.000000            50                 20                 5
4   1.000000            50                 20                10
5   1.000000            50                 20                15
6   1.000000            50                 25                 5
7   1.000000            50                 25                10
8   1.000000            50                 25                15
9   1.000000           200                 15                 5
10  1.000000           200                 15                10
11  1.000000           200                 15                15
12  1.000000           200                 20                 5
13  1.000000           200                 20                10
14  1.000000           200              

In [7]:
# Extract best parameters.
n_estimators_optimal = results.loc[results['Accuracy'].idxmax()]['n_estimators'].astype(int)
min_samples_split_optimal = results.loc[results['Accuracy'].idxmax()]['min_samples_split'].astype(int)
min_samples_leaf_optimal = results.loc[results['Accuracy'].idxmax()]['min_samples_leaf'].astype(int)

In [8]:
# Initialize your final model
gbt_optimal = ensemble.GradientBoostingClassifier(
                n_estimators=n_estimators_optimal,
                min_samples_split=min_samples_split_optimal,
                min_samples_leaf=min_samples_leaf_optimal,
                )            

# Use both training and validation data to fit it using np.concatenate (np.concatenate "stacks" the array like rbind in R)
X_train = np.concatenate((X_train,X_val))
y_train = np.concatenate((y_train,y_val))
#y_train
gbt_optimal.fit(X_train, y_train)

# Predict on test data
y_test_hat = gbt_current.predict(X_test)

# Obtain and check accuracy on test data
acc = accuracy_score(y_test, y_test_hat)
print(f'Boosting with optimal settings has a test accuracy of {round(acc * 100, 2)}%.')

Boosting with optimal settings has a test accuracy of 97.37%.
