# Exercise - Boosting for classification

1. Use the **load_breast_cancer** data (remember to split your data into a train, validation, and test data). Using your training and validation data, optimize the parameters of your GradientBoostingClassifier. How well does your optimized model perform on the test data?
1. Implement an RF and a SVM and use these as well (**note**: you may want to perform standardization for the SVM). How well do they perform on the test data? Try to "vote" using all three models (boosting, RF, and SVM) and select the class with the most votes. How well does your ensemble of all three models perform?

**See slides for more details!**

In [24]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn import ensemble
import pandas as pd
import numpy as np

X, y = load_breast_cancer(return_X_y=True)

# Use `train_test_split` to split your data into a train and a test set.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Use `train_test_split` to split your train data into a train and a validation  set.
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

print(X_train.shape, X_val.shape, X_test.shape, y_train.shape, y_val.shape, y_test.shape)

(364, 30) (91, 30) (114, 30) (364,) (91,) (114,)


# Exercise 1

Use the **load_breast_cancer** data (remember to split your data into a train, validation, and test data). Using your training and validation data, optimize the parameters of your GradientBoostingClassifier. How well does your optimized model perform on the test data?

Let us start by ensuring we can just run an GBT without any optimization.

In [25]:
gbt_current = ensemble.GradientBoostingClassifier()
gbt_current.fit(X_train, y_train)
y_val_hat = gbt_current.predict(X_val)
acc = accuracy_score(y_val, y_val_hat)

print(f'Boosting with default settings has validation accuracy of {round(acc * 100, 2)}%.')

Boosting with default settings has validation accuracy of 95.6%.


In [26]:
# Remember you can try other stuff than these specific parameters.
# Just here to get you started!

n_estimators_list = [5, 20, 50]
min_samples_split_list = [15, 20, 25]
min_samples_leaf_list = [5, 10, 20]

results = []

for n_estimators in n_estimators_list:
    for min_samples_split in min_samples_split_list:
        for min_samples_leaf in min_samples_leaf_list:
            gbt_current = ensemble.GradientBoostingClassifier(
                n_estimators=n_estimators,
                min_samples_split=min_samples_split,
                min_samples_leaf=min_samples_leaf,
                )
            gbt_current.fit(X_train, y_train)
            y_val_hat = gbt_current.predict(X_val)
            acc = accuracy_score(y_val, y_val_hat)

            results.append([acc, n_estimators, min_samples_split, min_samples_leaf])

results = pd.DataFrame(results)
results.columns = ['Accuracy', 'n_estimators', 'min_samples_split', 'min_samples_leaf']
print(results)

    Accuracy  n_estimators  min_samples_split  min_samples_leaf
0   0.912088             5                 15                 5
1   0.923077             5                 15                10
2   0.923077             5                 15                20
3   0.912088             5                 20                 5
4   0.934066             5                 20                10
5   0.923077             5                 20                20
6   0.912088             5                 25                 5
7   0.901099             5                 25                10
8   0.923077             5                 25                20
9   0.945055            20                 15                 5
10  0.934066            20                 15                10
11  0.934066            20                 15                20
12  0.945055            20                 20                 5
13  0.934066            20                 20                10
14  0.934066            20              

In [27]:
# Extract best parameters.
# results[results['Accuracy'] == results['Accuracy'].max()]
n_estimators_optimal = results.loc[results['Accuracy'].idxmax()]['n_estimators'].astype(int)
min_samples_split_optimal = results.loc[results['Accuracy'].idxmax()]['min_samples_split'].astype(int)
min_samples_leaf_optimal = results.loc[results['Accuracy'].idxmax()]['min_samples_leaf'].astype(int)


In [28]:
# Initialize your final model
gbt_optimal = ensemble.GradientBoostingClassifier(
                n_estimators=n_estimators_optimal,
                min_samples_split=min_samples_split_optimal,
                min_samples_leaf=min_samples_leaf_optimal,
                )            

# Use both training and validation data to fit it using np.concatenate (np.concatenate "stacks" the array like rbind in R)
X_train = np.concatenate((X_train,X_val))
y_train = np.concatenate((y_train,y_val))
#y_train
gbt_optimal.fit(X_train, y_train)

# Predict on test data
y_test_hat_gbt = gbt_current.predict(X_test)

# Obtain and check accuracy on test data
acc = accuracy_score(y_test_hat_gbt, y_test)
print(f'Boosting with optimal settings has a test accuracy of {round(acc * 100, 2)}%.')

Boosting with optimal settings has a test accuracy of 93.86%.


# Exercise 2

Implement an RF and a SVM and use these as well (**note**: you may want to perform standardization for the SVM). How well do they perform on the test data? Try to "vote" using all three models (boosting, RF, and SVM) and select the class with the most votes. How well does your ensemble of all three models perform?

In [29]:
from sklearn.preprocessing import StandardScaler
from sklearn import svm

# Scale your data
scaler = StandardScaler()
Z_train = scaler.fit_transform(X_train)
Z_val = scaler.transform(X_val)
Z_test = scaler.transform(X_test)

In [30]:
# You may want to optimize the settings if you want.
# Then, you can do it here.
# You can/may want to do this both for the RF and the SVM.

results[results['Accuracy'] == results['Accuracy'].max()]
# print(X_train.shape, X_val.shape)

Unnamed: 0,Accuracy,n_estimators,min_samples_split,min_samples_leaf
21,0.967033,50,20,5
24,0.967033,50,25,5


In [31]:
from sklearn.ensemble import RandomForestClassifier

# Initialize your final models
gb_final = RandomForestClassifier(
    n_estimators = 5,
    random_state=42,
    # min_samples_split = 5, 
    # min_samples_leaf = 10
)
# Use both training and validation data to fit them using np.concatenate (np.concatenate "stacks" the array like rbind in R)
gb_final.fit(np.concatenate([X_train, X_val]), np.concatenate([y_train, y_val]))

# Predict on test data
y_test_hat_rf = gb_final.predict(X_test)

# Obtain and check mse on test data. Is it lower or higher than the RF?
acc_rf = accuracy_score(y_test, y_test_hat_rf)
print(f'Random Forest achieved test accuracy of {round(acc_rf * 100, 2)}%.')

svm_model = svm.SVC() # initialize SVM

#  fit on both train. and val. data
svm_model.fit(np.concatenate([Z_train, Z_val]), np.concatenate([y_train, y_val]))

# predict on test data
y_test_hat_svm = svm_model.predict(Z_test)

acc_svm = accuracy_score(y_test, y_test_hat_svm)
print(f'SVM achieved test accuracy of {round(acc_svm * 100, 2)}%.')

# print(f'Optimized GB achieved MSE = {round(gb_optimized*100, 2)}')

Random Forest achieved test accuracy of 95.61%.
SVM achieved test accuracy of 98.25%.


In [32]:
# Finally combine your predictions
# (you do not have to change the code here, but you may want to try to improve beyond this method)

# My notes
# y_test_hat_gbt = ensemble.GradientBoostingClassifier()
# y_test_hat_rf = ensemble.GradientBoostingClassifier()
# y_test_hat_svm 

# WARNING: The below code for voting is only valid for 2 classes - DO NOT USE IT FOR CASES WITH MORE THAN 2 CLASSES
y_test_hat_combined = np.c_[y_test_hat_gbt, y_test_hat_rf, y_test_hat_svm]
y_test_hat_combined = np.round(np.sum(y_test_hat_combined, axis=1) / y_test_hat_combined.shape[1]).astype(int)

acc = accuracy_score(y_test, y_test_hat_combined)

print(f'Ensemble of boosting, RF, and SVM achieved test accuracy of {round(acc * 100, 2)}%.')

Ensemble of boosting, RF, and SVM achieved test accuracy of 96.49%.
