In [1]:
import os 

os.makedirs("../../datasets", exist_ok=True)

In [2]:
%%bash

wget -qO "../../datasets/blood_transfusion.csv" "https://github.com/INRIA/scikit-learn-mooc/raw/master/datasets/blood_transfusion.csv"

Open the dataset `blood_transfusion.csv`

In [3]:
import pandas as pd

blood_transfusion = pd.read_csv("../../datasets/blood_transfusion.csv")
data = blood_transfusion.drop(columns='Class')
target = blood_transfusion['Class']
blood_transfusion.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 748 entries, 0 to 747
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Recency    748 non-null    int64 
 1   Frequency  748 non-null    int64 
 2   Monetary   748 non-null    int64 
 3   Time       748 non-null    int64 
 4   Class      748 non-null    object
dtypes: int64(4), object(1)
memory usage: 29.3+ KB


We can display an interactive diagram with the following command:

In [4]:
from sklearn import set_config
set_config(display='diagram')

In this dataset, the column `"Class"` is the target vector containing the labels that our model should predict.

For all the questions below, make a cross-validation evaluation using a 10-fold cross-validation strategy.

Evaluate the performance of a `sklearn.dummy.DummyClassifier` that always predict the most frequent class seen during the training. Be aware that you can pass a list of score to compute in `sklearn.model_selection.cross_validate` by setting the parameter `scoring`.

The code allowing to compute the score with the given dataset is shown below:

In [5]:
from sklearn.dummy import DummyClassifier

dummy = DummyClassifier(strategy='most_frequent', 
    random_state=0)
dummy

What the accuracy and balanced accuracy of this dummy classifier?

In [6]:
%%time
from sklearn.model_selection import cross_validate

cv_results = cross_validate(dummy, data, target,
    cv=10, scoring=['accuracy', 'balanced_accuracy'])
print(f"Average accuracy: "
    f"{cv_results['test_accuracy'].mean():.3f} +/- "
    f"{cv_results['test_accuracy'].std():.3f}")
print(f"Average balanced accuracy: "
    f"{cv_results['test_balanced_accuracy'].mean():.3f} +/- "
    f"{cv_results['test_balanced_accuracy'].std():.3f}")

Average accuracy: 0.762 +/- 0.004
Average balanced accuracy: 0.500 +/- 0.000
CPU times: user 54.8 ms, sys: 3.8 ms, total: 58.6 ms
Wall time: 57.9 ms


We get an average accuracy of ~0.76 and an average balanced accuracy of ~0.5. This is due to the fact that the number of elements in each class is imbalanced.

Replace the `DummyClassifier` by a `sklearn.tree.DecisionTreeClassifier` and check the statistical performance.

In [7]:
from sklearn.tree import DecisionTreeClassifier

tree = DecisionTreeClassifier(random_state=0)
tree

Is a single decision classifier better than a dummy classifier, by an increase of at least 0.04 of the balanced accuracy?

In [8]:
%%time
from sklearn.model_selection import cross_val_score

scores_tree = cross_val_score(tree, data, target,
    cv=10, scoring='balanced_accuracy', n_jobs=2)
print(f"Average balanced accuracy: "
    f"{scores_tree.mean():.3f} +/- "
    f"{scores_tree.std():.3f}")

Average balanced accuracy: 0.516 +/- 0.100
CPU times: user 30.8 ms, sys: 24.5 ms, total: 55.3 ms
Wall time: 816 ms


Evaluate the performance of a `sklearn.ensemble.RandomForestClassifier` using 300 trees.

In [9]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(
    n_estimators=300, n_jobs=2, random_state=0)
rf

Is random forest better than a dummy classifier, by an increase of at least 0.04 of the balanced accuracy?

We can evaluate the random forest classifier with the following code.

In [10]:
%%time

scores_rf = cross_val_score(rf, data, target,
    cv=10, scoring='balanced_accuracy', n_jobs=2)
print(f"Average balanced accuracy: "
    f"{scores_rf.mean():.3f} +/- "
    f"{scores_rf.std():.3f}")

Average balanced accuracy: 0.530 +/- 0.076
CPU times: user 31 ms, sys: 0 ns, total: 31 ms
Wall time: 3.21 s


Compare a `sklearn.ensemble.GradientBoostingClassifier` and a `sklearn.ensemble.RandomForestClassifier` with both 300 trees. To do so, repeat 10 times a 10-fold cross-validation by using the balanced accuracy as metric. For each of the 10 try, compute the average of the cross-validation score for both models. Count how many times a model is better than the other.

In [11]:
from sklearn.ensemble import GradientBoostingClassifier

gbdt = GradientBoostingClassifier(
    n_estimators=300, random_state=0)
gbdt

In [12]:
%%time

scores_gbdt = cross_val_score(gbdt, data, target,
    cv=10, scoring='balanced_accuracy', n_jobs=2)
print(f"Average balanced accuracy: "
    f"{scores_gbdt.mean():.3f} +/- "
    f"{scores_gbdt.std():.3f}")

Average balanced accuracy: 0.537 +/- 0.073
CPU times: user 27.6 ms, sys: 0 ns, total: 27.6 ms
Wall time: 1.1 s


On average, is the gradient boosting better than the random forest?

The code below repeat 10 times the 10-fold cross-validation of a random forest and a gradient-boosting. Then, we check how many time the average balanced accuracy of a gradient boosting is better than a random forest.

In [13]:
%%time
from sklearn.model_selection import KFold

n_try = 10
scores_rf, scores_gbdt = [], []

for seed in range(n_try):
    cv = KFold(n_splits=10, shuffle=True, random_state=seed)
    
    scores = cross_val_score(rf, data, target,
        cv=cv, scoring='balanced_accuracy', n_jobs=2)
    scores_rf.append(scores.mean())
    
    scores = cross_val_score(gbdt, data, target,
        cv=cv, scoring='balanced_accuracy', n_jobs=2)
    scores_gbdt.append(scores.mean())

print(f"10 average balanced accuracy as follows: ")
print(f"Random forest: {scores_rf}")
print(f"Gradient boosting: {scores_gbdt}")

compare = [s_gbdt > s_rf for s_gbdt, s_rf in zip(scores_gbdt, scores_rf)]
print(f"Number of the average balanced accuracy of a gradient boosting "
    f"is better than a random forest = {sum(compare)}")

10 average balanced accuracy as follows: 
Random forest: [0.5958735063849163, 0.6071082350508421, 0.5861206151327055, 0.6084622334638855, 0.5977931013771813, 0.596029270770206, 0.586762777491701, 0.5818846039396067, 0.6090675092755465, 0.5944661809753218]
Gradient boosting: [0.612448150763859, 0.6082180772939754, 0.596592281401919, 0.6166012516291297, 0.5939191388989137, 0.5987164058221208, 0.5981547417134458, 0.596231772758258, 0.6203521245148931, 0.6016079662441642]
Number of the average balanced accuracy of a gradient boosting is better than a random forest = 9
CPU times: user 536 ms, sys: 38.9 ms, total: 575 ms
Wall time: 42.1 s


Evaluate the performance of a `sklearn.ensemble.HistGradientBoostingClassifier`. Enable early-stopping and add as many trees an needed.

**Note**: Be aware that you need a specific import when importing the `HistGradientBoostingClassifier`:

In [14]:
# explicitly require this experimental feature
from sklearn.experimental import enable_hist_gradient_boosting
# now you can import normally from ensemble
from sklearn.ensemble import HistGradientBoostingClassifier

hist_gbdt = HistGradientBoostingClassifier(
    max_iter=1000, early_stopping=True, random_state=0)
hist_gbdt

Is histogram gradient boosting a better classifier considering the mean of the cross-validation test_score?

We can evaluate the `HistGradientBoostingClassifier` with the following snippet.

In [15]:
%%time

cv_results = cross_validate(hist_gbdt, data, target, 
    cv=10, scoring='balanced_accuracy', n_jobs=2, 
    return_estimator=True)
print(f"Average balanced accuracy: "
    f"{cv_results['test_score'].mean():.3f} +/- "
    f"{cv_results['test_score'].std():.3f}")

Average balanced accuracy: 0.579 +/- 0.110
CPU times: user 30.9 ms, sys: 277 µs, total: 31.1 ms
Wall time: 603 ms


This classifier reach a balanced accuracy of ~0.58 which is better than all other methods.

With the early stopping activated, how many trees on average the `HistGradientBoostingClassifier` needed to converge?

We can inspect the fitted estimator resulting from the cross-validation.

In [16]:
import numpy as np

num_tree = np.mean([estimator.n_iter_ 
    for estimator in cv_results['estimator']])
print(f"On average, {num_tree} iterations where required.")

On average, 33.3 iterations where required.


`Imbalanced-learn` is an open-source library relying on scikit-learn and provides methods to deal with classification with imbalanced classes.

Here, we will be using the class `imblearn.ensemble.BalancedBaggingClassifier` to alleviate the issue of class imbalance.

Use the `BalancedBaggingClassifier` and pass an `HistGradientBoostingClassifier` as a `base_estimator`. Fix the hyperparameter `n_estimators` to 50.

`BalancedBaggingClassifier` equivalent to a `sklearn.ensemble.BaggingClassifier` with a resampling of each bootstrap sample will be resampled to achieve a desired class balance that contain a many samples from each class.

In [17]:
from imblearn.ensemble import BalancedBaggingClassifier

bbc = BalancedBaggingClassifier(
    base_estimator=hist_gbdt,
    n_estimators=50, 
    n_jobs=2, random_state=0)
bbc

Compared to the balanced accuracy of a `HistGradientBoostingClassifier` alone, the balanced accuracy of the `BalancedBaggingClassifier` is **Better**.

The following code snipped allows to evaluate the `BalancedBaggingClassifier`.

In [18]:
%%time

scores_bbc = cross_val_score(bbc, data, target, 
    cv=10, scoring='balanced_accuracy', n_jobs=2)
print(f"Average balanced accuracy: "
    f"{scores_bbc.mean():.3f} +/- "
    f"{scores_bbc.std():.3f}")

Average balanced accuracy: 0.601 +/- 0.077
CPU times: user 21.9 ms, sys: 0 ns, total: 21.9 ms
Wall time: 41.3 s


The balanced accuracy for the `BalancedBaggingClassifier` is around ~0.6. It is better (i.e. higher) than the average balanced accuracy of a `HistGradientBoostingClassifier` alone, which was around ~0.58.