In [1]:
import os 

os.makedirs("../../datasets", exist_ok=True)

In [2]:
%%bash

wget -qO "../../datasets/blood_transfusion.csv" "https://github.com/INRIA/scikit-learn-mooc/raw/master/datasets/blood_transfusion.csv"

Open the dataset `blood_transfusion.csv`

In [3]:
import pandas as pd

blood_transfusion = pd.read_csv("../../datasets/blood_transfusion.csv")
data = blood_transfusion.drop(columns='Class')
target = blood_transfusion['Class']
blood_transfusion.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 748 entries, 0 to 747
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Recency    748 non-null    int64 
 1   Frequency  748 non-null    int64 
 2   Monetary   748 non-null    int64 
 3   Time       748 non-null    int64 
 4   Class      748 non-null    object
dtypes: int64(4), object(1)
memory usage: 29.3+ KB


We can display an interactive diagram with the following command:

In [4]:
from sklearn import set_config
set_config(display='diagram')

In this dataset, the column `"Class"` is the target vector containing the labels that our model should predict.

For all the questions below, make a cross-validation evaluation using a 10-fold cross-validation strategy.

Evaluate the performance of a `sklearn.dummy.DummyClassifier` that always predict the most frequent class seen during the training. Be aware that you can pass a list of score to compute in `sklearn.model_selection.cross_validate` by setting the parameter `scoring`.

The code allowing to compute the score with the given dataset is shown below:

In [5]:
from sklearn.dummy import DummyClassifier

dummy = DummyClassifier(strategy='most_frequent')
dummy

What the accuracy and balanced accuracy of this dummy classifier?

In [14]:
%%time
from sklearn.model_selection import cross_validate

cv_results = cross_validate(dummy, data, target,
    cv=10, scoring=['accuracy', 'balanced_accuracy'])
print(f"Average accuracy: "
    f"{cv_results['test_accuracy'].mean():.3f} +/- "
    f"{cv_results['test_accuracy'].std():.3f}")
print(f"Average balanced accuracy: "
    f"{cv_results['test_balanced_accuracy'].mean():.3f} +/- "
    f"{cv_results['test_balanced_accuracy'].std():.3f}")

Average accuracy: 0.762 +/- 0.004
Average balanced accuracy: 0.500 +/- 0.000
CPU times: user 37.2 ms, sys: 0 ns, total: 37.2 ms
Wall time: 35.6 ms


We get an average accuracy of ~0.76 and an average balanced accuracy of ~0.5. This is due to the fact that the number of elements in each class is imbalanced.

Replace the `DummyClassifier` by a `sklearn.tree.DecisionTreeClassifier` and check the statistical performance.

In [10]:
from sklearn.tree import DecisionTreeClassifier

tree = DecisionTreeClassifier(random_state=0)
tree

Is a single decision classifier better than a dummy classifier, by an increase of at least 0.04 of the balanced accuracy?

In [15]:
%%time
from sklearn.model_selection import cross_val_score

scores_tree = cross_val_score(tree, data, target,
    cv=10, scoring='balanced_accuracy', n_jobs=2)
print(f"Average balanced accuracy: "
    f"{scores_tree.mean():.3f} +/- "
    f"{scores_tree.std():.3f}")

Average balanced accuracy: 0.516 +/- 0.100
CPU times: user 22.8 ms, sys: 25.1 ms, total: 47.9 ms
Wall time: 810 ms


Evaluate the performance of a `sklearn.ensemble.RandomForestClassifier` using 300 trees.

In [18]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(
    n_estimators=300, n_jobs=2, random_state=0)
rf

Is random forest better than a dummy classifier, by an increase of at least 0.04 of the balanced accuracy?

We can evaluate the random forest classifier with the following code.

In [17]:
%%time

scores_rf = cross_val_score(rf, data, target,
    cv=10, scoring='balanced_accuracy', n_jobs=2)
print(f"Average balanced accuracy: "
    f"{scores_rf.mean():.3f} +/- "
    f"{scores_rf.std():.3f}")

Average balanced accuracy: 0.521 +/- 0.069
CPU times: user 31.4 ms, sys: 14.6 ms, total: 46 ms
Wall time: 1.24 s


Compare a `sklearn.ensemble.GradientBoostingClassifier` and a `sklearn.ensemble.RandomForestClassifier` with both 300 trees. To do so, repeat 10 times a 10-fold cross-validation by using the balanced accuracy as metric. For each of the 10 try, compute the average of the cross-validation score for both models. Count how many times a model is better than the other.

On average, is the gradient boosting better than the random forest?