In [1]:
import os

os.makedirs("../../datasets", exist_ok=True)

In [2]:
%%bash

wget -qO "../../datasets/penguins.csv" "https://github.com/INRIA/scikit-learn-mooc/raw/master/datasets/penguins.csv"

Load the dataset file named `penguins.csv` with the following command:

In [3]:
import pandas as pd

penguins = pd.read_csv("../../datasets/penguins.csv")

columns = ['Body Mass (g)', 'Flipper Length (mm)', 'Culmen Length (mm)']
target_name = 'Species'

# Remove lines with missing values for the columns of interests
penguins_non_missing = penguins[columns + [target_name]].dropna()

data = penguins_non_missing[columns]
target = penguins_non_missing[target_name]

`penguins` is a pandas dataframe. The column "Species" contains the target variable. We extract through numerical columns that quantify various attributes of animals and our goal is try to predict the species fo the animal based on those attributes stored in the dataframe named `data`.

We can have a look to the target variable:

In [4]:
target.value_counts(normalize=True)

Adelie Penguin (Pygoscelis adeliae)          0.441520
Gentoo penguin (Pygoscelis papua)            0.359649
Chinstrap penguin (Pygoscelis antarctica)    0.198830
Name: Species, dtype: float64

We observe that there are 3 classes and that there are more than twice as many Adelie Penguins as there are Chinstrap penguins in this dataset.

We can have a look at the scale of the input features with:

In [5]:
data.describe()

Unnamed: 0,Body Mass (g),Flipper Length (mm),Culmen Length (mm)
count,342.0,342.0,342.0
mean,4201.754386,200.915205,43.92193
std,801.954536,14.061714,5.459584
min,2700.0,172.0,32.1
25%,3550.0,190.0,39.225
50%,4050.0,197.0,44.45
75%,4750.0,213.0,48.5
max,6300.0,231.0,59.6


We observe that the body mass varies between 2700 g and 6300 g with a standard deviation of 801 g while the length of the culmen varies between 32.1 mm and 59.6 mm with a standard deviation of 5.4 mm. Therefore, if we use the default units, the features do not have the same dynamic range at all.

We can display an interactive diagram with the following command:

In [6]:
from sklearn import set_config

set_config(display='diagram')

In [7]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
model = Pipeline(steps=[
    ("preprocessor", StandardScaler()),
    ("classifier", KNeighborsClassifier(n_neighbors=5))
])
model

Evaluate the pipeline using 10-fold cross-validation using the `balanced-accuracy` scoring metric. Use `sklearn.model_selection.cross_validate` with `scoring='balanced_accuracy'`.

The cross-validated scores can be computed with:

In [8]:
%%time
from sklearn.model_selection import cross_validate

cv_results = cross_validate(model, data, target, cv=10,
    scoring='balanced_accuracy')
scores = cv_results["test_score"]
print(f"The average cross-validation scores:\n"
    f"{scores.mean():.3f} +/- {scores.std():.3f}")

The average cross-validation scores:
0.952 +/- 0.040
CPU times: user 79 ms, sys: 0 ns, total: 79 ms
Wall time: 77.3 ms


which gives values between 0.9 and 1.0 with average close to 0.95.

Use `model.get_params()` to list the parameters of the pipeline and use `model.set_params(param_name=param_value)` to update them.

In [9]:
for parameter in model.get_params():
    print(parameter)

memory
steps
verbose
preprocessor
classifier
preprocessor__copy
preprocessor__with_mean
preprocessor__with_std
classifier__algorithm
classifier__leaf_size
classifier__metric
classifier__metric_params
classifier__n_jobs
classifier__n_neighbors
classifier__p
classifier__weights


It is possible to change the pipeline parameters and rerun a cross-validation with: 

In [10]:
%%time
for k in [5, 51]:
    model.set_params(classifier__n_neighbors=k)
    cv_results = cross_validate(model, data, target, cv=10,
        scoring='balanced_accuracy')
    scores = cv_results["test_score"]
    print(f"The average cross-validation scores with n_neighbors={k}:\n"
        f"{scores.mean():.3f} +/- {scores.std():.3f}")

The average cross-validation scores with n_neighbors=5:
0.952 +/- 0.040
The average cross-validation scores with n_neighbors=51:
0.942 +/- 0.039
CPU times: user 152 ms, sys: 127 µs, total: 152 ms
Wall time: 151 ms


which gives slightly worse test scores but the difference is not necessarily significant: they overlap a lot.

We can disable the preprocessor by setting `preprocessor` parameter to `None` (while resetting the number of neighbors to 5) as follows:

In [11]:
%%time
model.set_params(preprocessor=None, classifier__n_neighbors=5)
cv_results = cross_validate(model, data, target, cv=10,
    scoring='balanced_accuracy')
scores = cv_results["test_score"]
print(f"The average cross-validation scores:\n"
    f"{scores.mean():.3f} +/- {scores.std():.3f}")

The average cross-validation scores:
0.740 +/- 0.087
CPU times: user 64.4 ms, sys: 191 µs, total: 64.6 ms
Wall time: 63.1 ms


We will now study the impact of different preprocessors defined in the list below:

In [12]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import QuantileTransformer
from sklearn.preprocessing import PowerTransformer

all_preprocessors = [
    None,
    StandardScaler(),
    MinMaxScaler(),
    QuantileTransformer(n_quantiles=100),
    PowerTransformer(method='box-cox')
]

The [Box-Cox method](https://en.wikipedia.org/wiki/Power_transform#Box%E2%80%93Cox_transformation) is common preprocessing strategy for positive values. The other preprocessors work both for any kind of numerical features. If you are curious to read the details about those method, please feel free to read them up in the [preprocessing chapter](https://scikit-learn.org/stable/modules/preprocessing.html) of the scikit-learn user guide.

Use `sklearn.model_selection.GridSearchCV` to study the impact of the choice of the preprocessor and the number of neighbors on the 10-fold cross-validated `balanced_accuracy` metric. We want to study the `n_neighbors` in the range `[5, 51, 101]` and `preprocessor` in the range `all_preprocessors`.

Let us consider that a model is significantly better than another if the mean test score is better than the mean test score of the alternative by more than the standard deviation of its test score.

In [13]:
%%time
from sklearn.model_selection import GridSearchCV

param_grid = {
    'preprocessor': all_preprocessors,
    'classifier__n_neighbors': [5, 51, 101]
}
grid_search = GridSearchCV(model, param_grid=param_grid, 
    scoring='balanced_accuracy', cv=10)\
    .fit(data, target)

CPU times: user 1.13 s, sys: 13.4 ms, total: 1.15 s
Wall time: 1.14 s


We can sort the results and focus on the columns of interest with:

In [14]:
results = pd.DataFrame(grid_search.cv_results_)\
    .sort_values('mean_test_score', ascending=False)
results.head()

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_classifier__n_neighbors,param_preprocessor,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,split5_test_score,split6_test_score,split7_test_score,split8_test_score,split9_test_score,mean_test_score,std_test_score,rank_test_score
1,0.00332,0.000114,0.002996,0.000119,5,StandardScaler(),"{'classifier__n_neighbors': 5, 'preprocessor':...",1.0,1.0,1.0,0.918803,0.88254,0.952381,0.977778,0.930159,0.907937,0.952381,0.952198,0.039902,1
2,0.003158,0.000184,0.002944,2.2e-05,5,MinMaxScaler(),"{'classifier__n_neighbors': 5, 'preprocessor':...",1.0,0.952381,1.0,0.944444,0.88254,0.930159,0.955556,0.952381,0.907937,0.952381,0.947778,0.034268,2
3,0.004007,0.000144,0.003268,0.000229,5,QuantileTransformer(n_quantiles=100),"{'classifier__n_neighbors': 5, 'preprocessor':...",0.952381,0.92674,1.0,0.918803,0.904762,1.0,0.977778,0.930159,0.907937,0.952381,0.947094,0.033797,3
4,0.006139,0.00042,0.003093,4.4e-05,5,PowerTransformer(method='box-cox'),"{'classifier__n_neighbors': 5, 'preprocessor':...",1.0,0.977778,1.0,0.863248,0.88254,0.952381,0.955556,0.930159,0.907937,1.0,0.94696,0.047387,4
6,0.003255,8.5e-05,0.003108,4.9e-05,51,StandardScaler(),"{'classifier__n_neighbors': 51, 'preprocessor'...",0.952381,0.977778,1.0,0.863248,0.88254,0.952381,0.955556,0.952381,0.930159,0.952381,0.94188,0.038905,5


In [15]:
results = results[
    [c for c in results.columns if c.startswith("param_")]
    + ['mean_test_score', 'std_test_score']
]
results

Unnamed: 0,param_classifier__n_neighbors,param_preprocessor,mean_test_score,std_test_score
1,5,StandardScaler(),0.952198,0.039902
2,5,MinMaxScaler(),0.947778,0.034268
3,5,QuantileTransformer(n_quantiles=100),0.947094,0.033797
4,5,PowerTransformer(method='box-cox'),0.94696,0.047387
6,51,StandardScaler(),0.94188,0.038905
8,51,QuantileTransformer(n_quantiles=100),0.927277,0.043759
9,51,PowerTransformer(method='box-cox'),0.922833,0.047883
7,51,MinMaxScaler(),0.920293,0.045516
11,101,StandardScaler(),0.876642,0.041618
12,101,MinMaxScaler(),0.862357,0.046244


We can observe that model with any scalers and `n_neighbors=5` perform typically the best but not necessarily significantly that better than with `n_neighbors=51`. For all those models, the mean test `balanced_accuracy` is above 0.92 while the best model is around 0.95 +/- 0.04.

The models with no processor (`preprocessor=None`) are all below 0.75, even for `n_neighbors=5`.

Models with any preprocessor and `n_neighbors=101` are in the range 0.80 to 0.88. They are significantly better than without preprocessor but also significantly worse than models with lower values for `n_neighbors`.

The main reason that explains tha removing the preprocessor leads to bad performance, is the fact that the input features have very different dynamic ranges when using the default units (grams and millimeters).

As usual, setting a too large value for `n_neighbors` cause under-fitting. Here the data is well structured and has not much noise: using low values for `n_neighbors` is as good or better than intermediate values as there is not much over-fitting possible. 