In [1]:
%%bash

wget -qO "../../datasets/penguins.csv" "https://github.com/INRIA/scikit-learn-mooc/raw/master/datasets/penguins.csv"

Load the dataset file named `penguins.csv` with the following command:

In [3]:
import pandas as pd

penguins = pd.read_csv("../../datasets/penguins.csv")

columns = ['Body Mass (g)', 'Flipper Length (mm)', 'Culmen Length (mm)']
target_name = 'Species'

# Remove lines with missing values for the columns of interests
penguins_non_missing = penguins[columns + [target_name]].dropna()

data = penguins_non_missing[columns]
target = penguins_non_missing[target_name]

`penguins` is a pandas dataframe. The column "Species" contains the target variable. We extract through numerical columns that quantify various attributes of animals and our goal is try to predict the species fo the animal based on those attributes stored in the dataframe named `data`.

We can have a look to the target variable:

In [6]:
target.value_counts(normalize=True)

Adelie Penguin (Pygoscelis adeliae)          0.441520
Gentoo penguin (Pygoscelis papua)            0.359649
Chinstrap penguin (Pygoscelis antarctica)    0.198830
Name: Species, dtype: float64

We observe that there are 3 classes and that there are more than twice as many Adelie Penguins as there are Chinstrap penguins in this dataset.

We can have a look at the scale of the input features with:

In [5]:
data.describe()

Unnamed: 0,Body Mass (g),Flipper Length (mm),Culmen Length (mm)
count,342.0,342.0,342.0
mean,4201.754386,200.915205,43.92193
std,801.954536,14.061714,5.459584
min,2700.0,172.0,32.1
25%,3550.0,190.0,39.225
50%,4050.0,197.0,44.45
75%,4750.0,213.0,48.5
max,6300.0,231.0,59.6


We observe that the body mass varies between 2700 g and 6300 g with a standard deviation of 801 g while the length of the culmen varies between 32.1 mm and 59.6 mm with a standard deviation of 5.4 mm. Therefore, if we use the default units, the features do not have the same dynamic range at all.

We can display an interactive diagram

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
model = Pipeline(steps=[
    ("preprocessor", StandardScaler()),
    ("classifier", KNeighborsClassifier(n_neighbors=5))
])
