<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Table-of-Contents" data-toc-modified-id="Table-of-Contents-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Table of Contents</a></span></li><li><span><a href="#Introduction" data-toc-modified-id="Introduction-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Introduction</a></span><ul class="toc-item"><li><span><a href="#Load-packages-and-related-objects" data-toc-modified-id="Load-packages-and-related-objects-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Load packages and related objects</a></span></li></ul></li><li><span><a href="#Data-preparation" data-toc-modified-id="Data-preparation-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Data preparation</a></span><ul class="toc-item"><li><span><a href="#Load-data" data-toc-modified-id="Load-data-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Load data</a></span></li><li><span><a href="#Feature-encoding" data-toc-modified-id="Feature-encoding-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Feature encoding</a></span></li><li><span><a href="#Split-dataset-$\mapsto$-train/test" data-toc-modified-id="Split-dataset-$\mapsto$-train/test-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Split dataset $\mapsto$ train/test</a></span></li></ul></li><li><span><a href="#Inspection-of-the-training-set" data-toc-modified-id="Inspection-of-the-training-set-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Inspection of the training set</a></span><ul class="toc-item"><li><span><a href="#Short-summary-of-training-set" data-toc-modified-id="Short-summary-of-training-set-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Short summary of training set</a></span></li><li><span><a href="#Visualization" data-toc-modified-id="Visualization-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Visualization</a></span></li></ul></li><li><span><a href="#Classification-with-KNN" data-toc-modified-id="Classification-with-KNN-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Classification with KNN</a></span><ul class="toc-item"><li><span><a href="#Classification-using-all-the-features" data-toc-modified-id="Classification-using-all-the-features-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>Classification using all the features</a></span><ul class="toc-item"><li><span><a href="#Classification-errors" data-toc-modified-id="Classification-errors-5.1.1"><span class="toc-item-num">5.1.1&nbsp;&nbsp;</span>Classification errors</a></span></li><li><span><a href="#Confusion-matrix" data-toc-modified-id="Confusion-matrix-5.1.2"><span class="toc-item-num">5.1.2&nbsp;&nbsp;</span>Confusion matrix</a></span></li></ul></li><li><span><a href="#Classification-using-only-2-features" data-toc-modified-id="Classification-using-only-2-features-5.2"><span class="toc-item-num">5.2&nbsp;&nbsp;</span>Classification using only 2 features</a></span><ul class="toc-item"><li><span><a href="#Decision-boundary" data-toc-modified-id="Decision-boundary-5.2.1"><span class="toc-item-num">5.2.1&nbsp;&nbsp;</span>Decision boundary</a></span></li><li><span><a href="#Tune-the-$k$-parameter" data-toc-modified-id="Tune-the-$k$-parameter-5.2.2"><span class="toc-item-num">5.2.2&nbsp;&nbsp;</span>Tune the $k$ parameter</a></span><ul class="toc-item"><li><span><a href="#Manually" data-toc-modified-id="Manually-5.2.2.1"><span class="toc-item-num">5.2.2.1&nbsp;&nbsp;</span>Manually</a></span></li><li><span><a href="#Bonus:-with-sklearn-GridSearchCV" data-toc-modified-id="Bonus:-with-sklearn-GridSearchCV-5.2.2.2"><span class="toc-item-num">5.2.2.2&nbsp;&nbsp;</span>Bonus: with <code>sklearn</code> <a href="https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html" target="_blank"><code>GridSearchCV</code></a></a></span></li></ul></li></ul></li></ul></li></ul></div>

# Introduction

This goal of this first TP is twofold:

- familiarize yourself with Python `pandas, seaborn, sklearn`
- practice the data analysis workflow
    - with the $k$-NN classifier on a toy example: the [`iris` flower dataset](https://en.wikipedia.org/wiki/Iris_flower_data_set)
    - display a descision boundary
    - apply a simple receipe to tune the $k$ parameter


**You are expected to answer, comment and argument everything you do**

## Load packages and related objects

In [1]:
import numpy as np
import matplotlib.pyplot as plt

# use pandas to play with dataset
import pandas as pd

# use seaborn to display data
import seaborn as sns

# use sklearn to practice ML
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

# Algorithm of the day
from sklearn.neighbors import KNeighborsClassifier as KNN

# Data preparation

In this first phase you get a first contact with the data

- size
- name of features (potentially rename some of them)
- missing values (there are no missing values with `iris`)
- type of features
- number of classes

and then split it into train/test sets

## Load data

In [2]:
iris = load_iris() # load iris dataset from scikit-learn

What are the attributes of the `iris` object?

In [None]:
help(iris)
iris.feature_names

In [None]:
iris.target_names

Create `DataFrame`s:

- `X_data` from `iris.data`
- `y_data` from `iris.target`
    * 2 columns `'label'` (0,1,2) and `'specie'` ('setosa', 'versicolor', 'virginica'')
    * set the type of the `'specie'` column to '`category`'

In [12]:
X_data = pd.DataFrame(iris.data, columns=iris.feature_names)
y_data = pd.DataFrame(iris.target, columns=['label'])

y_data['specie'] = y_data.label.map(dict(enumerate(iris.target_names)))
y_data['specie'] = y_data['specie'].astype('category')

What is the size of the dataset?

Hint: use `.shape`

Display the first 10 rows of the dataset

Hint: use `.head()`

What is the type of each feature and what do they correspond to?

What is the proportion of each class?

## Feature encoding

Try to rename features' name: remove `  (cm)` for simpler calls/display

In [None]:
X_data.rename(lambda x: '_'.join(x.split(' ')[:-1]),
              axis='columns', inplace=True)

X_data.head()

## Split dataset $\mapsto$ train/test

Hint:
- use `train_test_split`
- think about `shuffle` and `stratified` arguments!

After the split the test set is **only** used to assess the performance of your classifier on unseen data.

In [None]:
test_frac = 1/3 # Fraction of the data set to consider as test set

X_train, X_test,\
y_train, y_test = train_test_split(X_data, y_data.label,
                                   test_size=test_frac,
                                   shuffle=True,
                                   stratify=y_data.label)#,
#                                    random_state=123)

Comment on the impact of the `shuffle` and `stratified` arguments on the proportion of classes

What could the `random_state` argument be used for? 

___

Create a `DataFrame` named `data_train` as the concatenation of `X_train` and the corresponding labels.

That will be convenient to use in the visualization part

In [None]:
data_train = pd.concat([X_train, y_data.specie[y_train.index]], axis=1)
data_train.head()

# Inspection of the training set

This is the most important part!

You must carefully study the distribution of your data.

For this purpose you are free to compute and display as many stastistical properties of the data and make some relevant comments.

## Short summary of training set

Hint: you can use the `describe` method

Compute the correlation matrix

## Visualization

Enjoy [`seaborn`](https://seaborn.pydata.org/index.html) displays

- `boxplot`
- `violinplot`
- `pairplot`
- `pie`

# Classification with KNN

## Classification using all the features

The `sklearn` classifier [`KNeighborsClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier) works as a regular Python object! 

In particular it has 

- attributes:
    - `n_neighbors`$=k$
    - [`metric`](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.DistanceMetric.html)
    - `weights`
- methods
    - `.fit` to train the model
    - `.predict` to predict tests' label (specie)
    - `.score` to prodive classification score

Create a baseline $k$-NN classifier with $k=3$ and train the model on `X_train` and `y_train`

### Classification errors

Compute the classification error on the training set `X_train` in 2 ways:

- use `predict` method
- use `score` method

Now evaluate your model on the test set and give the corresponding classification error, `err_test`

### Confusion matrix

Use the function `confusion_matrix` to compute the confusion matrix called `conf_mat`of the classifier on the test set.

To display a nice confusion matrix you can use the following

In [None]:
plt.title('Confusion matrix on test set, with classification error {:.2f}%'.format(100*err_test))

ax = sns.heatmap(conf_mat, annot=True, linewidths=.5)

ax.xaxis.set_ticks_position('top')
plt.xticks(0.5+np.arange(3), iris.target_names)
plt.yticks(0.5+np.arange(3), iris.target_names, **{'verticalalignment':'center'})

plt.show()

Any comment?

## Classification using only 2 features

From the visualization part what could be the best 2 discriminative features?

Create `X_train_2D` the corresponding 2D training set.

In [None]:
features = ['petal_length', 'petal_width'] # petal_width, sepal_length

X_train_2D = X_train[features]

Train a baseline $k$-NN classifier based only on these 2 features, with $k=3$

### Decision boundary

To do this, you can mesh the input space and predict the class of each point of the mesh

To construct the mesh you can use

In [None]:
x_min, x_max = X_data[features[0]].min() - 1.0, X_data[features[0]].max() + 1.0
y_min, y_max = X_data[features[1]].min() - 1.0, X_data[features[1]].max() + 1.0

# Create mesh of the input space
nx, ny = 100, 100 # number of nodes along x (resp. y) axis
mesh_x, mesh_y = np.linspace(x_min, x_max, nx), np.linspace(y_min, y_max, ny)
xx, yy = np.meshgrid(mesh_x, mesh_y)
X_mesh = np.column_stack([xx.ravel(), yy.ravel()])

Predict labels of the mesh's nodes

To display the decision boundary you can use `contourf` in the following way

In [None]:
fig, ax = plt.subplots()

ax.contourf(xx, yy, y_pred_mesh.reshape(xx.shape),
            cmap=plt.cm.coolwarm, alpha=0.7)

plt.show()

Now **you** can display both the decision boundary and train/test examples

### Tune the $k$ parameter

#### Manually

Split the 2D training set into a smaller dataset and one validation set.

The latter is used as an indicator of the predicting power of $k$-NN classifier on the test set for different values of $k$.

In other words

- `X_train_2D, y_train` $\mapsto$ `X_train_small` $\cup$ `X_valid`, `y_valid` $\cup$ `y_train_small`
- Fit the model for different values of $k$ and predict on the validation set
- Pick the value of $k$ that yields the smallest validation error

#### Bonus: with `sklearn` [`GridSearchCV`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)

In [None]:
from sklearn.model_selection import GridSearchCV