<img align="center" style="max-width: 1000px" src="banner.png">

<img align="right" style="max-width: 200px; height: auto" src="hsg_logo.png">

##  Lab 03 - "Supervised Machine Learning: k Nearest-Neighbors" Assignments

EMBA 58/59 - W8/3 - "AI Coding for Executives", University of St. Gallen

In the last lab, we saw an application of **supervised machine learning** by using the **k Nearest-Neighbor (k NN) classifier** to classify features derived from delicious real-world **Wine samples**. You learned how to train a model and to evaluate and interpret its results. In this lab, we aim to leverage that knowledge by applying it to a set of related self-coding assignments. But before we do so let's start with a motivational video by OpenAI:

In [None]:
from IPython.display import YouTubeVideo
# OpenAI: "Solving Rubik's Cube with a Robot Hand"
YouTubeVideo('x4O8pojMF0w', width=1000, height=500)

As always, pls. don't hesitate to ask all your questions either during the lab, post them in our CANVAS (StudyNet) forum (https://learning.unisg.ch), or send us an email (using the course email).

## 1. Assignment Objectives:

Similar today's lab session, after today's self-coding assignments you should be able to:

> 1. Know how to setup a **notebook or "pipeline"** that solves a simple supervised classification task.
> 2. Recognize the **data elements** needed to train and evaluate a supervised machine learning classifier. 
> 3. Understand how a discriminative **k Nearest-Neighbor (kNN)** classifier can be trained and evaluated.
> 4. Know how to use Python's sklearn library to **train** and **evaluate** arbitrary classifiers.
> 5. Understand how to **evaluate** and **interpret** the classification results.

## 2. Setup of the Jupyter Notebook Environment

Similarly to the previous labs, we need to import a couple of Python libraries that allow for data analysis and data visualization. In this lab will use the `Pandas`, `Numpy`, `Scikit-Learn`, `Matplotlib` and the `Seaborn` library. Let's import the libraries by the execution of the statements below:

In [None]:
# import the numpy, scipy and pandas data science library
import pandas as pd
import numpy as np
from scipy.stats import norm

# import sklearn data and data pre-processing libraries
from sklearn import datasets
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split

# import k-nearest neighbor classifier library
from sklearn.neighbors import KNeighborsClassifier

# import sklearn classification evaluation library
from sklearn import metrics
from sklearn.metrics import confusion_matrix 

# import matplotlib data visualization library
import matplotlib.pyplot as plt
import seaborn as sns

Enable inline Jupyter notebook plotting:

In [None]:
%matplotlib inline

Use the `Seaborn`plotting style in all subsequent visualizations:

In [None]:
plt.style.use('seaborn')

## 3. k Nearest-Neighbors (kNN) Classification Assignments

### 3.1 Wine Dataset Download

Let's download the delicious **Wine Dataset** that we will use for the following assignments. It is a classic and straightforward multi-class classification dataset.

<img align="center" style="max-width: 600px; height: auto" src="https://github.com/GitiHubi/courseAIML/blob/master/lab_03/wine_dataset.jpg?raw=1">

(Source: https://www.empirewine.com)

The data is the result of a chemical analysis of wines grown in the same region in Italy by three different cultivators (types). The dataset consists in total of **178 wines** as well as their corresponding **13 different measurements** taken for different constituents found in the three types of wine. Please, find below the list of the individual measurements (features):

>- `Alcohol`
>- `Malic acid`
>- `Ash`
>- `Alcalinity of ash`
>- `Magnesium`
>- `Total phenols`
>- `Flavanoids`
>- `Nonflavanoid phenols`
>- `Proanthocyanins`
>- `Color intensity`
>- `Hue`
>- `OD280/OD315 of diluted wines`
>- `CProline`

Further details on the dataset can be obtained from the following puplication: *Forina, M. et al, PARVUS - "An Extendible Package for Data Exploration, Classification and Correlation.", Institute of Pharmaceutical and Food Analysis and Technologies, Via Brigata Salerno, 16147 Genoa, Italy.*

Let's load the dataset and conduct a preliminary data assessment: 

In [None]:
wine = datasets.load_wine()

Print and inspect feature names of the dataset:

In [None]:
wine.feature_names

Print and inspect the class names of the dataset:

In [None]:
wine.target_names

Print and inspect the top 10 feature rows of the dataset:

In [None]:
pd.DataFrame(wine.data, columns=wine.feature_names).head(10)

Print and inspect the top 10 labels of the dataset:

In [None]:
pd.DataFrame(wine.target).head(10)

Determine and print the feature dimensionality of the dataset:

In [None]:
wine.data.shape

Determine and print the label dimensionality of the dataset:

In [None]:
wine.target.shape

Plot the data distributions of the distinct features:

In [None]:
# init the plot
plt.figure(figsize=(10, 10))

# prepare the dataset to be plotable using seaborn

# convert to Panda's DataFrame
wine_plot = pd.DataFrame(wine.data, columns=wine.feature_names)

# add class labels to the DataFrame
wine_plot['class'] = wine.target

# plot a pairplot of the distinct feature distributions
sns.pairplot(wine_plot, diag_kind='hist', hue='class');

### 3.2 Dataset Pre-Processing

#### 3.2.1 Feature Re-Scaling

Let's re-scale the distinct feature values of the **Wine Dataset** using **Min-Max Normalization** using the `MinMaxScaler` class of the `sklearn` library:

In [None]:
# init the min-max scaler
scaler = MinMaxScaler(feature_range=(0, 1), copy=True)

# min-max normalize the distinct feature values
wine_data_scaled = scaler.fit_transform(wine.data)

Print and inspect the top 10 feature rows of the normalized dataset:

In [None]:
pd.DataFrame(wine_data_scaled, columns=wine.feature_names).head(10)

Now that all feature values are scaled to a range between $[0,1]$, let's visualize the derived feature value distributions and inspect their distributions:

In [None]:
# init the plot
plt.figure(figsize=(10, 10))

# prepare the dataset to be plotable using seaborn

# convert to Panda's DataFrame
wine_plot = pd.DataFrame(wine_data_scaled, columns=wine.feature_names)

# add class labels to the DataFrame
wine_plot['class'] = wine.target

# plot a pairplot of the distinct feature distributions
sns.pairplot(wine_plot, diag_kind='hist', hue='class');

Excellent, the characteristics of the distinct feature value distributions remained unchanged.

#### 3.2.2 Extraction of Training- and Evaluation-Dataset

We set the fraction of testing records to **30%** of the original dataset:

In [None]:
eval_fraction = 0.3

Furthermore, let's set a random seed to insure reproducibility of the train-test split in potential future runs of the notebook:

In [None]:
seed = 42

Randomly split the **Wine Dataset** into training set and evaluation set using sklearn's `train_test_split` function:

In [None]:
# 70% training and 30% evaluation
X_train_scaled, X_eval_scaled, y_train_scaled, y_eval_scaled = train_test_split(wine_data_scaled, wine.target, test_size=eval_fraction, random_state=seed)

Evaluate the training set dimensionality:

In [None]:
X_train_scaled.shape, y_train_scaled.shape

Evaluate the evaluation set dimensionality:

In [None]:
X_eval_scaled.shape, y_eval_scaled.shape

### 3.2 k Nearest-Neighbor (kNN) Model Training and Evaluation

<img align="center" style="max-width: 700px; height: auto" src="hsg_knn.png">

(Courtesy: Intro to AI & ML lecture, Prof. Dr. Borth, University of St. Gallen)

We recommend you to try the following exercises as part of the self-coding session:

**Exercise 1: Train and evaluate the prediction accuracy of the k=1,...,40 Nearest Neighbor models.**

> Write a Python loop that trains and evaluates the prediction accuracy of all k-Nearest Neighbor parameterizations ranging from k=1,...,40 using the **Manhattan** instead of the **Euclidean** distance. Collect and print the prediction accuracy of each model respectively and compare the results.

In [None]:
# ***************************************************
# INSERT YOUR SOLUTION/CODE HERE
# ***************************************************

**Exercise 2: Visualize the model prediction accuracy for the distinct values of k=1,...,40.**

> Plot the prediction accuracy collected for each model above. The plot should display the **distinct values of k at the x-axis** and the corresponding **model prediction accuracy on the y-axis**. What kind of behaviour in terms of prediction accuracy can be observed with increasing k?

In [None]:
# ***************************************************
# INSERT YOUR SOLUTION/CODE HERE
# ***************************************************

**Exercise 3: Train, evaluate and plot the prediction accuracy of the Nearest Neighbor models without feature scaling.**

> Similar to the exercises above, write a Python loop that trains and evaluates the prediction accuracy of all k-Nearest Neighbor parameterizations ranging from k=1,...,40 using the **original (non feature scaled) wine dataset**. Collect and print the prediction accuracy of each model respectively and compare the results (similar to exercise 1). Plot the prediction accuracy collected for each model above. The plot should display the distinct values of k at the x-axis and the corresponding model prediction accuracy on the y-axis (similar to exercise 2). What do you observe when comparing the results of the non re-scaled with the results obtained for the scaled features?

In [1]:
# ***************************************************
# INSERT YOUR SOLUTION/CODE HERE
# ***************************************************