<img align="center" style="max-width: 1000px" src="banner.png">

<img align="right" style="max-width: 200px; height: auto" src="hsg_logo.png">

##  Lab 03 - "Supervised Machine Learning" Assignments

GSERM'21 course "Deep Learning: Fundamentals and Applications", University of St. Gallen

The lab environment of the "Deep Learning: Fundamentals and Applications" GSERM course at the University of St. Gallen (HSG) is based on Jupyter Notebooks (https://jupyter.org), which allow to perform a variety of statistical evaluations and data analyses.

In the last lab we learned how to implement, train, and apply our first **Machine Learning** models, namely the Gaussian **Naive-Bayes (NB)** and the **Support Vectore Machine (SVM)** classifiers. In this lab, we aim to leverage that knowledge by applying it to a set of self-coding assignments. 

As always, pls. don't hesitate to ask all your questions either during the lab, post them in our CANVAS (StudyNet) forum (https://learning.unisg.ch), or send us an email (using the course email).

## 1. Assignment Objectives:

Similar today's lab session, after today's self-coding assignments you should be able to:

> 1. Know how to setup a **notebook or "pipeline"** that solves a simple supervised classification task.
> 2. Recognize the **data elements** needed to train and evaluate a supervised machine learning classifier. 
> 3. Understand how a Gaussian **Naive-Bayes (NB)** classifier can be trained and evaluated.
> 4. Understand how a **Suppport Vector Machine (SVM)** classifier can be trained and evaluated.
> 5. Train and evaluate **machine learning models** using Python's `scikit-learn` library.
> 6. Understand how to **evaluate** and **interpret** the classification results.

Before we start let's watch a motivational video:

In [None]:
from IPython.display import YouTubeVideo
# OpenAI: "Solving Rubik's Cube with a Robot Hand"
# YouTubeVideo('x4O8pojMF0w', width=800, height=600)

## 2. Setup of the Analysis Environment

Similarly to the previous labs, we need to import a couple of Python libraries that allow for data analysis and data visualization. In this lab will use the `Pandas`, `Numpy`, `Scikit-Learn`, `Matplotlib` and the `Seaborn` library. Let's import the libraries by the execution of the statements below:

In [None]:
# import the numpy, scipy and pandas data science library
import pandas as pd
import numpy as np
import scipy as sp
from scipy.stats import norm

# import sklearn data and data pre-processing libraries
from sklearn import datasets
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split

# import sklearn naive.bayes classifier library
from sklearn.naive_bayes import GaussianNB

# import sklearn support vector classifier (svc) library
from sklearn.svm import SVC

# import sklearn classification evaluation library
from sklearn import metrics
from sklearn.metrics import classification_report, confusion_matrix

# import matplotlib data visualization library
import matplotlib.pyplot as plt
import seaborn as sns

Enable inline Jupyter notebook plotting:

In [None]:
%matplotlib inline

Ignore potential library warnings:

In [None]:
import warnings
warnings.filterwarnings('ignore')

Use the 'Seaborn' plotting style in all subsequent visualizations:

In [None]:
plt.style.use('seaborn')

Set random seed of all our experiments - this insures reproducibility.

In [None]:
random_seed = 42

## 3. Data Download, Assessment and Pre-processing

### 3.1 Dataset Download and Data Assessment

The **Iris Dataset** is a classic and straightforward dataset often used as a "Hello World" example in multi-class classification. This data set consists of measurements taken from three different types of iris flowers (referred to as **Classes**),  namely the Iris Setosa, the Iris Versicolour and the Iris Virginica, and their respective measured petal and sepal length (referred to as **Features**).

<img align="center" style="max-width: 700px; height: auto" src="iris_dataset.png">

(Source: http://www.lac.inpe.br/~rafael.santos/Docs/R/CAP394/WholeStory-Iris.html)

In total, the dataset consists of **150 samples** (50 samples taken per class) as well as their corresponding **4 different measurements** taken for each sample. Please, find below the list of the individual measurements:

>- `Sepal length (cm)`
>- `Sepal width (cm)`
>- `Petal length (cm)`
>- `Petal width (cm)`

Further details of the dataset can be obtained from the following puplication: *Fisher, R.A. "The use of multiple measurements in taxonomic problems" Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to Mathematical Statistics" (John Wiley, NY, 1950)."*

Let's load the dataset and conduct a preliminary data assessment: 

In [None]:
iris = datasets.load_iris()

Print and inspect the names of the four features contained in the dataset:

In [None]:
iris.feature_names

Determine and print the feature dimensionality of the dataset:

In [None]:
iris.data.shape

Determine and print the class label dimensionality of the dataset:

In [None]:
iris.target.shape

Print and inspect the names of the three classes contained in the dataset:

In [None]:
iris.target_names

Let's briefly envision how the feature information of the dataset is collected and presented in the data:

<img align="center" style="max-width: 900px; height: auto" src="feature_collection.png">

Let's inspect the top five feature rows of the Iris Dataset:

In [None]:
pd.DataFrame(iris.data, columns=iris.feature_names).head(5)

Let's also inspect the top five class labels of the Iris Dataset:

In [None]:
pd.DataFrame(iris.target, columns=["class"]).head(5)

Let's now conduct a more in depth data assessment. Therefore, we plot the feature distributions of the Iris dataset according to their respective class memberships as well as the features pairwise relationships.

Pls. note that we use Python's **Seaborn** library to create such a plot referred to as **Pairplot**. The Seaborn library is a powerful data visualization library based on the Matplotlib. It provides a great interface for drawing informative statstical graphics (https://seaborn.pydata.org). 

In [None]:
# init the plot
plt.figure(figsize=(10, 10))

# load the dataset also available in seaborn
iris_plot = sns.load_dataset("iris")

# plot a pairplot of the distinct feature distributions
sns.pairplot(iris_plot, diag_kind='hist', hue='species');

It can be observed from the created Pairplot, that most of the feature measurements that correspond to flower class "setosa" exhibit a nice **linear seperability** from the feature measurements of the remaining flower classes. In addition, the flower classes "versicolor" and "virginica" exhibit a commingled and **non-linear seperability** across all the measured feature distributions of the Iris Dataset.

### 3.2 Dataset Pre-processing

To understand and evaluate the performance of any trained **supervised machine learning** model, it is good practice to divide the dataset into a **training set** (the fraction of data records solely used for training purposes) and a **evaluation set** (the fraction of data records solely used for evaluation purposes). Please note that the **evaluation set** will never be shown to the model as part of the training process.

<img align="center" style="max-width: 500px; height: auto" src="train_eval_dataset.png">

We set the fraction of evaluation records to **30%** of the original dataset:

In [None]:
eval_fraction = 0.3

Randomly split the dataset into training set and evaluation set using sklearn's `train_test_split` function:

In [None]:
# 70% training and 30% evaluation
x_train, x_eval, y_train, y_eval = train_test_split(iris.data, iris.target, test_size=eval_fraction, random_state=random_seed, stratify=None)

Evaluate the dimensionality of the training dataset $x^{train}$:

In [None]:
x_train.shape, y_train.shape

Evaluate the dimensionality of the evaluation dataset $x^{eval}$:

In [None]:
x_eval.shape, y_eval.shape

## 4. Gaussian "Naive-Bayes" (NB) Classification Assignments

We recommend you to try the following exercises as part of the lab:

**1. Train and evaluate the prediction accuracy of different train- vs. eval-data ratios.**

> Change the ratio of training data vs. evaluation data to 30%/70% (currently 70%/30%), fit your model and calculate the new classification accuracy. Subsequently, repeat the experiment a second time using a 10%/90% fraction of training data/evaluation data. What can be observed in both experiments in terms of classification accuracy? 

In [None]:
# ***************************************************
# INSERT YOUR CODE HERE
# ***************************************************

**2. Calculate the true-positive as well as the false-positive rate of the Iris versicolor vs. virginica.**

> Calculate the true-positive rate as well as false-positive rate of (1) the experiment exhibiting a 30%/70% ratio of training data vs. evaluation data and (2) the experiment exhibiting a 10%/90% ratio of training data vs. evaluation data.

In [None]:
# ***************************************************
# INSERT YOUR CODE HERE
# ***************************************************

## 5. Support Vector Machine (SVM) Classification Assignments

We recommend you to try the following exercises as part of the lab:

**1. Train and evaluate the prediction accuracy of SVM models trained with different hyperparameters.**

> Change the kernel function $\phi$ of the SVM to a polynomial kernel, fit your model and calculate the new classification accuracy on the IRIS dataset. Subsequently, repeat similar experiment with different SVM hyperparameter setups by changing the value of $C$, $\gamma$ and the kernel function $\phi$. What pattern can be observed by the distinct hyperparameter setups in terms of classification accuracy? 

In [None]:
# ***************************************************
# INSERT YOUR CODE HERE
# ***************************************************

**2. Train and evaluate the prediction accuracy of SVM models using different or additional features.**

> Fix the hyperparameters of the SVM and evalute the classification accuracy on the FashionMNIST dataset using different features. For example, evaluate the prediction accuracy that can be derived based on a set of Scale-Invariant Feature Transform (SIFT) features. Or the combination of HOG and SIFT features. Will the consideration of additional features improve you classification accuracy?

More information on the FashionMNIST dataset: visit Zalando research's [github page](https://github.com/zalandoresearch/fashion-mnist).

In [None]:
# ***************************************************
# INSERT YOUR CODE HERE
# ***************************************************