<img align="center" style="max-width: 1000px" src="banner.png">

<img align="right" style="max-width: 200px; height: auto" src="hsg_logo.png">

##  Lab 03 - "Supervised Machine Learning: Naive Bayes" Assignments

EMBA 58/59 - W8/3 - "AI Coding for Executives", University of St. Gallen

In the last lab, we saw an application of **supervised machine learning** by using the **Gaussian Naive-Bayes (GNB) classifier** to classify features derived from real-world **Iris flowers**. You learned how to train a model and to evaluate and interpret its results. In this lab, we aim to leverage that knowledge by applying it to a set of self-coding assignments. But before we do so let's start with a motivational video by OpenAI:

In [None]:
from IPython.display import YouTubeVideo
# OpenAI: "Solving Rubik's Cube with a Robot Hand"
YouTubeVideo('x4O8pojMF0w', width=1000, height=500)

As always, pls. don't hesitate to ask all your questions either during the lab, post them in our CANVAS (StudyNet) forum (https://learning.unisg.ch), or send us an email (using the course email).

## 1. Assignment Objectives:

Similar today's lab session, after today's self-coding assignments you should be able to:

> 1. Know how to setup a **notebook or "pipeline"** that solves a simple supervised classification task.
> 2. Recognize the **data elements** needed to train and evaluate a supervised machine learning classifier. 
> 3. Understand how a generative Gaussian **Naive-Bayes (NB)** classifier can be trained and evaluated.
> 4. Know how to use Python's sklearn library to **train** and **evaluate** arbitrary classifiers.
> 5. Understand how to **evaluate** and **interpret** the classification results.

## 2. Setup of the Jupyter Notebook Environment

Similarly to the previous labs, we need to import a couple of Python libraries that allow for data analysis and data visualization. In this lab will use the `Pandas`, `Numpy`, `Scikit-Learn`, `Matplotlib` and the `Seaborn` library. Let's import the libraries by the execution of the statements below:

In [None]:
# import the numpy, scipy and pandas data science library
import pandas as pd
import numpy as np

# import sklearn data and data pre-processing libraries
from sklearn import datasets
from sklearn.model_selection import train_test_split

# import sklearn naive.bayes classifier library
from sklearn.naive_bayes import GaussianNB

# import sklearn classification evaluation library
from sklearn import metrics
from sklearn.metrics import confusion_matrix 

# import matplotlib data visualization library
import matplotlib.pyplot as plt
import seaborn as sns

Enable inline Jupyter notebook plotting:

In [None]:
%matplotlib inline

Use the `Seaborn`plotting style in all subsequent visualizations:

In [None]:
plt.style.use('seaborn')

## 3. Gaussian Naive-Bayes (NB) Classification Assignments

### 3.1 Iris Dataset Download and Pre-Processing

The **Iris Dataset** is a classic and straightforward dataset often used as a "Hello World" example in multi-class classification. This data set consists of measurements taken from three different types of iris flowers (referred to as **Classes**),  namely the Iris Setosa, the Iris Versicolour and the Iris Virginica, and their respective measured petal and sepal length (referred to as **Features**).

<img align="center" style="max-width: 700px; height: auto" src="iris_dataset.png">

(Source: http://www.lac.inpe.br/~rafael.santos/Docs/R/CAP394/WholeStory-Iris.html)

In total, the dataset consists of **150 samples** (50 samples taken per class) as well as their corresponding **4 different measurements** taken for each sample. Please, find below the list of the individual measurements:

>- `Sepal length (cm)`
>- `Sepal width (cm)`
>- `Petal length (cm)`
>- `Petal width (cm)`

Further details of the dataset can be obtained from the following puplication: *Fisher, R.A. "The use of multiple measurements in taxonomic problems" Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to Mathematical Statistics" (John Wiley, NY, 1950)."*

Let's load the dataset and conduct a preliminary data assessment: 

In [None]:
iris = datasets.load_iris()

Print and inspect the names of the four features contained in the dataset:

In [None]:
iris.feature_names

Determine and print the feature dimensionality of the dataset:

In [None]:
iris.data.shape

Determine and print the class label dimensionality of the dataset:

In [None]:
iris.target.shape

Print and inspect the names of the three classes contained in the dataset:

In [None]:
iris.target_names

Let's briefly envision how the feature information of the dataset is collected and presented in the data:

<img align="center" style="max-width: 900px; height: auto" src="feature_collection.png">

Let's inspect the top five feature rows of the Iris Dataset:

In [None]:
pd.DataFrame(iris.data, columns=iris.feature_names).head(5)

Let's also inspect the top five class labels of the Iris Dataset:

In [None]:
pd.DataFrame(iris.target, columns=["class"]).head(5)

Let's now conduct a more in depth data assessment. Therefore, we plot the feature distributions of the Iris dataset according to their respective class memberships as well as the features pairwise relationships.

Pls. note that we use Python's **Seaborn** library to create such a plot referred to as **Pairplot**. The Seaborn library is a powerful data visualization library based on the Matplotlib. It provides a great interface for drawing informative statstical graphics (https://seaborn.pydata.org). 

In [None]:
# init the plot
plt.figure(figsize=(10, 10))

# load the dataset also available in seaborn
iris_plot = sns.load_dataset("iris")

# plot a pairplot of the distinct feature distributions
sns.pairplot(iris_plot, diag_kind='hist', hue='species');

### 3.2 Gaussian Naive Bayes Model Training and Evaluation

<img align="center" style="max-width: 400px; height: auto" src="bayes_theorem.png">

We recommend you to try the following exercises as part of the self-coding session:

**Exercise 1: Train and evaluate the prediction accuracy of different train- vs. eval-data ratios.**

> Change the ratio of the Iris training data vs. test data split to **30%/70%** (currently 70%/30%), fit a Gaussian Naive Bayes model using the `Scikit-Learn` (https://scikit-learn.org) library and compute the classification accuracy of the trained model on the held-out test data. 

In [None]:
# ***************************************************
# INSERT YOUR SOLUTION/CODE HERE
# ***************************************************

**Exercise 2: Train and evaluate the prediction accuracy of different train- vs. eval-data ratios.**

> Now, repeat the experiment a second time using a **10%/90%** Iris train vs. test data split fraction. Next, fit again a Gaussian Naive Bayes model using the `Scikit-Learn` library and also compute the classification accuracy of the trained model on the held-out test data. What can be observed when comparing the results of exercise 1 and 2 in terms of classification accuracy? 

In [None]:
# ***************************************************
# INSERT YOUR SOLUTION/CODE HERE
# ***************************************************

**Exercise 3: Calculate the true-positive as well as the false-positive rate of the Iris versicolor vs. virginica.**

> Compute and visualize the confusion matrices of (1) the experiment exhibiting a 30%/70% ratio of training data vs. evaluation data and (2) the experiment exhibiting a 10%/90% ratio of training data vs. evaluation data. Similar to the guided lab session you might want to use the `confusion_matrix` function that comes with the `Scikit-Learn` library.

<img align="center" style="max-width: 300px; height: auto" src="https://github.com/GitiHubi/courseAIML/blob/master/lab_03/confusion_matrix.png?raw=1">

(Source: https://en.wikipedia.org/wiki/Confusion_matrix)

In [None]:
# ***************************************************
# INSERT YOUR SOLUTION/CODE HERE
# ***************************************************

In [None]:
# ***************************************************
# INSERT YOUR SOLUTION/CODE HERE
# ***************************************************