# SIN INTELLIGENT SYSTEMS - Lab assignment 2

## Supervised learning: Perceptron algorithm and logistic regression

In this lab assignment 2 on machine learning, we will develop pattern recognition systems based on the perceptron algorithm and logistic regression, applying them to several classification tasks (including flower species classification and handwritten text recognition).

This lab assignment 2 will be divided in 4 lab sessions, followed by an additional session for lab exam 2.

- Session 1: Study of the Iris and Digits standard datasets. Creation of the MyDigits dataset.
- Session 2: Application of the Perceptron algorithm to several classification tasks.
- Session 3: Application of Logistic Regression to several classification tasks.
- Session 4: Additional exercises to prepare for lab exam 2.
- Session 5: Lab exam 2.

<p style="page-break-after:always;"></p>

# **Session 1**

In this first session we will familiarize ourselves with the working environment and some datasets, starting with *Iris*. Then, you can follow with *Digits*. Finally, you will create your own *MyDigits* dataset.

You may need to run this code if this is the first time you are running this notebook.

In [None]:
!pip install seaborn scikit-learn pandas pillow gradio matplotlib

<p style="page-break-after:always;"></p>

# The Iris dataset

The Iris dataset has been widely used to introduce basic machine learning concepts and methods. It consists of $N=150$ samples, $50$ for each of $C=3$ classes, represented by vectors of $D=4$ homogeneous real features. One of the classes is linearly separable from the rest, but the other two are not linearly separable. Although today it is considered a "toy" dataset, it is still very useful for introducing basic concepts and methods.

First we import some standard and sklearn libraries:

In [None]:
import warnings; warnings.filterwarnings("ignore");
import pandas as pd
import seaborn as sns
from sklearn.datasets import load_iris

Reading the Iris dataset:

In [None]:
iris = load_iris()
print(dir(iris))
X = iris.data 
y = iris.target
fn = iris.feature_names
cn = iris.target_names
print(iris.DESCR)

<p style="page-break-after:always;"></p>

We convert the corpus into a pandas dataframe to facilitate its description:

In [None]:
data = pd.DataFrame(data=X, columns=fn)
data['species'] = pd.Series(iris.target_names[y], dtype='category')
data

<p style="page-break-after:always;"></p>

Let's look at some basic statistics:

In [None]:
data.describe()

We check that we have $50$ samples of each class:

In [None]:
data.groupby('species', observed=False).size()

<p style="page-break-after:always;"></p>

Since we have few features, it's a good idea to make a scatter matrix plot:

In [None]:
sns.pairplot(data, hue="species", height = 1.4, palette = 'colorblind');

**Question:** which class is linearly separated from the other two?

# The Digits dataset

Like iris, Digits can be considered a "toy" dataset. However, compared to iris, Digits represents a jump of complexity due to the greater number of classes, $C=10$, samples, $N=1797$, and dimension of feature vectors, $D=64$. In addition, digits addresses one of the main perceptual tasks of machine learning: optical character recognition (OCR) and, more specifically, handwritten digit recognition. Although handwritten digit recognition has been considered a "solved" task since the 1990s, image classification in general remains a complex task of great academic and commercial interest. So the relative simplicity of Digits is very convenient as an introductory task to image classification.

In [None]:
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits

In [None]:
digits = load_digits()
print(digits.DESCR)

<p style="page-break-after:always;"></p>

Let's see some images:

In [None]:
import numpy as np
import matplotlib.pyplot as plt
X = digits.images.astype(np.float16).reshape(-1, 8*8); X/=np.max(X)
y = digits.target.astype(np.uint).reshape(-1, 1);
nrows = 6; ncols = 15
_, axs = plt.subplots(nrows=nrows, ncols=ncols, figsize=(12, 12*nrows/ncols), constrained_layout=True)
for ax, xn, yn in zip(axs.flat, X, y):
  ax.set_axis_off(); image = xn.reshape(8, 8); ax.set_title(yn)
  ax.imshow(image, cmap=plt.cm.gray_r, interpolation="none")

<p style="page-break-after:always;"></p>

# Create MyDigits dataset

The following simple application allows you to create your own Digits dataset. When you run this application, it shows a basic graphical interface containing a panel on which you can draw your own handwritten digits.

Before you can draw a digit, you need to click on the *pen* locate on the left vertical. Then you can draw on the panel. If you need to erase what you have drawn on the panel, just click on *Undo* located on the top menu. 

The usual process to acquire your Digits dataset is to draw a digit, provide the class label and click on "Save image". You should acquire at least 10 samples for each digit (0 to 9), that is, 100 samples in total. Once you finish, you can save your dataset, by providing a filename for image and a filename for labels and clicking on "Save dataset".

In addition to saving data, it is also possible to load data using the "Load dataset" button. The file names for the images and labels must be provided using the same input text boxes used in saving. **Warning!** Loading data deletes the current data in memory.

Finally, there is also a "Merge dataset" button for merging two datasets. It works exactly the same way as "Load dataset", except that instead of deleting the current data in memory, it merges it with the loaded data.

In [None]:
# Execute this cell only when running in Google Colab 
# You need to upload DigitCaptureGradioApp.py
from google.colab import files
uploaded = files.upload()

In [None]:
import warnings; warnings.filterwarnings("ignore");
import numpy as np
import DigitCaptureGradioApp as dca

demo = dca.DigitCaptureApp()
demo.launch()

<p style="page-break-after:always;"></p>

In [None]:
# Execute this cell only when running in Google Colab 
# You need to download your dataset: images.npy labels.npy
files.download('images.npy')
files.download('labels.npy')

**Check the images and labels of your dataset:**

In [None]:
fd = open('images.npy', 'rb')
X = np.load(fd)
fd.close()

fd = open('labels.npy', 'rb')
y = np.load(fd).astype(int)
fd.close()

In [None]:
import matplotlib.pyplot as plt
nrows = 10; ncols = 10
_, axs = plt.subplots(nrows=nrows, ncols=ncols, figsize=(7, 7*nrows/ncols), constrained_layout=True)
for ax, xn, yn in zip(axs.flat, X, y):
  ax.set_axis_off(); image = xn.reshape(8, 8); ax.set_title(yn)
  ax.imshow(image, cmap=plt.cm.gray_r, interpolation="none")