# GTI771 - Apprentissage machine avancé

### Created: Thiago M. Paixão <br> Revised: Alessandro L. Koerich <br> Ver 1.0 <br> December 2020¶

## NB1 - Template Matching Dataset Simpsons

In this notebook, we will address the classification of characters from the TV serie "The Simpson" using a naive template matching technique. The notebook is divided into four parts:

- Setup
- Train-test partitioning
- Template matching-based classification
- Performance evaluation

## Setup

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import sys
sys.path.append('../')

import os
import glob
import csv
import random
import numpy as np
import matplotlib.pyplot as plt

from collections import defaultdict

from skimage import io
from skimage.transform import resize

from utils import show, show_collection

In [None]:
# change this to the path where the dataset is located in your filesystem
# DATASET_PATH = '/mnt/data/datasets/Simpsons-Train-Valid' 
DATASET_PATH = 'Simpsons-Train-Valid' 

## Train-test partitioning

Here, we split the entire collection into two train and test datasets. Each dataset has the for

$$ \mathcal{X} = \{({\bf x}^t, r^t)\}_{t=1}^n $$,

where ${\bf x}^t$ denotes the $t$-th image and $r^t$ its respective character name (i.e., the class label). To accomplish this, we first create a dictionary ``map_character_filenames`` that maps characters to file names in the dataset.

In [None]:
# list all valid filenames
filenames = glob.glob(os.path.join(DATASET_PATH, 'Train', '*'))
filenames = [filename for filename in filenames if 'Thumbs.db' not in filename] # remove Thumbs.db

# create the mapping
map_character_filenames = defaultdict(list)
for filename in filenames:
    basename = os.path.splitext(os.path.basename(filename))[0]
    character = basename[: -3] # remove the ending digits
    map_character_filenames[character].append(filename)
                                    
# check how many samples (images) are available for each class (character)
for character, filenames in map_character_filenames.items():
    print('{} = {} samples'.format(character, len(filenames)))


The test set is constructed by randomly selecting one exemplar of each class. The train, on its turn, comprises all the pairs $({\bf x}^t, r^t)$ of the dataset except those in the test set.

In [None]:
train_set = []
test_set = []
for character in map_character_filenames:
    # select randomly 1 sample to the test set
    filename_chosen = random.choice(map_character_filenames[character])
    image = io.imread(filename_chosen)
    test_set.append((image, character))
    
    # the rest of the samples are assigned to the train set
    for filename in map_character_filenames[character]:
        if filename != filename_chosen:
            image = io.imread(filename)
            train_set.append((image, character))

# show images
titles = [character for _, character in test_set]
images = [image for image, _ in test_set]
show_collection(images, titles, scale=0.5)

As our template matching plays direct pixel comparison, we must resize the images so that they have the same dimensions (shape). For this example, it was chosen the shape ``(256, 256)``.

**Note:** *The ``resize`` function of the scikit-image library has a side effect of converting images to float in the range $[0,1]$.*

In [None]:
def resize_dataset(dataset, output_shape=(200, 200)):
    dataset_resized = []
    for image, character in dataset:
        image_resized = resize(image, output_shape)
        dataset_resized.append((image_resized, character))
    return dataset_resized

train_set_resized = resize_dataset(train_set, output_shape=(256, 256))
test_set_resized = resize_dataset(test_set, output_shape=(256, 256))

titles = [character for _, character in test_set_resized]
images = [image for image, _ in test_set_resized]
show_collection(images, titles, scale=0.5)

## Template matching-based classification

To proceed with the classification, we randomly select a query image ($Q$) from the test set and search for the most similar image ($S$) in the training set as follows:

$$S = \operatorname{argmin}_{i}|Q - {\bf x}_i|_2.$$

Ideally, the class/character of $S$ should match $Q$'s.

In [None]:
def classify(image_query, train_set):
    min_cost = float('inf')
    label_result = ''
    image_result = None
    
    for image_candidate, label_candidate in train_set:
        cost = ((image_query - image_candidate) ** 2).sum()
        if cost < min_cost:
            min_cost = cost
            label_result = label_candidate
            image_result = image_candidate
    return image_result, label_result
    
image_query, label_query = random.choice(test_set_resized)
image_result, label_result = classify(image_query, train_set_resized)

titles = ['query = {}'.format(label_query), 'result = {}'.format(label_result)]
images = [image_query, image_result]
show_collection(images, titles, scale=0.5)

## Performance evaluation

Compute some performance metrics on the test set.

In [None]:
# Evaluation metrics
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

To use the mentioned ``sklearn`` metrics, we first encode the labels ('bart', 'homer', ...) to numeric values ranging from $0$ to $n_{classes} - 1$:

In [None]:
from sklearn import preprocessing

# building encoder
labels_train = [label for _, label in train_set_resized]
label_encoder = preprocessing.LabelEncoder()
label_encoder.fit(labels_train)

# encoding the labels of the test set
labels_test = [label for _, label in test_set_resized]
y_true = label_encoder.transform(labels_test)

print('True labels')
for label, y in zip(labels_test, y_true):
    print('{} -> {}'.format(label, y))

Predicting the labels of the test set (remember that the test set has a single exemplar of each class):

In [None]:
labels = []
for image_query, _ in test_set_resized:
    _, label = classify(image_query, train_set_resized)
    labels.append(label)
y_pred = label_encoder.transform(labels)

print('Predicted labels')
for label, y in zip(labels_test, y_pred):
    print('{} -> {}'.format(label, y))

Now, we can compute the metrics:

In [None]:
acc = accuracy_score(y_true, y_pred)
print('Correct classification rate for the training dataset = {:.2f}%'.format(100 * acc))

In [None]:
confusion_matrix(y_true, y_pred)

In [None]:
report = classification_report(y_true, y_pred)
print(report)

Observe the warning above. This is due to zero division in the F1 metric calculation, which occurs when precision and recall are simultaneoulsy zero.

## Suggested activity

Although performance evaluation should be performed on the test set, evaluation on the training set can be useful for a 'sanity test' of the implementation. In our case, we know that template matching-based classification is expected to fully recover the queries if they are fom the training partition.

Based on this fact, we suggest to verify this yourself by running the above metrics on the training partition.