# Machine Learning SoSe21 Practice Class

Dr. Timo Baumann, Dr. Özge Alaçam, Björn Sygo <br>
Email: baumann@informatik.uni-hamburg.de, alacam@informatik.uni-hamburg.de, 6sygo@informatik.uni-hamburg.de


## Exercise 3
**Description:** Implement gaussian discriminant analysis on the provided images <br>
**Deadline:** Saturday, 15. May 2021, 23:59 <br>
**Working together:** You can work in pairs or triples but no larger teams are allowed. <br>
&emsp;&emsp;&emsp; &emsp; &emsp; &emsp; &emsp; Please adhere to the honor code discussed in class. <br>
&emsp;&emsp;&emsp; &emsp; &emsp; &emsp; &emsp; All members of the team must get involved in understanding and coding the solution.

## Submission: 
**Christoph Brauer, Linus Geewe, Moritz Lahann**

*Also put high-level comments that should be read before looking at your code and results.*

### Goal

1. The goal of this exercise is to build a classifier for real-life data (images).
2. You will derive features of the images and then use GDA for classification, i.e., you compute the probability of each image being sampled from one of the classes $j$ (in our case $j \in \{\textrm{no_mask}, \textrm{mask}\}$). This probability can be calculated with <br>
$p(y=j|x)= \frac{p(x|y=j)p(y=j)}{p(x|y=j)p(y=j)+p(x|y \neq j)p(y \neq j)}$ <br>
where <br>
$p(x|y=j)=\frac{1}{(2 \pi)^{\frac{n}{2}}|\Sigma|^{\frac{1}{2}}}e^{-\frac{1}{2}(x-\mu_j)^T \Sigma^{-1} (x-\mu_j)}$ <br>
and the prior probability $p(y=j)=1-\Phi$ (depending on the dataset).
3. For later classification, you compute for each class the probability that the image is a sample of it and choose the class with the highest probability.

### Load the images

**Task 1** (15%): Load the images and represent them so that you can work with them.

The dataset contains identically sized images of people with and without facemasks and the goal is to classify if an image contains a person wearing a facemask or not. Note that some images have 3 colors, some also have an alpha channel which you should probably ignore.

When the images are loaded, they are represented as a 32x32x3 matrix. One of the 3 layers of the matrix is for the red (R) value, one for the 
green (G) value and one for the blue (B) value. The image itself is created by taking the values of each of theses 3 matrices at (i,j) to create the 
pixel of the image at spot (i,j). Each of the values can be in range of 0-255.

First, you should load in the images (see zip files). There are two versions of the dataset, a small subsample which contains 40 images for each of the two classes, and the large full dataset (with imbalanced classes). There are multiple libraries for image representation. Try PIL or google around.

In [1]:
from PIL import Image as im
import glob
import math
import numpy as np

def load_imgs_numpy(path):
    imgs = []
    for index, imgpath in enumerate(glob.iglob(path + "*")):
        img = im.open(imgpath)
        imgs.append(np.array(img)[:, :, :3])
    return np.array(imgs)    

mask_imgs = load_imgs_numpy("subset/mask/")
nomask_imgs = load_imgs_numpy("subset/no_mask/")

print(mask_imgs.shape)
im.fromarray(mask_imgs[0]).show()

(40, 32, 32, 3)


### Obtain feature vectors

**Task 2** (15%): Build a feature vector and represent each image by its corresponding feature vector.

For Gaussian Discriminant Analysis, the images should be represented as feature vectors. A feature vector consists of different features of the image, for example mean or variance of pixel values, of color channels, number of "blue" pixels, etc. Be creative. The more discriminative your features, the better your classifier will perform.

Your feature vector should contain at least 5 different features. (Note: your code below should also work with more or fewer features.)

In [2]:
def mean_numpy(image):
    return np.mean(image, (0, 1))

def std_numpy(image):
    return np.std(image, (0, 1))

In [3]:
# maskfilter mean
# disregards pixels outside of where we expect a mask to be
maskFilter = [
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2],
[0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5],
[0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8],
[0, 0.5, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0.5, 0],
[0, 0.5, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0.5, 0],
[0, 0.5, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0.5, 0],
[0, 0.5, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0.5, 0],
[0, 0.5, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0.5, 0],
[0, 0.5, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0.5, 0],
[0, 0.5, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0.5, 0],
[0, 0.5, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0.5, 0],
[0, 0.5, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0.5, 0],
[0, 0.5, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0.5, 0],
[0, 0.5, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0.5, 0],
[0, 0.5, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0.5, 0],
[0, 0.5, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0.5, 0],
[0, 0.5, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0.5, 0],
[0, 0.5, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0.5, 0],
[0, 0.5, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0.5, 0],
[0, 0.5, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0.5, 0],
[0, 0.5, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0.5, 0],
[0, 0.5, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0.5, 0],
[0, 0.5, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0.5, 0],
[0, 0.5, 0.5, 0.5, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0.5, 0.5, 0.5, 0],
[0, 0.5, 0.5, 0.5, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0.5, 0.5, 0.5, 0],
[0, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]

def masked_mean_numpy(image, mask):
    mask = np.array([np.array(row) for row in mask])
    masked_image = image * np.stack((mask, mask, mask), 2)

    return mean_numpy(masked_image)

blue_min = [0, 0, 100]
blue_max = [180, 180, 255]


white_min = [128,128,128]
white_max = [255,255,255]

def countColorValues(image, color_min, color_max):
    count = 0
    for row in image:
        for pixel in row:
            is_in_range = True
            for index in range(len(pixel)):
                if not color_min[index] <= pixel[index] <= color_max[index]:
                    is_in_range = False
            count += is_in_range
    return count

def fv_numpy(image):
    blue_value = countColorValues(image, blue_min, blue_max)
    white_value = countColorValues(image, white_min, white_max)
    mean = mean_numpy(image)
    sigma = std_numpy(image)
    mask_filter = masked_mean_numpy(image, maskFilter)
    return np.concatenate((np.array([blue_value, white_value]), mean, sigma, mask_filter)).flatten()

print(fv_numpy(mask_imgs[0]))

[188.         455.         137.10058594 124.89746094 120.07910156
  53.05155437  54.79423721  54.67546825  99.76337891  93.5296875
  90.86425781]


### Initialize your parameters

**Task 3** (5%):  Initialize your parameters for the GDA algorithm.

For the discriminant analysis, you will need to estimate your parameters $\Phi, \mu_0, \mu_1, \Sigma$.

For this, you will need to initialize them first. You can just initalize them with 0, but you should consider their dimensions.

In [4]:
test_fv = fv_numpy(nomask_imgs[0])

# our mean vectors have the same dimension as our feature vector (we don't initialize them separately here)
mu_initial = np.zeros_like(test_fv)
print(mu_initial)

# our covariance matrix is a 2D matrix with dimensions n x n (where n: length of mu)
mu_for_sigma = mu_initial.reshape(1, mu_initial.shape[0])
sigma_initial = mu_for_sigma * mu_for_sigma.T
print(sigma_initial)

# our phi is a scalar value denoting the probability of a test sample having one of two classes 
phi = 0
print(phi)


[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]
0


### Implement Gaussian Discriminant analysis

**Task 4** (35%):  Implement the Gaussian Discriminant Analysis algorithm.

Now you can use the GDA algorithm to find an estimation for the correct parameters to classify the images later.

In [5]:
def calc_phi(mask_images, nomask_images):
    phi = len(mask_imgs)/(len(mask_imgs)+(len(nomask_imgs)))
    return phi

def calc_mu_numpy(images):
    fvs = []
    for image in images:
        fvs.append(fv_numpy(image))
    
    return np.mean(np.array(fvs), 0)

def calc_sigma_numpy(mask_images, nomask_images, mu_mask, mu_nomask):
    return calc_difference_numpy(mask_images, mu_mask) + calc_difference_numpy(nomask_images, mu_nomask) / (mask_imgs.shape[0] + nomask_imgs.shape[0])

def calc_difference_numpy(images, mu):
    sigmas = []
    for image in images:
        fv = fv_numpy(image)
        difference_mask = np.reshape(fv - mu, (1, mu.shape[0]))
        sigmas.append(difference_mask * difference_mask.T)
    return np.mean(np.array(sigmas), 0)

mu_mask = calc_mu_numpy(mask_imgs)
mu_nomask = calc_mu_numpy(nomask_imgs)
sigma = calc_sigma_numpy(mask_imgs, nomask_imgs, mu_mask, mu_nomask)
print(sigma)
# confirm that sigma is symmetric
print(np.allclose(sigma, sigma.T, rtol=1e-05, atol=1e-08))

[[ 1.60295561e+04 -1.21211945e+04 -2.47075010e+03 -1.65095995e+03
  -1.36891592e+03 -7.24752731e+02 -6.55641386e+02 -5.93169748e+02
  -1.96274078e+03 -1.37345387e+03 -1.16558093e+03]
 [-1.21211945e+04  2.41673342e+04  3.80209400e+03  3.32456954e+03
   3.08275036e+03 -1.12327598e+02 -9.80440498e+01 -1.80767328e+02
   2.62930952e+03  2.24270160e+03  2.04797841e+03]
 [-2.47075010e+03  3.80209400e+03  7.74366219e+02  5.98267789e+02
   5.16087192e+02 -1.25553767e+01 -7.61181978e+00 -1.05765456e+01
   5.39302447e+02  4.06764202e+02  3.46011779e+02]
 [-1.65095995e+03  3.32456954e+03  5.98267789e+02  5.39322997e+02
   4.91100784e+02 -5.73171187e+01 -3.79920695e+01 -3.71261862e+01
   4.05891597e+02  3.61790645e+02  3.26271419e+02]
 [-1.36891592e+03  3.08275036e+03  5.16087192e+02  4.91100784e+02
   4.84883669e+02 -3.83392540e+01 -2.38497445e+01 -2.71499760e+01
   3.52806426e+02  3.32766351e+02  3.25757627e+02]
 [-7.24752731e+02 -1.12327598e+02 -1.25553767e+01 -5.73171187e+01
  -3.83392540e+01  

### Classify with the Bayes rule
**Task 5** (10%): Use the Bayes rule to check how many images were correctly classified.

Now that you have estimated your parameters, you can use the Bayes rule to classify the images. You then can evaluate how many of the images were correctly classified and try again with different features if the results weren't good enough.

In [6]:
# Multivariate Gaussian PDF
def pdf(x, mu, sigma):
    d = mu.shape[0]
    first_term = 1 / (((2 * math.pi) ** (d / 2)) * (np.linalg.det(sigma) ** 0.5))
    difference_vector = x - mu
    exponent = -0.5 * (np.linalg.inv(sigma).dot(difference_vector.T).dot(difference_vector))
    second_term = math.exp(exponent)
    return first_term * second_term

# BAYES RULE
# p(y = 1|x) = (p(x|y = 1) * p(y = 1) / p(x))
# where:    p(y = 1) is our prior phi
#           p(x|y = 1) is pdf(x, mu for y = 1, sigma)
#           p(x) is constant and equal for both classes, so we can disregard it (since we don't care about the actual probability)

# returns 1 if mask, 0 if no mask
def bayes_rule(x, mu_mask, mu_nomask, sigma, phi):
    prob_mask = pdf(x, mu_mask, sigma) * phi
    prob_nomask = pdf(x, mu_nomask, sigma) * phi
    return prob_mask > prob_nomask

# expects samples to be in format [[[...image...], class], ...]
def accuracy(samples, mu_mask, mu_nomask, sigma, phi):
    correct = 0
    for sample in samples:
        fv = fv_numpy(sample[0])
        correct += bayes_rule(fv, mu_mask, mu_nomask, sigma, phi) == sample[1]
    return correct, correct / samples.shape[0]
    
# puts all samples into one list next to their class labels
def prepare_samples(mask_images, nomask_images):
    samples = []
    for image in mask_images:
        samples.append(np.array([image, 1], dtype=object))
    for image in nomask_images:
        samples.append(np.array([image, 0], dtype=object))
    return np.array(samples)


In [7]:
# Load images
mask_imgs = load_imgs_numpy("subset/mask/")
nomask_imgs = load_imgs_numpy("subset/no_mask/")

# Get ML params
mu_mask = calc_mu_numpy(mask_imgs)
mu_nomask = calc_mu_numpy(nomask_imgs)
sigma = calc_sigma_numpy(mask_imgs, nomask_imgs, mu_mask, mu_nomask)
phi = calc_phi(mask_imgs, nomask_imgs)

# Predict
concat_samples = prepare_samples(mask_imgs, nomask_imgs)
nr_correct, acc = accuracy(concat_samples, mu_mask, mu_nomask, sigma, phi)
print(f"Correctly Predicted: {nr_correct}/{concat_samples.shape[0]}")
print(f"Accuracy: {acc * 100}%")

Correctly Predicted: 77/80
Accuracy: 96.25%


### Cross-validation

**Task 6** (10%): Implement 10-fold cross-validation and report the quality of your results in terms of accuracy and f-score.

You have so far trained and tested your classifier on the same data. This does not tell us much about the true performance on unseen data. 
Instead, you should now randomly split your data into _k_ folds of equal size. you then train your model _k_ times, using all but the _k_'th fold for training and the _k_'th fold for testing.

In [8]:
def k_fold_cross_validation(samples, k):
    counts = []
    accs = []
    individual_folds = np.array_split(samples, k)
    for index, fold in enumerate(individual_folds):
        # remove validation fold from training and concatenate training folds to single array
        train = np.delete(individual_folds, index, 0)
        train = np.concatenate(train, 0)
        val = fold

        # split into mask and non mask images
        mask_imgs = np.array([sample[0] for sample in train if sample[1] == 1])
        nomask_imgs = np.array([sample[0] for sample in train if sample[1] == 0])

        # calculate mu, sigma, ... for train
        mu_mask = calc_mu_numpy(mask_imgs)
        mu_nomask = calc_mu_numpy(nomask_imgs)
        sigma = calc_sigma_numpy(mask_imgs, nomask_imgs, mu_mask, mu_nomask)
        phi = calc_phi(mask_imgs, nomask_imgs)

        # test on validation fold
        results = accuracy(val, mu_mask, mu_nomask, sigma, phi)
        counts.append(results[0])
        accs.append(results[1])
        
    mean_count = np.mean(np.array(counts))
    total_count = sum(counts)
    mean_acc = np.mean(np.array(accs))
    return mean_count, mean_acc, total_count


In [9]:
np.random.seed(4505918)

shuffled_samples = prepare_samples(mask_imgs, nomask_imgs)
np.random.shuffle(shuffled_samples)
print(shuffled_samples.shape)
# im.fromarray(shuffled_samples[0, 0]).show()

k = 10
mean_count, mean_acc, total_count = k_fold_cross_validation(shuffled_samples, k)
print(f"Total Correctly Predicted: {total_count}/{shuffled_samples.shape[0]}")
print(f"Avg. Correctly Predicted: {mean_count}/{shuffled_samples.shape[0] / k}")
print(f"Accuracy: {mean_acc * 100}%")

(80, 2)
Total Correctly Predicted: 77/80
Avg. Correctly Predicted: 7.7/8.0
Accuracy: 96.25%


There is some variance here due to randomness from the shuffling, though not much - one less correct sometimes.

### Feature importance

**Task 7** (10%): Experiment with the features: how well does the classifier perform with individual features, what is the additional value of the second best feature in addition to the best?

In [10]:
# make feature vector changeable
class GDA:
    def __init__(self, feature_vector_func):
        self.fv = feature_vector_func

    def pdf(self, x, mu, sigma):
        d = mu.shape[0]
        first_term = 1 / (((2 * math.pi) ** (d / 2)) * (np.linalg.det(sigma) ** 0.5))
        difference_vector = x - mu
        exponent = -0.5 * (np.linalg.inv(sigma).dot(difference_vector.T).dot(difference_vector))
        second_term = math.exp(exponent)
        return first_term * second_term

    def bayes_rule(self, x, mu_mask, mu_nomask, sigma, phi):
        prob_mask = self.pdf(x, mu_mask, sigma) * phi
        prob_nomask = self.pdf(x, mu_nomask, sigma) * phi
        return prob_mask > prob_nomask

    def accuracy(self, samples, mu_mask, mu_nomask, sigma, phi):
        correct = 0
        for sample in samples:
            fv = self.fv(sample[0])
            correct += self.bayes_rule(fv, mu_mask, mu_nomask, sigma, phi) == sample[1]
        return correct, correct / samples.shape[0]

    def calc_phi(self, mask_images, nomask_images):
        phi = len(mask_imgs)/(len(mask_imgs)+(len(nomask_imgs)))
        return phi

    def calc_mu_numpy(self, images):
        fvs = []
        for image in images:
            fvs.append(self.fv(image))
        
        return np.mean(np.array(fvs), 0)

    def calc_sigma_numpy(self, mask_images, nomask_images, mu_mask, mu_nomask):
        return self.calc_difference_numpy(mask_images, mu_mask) + self.calc_difference_numpy(nomask_images, mu_nomask) / (mask_imgs.shape[0] + nomask_imgs.shape[0])

    def calc_difference_numpy(self, images, mu):
        sigmas = []
        for image in images:
            fv = self.fv(image)
            difference_mask = np.reshape(fv - mu, (1, mu.shape[0]))
            sigmas.append(difference_mask * difference_mask.T)
        return np.mean(np.array(sigmas), 0)

    def k_fold_cross_validation(self, samples, k):
        counts = []
        accs = []
        individual_folds = np.array_split(samples, k)
        for index, fold in enumerate(individual_folds):
            train = np.delete(individual_folds, index, 0)
            train = np.concatenate(train, 0)
            val = fold

            mask_imgs = np.array([sample[0] for sample in train if sample[1] == 1])
            nomask_imgs = np.array([sample[0] for sample in train if sample[1] == 0])

            mu_mask = self.calc_mu_numpy(mask_imgs)
            mu_nomask = self.calc_mu_numpy(nomask_imgs)
            sigma = self.calc_sigma_numpy(mask_imgs, nomask_imgs, mu_mask, mu_nomask)
            phi = self.calc_phi(mask_imgs, nomask_imgs)

            results = self.accuracy(val, mu_mask, mu_nomask, sigma, phi)
            counts.append(results[0])
            accs.append(results[1])
            
        mean_count = np.mean(np.array(counts))
        total_count = sum(counts)
        mean_acc = np.mean(np.array(accs))
        return mean_count, mean_acc, total_count
    

In [11]:
# overwrite feature vector method with different single features
# call k_fold_cross_validation for each feature
# compare
# overwrite fv with best & second best feature, validate, compare to only single best

# shuffle samples once
shuffled_samples = prepare_samples(mask_imgs, nomask_imgs)
np.random.shuffle(shuffled_samples)
k = 10

# per channel mean
def mean_fv(image):
    return mean_numpy(image)

# per channel stddev
def std_fv(image):
    return std_numpy(image)

# nr of "blue" pixels in image
blue_min = [0, 0, 100]
blue_max = [180, 180, 255]
def blue_fv(image):
    return np.array([countColorValues(image, blue_min, blue_max)])

# nr of "white" pixels in image
white_min = [128,128,128]
white_max = [255,255,255]
def white_fv(image):
    return np.array([countColorValues(image, white_min, white_max)])

# per channel mean of masked image
def masked_fv(image):
    return masked_mean_numpy(image, maskFilter)

print(GDA(mean_fv).k_fold_cross_validation(shuffled_samples, k))

print(GDA(std_fv).k_fold_cross_validation(shuffled_samples, k))

print(GDA(blue_fv).k_fold_cross_validation(shuffled_samples, k))

print(GDA(white_fv).k_fold_cross_validation(shuffled_samples, k))

print(GDA(masked_fv).k_fold_cross_validation(shuffled_samples, k))


(7.6, 0.95, 76)
(7.0, 0.875, 70)
(4.5, 0.5625, 45)
(6.5, 0.8125, 65)
(7.9, 0.9875, 79)


In [12]:
def top_two_fv(image):
    mean = mean_numpy(image)
    masked_mean = masked_mean_numpy(image, maskFilter)
    return np.concatenate([mean, masked_mean])

print(GDA(top_two_fv).k_fold_cross_validation(shuffled_samples, k))

(8.0, 1.0, 80)


# Feature importance

Our top two features were 
 - mean per channel on masked image with 98.75% accuracy
 - mean per channel with 96.25% accuracy 

The mask used to mask the image disregarded pixels outside of the lower center of the image (where the mask was commonly placed in the dataset), so there was a lower influence of backgrounds on the per-channel means (e.g. some images had white backgrounds).

Blue pixels performed the worst, probably because a larger number of images with face masks were actually of white face masks.

Suprisingly, even just the channel means alone give quite a respectable result (on a simple dataset like this).

Using the top two features increased accuracy from 98.75% to 100%. This means some features in our full feature vector used above were actually detrimental to performance. Features are not weighted, so lower confidence features have the same impact as high confidence ones, worsening performance.

### Report Submission

Prepare a report of your solution as a commented Jupyter notebook (using markdown for your results and comments); include figures and results.
If you must, you can also upload a PDF document with the report annexed with your Python code.

Upload your report file to the Machine Learning Moodle Course page. Please make sure that your submission team corresponds to the team's Moodle group that you're in.