# Homework 2: mask detection for the next pandemic
You heard rumors from biology department of EPFL about a new version of a corona virus spreading out next year, already called COVID-25. The virus is so dangerous that you are asked to develop a mask detection system to be deployed in all public places. The system should be able to detect if a person is wearing a mask or not. You are given a dataset of images of people with and without masks. Your task is to develop a machine learning model that can detect if a person is wearing a mask or not.

**IMPORTANT NOTE:** The evaluation of this homework will be done on an hidden dataset. It is important that your notebook runs without errors, otherwise we will not be able to evaluate some questions, resulting in a grade 0 for that part. Make sure to double check your code before submitting the notebook. We provide a simple `evaluate` function that you can use to make sure that all the necessary functions are implemented correctly; the function will warn you if it detects that a good prediction cannot be made. `evaluate` takes just one argument: the `scikit-learn` model that you have trained. It expects the model to have a `predict` method that works on a numpy array.

In [2]:
from evaluation import evaluate

## Data loading and preprocessing
The dataset consists in $n$ images of dimension 128x128 pixels. Each pixel is represented by 3 values (RGB). The dataset is divided in two classes: 0 for people with the mask and 1 for people without the mask. 

In [None]:
import numpy as np
import sklearn

def load_file(file):
    with np.load('data/'+file+'.npz') as data:
        X = data['X']
        y = data['y']
    return X, y

X, y = load_file('train')
print(X.shape, y.shape)

**Question 1.** Plot few images, and check the distribution of the labels.

**Question 2.** Reshape each image in the dataset to a flat vector. Split the dataset into training and test sets.

## Logistic regression

**Question 3.** Train a logistic regression model on the training data, comparing the accuracies of $\ell_1$ and $\ell_2$ penalties. What is the train and test accuracy of the model?

In [None]:
# Evaluate the model
evaluate(model)
# Make sure the line above runs without errors before submitting your solution!

**Question 4.** Make a more extensive search of possible Logistic Regressors. Crossvalidate the following parameters:
- penalty: `['l1', 'l2']`
- $C$: range from $10^{-3}$ to $10^3$.


In [None]:
# Official evaluation!
evaluate(logistic_model)
# Make sure the line above runs without errors before submitting your solution!

## SVM
**Question 4.** Train a linear SVM model on the training data and crossvalidate the parameter $C$.

In [None]:
# Official evaluation!
evaluate(svm_model)
# Make sure the line above runs without errors before submitting your solution!

## Random Forest

**Question 5.** Train a Random Forest model on the training data and crossvalidate the parameters:
- `n_estimators`
- `max_depth`
- `criteria`



In [None]:
# Official evaluation!
evaluate(rf_model)
# Make sure the line above runs without errors before submitting your solution!

## Extra data
Digging in his archive, EPFL Health autorities found a another dataset of images with the same format. This dataset is called `extratrain`. Use this dataset to improve the performance of your models.

In [37]:
X_extra, y_extra = load_file('extra')

**Question 6.** Is this new dataset good? What problem does it have compared to the original dataset, if any?

**Question 7.** Evaluate the performance of the cross-validated models you found in the previous questions on the new dataset. Is the performance better or worse? Why?

**Question 8.** Train the most promising model using the new dataset, and crossvalidate. Is the performance better or worse? Why?

In [None]:
# Official evaluation!
evaluate(newrf_model)
# Make sure the line above runs without errors before submitting your solution!

**Question 9:** Join the `train` and `extra` datasets and repeat the previous point. What is the best model and its accuracy on the combined dataset?

In [None]:
# Official evaluation!
evaluate(grid.best_estimator_)
# Make sure the line above runs without errors before submitting your solution!

**Question 10:** Resample the joined dataset to solve the joined dataset "problem". Train a Random Forest with crossvalidation and discuss the results.

In [None]:
# Official evaluation!
evaluate(oversampled_rf_model)
# Make sure the line above runs without errors before submitting your solution!

**Question 11:** Repeat the previous question, but this time use `LogisticRegression` as the model. 

**Question 12:** Train the best model you can given the data available. Describe your technique and justify your choices.

**Question Bonus:** You have been told that is much more important to detect people without masks than people with masks. How would you change your model to take this into account?