## Face detection

La ProCam s.p.a ha intenzione di lanciare sul mercato una nuova fotocamera digitale compatta ed economica destinata a piccoli fotografi in erba.

Vieni assunto come Data Scientist per realizzare il sistema di identificazione dei volti nelle immagini, questo permetterà poi ai tecnici della fotografia di ottimizzare le impostazioni per un selfie con una o più persone.

Si tratta di un problema di computer vision, più precisamente di Face Detection.

Devi fornire una pipeline scikit-learn che prende un'immagine in ingresso e ritorna una lista con le coordinate dei bounding box dove sono presenti dei volti, se nell'immagine non contiene volti la lista sarà ovviamente vuota.

- Non ti viene fornito un dataset, sta a te cercarne uno in rete o, nella peggiore delle ipotesi, costruirlo, per semplicità non considereremo implicazioni sulle licenze ad utilizzo commerciale, si tratta pur sempre di un progetto didattico.
- Non puoi utilizzare modelli pre-addestrati, devi addestrarlo tu utilizzando scikit-learn.
- Stai lavorando su un sistema con ridotte capacità di calcolo, quindi il modello deve richiedere poche risorse di calcolo.
- Ovviamente non ti vengono fornite indicazioni sull'implementazione, fai un'approfondita ricerca bibliografica per trovare la soluzione migliore da adottare, il notebook che consegnerai deve essere ben documentato, devi spiegare quali soluzioni hai adottato e perché ed ogni risorsa esterna (paper, blog post, codice github...) che hai utilizzato.
- Il progetto è abbastanza complesso, ricorda che in caso ne avessi necessità puoi sempre chiedere aiuto ai tuoi coach nella Classe Virtuale di Machine Learning su Discord.

# Notebook 1
## Model construction, optimization and training

In this notebook the model is built, optimized and trained.

First of all, the images dataset for training, validation and test is built, gathering several datasets found in kaggle and scikit-learn library; then a Pipeline is made in order to perform images preprocessing, feature extraction, dimensionality reduction and finally prediction through a SVM model.
An hyper parameter tuning is performed with random search technique.

At the end of this notebook the optimized pipeline is saved into a joblib file, for further usage.

### Bash section

Install kaggle package (inf not already installed) and import non-face images from kaggle

In [1]:
%pip install kaggle

Note: you may need to restart the kernel to use updated packages.


As negative class images (i.e. non-face images) the [African Wildlife dataset](https://www.kaggle.com/datasets/biancaferreira/african-wildlife) is used. 

This is a collection of 4 african animals images, taken from Google images.

Inspecting images, one can see as in images there is also some natural background, and could be useful for learning not only animal features but even natural objects ones too (such as trees and rocks).

Dataset is downloded through kaggle API and extracted into a local directory

In [2]:
!kaggle datasets download -p wildlife biancaferreira/african-wildlife
!tar -xf wildlife/african-wildlife.zip -C wildlife

Dataset URL: https://www.kaggle.com/datasets/biancaferreira/african-wildlife
License(s): unknown
african-wildlife.zip: Skipping, found more recently modified local copy (use --force to force download)


### Python section

In [None]:
#Import some modules
from tqdm import tqdm
import os
import random
import numpy as np
from skimage.transform import resize
from skimage.io import imread
from sklearn.datasets import fetch_olivetti_faces
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from ImageResizer import ImageResizer
from HOGFeatureExtractor import HOGFeatureExtractor
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix


Set a random seed to get reproducible results

In [4]:
RANDOM_SEED=200
random.seed(RANDOM_SEED)

#### Dataset preparation

As positive class images (i.e. faces images) the [Olivetti faces dataset](https://scikit-learn.org/stable/datasets/real_world.html#the-olivetti-faces-dataset) is used.

This dataset is a collection of 10 photos of each of 40 distinct subjects faces (400 samples total). 

This dataset is very suitable for face detection learning, since images are very clean, there aren't distracting elements (like background) and faces are depicted with different facial expressions (open/closed eyes, smiling/not smiling) and frontal position.

This is perfect for a SVM classifier.


The dataset is downloaded with the [scikit-learn utility](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_olivetti_faces.html)

In [5]:
olivetti_faces=fetch_olivetti_faces()

images_positive = olivetti_faces.images

#helper variable
resize_shape = (64, 64) #as row x columns (h x w)


Store positive images in a temporary variable and generate positive labels array.

In [6]:
X_positive = images_positive
y_positive = np.ones(X_positive.shape[0])

#check
print(X_positive.shape) #3 dimensions: 1 = records, 2-3 = image as matrix 
print(y_positive.shape)

(400, 64, 64)
(400,)


In [7]:
#check
print(X_positive[1,:,:].shape)

(64, 64)


Store negative images in a temporary variable and generate negative labels array.

Note: not all images available in kaggle dataset are used, but for each subdirectory (i.e. images set for a given animal) the images provided are randomly sampled, such that there will be 100 images for each animal, and so 400 negative images in total in final dataset.

This is done to have a balanced dataset between positive and negative classes.

In [8]:
#process negative images V2
negative_images = []

buffalo_img_dir = "wildlife/buffalo"
elephant_img_dir = "wildlife/elephant"
rhino_img_dir = "wildlife/rhino"
zebra_img_dir = "wildlife/zebra"

negative_img_dirs = [buffalo_img_dir,
                     elephant_img_dir,
                     rhino_img_dir,
                     zebra_img_dir                 
                    ]

for directory in negative_img_dirs:
    filenames = os.listdir(directory)
    jpg_files = [f for f in filenames if f.endswith('.jpg')]
    #with 4 directories I get 400 negative images that balance positive images numerosity (400) 
    sampled_files=random.sample(jpg_files,100)


    for filename in tqdm(sampled_files,desc=f"Processing negative images ({directory})",
                        unit="item"):
        
        if filename.endswith('.jpg') or filename.endswith('.png') or filename.endswith('.jpeg'):
            img_path = os.path.join(directory, filename)
            img = imread(img_path, as_gray=True)
            #this resize is necessary for concatenation in a single numpy array
            img = resize(img, X_positive[1,:,:].shape)
            negative_images.append(img)

X_negative = np.array(negative_images)
y_negative = np.zeros(X_negative.shape[0])

#check
print(X_negative.shape) #3 dimensions: 1 = records, 2-3 = image as matrix 
print(y_negative.shape)

Processing negative images (wildlife/buffalo): 100%|██████████| 100/100 [00:10<00:00,  9.48item/s]
Processing negative images (wildlife/elephant): 100%|██████████| 100/100 [00:03<00:00, 26.61item/s]
Processing negative images (wildlife/rhino): 100%|██████████| 100/100 [00:05<00:00, 18.31item/s]
Processing negative images (wildlife/zebra): 100%|██████████| 100/100 [00:03<00:00, 27.47item/s]

(400, 64, 64)
(400,)





Store whole dataset in a single matrix and labels array. Here we follow the convention to call _X_ the obervation matrix and _y_ the labels array.

The idea is to pass dataset to a scikit-learn Pipeline, in a conventional way.

In [9]:
X = np.vstack((X_positive, X_negative))
y = np.concatenate([y_positive,y_negative])

#check
print(X.shape)
print(y.shape)

(800, 64, 64)
(800,)


#### Pipeline definition

All the preprocessing, feature extraction and classification steps are integrated inside a Pipeline, in order to generate an object that can be easly shared and used, even among different devices or notebooks.

Pipeline steps are:
- ImageResizer
    
    A custom transformer, used to resize input images
- HOGFeatureExtractor
    
    A custom transformer, used to perform feature extraction using the [HOG method](https://www.analyticsvidhya.com/blog/2019/09/feature-engineering-images-introduction-hog-feature-descriptor/)
- StandardScaler

    Standard scikit-learn implementation of data standardization. This is crucial for subsequent PCA step.
- PCA
    
    Once the images are transformed into numeric features, a dimensionality reduction is performed since HOG method generates a lot of features and here we need to make a light model due to limited computational resources
- SVC
    
    Scikit-learn implementation of SVM model. This last step performs classification.

The pipeline is created with a cache functionality (see memory parameter in [Pipeline documentation](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html)) to increase memory efficency during optimization 

In [10]:
pipeline = Pipeline(steps=[
    ('resizer', ImageResizer(resize_shape)),
    ('hog', HOGFeatureExtractor()),
    ('scaler', StandardScaler()),
    ('pca', PCA(random_state=RANDOM_SEED)), 
    ('svc', SVC(max_iter=5000))
], 
memory="pipe_cache")

#check
pipeline

#### Pipeline optimization

Hyper parameters optimization is done by [Random search method](https://scikit-learn.org/stable/modules/grid_search.html#randomized-parameter-search). 

The parameters that are optmized are:
- n_components of PCA step
- C parameter of SVC step

Gamma and kernel of SVM are constant, and here we use rbf as kernel, and the default scikit-learn gamma value.

In [11]:
#uniform distribution, such that all values are equally probable
C_range = uniform(loc=0.001, scale=100.0)
n_components_range=[100,150,200,250]

grid = [{
    "pca__n_components":n_components_range,
    "svc__kernel" : ["rbf"],
    "svc__C" : C_range
}]

search = RandomizedSearchCV(estimator=pipeline,
                            param_distributions=grid,
                            n_iter=20,
                            cv=5,
                            scoring="accuracy",
                            verbose=4,
                            random_state=RANDOM_SEED,
                            n_jobs=1)


In [12]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=RANDOM_SEED,stratify=y)

#check
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

search.fit(X_train, y_train)

(640, 64, 64)
(160, 64, 64)
(640,)
(160,)
Fitting 5 folds for each of 20 candidates, totalling 100 fits
[CV 1/5] END pca__n_components=200, svc__C=49.41536292718904, svc__kernel=rbf;, score=1.000 total time=   0.2s
[CV 2/5] END pca__n_components=200, svc__C=49.41536292718904, svc__kernel=rbf;, score=1.000 total time=   0.2s
[CV 3/5] END pca__n_components=200, svc__C=49.41536292718904, svc__kernel=rbf;, score=1.000 total time=   0.2s
[CV 4/5] END pca__n_components=200, svc__C=49.41536292718904, svc__kernel=rbf;, score=1.000 total time=   0.2s
[CV 5/5] END pca__n_components=200, svc__C=49.41536292718904, svc__kernel=rbf;, score=1.000 total time=   0.2s
[CV 1/5] END pca__n_components=100, svc__C=59.443014439113256, svc__kernel=rbf;, score=1.000 total time=   0.1s
[CV 2/5] END pca__n_components=100, svc__C=59.443014439113256, svc__kernel=rbf;, score=1.000 total time=   0.2s
[CV 3/5] END pca__n_components=100, svc__C=59.443014439113256, svc__kernel=rbf;, score=1.000 total time=   0.1s
[CV 4

In [13]:
print(f"Best parameters: {search.best_params_}")
print(f"Best accuracy: {search.best_score_}")

Best parameters: {'pca__n_components': 200, 'svc__C': 49.41536292718904, 'svc__kernel': 'rbf'}
Best accuracy: 1.0


In [14]:
#check
search.best_estimator_

#### Model evaluation

In [17]:
y_pred = search.predict(X_test)
print(classification_report(y_test, y_pred,digits=4))

print("Confusion matrix on test set")
print(confusion_matrix(y_test,y_pred))

              precision    recall  f1-score   support

         0.0     1.0000    1.0000    1.0000        80
         1.0     1.0000    1.0000    1.0000        80

    accuracy                         1.0000       160
   macro avg     1.0000    1.0000    1.0000       160
weighted avg     1.0000    1.0000    1.0000       160

Confusion matrix on test set
[[80  0]
 [ 0 80]]


The optimization produced a very good model, regarding the dataset that has been used: the confusion matrix on test set, doesn't show any prediction error. 

However, even if the model has an accuracy of $100\%$ on test set, this doesn't mean that the model will never fail (i.e. make some false positive or false negative predictions) in actual usage. 
One must keep in mind that the provided dataset doesn't contain all possible non-faces obect images, and the faces in olivetti dataset are always well exposed, hence there could be some errors with real images where faces have high contrast, or are over or under exposed.

These aspects are investigated in Notebook 2.

#### Pipeline export

Best found model is exported with joblib module

In [16]:
import joblib
joblib.dump(search.best_estimator_,"model.joblib")

['model.joblib']