## Dimensionality Reduction: Exercise Faces

### Setup

This project requires Python 3.7 or above:

In [None]:
import sys

assert sys.version_info >= (3, 7)

It also requires Scikit-Learn ≥ 1.0.1:

In [None]:
from packaging import version
import sklearn

assert version.parse(sklearn.__version__) >= version.parse("1.0.1")

### The Olivetti Faces Dataset

*The classic Olivetti faces dataset contains 400 grayscale 64 × 64–pixel images of faces. Each image is flattened to a 1D vector of size 4,096. 40 different people were photographed (10 times each), and the usual task is to train a model that can predict which person is represented in each picture. Load the dataset using the `sklearn.datasets.fetch_olivetti_faces()` function.*

In [None]:
from sklearn.datasets import fetch_olivetti_faces

olivetti = fetch_olivetti_faces()

In [None]:
print(olivetti.DESCR)

In [None]:
olivetti.target

We show the first 100 images together with their label. 

In [None]:
import matplotlib.pyplot as plt

plt.rc('font', size=8)
plt.rc('axes', labelsize=8, titlesize=8)
plt.rc('legend', fontsize=14)
plt.rc('xtick', labelsize=10)
plt.rc('ytick', labelsize=10)

def plot_faces(faces, labels, n_cols=10):
    faces = faces.reshape(-1, 64, 64)
    n_rows = (len(faces) - 1) // n_cols + 1
    plt.figure(figsize=(n_cols, n_rows * 1.1))
    for index, (face, label) in enumerate(zip(faces, labels)):
        plt.subplot(n_rows, n_cols, index + 1)
        plt.imshow(face, cmap="gray")
        plt.axis("off")
        plt.title(label)
    plt.show()

plot_faces(olivetti.data[:100,:],olivetti.target[:100])

Split the dataset into a training set  and a test set (note that the dataset is already scaled between 0 and 1).

Apply a RandomForestClassifier with 150 trees to classify the images.  
- Check the duaration of the training phase. 
- What's the accuracy on the test set? 

Now apply PCA but keep 99% variance. 
- What's the percentage of the dimensions that are left? 
- Check the duaration of the training phase. 
- What's the accuracy on the test set? 
- Is this a dataset that takes advantage from PCA? 

Find the best parameter combination using grid search. 
- Find the optimal number of pca components. 
- Find the optimal number of trees in the range 100 - 300 (with steps of 25)
- What's the accuracy? 

Retrain with the best parameters found. 
- Determine the training time and compare is with the original time without dimensionality reduction
- What's the possible use of this information if you spend much time finding the optimal parameter combination? 

...