# Classification: Advanced SVM with Face detection

Now let's try something more challenging: facial detection.

SVMs maximise the margin between the support vectors.

References:
Faces recognition example using eigenfaces and SVMs: https://scikit-learn.org/stable/auto_examples/applications/plot_face_recognition.html <br>

## Installation

In [None]:
%pip install numpy
%pip install matplotlib
%pip install sklearn

In [None]:
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.datasets import fetch_lfw_people
from sklearn.metrics import classification_report, ConfusionMatrixDisplay
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.svm import SVC
from sklearn.utils.fixes import loguniform

## Loading the LFW dataset

The labelled faces in the wild (LFW) dataset is famous for testing facial recognition algorithms. Faces are easy for humans to identify but notoriously difficult for machines. The LFW dataset provides a fantastic mix of celebrities faces in all sorts of lighting conditions, angles, and facial expressions.

In [None]:
# Load the LWF dataset for classes with at least 50 samples


# introspect the images arrays to find the shapes (for plotting)


# for machine learning we use the 2 data directly (as relative pixel
# positions info is ignored by this model)



# the label to predict is the id of the person


print("Total dataset size:")
print("n_samples: %d" % n_samples)
print("n_features: %d" % n_features)
print("n_classes: %d" % n_classes)

<details><summary>Click to cheat</summary>

```python
# Load the LWF dataset for classes with at least 50 samples
lfw_people = fetch_lfw_people(min_faces_per_person=50, resize=0.4)

# introspect the images arrays to find the shapes (for plotting)
n_samples, h, w = lfw_people.images.shape

# for machine learning we use the 2 data directly (as relative pixel
# positions info is ignored by this model)
X = lfw_people.data
n_features = X.shape[1]

# the label to predict is the id of the person
y = lfw_people.target
target_names = lfw_people.target_names
n_classes = target_names.shape[0]

print("Total dataset size:")
print("n_samples: %d" % n_samples)
print("n_features: %d" % n_features)
print("n_classes: %d" % n_classes)
```
</details>

Let's show some examples of what these photos look like.

In [None]:
def plot_gallery(images, titles, h, w, n_row=3, n_col=4):
    """Helper function to plot a gallery of portraits"""
    plt.figure(figsize=(1.8 * n_col, 2.4 * n_row))
    plt.subplots_adjust(bottom=0, left=0.01, right=0.99, top=0.90, hspace=0.35)
    for i in range(n_row * n_col):
        plt.subplot(n_row, n_col, i + 1)
        plt.imshow(images[i].reshape((h, w)), cmap=plt.cm.gray)
        plt.title(titles[i], size=12)
        plt.xticks(())
        plt.yticks(())

def title(y, target_names, i):
    return target_names[y[i]].rsplit(" ", 1)[-1]

prediction_titles = [
    title(y, target_names, i) for i in range(y.shape[0])
]

plot_gallery(X, prediction_titles, h, w)

## Reducing the dataset's complexity

Notice that there are more features than samples? This is a **very** bad sign as it tells us we don't have enough training data!

Hence, we'll need to do some preprocessing to simplify the dataset's complexity.

In [None]:
# Split the data into training and testing sets


# Let's use z-scaling to ensure no features dominate others


<details><summary>Click to cheat</summary>

```python
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42
)

# Let's use z-scaling to ensure no features dominate others
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
```
</details>

Even with z-scaling, the numerous features will be too much to work with.

What we need is to reduce the number of features. This presents a problem: *which features do we drop?*

Thankfully, scikit-learn already provides an algorithm for combining features: *Principal Component Analysis*

In [None]:
# Choose a number of components
# It should be high enough to avoid losing too much information
# but not too high to prevent the curse of dimensionality
# n_components = 

# Create the PCA analyser and train it


# Transform the training and testing data


<details><summary>Click to cheat</summary>

```python
# Choose a number of components
# It should be high enough to avoid losing too much information
# but not too high to prevent the curse of dimensionality
n_components = 150

# Create the PCA analyser and train it
pca = PCA(n_components=n_components).fit(X_train)

# Transform the training and testing data
X_train_pca = pca.transform(X_train)
X_test_pca = pca.transform(X_test)
```
</details>

Now let's see what these eigenfaces look like.

In [None]:
eigenfaces = pca.components_.reshape((n_components, h, w))
eigenface_titles = ["eigenface %d" % i for i in range(eigenfaces.shape[0])]
plot_gallery(eigenfaces, eigenface_titles, h, w)

plt.show()

## Creating our classifier

Due to the high number of features, we'll use SVM.

However, we don't know what our hyperparameters should be. Furthermore, there's **way** too many possible combinations of hyperparameters to comb through. Instead, we'll do a gridsearch to speed this process up.

In [None]:
# Define your grid of varying C and gamma values


# Create the model using SVC and RandomizedSearchCV
# Use a RBF kernel and balanced class weights


# Train the model using the PCA data


# Display the best model found
print("Best estimator found by grid search:")
print(model.best_estimator_)

<details><summary>Click to cheat</summary>

```python
# Define your grid of varying C and gamma values
param_grid = {
    "C": loguniform(1e3, 1e5),
    "gamma": loguniform(1e-4, 1e-1),
}

# Create the model using SVC and RandomizedSearchCV
# Use a RBF kernel and balanced class weights
model = RandomizedSearchCV(
    SVC(kernel='rbf', class_weight='balanced'),
    param_grid, n_iter=10
)

# Train the model using the PCA data
model2 = model.fit(X_train_pca, y_train)

# Dispaly the best model found
print("Best estimator found by grid search:")
print(model.best_estimator_)
```
</details>

### Confusion matrix of our model

With our best model known, let's see how well it performs.

In [None]:
# Get the model's predictions with the PCA test data


# Plot the confustion matrix
ConfusionMatrixDisplay.from_estimator(
    model, X_test_pca, y_test, display_labels=target_names, xticks_rotation="vertical"
)
plt.tight_layout()
plt.show()

<details><summary>Click to cheat</summary>

```python
# Get the model's predictions with the PCA test data
y_pred = model.predict(X_test_pca)

# Plot the confusion matrix
ConfusionMatrixDisplay.from_estimator(
    model, X_test_pca, y_test, display_labels=target_names, xticks_rotation="vertical"
)
plt.tight_layout()
plt.show()
```
</details>

One final time, we'll look at our performance scores.

In [None]:
print(classification_report(y_test, y_pred, target_names=target_names))