### Building a SVM modle for Face Recognition Problem

We will use a very famous dataset, called Labelled Faces in the Wild, which
consists of 1288 faces of famous people, and it is available at http://viswww.cs.umass.edu/lfw/lfw-funneled.tgz.

However, note that it can be easily imported via scikit-learn from the datasets class.
Each image consists of 1850 features: we could proceed by simply using each of them in the model.



Fitting a SVM to non-linear data using the Kernel Trick produces non- linear decision boundaries.
In particular, we seek to:
* Build SVM model with radial basis function (RBF) kernel
* Use a grid search cross-validation to explore ran- dom combinations of parameters.

### Step to do:

1. Loading the dataf from sklearn.datasets:

In [None]:
from sklearn.datasets import fetch_lfw_people
faces = fetch_lfw_people(min_faces_per_person=60)

2. Since the data can be accessed from the sklearn.datasets module, you need to explore the dataset.
    - (refer to the first 6 steps in Lab_1 could help you)

a- Print the field names (that is, the keys to the dictionary) (1 point)

In [None]:
# What fields are in the dictionary?
print(faces.keys())

dict_keys(['data', 'images', 'target', 'target_names', 'DESCR'])


b- Print the dataset description contained (2 point)

In [None]:
print(faces.DESCR)

.. _labeled_faces_in_the_wild_dataset:

The Labeled Faces in the Wild face recognition dataset
------------------------------------------------------

This dataset is a collection of JPEG pictures of famous people collected
over the internet, all details are available on the official website:

    http://vis-www.cs.umass.edu/lfw/

Each picture is centered on a single face. The typical task is called
Face Verification: given a pair of two pictures, a binary classifier
must predict whether the two images are from the same person.

An alternative task, Face Recognition or Face Identification is:
given the picture of the face of an unknown person, identify the name
of the person by referring to a gallery of previously seen pictures of
identified persons.

Both Face Verification and Face Recognition are tasks that are typically
performed on the output of a model trained to perform Face Detection. The
most popular model for Face Detection is called Viola-Jones and is
implemented in the OpenC

3. Print the data, its shape, and the target names. ( 3 points)

In [None]:
# What does the data look like?
print(faces.data)

[[0.53464055 0.5254902  0.49673203 ... 0.00653595 0.00653595 0.00261438]
 [0.28627452 0.20784314 0.2522876  ... 0.96993464 0.9490196  0.9346406 ]
 [0.31895426 0.39215687 0.275817   ... 0.4261438  0.7908497  0.9555555 ]
 ...
 [0.11633987 0.11111111 0.10196079 ... 0.5686274  0.5803922  0.5542484 ]
 [0.19346406 0.21176471 0.2901961  ... 0.6862745  0.654902   0.5908497 ]
 [0.12287582 0.09803922 0.10980392 ... 0.12941177 0.1633987  0.29150328]]


In [None]:
# what is the shape of the data?
print(faces.data.shape)

(1348, 2914)


In [None]:
# What is the target names?
print(faces.target_names)

['Ariel Sharon' 'Colin Powell' 'Donald Rumsfeld' 'George W Bush'
 'Gerhard Schroeder' 'Hugo Chavez' 'Junichiro Koizumi' 'Tony Blair']


4. Divide the data into features (X) using the faces.data and target (y) using faces.target (2 points)

In [None]:
X = faces.data
y = faces.target

5. Splitting the data into training and testing sets. (2 point)

We train the model with 70% of the samples and test with the remaining 30%.

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# print the sizes of our training and test set to verify if the splitting has occurred properly.
print("Training set size:")
print("- X_train (features):", X_train.shape)
print("- y_train (target):", y_train.shape)

print("\nTest set size:")
print("- X_test (features):", X_test.shape)
print("- y_test (target):", y_test.shape)

Training set size:
- X_train (features): (943, 2914)
- y_train (target): (943,)

Test set size:
- X_test (features): (405, 2914)
- y_test (target): (405,)


6. Declare SVM model with kernel='rbf', class_weight='balanced' (2 points)

In [None]:
from sklearn.svm import SVC


svm_model = SVC(kernel='rbf', class_weight='balanced')

7. Use a grid search cross-validationwith 10 CV to explore random combinations of parameters. (3 points)
    - we will adjust C, which controls the margin
    - and Gamma (γ), which controls the size of the radial basis function kernel, and determine the best model.

In [None]:
from sklearn.model_selection import GridSearchCV
param_grid = {'C': [1,5,10,50],'gamma': [0.001,0.0005,0.01,0.1]}


grid_search = GridSearchCV(svm_model, param_grid, cv=10)
grid_search.fit(X_train, y_train)

8. predict on the test set, using the best model from above step (best_estimator_) (5 points)

In [None]:
y_pred = grid_search.best_estimator_.predict(X_test)
print("Predictions:", y_pred)

Predictions: [3 4 3 6 6 1 3 3 3 3 3 3 4 3 3 1 7 2 3 2 7 7 5 5 0 3 6 7 3 3 0 6 3 3 2 3 2
 3 3 2 3 3 7 1 3 3 0 2 1 2 7 3 4 6 7 3 7 1 7 0 4 2 7 2 5 4 7 3 4 3 1 3 4 1
 3 4 0 4 3 3 1 3 1 0 7 2 3 2 7 0 1 1 2 3 3 1 7 3 3 1 3 7 1 4 3 3 0 3 7 3 3
 1 0 1 3 1 3 2 3 4 7 7 5 4 3 3 3 3 2 2 3 3 0 3 4 3 4 1 2 1 7 6 5 3 3 1 0 3
 4 4 3 2 7 1 7 1 3 7 1 4 6 1 2 3 2 3 1 7 2 2 1 7 3 3 1 1 1 3 3 1 0 0 1 1 7
 1 1 5 3 4 3 3 7 5 6 3 7 7 3 2 2 3 2 3 3 6 3 3 1 7 3 6 1 3 3 1 1 7 6 3 1 3
 1 7 7 2 7 7 5 7 1 3 3 2 3 4 7 2 3 1 3 7 7 1 4 3 1 1 5 4 2 3 4 1 1 1 2 2 3
 7 3 7 3 7 3 1 3 1 3 1 1 1 3 3 1 4 4 3 1 4 1 0 0 3 2 0 2 5 1 3 3 6 2 1 3 6
 3 1 3 5 1 1 1 3 3 3 3 3 3 5 0 1 1 3 3 1 1 7 1 1 6 3 1 1 5 7 3 2 1 3 1 3 3
 7 3 1 0 3 6 3 7 3 3 0 5 3 1 6 3 0 1 1 0 1 0 4 1 3 2 2 3 1 4 4 2 3 1 3 1 1
 3 1 3 2 3 2 7 1 7 3 1 2 6 4 1 7 3 3 3 3 3 2 3 3 5 1 1 2 5 6 2 5 7 3 3]


9. Model performances:
Run the following code to print the model evaluation metric

In [None]:
from sklearn.metrics import classification_report
labels = list(faces.target_names)
print(classification_report(y_test,y_pred,target_names=labels))

                   precision    recall  f1-score   support

     Ariel Sharon       0.59      0.76      0.67        17
     Colin Powell       0.80      0.83      0.81        84
  Donald Rumsfeld       0.67      0.81      0.73        36
    George W Bush       0.90      0.86      0.88       146
Gerhard Schroeder       0.69      0.71      0.70        28
      Hugo Chavez       1.00      0.63      0.77        27
Junichiro Koizumi       0.89      1.00      0.94        16
       Tony Blair       0.82      0.78      0.80        51

         accuracy                           0.81       405
        macro avg       0.79      0.80      0.79       405
     weighted avg       0.83      0.81      0.82       405



10. Observances on model performance? (5 points)

- __Performance Variation:__ The model shows varied performance across classes, excelling in identifying "Junichiro Koizumi" but struggling more with "Ariel Sharon".

- __High Precision for Some:__ Classes like "Hugo Chavez" have high precision, indicating accurate predictions, though its lower recall suggests missed instances.

- __Balanced Metrics for Politicians:__ Figures like "Colin Powell" and "Tony Blair" have balanced precision and recall, indicating a good compromise between accuracy and coverage.

- __Overall Accuracy:__ The model achieves a high overall accuracy of 0.81, suggesting strong general performance despite individual class variations.

- __Averages Insight:__ The macro average f1-score is slightly lower than the weighted average, hinting at better performance on classes with more samples.


