# Validation Notes

For the last few weeks we have split the dataset into 3 sets, training, test and validation as described in lecture 2 slides 21-24. I did it this way to emphasise how the test set should be used. I also wanted you to see something with confusion_matrix and classification_reports which are easier to do this way.

However, this is not always the best way of doing things, particularly if we have limited data. A better way of doing things is to use a cross-validation score to make decisions instead.

So here's the idea
1. Split the data into training and test data (2 sets)
2. With the training data, try and come up with a final model
3. Score any type of model you build using a cross-validation score, either use kFolds, cross_val_score or GridSearchCV
4. Your final model hyperparameters is the one with the highest cross-validation score
5. With the final chosen model, evaluate it on the test set (classification_report as well as everything else)

With cross_val_score, you don't have to only rely on accuracy, that is just the default, look at the manual https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html or cross_val_score? . You can get different f1 scores

| F1 Type     | Best for                          | Handles Imbalance? | Formula Basis              |
|------------|----------------------------------|------------------|----------------------------|
| **Macro**  | Equal importance to all classes  | ❌ No            | Mean of per-class F1       |
| **Weighted** | Reflects dataset distribution  | ✅ Yes           | Weighted mean of per-class F1 |
| **Micro**  | Overall accuracy-based evaluation | ✅ Yes           | Global precision & recall  |

f1_macro, f1_weighted, f1_micro setting the scoring attribute in cross_val_score. Similar in GridSearchCV

# Face Recognition

In this lab we are going to build a facial recognition model using Support Vector Machines. The state of the art for image processing is to use Convolutional Neural Networks, but we'll try SVMs now to see how they do

fetch_lfw_people can get us these images. We will take all the people in the dataset that have at least 60 images of that person. This may take a while to run as it will download the images

In [2]:
from sklearn.datasets import fetch_lfw_people
faces = fetch_lfw_people(min_faces_per_person=60, resize=1)
print(faces.target_names)
print(faces.images.shape)

['Ariel Sharon' 'Colin Powell' 'Donald Rumsfeld' 'George W Bush'
 'Gerhard Schroeder' 'Hugo Chavez' 'Junichiro Koizumi' 'Tony Blair']
(1348, 125, 94)


So we have 8 different people to train our model on and 1348 total images. The image resolution is 125x94

In [3]:
faces.data.shape

(1348, 11750)

Why does this have shape 1348, 11750?

In [4]:
125*94

11750

Let's check how many we have of each person

In [None]:
# Count images per person
unique_people, counts = np.unique(faces.target, return_counts=True)

# Get corresponding names
person_names = [faces.target_names[i] for i in unique_people]

# Plot histogram
plt.figure(figsize=(12, 6))
plt.barh(person_names, counts, color='skyblue')
plt.xlabel("Number of Images")
plt.ylabel("Person")
plt.title("Histogram of Images per Person in LFW Dataset")
plt.gca().invert_yaxis()  # Invert y-axis for better readability
plt.show()

Is this a balanced set?

Let's set up our X and Y. faces.data and faces.target seem like good places to go!

In [6]:
# your code here


In [16]:
X[0]

array([0.5254902 , 0.5176471 , 0.5058824 , ..., 0.00653595, 0.00261438,
       0.        ], dtype=float32)

shows the images are greyscale with values between 0 and 1 it looks like

Do a train test split as usual, use a random_state

In [7]:
# your code here


In [23]:
# your code here


Do a quick test with things we know, build and test a SVM linear model with C=2, a SVM rbf model with C=2, a LogisticRegression model and a kNN neighbours model with n=5

In [10]:
# import the model types



In [None]:
model = SVC(kernel='linear', C=2.0)
%time cross_val_score(model, X_train, y_train)
%time model.fit(X_train, y_train)

In [None]:
# knn


In [None]:
# logisticregression


In [None]:
# rbf SVC


### My results

When I ran it, LogisticRegression got the best test score (but took ages to run, 2 and a half minutes!) , then SVC with a linear kernel, then SVC with an rbf kernel (all relatively close scores) and finally kNN was a good bit behind

SVC likes things to be normalised, although the fact that these are images with values between 0 and 1 already means this may not really do much, and maybe with images, it should not even be attempted!

In [21]:
from sklearn.pipeline import make_pipeline 
from sklearn.preprocessing import StandardScaler 
from sklearn.preprocessing import MinMaxScaler

In [None]:
# your code here 


A small improvement with StandardScaler

## Cross Validation
Now let's concentrate on SVC and pick the best one we can with that dataset. Import GridSearchCV and use that

In [24]:
# your code here


You're going to want to try different values of C, different gammas and different kernels (linear, rbf, poly)

While you can do them all in one param_grid, it might take longer as it will go through hyperparameters that have no affect on some kernels wasting time

Look at the breast cancer example for how I approached it there

In [25]:
# your code goes after here


Record the best score you have had with all your cross validation

You could try it all using a StandardScaler but I think it might take a lot longer to run

When you are done, take the best overall model parameters. Fit the model with the %time thing I did earlier. Same with scoring the model with the test set.

Then do a classification report and confusion matrix

Tell a story about your results, explain what you think is going on and interpret the above reports

Now try the whole thing all over again with

                    faces = fetch_lfw_people(min_faces_per_person=60, resize=0.5)

This will make the images smaller. Do you get the same or better results? It should be a lot faster training due to their being a lot less features. 

When doing train_test_split use the same random_state as you did previously, does it give you the same split then? 

Go through the same cross-validation to make your choices and do the same story with classification report/confusion matrix and timing of the fitting/scoring. This model should be faster, I don't know about the accuracy