Task 1 [50 points]: The goal of this task is to train and evaluate a convolutional neural network (CNN) model that can classify a given cropped input image as a face or not a face.

Create a Python function named
model = train_face_model(faces, nonfaces)

that accepts a dataset of face images and a dataset of non-face images as matrices and returns a trained CNN model.
The given face and non-face matrices are of shape (n_instances, height, width). Reshape the matrices as needed to pass them to your neural network for training. You also need to create appropriate labels for training.
The function returns two objects, the training model and the training history.
We will use a dataset of 1000 faces and 1000 non-faces to train and evaluate the model. The first 700 images of each class will be passed to your function for training, and the remaining 300 images of each class will be used for testing. Your model is expected to achieve a very high classification accuracy. My reference solution achieves >0.99 accuracy on the test set. To get full credit, your solution should achieve at least 0.97.
Note: Before using the training data to train the model, you should scale their pixel values to the range [0, 1].

In [2]:
import numpy as np
from sklearn.utils import shuffle
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense, Dropout
from tensorflow.keras.optimizers import Adam

def train_face_model(faces, nonfaces):
    # Define labels for the datasets (1 for faces, 0 for non-faces)
    num_faces = faces.shape[0]
    num_nonfaces = nonfaces.shape[0]
    labels = np.concatenate([np.ones(num_faces), np.zeros(num_nonfaces)])
    
    # Normalize pixel values to the range [0, 1]
    faces = faces / 255.0
    nonfaces = nonfaces / 255.0
    
    # Concatenate face and non-face images
    all_images = np.concatenate([faces, nonfaces], axis=0)
    
    # Create a Sequential model
    model = Sequential()
    model.add(Conv2D(32, (3, 3), activation='relu', input_shape=(faces.shape[1], faces.shape[2], 3)))
    model.add(MaxPooling2D((2, 2)))
    model.add(Conv2D(64, (3, 3), activation='relu'))
    model.add(MaxPooling2D((2, 2)))
    model.add(Conv2D(128, (3, 3), activation='relu'))
    model.add(MaxPooling2D((2, 2)))
    model.add(Flatten())
    model.add(Dense(128, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(1, activation='sigmoid'))
    
    # Compile the model
    model.compile(optimizer=Adam(learning_rate=0.0001), loss='binary_crossentropy', metrics=['accuracy'])
    
    # Shuffle the data
    all_images, labels = shuffle(all_images, labels, random_state=42)
    
    # Split the data into training and validation sets
    split_point = 700  # Number of images for training
    x_train = all_images[:split_point]
    y_train = labels[:split_point]
    x_val = all_images[split_point:]
    y_val = labels[split_point:]
    
    # Train the model
    history = model.fit(x_train, y_train, epochs=10, batch_size=32, validation_data=(x_val, y_val))
    return model, history

In [21]:
import numpy as np
from sklearn.model_selection import train_test_split
from train_face_model import train_face_model  # Assuming you have saved the function in a separate file

# Load your face and non-face datasets (make sure they are properly preprocessed and loaded as NumPy arrays)
# Example:
faces = np.load('data/faces1000.npy')
nonfaces = np.load('data/nonfaces1000.npy')

first_image = faces[0]
print(f'First image shape: {first_image.shape}')
print(f'First image values:\n{first_image}')

# Split the data into training and testing sets
faces_train, faces_test = train_test_split(faces, test_size=300, random_state=42)
nonfaces_train, nonfaces_test = train_test_split(nonfaces, test_size=300, random_state=42)

# Call the training function
model, history = train_face_model(faces_train, nonfaces_train)

# Evaluate the model on the test set
test_data = np.concatenate([faces_test, nonfaces_test], axis=0)
test_labels = np.concatenate([np.ones(300), np.zeros(300)])

# Normalize pixel values to the range [0, 1]
test_data = test_data / 255.0

test_loss, test_accuracy = model.evaluate(test_data, test_labels)
print(f'Test Accuracy: {test_accuracy}')

Faces dataset has the wrong number of dimensions. It should be a 3D NumPy array (height, width, channels).


IndexError: tuple index out of range

Task 2 [50 points]: Create a function
result = cnn_face_search(image, scale, model, face_size, result_number)
that can detect faces in photos at a given scale by using the face classification CNN model that we trained in task 1.

The function returns a tuple containing two elements:
A list of tuples, where each tuple contains the confidence score, center row and column coordinates of the detected face, and the top, bottom, left, and right coordinates of the bounding box around the face.
A numpy array containing the confidence scores for each pixel in the input image.
E.g.
results, scores = cnn_face_search(img, model, face_size, scale, result_number)

# Draw bounding boxes
for result in results:
    (max_val, best_row, best_col, top, bottom, left, right) = result
    cv2.rectangle(img, (left, top), (right, bottom), 255, 2)
Notes:
The solution to this task is very similar to thechamfer_search that you implemented in assignment 6, except instead of computing a distance from a template to each window (patch) in the image, you are computing the probability that the window contains a face using the trained CNN model from task 1.
As you iterate over all possible window positions in the input image, calling the model.predict() function thousands of times inside the loop is very slow. Instead, you should extract all windows and store them in a NumPy array of dimensions (n_windows, height, width). Then you can call model.predict(windows) once and get all the prediction probabilities. Each window is associated with a particular location in the original image, so you need to figure out a way to associate the top window candidates with the location they were extracted from in the image.
The scale is measured with the respect to the original image face size. For example, if the original image face size is (31, 25) and the scale is 2.0, then the scaled face size will be (62, 50). Since you cannot scale the size of the trained model, you can scale the input image to 1.0/scale.
Ensure the input image pixel values are scaled to [0,1] before attempting to apply the prediction model.
My solution almost always detects the faces in faces.bmp at scale 2 correctly, as the top two results. However, for the faces in vjm.bmp, at scale 1, sometimes it detects all three of them as the top three results and sometimes it doesn't. That depends on how the model that I trained converged.