<a href="https://colab.research.google.com/github/DrAlexSanz/Faces/blob/master/Face_recognition.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Welcome to the first assignment of week 4! Here you will build a face recognition system. Many of the ideas presented here are from FaceNet. In lecture, we also talked about DeepFace.

**Face recognition** problems commonly fall into two categories:

**Face Verification** - "is this the claimed person?". For example, at some airports, you can pass through customs by letting a system scan your passport and then verifying that you (the person carrying the passport) are the correct person. A mobile phone that unlocks using your face is also using face verification. This is a 1:1 matching problem.
Face Recognition - "who is this person?". For example, the video lecture showed a face recognition ([video](https://www.youtube.com/watch?v=wr4rx0Spihs)) of Baidu employees entering the office without needing to otherwise identify themselves. This is a 1:K matching problem.
FaceNet learns a neural network that encodes a face image into a vector of 128 numbers. By comparing two such vectors, you can then determine if two pictures are of the same person.

In this assignment, you will:

Implement the triplet loss function
Use a pretrained model to map face images into 128-dimensional encodings
Use these encodings to perform face verification and face recognition
In this exercise, we will be using a pre-trained model which represents ConvNet activations using a "channels first" convention, as opposed to the "channels last" convention used in lecture and previous programming assignments. In other words, a batch of images will be of shape $(m, n_C, n_H, n_W)$ instead of $(m, n_H, n_W, n_C)$. Both of these conventions have a reasonable amount of traction among open-source implementations; there isn't a uniform standard yet within the deep learning community.

Let's load the required packages.

In [17]:
import numpy as np
import keras
import tensorflow as tf

import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = [18, 12]
import h5py

from PIL import Image

from keras import layers, optimizers
from keras.layers import Input, Dense, Conv2D, Activation, ZeroPadding2D, BatchNormalization, Flatten, Add
from keras.layers import AveragePooling2D, MaxPooling2D, Dropout, GlobalMaxPooling2D, GlobalAveragePooling2D

from keras.layers.merge import Concatenate
from keras.layers.core import Lambda, Flatten, Dense
from keras.initializers import glorot_uniform
from keras.engine.topology import Layer
from keras import backend as K
K.set_image_data_format('channels_first')

from keras.models import Model, Sequential

from keras.preprocessing import image

from keras.utils import layer_utils, plot_model, to_categorical

from keras.callbacks import History, ModelCheckpoint


%matplotlib inline

print("Everything imported correctly")

Everything imported correctly


In [18]:
!rm -rf Faces

!git clone https://github.com/DrAlexSanz/Faces.git

Cloning into 'Faces'...
remote: Enumerating objects: 85, done.[K
remote: Counting objects:   1% (1/85)[Kremote: Counting objects:   2% (2/85)[Kremote: Counting objects:   3% (3/85)[Kremote: Counting objects:   4% (4/85)[Kremote: Counting objects:   5% (5/85)[Kremote: Counting objects:   7% (6/85)[Kremote: Counting objects:   8% (7/85)[Kremote: Counting objects:   9% (8/85)[Kremote: Counting objects:  10% (9/85)[Kremote: Counting objects:  11% (10/85)[Kremote: Counting objects:  12% (11/85)[Kremote: Counting objects:  14% (12/85)[Kremote: Counting objects:  15% (13/85)[Kremote: Counting objects:  16% (14/85)[Kremote: Counting objects:  17% (15/85)[Kremote: Counting objects:  18% (16/85)[Kremote: Counting objects:  20% (17/85)[Kremote: Counting objects:  21% (18/85)[Kremote: Counting objects:  22% (19/85)[Kremote: Counting objects:  23% (20/85)[Kremote: Counting objects:  24% (21/85)[Kremote: Counting objects:  25% (22/85)[Kremote: Counting ob

In [19]:
%cd "/content/Faces"

/content/Faces


In [0]:
from fr_utils import *
from inception_blocks_v2 import *

In Face Verification, you're given two images and you have to tell if they are of the same person. The simplest way to do this is to compare the two images pixel-by-pixel. If the distance between the raw images are less than a chosen threshold, it may be the same person!
Of course, this algorithm performs really poorly, since the pixel values change dramatically due to variations in lighting, orientation of the person's face, even minor changes in head position, and so on.

You'll see that rather than using the raw image, you can learn an encoding $f(img)$ so that element-wise comparisons of this encoding gives more accurate judgements as to whether two pictures are of the same person.

#1 - Encoding face images into a 128-dimensional vector


##1.1 - Using an ConvNet to compute encodings
The FaceNet model takes a lot of data and a long time to train. So following common practice in applied deep learning settings, let's just load weights that someone else has already trained. The network architecture follows the Inception model from Szegedy et al.. We have provided an inception network implementation. You can look in the file inception_blocks.py to see how it is implemented.

The key things you need to know are:

This network uses 96x96 dimensional RGB images as its input. Specifically, inputs a face image (or batch of $m$ face images) as a tensor of shape $(m, n_C, n_H, n_W) = (m, 3, 96, 96)$
It outputs a matrix of shape $(m, 128)$ that encodes each input face image into a 128-dimensional vector
Run the cell below to create the model for face images.

In [0]:
FRmodel = faceRecoModel(input_shape=(3, 96, 96))

Let's see how many parameters I have now.

In [22]:
print("Total Params:", FRmodel.count_params())

Total Params: 3743280


Not bad. So this means I have pictures and I learned an inception network, which will produce a 128 dimensional vector for each picture I have. Pictures from the same person in different situations should be closer than a reasonable threshold, and pics of different people will have a greater distance.

To explain it, I pass 2 pictures through the network. Then I will compare the two outputs (distance, substraction, or whatever). And with this I decide if they are the same person or not.

So if I have a picture A in my database. Then I get one of the same person, A', and another from a different person, B. My encoding will minimize the distance between A and A' and maximize the distance between A and B. The distance in 128 dimensions, careful with this. A is usually called anchor and this is usually called the triplet loss. It's in the paper and the notation is confusing to me, but writing it down helps.

In [0]:
def triplet_loss(y_true, y_pred, alpha = 0.2):
    """
    Obviously this function implements the triplet loss
    
    y_true are the true labels.
    y_pred is a list with three objects
        --- anchor: the encoding of the anchor images, shape = (None, 128)
        --- positive: encodings of the positive images, shape = (None, 128)
        --- negative: encodings of the negative images, shape = (None, 128)
    
    alpha is a margin parameter. Manually chosen.
    
    returns the loss value (real number)
    
    """
    
    anchor, positive, negative = y_pred[0], y_pred[1], y_pred[2]
    
    #PArt 1: Distance from anchor to positive (last dimension, look at the shapes!)
    
    pos_dist = tf.reduce_sum(tf.square(tf.subtract(anchor, positive)), axis = -1)
    
    #PArt 2: Distance from anchor to negative
    
    neg_dist = tf.reduce_sum(tf.square(tf.subtract(anchor, negative)), axis = -1)
    
    # Substract these two distances and add alpha. I can use + or - because I'm dealing with real numbers, not tensors.
    
    basic_loss = pos_dist - neg_dist + alpha
    
    # Now take the maximum between 0 and basic_loss. If it's 0 I can't reduce more.
    #Use 0.0 because if I use 0 it's an int, not a float. And basic_loss is a float    
    loss = tf.reduce_sum(tf.maximum(0.0, basic_loss)) # Here I will take the first axis, I need to sum all the basic losses
    
    return loss
    

In [24]:
with tf.Session() as test:
    
    tf.set_random_seed(1)
    y_true = (None, None, None) # I don't need this in this cell
    y_pred = (tf.random_normal([3, 128], mean=6, stddev=0.1, seed = 1),
              tf.random_normal([3, 128], mean=1, stddev=1, seed = 1),
              tf.random_normal([3, 128], mean=3, stddev=4, seed = 1))
    
    loss = triplet_loss(y_true, y_pred)
    print("Loss = " + str(loss.eval()))

Loss = 528.1427


Obviously training this model takes a lot of time and resources, even in colab. So I will compile the model and load the weights (the weights take a couple of minutes to read).

In [0]:
FRmodel.compile(optimizer = "adam", loss = triplet_loss, metrics = ["accuracy"])

load_weights_from_FaceNet(FRmodel)

Now let's apply the model to some pictures. The following code takes a picture and produces an encoding and I just need to provide a dictionary to be filled.

In [0]:
database = {}
database["danielle"] = img_to_encoding("images/danielle.png", FRmodel)
database["younes"] = img_to_encoding("images/younes.jpg", FRmodel)
database["tian"] = img_to_encoding("images/tian.jpg", FRmodel)
database["andrew"] = img_to_encoding("images/andrew.jpg", FRmodel)
database["kian"] = img_to_encoding("images/kian.jpg", FRmodel)
database["dan"] = img_to_encoding("images/dan.jpg", FRmodel)
database["sebastiano"] = img_to_encoding("images/sebastiano.jpg", FRmodel)
database["bertrand"] = img_to_encoding("images/bertrand.jpg", FRmodel)
database["kevin"] = img_to_encoding("images/kevin.jpg", FRmodel)
database["felix"] = img_to_encoding("images/felix.jpg", FRmodel)
database["benoit"] = img_to_encoding("images/benoit.jpg", FRmodel)
database["arnaud"] = img_to_encoding("images/arnaud.jpg", FRmodel)

Now, if all these people are the members, I can go and compare anyone who shows up by taking his picture and comparing with my encoding. This is done with the next function. I consider that he swiped his card or code or whatever, so I'm checking him against his picture. Not face ID but ID verification (1:1 problem, not 1:n).

In [0]:
def verify(image_path, identity, model):
    
    """
    image path: that's what it looks like.
    Identity: a string with the name of someone, contained in database.
    model is my encoding of the given person, FRmodel
    
    output
    dist: distance between the encoding and the person at the door. should be closer than threshold.
    door_open: True or False.
    
    """
    
    # First compute the encoding of the person trying to enter
    encoding = img_to_encoding(image_path, model)
    
    #Now compute the distance (L2). Identity here is not a string, it's a variable of type string!! Otherwise it would be constant
    dist = np.linalg.norm(encoding - database[identity])
    
    # If dist is less than 0.7, open, otherwise close
    
    if dist < 0.7:
        door_open = True
    else:
        door_open = False
    
    return dist, door_open

In [28]:
!pwd

/content/Faces


Now I will verify with someone. It should be true because I chose it to be True. I'll copy the same code with a different identity to check in the second cell.

In [30]:
verify("/content/Faces/images/camera_0.jpg", "younes", FRmodel)

(0.67100644, True)

In [33]:
verify("/content/Faces/images/camera_0.jpg", "danielle", FRmodel)

(1.2086712, False)

Now I want to do my face recognition, not ID. To avoid using the ID card or code or whatever. The idea is that someone comes, I take their picture and with this witchcraft I open or not. In practice:


*   Encode picture
*   Calculate the distance by looping over my database. Save the minimum and if it's lower than the threshold I open.

Simple, right? But notice that I don't input an identity, I input all the database to compare the pic to everyone.



In [0]:
def who_is_it(image_path, database, model):
    """
    image path is the picture
    database is the database of encodings + names
    model is the keras model that encodes, as before
    
    output:
    min_dist, the minimum distance
    identity a string, the prediction of the person's name
    
    """
    # Again I take the picture and encoding, as before
    
    encoding_who = img_to_encoding(image_path, model)
    
    # I initialize min_dist at a huge level, 1e6 is ok
    
    min_dist = 1e6
    
    # Loop over the database dictionary
    
    for (name, db_encod) in database.items():
        
        #compute distance
        dist = np.linalg.norm(encoding_who - db_encod)
        
        # See if it's minimum or not.
        if dist < min_dist:
            min_dist = dist
            identity = name
    # And as before, I see if I open once I have found the minimum distance
    
    if min_dist < 0.7:
        print("It's ", identity)
    else:
        print("I don't know you, go somewhere else")
    
    
    return min_dist, identity
    

Now I check with the camera_0 and 1 pictures. If I input some other picture it should reject it, but I won't do it because resizing the image is a hassle.

In [37]:
who_is_it("/content/Faces/images/camera_0.jpg", database, FRmodel)

It's  younes


(0.67100644, 'younes')

Now this was really useful. I got to understand how it works much better. Now some ways to improve the model.

* Put more images of each person (under different lighting conditions, taken on different days, etc.) into the database. Then given a new image, compare the new face to multiple pictures of the person. This would increase accuracy.
* Crop the images to just contain the face, and less of the "border" region around the face. This preprocessing removes some of the irrelevant pixels around the face, and also makes the algorithm more robust.
* Face verification solves an easier 1:1 matching problem; face recognition addresses a harder 1:K matching problem.
* The triplet loss is an effective loss function for training a neural network to learn an encoding of a face image.
* The same encoding can be used for verification and recognition. Measuring distances between two images' encodings allows you to determine whether they are pictures of the same person.