# Face Recognition Using Convolutional Neural Network

This portfolio is about building an CNN model for face recognition using Keras.

##I. Import Libiaries 
You can run this notebook on databricks, after create libraries: *keras*, *opencv-python*, *tensorflow*.
Or, you can run this notebook locally. Anaconda is a good choice.

In [4]:
import os
import sys
import numpy as np
import cv2
import random
import numpy as np
import keras
from sklearn.cross_validation import train_test_split
from keras.preprocessing.image import ImageDataGenerator
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, Flatten
from keras.layers import Conv2D, MaxPooling2D
from keras.optimizers import SGD
from keras.utils import np_utils
from keras.models import load_model
from keras import optimizers
from keras import backend as K
import h5py 
from keras.models import model_from_json 
import matplotlib.pyplot as plt
import pandas as pd

##II. Image Processing
The input shape of a CNN has rule, so before build our model, we need to process image dataset. We want to read graphs one by one, normalization, and add different labels to each of them according to the person shows on it, like ‘0’ for Bill and ‘1’ for Tom. 

We use OpenCV, which has an image processing module that includes linear and non-linear image filtering, geometrical image transformations (resize, affine and perspective warping, generic table-based remapping), color space conversion, histograms, and so on. to adjust our images.

When we read an image, suppose we have x pixel wide and y height image. OpenCV can read an image file and coverts it to a list (width × height × 3, 3 means three channels of colors as red, green, blue). To normalize the input data shape of our neural network model, for one images, we compare the width and height, and add borders to the shorter sides. At last, we resized the picture to 32 × 32 pixels format.

In [7]:
IMAGE_SIZE = 64

def resize_image(image, height_output = IMAGE_SIZE, width_output = IMAGE_SIZE):
     # initialize border's value. default value of four borders is 0
    top, bottom, left, right = (0, 0, 0, 0)
    
    #get image's size
    width, height, _ = image.shape
    
    #find the longer side of the image
    longer_side = max(width, height)    
    
    #calculate how many we should add to the shorter side
    if width < longer_side:
        dh = longer_side - width
        top = dh // 2
        bottom = dh - top
    elif height < longer_side:
        dw = longer_side - height
        left = dw // 2
        right = dw - left
    else:
        pass 
    
    #RGB
    BLACK = [0, 0, 0]

    # add border to make two sides the same. "cv2.BORDER_CONSTANT " is the color of border, configured by "value"
    constant = cv2.copyMakeBorder(image, top , bottom, left, right, cv2.BORDER_CONSTANT, value = BLACK)
    
    return cv2.resize(constant, (height_output, width_output))

How to read hundreds of images with labels into our project is the next problem. Now we have many graphs of two people, for example, Bill Gates and Tom Cruise, separated in two folders.  After doing this classification, we can get a list of images and a list of related labels.

In [9]:
#read data into ram
images = []
labels = []
def read_path(path_url):    
    for dir_item in os.listdir(path_url):
        full_path = os.path.abspath(os.path.join(path_url, dir_item))
        
        if os.path.isdir(full_path):    #if it is a folder, continue. (here a recursion is used)
            read_path(full_path)
        else: 
            if dir_item.endswith('.jpg'):
                image = cv2.imread(full_path)                
                image = resize_image(image, IMAGE_SIZE, IMAGE_SIZE)
                images.append(image)                
                labels.append(path_url)                                
                    
    return images,labels
    
#main function, read data
def load_dataset(path_url):
    images,labels = read_path(path_url)
  
    #change images to 4-dimensions array
    images = np.array(images)
#     print(images.shape,"total ")
    
    labels = np.array([0 if label.endswith('johnny_depp') else 1 for label in labels])    
    return images, labels

We will do cross validation after training, so we split our dataset into three parts: Training, validation, and test sets. 

Also, as it is a project about recognition, it’s necessary to do one-hot encoding to the labels. Label sets' shape will be a 2 dimention list, depends on the value of *nb_classes*. 

Next, change data type to float32 and normalize the value of RGB between 0-1, to improve network convergence speed, reduce training time, and reduce the value of training error.

For now, the preparation function before building CNN is complete.

In [11]:
class Dataset:
    def __init__(self, path_url):
        #train data
        self.train_images = None
        self.train_labels = None
        
        #validate data
        self.valid_images = None
        self.valid_labels = None
        
        #test data
        self.test_images  = None            
        self.test_labels  = None
        
        #path of dataset
        self.path_url = path_url
        
        #tensorflow (channels,rows,cols) or theano (rows,cols,channels)
        self.input_shape = None
        
    # load dataset, then seperate dataset according to cross-validation
    def load(self, nb_classes = 2):
        #load dataset to ram
        images, labels = load_dataset(self.path_url)        
        
        # split into three sets randomly
        train_images, valid_images, train_labels, valid_labels = train_test_split(images, labels, test_size = 0.3, random_state = random.randint(0, 100))

        _, test_images, _, test_labels = train_test_split(images, labels, test_size = 0.5, random_state = random.randint(0, 100))

        self.input_shape = (IMAGE_SIZE, IMAGE_SIZE, 3)            
            
        #the amount of three data parts
        print(train_images.shape[0], 'train samples')
        print(valid_images.shape[0], 'valid samples')
        print(test_images.shape[0], 'test samples')
        
        # label ===(one-hot encoding)===> 2 dimensions data
        train_labels = np_utils.to_categorical(train_labels, nb_classes)                        
        valid_labels = np_utils.to_categorical(valid_labels, nb_classes)            
        test_labels = np_utils.to_categorical(test_labels, nb_classes)
        
        #image to float
        train_images = train_images.astype('float32')            
        valid_images = valid_images.astype('float32')
        test_images = test_images.astype('float32')
            
        # normalize the value of RGB between 0-1
        train_images /= 255
        valid_images /= 255
        test_images /= 255            
        
        self.train_images = train_images
        self.valid_images = valid_images
        self.test_images  = test_images
        self.train_labels = train_labels
        self.valid_labels = valid_labels
        self.test_labels  = test_labels

Next is load data from storage system. We have uploaded data source on DBFS (1062 jpg of Johnny Depp, 840 jpg of natalie portman), and read them here.

After creating a new dataset object, including train_images, valid_images, test_images, train_labels, valid_labels, test_labels.

In [13]:
dataset = Dataset('/dbfs/FileStore/tables/data/')    
dataset.load()

##III. Build CNN model
Keras provides plenty of API for building neural network model. We can build a sequential convolutional neural network easily. For more details you can read [Keras official document layers part](https://keras.io/layers/about-keras-layers/).

In [15]:
def build_model(dataset, nb_classes = 2):
    model = Sequential() 

    model.add(Conv2D(32, (3, 3), border_mode='same', 
                                 input_shape = dataset.input_shape))

    model.add(Activation('relu'))
    model.add(Conv2D(32, (3, 3)))
    model.add(Activation('relu'))
    model.add(MaxPooling2D(pool_size=(2, 2)))        
    # droupout: present from Overfitting
    model.add(Dropout(0.25))
    model.add(Flatten())
    model.add(Dense(256))
    model.add(Activation('relu'))
    model.add(Dropout(0.5))
    model.add(Dense(nb_classes))
    model.add(Activation('softmax'))
    model.summary()
    
    return model

Our structure of CNN model:

In [17]:
MODEL = build_model(dataset)

##IV.	Trainning
Next is let the dataset we have prepared to train this model and do cross validations. We set the cost function as ‘categorical_crossentropy’, a typical function dealing with categorical project. The optimizer is SGD (Stochastic gradient descent optimizer), at the same time, Keras includes support for momentum, learning rate decay, and Nesterov momentum, which means we can take multiple advantages from them. Using SGD only, the decent direction depends on batch data completely, so add a momentum can keep original decent direction in some degree, make the optimizer works faster, but more stable. We set the epoch as 100, and at last we get a graph using spark DataFrame.

In [19]:
def train(dataset, model, batch_size = 40, nb_epoch = 100):
    # SGD: a  compiler; lr: learning rate
    sgd = SGD(lr = 0.001, decay = 1e-6, momentum = 0.9, nesterov = True) 
    
    model.compile(loss='categorical_crossentropy',
                  optimizer=sgd,
                  metrics=['accuracy'])   

    history = model.fit(dataset.train_images, dataset.train_labels,
                   batch_size = batch_size,
                   nb_epoch = nb_epoch,
                   validation_data = (dataset.valid_images, dataset.valid_labels),
                   shuffle = True)

    res = history.history
    return res

In [20]:
res = train(dataset, MODEL)

In [21]:
i=1
temp_epoch = []

for each in range(len(res['acc'])):
#   print(i)
  temp_epoch.append(i)
  i+=1
res['epoch'] = temp_epoch

df = pd.DataFrame(res)
spark_df = spark.createDataFrame(df)
display(spark_df)

In [22]:
display(spark_df)

##V.	Test
Using test dataset we generate before to evaluate our training result. This test set was generated from the same source as train set and validation set, but we capture jpg randomly, so the test result is largely persuasive.

In [24]:
def evaluate(model, dataset):
    score = MODEL.evaluate(dataset.test_images, dataset.test_labels, verbose = 1)
    print('Test loss:', score[0])
    print('Test accuracy:', score[1])
#         print(type(score))
    return score


In [25]:
evaluate(MODEL,dataset)

In [26]:
class Model:
    def __init__(self):
        self.model = None 
        
    def build_model(self, dataset, nb_classes = 2):
        self.model = Sequential() 
        
        self.model.add(Convolution2D(32, 3, 3, border_mode='same', 
                                     input_shape = dataset.input_shape))

        self.model.add(MaxPooling2D(pool_size=(2, 2)))
        self.model.add(Convolution2D(32, 3, 3))
        self.model.add(MaxPooling2D(pool_size=(2, 2)))        
        # droupout: present from Overfitting
        self.model.add(Dropout(0.25))
        self.model.add(Flatten())
        self.model.add(Dense(512, activation='relu')) 
        self.model.add(Dense(nb_classes, activation='softmax'))
        self.model.summary()
#train model
    def train(self, dataset, loss, batch_size = 40, nb_epoch = 10):
        # SGD  lr: learning rate
        sgd = SGD(lr = 0.01, decay = 1e-6, momentum = 0.9, nesterov = True) 
        
        self.model.compile(loss=loss,
                           optimizer=sgd,
                           metrics=['accuracy'])   
        
        # nb_epoch： how many times should train the model.
        # batch_size: how many data should be trained at one time
        history = self.model.fit(dataset.train_images, dataset.train_labels,
                       batch_size = batch_size,
                       nb_epoch = nb_epoch,
                       validation_data = (dataset.valid_images, dataset.valid_labels),
                       shuffle = True)
#         print('keys: ',history.history.keys())
#         print(type(history.history))
        res = history.history
        return res
        
        
    def evaluate(self, dataset):
        score = self.model.evaluate(dataset.test_images, dataset.test_labels, verbose = 1)
        print('Test loss:', score[0])
        print('Test accuracy:', score[1])
#         print(type(score))
        return score