# Assignment 2 Summary

## Introduction 

The goal of InspiRED Robotics is to bring entertainment robots that can interact with people intelligently to life. We would like use cutting-edge deep learning methods that help the robot to see, understand and react. As shown in Figure 1, the InspiRED humanoid robot is equipped with a camera and a single-board computer. Because of the limited computing power of the computer on the robot, we need to build an integrated vision system which can balance the accuracy and speed.

![tennis1](../Content/Pics/humanoid.png)

<center>Figure 1 InspiRED humanoid robot</center>

# Context

Recognize people's faces is a crucial step to enable robots to interact with people effectively. To validate our idea of applying deep learning methods to robots, we choose to apply face recognition onto the robots in this assignment. Face recognition has been an active research topic in recent years and there are many open-sourced projects for face recognition. Instead of simply using existing APIs, we would like to integate our own implimentation into existing methods so that we can optimize the computer vision pipeline on mobile computing platforms on our robots.

We started from training our model for face recognition from Casia dataset [10] containing 10,575 people and 494,414 images. Different networks architectures have been tested. And finally, we integrated the detecting, tracking and recognizing for real-time performance.

The implementation is based on David's [5] tensorflow implementation of facenet [1] and the OpenFace[6] project. 

## Result 

Following are the results for the whole system:
    1. Face recognization accuracy: 
        a. Google Inception Net V1: 99% on LFW datset
        b. Lightened network: 93% on LFW dataset
    2. Face classification accuracy: ~100% with a datsets with 10 people
    2. Face recognization time usage on cpu (I7-4790): 50~100ms/One person, 100~150ms/Two person, 150~200ms/Three person
    3. Face recognization time usage on gpu (GTX 980): roughly 0

# Trainning Face Recognition Model

We start from using David's[5] pre-trained model based on Inception-V1, trainned on Casia dataset [7]. This model has ~99% accuracy on LFW dataset, which means it's already good enough for us to use. However, considering the time efficiency and memory usage for such a large network, we have to consider some small network that can improve the speed. Actually since the robot usally needs to classify within 10 people, it doesn't need a pretty strong model, but it needs efficiency.
 
The work by Wu etc. [3] introdcued a lightened network A for face recognition, and the performance is still acceptable. So in this project, we chose to implement that network and use it in our final demo system. The network contains around 1,300K parameters.

Trainning is always not easy for face recognition, and choosing loss function is highly related to final face representation. Facenet [1] used the tripple loss, which tries to maximize the distance between different people but minimize the distance between the same person. But during the trainning process, you need to do hard-negative example mining in order to increase the performance. Thus it always take a long time and the convergence is poor.

Parkhi [8] found out that actually instead of using triple loss at first, using classifier loss to pre-train the network will help the network to converge well. Wen [2] recently proposed one neat idea of using center loss. Actually the problem of using classifier loss is the variance is large within one class, since the classifier is just tring to find out a shape that can accuratly classify different classes. For the center loss, it added the variance penalty into the loss, so it forced the variance to be small for the same person. In our implementation we choosed to train a classifier and using the center loss. 

In order to train face recognition, we first need to crop out the thumbnail of the face. We first use the Zhang's [8] method to detect the face, this method tries to detect people's face through 3 small network and the performance is pertty good. In our final demo system, we used the dlib libarary since the speed is too slow for Zhang's [8] method.


In [24]:
# Since it's not be able to include the large dataset here, I simply write down the steps to train the model
# 1. Download face recognization dataset LFW (http://vis-www.cs.umass.edu/lfw/) as well as Casia(http://www.cbsr.ia.ac.cn/english/CASIA-WebFace-Database.html)
# 2. Preprocess LFW and Casia dataset and crop out people's face.
#     "python src/align/align_dataset_mtcnn.py ~/datasets/lfw/raw ~/remote/datasets/lfw/lfw_mtcnnpy_128 --image_size 128 --margin 24 --random_order"
#     "python src/align/align_dataset_mtcnn.py ~/dataset/CASIA-WebFace ~/remote/datasets/lfw/casia_mtcnnpy_144 --image_size 144 --margin 24 --random_order"
# 3. Tranning Classifier: facenet_train_classifier.py
#     python src/facenet_train_classifier.py --logs_base_dir ~/remote/logs/facenet/lightened --models_base_dir ~/remote/models/facenet/lightened/ --data_dir ~/remote/datasets/casia/casia_maxpy_mtcnnpy_144 --image_size 128 --model_def models.lightened --lfw_dir /home/xca64/remote/datasets/lfw/lfw_mtcnnpy_128 --optimizer RMSPROP --learning_rate 0.05 --max_nrof_epochs 80 --keep_probability 0.8 --random_crop --random_flip --learning_rate_schedule_file data/learning_rate_schedule_classifier_casia.txt --weight_decay 5e-5 --center_loss_factor 1e-4 --center_loss_alfa 0.9 --batch_size 45 --pretrained_model /home/xca64/remote/models/facenet/lightened/20170321-134041
# 4. Lightened, Inception_V1 network: lightened.py, inception_resnet_v1.py

![tennis1](../Content/Pics/Recognition.png)

<center>Figure 2 Face Recognition System Design</center> 

# Extract People's Feature Using Trainned Model

After we trained the face recognition model, we can use the model to extract the feature for each person and then train a svm classifier. We put all the needed data inside the "Content" folder. In order to extract feature for new person, simply place new photos under the "People" folder and name the new folder with the person's name. The program will automatically crop the photo and use the trained neural network to extract the feature for all people  


In [2]:
# Solving PYTHONPATH Problem
import sys
import numpy as np
import os
current_folder_path, _ = os.path.split(os.getcwd())
sys.path.insert(0, current_folder_path+'/src')


This step may take a long time, because it will crop the feature at first and then extracte the features for each person. And make sure the PYTHONPATH is set to the src folder as previous step

In [5]:
%run ../src/align/train_person_recognition_with_dlib.py --input_dir ../Content/People --output_dir ../Content/People_Cropped --image_size 128 --margin 24 --model_dir ../Content/NetworkModel/20170321-222346 --feature_dir ../Content/PeopleFeature --feature_name emb_array 

Creating networks and loading parameters
Model directory: ../Content/NetworkModel/20170321-222346
Loading network used to extract features
Metagraph file: model-20170321-222346.meta
Checkpoint file: model-20170321-222346.ckpt-80000
Start to extracting features, it usually take a long time.
But in this repo, the images have all been cropped into the People_Cropped folder
Total number of images: 360
Number of successfully aligned images: 0


Starting to extract features
Total number of images: 340
Runnning forward pass on images
Processing Patch 0/7
Processing Patch 1/7
Processing Patch 2/7
Processing Patch 3/7
Processing Patch 4/7
Processing Patch 5/7
Processing Patch 6/7


# Train the SVM classifier 

In [1]:
# The extracted feature are stored in the PeopleFeature dir, and you can have a look of the feature by
# uncommenting followed code

# npfile=np.load('../Content/PeopleFeature/emb_array.npz')
# print('People Feature Example ', npfile['emb_array'][0])
# print('People Label', npfile['label'][0])

In [17]:
# Running the SVM Training, at first we tested the system's accuracy to ten-fold cross validation by 25 times
%run ../util/train_svm.py --npz_file_dir ../Content/PeopleFeature/emb_array.npz --C 4 --svm_model_dir ../Content/SVMModel --gamma 1 --test_acc

accuracy 1.0
accuracy 1.0
accuracy 1.0
accuracy 1.0
accuracy 1.0
accuracy 1.0
accuracy 1.0
accuracy 1.0
accuracy 1.0
accuracy 1.0
accuracy 1.0
accuracy 1.0
accuracy 1.0
accuracy 1.0
accuracy 1.0
accuracy 1.0
accuracy 1.0
accuracy 1.0
accuracy 1.0
accuracy 1.0
accuracy 1.0
accuracy 1.0
accuracy 1.0
accuracy 1.0


In [19]:
# Running the SVM Training, and store the trainned SVM model
%run ../util/train_svm.py --npz_file_dir ../Content/PeopleFeature/emb_array.npz --C 4 --svm_model_dir ../Content/SVMModel --gamma 1

Trainning and Storing SVM model
accuracy 1.0


# Integrate the Face Recognition Model

In [20]:
# The test photo are located in the test folder and the predicted label should be Xiaochuan
# This step may take a long time, cause it will load the tensorflow model
# It may crash the notebook, seems like it's the dlib libraries' problem, but you can copy the commands to terminal and run it
%run ../src/align/recognize_person_with_dlib.py --input_dir ../Content/Test/photos --image_size 128 --margin 24 --model_dir ../Content/NetworkModel/20170321-222346 --svm_model_dir ../Content/SVMModel

Creating networks and loading parameters
Model directory: ../Content/NetworkModel/20170321-222346
Loading network used to extract features
Metagraph file: model-20170321-222346.meta
Checkpoint file: model-20170321-222346.ckpt-80000
../Content/Test/photos/Xiaochuan/0.jpg
../Content/Test/photos/Xiaochuan/1.jpg
../Content/Test/photos/Xiaochuan/2.jpg
../Content/Test/photos/Xiaochuan/3.jpg
../Content/Test/photos/Xiaochuan/4.jpg
../Content/Test/photos/Xiaochuan/5.jpg
../Content/Test/photos/Xiaochuan/6.jpg
../Content/Test/photos/Xiaochuan/7.jpg
Total number of images: 8
Number of successfully aligned images: 8


Starting to extract features
Crop pics spend 0.887 seconds
Extract feature spend 2.409 seconds
Classifier spend 0.002 seconds
Predicted Persons
['Xiaochuan' 'Xiaochuan' 'Xiaochuan' 'Xiaochuan' 'Xiaochuan' 'Xiaochuan'
 'Xiaochuan' 'Xiaochuan']


# Real-Time Face Detection and Recognition

![tennis1](../Content/Pics/Pipeline.png)

<center>Figure 3 </center>

In order to detect the people in real-time, we don't choose the model proposed in [4]. In our experiment, it will take around 1 second which is not acceptable. We instead used the dlib face detection method which will take around 50ms, and we place it to a separate process for detection. Once it detected a new person in the video, it will invoke the face recognition part to classify this new person. In the main loop, we used dlib tracking method and it can achieve real-time tracking. 

In all, we can achieve real-time face recognition for multiple people. 

In [None]:
# Note, it may complain the dlib doesn't have imshow function, this related to the dlib problem, Mac may have 
# this problem, but ubuntu doesn't  ['module' object has no attribute 'image_window']
%run ../src/align/detect_tracking_recognize.py --input_video ../Content/Test/Two.mov --image_size 128 --margin 24 --model_dir ../Content/NetworkModel/20170321-222346 --svm_model_dir ../Content/SVMModel

In [None]:
%run ../src/align/detect_tracking_recognize.py --input_video ../Content/Test/Three.mov --image_size 128 --margin 24 --model_dir ../Content/NetworkModel/20170321-222346 --svm_model_dir ../Content/SVMModel

![tennis1](../Content/Pics/tracking.png)

<center>Figure 4 Multiple faces recognition test</center>

# Perspectives

In this project, we experienced with designing neural network and real-time experiments. The next step involves migrating the whole vision pipeline to embedded computing platforms, such as the single-board computer [Odroid XU4](http://www.hardkernel.com/main/products/prdt_info.php?g_code=G143452239825).

The long-term goal is to have all vision computing process on board so that our robots can respond in real time. In the future, we may use Hinton's method [9] to distill accurate models into small models for the robot usage. And trainning multiple-task network also seems to be a proper way to run deep learning on robot, because it can process multiple tasks using only one forward run.
 

## Reference 

[1] Florian Schroff, Dmitry Kalenichenko, James Philbin; The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 815-823

[2] Wen, Yandong, Kaipeng Zhang, Zhifeng Li, and Yu Qiao. "A discriminative feature learning approach for deep face recognition." In European Conference on Computer Vision, pp. 499-515. Springer International Publishing, 2016.

[3] Wu, Xiang, Ran He, and Zhenan Sun. "A lightened cnn for deep face representation." In 2015 IEEE Conference on IEEE Computer Vision and Pattern Recognition (CVPR). 2015.

[4] Zhang, Kaipeng, Zhanpeng Zhang, Zhifeng Li, and Yu Qiao. "Joint Face Detection and Alignment Using Multitask Cascaded Convolutional Networks." IEEE Signal Processing Letters 23, no. 10 (2016): 1499-1503.

[5] https://github.com/davidsandberg/facenet

[6] https://cmusatyalab.github.io/openface/

[7] Dong Yi, Zhen Lei, Shengcai Liao and Stan Z. Li, “Learning Face Representation from Scratch”. arXiv preprint[] arXiv:1411.7923. 2014

[8] Parkhi, Omkar M., Andrea Vedaldi, and Andrew Zisserman. "Deep Face Recognition." In BMVC, vol. 1, no. 3, p. 6. 2015.

[9] Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. "Distilling the knowledge in a neural network." arXiv preprint arXiv:1503.02531 (2015).

[10] Dong Yi, Zhen Lei, Shengcai Liao and Stan Z. Li, “Learning Face Representation from Scratch”. arXiv preprint arXiv:1411.7923. 2014.