This article is about what I've worked when I was undergraduate.
Since the aim of this activity is to learn machine learning technology, there would be no novelty.
My goal was to detect eyesight from one face image.
I mean whether the person in the image looks camera or not.
Some people in lab, where I worked, have done research on interactive robots system.
So I hoped my work would contribute to their interactive system, utilizing eyesight detection.
There are various images of face, for example with respect to size and position.
For that reason, I tried to detect eyesight after detection of face with OpenFace API.
I used the API for following two objects.
- Get position(coordinates) of faces's landmarks, black points in the image below.
- Crop face in the shape of box.
As you can see in the following image, 64 face's landmarks are pointed.
Then, cropping face area using points around eyes.
You can also get coordinates of each landmark.
I tested 3 following models to detect eyesight.
- SVM(Support Vector Machine) with raw pixels as input.
- SVM with SIFT features as input.
- CNN(Convolutional Neural Network) with raw pixels as input.
I'm going to show each model below.
As preprocessing, I applied histogram flattening to face images which cropped by OpenFace.
Then, I trained SVM with top one third of cropped images as input.
I show the example below.
The both Process of transfer to grayscale and use of top one third aim for dimension reduction.
And histogram flattening is for robustness of divergence of contrast.
Finally, I trained SVM, using processed vectors as input.
SIFT features is often used for image recognition.
In my case, since I knew coordinates of face's landmarks around eyes, I calculated SIFT features in those points.
I trained SVM using features obtained in that way.
Input is top one third of processed face image, same as model 1(SVM with raw pixels).
I show network architecture below.
The dataset I used for experiments includes 1300 positive images(looking camera) and 600 negative images(not looking camera).
I evaluate in hold-out way, divide whole images in the ration of 9 to 1.
Model | Accuracy [%] |
---|---|
1. SVM(Input: Raw pixel) | 81.3 |
2. SVM(Input: SIFT Features) | 82.3 |
3. CNN | 88.7 |
Neural Network outperformed while the model was easy to deploy.
I applied trained models to movie captured by web camera.
The blue box means the model predicts a person looks camera, and red means otherwise.
I had a feeling that CNN model is more stable than another.
As I thought, prediction seems to be difficult when the person doesn't face front.
Maybe, this is because we didn't have a sufficient number of images for training.
In addition, if we apply these models for movie, we should use techniques such as smoothing or sequential modeling.
I'm going to organize my code and upload to Github.
Thank you for reading!