This is a module for face detection with convolutional neural networks (CNNs). It uses a small CNN as a binary classifier to distinguish between faces and non-faces. A simple sliding window (with multiple windows of varying size) is used to locaize the faces in the image.
1. TensorFlow 2. OpenCV for Python
The network consists of 3 convolution layers
Input: 32x32 black and white image 1. Layer 1: 5x5 convolutions 4 feature maps 2. Layer 2: 3x3 convolutions 16 feature maps 3. Layer 3: 3x3 feature maps 32 feature maps Layer 3 outputs 32 4x4 feature maps 4. Layer 4: Fully connected layer 600 units 5. Layer 5: Softmax layer 2 units
The network was trained with TensorFlow's AdamOptimzer
lrate: 1e-4 epsilon: 1e-16 mini-batch size: 100 number of epochs: 8
The validation accuracy was 98.762% and the final accuracy on the test set was 98.554%.
Positive samples (images of faces) for the classification were taken from 2 sources:
1. Cropped labelled faces in the wild (http://conradsanderson.id.au/lfwcrop/) 2. MIT CBCL face recognition database (http://cbcl.mit.edu/software-datasets/heisele/facerecognition-database.html)
The horizontal mirror images of these images were included in the dataset.
Negative samples (non-faces) were taken from 4 sources:
1. Fifteen scene categories (http://www-cvr.ai.uiuc.edu/ponce_grp/data/) 2. Texture database (http://www-cvr.ai.uiuc.edu/ponce_grp/data/) 3. Caltech cars (Rear) background dataset (http://www.robots.ox.ac.uk/~vgg/data3.html) 4. Caltech houses dataset (http://www.robots.ox.ac.uk/~vgg/data3.html)
Random snapshots from these images were generated by taking sub-images of a random lengths at random positions in the images. These snapshots were mixed in the dataset along with the complete images.
The final dataset consists of the 32,000 images from each class (positive and negative). Training, validation and test sets were generated from this with a 0.6 split for training, 0.2 for validation and 0.2 for test. Each of these sets have 50% positive and 50% negative samples.
The module uses a simple sliding window localizer. The input image is reshaped to (300,300). Square windows of side lengths 40,50... 100 are slid along the image. Each sub image seen through the window is reshaped to (32,32) and fed to the network. If the sub image is a face with a minimum confidence of 0.99, the window is marked in the mask. After running all different sized windows on the image, the final mask is blurred with a 50x50 Gaussian filter and binarized. This final binarized mask is used to extract only the faces from the image. The localizer returns two images: an image with only the faces and the raw mask (before blurring and binarization).
The repo includes a pre-trained model: face_model. This can directly be used for localization. Sample usage of this model with FaceDetect.py can be seen in demo.py. Running the demo should display the result of running the localizer on demo.jpg. Demos with other images can be seen here: https://youtu.be/N4GIGVnyNBo