# Introduction to Face Detection
## Viola-Jones Algorithm with Haar Cascades

Face detection: detect faces in the image. This can be done with classic Computer Vision (without deep learning).  
Face recognition: recognize whose faces are in the image. For this we need deep learning and large training datasets.

From [Haar-like feature (Wikipedia)](https://en.wikipedia.org/wiki/Haar-like_feature):
* used in the first real-time face detector
* A Haar-like feature considers adjacent rectangular regions at a specific location in a detection window, sums up the pixel intensities in each region and calculates the difference between these sums. This difference is then used to categorize subsections of an image. For example, with a human face, it is a common observation that among all faces the region of the eyes is darker than the region of the cheeks. Therefore, a common Haar feature for face detection is a set of two adjacent rectangles that lie above the eye and the cheek region. The position of these rectangles is defined relative to a detection window that acts like a bounding box to the target object (the face in this case).

[Viola, Jones: Rapid Object Detection using a Boosted Cascade of Simple Features (2001)](https://www.cs.cmu.edu/~efros/courses/LBMV07/Papers/viola-cvpr-01.pdf)
- pre-computing "integral images" to save time on calculations
- the main proposed feature types:
    - Edge Features (horisontal, vertical)
    - Line Features (horisontal, vertical)
    - Four rectangle features (diagonal)
    
<img src="../img/haar-features.png" alt="haar_features" width="400"/>

- each feature is a single value obtained by subtracting sum of pixels under white rectangle minus sum of pixels under black rectangle 
- example: let's look horisontal edge filter:

0000  
0000  
1111  
1111  

0.1 0.2 0.1 0.4  
0.2 0.2 0.1 0.1  
0.8 0.6 0.4 0.4  
0.7 0.9 1.0 0.9  

sum under 0s = 0.1 0.2 0.1 0.4 + 0.2 0.2 0.1 0.1 = 1.4  
average under 0s = 1.4/8 = 0.175

sum under 1s = 0.8 0.6 0.4 0.4 + 0.7 0.9 1.0 0.9 = 5.5  
average under 1s = 5.5 / 8 = 0.6875

delta =  0.6875 - 0.175 = 0.5125

We can say that if delta > 0.5 than there is a horisontal edge (feature) there.

- calculating this delta for each position of the filter in the image would be computationally expensive
- Viola-Jones algorithm solves this by using *integral image*, resulting in O(1) running time
- integral image has the same dimension as the original image
- value at the position (x, y) in the integral image is the sum of all values of the rectangle crop of the original image defined by upper left corner coordinates (0, 0) and lower right corner coordinates (x, y) (inclusive). Let's name this value S(x, y). (0, 0) is the coordinate of the very upper left corner of the image.
- as each value is a sum, integral image is also called *summed area table*
- to calculate the integral value within the arbitrary rectangle defined by the upper left corner (x, y), upper right corner (x + dx, y), lower left corner (x, y + dy) and lower right corner (x + dx, y + dy) it only takes the following calculation:

```
S(rectangle) = S(x + dx, y + dy) - S(x, y + dy) - S(x + dx, y) + S(x, y)
```

To understand how we got this calculation look the following image (source: Wikipedia):

```
S(rectangle) = S(yellow rect) - S(blue rect) - S(green rect) + S(red rect)
```

S(blue rect) and S(green rect) contain S(red rect) so when we subtract them both we subtract S(red rect) and that's the reason why we need to add S(red rect) at the end. 

<img src="../img/integral_image.png" alt="integral_image" width="200"/>

- another solution which contributes towards the fast speed of the algorithm is using a cascade of classifiers
- image is passed through a series of of classifiers, based on the simple features mentioned above (edges, lines)
- once image fails a classifier, the algorithm stops as face is not detected
- algorithm does not scan the entire image looking for a face, it passes the image through series of classifiers


- image has to contain a front facing person's face
- image is turned into a grayscale as Haar features we are looking for are black and white 
- then begins search for Haar Cascade features
- the first one is an edge feature indicating eyes and cheecks; eyes are lighter, cheecks are darker - all horisontal
- so we scan the image in order to try to find this feature; if it's not found, the process stops: face is not detected
- the next test is searching for the bridge of teh nose - a vertical edge 
- the next classifier tests for e.g. eyebrows, then mouth etc...We can have thousands of such features
- if all classifiers have been passed, that means that the face has been detected


- theoretically, this approach can be used for detecting any object, not just faces

Downsides:  
- This algorithm needs large datasets to create your own features and classifiers but luckily many pre-trained sets of features already exist. OpenCV comes with pre-trained xml files of various Haar cascades.
- Because this need for large datasets, it is acutally more efficient to create image recognition classifiers with Neural Networks