# Machine Learning Final Project Report
Chang Yan (cyan13) and Jingguo Liang (jliang35)

## Goal
The goal of this project is to develop deep neural networks that are able to detect human faces in an image and make predictions about this person's age, sex, and race. There are lots of applications for this in real-time cameras.

## Deliverables
We successfully finished all of our deliverables. They are listed below:
#### Must accomplish
1. A program with a trained deep convolutional network with optimized structure and best tuned parameters, that can predict the age of the face in the input image
2. The prediction of the program should be distinctly better than simple benchmark run, like a logistic regression.
3. The program should use appropriate data preprocessing mothods to improve the performance.

#### Expect to accomplish
1. The program should be able to predict not only age but also sex and ethnicity of the face.
2. The program should use multiple different network structures, ensembles or other methods to further improve the performance.
3. The prediction of the program should reach an accuracy that is usable in real-life, like over 90%,.
4. Adding feature extraction methods before the network (like edge detection) to create more features from the image.

#### Would like to accomplish
1. The program should be able to first identify the location of the face in the image, and then predict using the cropped face area.
The position of the face should also be an output.
2. The program should be able to take in different sizes of image files.
3. The program should be able to identify an image with no face, instead of giving random output.
4. The program should work with both RGB and grayscale images, and uses the RGB channels to achieve better performance than
grayscale.
5. The speed of the prediction should be fast enough to be done in real-time.

## Dataset

The dataset used in the project is UTKFace, which can be found on the following website:
https://susanqq.github.io/UTKFace/

The raw data consists of four parts: 

(1) an "original" RGB image in .jpg format with arbitrary size that contains one human face. 

(2) a "cropped" RGB image in .jpg format with size (200, 200) that is part of the image in (1) where the face locates in. 

(3) labels for the image, including four parts: age, sex, race of the people in the image, and the date on which the image is collected. 
The labels are contained in the name of the images, formatted as [age]_[gender]_[race]_[date&time].jpg
For cropped images, they are named [age]_[gender]_[race]_[date&time].jpg.chip.jpg. Thus, we can find the corresponding original/cropped images using their labels.
[age] is an integer from 0 to 116.
[gender] is either 0 (indicating a male) or 1 (indicating a female).
[race] is an integer from 0 to 4. 0 denotes White, 1 denotes Black, 2 denotes (East) Asian, 3 denotes Indian, 4 denotes others (e.g. Hispanic, Latino, Middle Eastern)
[date&time] is in the format of yyyymmddHHMMSSFFF, indicating the date on which the image is collected to the dataset.

(4) Facial landmarks located in the cropped image

In the project, we will only use the first three parts of the data (original image, cropped image, labels). For the labels, we will only use age, gender and race labels, as data&time is collection time which it irrelevant.

## Data Processing
The full data preprocessing scripts can be found in the repo under the folder "DataPreprocessing".
### Data cleansing

The first thing we do is to go through all cropped images and remove those which are obvious wrong (clearly not a face). We also removed all black and white images because we want RGB images as input.

### Crop Box Generation
Then, we want to locate the cropbox in the original image. We can do that through SIFT and then finding the homography that maps the cropped image to the original image.

In [None]:
# Code is only for demonstration purpose and not executable in Jupyter Notebook.
# Complete, executable script can be found in the deliverables (.py files)

import numpy as np
import cv2 as cv

def extractKeypoints(img1, img2, N):
    sift = cv.SIFT_create()
    keypoint1, descriptor1 = sift.detectAndCompute(img1, None)
    keypoint2, descriptor2 = sift.detectAndCompute(img2, None)

    if keypoint1 is None or keypoint2 is None or len(keypoint1) < N or len(keypoint2) < N:
        return np.array([]), np.array([])

    bf = cv.BFMatcher_create()
    matches = bf.knnMatch(descriptor1, descriptor2, k = 2)

    good = []
    for m, n in matches:
        if m.distance < 0.75 * n.distance:
            good.append(m)

    matches = sorted(good, key = lambda x:x.distance)[:N]

    points1 = np.float32([keypoint1[m.queryIdx].pt for m in matches])
    points2 = np.float32([keypoint2[m.trainIdx].pt for m in matches])

    return points1, points2

[pts1, pts2] = ExtractKeypoints.extractKeypoints(imgOriginal, imgCropped, 25)
[H, status] = cv.findHomography(pts2, pts1, method = cv.RANSAC)

We notice that some images does not generate any feature through SIFT, or the number of features generated is lower than we want (<25). Those images are discarded.

After finding the homography, we can map the four corners of cropbox ([[0, 0], [0, 200], [200, 200], [200, 0]) back to the original image. We notice that after mapping the cropbox back, the sides of the box is not parallel to the side of the image. Thus, we also calculate the circumscribed box of the mapped cropbox.

We also discard the boxes which has a significant angle between the original image, because in that case the circumstribed box will be significantly larger than the cropbox. We can do this by extracting the rotation angle from the homography.

In [None]:
# Code is only for demonstration purpose and not executable in Jupyter Notebook.
# Complete, executable script can be found in the deliverables (.py files)

angle = abs(math.atan2(H[1,0], H[0,0])) * 180 / math.pi
if angle < 5:
    originalShape = imgOriginal.shape
    croppedShape = imgCropped.shape
    croppedPts = np.array([[0, 0, 1], [0, croppedShape[1], 1], [croppedShape[0], croppedShape[1], 1], [croppedShape[0], 0, 1]])
    originalPts = np.matmul(H, croppedPts.transpose()).transpose()
    originalPts = np.round(originalPts[:, 0:2] / originalPts[:, 2:3]).astype(np.int32)
    x1 = np.min(originalPts[:, 0]).clip(0, originalShape[1])
    x2 = np.max(originalPts[:, 0]).clip(0, originalShape[1])
    y1 = np.min(originalPts[:, 1]).clip(0, originalShape[0])
    y2 = np.max(originalPts[:, 1]).clip(0, originalShape[0])

<img src="res/SIFT.JPG"> 
<center>SIFT</center>

<img src="res/cropbox.JPG" width=400 />
<center>Cropbox</center>

### Feature Engineering
We would also like to use the intensity image and the edge map in training. Intensity map can be obtained by a weighted combination of RGB values, and the edge map can be obtained through canny edge detection. We can add the intensity image and the edge map to the RGB image, getting a 5-channel image.

The input to the neural model should be in [N,C,H,W] format, while currently our image is in [H,W,C] format. We can simply swap the axes to get the nchw format that we want.

In [None]:
# Code is only for demonstration purpose and not executable in Jupyter Notebook.
# Complete, executable script can be found in the deliverables (.py files)

class PreprocessImage(object):

    @classmethod
    def preprocess(cls, img, normalize):
        shape = img.shape
        intensity = (0.11 * img[:,:,0] + 0.59 * img[:,:,1] + 0.30 * img[:,:,2])
        intensity = intensity.reshape((shape[0], shape[1], 1))
        edge = cv.Canny(img, 100, 100, apertureSize = 3)
        # cv.imshow('img', edge)
        # cv.waitKey(0)
        edge = edge
        edge = edge.reshape((shape[0], shape[1], 1))
        if normalize is True:
            intensity = intensity / 255.0
            edge = edge / 255.0
            img = img / 255.0
            result = np.append(img, intensity, axis = -1)
            result = np.append(result, edge, axis = -1)
            result = np.moveaxis(result, -1, 0)
            result[[0,2]] = result[[2,0]]
            return result.astype(float)
        else:
            result = np.append(img, intensity, axis = -1)
            result = np.append(result, edge, axis = -1)
            result = np.moveaxis(result, -1, 0)
            result[[0,2]] = result[[2,0]]
            return result.astype(np.uint8)

### Label Encoding
The original dataset gives ages directly as a number. However, in our model we would like to divide all ages into several categories: <br>
0-3: label 0 <br>
4-6: label 1 <br>
7-10: label 2 <br>
11-15: label 3 <br>
16-20: label 4 <br>
21-25: label 5 <br>
26-30: label 6 <br>
31-40: label 7 <br>
41-50: label 8 <br>
51-60: label 9 <br>
61-80: label 10 <br>
81+: label 11 <br>

Besides age, we also have sex and race in the labels. For sex: <br>
0: male <br>
1: female <br>

For race: <br>
0: White <br>
1: Black <br>
2: (East) Asian <br>
3: Indian <br>
4: Other

### 5 fold cross validation
The last thing to do is to divide the data into 5 parts. We will use them for 5-fold cross validation in the training. After the division, we can save the data as .npz files, which will be read back and directly used in training. Each fold contains 1000 cases. In each fold, there will be one single .npz for all cropboxes, one single .npz for all labels and one .npz file for each of the origina
l images.

## Model Result Visualization: Facial Detection: Faster-RCNN
Here are some examples of the predictions made by the facial detection Faster-RCNN network. The visualization is done using the file "visualization.py". Note that this is a fully working prediction file that can use the trained networks to detect face and prediction labels. We also provide a part of the code here for demonstration ONLY. Full code can be found on visualization.py.

In [None]:
def show(imgs):
    if not isinstance(imgs, list):
        imgs = [imgs]
    fix, axs = plt.subplots(ncols=len(imgs), squeeze=False)
    for i, img in enumerate(imgs):
        img = img.detach()
        img = F.to_pil_image(img)
        axs[0, i].imshow(np.asarray(img))
        axs[0, i].set(xticklabels=[], yticklabels=[], xticks=[], yticks=[])
    plt.savefig("image.jpg")

def main(args):
    image = cv.imread("./Test/9.jpg")
    imageT = read_image("./Test/9.jpg")
    image = PreprocessImage.preprocess(image)
    shape = image.shape
    
    imageT = F.resize(imageT, [shape[1], shape[2]])
    test_image = image[0:3].astype(float) / 255
    model = Rcnn(args.svpath, "Rcnn_full_feature_fold_4.pt", 50, 1)
    model.load()
    boxes, scores = model.predict([test_image])
    if (boxes[0] is not None):
        result = draw_bounding_boxes(imageT, torch.tensor([boxes[0]], dtype=torch.float), colors=["blue"], width=5)
        show(result)
        

<img src="res/box1.png"> 
<img src="res/box2.png"> 
<img src="res/box3.png"> 

We can see that our predication is very accurate. The networks in each fold capture every face in each test set. We only show 3 random cases here, but all other results look as accurate as those, and the accuracy is quite consistent across the 5 folds. This demonstrates that the facial detection of our network is very reliable.

## Model Result Visualization: Label Prediction: The Ensemble
Here are some examples of the predictions made by the ensemble classification network. The visualization is done using the file "visualization.py". Note that this is a fully working prediction file that can use the trained networks to detect face and prediction labels. We also provide a part of the code here for demonstration ONLY. Full code can be found on visualization.py.

In [None]:
# Code is only for demonstration purpose and not executable in Jupyter Notebook.
# Complete, executable script can be found in the deliverables (.py files)

def main(args)
    imageCropped = cv.imread('./Test/cropped3.jpg')
    imageCropped = PreprocessImage.preprocess2(imageCropped).reshape((1, 5, 224, 224))
    model = EnsembleWrapper(args.svpath, "Ensemble_age_fold_0.pt", 12, 50, 1)
    model.load()
    age = model.predict(imageCropped)
    print(age)
    model = EnsembleWrapper(args.svpath, "Ensemble_full_sex_fold_4.pt", 2, 50, 1)
    model.load()
    sex = model.predict(imageCropped)
    print(sex)
    model = EnsembleWrapper(args.svpath, "Ensemble_full_race_fold_4.pt", 5, 50, 1)
    model.load()
    race = model.predict(imageCropped)
    print(race)


<img src="res/cropped1.JPG"> 
<center>
    Age: [6] (26-30) <br>
    Sex: [0] (Male) <br>
    Race: [1] (Black) <br>
</center>

<img src="res/cropped2.JPG"> 
<center>
    Age: [10] (61-80) <br>
    Sex: [0] (Male) <br>
    Race: [0] (White) <br>
</center>

<img src="res/cropped3.JPG"> 
<center>
    Age: [6] (26-30) <br>
    Sex: [1] (Female) <br>
    Race: [3] (Indian) <br>
</center>

We can see that both the prediction of sex and race are quite accurate, but for age the third one is a bit higher than it should be. We believe that there are three reasons for this: 1. The original labels of age may are not accurate, as we actually found some wrongly labeled ones and removed them, but there should still be some missing. 2. Age is very hard to predict even by human eyes, causing difficulties for the network (We also did not use the largest version of each model because of training time). 3. The labels are imbalanced. 26-30 actually is the most common label and the network is a bit biased to it.