# FaceMask Detection 

`Group Name: Gogogo`


|     Name      | Student ID |
| :-----------: | :--------: |
|  Puquan Chen  |  z5405329  |
| Wenzhen Zhang |  z5282188  |
|   Zeran Qiu   |  z5237346  |
| Xiaolan Zhang |  z5400028  |
|  Haoyu Zang   |  z5326339  |


# 1. Introduction

## 1.1 Background

COVID-19 has been going on for three years. Although its impact is waning in many areas, we still do not have a good solution to the harm it causes to humans. Therefore the use of facial masks has significantly increased on many occasions. Nevertheless, it is still important to develop social awareness of residents to wear facial masks, since it is common to find people in narrow public spaces without masks. In this case, we need face mask detection to help local authorities better monitor people's compliance with local mask wearing policies.

## 1.2 Motivation
As mentioned above, in a climate where COVID-19 is still spreading, we need to ensure that people comply with the mask wearing policy. However, it is very difficult to identify whether people are wearing a mask by human intervention alone, especially in crowded situations such as trains and shopping malls. Therefore, there is a great need for face mask detection to help us quickly and accurately identify whether people are wearing masks or not.


## 1.3 Purpose/ Project Abstract


In this project, our group applied three existing popular deep learning object detection models which includes `RCNN`, `YOLO` and `SSD`, aim to compare their performance in accuracy, training time and loss rate by using the same data set. Aftering getting all necessary data and details, our group will build our own model based on the existing models mentioned in above and try to analysis the methods which can be used and improve the performance of the network.

## 1.4 Common Techinque / Key Terminology

**1.Intersection over Union (IoU):** 

The Intersection over Union (IoU) metric is essentially a method used usually to quantify the percent overlap between the ground truth BBox (Bounding Box) and the prediction BBox. However, in NMS, we find IoU between two predictions BBoxes instead.
![jupyter](./notebook_images/iou.png)


**2.Anchor box:**

Anchor boxes are a set of predefined boxes with scaled sizes and width-height ratios. Anchor box was first adopted in YOLO since YOLOv2. The main idea of using anchor boxes is, instead of generating the entire bounding box, bounding boxes for each grid cell can be created by tuning the initial anchor boxes. And the size and shape of anchor boxes can be set up either with experience or using k-means clusters on the ground truth boxes of the dataset. In YOLOv5, other than setting up the anchor by hand, the system could adaptively calculate the best anchor based on the dataset we use during the training stage.

**3.Non-Maximum Suppression (NMS):**

NMS is a class of algorithms to select one entity (e.g., bounding boxes) out of many overlapping entities. We can choose the selection criteria to arrive at the desired results. 
![jupyter](./notebook_images/nms.jpg)

The process can be described in 3 main steps:

Assume we get a list P of prediction bounding boxes, and each bbox is associated with a predicted confidence score c (For simplicity, we only take one confidence score)

1.Select the prediction S with highest class probability and remove it from P and add it to the final prediction list.

2.Now compare this prediction S with all the predictions present in P. Calculate the IoU of this prediction S with every other predictions in P. If the IoU is greater than the threshold thresh_iou for any prediction T present in P, remove prediction T from P.

3.If there are still predictions left in P, then go to Step 1 again, else return the final prediction list containing the filtered predictions.



# 2. Data Sources / Preparation of Datasets

**All scripts used in this part are stored in the directory** `./tool_scripts` 

## 2.1 Basic Information about The Data 

Our original face mask dataset contains 6120 images, it comes from https://github.com/AIZOOTech/FaceMaskDetection

Nevertheless, the Face Mask Dataset is a combined dataset, the training set made up of 3114 images of Wider Face and 3006 images of masking FAces (MAFA) datasets. It contains normal faces with different lighting, poses, occlusion, and masked face.The test set has 1839 images which includes 780 Wider Face and 1059 MAFA dataset. Here are some samples of original images include in dataset:

In [None]:
### 这边泽然补一下sample image，然后看看这部分上下，语句有没有问题

Since some format of label files in original dataset are not readable, the pre-processing has been applied to the original dataset. The images in available formats were identified and some invalid dataset was deleted. However, after processing and editing, the number of reliable images reduced dramatically which could affect the following training, our group collected and added extra images to ensure the amount and quality of the whole dataset. Extra datasets refers to the link below:
1. https://public.roboflow.com/object-detection/mask-wearing
2. https://www.kaggle.com/datasets/andrewmvd/face-mask-detection

## 2.2 Split Dataset




After integrating three datasets, we try to split the whole datasets into training set, validation set and testing set.

We used [split_dataset.py](./tool_scripts/split_dataset.py) to do the spliting task.

The script can be run on dataset with the command below:

`python3 split_dataset.py`

Our final face mask dataset contains `18756` training images, `2993` validation images and `2993` testing images 

## 2.3 Process Dataset Labels 

We tried to train our models based on different frameworks, we used `yolov5` for yolo and `mmdetection` for others, but they need different format of labels, such as `xml`, `txt`, and `json`. Therefore, we created and ran scripts that those label files can be interchangeable






The usage of those scripts:

- 1. [formatting_xml.py](./tool_scripts): reformat the `xml` files so that it can be processed by other scripts. 

<img style="float: left;" width = "650" height = "300" src="./notebook_images/2.png" >

- 2. [xml_to_txt.py](./tool_scripts/xml_to_txt.py): convert `xml` format labels to `txt` format labels. The `txt` format is also called yolo format, which only contains the information about object class, object coordinates, height, and width as the image shown below.

<img style="float: left;" width = "650" height = "300" src="./notebook_images/3.png" >

- 3. [txt_to_xml.py](./tool_scripts/txt_to_xml.py): convert `txt` format labels to `xml` format labels.

- 3. [voc2coc.py](./tool_scripts/voc2coc.py): convert `xml` format labels to `json` format labels.

Code reference of this part can be found in [tool_scripts/README.md](./tool_scripts/README.md)

# 3. Experimentation

The source codes of each part of experiments are stored in the directory `source_code`.

All models were trained and tested on a [Cloud platform](https://www.autodl.com/), using Nvidia RTX 3090. (GPU Memory Size: 24 GB)

We tried to stop training until each model became converged, but it was very time-comsuming.

## 3.1 RCNN 

### 3.1.1 Introduction

Faster R-CNN is an improvement on R-CNN. R-CNN belongs to a two-stage which means first generating a region proposal, and then predicting the classification and location of the target by means of convolutional neural networks. R-CNN uses CNN to replace traditional feature extraction methods in object detection, which can extract better feature information. It also uses a migration learning approach to better improve object detection when the dataset is small at that time. The main reason we choose Faster-RCNN is the feature extraction which RCNN does not have. It maps the region proposals directly to the feature map of the last convolutional layer of the CNN, which reduces the feature extraction time.
We choose MMDetection as working framework, MMDetection is an open source object detection toolbox based on PyTorch. It is a part of the OpenMMLab project.The master branch works with PyTorch 1.5+.#直接复制的#
refer:https://github.com/open-mmlab/mmdetection

### 3.1.2 Model Architecture

<img style="float: left;" width = "650" height = "300" src="./notebook_images/7.png" >
refer:https://arxiv.org/abs/1506.01497

### 3.1.3 Experiment Implementation 

The source code of this part is in the directory `source_code/Faster-rcnn`

#### Install environment
- MMDetection works on Linux, Windows and macOS. It requires Python 3.6+, CUDA 9.2+ and PyTorch 1.5+.
1. `conda create --name openmmlab python=3.8 -y`
2. `conda activate openmmlab`
3. `conda install pytorch=1.8 torchvision cudatoolkit=10.2 -c pytorch`
( this step you should change according to your cuda version )
4. `pip install -U openmim`
5. `mim install mmcv-full`
( MMDetection requires mmcv-full>=1.3.17, <1.7.0）
6. `cd Faster-rcnn`
7. `pip install -r requirements/build.txt`
8. `pip install -v -e .  # or "python setup.py develop"`

#### Training
We set most of hyperparameters as default.
The hyperparameters of training can be found in [faster-rcnn_150epoch.log](experiment_result/Faster-rcnn/faster-rcnn_150epoch.log) within the directory `experiment_result/Faster-rcnn`

We used the command below the run our experiment.

In [None]:
!python ./tools/train.py

The train result is saved at the directory train_mask(when you train, it will create this new directory)

If train is interrupted, we used the command below the continue our experiment.

In [None]:
!python ./tools/train.py --resume-from ./train_mask/latest.pth

The training output can be found in this file: [faster-rcnn_result.json](experiment_result/Faster-rcnn/faster-rcnn_result.json)

### 3.3.5 Results 

#### Confusion Matrix

The below figure is the confusion matrix of our best model of ssd-mobilenetv2.

In [None]:
###加一下分析###

<img style="float: left;" width = "250" height = "300" src="./experiment_result/Faster-rcnn/confusion_matrix.png" >

#### The Plot of training

Based on the the figure below we can see that the model is converged at about 20th epoch 

<img style="float: left;" width = "650" height = "300" src="./experiment_result/Faster-rcnn/map.png" >

#### mAP

The table below shows the map value of our model.

| Class | map50 | map50-95 |
| :---: | :---: | :------: |
|  All  | 0.383 |  0.151   |

#### inference time

The inference time(FPS) is evaluate by the scrpit [benchmark.py](source_code/Faster-rcnn/tools/analysis_tools/benchmark.py), the testing results varied sometimes, we chose the best performence as the final result, which was `48.1` fps.

## 3.2 SSD

### 3.2.1 Introduction

Single-shot Multibox Detector, also known as SSD, uses a fully convolutional approach in the network It is widely used for objects detecting in image and is considered as one of the most excellent models in both speed and accuracy due to the base architecture of VGG-16 Architecture [5], while using SSD, we only need one shot to detect multiple images while R-CNN needs two shots, and that is the reason that SSD is much faster than R-CNN [6]. However, while detecting small-scale objects, SSD is much worse than RCNN since they can only be detected in higher resolution layers and SSD is slightly inaccurate than YOLO because the speed gets interrupted due to the gigantic model. Here is the image of the structure of the SSD [7]. SSD has two main components: multi-scale feature maps and convolutional predictor. Multi-scale feature maps improve the accuracy significantly because it uses multiple layers to detect objects independently, and that is one of the important features for this project. In facial mask detection, all images are not presented in the same scale, in that case, SSD makes it efficient for valid and test sets.

### 3.2.2 Model Architecture

#### 3.2.2.1 ssd model architecture

In [1]:
###补一下介绍###

<img style="float: left;" width = "650" height = "300" src="./notebook_images/ssd.png" >

#### 3.2.2.2 MobileNetV2 model architecture

In [None]:
###补一下介绍###

<img style="float: left;" width = "350" height = "300" src="./notebook_images/8.png" >

### 3.2.3 Experiment Implementation

We use MMDetection framework to train and test ssd-mobilenetv2 model

The source code of this part is in the directory `source_code/ssd-mobilenetv2`

#### Install environment
- MMDetection works on Linux, Windows and macOS. It requires Python 3.6+, CUDA 9.2+ and PyTorch 1.5+.
1. `conda create --name openmmlab python=3.8 -y`
2. `conda activate openmmlab`
3. `conda install pytorch=1.8 torchvision cudatoolkit=10.2 -c pytorch`
( this step you should change according to your cuda version )
4. `pip install -U openmim`
5. `mim install mmcv-full`
( MMDetection requires mmcv-full>=1.3.17, <1.7.0）
6. `cd ssd-mobilenetv2`
7. `pip install -r requirements/build.txt`
8. `pip install -v -e .  # or "python setup.py develop"`

#### Training
We set most of hyperparameters as default.
The hyperparameters of training can be found in [ssd_300epoch.log](experiment_result/ssd/ssd_300epoch.log) within the directory `experiment_result/ssd-mobilenetv2`

We used the command below the run our experiment.

In [None]:
!python ./tools/train.py

The train result is saved at the directory train_mask(when you train, it will create this new directory)

If train is interrupted, we used the command below the continue our experiment.

In [None]:
!python ./tools/train.py --resume-from ./train_mask/latest.pth

The training output can be found in this file: [ssd_result.json](experiment_result/ssd-mobilenetv2/ssd_result.json)

### 3.3.5 Results 

#### Confusion Matrix

The below figure is the confusion matrix of our best model of ssd-mobilenetv2.

In [2]:
###加一下分析这个混淆矩阵###

<img style="float: left;" width = "250" height = "300" src="experiment_result/ssd-mobilenetv2/confusion_matrix.png" >

#### The Plot of training

Based on the the figure below we can see that the model is converged at about 200th epoch 

In [3]:
###map图###

#### Map

The table below shows the map value of our model.

| Class | map50 | map50-95 |
| :---: | :---: | :------: |
|  All  | 0.69 |  0.333   |

#### Inference Time

The inference time(FPS) is evaluate by the scrpit [benchmark.py](source_code/ssd-mobilenetv2/tools/analysis_tools/benchmark.py), the testing results varied sometimes, we chose the best performence as the final result, which was `85.6` fps.

## 3.3 Yolov5

### 3.3.1 Introduction

YOLO is a state-of-the-art object detection system. It is one of the most popular algorithms for real-time object detection. YOLO(You Only Look Once) is a `single-stage detector`, which is the same as SSD but is different from RCNN.

Since the first debut of YOLOv1 in 2016, YOLO family has been well developed and updated to YOLOv7 although Redmond, the author of YOLOv1, stopped working on YOLO after YOLOv3. For this project, our team chose to use YOLOv5, which was released in 2020, to perform facial mask detection. And instead of a single model, YOLOv5 is a family of compound-scaled object detection models[3], and `YOLOv5s` is used in our project.

YOLOv5 comes with various versions, each having its own
unique characteristic. These versions being:
1. yolov5-n - The nano version
2. yolov5-s - The small version
3. yolov5-m - The medium version
4. yolov5-l – The large version
5. yolov5-x - The extra-large version

The performance analysis of above models is shown below:

<img style="float: left;" width = "650" height = "300" src="./notebook_images/4.png" >


refer:https://github.com/ultralytics/yolov5

### 3.3.2 Model Architecture

The YOLOv5 network consists of three main parts. 
1. Backbone - A CNN layer aggregate image features at
different scales.
2. Neck – Set of layers to combine image features and pass
them forward to prediction.
3. Head - Takes features from the neck and performs
localization and classification

<img style="float: left;" width = "650" height = "300" src="./notebook_images/5.png" >
ref: https://www.researchgate.net/figure/The-network-architecture-of-Yolov5-It-consists-of-three-parts-1-Backbone-CSPDarknet_fig1_349299852

### 3.3.3 Experiment Implementation 

The source code of this part is in the directory `source_code/yolov5`

#### Install environment
1. `cd source_code/yolov5`
2. `conda create -n yolov5 python=3.8`
3. `conda activate yolov5`
4. `pip install -r requirements.txt`

#### Training

We set most of hyperparameters as default.

The hyperparameters of training can be found in [opt.yaml](experiment_result/yolov5s/640_5s_300epoch/opt.yaml) within the directory `experiment_result/yolov5s/640_5s_300epoch`

We used the command below the run our experiment.

In [None]:
!python train.py --img 640 --epochs 300 --batch 64 --data './datasets/data.yaml' --cfg 'yolov5s.yaml' --weights '' --project 'facemask_640_yolo5s' --name '640_5s_300epoch'

The training output can be found in this file: [train_record.ipynb](source_code/yolov5/train_record.ipynb)

### 3.3.5 Results 

#### Confusion Matrix

The below figure is the confusion matrix of our best model of yolov5s. Based on this graph, we can see that the background can be misdetected as face or mask.

<img style="float: left;" width = "650" height = "300" src="experiment_result/yolov5s/640_5s_300epoch/confusion_matrix.png" >

#### The Plot of training

Based on the the figure below we can see that the model is converged at about 200th epoch 

In [None]:
## 图, 就画一个map的图，x轴的epoch需要显示的比较明显一些，

#### Map

The table below shows the map value of our model.

| Class | map50 | map50-95 |
| :---: | :---: | :------: |
|  All  | 0.887 |  0.504   |
| Face  | 0.931 |  0.557   |
| Mask  | 0.843 |  0.451   |

We can see that the mask class much lower map than the face class, which matches the result of confusion, because the mask class are easier to be misdected as background than the face class.

#### Inference Time

The inference time(FPS) is evaluate by the scrpit [val.py](source_code/yolov5/val.py), the testing results varied sometimes, we chose the best performence as the final result, which was `92.5` fps.

## 3.4 Yolov5-MobileNetV2

### 3.4.1 Introduction

We were inspired by the method of ssd-mobilenetV2, it integrates the SSD algorithm and MobileNetV2 networks. Therefore we decided to integrate yolov5 with moblieNet.

The mobileNet has three versions, V1, V2 and V3. Its first version (mobileNetV1) has a depthwise separable convolution, which reduces the model complexity and computational cost. The key methmod used in the second version is called inverted residual structure, resulting in keeping the low computational cost and reaching relative higher accuracy compare with mobileNetV1. For V3, it combines the self-attention mechanism, and replaces relu activation with h-swish activation.[1]这里加reference，mobilenet的论文



For comparision, we choose mobileNetV2, which is the same as the SSD model used.

### 3.4.2 Custom Layers Modification in Yolov5

As the part 3.3.3 introduced, yolov5 has three main parts, we only need to modify the backbone of the yolov5 model, which is used for extracting and processing features.

ref: 
1. https://blog.csdn.net/weixin_42182534/article/details/123418604?spm=1001.2101.3001.6650.1&utm_medium=distribute.pc_relevant.none-task-blog-2%7Edefault%7ECTRLIST%7ERate-1-123418604-blog-114492726.pc_relevant_3mothn_strategy_and_data_recovery&depth_1-utm_source=distribute.pc_relevant.none-task-blog-2%7Edefault%7ECTRLIST%7ERate-1-123418604-blog-114492726.pc_relevant_3mothn_strategy_and_data_recovery&utm_relevant_index=2

2. https://github.com/shaoshengsong/YOLOv5-ShuffleNetV2


To change the architecture, we need to do:

1. add inverted residual structure in the yolov5 model module, see the file [common.py](source_code/yolov5-MobileNetV2/models/common.py), line no.876
2. modify the model configuration file of original model, the key concept is to change the original CNN layers into invertedResidual layers, connecting with their corresponding head layers. The final configuration file can be found in the file [yolov5-moblienetv2.yaml](source_code/yolov5-MobileNetV2/models/yolov5-moblienetv2.yaml)

The comparison of model complexity for Yolov5-MobileNetV2 and Yolov5s is shown below:

|       Model        | Number of Parameters | Number of Layers |
| :----------------: | :------------------: | :--------------: |
|      Yolov5s       |       7015519        |       157        |
| Yolov5-MobileNetV2 |       2916063        |       276        |

### 3.4.3 Experiment Implementation 

The traning process and hyperparameter setting were the same as last experiment.

The training output can be found in this file: [train_record.ipynb](source_code/yolov5-MobileNetV2/train_record.ipynb)

### 3.4.4 Results

#### Confusion Matrics

<img style="float: left;" width = "450" height = "300" src="experiment_result/yolov5-MobileNetV2/640_300epoch/confusion_matrix.png" >

<img style="float: left;" width = "450" height = "300" src="experiment_result/yolov5s/640_5s_300epoch/confusion_matrix.png" >

According to the matrics above, the left is for the Yolov5-MoblieNetV2, the right is for the Yolov5s, we can found that the performance of accuracy slightly decreased. The characteristic of misdetection of mask,face and background remained as yolov5s model.

#### The Plot of training

Based on the the figure below we can see that the model is converged at about 100th epoch 

#### Map

The table below shows the map value of our model.

| Class | map50 | map50-95 |
| :---: | :---: | :------: |
|  All  | 0.871 |  0.478   |
| Face  | 0.919 |  0.533   |
| Mask  | 0.823 |  0.423   |

#### Inference Time

The testing results varied sometimes, we chose the best performence as the final result, which was `95.2` fps.

# 4. Comparision/ Conclusion