# FaceMask Detection 

`Group Name: Gogogo`


|     Name      | Student ID |
| :-----------: | :--------: |
|  Puquan Chen  |  z5405329  |
| Wenzhen Zhang |  z5282188  |
|   Zeran Qiu   |  z5237346  |
| Xiaolan Zhang |  z5400028  |
|  Haoyu Zang   |  z5326339  |


# 1. Introduction

## 1.1 Background

COVID-19 has been going on for three years. Although its impact is waning in many areas, we still do not have a good solution to the harm it causes to humans. Therefore the use of facial masks has significantly increased on many occasions. Nevertheless, it is still important to develop social awareness of residents to wear facial masks, since it is common to find people in narrow public spaces without masks. In this case, we need face mask detection to help local authorities better monitor people's compliance with local mask wearing policies.

## 1.2 Motivation
As mentioned above, in a climate where COVID-19 is still spreading, we need to ensure that people comply with the mask wearing policy. However, it is very difficult to identify whether people are wearing a mask by human intervention alone, especially in crowded situations such as trains and shopping malls. Therefore, there is a great need for face mask detection to help us quickly and accurately identify whether people are wearing masks or not.


## 1.3 Purpose/ Project Abstract

In this project, our group aims to develop a real-time facial mask detection system. We applied three existing popular deep learning object detection models which include Faster R-CNN, YOLO and SSD, aiming to compare their performance by using the same dataset. Our group uses MMDetection and YOLOv5 as the framework while training our data. After getting all the necessary results and details, our group decided to replace the backbone of YOLOv5 and SSD to mobilenetv2 in the innovation part, which may help to improve the performance of the network and models in terms of speed.

# 2. Data Sources / Preparation of Datasets

All scripts used in this part are stored in the directory `./tool_scripts` 

Link for Dataset used: 

## 2.1 Basic Information about The Data 

Our original face mask dataset contains 6120 images, it comes from https://github.com/AIZOOTech/FaceMaskDetection

Nevertheless, the Face Mask Dataset is a combined dataset, the training set made up of 3114 images of Wider Face and 3006 images of masking FAces (MAFA) datasets. It contains normal faces with different lighting, poses, occlusion, and masked face.The test set has 1839 images which includes 780 Wider Face and 1059 MAFA dataset. Here are some samples of original images include in dataset:

![jupyter](./notebook_images/dataset_sample.png)

Since some format of label files in original dataset are not readable, the pre-processing has been applied to the original dataset. The images in available formats were identified and some invalid dataset was deleted. However, after processing and editing, the number of reliable images reduced dramatically which could affect the following training, our group collected and added extra images to ensure the amount and quality of the whole dataset. Extra datasets refers to the link below:
1. https://public.roboflow.com/object-detection/mask-wearing
2. https://www.kaggle.com/datasets/andrewmvd/face-mask-detection

## 2.2 Split Dataset




After integrating three datasets, we try to split the whole datasets into training set, validation set and testing set.

We used [split_dataset.py](./tool_scripts/split_dataset.py) to do the spliting task.

The script can be run on dataset with the command below:

`python3 split_dataset.py`

Our final face mask dataset contains `18756` training images, `2993` validation images and `2993` testing images 

## 2.3 Process Dataset Labels 

We tried to train our models based on different frameworks, we used `yolov5` for yolo and `mmdetection` for others, but they need different format of labels, such as `xml`, `txt`, and `json`. Therefore, we created and ran scripts that those label files can be interchangeable






The usage of those scripts:

- 1. [formatting_xml.py](./tool_scripts): reformat the `xml` files so that it can be processed by other scripts. 

<img style="float: left;" width = "650" height = "300" src="./notebook_images/2.png" >

- 2. [xml_to_txt.py](./tool_scripts/xml_to_txt.py): convert `xml` format labels to `txt` format labels. The `txt` format is also called yolo format, which only contains the information about object class, object coordinates, height, and width as the image shown below.

<img style="float: left;" width = "650" height = "300" src="./notebook_images/3.png" >

- 3. [txt_to_xml.py](./tool_scripts/txt_to_xml.py): convert `txt` format labels to `xml` format labels.

- 4. [voc2coc.py](./tool_scripts/voc2coc.py): convert `xml` format labels to `json` format labels.

Code reference of this part can be found in [tool_scripts/README.md](./tool_scripts/README.md)

# 3. Experimentation

The source code of the experiments are stored in the directory `source_code`.

### Experiment platform: 

All models were trained and tested on a [Cloud platform](https://www.autodl.com/) with following configuration:

    GPU: NVidia RTX 3090 24GB
    CPU: Intel Xeon Platinum 8255C @2.50GHz
    OS: Windows 10 and Linux
    Platform: Jupyter Notebook (Python 3.8)
    Framework: MMdetection (Faster R-CNN and SSD-moblienetv2), YOLOv5 (YOLOv5s and YOLOv5s-mobilenetv2)


### Common Techinque / Key Terminology:

**Intersection over Union (IoU):** 

The Intersection over Union (IoU) metric is essentially a method used usually to quantify the percent overlap between the ground truth BBox (Bounding Box) and the prediction BBox.[1] In NMS, we find IoU between two predictions BBoxes instead.
![jupyter](./notebook_images/iou.png)


**Anchor box:**

Anchor boxes are a set of predefined boxes with scaled sizes and width-height ratios. The main idea of using anchor boxes is, instead of generating the entire bounding box, bounding boxes for each grid cell can be created by tuning the initial anchor boxes. And the size and shape of anchor boxes can be set up either with experience or using k-means clusters on the ground truth boxes of the dataset. In YOLOv5, other than setting up the anchor by hand, the system could adaptively calculate the best anchor based on the dataset we use during the training stage.

**Non-Maximum Suppression (NMS):**

NMS is a class of algorithms to select one entity (e.g., bounding boxes) out of many overlapping entities. We can choose the selection criteria to arrive at the desired results. [1]
![jupyter](./notebook_images/nms.jpg)

The process can be described in 3 main steps:

Assume we get a list P of prediction bounding boxes, and each bbox is associated with a predicted confidence score c.

1.Select the prediction S with highest class probability and remove it from P and add it to the final prediction list.

2.Now compare this prediction S with all the predictions present in P. Calculate the IoU of this prediction S with every other predictions in P. If the IoU is greater than the threshold thresh_iou for any prediction T present in P, remove prediction T from P.

3.If there are still predictions left in P, then go to Step 1 again, else return the final prediction list containing the filtered predictions.



## 3.1 Faster R-CNN

### 3.1.1 Introduction

**R-CNN** (Regions with CNN features)[2] is a new idea born at a time when the traditional idea of object detection has reached a bottleneck. The detection accuracy of traditional object detection algorithms is only about 30%, but the detection accuracy of R-CNN can reach more than 50%, which greatly improves the detection accuracy of object detection. 

![jupyter](./notebook_images/rcnn.png)

R-CNN has two key points: 
1. It uses convolutional neural networks to replace traditional feature extraction methods (e.g. HOG) in object detection, which can extract better feature information. 
2. R-CNN uses a migration learning approach to better improve object detection when the object detection dataset is small at the time. 

The process of R-CNN can be divided into several main steps: 
1. Generate 1k to 2k candidate regions for the input image using the Selective Search method.
2. Feature extraction for each candidate region using a deep neural network. The R-CNN then feeds the feature vectors obtained in the second step into the SVM classifier for each class to determine which class the feature vectors belong to.
3. The position of the candidate box is adjusted using the regressor. The following figure shows the exact flow of R-CNN in more detail.


**Fast R-CNN**[3] is an improvement on R-CNN and they both use VGG16 as the backbone of the network. Fast R-CNN has faster training time and test inference time and higher accuracy compared to R-CNN. 
![jupyter](./notebook_images/fastrcnn.png)

The algorithm of Fast R-CNN can be divided into three main steps: 
1. Generate about 2,000 candidate regions for the image using selective search; 
2. Input the image into the network to obtain the corresponding feature map, and project the candidate regions generated in the previous step onto the feature map to obtain the corresponding feature matrix. 
3. Each feature matrix is scaled to a feature map of the same size using a region of interest (ROI) pooling layer. Afterwards, the scaled feature maps are flattened and the final prediction is obtained through a series of fully connected layers. The algorithm flow of Fast R-CNN is shown in the following diagram.


**Faster R-CNN**[4] is an improvement on Fast R-CNN. It is a two-stage detector, meaning that firstly it would generate a region proposal, and then predict the classification and location of the target by means of convolutional neural networks. The Faster R-CNN is identical to the Fast R-CNN in some parts, except that the Faster R-CNN uses specialized region proposal networks (RPN) instead of a selective search algorithm. Faster R-CNN further improves inference speed and accuracy on top of Fast R-CNN.

### 3.1.2 Faster R-CNN Model Architecture

<img style="float: left;" width = "300" height = "150" src="./notebook_images/7.png" >
refer:https://arxiv.org/abs/1506.01497 [4]

### 3.1.3 Experiment Implementation 

We choose MMDetection as working framework, MMDetection is an open source object detection toolbox based on PyTorch.
refer:https://github.com/open-mmlab/mmdetection

The source code of this part is in the directory `source_code/Faster-rcnn`

#### Install environment:
- MMDetection works on Linux, Windows and macOS. It requires Python 3.6+, CUDA 9.2+ and PyTorch 1.5+.
1. `conda create --name openmmlab python=3.8 -y`
2. `conda activate openmmlab`
3. `conda install pytorch=1.8 torchvision cudatoolkit=10.2 -c pytorch`
( this step you should change according to your cuda version )
4. `pip install -U openmim`
5. `mim install mmcv-full`
( MMDetection requires mmcv-full>=1.3.17, <1.7.0）
6. `cd Faster-rcnn`
7. `pip install -r requirements/build.txt`
8. `pip install -v -e .  # or "python setup.py develop"`

#### Training:
We set most of hyperparameters as default.
The hyperparameters of training can be found in [faster-rcnn_150epoch.log](experiment_result/Faster-rcnn/faster-rcnn_150epoch.log) within the directory `experiment_result/Faster-rcnn`

We used the command below the run our experiment.

In [None]:
!python ./tools/train.py

The train result is saved at the directory train_mask(when you train, it will create this new directory)

If train is interrupted, we used the command below the continue our experiment.

In [None]:
!python ./tools/train.py --resume-from ./train_mask/latest.pth

The training output can be found in this file: [faster-rcnn_result.json](experiment_result/Faster-rcnn/faster-rcnn_result.json)

### 3.1.4 Results of Faster R-CNN model:

#### Confusion Matrix:

The below figure is the confusion matrix of our best model of ssd-mobilenetv2.

In the following figure e represents the mask, k represents the face and d represents the background. x-axis is the predicted value and y-axis is the true value. Overall the object detection of the Faster R-CNN is not very good, and there is a high probability that the recognition of faces and masks will be recognized as background. And the background part will also be recognized as a mask or a face.

<img style="float: left;" width = "250" height = "300" src="./experiment_result/Faster-rcnn/confusion_matrix.png" >

#### The plot of training for Faster R-CNN model:
Based on the the figure below we can see that the model is converged at about 20th epoch 

<img style="float: left;" width = "650" height = "300" src="./experiment_result/Faster-rcnn/map.png" >

#### mAP of Faster R-CNN model:

The table below shows the map value of our model.

| Class | map50 | map50-95 |
| :---: | :---: | :------: |
|  All  | 0.383 |  0.151   |

#### The inference time of Faster R-CNN model:

The inference time(FPS) is evaluate by the scrpit [benchmark.py](source_code/Faster-rcnn/tools/analysis_tools/benchmark.py), the testing results varied sometimes, we chose the best performence as the final result, which was `48.1` fps.

## 3.2 SSD

### 3.2.1 Introduction

Single-shot Multibox Detector, also known as SSD, uses a fully convolutional approach in the network It is widely used for objects detecting in image and is considered as one of the most excellent models in both speed and accuracy due to the base architecture of VGG-16 Architecture [5], while using SSD, we only need one shot to detect multiple images while R-CNN needs two shots, and that is the reason that SSD is much faster than R-CNN [6]. However, while detecting small-scale objects, SSD is much worse than RCNN since they can only be detected in higher resolution layers and SSD is slightly inaccurate than YOLO because the speed gets interrupted due to the gigantic model. Here is the image of the structure of the SSD [7]. SSD has two main components: multi-scale feature maps and convolutional predictor. Multi-scale feature maps improve the accuracy significantly because it uses multiple layers to detect objects independently, and that is one of the important features for this project. In facial mask detection, all images are not presented in the same scale, in that case, SSD makes it efficient for valid and test sets.

### 3.2.2 Model Architecture

#### 3.2.2.1 SSD model architecture

SSD consists of two parts: the backbone model and the SSD head. In the figure below, the white boxes represent backbone and the last few blue boxes show the SSD head. The backbone model is usually a pre-trained image classification network as a feature extractor [7]. In this way, we get a deep neural network that is able to extract the semantics from the input image while preserving the spatial structure of the image. However, the resolution is extremely low. The SSD head is simply one or more convolutional layers added to the backbone, and the output is interpreted as a bounding box and object class in a spatial location activated by the final layer.

The backbone provided by MMdetection consists of VGG-16 and mobilenetv2. Our group finally chose SSD-moblienetv2 as the deployed model; it is a single-stage target detection model, which is widely used for its compact network and novel depth separable volume. The one-stage detector requires only a single pass through the neural network and predicts all the bounding boxes in one go. SSD-moblienetv2 is a mobile friendly variant of regular SSD with high precision performance. Furthemore,  ssd-moblienetv2 reduce the parameter count and computational cost significantly compared to SSD, here is the table shows the comparison [8]:



<img style="float: left;" width = "650" height = "300" src="./notebook_images/ssd.png" >

#### 3.2.2.2 MobileNetV2 model architecture

MobileNetV2 is a convolutional neural network architecture, it is a single-stage target detection model, which is widely used for its compact network and novel depth separable volume. It is based on a reverse residual structure where the residual connections are located between the bottleneck layers. Mobilenetv2 architecture consists of an initial full convolution layer and 32 filters, followed by 19 remaining bottleneck layers.[9]

<img style="float: left;" width = "350" height = "300" src="./notebook_images/8.png" >

### 3.2.3 Experiment Implementation

We use MMDetection framework to train and test ssd-mobilenetv2 model

The source code of this part is in the directory `source_code/ssd-mobilenetv2`

#### Install environment
- MMDetection works on Linux, Windows and macOS. It requires Python 3.6+, CUDA 9.2+ and PyTorch 1.5+.
1. `conda create --name openmmlab python=3.8 -y`
2. `conda activate openmmlab`
3. `conda install pytorch=1.8 torchvision cudatoolkit=10.2 -c pytorch`
( this step you should change according to your cuda version )
4. `pip install -U openmim`
5. `mim install mmcv-full`
( MMDetection requires mmcv-full>=1.3.17, <1.7.0）
6. `cd ssd-mobilenetv2`
7. `pip install -r requirements/build.txt`
8. `pip install -v -e .  # or "python setup.py develop"`

#### Training
We set most of hyperparameters as default.
The hyperparameters of training can be found in [ssd_300epoch.log](experiment_result/ssd/ssd_300epoch.log) within the directory `experiment_result/ssd-mobilenetv2`

We used the command below the run our experiment.

In [None]:
!python ./tools/train.py

The train result is saved at the directory train_mask(when you train, it will create this new directory)

If train is interrupted, we used the command below the continue our experiment.

In [None]:
!python ./tools/train.py --resume-from ./train_mask/latest.pth

The training output can be found in this file: [ssd_result.json](experiment_result/ssd-mobilenetv2/ssd_result.json)

### 3.2.4 Results 

#### Confusion Matrix of SSD-mobilenetv2 model:

The below figure is the confusion matrix of our best model of ssd-mobilenetv2. We can conclude that although the overall performance of the SSD-mobilenetv2 model is not bad in accuracy, especially when detecting a human face without a mask, the accuracy reached 73%. Nevertheless, the system will distinguish the background as human face or face with mask by mistakes. 

<img style="float: left;" width = "250" height = "300" src="experiment_result/ssd-mobilenetv2/confusion_matrix.png" >

#### The plot of training for SSD-mobilenetv2 model:

Based on the the figure below we can see that the model is converged at about 200th epoch
![jupyter](./notebook_images/ssd_map.png)

####  mAP of SSD-mobilenetv2 model:

The table below shows the map value of our model.

| Class | map50 | map50-95 |
| :---: | :---: | :------: |
|  All  | 0.69 |  0.333   |

#### Inference Time of SSD-mobilenetv2 model:

The inference time(FPS) is evaluate by the scrpit [benchmark.py](source_code/ssd-mobilenetv2/tools/analysis_tools/benchmark.py), the testing results varied sometimes, we chose the best performence as the final result, which was `85.6` fps.

## 3.3 Yolov5s

### 3.3.1 Introduction

YOLO is a state-of-the-art object detection system. It is one of the most popular algorithms for real-time object detection. YOLO(You Only Look Once) is a `single-stage detector`, which is the same as SSD but is different from RCNN.

Since the first debut of YOLOv1 in 2016, YOLO family has been well developed and updated to YOLOv7 although Redmond, the author of YOLOv1, stopped working on YOLO after YOLOv3. For this project, our team chose to use YOLOv5, which was released in 2020, to perform facial mask detection. And instead of a single model, YOLOv5 is a family of compound-scaled object detection models[3], and `YOLOv5s` is used in our project.

YOLOv5 comes with various versions, each having its own
unique characteristic. These versions being:
1. yolov5-n - The nano version
2. yolov5-s - The small version
3. yolov5-m - The medium version
4. yolov5-l – The large version
5. yolov5-x - The extra-large version

The performance analysis of above models is shown below:

<img style="float: left;" width = "650" height = "300" src="./notebook_images/4.png" >


refer:https://github.com/ultralytics/yolov5

The main process of object detection in YOLO can be demonstrated in the following figure. The system divides the “feature map” into S*S grid cells[11]. Each grid cell is responsible for detecting objects that belong to it. This is achieved by first generating three bounding boxes (including box coordinates and its confidence) and conditional class probability for each grid cell. Then, calculate the class probability for each bounding box by multiplying the confidence of the box with conditional class probability of the grid cell. Based on the class probability in the previous step, using NMS to eliminate redundant bounding boxes to get the final detections.
![jupyter](./notebook_images/yolov1.png)

### 3.3.2 Model Architecture

The YOLOv5 network consists of three main parts. 
1. Backbone - A CNN layer aggregate image features at
different scales.
2. Neck – Set of layers to combine image features and pass
them forward to prediction.
3. Head - Takes features from the neck and performs
localization and classification

<img style="float: left;" width = "650" height = "300" src="./notebook_images/5.png" >
ref: https://www.researchgate.net/figure/The-network-architecture-of-Yolov5-It-consists-of-three-parts-1-Backbone-CSPDarknet_fig1_349299852

### 3.3.3 Experiment Implementation 

The source code of this part is in the directory `source_code/yolov5`

#### Install environment
1. `cd source_code/yolov5`
2. `conda create -n yolov5 python=3.8`
3. `conda activate yolov5`
4. `pip install -r requirements.txt`

#### Training

We set most of hyperparameters as default.

The hyperparameters of training can be found in [opt.yaml](experiment_result/yolov5s/640_5s_300epoch/opt.yaml) within the directory `experiment_result/yolov5s/640_5s_300epoch`

We used the command below the run our experiment.

In [None]:
!python train.py --img 640 --epochs 300 --batch 64 --data './datasets/data.yaml' --cfg 'yolov5s.yaml' --weights '' --project 'facemask_640_yolo5s' --name '640_5s_300epoch'

The training output can be found in this file: [train_record.ipynb](source_code/yolov5/train_record.ipynb)

### 3.3.4 Results of YOLOv5s model:

#### Confusion Matrix of YOLOv5s model:

The below figure is the confusion matrix of our best model of yolov5s. Based on this graph, we can see that the background can be misdetected as face or mask.

<img style="float: left;" width = "650" height = "300" src="experiment_result/yolov5s/640_5s_300epoch/confusion_matrix.png" >

#### The plot of training:

Based on the the figure below we can see that the model is converged at about 200th epoch 
![jupyter](./notebook_images/yolov5map.png)

####  mAP of YOLOv5s model:

The table below shows the map value of our model.

| Class | map50 | map50-95 |
| :---: | :---: | :------: |
|  All  | 0.887 |  0.504   |
| Face  | 0.931 |  0.557   |
| Mask  | 0.843 |  0.451   |

We can see that the mask class much lower map than the face class, which matches the result of confusion, because the mask class are easier to be misdected as background than the face class.

#### Inference Time of YOLOv5s model:

The inference time(FPS) is evaluate by the scrpit [val.py](source_code/yolov5/val.py), the testing results varied sometimes, we chose the best performence as the final result, which was `92.5` fps.

## 3.4 Yolov5-MobileNetV2

### 3.4.1 Introduction

We were inspired by the method of ssd-mobilenetV2, it integrates the SSD algorithm and MobileNetV2 networks. Therefore we decided to integrate yolov5 with moblieNet.

The mobileNet has three versions, V1, V2 and V3. Its first version (mobileNetV1) has a depthwise separable convolution, which reduces the model complexity and computational cost. The key methmod used in the second version is called inverted residual structure, resulting in keeping the low computational cost and reaching relative higher accuracy compare with mobileNetV1. For V3, it combines the self-attention mechanism, and replaces relu activation with h-swish activation.[8]



For comparision, we choose mobileNetV2, which is the same as the SSD model used.

### 3.4.2 Custom Layers Modification in Yolov5

As the part 3.3.3 introduced, yolov5 has three main parts, we only need to modify the backbone of the yolov5 model, which is used for extracting and processing features.

ref: 
1. https://blog.csdn.net/weixin_42182534/article/details/123418604?spm=1001.2101.3001.6650.1&utm_medium=distribute.pc_relevant.none-task-blog-2%7Edefault%7ECTRLIST%7ERate-1-123418604-blog-114492726.pc_relevant_3mothn_strategy_and_data_recovery&depth_1-utm_source=distribute.pc_relevant.none-task-blog-2%7Edefault%7ECTRLIST%7ERate-1-123418604-blog-114492726.pc_relevant_3mothn_strategy_and_data_recovery&utm_relevant_index=2

2. https://github.com/shaoshengsong/YOLOv5-ShuffleNetV2


To change the architecture, we need to do:

1. add inverted residual structure in the yolov5 model module, see the file [common.py](source_code/yolov5-MobileNetV2/models/common.py), line no.876
2. modify the model configuration file of original model, the key concept is to change the original CNN layers into invertedResidual layers, connecting with their corresponding head layers. The final configuration file can be found in the file [yolov5-moblienetv2.yaml](source_code/yolov5-MobileNetV2/models/yolov5-moblienetv2.yaml)

The comparison of model complexity for Yolov5-MobileNetV2 and Yolov5s is shown below:

|       Model        | Number of Parameters | Number of Layers |
| :----------------: | :------------------: | :--------------: |
|      Yolov5s       |       7015519        |       157        |
| Yolov5-MobileNetV2 |       2916063        |       276        |

### 3.4.3 Experiment Implementation 

The traning process and hyperparameter setting were the same as last experiment.

The training output can be found in this file: [train_record.ipynb](source_code/yolov5-MobileNetV2/train_record.ipynb)

### 3.4.4 Results of YOLOv5s-mobilenetv2 model:

#### Confusion Matrics

<img style="float: left;" width = "450" height = "300" src="experiment_result/yolov5-MobileNetV2/640_300epoch/confusion_matrix.png" >

<img style="float: left;" width = "450" height = "300" src="experiment_result/yolov5s/640_5s_300epoch/confusion_matrix.png" >

According to the matrics above, the left is for the Yolov5-MoblieNetV2, the right is for the Yolov5s, we can found that the performance of accuracy slightly decreased. The characteristic of misdetection of mask,face and background remained as yolov5s model.

#### The plot of training:
![jupyter](./notebook_images/yolov5map_mobilenetv2.png)
Based on the the figure below we can see that the model is converged at about 100th epoch 

####  mAP of YOLOv5s-mobilenetv2 model:

The table below shows the map value of our model.

| Class | map50 | map50-95 |
| :---: | :---: | :------: |
|  All  | 0.871 |  0.478   |
| Face  | 0.919 |  0.533   |
| Mask  | 0.823 |  0.423   |

#### Inference Time of YOLOv5s-mobilenetv2 model:

The testing results varied sometimes, we chose the best performence as the final result, which was `95.2` fps.

# 4. Conclusion:

## 4.1 Comparision:
Here’s the table concludes the result in each field:
![jupyter](./notebook_images/compare_table.png)

![jupyter](./notebook_images/4in1.png)
In terms of mAP50, the original YOLOv5s model has the best mAP result of 88.7, while  Faster R-CNN can only achieve a mAP of 37.1. The result of YOLOv5s-mobilenetV2 is slightly lower than the original one. SSD-mobilenetV2 has a mAP of 68.9 which is between the result of Faster R-CNN and the result of YOLOv5s.

In terms of inference speed, the YOLOv5s-mobilenetv2 has the best result of 95.2FPS, while Faster R-CNN can only achieve an inference speed of 48.1 FPS. The original YOLOv5s has a slower inference speed than the altered model. The performance of SSD-mobilenetV2 is between Faster R-CNN and YOLOv5s.

## 4.2 Discussion:

### 4.2.1 Data analysis
As shown in the previous result table and figures, we can clearly see that  except for the Faster R-CNN model with mAP50 of 37.1 and inference speed of 48.1 FPS, all models we trained can achieve an acceptable performance on face mask detection job. 
Among them, the original YOLOv5s had the best result with mAP50 of 88.7 and inference speed of 92.5 FPS. 
As for the altered YOLOv5s model, after replacing the backbone of it, we managed to simplify the model by reducing the number of parameters from around 7m to 2.9m, leading to a smaller size and a faster inference speed from 92.5 FPS to 95.2 FPS with only a slight decrease in accuracy.
As for the SSD model, it has a relatively poor performance than the YOLOv5s models with mAP of 68.9 and inference speed of 85.6.
Overall, YOLOv5s can achieve a better performance compared with Faster R-CNN and SSD. After replacing the complex backbone network to a relatively lightweight network, such as mobilenetv2, YOLOv5s can achieve an even faster inference speed. After the comparison among all models we tried, we believe the YOLOv5s-mobilenetv2 model meets the goal of our project to find a real time face mask detection model with good trade-off.


### 4.2.2 Error analysis
While deploying the YOLOv5 model, based on the mAP value we got, our group found that the mask class had much lower mAP value than the face class, which matches the result of confusion, because the mask class is easier to be misdetected as background than the face class.

To solve this problem, our group prepared to add extra training images which contained neither a human face nor human face with a mask to train our model. Further experiments might be included in possible future work.


### 4.2.3 Possible future work
As future work, we will conduct further experiments to evaluate the performance of the proposed solutions. For example, our group chose the YOLO version 5s as one of the deployed models since it was released in 2020 and is the most popular architecture among yolo families. Yolov5 has other models, 5n, 5m,5l, and 5x. The complexity of these models increases gradually. It will cost more inference time if a model is more complex than others, due to the limitation of time in this project, we may try another version of the YOLO model in the future to test if there is any chance to improve the performance of the system. 


# 5.Reference

[1] Jatin Prakash, “Non Maximum Suppression: Theory and Implementation in PyTorch”, [Online]. Available: 
https://learnopencv.com/non-maximum-suppression-theory-and-implementation-in-pytorch/
[Accessed: 04-Nov-2022]

[2] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” arXiv.org, 22-Oct-2014. [Online]. Available: https://arxiv.org/abs/1311.2524. [Accessed: 03-Nov-2022]. 

[3] R. Girshick, “Fast R-CNN,” arXiv.org, 27-Sep-2015. [Online]. Available: https://arxiv.org/abs/1504.08083. [Accessed: 08-Nov-2022]. 

[4] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” arXiv.org, 06-Jan-2016. [Online]. Available: https://arxiv.org/abs/1506.01497. [Accessed: 08-Nov-2022]. 

[5] Renu Khandelwal, “SSD : Single Shot Detector for object detection using MultiBox
”, Dec 1, 2019, [Online]. Available: https://towardsdatascience.com/ssd-single-shot-detector-for-object-detection-using-multibox-1818603644ca
[Accessed: 30-Oct-2022]

[6] Jonathan Hui, “SSD object detection: Single Shot MultiBox Detector for real-time processing
”, Mar 14, 2018, [Online]. Available:https://jonathan-hui.medium.com/ssd-object-detection-single-shot-multibox-detector-for-real-time-processing-9bd8deac0e06
[Accessed: 30-Oct-2022]

[7] Esri, “How single-shot detector (SSD) works?”, [Online]. Available: 
https://developers.arcgis.com/python/guide/how-ssd-works/
[Accessed: 31-Oct-2022]

[8] Mark Sandler, “MobileNetV2: Inverted Residuals and Linear Bottlenecks”, https://arxiv.org/pdf/1801.04381.pdf [Accessed: 11-Nov-2022]

[9] Sandler et al., “MobileNetV2”, https://paperswithcode.com/method/mobilenetv2 [Accessed: 19-Nov-2022]

[10] Ultralytics, “ultralytics_yolov5”, [Online]. Available: 
https://pytorch.org/hub/ultralytics_yolov5/
[Accessed: 04-Nov-2022]

[11] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You Only Look Once: Unified, Real-Time Object Detection” arXiv.org, 9-May-2016. [Online]. Available:https://arxiv.org/abs/1506.02640. [Accessed: 03-Nov-2022]. 

[12] A. Bochkovskiy, C. Wang and H. Liao, “YOLOv4: Optimal Speed and Accuracy of Object Detection” arXiv.org, 23-Apr-2022. [Online]. Available: https://arxiv.org/abs/2004.10934. [Accessed: 04-Nov-2022]. 