<img src="images/cnn/maxinai.png" height="1000" width="1000" />

# Detecting Face Mask in Real Time

---

> * Summary of Convolutional Neural Networks
> * Data Collection and Pre-processing
> * Model Architecture
> * Model Building and Training
> * Performance Evaluation
> * Face Mask Detection in Real-Time Video Streams

# Summary of Convolutional Neural Networks

### Computer Vison Tasks


<img src="images/cnn/vision_tasks.png" height="1000" width="1000" />

### Kernel of Convolution

---

- One of the most important components of a convolution operation is the kernel (a.k.a filter). 
It is a small matrix and used for blurring, sharpening, embossing, edge detection, and more.

[Some well known kernels](https://bit.ly/36iPv2I)

## 1-D Convolution

---

1D Convolutions are actually just a simplified version of the 2D Convolution.


<img src="images/cnn/1_D_convolution.png" height="800" width="800">

<center>$(1 \times 2) + (3 \times 0) + (3 \times 1) = 5$</center>

[source](https://medium.com/apache-mxnet/1d-3d-convolutions-explained-with-ms-excel-5f88c0f35941)

### Padding and Stride

---

- In order to keep the convolution result size the same size as the input, we pad the sequence with zeros to the left and right. 


- Stride is a step, number indicating how much of a jump we want to make between evaluations.

### Padding and Stride


---

#### Padding = 1; Stride = 2

<img src="images/cnn/1_D_padding_stride.png" height="800" width="800">

<center>$(0 \times 2) + (1 \times 0) + (3 \times 1) = 3$</center>

<center>$(3 \times 2) + (3 \times 0) + (0 \times 1) = 6$</center>

[source](https://medium.com/apache-mxnet/1d-3d-convolutions-explained-with-ms-excel-5f88c0f35941)

> * Rarely used in image processing
> * Used as alternative of RNNs in NLP
> * Used in time series modeling

The last two fields share one common spatial dimension (time). This type of data is a sequential data and that's why 1-D convolution is suitable!

Images have two spatial dimensions, heights and width.

## 2-D Convolution

---

2D convolutions are used as image filters. With 2D Convolutions we slide the kernel in two directions: left/right and up/down.

<img src="images/cnn/Conv2dUsage1.png" height="800" width="800">

[source](https://leonardoaraujosantos.gitbooks.io/artificial-inteligence/content/convolution.html)

### 2-D Convolution - No Padding, No Stride

---

<img src="images/cnn/2-D_convolution.png" height="800" width="800">

<center>$(1 \times 1)+(3 \times 2)+(2 \times 3)+(1 \times 0)+(3 \times 1)+(3 \times 0)+(2 \times 2)+(1 \times 1)+(1 \times 2) = 23$</center>


[source](https://medium.com/apache-mxnet/convolutions-explained-with-ms-excel-465d6649831c)

### 2-D Convolution - Padding, No Stride

---

With padding we can maintain the spatial dimensions through the convolution operation. Zeros are placed all the way around the original matrix. We need to pad by one element in every direction to maintain the dimensions, and we see the output is now $(4 \times 4)$

<img src="images/cnn/2-D_convolution_padding.png" height="800" width="800">


[source](https://medium.com/apache-mxnet/convolutions-explained-with-ms-excel-465d6649831c)

### 2-D Convolution - No Padding, Stride

---

Stride is an effective method for reducing the spatial dimensions, instead of using pooling operations. Stride indicates how much to jump. In our example jump size is two.

<img src="images/cnn/2-D_convolution_stride.png" height="800" width="800">


[source](https://medium.com/apache-mxnet/convolutions-explained-with-ms-excel-465d6649831c)

### 2-D Convolution - Max Pooling

---

We can use pooling instead of stride with step size more than one. This is an example of max pooling.

<img src="images/cnn/pooling_1.png" height="800" width="800">


### Greyscale Image Convolution
---

<img src="images/cnn/moving_kernel_1_d.gif" height="800" width="800">


### RGB Image Convolution

---

<img src="images/cnn/moving_kernel_rgb.gif" height="400" width="700">


### Complex feature extraction

---

Stacked layer of convolution extract more complex features from layer to layer up to dense layers which then classify input image

<img src="images/cnn/face_features.png" height="900" width="1800">


### Whole model
---
##### Note that:
- Kernel has the same number of channels as input data;  
- Each convolution layer has several Kernels (64, 128, ect);
- Output of convolutional layer has as many number of channels as many kernels it applied to input data;
- Output of convolutional layer pass through activation function (ReLu, Tanh, ect) and then to the next convolutional layer (or pooling layer first and then to the conv. layer);  
- The process repeats before dense layers are reached. After that dense layers do the classification task on extracted features;
<img src="images/cnn/stacked_convs.jpg" height="700" width="1800">

### Whole model
---
##### Note that:
- Data is feeded to the model by batches. For example, if we set batch size to 64 and our dataset consists of 28x28x1 images, we feed to the model tensor of size [64, 28, 28, 1] (in the below image batch size is ignored!);
- Commonly, before dense layer feature maps are flattened by averaging in height and width dimension. Thus feature map of shape [64, 28, 28, 264] becomes [64, 264].
<img src="images/cnn/whole_model.jpeg" height="100" width="1500">

# Data Collection and Pre-processing

### Our Task

<img src="images/cnn/Classification-vs-Classification-and-Localization.png" height="600" width="700">

Our dataset consists of 690 images of faces with masks and 686 images of faces without the mask. Data is somewhat artificial but still captures the real world. The data creation process has two steps:

* Finding normal face images

* Applying some computer vision tricks to add face masks to them


[data source](https://github.com/prajnasb/observations)

The "trick" here is to use [Facial Landmarks](https://paperswithcode.com/task/facial-landmark-detection) to automatically detect facial structures, such as eyes, eyebrows, mouth, nose, and jawline. For building a dataset, we need images of faces without a mask.

<img src="images/cnn/face.jpg" height="500" width="500">

[source](https://www.pyimagesearch.com/2020/05/04/covid-19-face-mask-detector-with-opencv-keras-tensorflow-and-deep-learning/)

We have to apply face detection to calculate the face bounding box.

<img src="images/cnn/face_bounding_box.jpg" height="500" width="500">


[source](https://www.pyimagesearch.com/2020/05/04/covid-19-face-mask-detector-with-opencv-keras-tensorflow-and-deep-learning/)

Having a face bounding box gives us the ability to apply facial landmarks and localize facial structure.

<img src="images/cnn/facial_landmarks.png" height="500" width="500">


[source](https://www.pyimagesearch.com/2020/05/04/covid-19-face-mask-detector-with-opencv-keras-tensorflow-and-deep-learning/)

In the next step, the mask will be applied to the face.

<img src="images/cnn/face_with_mask.jpg" height="500" width="500">


[source](https://www.pyimagesearch.com/2020/05/04/covid-19-face-mask-detector-with-opencv-keras-tensorflow-and-deep-learning/)

There is a one **downside** of this dataset. We cannot use images of faces without a mask, which was used to generate images with a face mask for model training. Doing otherwise will cause the model to be biased and will not generalize well.

# Model Architecture

## MobileNets

A class of efficient models for mobile and embedded vision applications. MobileNets are based on a streamlined architecture that uses **depth-wise separable convolutions** to build light weight deep neural networks. Two main versions among others:


* MobileNetV1


* MobileNetV2

<img src="images/cnn/mobilenets.png" height="1000" width="1000" />

## MobileNetV1

---

**Depthwise Convolution**, that applies a single convolution filter per input channel to perform lightweight filtering.

**Pointwise Convolution**. This is a $(1\times1)$ convolution responsible for building new features through computing linear combinations of the input channels.

Combining these two types of convolution operation, we get **Depthwise Separable Convolution**.

## MobileNetV2

---

In this network, there are two types of blocks: The first is the residual block with stride 1, and the second is with stride 2 for downsizing.

There are 3 layers:

* The first layer is $(1\times1)$ convolution ReLU6


* The second layer is depthwise convolution


* The third layer is another $(1\times1)$ convolution without any non-linearity

<center><strong>ImageNet Classification</strong></center>

<img src="images/cnn/imagenet.png" height="900" width="800" />

# Model Training

As mentioned above, for face mask detection we used MobileNetV2 architecture.

To improve model's generalization ability we used [data augmentation](https://www.tensorflow.org/tutorials/images/data_augmentation), where random rotation, zoom, shift, shear, and flip of images are established. 

In [None]:
# Load the MobileNetV2 with pre-trained ImageNet weights.
# Ensuring the head FC layer sets are left off

baseModel = MobileNetV2(weights="imagenet",
                        include_top=False,
                        input_tensor=Input(shape=(224, 224, 3)))

In [None]:
# Construct the head of the model that will be placed on top of the base model

headModel = baseModel.output
headModel = AveragePooling2D(pool_size=(7, 7))(headModel)
headModel = Flatten(name="flatten")(headModel)
headModel = Dense(128, activation="relu")(headModel)
headModel = Dropout(0.5)(headModel)
headModel = Dense(2, activation="softmax")(headModel)


# Place the head FC model on top of the base model
# (this will become the actual model we will train)
model = Model(inputs=baseModel.input, outputs=headModel)

In [None]:
# Optimizer function

opt = Adam(lr=INIT_LR, decay=INIT_LR / EPOCHS)
model.compile(loss="binary_crossentropy",
              optimizer=opt,
              metrics=["accuracy"])



# Train the head of the network

H = model.fit(
    aug.flow(X_train,
             Y_train,
             batch_size=BATCH_SIZE),
    steps_per_epoch=len(X_train) // BATCH_SIZE,
    validation_data=(X_test, Y_test),
    validation_steps=len(X_test) // BATCH_SIZE,
    epochs=EPOCHS)

# Performance Evaluation

### Classification Report

|              | precision | recall | f1-score | support |
|--------------|-----------|--------|----------|---------|
| with_mask    | 0.99      | 1.00   | 0.99     | 138     |
| without_mask | 1.00      | 0.99   | 0.99     | 138     |
| accuracy     |           |        | 0.99     | 276     |
| macro avg    | 0.99      | 0.99   | 0.99     | 276     |
| weighted avg | 0.99      | 0.99   | 0.99     | 276     |

### Training Loss and Accuracy

<img src="./images/cnn/plot.png" height="900" width="800" />

# Face Mask Detection in Real-Time Video Streams

## Additional Resources

---

[MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications](https://arxiv.org/abs/1704.04861)

[MobileNetV2: Inverted Residuals and Linear Bottlenecks](https://arxiv.org/abs/1801.04381)

[Depthwise separable convolutions for machine learning](https://eli.thegreenplace.net/2018/depthwise-separable-convolutions-for-machine-learning/)

[MobileNetV2: The Next Generation of On-Device Computer Vision Networks](https://ai.googleblog.com/2018/04/mobilenetv2-next-generation-of-on.html)

[Google’s MobileNets on the iPhone](https://machinethink.net/blog/googles-mobile-net-architecture-on-iphone/)

[MobileNet version 2](https://machinethink.net/blog/mobilenet-v2/)

<img src="images/cnn/questions.jpg" height="600" width="600" />

<center><strong style="font-size:40px">Thank You</strong></center>

## References

---

https://github.com/MaxinAI/school-of-ai

https://www.pyimagesearch.com/2020/05/04/covid-19-face-mask-detector-with-opencv-keras-tensorflow-and-deep-learning/

https://github.com/prajnasb/observations