# 1. Introduction 
### 1.1 Background 
Self-driving is one of the popular applications of neural networks and deep learning. Pedestrian detection in real-time is one of the most important topics. It has to be applied carefully due to the issue of safety. Nowadays, although there are many methods, there still are chances to do it better.
Based on this concern, we conduct experiments for long-ranged pedestrian detection.
### 1.2 Motivation
The traditional offline embed systems have limited computational power and the detection of pedestrians requests an immediate result. Therefore, under the limited resources, how to improve the existing models to help obtain accurate and fast results can be a challenge. 
In addition to this, current light-frameworks like Yolov5 still have chances to perform better.
### 1.3 Problem Statement 
Context:
 * We are running neural networks on embedded systems, so detecting was be light and not take so much time
 * Attention mechanisms are used because help NN concentrate on the pedestrians rather than other features in the image

Comparison of adding different attention mechanisms to yolov5’s current model to see which one is better AND seeing how data augmentation with the models we have made will change anything.
### 1.4 Solutions
We proposed two strategies: 
* Experimenting on different attention models to see if there will be any improvement.
* Applying augmentation techniques to see if there will be any benefit.

# 2. Data Sources
Our dataset is the Caltech Pedestrian Dataset from Caltech institution. This dataset is a video taken from a vehicle driving through regular traffic in an urban area in the US.

Our original Caltech Pedestrian Dataset contains 1550 images, it comes from https://universe.roboflow.com/pionc/caltech-6f68o

There is the example of the dataset:


# 2.1. Analysis of Data
### 2.1.1  Pre-processing
First We convert video formatted data into images. Then we resize the image to 640 x 640 to fit the YOLOV5 model and rotated the orientation of the images. 

We have 1550 images and 2991 Annotations. In those images, we divide into 1082 training set, 311 valid set and 157 test set.

we are only labelled 1 class which is pedestrian.
### 2.1.2 Analysis
we have annotation heatmap by roboflow, the green part represents most of the pedestrian labels' position in the dataset. We can observe that most of the pedestrians labelled in the dataset are located at the back of the image.


As we can see, pedestrians in the image are small，Occluded and located at the back of the images along with Cluttered Background. This lines up with our goal of detecting pedestrians from long-ranged distances. That's why we chose this dataset.
### 2.1.3 Data augmentation
In particular, we did data augmentation to reduce overfitting and try to improve accuracy of the training model. Mosaic data augmentation merges four training photos into one image. We can also see that four images are placed in one image. 
This enables the model to learn how to recognize items at a lower scale than usual.


After Data augmentation, we have another link for data source :https://universe.roboflow.com/visdronedataset/caltech-ped

## 3. Experimentation

### 3.1 Experimental Procedures
For the experimental procedure, we first trained all the attention models once, then trained the same models but applied data augmentation to the models. After this, we merged all the results into one graph to analyse our findings. 

##### 3.1.1 Experimental Parameters
To ensure the reliability of our experiment, we have opted to run all our attention models in one server for more control and enforce validity of our experiment through training all models with the same parameters as seen here.

We have chosen the epoch to be 400 because we wanted to ensure our models were still improving and that there is no validation loss despite potential overfitting towards some light models like mobileNet.

##### 3.1.2 Training
We used "graphic card" 

### 3.2 Evaluation Metrics


##### 3.2.1 Intersection over Union (lOU)
Intersection over Union measures the intersection between predicted bounding box and the ground truth. It helps quantifie the degree of accuracy for the prediction. 

With a threshold, we can define positive if loU greater than threshold, a negative vice versa. 
![lou.png](attachment:lou.png)




##### 3.2.2 Qualitative Analysis of Prediction
**Precision** measures how well you can find true positives (TP) out of all positive prediction (TP + FP). 


**Recall** is the proportion of true positives over the true positives plus the false negatives. It's a measure of how well you find all the positives.

![confusion%20matrix.png](attachment:confusion%20matrix.png)



##### 3.2.3 mAP
Average Precision (AP), the formula is based on the following metrics,

- Confusion Matrix
- Recall
- Precision

Precision-recall curve is obtained by plotting precision and recall value, which encapsulate the tradeoff of both metric and maximize the effort of both metrics. 

Average Precision finds the area under the precision-recall curve. Average precision incorporate the trade-off between precision and recall and considers both false positive and false negative. It gives us idea of the overal accuracy of the model. 

Mean Average Precision (mAP) is used to evaluate the performance of object detection models. mAP is the average of AP of each class. 

![image.png](attachment:image.png)


## Methods

### 3.3 YOLOv5
Yolo, also known as you only look once, it’s one of the most popular, powerful algorithms in the field of object detection, due to its fast and efficient performance. YOLOv5 is the latest generation of YOLO series of algorithm. 

YOLOv5 has multiple varieties of pre-trained models, the difference is the trade off between network size and final performance. We choose YOLOv5l as our base model as it provide good balance between effciency and performance. 


![yolov5Models.png](attachment:yolov5Models.png)



##### 3.3.1 Object Detection
Object detection is the task of detecting an instance of an object, and classify to a particular class inside an image. 
![label.png](attachment:label.png)



##### 3.3.3 YOLOv5 Mechanism
YOLO reframes object detection as a regression problem. It takes an input image and learns to predict multiple bounding boxes, with a confidence score , then does a specific object classification. 

To do so, the image is divided into ‘S’ X ‘S’ grids, then if the centre of an object is in one of these grids, the grid has bounding boxes is responsible for detecting that if objects are within the grid through. Each grid then is responsible for predicting B bounding boxes, and a confidence score indicating how confident the network thinks if there is an object present in the box. 

![yolo_Mechanism.jpg](attachment:yolo_Mechanism.jpg)




The confidence score detects  there is no object in the box, the confidence score should be zero. Otherwise, it would be equal to the intersection over union (loU) between box and ground truth. For each box, YOLO uses a CNN based googleNet to derive a class specific probability. The final confidence score will be the product of conditional class specific probability and individual box confidence score. Tell us both the probability of the class appearing in the box, and how well the predicted box fits the object. Then it apply threshold value to eliminate low confidence prediction.

![model2.png](attachment:model2.png)


##### 3.3.4 YOLOv5 Model Architecture
YOLO network architecture consists of three parts, backbone, which will extract unique features from the particular image. Neck, elaborate in feature Pyramid. Then head is responsible for generating the result, the graph with bounding boxes, confidence score and object classification.

In YOLOv5, it uses CSPDarknet as its backbone, which is a CNN  that adopts cross stage partial structure. It partitions the feature map of the layer into two parts and merges them through a cross-stage hierarchy. By split and merge strategy it optimises duplicate gradient information while maintaining accuracy. Then utilise PANet as its neck, downsampling the extracted feature, obtains a feature map with decreasing size and pixels, it enhances recognition ability and performs better detection and helps detect objects in all scales. Then head layer for the final output.  

![networkArchitecture.png](attachment:networkArchitecture.png)

## 3.4 Attention Modules
Attention is an attempt to enhance the important parts while downplaying irrelevant information. In our dataset, we focus on Pedestrian detection, Weakens other non-relevant information, such as traffic signals.




#### 3.4.1 SE
SE net also known as the Squeeze-and-Excitation Networks consists of 3 steps: the Squeeze Module, Excitation Module, and Scale Module. 

![se.jpg](attachment:se.jpg)

First, the squeeze excitation block takes an input tensor x reduces it to a (C × 1 × 1) tensor through passing the  ( Global Average Pool) layer.

Next is the Excitation module. The output (C×1×1) tensor from the last section is passed to a c-length vector layer and outputs a tensor of same length that is then broadcasted and multiplied lament-wise with the input x.

SE is a well renowned attention mechanism and adding it in front of the SPPP layer in the yolov5 backbone as suggested by the author, could increase performance. 

![se2.jpg](attachment:se2.jpg)

#### 3.4.2 SPD
From the diagram, we can see that the SPD-Conv net consists of two components: a  space to depth layer followed by a non-strided convolution layer. 

The SPD layer downsamples a feature map X but retains all the information in the channel dimension, and thus there is no information loss. 

The non-strided convolution operation after each SPD to reduce the (increased) number of channels using learnable parameters in the added convolution layer.

We have chosen SPD attention module because the spd paper noted that it improves performance of images with low resolutions of size 640 x 640 and this fits with our dataset.

![spd.jpg](attachment:spd.jpg)



#### ECA
The ECA net is similar to the SE-Net.The ECA-block uses the same global feature descriptor as the SEnet explained previously where it was named the squeeze module and the tensor is decomposed to a channel vector. 
After this, the channel vector passes through a global average pooling layer and is followed by a 1D convolutional layer with an adaptive neighbourhood interaction of kernel size k. 
After this, the result is does the same as the Se-net and is broadcasted and does element wise multiplication with the input tensor.

The ECA attention module was chosen by us as the paper author has stated it performs better than the SE-net and is added to the yolov5 in the same position as the SeNet for comparison.
![eca.jpg](attachment:eca.jpg)


#### Mobile Net
MobilenetV3 is mainly added to our experiment as a comparison with the basic yolov5 to see if it can speed up training.

What we did is changing the whole backbone of the yolov5. When comparing with YOLOv5, mobilenet use depthwise separable convolution to reduce the number of weight. So, theoriticly it will reducethe  training time.

MobilenetV3 the updated version of mobilenet family. For version 1, it used depthwise separable convolution, and used two hyper-parameters to trade off between latency and accuracy. In version 2, it added inverted residual and shortcut to improve the accuracy.As for the updated version 3, it added the SE attention mechanism which we have introduced before and replace the swish activate function with h-swish function. 

These changes improve the overall performance of Mobilenet. Meanwhile, mobilenet is runable in small devices with limited computation units.
![mobile.jpg](attachment:mobile.jpg)

## 5.0 Results
### 5.1 Before Data augmentation

![picture]("file")
From the chart shown, before data augmentation is applied, the yolov5 outperforms all attention mechanisms models implemented and the MobileV3 net selected for comparison.

### 5.2 Before Data augmentation

![picture]("file") --> likely in results folder
After applying mosaic data augmentation technique to our dataset, the SPD attention module outperforms the vanilla yolov5 model and all other models. 
There is also an overall 30% increase in performance for all models that are trained.


## 6.0 Discussion
### 6.1 Before Data Augmentation
![picture]("file")
Before data augmentation, the SE-net and ECA-net and other attention modules consistently performed worse than yolov5’s base model. This is because these modules make yolov5's backbone undergo global average pooling twice; once in the added attention mechanism; another time in the SPPP section of the yolov5 backbone. 
This causes there to be two pooling functions and further loss of distinct characteristics of small sized pedestrians in images and as a result, means lower performance. 


#### 6.2 After Data Augmentation
As mentioned before, mosaic data augmentation generates an extra image that is merged from 1 original image + 3 random images. Due to this reason, our dataset doubled in size and improved detection of smaller-sized pedestrians in bounding boxes of grids that yolov5 creates.

### 6.3 Models
#### 6.3.1 ECA and SE comparison:
![picture]("file")
The ECA module is an improved methodology for object detestrian than the SEnet. However, the ECAnet performs worse than the SE-net. This is likely because of ECAnet's extra adaptive neighbourhood interaction filtering out the distinct features of small pedestrians in images, hence reducing performance.

#### 6.3.2 SPD
SPD has better performance in every epoch than all other models. This is because SPD uses a self-made space-to-depth layer such that information from previous layers will be preserved in low-resolution images, hence fits our dataset perfectly.

#### 6.3.3 MobileNet V3
Mobilenet is a light model with very short training time, however, it consistently performs the worst out of all other models. This is due to the h-swish function having reduced amounts of calculations to make it faster.


## 7.0 Conclusion
From the experiment, we have discovered that SPD and Mobilenet are able to enhance the performance of YOLOv5l,  regarding on the speed and accuracy. If the power of computation is sufficient, Mobilenet could be a better choice, if there’s no worry about the computation resources at all, we can use the SPD to get better accuracy. 
