# YOLO Object Detection

### Image lassification vs Object Detection

In <b>classification</b>, there’s generally an image with a single object as the focus and the task is to say what that image is

![Alt Text](https://cdn-images-1.medium.com/max/1600/1*8GVucX9yhnL21KCtcyFDRQ.png)

But when we look at the world around us, we carry out far more complex task

![Alt Text](https://cdn-images-1.medium.com/max/1600/1*NdwfHMrW3rpj5SW_VQtWVw.png)

We see complicated sights with multiple overlapping objects, and different backgrounds and we not only classify these different objects but also identify their boundaries, differences, and relations to one another!

Can CNNs help us with such complex tasks? Yes.

![Alt Text](https://irenelizihui.files.wordpress.com/2016/02/cnn2.png)

![Alt Text](https://www.pyimagesearch.com/wp-content/uploads/2017/03/imagenet_vgg16.png)

- We can take a classifier like VGGNet or Inception and turn it into an object detector by sliding a small window across the image
- At each step you run the classifier to get a prediction of what sort of object is inside the current window. 
- Using a sliding window gives several hundred or thousand predictions for that image, but you only keep the ones the classifier is the most certain about.
- This approach works but it’s obviously going to be very slow, since you need to run the classifier many times.


### What is YOLO?

- YOLO takes a completely different approach. 
- It’s not a traditional classifier that is repurposed to be an object detector. 
- YOLO actually looks at the image just once (hence its name: You Only Look Once) but in a clever way.

### How does the YOLO Framework Function?

- YOLO first takes an input image:

![Alt Text](https://cdn.analyticsvidhya.com/wp-content/uploads/2018/12/Screenshot-from-2018-11-15-17-43-42.png)

- The framework then divides the input image into grids (say a 3 X 3 grid):

![Alt Text](https://cdn.analyticsvidhya.com/wp-content/uploads/2018/12/Screenshot-from-2018-11-15-17-46-32.png)

- Image classification and localization are applied on each grid. YOLO then predicts the bounding boxes and their corresponding class probabilities for objects.

#### Let's break it down

<b>Training Data</b>
- We need to pass the labelled data to the model in order to train it.
    - Suppose we divide image into a grid of size 3 X 3
    - There are a total of 3 classes(Pedestrian, Car, and Motorcycle) which we want the objects to be classified into.
    - label y will be an eight dimensional vector:
    
    ![Alt Text](https://cdn.analyticsvidhya.com/wp-content/uploads/2018/12/Screenshot-from-2018-11-15-18-01-24.png)
        
        - pc defines whether an object is present in the grid or not (it is the probability)
        - bx, by, bh, bw specify the bounding box if there is an object
        - c1, c2, c3 represent the classes. So, if the object is a car, c2 will be 1 and c1 & c3 will be 0, and so on
     
- Let’s say we select the first grid from the above example:

    ![Alt Text](https://cdn.analyticsvidhya.com/wp-content/uploads/2018/12/Screenshot-from-2018-11-15-18-08-47.png)

    - Since there is no object in this grid, pc will be zero and the y label for this grid will be:

     ![Alt Text](https://cdn.analyticsvidhya.com/wp-content/uploads/2018/12/Screenshot-from-2018-11-15-18-11-15.png)



- Let’s take another grid in which we have a car (c2 = 1):

    ![Alt Text](https://cdn.analyticsvidhya.com/wp-content/uploads/2018/12/Screenshot-from-2018-11-15-18-15-50.png)
    
    - Even if an object spans out to more than one grid, it will only be assigned to a single grid in which its mid-point is located.
    
    - y label for this grid will be:
    
    ![Alt Text](https://cdn.analyticsvidhya.com/wp-content/uploads/2018/12/Screenshot-from-2018-11-15-18-27-25.png)

-  Now we have an input image and it’s corresponding target vector. Using the above example (input image – 100 X 100 X 3, output – 3 X 3 X 8), our model will be trained as follows:

    ![Alt Text](https://cdn.analyticsvidhya.com/wp-content/uploads/2018/12/Screenshot-from-2018-11-15-18-46-10.png)

### How to Encode Bounding Boxes?

As we saw earlier, bx, by, bh, and bw are calculated relative to the grid cell we are dealing with. Let’s understand this concept with an example. Consider the center-right grid which contains a car:

 ![Alt Text](https://cdn.analyticsvidhya.com/wp-content/uploads/2018/12/Screenshot-from-2018-11-15-19-23-07.png)

So, bx, by, bh, and bw will be calculated relative to this grid only. The y label for this grid will be:

 ![Alt Text](https://cdn.analyticsvidhya.com/wp-content/uploads/2018/12/Screenshot-from-2018-11-15-18-27-25.png)

pc = 1 since there is an object in this grid and since it is a car, c2 = 1. Now, let’s see how to decide bx, by, bh, and bw. In YOLO, the coordinates assigned to all the grids are:

 ![Alt Text](https://cdn.analyticsvidhya.com/wp-content/uploads/2018/12/Screenshot-from-2018-11-15-19-35-31.png)

bx, by are the x and y coordinates of the midpoint of the object with respect to this grid. In this case, it will be (around) bx = 0.4 and by = 0.3:

 ![Alt Text](https://cdn.analyticsvidhya.com/wp-content/uploads/2018/12/Screenshot-from-2018-11-15-19-39-51.png)

bh is the ratio of the height of the bounding box (red box in the above example) to the height of the corresponding grid cell, which in our case is around 0.9. So,  bh = 0.9. bw is the ratio of the width of the bounding box to the width of the grid cell. So, bw = 0.5 (approximately). The y label for this grid will be:

   ![Alt Text](https://cdn.analyticsvidhya.com/wp-content/uploads/2018/12/Screenshot-from-2018-11-16-12-44-34.png)

Notice here that bx and by will always range between 0 and 1 as the midpoint will always lie within the grid. Whereas bh and bw can be more than 1 in case the dimensions of the bounding box are more than the dimension of the grid.

### Intersection over Union and Non-Max Suppression:
    
- Consider the actual and predicted bounding boxes for a car as shown below:
    
    ![Alt Text](https://cdn.analyticsvidhya.com/wp-content/uploads/2018/12/Screenshot-from-2018-11-16-13-07-50.png)
    
    - Here, the red box is the actual bounding box and the blue box is the predicted one. How can we decide whether it is a good prediction or not? IoU, or Intersection over Union, will calculate the area of the intersection over union of these two boxes. That area will be:
    
    ![Alt Text](https://cdn.analyticsvidhya.com/wp-content/uploads/2018/12/Screenshot-from-2018-11-16-13-12-02.png)
    
        - IoU = Area of the intersection / Area of the union, i.e.

IoU = Area of yellow box / Area of green box

### Non-Max Suppression:

- There is one more technique that can improve the output of YOLO significantly

- One of the most common problems with object detection algorithms is that rather than detecting an object just once, they might detect it multiple times. Consider the below image:

    ![Alt Text](https://cdn.analyticsvidhya.com/wp-content/uploads/2018/12/Screenshot-from-2018-11-16-13-32-40.png)

- Here, the cars are identified more than once. The Non-Max Suppression technique cleans up this up so that we get only a single detection per object. Let’s see how this approach works.

1. It first looks at the probabilities associated with each detection and takes the largest one. In the above image, 0.9 is the highest probability, so the box with 0.9 probability will be selected first:

    ![Alt text](https://cdn.analyticsvidhya.com/wp-content/uploads/2018/12/Screenshot-from-2018-11-17-12-08-14.png)

2. Now, it looks at all the other boxes in the image. The boxes which have high IoU with the current box are suppressed. So, the boxes with 0.6 and 0.7 probabilities will be suppressed in our example:

    ![Alt text](https://cdn.analyticsvidhya.com/wp-content/uploads/2018/12/Screenshot-from-2018-11-17-12-09-17.png)

3. After the boxes have been suppressed, it selects the next box from all the boxes with the highest probability, which is 0.8 in our case:

    ![Alt text](https://cdn.analyticsvidhya.com/wp-content/uploads/2018/12/Screenshot-from-2018-11-17-12-10-38.png)

4. Again it will look at the IoU of this box with the remaining boxes and compress the boxes with a high IoU:

    ![Alt text](https://cdn.analyticsvidhya.com/wp-content/uploads/2018/12/Screenshot-from-2018-11-17-12-11-35.png)

5. We repeat these steps until all the boxes have either been selected or compressed and we get the final bounding boxes:

    ![Alt text](https://cdn.analyticsvidhya.com/wp-content/uploads/2018/12/Screenshot-from-2018-11-17-12-21-31.png)

This is what Non-Max Suppression is all about. We are taking the boxes with maximum probability and suppressing the close-by boxes with non-max probabilities. 

#### Let's Summarize

- Discard all the boxes having probabilities less than or equal to a pre-defined threshold (say, 0.5)
- For the remaining boxes:
    - Pick the box with the highest probability and take that as the output prediction
    - Discard any other box which has IoU greater than the threshold with the output box from the above step
    - Repeat step 2 until all the boxes are either taken as the output prediction or discarded

### Anchor Boxes

What if there are multiple objects in a single grid?

- Consider the following image, divided into a 3 X 3 grid:

    ![Alt Text](https://cdn.analyticsvidhya.com/wp-content/uploads/2018/12/Screenshot-from-2018-11-17-13-18-38.png)

- The midpoint of both the objects lies in the same grid. This is how the actual bounding boxes for the objects will be:

    ![Alt Text](https://cdn.analyticsvidhya.com/wp-content/uploads/2018/12/Screenshot-from-2018-11-17-13-20-41.png)

- Let's take two achor boxes to make the concept easy to understand:

    ![Alt Text](https://cdn.analyticsvidhya.com/wp-content/uploads/2018/12/Screenshot-from-2018-11-17-13-36-28.png)

- This is how the y label for YOLO <b>without</b> anchor boxes looks like:

    ![Alt Text](https://cdn.analyticsvidhya.com/wp-content/uploads/2018/12/Screenshot-from-2018-11-15-18-01-24.png)

- The y label for YOLO <b>with</b> anchor boxes will be:

    ![Alt Text](https://cdn.analyticsvidhya.com/wp-content/uploads/2018/12/Screenshot-from-2018-11-17-13-33-31.png)

- The first 8 rows belong to anchor box 1 and the remaining 8 belongs to anchor box 2. The output in this case, instead of 3 X 3 X 8 (using a 3 X 3 grid and 3 classes), will be 3 X 3 X 16 (since we are using 2 anchors).

##### So, for each grid, we can detect two or more objects based on the number of anchors.

### Let's combine all the above ideas 

YOLO divides up the image into a grid of 13 by 13 cells:

![Alt Text](http://machinethink.net/images/yolo/Grid@2x.png)

- Each of these cells is responsible for predicting 5 bounding boxes. 
- A bounding box describes the rectangle that encloses an object.
- YOLO also outputs a confidence score that tells us how certain it is that the predicted bounding box actually encloses some object.
- This score doesn’t say anything about what kind of object is in the box, just if the shape of the box is any good.

The predicted bounding boxes may look something like the following (the higher the confidence score, the fatter the box is drawn):

![Alt Text](http://machinethink.net/images/yolo/Boxes@2x.png)

- For each bounding box, the cell also predicts a class. 
- This works just like a classifier: it gives a probability distribution over all the possible classes. 
- YOLO was trained on the PASCAL VOC dataset, which can detect 20 different classes such as:

- bicycle
- boat
- car
- cat
- dog
- person

- The confidence score for the bounding box and the class prediction are combined into one final score that tells us the probability that this bounding box contains a specific type of object. 
- For example, the big fat yellow box on the left is 85% sure it contains the object “dog”:

![Alt Text](http://machinethink.net/images/yolo/Scores@2x.png)

- Since there are 13×13 = 169 grid cells and each cell predicts 5 bounding boxes, we end up with 845 bounding boxes in total. 
- It turns out that most of these boxes will have very low confidence scores, so we only keep the boxes whose final score is 30% or more (you can change this threshold depending on how accurate you want the detector to be).

The final prediction is then:

![Alt Text](http://machinethink.net/images/yolo/Prediction@2x.png)

- From the 845 total bounding boxes we only kept these three because they gave the best results. 
- But note that even though there were 845 separate predictions, they were all made at the same time — the neural network just ran once. And that’s why YOLO is so powerful and fast.

The architecture of YOLO is simple, it’s just a convolutional neural network:

![Alt Text](https://i.imgur.com/QH0CvRN.png)

This neural network only uses standard layer types: convolution with a 3×3 kernel and max-pooling with a 2×2 kernel. No fancy stuff. There is no fully-connected layer in YOLOv2.

The very last convolutional layer has a 1×1 kernel and exists to reduce the data to the shape 13×13×125. This 13×13 should look familiar: that is the size of the grid that the image gets divided into.

So we end up with 125 channels for every grid cell. These 125 numbers contain the data for the bounding boxes and the class predictions. Why 125? Well, each grid cell predicts 5 bounding boxes and a bounding box is described by 25 data elements:

- x, y, width, height for the bounding box’s rectangle
- the confidence score
- the probability distribution over the classes

Using YOLO is simple: you give it an input image (resized to 416×416 pixels), it goes through the convolutional network in a single pass, and comes out the other end as a 13×13×125 tensor describing the bounding boxes for the grid cells. All you need to do then is compute the final scores for the bounding boxes and throw away the ones scoring lower than 30%.

Paper here
https://arxiv.org/pdf/1612.08242v1.pdf