FOR A COMPLETE SSD IMPLEMETATION USING TENSORFLOW GO TO https://github.com/PranavEranki/SSD-Algorithm

# Single Shot Multibox Detection Algorithm
* How is the SSD algorithm different from CNNs?
* How to detect / predict object detections

Let's take this image. We want to detect all the sheep

![](imgs/sheep.PNG)

Normal CNN's do something called object proposal methodology.
* They would have some object proposal techniques
* They segment the whole image with rectangles and attempt to find every rectangle which contains a sheep
* Checks for gradient and color - changes in gradient / color 'indicate' objects in that rectangle
* Estimates what rectangles have sheep and which ones do not.
* This is dangerous, and it is quite easy to mistake shadows for objects or miss objects

Severely damages accuracy and computational efficiency

![](imgs/ssd.PNG)

This algorithm looks at the entire image one time only - hence the name.

* All the boxes go through the network at the same time. The network also remembers what boxes it is dealing with.
* There are mini convolutions to shrink down the image to also deal with small versions of the object (i.e. small sheep in the background)

Before we continue with how the multi-box detection concept works, let's define some basic vocab:
1. Ground truth - a concept that is used to separate observed / empirical evidence from inferred evidence
    - Ground truth is what exists for sure
    - Inferred evidence could be boxes that the model puts around objects


### So, how does the SSD algorithm work?
1. First, let's imagine we have a __fully trained model__, and we take in the image (think of the *person* box as a label for the model to use)
![](imgs/person.PNG)

The SSD will break the image down into segments (see the green dots?) and construct 3 boxes for each of the segments(the sizes and exact layout can vary)

*Please note that the algorithm does not draw this on your screen, it is performed internally - these images are just for visualization of how the algorithm works*
![](imgs/beginningDetection.PNG)

Then, we cover the entire image with these boxes.
![](imgs/fullCovered.PNG)

Then, for every box it checks for the existence of an object which it is training for (for example, if it is training for sheep, it will look for a sheep in every box it has made.)

In our example, it detects these boxes(highlighted in red) which contain a person in them.
![](imgs/detected1.PNG)
*Does not detect the people in the back, they're too small and the algorithm cannot find features for them.*

Let's clean up those boxes:
![](imgs/detectedClean.PNG)
*As you can see, not every box is orange - one does not have enough features*

Since we have the ground truth box (the original white box), the algorithm will account for the differences between its predictions and the ground truth box, and will then adjust its weights (process of backpropagation) to better account for this box and draw boxes around the people better.

Through the training, two things happen:
* The algorithm will learn how to use each box it generates to better find the existence of people inside the bounding box
* The algorithm will get better at generating more accurate boxes for the people in the correct positions


Now, we're doing well on our algorithm. Just one small problem - __scaling__. If we were to have an image, like this:

![](imgs/horsesInAField.PNG)

Then the algorithm will only detect these horses:

![](imgs/detectedHorses1.PNG)


The reason for this is because this main horse is too close. The rectangles are simply too small to be able to pick up this entire horse and important features it needs using only 1 box. (It might pick up a nose or a tail, but will be unable to piece everything together and identify a full horse.

Now, the next component of the SSD comes into play.

Let's refer back to the SSD architecture:
![](imgs/ssd.PNG)

The algorithm applies convolutions to shrink the size of the image. 

Everything we previously talked about occurs on these images.
* Rectangles identified
* Training of where and what the rectangles should detect

There are many detections, and every time we convolve the image, we detect the boxes again.

And again,

and again,

and again.

And eventually, during one of these convolutions, the image will reach a size such that once the detection is applied, the algorithm will be able to detect the large horse because it has shrunk down to a size which can be picked up with a rectangle(which stays the same size throughout)

![](imgs/smallHorse.PNG)

But how is this rectangle drawn on this horse(which is smaller) translated to the larger image.

Well, it is pretty straightforward. Referring to the architecture, the SSD saves some info about the scaling and detected values then 'deconvolves' or gets back to the original image, then places the filter on the horse in the original image.

Again, to reiterate, the entire algorithm is done __in one go__. All the layers learn together, and as we have seen before, *power in numbers*, and the network can better adjust itself to see all the images.