# The Power of end to end detection framework (YOLO) 
$\text{by Kent Chiu}$

### Introduction


Recently [YOLO](https://arxiv.org/pdf/1506.02640v5.pdf) and [SSD](https://arxiv.org/pdf/1512.02325v2.pdf) have been seen as the most state-of-art algorithms on object detection in computer vision. They make a great progress on frame-per-second (FPS) with decent mean-average-perception (mAP).

One of their main strategy is actually to treat the detection problem as an regression. This ideal seems to be original from [OverFeat](https://arxiv.org/pdf/1312.6229v4.pdf) and be strongly boosted by YOLO. Then SSD is a following work that combine the ideal with region-proposal-network from [FastRCNN](https://arxiv.org/pdf/1504.08083v2.pdf) to make a trade-off on mAP vs FPS. 

In this article, I am going to share how I reimplemet YOLO with python-tf and what I have learned from YOLO strategy. My work is actually inspired by many other works. Please check these amazing works in reference section.

### Article Structure

- YOLO Strategy

 - 

- Reimplementation 

### Core Concept

- The core concept of the Darknet with YOLO is about the encoded pairs of **input image tensor** and **output tensor**, which is quite different from **output one-hot vector** in general case. 

- The process of encoded tensor is kinds handcafted and is **independent to** the convnet itself.

- The convnet is just trying to fit the output tensor.

- YOLO divides the input image into an S × S grid.

- **If the center of an object falls into a grid cell, that grid cell is responsible for detecting that object**. 

- Which is means the grid is about (448/7, 448/7) = (64,64) in resolution

$$ \text{Output } = F_\text{cnn}(\text{ Input }) $$

$$ \text{Input } : X \in \mathbb{R}^{448\times448\times3}$$

$$ \text{Output } : Y \in \mathbb{R}^{S\times S\times (B\times5+C)}$$

where

$$ \text{Grid Numbers} : S \times S \text{ where, } S \in \{\ \text{ Factors of img w or h}\,\}$$

$$ \text{Bounding Box Numbers} : B \ \text{ for each } B \equiv \{\ \text{x1, y1, wid, height, confidence score}\, \}$$

$$ \text{Classes Numbers} : C \in \{\ \text{Number of Objects such as cars, cat ...} \} $$

### Single Grid Predict $B$ boxes

- At first glance, it might be a little bit redundant for predicting multi-boxes in single grid.

- However, it is actually could enpower the capability of convnet, and by decent setting of selective loss function, it gives more generalibily to the convnet.



### What is Predicted Bounding Box representing ?

- Each bounding box consists of 5 predictions: { x, y, w, h,and confidence}. 

- The (x, y) coordinates represent the center of the box relative to the bounds of the grid cell. 

- The width and height are predicted relative to the whole image. 

- Finally the confidence prediction represents the IOU between the predicted box and any ground truth box.

### What is Confidence representing ?

- The **confidence ** here is well defined

- These confidence reflect how confident the model is that the box contains an object andalso how accurate it thinks the box is that it predicts

- If **no object** exists in that cell, the **confidence scores should be zero**. 

- Otherwise we want the confidence score to equal **the intersection over union (IOU)** between the predicted box and the ground truth

$$ \Pr ( \text{Objects} ) \times \text{IOU}^{pred}_{truth} $$

$$
\Pr ( \text{Objects} ) =
\begin{cases}
1,& \text{ if there are objects } \\
0,& \text{ if no object exiests }
\end{cases}
$$


### What is the Classes Numbers really representing ?

- $C$ is, in fact, the number of probabilities that descibing how many chances a single grid could be classified as what classes, a car ? or a dog ? or ... ?

- For a trained pair, the grid that contains the center of the target-bounding box for each class, is actually a one-hot-vector. 

- For a trained pair, the grid that does not contain any center of the target-bounding box, is actually a all-zeron-vector.

- Each grid cell also predicts C conditional class probabilities,Pr(Classi|Object). These probabilities are conditioned on the grid cell containing an object.

- We only predict one set of class probabilities per grid cell, regardless of the number of boxes B

### Some Insights about Encoded Output Tensor (EOT)

- Very Sparse, much more sparse that triditional one-hot-vector encoder due to the tensor dimension is basically larger.

- 


### Loss Function Design 

We use **sum-squared error** because it is easy to optimize, however it does not perfectly align with our goal of maximizing average precision. It weights **localization error** equally with **classification error** which may not be ideal. Also, in every image many grid cells do not contain any object. This pushes the **“confidence” scores** of those cells towards zero, often overpowering the gradient from cells that do contain objects. This can lead to model instability, causing training to diverge early on.

To remedy this, we increase the loss from bounding box coordinate predictions and decrease the loss from confi-dence predictions for boxes that don’t contain objects. We use two parameters, λcoord and λnoobj to accomplish this. We set **λcoord = 5** and **λnoobj = .5.**

**Sum-squared error** also equally weights errors in **large boxes** and **small boxes.** Our **error metric** should reflect that **small deviations in large boxes matter less than in small boxes.**  To partially address this we predict the square root of the bounding box width and height instead of the width and height directly.

YOLO predicts **multiple bounding boxes per grid cell.** At training time we **only want one bounding box predictor to be responsible for each object.** We assign one predictor to be “responsible” for predicting an object based on which prediction has the **highest current IOU with the ground truth.** 


This leads to specialization between the bounding box predictors. Each predictor gets better at predicting certain sizes, aspect ratios, or classes of object, improving overall recall.



$
\lambda_{coord}\sum\limits_{i=0}^{S^2}{\sum\limits_{j=0}^{B}{1_{ij}^{obj}({(x_i -  \hat{x_i})^2 + (y_j + \hat{y_j})^2}})}
$

$
+{\lambda_{coord}\sum\limits_{i=0}^{S^2}\sum\limits_{j=0}^{B}1_{ij}^{obj}((\sqrt{w_i} - \sqrt{\hat{w_i}})^2 +  (\sqrt{h_i} - \sqrt{\hat{h_i}})^2}
$

$
+\sum\limits_{i=0}^{S^2}\sum\limits_{j=0}^{B}1_{ij}^{obj}(c_i - \hat{c_i})^2
$

$
+\lambda_{noobj}\sum\limits_{i=0}^{S^2}\sum\limits_{j=0}^{B}1_{ij}^{noobj}(c_i - \hat{c_i})^2
$

$
+\sum\limits_{i =0}^{S^2}1_{i}^{obj}({\sum\limits_{c \in classes}(p_i(c)-\hat{p_i}(c))^2})
$

where $1^\text{obj}_{i}$ denotes if object appears in cell $i$ and $1^{obj}_{ij}$ denotes that the $j_{th}$ bounding box predictor in cell $i$ is “responsible” for that prediction.

Note that the loss function only penalizes classification error if an object is present in that grid cell (hence the conditional class probability discussed earlier). 

It also only penalizes bounding box coordinate error if that predictor is
“responsible” for the ground truth box (i.e. has the highest IOU of any predictor in that grid cell).

We train the network for about 135 epochs on the training and validation data sets from PASCAL VOC 2007 and 2012. 

When testing on 2012 we also include the VOC 2007 test data for training. Throughout training we use a batch size of 64, a momentum of 0.9 and a decay of 0.0005.
Our learning rate schedule is as follows: For the first epochs we slowly raise the learning rate from 10−3 to 10−2. 

If we start at a high learning rate our model often diverges due to unstable gradients. We continue training with 10−2 for 75 epochs, then 10−3 for 30 epochs, and finally 10−4 for 30 epochs.

To avoid overfitting we use dropout and extensive data augmentation. A dropout layer with rate = .5 after the firstconnected layer prevents co-adaptation between layers [18].

For data augmentation we introduce random scaling and translations of up to 20% of the original image size. 

We also randomly adjust the exposure and saturation of the image by up to a factor of 1.5 in the HSV color space.



$$
\lambda_{coord}\sum\limits_{i=0}^{S^2}{\sum\limits_{j=0}^{B}{1_{ij}^{obj} } ({(x_i -  \hat{x_i})^2 + (y_j + \hat{y_j})^2 + (\sqrt{w_i} - \sqrt{\hat{w_i}})^2 + (\sqrt{h_i} - \sqrt{\hat{h_i}})^2)}}
$$

$$
+\sum\limits_{i=0}^{S^2}\sum\limits_{j=0}^{B}1_{ij}^{obj}(c_i - \hat{c_i})^2
$$

$$
+\lambda_{noobj}\sum\limits_{i=0}^{S^2}\sum\limits_{j=0}^{B}1_{ij}^{noobj}(c_i - \hat{c_i})^2
$$

$$
+\sum\limits_{i =0}^{S^2}1_{i}^{obj}({\sum\limits_{c \in classes}(p_i(c)-\hat{p_i}(c))^2})
$$

### Build the net

In the orignnal source code, the author provide the .config file on **darknet framework**

Here I am going to translate the model from **darknet Style to Keras and Tensorflow style**

### Darkent Souce Code Mapping 


```c
detection_layer make_detection_layer(int batch, int inputs, int n, int side, int classes, int coords, int rescore)
{
    detection_layer l = {0};
    l.type = DETECTION;

    l.n = n; //B
    l.batch = batch;
    l.inputs = inputs;
    l.classes = classes; //C
    l.coords = coords;//x,y,w,h of Box
    l.rescore = rescore;//
    l.side = side;
    l.w = side;
    l.h = side;
    assert(side*side*((1 + l.coords)*l.n + l.classes) == inputs);
    l.cost = calloc(1, sizeof(float));
    l.outputs = l.inputs;
    l.truths = l.side*l.side*(1+l.coords+l.classes);
    l.output = calloc(batch*l.outputs, sizeof(float));
    l.delta = calloc(batch*l.outputs, sizeof(float));
#ifdef GPU
    l.output_gpu = cuda_make_array(l.output, batch*l.outputs);
    l.delta_gpu = cuda_make_array(l.delta, batch*l.outputs);
#endif

    fprintf(stderr, "Detection Layer\n");
    srand(0);

    return l;
}

void forward_detection_layer(const detection_layer l, network_state state)
{
    int locations = l.side*l.side;
    int i,j;
    memcpy(l.output, state.input, l.outputs*l.batch*sizeof(float));
    //if(l.reorg) reorg(l.output, l.w*l.h, size*l.n, l.batch, 1);
    int b;
    if (l.softmax){
        for(b = 0; b < l.batch; ++b){
            int index = b*l.inputs;
            for (i = 0; i < locations; ++i) {
                int offset = i*l.classes;
                softmax_array(l.output + index + offset, l.classes, 1,
                        l.output + index + offset);
            }
        }
    }
    if(state.train){
        float avg_iou = 0;
        float avg_cat = 0;
        float avg_allcat = 0;
        float avg_obj = 0;
        float avg_anyobj = 0;
        int count = 0;
        *(l.cost) = 0;
        int size = l.inputs * l.batch;
        memset(l.delta, 0, size * sizeof(float));
        for (b = 0; b < l.batch; ++b){
            int index = b*l.inputs;
            for (i = 0; i < locations; ++i) {
                int truth_index = (b*locations + i)*(1+l.coords+l.classes);
                int is_obj = state.truth[truth_index];
                for (j = 0; j < l.n; ++j) {
                    int p_index = index + locations*l.classes + i*l.n + j;
                    l.delta[p_index] = l.noobject_scale*(0 - l.output[p_index]);
                    *(l.cost) += l.noobject_scale*pow(l.output[p_index], 2);
                    avg_anyobj += l.output[p_index];
                }

                int best_index = -1;
                float best_iou = 0;
                float best_rmse = 20;

                if (!is_obj){
                    continue;
                }

                int class_index = index + i*l.classes;
                for(j = 0; j < l.classes; ++j) {
                    l.delta[class_index+j] = l.class_scale * (state.truth[truth_index+1+j] - l.output[class_index+j]);
                    *(l.cost) += l.class_scale * pow(state.truth[truth_index+1+j] - l.output[class_index+j], 2);
                    if(state.truth[truth_index + 1 + j]) avg_cat += l.output[class_index+j];
                    avg_allcat += l.output[class_index+j];
                }

                box truth = float_to_box(state.truth + truth_index + 1 + l.classes);
                truth.x /= l.side;
                truth.y /= l.side;

                for(j = 0; j < l.n; ++j){
                    int box_index = index + locations*(l.classes + l.n) + (i*l.n + j) * l.coords;
                    box out = float_to_box(l.output + box_index);
                    out.x /= l.side;
                    out.y /= l.side;

                    if (l.sqrt){
                        out.w = out.w*out.w;
                        out.h = out.h*out.h;
                    }

                    float iou  = box_iou(out, truth);
                    //iou = 0;
                    float rmse = box_rmse(out, truth);
                    if(best_iou > 0 || iou > 0){
                        if(iou > best_iou){
                            best_iou = iou;
                            best_index = j;
                        }
                    }else{
                        if(rmse < best_rmse){
                            best_rmse = rmse;
                            best_index = j;
                        }
                    }
                }

                if(l.forced){
                    if(truth.w*truth.h < .1){
                        best_index = 1;
                    }else{
                        best_index = 0;
                    }
                }
                if(l.random && *(state.net.seen) < 64000){
                    best_index = rand()%l.n;
                }

                int box_index = index + locations*(l.classes + l.n) + (i*l.n + best_index) * l.coords;
                int tbox_index = truth_index + 1 + l.classes;

                box out = float_to_box(l.output + box_index);
                out.x /= l.side;
                out.y /= l.side;
                if (l.sqrt) {
                    out.w = out.w*out.w;
                    out.h = out.h*out.h;
                }
                float iou  = box_iou(out, truth);

                //printf("%d,", best_index);
                int p_index = index + locations*l.classes + i*l.n + best_index;
                *(l.cost) -= l.noobject_scale * pow(l.output[p_index], 2);
                *(l.cost) += l.object_scale * pow(1-l.output[p_index], 2);
                avg_obj += l.output[p_index];
                l.delta[p_index] = l.object_scale * (1.-l.output[p_index]);

                if(l.rescore){
                    l.delta[p_index] = l.object_scale * (iou - l.output[p_index]);
                }

                l.delta[box_index+0] = l.coord_scale*(state.truth[tbox_index + 0] - l.output[box_index + 0]);
                l.delta[box_index+1] = l.coord_scale*(state.truth[tbox_index + 1] - l.output[box_index + 1]);
                l.delta[box_index+2] = l.coord_scale*(state.truth[tbox_index + 2] - l.output[box_index + 2]);
                l.delta[box_index+3] = l.coord_scale*(state.truth[tbox_index + 3] - l.output[box_index + 3]);
                if(l.sqrt){
                    l.delta[box_index+2] = l.coord_scale*(sqrt(state.truth[tbox_index + 2]) - l.output[box_index + 2]);
                    l.delta[box_index+3] = l.coord_scale*(sqrt(state.truth[tbox_index + 3]) - l.output[box_index + 3]);
                }

                *(l.cost) += pow(1-iou, 2);
                avg_iou += iou;
                ++count;
            }
        }

        if(0){
            float *costs = calloc(l.batch*locations*l.n, sizeof(float));
            for (b = 0; b < l.batch; ++b) {
                int index = b*l.inputs;
                for (i = 0; i < locations; ++i) {
                    for (j = 0; j < l.n; ++j) {
                        int p_index = index + locations*l.classes + i*l.n + j;
                        costs[b*locations*l.n + i*l.n + j] = l.delta[p_index]*l.delta[p_index];
                    }
                }
            }
            int indexes[100];
            top_k(costs, l.batch*locations*l.n, 100, indexes);
            float cutoff = costs[indexes[99]];
            for (b = 0; b < l.batch; ++b) {
                int index = b*l.inputs;
                for (i = 0; i < locations; ++i) {
                    for (j = 0; j < l.n; ++j) {
                        int p_index = index + locations*l.classes + i*l.n + j;
                        if (l.delta[p_index]*l.delta[p_index] < cutoff) l.delta[p_index] = 0;
                    }
                }
            }
            free(costs);
        }


        *(l.cost) = pow(mag_array(l.delta, l.outputs * l.batch), 2);


        printf("Detection Avg IOU: %f, Pos Cat: %f, All Cat: %f, Pos Obj: %f, Any Obj: %f, count: %d\n", avg_iou/count, avg_cat/count, avg_allcat/(count*l.classes), avg_obj/count, avg_anyobj/(l.batch*locations*l.n), count);
        //if(l.reorg) reorg(l.delta, l.w*l.h, size*l.n, l.batch, 0);
    }
}
```

# Reinplementation 

- 

In [1]:
from IPython.core.display import HTML
HTML("""<style>
div.text_cell_render h1 {font-size: 2.4em;line-height:2.4em;text-align:left;}
div.text_cell_render h3 {font-size: 1.8em;line-height:1.8em;text-align:left;}
div.text_cell_render p {font-size: 1.4em;line-height:1.4em;text-align:left;}
div.text_cell_render li {font-size: 1.0em;line-height:1.0em;text-align:left;}
div.container pre{font-family: Monaco;font-size: 1.2em;line-height:1.2em;}
</style>""")

- https://github.com/gliese581gg/YOLO_tensorflow
- http://guanghan.info/blog/en/my-works/train-yolo/
- https://github.com/thtrieu/yolotf
- https://github.com/nilboy/tensorflow-yolo
- https://github.com/gliese581gg/YOLO_tensorflow/blob/master/YOLO_tiny_tf.py