### Introduction


Recently [YOLO](https://arxiv.org/pdf/1506.02640v5.pdf) and [SSD](https://arxiv.org/pdf/1512.02325v2.pdf) have been seen as the most state-of-art algorithms on object detection in computer vision. They make a great progress on frame-per-second (FPS) with decent mean-average-perception (mAP).

One of their main strategy is actually to treat the detection problem as an regression. This ideal seems to be original from [OverFeat](https://arxiv.org/pdf/1312.6229v4.pdf) and be strongly boosted by YOLO. Then SSD is a following work that combine the ideal with region-proposal-network from [FastRCNN](https://arxiv.org/pdf/1504.08083v2.pdf) to make a trade-off on mAP vs FPS. 

In this article, I am going to share how I reimplemet YOLO with tensorflow and what I have learned from YOLO strategy. My work is actually inspired by many other works. Please check these amazing works in reference section.

<iframe width="560" height="315"
src="https://www.youtube.com/watch?v=NM6lrxy0bxs" 
frameborder="0" 
allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" 
allowfullscreen></iframe>

### 1. Key Concept

- The key concept of YOLO-v1 could be divied into 2 parts.
  - the design of encoded-output-tensor
  - the loss to make this encode-output-tensor trainable.

- The process of encoded tensor is kinds handcafted and is **independent to** the convnet itself. We could actually to think yolo-v1 is trying to let model to predict a better representation of BBOX output that better fitting the nature of object-detection. But since the output-tensor is kind of larger/sparse, so a coressponding customized loss are proprossed to solve it.


### 2 Design of Encoded-Output-Tensor

- The output tensor is composed by
  - A designed grid system 
  - bbox representation
  - probability of each object in each grid

#### What is the grid system in yolo?
- YOLO divides the input image into an $S × S$ grid.Ex with 448 as image-width/height and S=7, we have (448/7, 448/7) = (64,64) in resolution. If the center of an object falls into a grid cell, the grid cell is responsible for detecting that object**. $C$ stands for how many class/object we are going to detect. 

$$ \text{Grid Numbers} : S \times S \in \mathbb{R}^{1} $$

####  What is BBox representation?
- In single grid, we predict $B$ BBox. Each BBox is representign a 5 dim tensor (x, y, wid, height, confidence-score). At first glance, it might be a little bit redundant for predicting multi-boxes in single grid.However, it is actually could enpower the capability of convnet, and by decent setting of selective loss function, it gives more generalibily to the convnet.

$$ \text{BBox Numbers} : B \in \mathbb{R}^{5} $$
$$ B \equiv \{\ \text{x1, y1, wid, height, confidence score}\, \} $$

#### What is Confidence-Score in BBox representaton ?
- confidence-score reflect how confident the model is that the box contains an object andalso how accurate it thinks the box is that it predicts. If **no object** exists in that cell, the **confidence scores should be zero**. Accordingly, we want the confidence score to equal **the intersection over union (IOU)** between the predicted box and the ground truth

- $
\Pr ( \text{Confidence-Score} ) =
\begin{cases}
1,& \text{ if there are objects } \\
0,& \text{ if no object exiests }
\end{cases}
- $

#### What is the probability of each object in each grid?
- They are actually simialr the tradicitonal one-hot-encoding output, the difference is we will have it in every grid.

$$ \text{Number of Objects } : C \in \mathbb{R}^{1}$$

  
#### What is the size of this encoded tensor?
- Finally, the ouput of yolo-backbond is a encoded tensor with the size $S*S*(B*5+C)$, notice that much more sparse that triditional one-hot-vector encoder due to the tensor dimension is basically larger. We only predict one set of class probabilities per grid cell, regardless of the number of boxes B. 
  

$$ \text{Input } : X \in \mathbb{R}^{448\times448\times3}$$
$$ \text{Output } : Y \in \mathbb{R}^{S\times S\times (B\times5+C)}$$
$$ Y = F_\text{cnn}(\text{ X }) $$



### 3. Specific Loss Design

- The Loss could be decomposed to 3 part 
  - (1) sum-squared error BBox position and size information
  - (2) sum-squared error BBox object confidence-score
  - (3) sum-squared error of wich Class is in this grid.

#### What is the loss of BBox position and size information?

$$
\lambda_{coord}\sum\limits_{i=0}^{S^2}{\sum\limits_{j=0}^{B}{1_{ij}^{obj}({(x_i -  \hat{x_i})^2 + (y_j + \hat{y_j})^2}})}
$$

$$
+{\lambda_{coord}\sum\limits_{i=0}^{S^2}\sum\limits_{j=0}^{B}1_{ij}^{obj}((\sqrt{w_i} - \sqrt{\hat{w_i}})^2 +  (\sqrt{h_i} - \sqrt{\hat{h_i}})^2}
$$

#### What is the loss sum-squared error BBox object confidence-score?


$$
+\sum\limits_{i=0}^{S^2}\sum\limits_{j=0}^{B}1_{ij}^{obj}(c_i - \hat{c_i})^2
$$

$$
+\lambda_{noobj}\sum\limits_{i=0}^{S^2}\sum\limits_{j=0}^{B}1_{ij}^{noobj}(c_i - \hat{c_i})^2
$$

#### What is the loss of wich Class is in this grid?

$$
+\sum\limits_{i =0}^{S^2}1_{i}^{obj}({\sum\limits_{c \in classes}(p_i(c)-\hat{p_i}(c))^2})
$$

#### Why there are some special coifficient?

Original auther use **sum-squared error** because it is easy to optimize, however it does not perfectly align with our goal of maximizing average precision. It weights **localization error** equally with **classification error** which may not be ideal. 

In (2) Orignal auther set **λcoord = 5** and **λnoobj = .5.** to increase the loss from bounding box coordinate predictions and decrease the loss from confi-dence predictions for boxes that don’t contain objects. 

They want to remedy the model instability, which is caused by many grid cells do not contain any object. This pushes the **“confidence” scores** of those cells towards zero, often overpowering the gradient from cells that do contain objects. 

Notice that: YOLO predicts **multiple bounding boxes per grid cell.** At training time we **only want one bounding box predictor to be responsible for each object.** We assign one predictor to be “responsible” for predicting an object based on which prediction has the **highest current IOU with the ground truth.**  This leads to specialization between the bounding box predictors. Each predictor gets better at predicting certain sizes, aspect ratios, or classes of object, improving overall recall.

#### Overall Math Expression 


$$
\lambda_{coord}\sum\limits_{i=0}^{S^2}{\sum\limits_{j=0}^{B}{1_{ij}^{obj} } ({(x_i -  \hat{x_i})^2 + (y_j + \hat{y_j})^2 + (\sqrt{w_i} - \sqrt{\hat{w_i}})^2 + (\sqrt{h_i} - \sqrt{\hat{h_i}})^2)}}
$$

$$
+\sum\limits_{i=0}^{S^2}\sum\limits_{j=0}^{B}1_{ij}^{obj}(c_i - \hat{c_i})^2
$$

$$
+\lambda_{noobj}\sum\limits_{i=0}^{S^2}\sum\limits_{j=0}^{B}1_{ij}^{noobj}(c_i - \hat{c_i})^2
$$

$$
+\sum\limits_{i =0}^{S^2}1_{i}^{obj}({\sum\limits_{c \in classes}(p_i(c)-\hat{p_i}(c))^2})
$$

where $1^\text{obj}_{i}$ denotes if object appears in cell $i$ and $1^{obj}_{ij}$ denotes that the $j_{th}$ bounding box predictor in cell $i$ is “responsible” for that prediction.


### 4. YOLO Loss Layer Implementation

The following are my partial implementation of yolo-v1. 

In [2]:
from __future__ import division
import numpy as np
from keras.engine import Layer
import tensorflow as tf 

# =============================================================================
# encode : (Ground Truth Box | Image ) -> Ground Truth Y
# decode : Predict Tensor Y ->
# the encode and decode are symmetric in logic
# =============================================================================


class YoloDetector(Layer):
    '''
    Description : 
    this class inherit keras-layer and providse the connectivity of tensorflow 
    this class implement the customize yolo-loss & encode & decode from [paper]
    (https://pjreddie.com/media/files/papers/yolo.pdf) for multi-object recog.  
    
    Usage: 
      loss = YoloDetector.loss(true_y, pred_y, batch_size=batch_size) # tf-stle slice must have same rank
      summary_op = get_summary_op(model, loss)
      ...
      _, lossN, summary_log = sess.run([train_step,loss,summary_op], feed_dict =
            {input_tensor : images_feed, true_y :labels_feed, K.learning_phase(): 0})
            
    '''
    def __init__(self, C=20, rImgW=448, rImgH=448, S=7, B=2, classMap=None):
        # C = number of class
        self.S = S
        self.B = B
        self.C = C
        self.W = rImgW
        self.H = rImgH
        self.iou_threshold=0.1
        if classMap:
            self.classMap = classMap
        else :
            self.classMap  =  ["aeroplane", "bicycle", "bird", "boat", "bottle", 
                           "bus", "car", "cat", "chair", "cow", "diningtable",
                           "dog", "horse", "motorbike", "person", "pottedplant",
                           "sheep", "sofa", "train","tvmonitor"]

    def set_class_map(self, mappingList):
        assert type(mappingList)==list ; assert len(mappingList) == self.C
        self.classMap=mappingList

    def encode(self, annotations):
        ''' annotations : nested list contained
        '''
        S, B, C, W, H = self.S, self.B, self.C, self.W, self.H

        # init
        classProb  = np.zeros([S, S, C   ])
        confidence = np.zeros([S, S, B   ])
        boxes      = np.zeros([S, S, B, 4])

        for classid, cX, cY, boxW, boxH in annotations:
            assert int(classid) <= int(C-1)

            # Target the center grid
            gridX, gridY = W/S, H/S
            tarIdX, tarIdY = int(cX/gridX) , int(cY/gridY)    

            # assign the true value
            classProb[tarIdX, tarIdY, classid] = 1.0
            confidence[tarIdX, tarIdY, :      ] = 1.0    

            # x,y,w,h
            boxes[tarIdX, tarIdY, :, 0] = (cX/gridX) - int(cX/gridX)
            boxes[tarIdX, tarIdY, :, 1] = (cY/gridY) - int(cY/gridY)
            boxes[tarIdX, tarIdY, :, 2] = np.sqrt(boxW/W)
            boxes[tarIdX, tarIdY, :, 3] = np.sqrt(boxH/H)

        return np.concatenate([classProb.flatten(),confidence.flatten(),
                               boxes.flatten()])

    def traDim(self, pred, mode=3):
        ''' Dimension Transformation of Tensor '''
        S, B, C, W, H = self.S, self.B, self.C, self.W, self.H

        if mode == 3 :
            pred = np.array(pred)
            classProb  = np.reshape(pred[0:S*S*C]         , (S,S,C))
            confidence = np.reshape(pred[S*S*C: S*S*(C+B)], (S,S,B)) 
            boxes      = np.reshape(pred[S*S*(C+B):]      , (S,S,B,4))

        elif mode == 2 :
            classProb  = tf.reshape(pred[0:S*S*C]         , (S*S,C))
            confidence = tf.reshape(pred[S*S*C: S*S*(C+B)], (S*S,B)) 
            boxes      = tf.reshape(pred[S*S*(C+B):]      , (S*S,B,4))

        return classProb, confidence, boxes        

    def decode(self, prediction,threshold=8e-25 ,only_objectness=0):
        '''
        this part is modified from https://github.com/gliese581gg/YOLO_tensorflow
        '''
        S, B, C, W, H = self.S, self.B, self.C, self.W, self.H
        classProb ,confidence, boxes = self.traDim(prediction, mode =3)

        # offset (7,7,2) mask, retrieve from offset
        offset = np.transpose(np.reshape(np.array([np.arange(S)]*S*B),(B,S,S)),(1,2,0))
        boxes[:,:,:,1] += offset
        boxes[:,:,:,0] += np.transpose(offset,(1,0,2))
        boxes[:,:,:,0:2] = boxes[:,:,:,0:2] / float(S)

        # retrieve from sqrt
        boxes[:,:,:,2] = np.multiply(boxes[:,:,:,2],boxes[:,:,:,2])
        boxes[:,:,:,3] = np.multiply(boxes[:,:,:,3],boxes[:,:,:,3])

        # retrieve from normalization
        boxes[:,:,:,0] *= self.W ; boxes[:,:,:,1] *= self.H
        boxes[:,:,:,2] *= self.W ; boxes[:,:,:,3] *= self.H

        # Pr(class|Obj) * Pr(obj) = Evaluate Proba
        eProbs = np.zeros((S,S,B,C))
        for i in range(B):
            for j in range(C):
                eProbs[:,:,i,j]=np.multiply(classProb[:,:,j],confidence[:,:,i])

        # Filter
        filter_mat_probs = np.array(eProbs >= threshold,dtype='bool')
        filter_mat_boxes = np.nonzero(filter_mat_probs)

        boxes_filtered = boxes[filter_mat_boxes[0],filter_mat_boxes[1],filter_mat_boxes[2]]
        probs_filtered = eProbs[filter_mat_probs]
        classes_num_filtered = np.argmax(filter_mat_probs,axis=3)[filter_mat_boxes[0],filter_mat_boxes[1],filter_mat_boxes[2]]

        argsort = np.array(np.argsort(probs_filtered))[::-1]
        boxes_filtered = boxes_filtered[argsort]
        probs_filtered = probs_filtered[argsort]
        classes_num_filtered = classes_num_filtered[argsort]

        # select the best pridect box with the ideal similar to nms
        # if there are 2 same probs, not likely, random pick one
        for i in range(len(boxes_filtered)):
            if probs_filtered[i] == 0 : continue
            for j in range(i+1,len(boxes_filtered)):
                if self.iou(boxes_filtered[i],boxes_filtered[j]) > self.iou_threshold :
                    probs_filtered[j] = 0.0

        filter_iou = np.array(probs_filtered>0.0,dtype='bool')
        boxes_filtered = boxes_filtered[filter_iou]
        probs_filtered = probs_filtered[filter_iou]
        classes_num_filtered = classes_num_filtered[filter_iou]

        result = []
        for i in range(len(boxes_filtered)):
            result.append([self.classMap[classes_num_filtered[i]],boxes_filtered[i][0],boxes_filtered[i][1],boxes_filtered[i][2],boxes_filtered[i][3],probs_filtered[i]])

        return result

    def loss(self, truY_, preY_, COORD=5. , NOOBJ=.5 , loss_=0 , batch_size=8):
        '''
        [description]
        - mini-batch optimization
        - the output of loss-function should >= 0
        - use avg-loss of batch-size
        - max-gradient clip with loss max =1000.
        '''
        S, B, C, W, H = self.S, self.B, self.C, self.W, self.H

        for batch in range(batch_size):
            truY = truY_[batch,:]
            preY = preY_[batch,:]

            truCP ,truConf, truB = self.traDim(truY, mode=2)
            preCP ,preConf, preB = self.traDim(preY, mode=2)    

            #print truCP    

            # Select for responsible box which with max IOU
            iouT = self.iouTensor(truB,preB)           # iouT (7*7,2)
            iouT = tf.argmax(iouT, dimension=1) # (7*7)    

            # tf.cast(x, dtype, name=None)
            def slec_Box(raw , iouT):
                for i in range(S*S):
                    j = iouT[i]    

                    # flatten input 2D 
                    raw = tf.reshape(raw,[-1])     

                    # cast the idx to the right tyle
                    idx = tf.cast(tf.constant([0,1,2,3]), tf.int64)
                    idx_flattened = idx + (i*B*4+j)              
                    yield tf.gather(raw, idx_flattened)    

            def slec_conf(raw , iouT):
                for i in range(S*S):
                    j = iouT[i]    

                    # flatten input 2D 
                    raw = tf.reshape(raw,[-1])     

                    # cast the idx to the right tyle
                    idx = tf.cast(tf.constant([0]), tf.int64)
                    idx_flattened = idx + (i*B+j)              
                    yield tf.gather(raw, idx_flattened)
            # https://github.com/tensorflow/tensorflow/issues/206
            truB    = tf.pack ([ a for a in slec_Box (truB    , iouT)] )
            preB    = tf.pack ([ a for a in slec_Box (preB    , iouT)] )
            truConf = tf.pack ([ a for a in slec_conf(truConf , iouT)] )
            preConf = tf.pack ([ a for a in slec_conf(preConf , iouT)] )    

            # Obj or noobj is actually only depend on truth
            # truCP = (S*S,C)
            def max_tf(raw):
                for i in range(S*S):
                    tmp = raw[i,:]
                    tmp = tf.reduce_max(tmp)
                    yield tmp    

            objMask  = tf.pack([ a for a in max_tf(truCP) ])
            nobjMask = 1 - objMask    

            loss_ += tf.reduce_sum(tf.reduce_sum(tf.pow(truB-preB, 2), 1)   * objMask  ) * COORD 
            loss_ += tf.reduce_sum(tf.pow(truConf- preConf, 2)              * objMask  )
            loss_ += tf.reduce_sum(tf.pow(truConf- preConf, 2)              * nobjMask ) * NOOBJ
            loss_ += tf.reduce_sum(tf.reduce_sum(tf.pow(truCP- preCP, 2), 1)* objMask  )
        # clip gradient
        return max(float(loss_ / batch_size), 1000.)

    def boxArea(self, box):
        return box[:,:,2]*box[:,:,3]

    def iouTensor(self, box1, box2):
        S, B, C, W, H = self.S, self.B, self.C, self.W, self.H
        assert box1.get_shape() == box2.get_shape() == (S*S,B,4)
        
        minTop = tf.minimum(box1[:,:,0]+0.5*box1[:,:,2],
                      box2[:,:,0]+0.5*box2[:,:,2])
        maxBot = tf.maximum(box1[:,:,0]-0.5*box1[:,:,2],
                      box2[:,:,0]-0.5*box2[:,:,2])
        minR   = tf.minimum(box1[:,:,1]+0.5*box2[:,:,3],
                      box1[:,:,1]+0.5*box2[:,:,3])
        maxL   = tf.maximum(box1[:,:,1]-0.5*box2[:,:,3],
                      box1[:,:,1]-0.5*box2[:,:,3])
        # intersection

        #tf.clip_by_value(t, clip_value_min, clip_value_max, name=None)

        inters = tf.clip_by_value( minTop-maxBot, clip_value_min=0, clip_value_max=999)* \
                 tf.clip_by_value( minR-maxL    , clip_value_min=0, clip_value_max=999)
        noZero = 0.000000001 # Return IOU and avoid devide zero
        return inters/ (self.boxArea(box1)+ self.boxArea(box2)- inters+ noZero)

    def iou(self,box1,box2):
        tb = min(box1[0]+0.5*box1[2],
                 box2[0]+0.5*box2[2])-\
             max(box1[0]-0.5*box1[2],
                 box2[0]-0.5*box2[2])
        lr = min(box1[1]+0.5*box1[3],
                 box2[1]+0.5*box2[3])-\
             max(box1[1]-0.5*box1[3],
                 box2[1]-0.5*box2[3])
        if tb < 0 or lr < 0 : intersection = 0
        else : intersection =  tb*lr
        return intersection/ (box1[2]*box1[3] + box2[2]*box2[3] -intersection)

    def train(self, ):
        model = self.build()
        loss = self.loss
        model.compile(optimizer=RMSprop(lr=0.001), loss=loss, metrics=['accuracy'])

Using TensorFlow backend.


### Reference
- https://github.com/gliese581gg/YOLO_tensorflow
- http://guanghan.info/blog/en/my-works/train-yolo/
- https://github.com/thtrieu/yolotf
- https://github.com/nilboy/tensorflow-yolo
- https://github.com/gliese581gg/YOLO_tensorflow/blob/master/YOLO_tiny_tf.py