<a href="https://colab.research.google.com/github/ProtossDragoon/CoMoLab/blob/master/CV/Yolov1Keras/Yolov1_Keras_Implementation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Yolov1 Model 

In this notebook I am going to implement YOLOV1 as described in the paper You Only Look Once. The goal is to replicate the model as described in the paper and in the process, understand the nuances of using Keras on a complex problem.

- 노트북 소스 원본 자료 : https://www.maskaravivek.com/post/yolov1/
- yolo loss 에 대한 고찰 : https://brunch.co.kr/@kmbmjn95/35#comment
- yolo 및 one stage detection 깨부수기 : http://machinethink.net/blog/object-detection/

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

In [None]:
import tensorflow as tf
import matplotlib.pyplot as plt

Data Preprocessing
I would be using VOC 2007 dataset as its size is manageable so it would be easy to run it using Google Colab.

First, I download and extract the dataset.

*아래 데이터를 한 번 다운받았으면 다시는 다운 받을필요가 없겠지요? 매우 오랜 시간이 걸리는 작업이니 주의해 주세요.

In [None]:
%cd /content/gdrive/"My Drive"/
!rm -r temp
!mkdir temp
%cd temp


!wget http://pjreddie.com/media/files/VOCtrainval_06-Nov-2007.tar
!wget http://pjreddie.com/media/files/VOCtest_06-Nov-2007.tar

!tar xvf VOCtrainval_06-Nov-2007.tar # 현재 디렉터리에 tar file 의 압축을 푸는 코드
!tar xvf VOCtest_06-Nov-2007.tar

!rm VOCtrainval_06-Nov-2007.tar
!rm VOCtest_06-Nov-2007.tar

In [None]:
%cd /content/gdrive/"My Drive"/temp/VOCdevkit
!ls


Next, we process the annotations and write the labels in a text file. A text file is easier to consume as compared to XML.

In [None]:
%cd /content/gdrive/"My Drive"/temp

import argparse
import xml.etree.ElementTree as ET
import os

parser = argparse.ArgumentParser(description='Build Annotations.')
parser.add_argument('dir', default='..', help='Annotations.')

sets = [('2007', 'train'), ('2007', 'val'), ('2007', 'test')]

# 내가 관심있는 class 들과 그에 해당하는 번호
classes_num = {'aeroplane': 0, 'bicycle': 1, 'bird': 2, 'boat': 3, 'bottle': 4, 'bus': 5,
               'car': 6, 'cat': 7, 'chair': 8, 'cow': 9, 'diningtable': 10, 'dog': 11,
               'horse': 12, 'motorbike': 13, 'person': 14, 'pottedplant': 15, 'sheep': 16,
               'sofa': 17, 'train': 18, 'tvmonitor': 19}


def convert_annotation(year, image_id, f):
    in_file = os.path.join('VOCdevkit/VOC%s/Annotations/%s.xml' % (year, image_id))
    tree = ET.parse(in_file)
    root = tree.getroot()

    for obj in root.iter('object'): # python 반복자 참고 : https://python.bakyeono.net/chapter-7-4.html 
                                    # xmltree xmlparser iter() 참고 : https://docs.python.org/2/library/xml.etree.elementtree.html#finding-interesting-elements
        difficult = obj.find('difficult').text # difficult 가 뭔지는 잘 모르겠음.
        cls = obj.find('name').text
        classes = list(classes_num.keys())
        if cls not in classes or int(difficult) == 1: # 내가 관심있는 class 가 아니면 버림.
            continue
        cls_id = classes.index(cls)
        xmlbox = obj.find('bndbox')
        b = (int(xmlbox.find('xmin').text), int(xmlbox.find('ymin').text),
             int(xmlbox.find('xmax').text), int(xmlbox.find('ymax').text))
        f.write(' ' + ','.join([str(a) for a in b]) + ',' + str(cls_id)) # join 함수 참고 : https://zetawiki.com/wiki/%ED%8C%8C%EC%9D%B4%EC%8D%AC_join()
        # 함수가 파일에 쓰는 형식 : 
        # " xmin,ymin,xmax,ymax,1 xmin,ymin,xmax,ymax,3 (...object의 개수만큼)"


for year, image_set in sets:
  print(year, image_set)
  with open(os.path.join('VOCdevkit/VOC%s/ImageSets/Main/%s.txt' % (year, image_set)), 'r') as f: # python context manager 참고 : https://sjquant.tistory.com/12
      image_ids = f.read().strip().split() # 파일 입출력 참고 : https://wikidocs.net/26
  with open(os.path.join("VOCdevkit", '%s_%s.txt' % (year, image_set)), 'w') as f:
      for image_id in image_ids:
          f.write('%s/VOC%s/JPEGImages/%s.jpg' % ("VOCdevkit", year, image_id))
          print('%s/VOC%s/JPEGImages/%s.jpg' % ("VOCdevkit", year, image_id))
          convert_annotation(year, image_id, f)
          f.write('\n')
          # for 문 반복 한번 당 파일에 작성되는 형식 :
          # "VOCdevkit/VOC2007/JPEGImages/이미지명.jpg xmin,ymin,xmax,ymax,1 xmin,ymin,xmax,ymax,3 (...object의 개수만큼)"

In [None]:
!cat 2007_val.txt

## Training the model

Next, I am defining a custom generator that returns a batch of input and outputs.

Next, I am adding a function to prepare the input and the output. The input is a (448, 448, 3) image and the output is a (7, 7, 30) tensor. The output is based on S x S x (B * 5 +C).

S X S is the number of grids B is the number of bounding boxes per grid C is the number of predictions per grid

In [None]:
import cv2
import numpy as np

def read(image_path, label):
    # 이미지 경로 하나 마다 read() 가 한 번 호출됨.
    # label : [(xmin,ymin,xmax,ymax,label), (xmin,ymin,xmax,ymax,label), ... , 해당이미지의 물체개수만큼]

    image = cv2.imread(image_path)
    image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
    image_h, image_w = image.shape[0:2]
    image = cv2.resize(image, (448, 448))
    image = image / 255.
    # image shape : [448, 448, 3], dtype : float32, scale : 0~1

    label_matrix = np.zeros([7, 7, 30])
    for l in label:
        l = l.split(',')
        l = np.array(l, dtype=np.int)

        # bbox parsing
        xmin = l[0]
        ymin = l[1]
        xmax = l[2]
        ymax = l[3]

        # class
        cls = l[4]

        # bbox center, scale : 0~1
        x = (xmin + xmax) / 2 / image_w
        y = (ymin + ymax) / 2 / image_h

        # bbox width and height, scale : 0~1
        w = (xmax - xmin) / image_w
        h = (ymax - ymin) / image_h

        # bbox center 의 위치를 7x7 grid 에 구겨넣기
        loc = [7 * x, 7 * y]
        loc_i = int(loc[1])
        loc_j = int(loc[0])

        # 7x7 grid 의 해당 위치 안에서 bbox 중심의 위치, scale : 0~1
        y = loc[1] - loc_i
        x = loc[0] - loc_j

        # [0:20] : class score
        # [20:24] : x, y, w, h
        # [24] : confidence
        if label_matrix[loc_i, loc_j, 24] == 0:
            label_matrix[loc_i, loc_j, cls] = 1
            label_matrix[loc_i, loc_j, 20:24] = [x, y, w, h]
            label_matrix[loc_i, loc_j, 24] = 1  # response

    return image, label_matrix

Next, I am defining a custom generator that returns a batch of input and outputs. 

<br>

### tf.keras.utils.Sequence

Base object for fitting to a sequence of data, such as a dataset. 


Every Sequence must implement then \_\_getitem\_\_ and the \_\_len\_\_ methods. If you want to modify your dataset between epochs you may implement on\_epoch\_end. The method \_\_getitem\_\_ should return a complete batch.

Notes:
Sequence are a safer way to do multiprocessing. This structure guarantees that the network will only train once on each sample per epoch which is not the case with generators.

<br>

https://www.tensorflow.org/api_docs/python/tf/keras/utils/Sequence

In [None]:
class My_Custom_Generator(tf.keras.utils.Sequence) :
  
  def __init__(self, images, labels, batch_size) :
    self.images = images
    self.labels = labels
    self.batch_size = batch_size
    
    
  def __len__(self) :
    # generator 의 반복 횟수를 return 함.
    # np.ceil : 숫자 올림
    return (np.ceil(len(self.images) / float(self.batch_size))).astype(np.int)
  
  
  def __getitem__(self, idx) :
    # batch_x, batch_y 는 각각 이미지 경로들의 집합, 라벨들의 집합임
    batch_x = self.images[idx * self.batch_size : (idx+1) * self.batch_size] # ex. [./img31, ./img32, ... , batch_size 개]
    batch_y = self.labels[idx * self.batch_size : (idx+1) * self.batch_size] # ex. [15(xmin),159(ymin),75(xmax),170(ymax),13(label), ... , batch_size 개]

    train_image = []
    train_label = []

    for i in range(0, len(batch_x)):
      img_path = batch_x[i]
      label = batch_y[i]
      image, label_matrix = read(img_path, label)
      train_image.append(image)
      train_label.append(label_matrix)
    return np.array(train_image), np.array(train_label)

The code snippet below, prepares arrays with inputs and outputs. And we create instances of the generator for our training and validation sets.

In [None]:
%cd /content/gdrive/"My Drive"/temp

train_datasets = []
val_datasets = []

with open(os.path.join("VOCdevkit", '2007_train.txt'), 'r') as f:
    train_datasets = train_datasets + f.readlines() # list + list = list
with open(os.path.join("VOCdevkit", '2007_val.txt'), 'r') as f:
    val_datasets = val_datasets + f.readlines()

# train_datasets : ["VOCdevkit/VOC2007/JPEGImages/009870.jpg 272,70,466,290,11 26,43,315,276,11", .... , ""]

X_train = []
Y_train = []

X_val = []
Y_val = []

for item in train_datasets:
  item = item.replace("\n", "").split(" ") # "hello world\n" -> ["hello", "world"]
  X_train.append(item[0])
  arr = []
  for i in range(1, len(item)): # [1:len(item)] : groundtruth bbox 들 하나하나
    arr.append(item[i])
  Y_train.append(arr)

for item in val_datasets:
  item = item.replace("\n", "").split(" ") 
  X_val.append(item[0])
  arr = []
  for i in range(1, len(item)):
    arr.append(item[i])
  Y_val.append(arr)


batch_size = 4
my_training_batch_generator = My_Custom_Generator(X_train, Y_train, batch_size)
my_validation_batch_generator = My_Custom_Generator(X_val, Y_val, batch_size)

x_train, y_train = my_training_batch_generator.__getitem__(0)
x_val, y_val = my_training_batch_generator.__getitem__(0)

print(x_train.shape)
print(y_train.shape)

print(x_val.shape)
print(y_val.shape)

## Define a custom output layer

We need to reshape the output from the model so we define a custom Keras layer for it.

### tf.keras.layers.Layer

This is the class from which all layers inherit.

Inherits From: Module

call() 함수는, 내부적으로 \_\_call()\_\_ 함수에서 불리도록 설계되어 있다고 api 문서에 나와 있어요.

- 참고 : https://www.tensorflow.org/api_docs/python/tf/keras/layers/Layer
- 참고 : https://jinmay.github.io/2019/12/03/python/python-callable/
- 참고 : https://www.programiz.com/python-programming/methods/dictionary/update

In [None]:
import keras.backend as K

class Yolo_Reshape(tf.keras.layers.Layer):
  def __init__(self, target_shape):
    super(Yolo_Reshape, self).__init__()
    self.target_shape = tuple(target_shape)

  def get_config(self):
    config = super().get_config().copy()
    # config return datatype : python dictionary
    config.update({
        'target_shape': self.target_shape
    })
    # dict.update()
    return config

  def call(self, input):
    # grids 7x7
    S = [self.target_shape[0], self.target_shape[1]]
    # classes
    C = 20
    # no of bounding boxes per grid
    B = 2

    idx1 = S[0] * S[1] * C
    idx2 = idx1 + S[0] * S[1] * B
    
    # class probabilities
    class_probs = K.reshape(input[:, :idx1], (K.shape(input)[0],) + tuple([S[0], S[1], C])) # (batch, S*S*(C+B*2)) -> (batch, S, S, C)
    class_probs = K.softmax(class_probs)

    #confidence
    confs = K.reshape(input[:, idx1:idx2], (K.shape(input)[0],) + tuple([S[0], S[1], B])) # (batch, S*S*(C+B*2)) -> (batch, S, S, B)
    confs = K.sigmoid(confs)

    # boxes
    boxes = K.reshape(input[:, idx2:], (K.shape(input)[0],) + tuple([S[0], S[1], B * 4])) # (batch, S*S*(C+B*2)) -> (batch, S, S, B*4)
    boxes = K.sigmoid(boxes)

    outputs = K.concatenate([class_probs, confs, boxes]) # finally, (batch, S*S*(C+B*2)) -> (batch, S, S, (C+B*2))
    return outputs

## Defining the YOLO model

Next, we define the model as described in the original paper.

In [None]:
from tensorflow.keras.layers import Dense, InputLayer, Dropout, Flatten, Reshape, LeakyReLU
from tensorflow.keras.layers import Conv2D, MaxPooling2D, GlobalMaxPooling2D
from tensorflow.keras.regularizers import l2

grid_w=7
grid_h=7
cell_w=64
cell_h=64
img_w=grid_w*cell_w
img_h=grid_h*cell_h

model = tf.keras.models.Sequential()
model.add(Conv2D(filters=64, kernel_size= (7, 7), strides=(1, 1), input_shape =(img_h, img_w, 3), padding = 'same', activation=LeakyReLU(alpha=0.1), kernel_regularizer=l2(5e-4)))
model.add(MaxPooling2D(pool_size=(2, 2), strides=(2, 2), padding = 'same'))

model.add(Conv2D(filters=192, kernel_size= (3, 3), padding = 'same', activation=LeakyReLU(alpha=0.1), kernel_regularizer=l2(5e-4)))
model.add(MaxPooling2D(pool_size=(2, 2), strides=(2, 2), padding = 'same'))

model.add(Conv2D(filters=128, kernel_size= (1, 1), padding = 'same', activation=LeakyReLU(alpha=0.1), kernel_regularizer=l2(5e-4)))
model.add(Conv2D(filters=256, kernel_size= (3, 3), padding = 'same', activation=LeakyReLU(alpha=0.1), kernel_regularizer=l2(5e-4)))
model.add(Conv2D(filters=256, kernel_size= (1, 1), padding = 'same', activation=LeakyReLU(alpha=0.1), kernel_regularizer=l2(5e-4)))
model.add(Conv2D(filters=512, kernel_size= (3, 3), padding = 'same', activation=LeakyReLU(alpha=0.1), kernel_regularizer=l2(5e-4)))
model.add(MaxPooling2D(pool_size=(2, 2), strides=(2, 2), padding = 'same'))

model.add(Conv2D(filters=256, kernel_size= (1, 1), padding = 'same', activation=LeakyReLU(alpha=0.1), kernel_regularizer=l2(5e-4)))
model.add(Conv2D(filters=512, kernel_size= (3, 3), padding = 'same', activation=LeakyReLU(alpha=0.1), kernel_regularizer=l2(5e-4)))
model.add(Conv2D(filters=256, kernel_size= (1, 1), padding = 'same', activation=LeakyReLU(alpha=0.1), kernel_regularizer=l2(5e-4)))
model.add(Conv2D(filters=512, kernel_size= (3, 3), padding = 'same', activation=LeakyReLU(alpha=0.1), kernel_regularizer=l2(5e-4)))
model.add(Conv2D(filters=256, kernel_size= (1, 1), padding = 'same', activation=LeakyReLU(alpha=0.1), kernel_regularizer=l2(5e-4)))
model.add(Conv2D(filters=512, kernel_size= (3, 3), padding = 'same', activation=LeakyReLU(alpha=0.1), kernel_regularizer=l2(5e-4)))
model.add(Conv2D(filters=256, kernel_size= (1, 1), padding = 'same', activation=LeakyReLU(alpha=0.1), kernel_regularizer=l2(5e-4)))
model.add(Conv2D(filters=512, kernel_size= (3, 3), padding = 'same', activation=LeakyReLU(alpha=0.1), kernel_regularizer=l2(5e-4)))
model.add(Conv2D(filters=512, kernel_size= (1, 1), padding = 'same', activation=LeakyReLU(alpha=0.1), kernel_regularizer=l2(5e-4)))
model.add(Conv2D(filters=1024, kernel_size= (3, 3), padding = 'same', activation=LeakyReLU(alpha=0.1), kernel_regularizer=l2(5e-4)))
model.add(MaxPooling2D(pool_size=(2, 2), strides=(2, 2), padding = 'same'))

model.add(Conv2D(filters=512, kernel_size= (1, 1), padding = 'same', activation=LeakyReLU(alpha=0.1), kernel_regularizer=l2(5e-4)))
model.add(Conv2D(filters=1024, kernel_size= (3, 3), padding = 'same', activation=LeakyReLU(alpha=0.1), kernel_regularizer=l2(5e-4)))
model.add(Conv2D(filters=512, kernel_size= (1, 1), padding = 'same', activation=LeakyReLU(alpha=0.1), kernel_regularizer=l2(5e-4)))
model.add(Conv2D(filters=1024, kernel_size= (3, 3), padding = 'same', activation=LeakyReLU(alpha=0.1), kernel_regularizer=l2(5e-4)))
model.add(Conv2D(filters=1024, kernel_size= (3, 3), padding = 'same', activation=LeakyReLU(alpha=0.1), kernel_regularizer=l2(5e-4)))
model.add(Conv2D(filters=1024, kernel_size= (3, 3), strides=(2, 2), padding = 'same'))

model.add(Conv2D(filters=1024, kernel_size= (3, 3), activation=LeakyReLU(alpha=0.1), kernel_regularizer=l2(5e-4)))
model.add(Conv2D(filters=1024, kernel_size= (3, 3), activation=LeakyReLU(alpha=0.1), kernel_regularizer=l2(5e-4)))

model.add(Flatten())
model.add(Dense(512))
model.add(Dense(1024))
model.add(Dropout(0.5))
model.add(Dense(1470, activation='sigmoid'))
model.add(Yolo_Reshape(target_shape=(7,7,30)))
model.summary()

## Define a custom learning rate scheduler

The paper uses different learning rates for different epochs. So we define a custom Callback function for the learning rate.

<br>

hasattr()

- object의 속성(attribute) 존재를 확인한다.
- 만약 argument로 넘겨준 object 에 name 의 속성이 존재하면 True, 아니면 False를 반환한다. 

<br>

tf.keras.callbacks.Callback 의 멤버변수 (Attribute) 들

- params (Attribute) : Dict. Training parameters (eg. verbosity, batch size, number of epochs...).
- model (Attribute) : model	Instance of keras.models.Model. **Reference** of the model being trained.

<br>

*참고 : 이 소스코드처럼 직접 짜내려가도 좋고, https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/LearningRateScheduler 을 활용해보아도 좋을 듯.

In [None]:
class CustomLearningRateScheduler(tf.keras.callbacks.Callback):
    """Learning rate scheduler which sets the learning rate according to schedule.

    Arguments:
        schedule: a function that takes an epoch index
            (integer, indexed from 0) and current learning rate
            as inputs and returns a new learning rate as output (float).
    """

    def __init__(self, schedule):
        super(CustomLearningRateScheduler, self).__init__()
        self.schedule = schedule

    def on_epoch_begin(self, epoch, logs=None): 
        if not hasattr(self.model.optimizer, "lr"): # https://technote.kr/251
            raise ValueError('Optimizer must have a "lr" attribute.')
        # Get the current learning rate from model's optimizer.
        lr = float(K.get_value(self.model.optimizer.learning_rate))
        # Call schedule function to get the scheduled learning rate.
        scheduled_lr = self.schedule(epoch, lr)
        # Set the value back to the optimizer before this epoch starts
        K.set_value(self.model.optimizer.lr, scheduled_lr)
        print("\nEpoch %05d: Learning rate is %6.4f." % (epoch, scheduled_lr))


LR_SCHEDULE = [
    # (epoch to start, learning rate) tuples
    (0, 0.01),
    (75, 0.001),
    (105, 0.0001),
]


def lr_schedule(epoch, lr):
    """Helper function to retrieve the scheduled learning rate based on epoch."""
    if epoch < LR_SCHEDULE[0][0] or epoch > LR_SCHEDULE[-1][0]:
        return lr # 현재 learning rate 를 유지하겠다는 생각.
    for i in range(len(LR_SCHEDULE)):
        if epoch == LR_SCHEDULE[i][0]:
          # learning rate 를 변경시켜 주겠다는 생각.
            return LR_SCHEDULE[i][1]
    return lr




# 잠시 후에 아래와 같은 코드가 탑재될 것임.
'''
callbacks=[
              CustomLearningRateScheduler(lr_schedule),
              mcp_save
          ])
'''

## Define the loss function

Next, we would be defining a custom loss function to be used in the model. Take a look at this blog post to understand more about the loss function used in YOLO.

I understood the loss function but didn’t implement it on my own. I took the implementation as it is from this Github repo.

In [None]:
def xywh2minmax(xy, wh):
    xy_min = xy - wh / 2
    xy_max = xy + wh / 2

    return xy_min, xy_max


def iou(pred_mins, pred_maxes, true_mins, true_maxes):
    intersect_mins = K.maximum(pred_mins, true_mins)
    intersect_maxes = K.minimum(pred_maxes, true_maxes)
    intersect_wh = K.maximum(intersect_maxes - intersect_mins, 0.)
    intersect_areas = intersect_wh[..., 0] * intersect_wh[..., 1]

    pred_wh = pred_maxes - pred_mins
    true_wh = true_maxes - true_mins
    pred_areas = pred_wh[..., 0] * pred_wh[..., 1]
    true_areas = true_wh[..., 0] * true_wh[..., 1]

    union_areas = pred_areas + true_areas - intersect_areas
    iou_scores = intersect_areas / union_areas

    return iou_scores


def yolo_head(feats):
    # Dynamic implementation of conv dims for fully convolutional model.
    conv_dims = K.shape(feats)[1:3]  # assuming channels last
    # In YOLO the height index is the inner most iteration.
    conv_height_index = K.arange(0, stop=conv_dims[0])
    conv_width_index = K.arange(0, stop=conv_dims[1])
    conv_height_index = K.tile(conv_height_index, [conv_dims[1]])

    # TODO: Repeat_elements and tf.split doesn't support dynamic splits.
    # conv_width_index = K.repeat_elements(conv_width_index, conv_dims[1], axis=0)
    conv_width_index = K.tile(
        K.expand_dims(conv_width_index, 0), [conv_dims[0], 1])
    conv_width_index = K.flatten(K.transpose(conv_width_index))
    conv_index = K.transpose(K.stack([conv_height_index, conv_width_index]))
    conv_index = K.reshape(conv_index, [1, conv_dims[0], conv_dims[1], 1, 2])
    conv_index = K.cast(conv_index, K.dtype(feats))

    conv_dims = K.cast(K.reshape(conv_dims, [1, 1, 1, 1, 2]), K.dtype(feats))

    box_xy = (feats[..., :2] + conv_index) / conv_dims * 448
    box_wh = feats[..., 2:4] * 448

    return box_xy, box_wh


def yolo_loss(y_true, y_pred):
    label_class = y_true[..., :20]   # ? * 7 * 7 * 20
    label_box = y_true[..., 20:24]   # ? * 7 * 7 * 4
    response_mask = y_true[..., 24]  # ? * 7 * 7
    response_mask = K.expand_dims(response_mask)  # ? * 7 * 7 * 1

    predict_class = y_pred[..., :20]    # ? * 7 * 7 * 20
    predict_trust = y_pred[..., 20:22]  # ? * 7 * 7 * 2
    predict_box = y_pred[..., 22:]      # ? * 7 * 7 * 8

    _label_box = K.reshape(label_box, [-1, 7, 7, 1, 4])
    _predict_box = K.reshape(predict_box, [-1, 7, 7, 2, 4])

    label_xy, label_wh = yolo_head(_label_box)  # ? * 7 * 7 * 1 * 2, ? * 7 * 7 * 1 * 2
    label_xy = K.expand_dims(label_xy, 3)       # ? * 7 * 7 * 1 * 1 * 2
    label_wh = K.expand_dims(label_wh, 3)       # ? * 7 * 7 * 1 * 1 * 2
    label_xy_min, label_xy_max = xywh2minmax(label_xy, label_wh)  # ? * 7 * 7 * 1 * 1 * 2, ? * 7 * 7 * 1 * 1 * 2

    predict_xy, predict_wh = yolo_head(_predict_box)  # ? * 7 * 7 * 2 * 2, ? * 7 * 7 * 2 * 2
    predict_xy = K.expand_dims(predict_xy, 4)         # ? * 7 * 7 * 2 * 1 * 2
    predict_wh = K.expand_dims(predict_wh, 4)         # ? * 7 * 7 * 2 * 1 * 2
    predict_xy_min, predict_xy_max = xywh2minmax(predict_xy, predict_wh)  # ? * 7 * 7 * 2 * 1 * 2, ? * 7 * 7 * 2 * 1 * 2

    iou_scores = iou(predict_xy_min, predict_xy_max, label_xy_min, label_xy_max)  # ? * 7 * 7 * 2 * 1
    best_ious = K.max(iou_scores, axis=4)               # ? * 7 * 7 * 2
    best_box = K.max(best_ious, axis=3, keepdims=True)  # ? * 7 * 7 * 1

    box_mask = K.cast(best_ious >= best_box, K.dtype(best_ious))  # ? * 7 * 7 * 2

    no_object_loss = 0.5 * (1 - box_mask * response_mask) * K.square(0 - predict_trust)
    object_loss = box_mask * response_mask * K.square(1 - predict_trust)
    confidence_loss = no_object_loss + object_loss
    confidence_loss = K.sum(confidence_loss)

    class_loss = response_mask * K.square(label_class - predict_class)
    class_loss = K.sum(class_loss)

    _label_box = K.reshape(label_box, [-1, 7, 7, 1, 4])
    _predict_box = K.reshape(predict_box, [-1, 7, 7, 2, 4])

    label_xy, label_wh = yolo_head(_label_box)        # ? * 7 * 7 * 1 * 2, ? * 7 * 7 * 1 * 2
    predict_xy, predict_wh = yolo_head(_predict_box)  # ? * 7 * 7 * 2 * 2, ? * 7 * 7 * 2 * 2

    box_mask = K.expand_dims(box_mask)
    response_mask = K.expand_dims(response_mask)

    box_loss = 5 * box_mask * response_mask * K.square((label_xy - predict_xy) / 448)
    box_loss += 5 * box_mask * response_mask * K.square((K.sqrt(label_wh) - K.sqrt(predict_wh)) / 448)
    box_loss = K.sum(box_loss)

    loss = confidence_loss + class_loss + box_loss

    return loss