创建一个 cfg 文件夹，然后在文件夹里下载配置文件
```
wget https://raw.githubusercontent.com/pjreddie/darknet/master/cfg/yolov3.cfg
```

写一个 parse_cfg 函数来解析我们下载的 cfg。这个 cfg 是 yolo3作者自己用的配置文件，格式不属于任何一种 python 常用的配置文件格式（作者是用 c 写的），所以我们不得不写这么一个很奇怪的解析函数

创建一个 darknet.py 来写网络构建的代码

In [14]:
from __future__ import division

import torch 
import torch.nn as nn
import torch.nn.functional as F 
import numpy as np

In [11]:
def parse_cfg(cfgfile):
    """
    Takes a configuration file 
    
    Returns a list of blocks. Each blocks describes a block in the neural
    network to be built. Block is represented as a dictionary in the list
    
    """
    file = open(cfgfile, 'r')
    lines = file.read().split('\n')                        # store the lines in a list
    lines = [x for x in lines if len(x) > 0]               # get read of the empty lines 
    lines = [x for x in lines if x[0] != '#']              # get rid of comments
    lines = [x.rstrip().lstrip() for x in lines]           # get rid of fringe whitespaces

    block = {}
    blocks = []

    for line in lines:
        if line[0] == "[":               # This marks the start of a new block
            if len(block) != 0:          # If block is not empty, implies it is storing values of previous block.
                blocks.append(block)     # add it the blocks list
                block = {}               # re-init the block
            block["type"] = line[1:-1].rstrip()     
        else:
            key,value = line.split("=") 
            block[key.rstrip()] = value.lstrip()
    blocks.append(block)

    return blocks

看一眼解析的结果：

In [12]:
parse_cfg('./src/cfg/yolov3.cfg')

[{'type': 'net',
  'batch': '64',
  'subdivisions': '16',
  'width': '608',
  'height': '608',
  'channels': '3',
  'momentum': '0.9',
  'decay': '0.0005',
  'angle': '0',
  'saturation': '1.5',
  'exposure': '1.5',
  'hue': '.1',
  'learning_rate': '0.001',
  'burn_in': '1000',
  'max_batches': '500200',
  'policy': 'steps',
  'steps': '400000,450000',
  'scales': '.1,.1'},
 {'type': 'convolutional',
  'batch_normalize': '1',
  'filters': '32',
  'size': '3',
  'stride': '1',
  'pad': '1',
  'activation': 'leaky'},
 {'type': 'convolutional',
  'batch_normalize': '1',
  'filters': '64',
  'size': '3',
  'stride': '2',
  'pad': '1',
  'activation': 'leaky'},
 {'type': 'convolutional',
  'batch_normalize': '1',
  'filters': '32',
  'size': '1',
  'stride': '1',
  'pad': '1',
  'activation': 'leaky'},
 {'type': 'convolutional',
  'batch_normalize': '1',
  'filters': '64',
  'size': '3',
  'stride': '1',
  'pad': '1',
  'activation': 'leaky'},
 {'type': 'shortcut', 'from': '-3', 'activatio

核对一下论文里的网络结构图，确认结构一致
![](https://pic1.zhimg.com/80/v2-770e443d1ad592a70bdf31868036a3fc_1440w.jpg)

定义两个层，一个空层EmptyLayer用于 route 和 shortcut，一个检测层 DetectionLayer用于预测目标检测的 bbox

In [15]:
class EmptyLayer(nn.Module):
    def __init__(self):
        super(EmptyLayer, self).__init__()

class DetectionLayer(nn.Module):
    def __init__(self, anchors):
        super(DetectionLayer, self).__init__()
        self.anchors = anchors

可以看到这两个层都很简单，因为 route 层中有 concat 操作，而 shortcut 有把两个 featuremap 相加的操作，这两个操作都很简单可以直接在最终的主网络的 forward 中实现，现在先用简单的层来占位置

然后我们要进一步用我们解析得到的 cfg 参数，来创建网络模块，这里我们定义一个 create_modules()

In [16]:
def create_modules(blocks):
    net_info = blocks[0]        # Captures the information about the input and pre-processing    
    module_list = nn.ModuleList()
    prev_filters = 3            # previous feature map is an image, so the number of filters is 3 (R, G, B)
    output_filters = []

    for idx, each_block in enumerate(blocks[1:]):
        module = nn.Sequential()
        # check the type of block
        # create a new module for the block
        # append to module_list
        if each_block['type'] == 'convolutional':
            try:
                bn = int(each_block['batch_normalize'])
                bias = False
            except:
                bn = 0
                bias = True
            filters = int(each_block['filters'])
            size = int(each_block['size'])
            stride = int(each_block['stride'])
            pad = int(each_block['pad'])
            activation = each_block['activation']

            if pad:
                pad = (size - 1) // 2
            else:
                pad = 0

            # add conv layer
            conv = nn.Conv2d(prev_filters, filters, size, stride, pad, bias=bias)
            module.add_module('conv_{}'.format(idx), conv)

            # add bn layer
            if bn:
                bn = nn.BatchNorm2d(filters)
                module.add_module('bn_{}'.format(idx), bn)
            
            # check the activation
            # activation will be either leaky or linear
            if activation == 'leaky':
                leaky = nn.LeakyReLU(0.1, inplace=True)
                module.add_module('leaky_{}'.format(idx), leaky)
            
        elif each_block['type'] == 'upsample':
            stride = int(each_block['stride'])
            upsample = nn.Upsample(scale_factor=2, mode='bilinear')
            module.add_module("upsample_{}".format(idx), upsample)

        elif each_block['type'] == 'route':
            layers = each_block['layers'].split(',')
            start = int(layers[0])
            try:
                end = int(layers[1])
            except:
                end = 0

            if start > 0:
                start = start - idx 
            if end > 0:
                end = end - idx     # a trick to let end negative, to keep end + idx correct
            
            route =  EmptyLayer()
            module.add_module("route_{0}".format(idx), route)

            if end < 0:
                end = output_filters[end + idx]
            filters = output_filters[start + idx] + end

        elif each_block['type'] == 'shortcut':
            shortcut = EmptyLayer()
            module.add_module("shortcut_{}".format(idx), shortcut)

        elif each_block['type'] == 'yolo':
            mask = each_block['mask'].split(',')
            mask = list(map(lambda x: int(x), mask))

            anchors = each_block['anchors'].split(',')
            anchors = list(map(lambda x: int(x), anchors))
            anchors = [(anchors[i], anchors[i+1]) for i in range(0, len(anchors), 2)]
            anchors = [anchors[i] for i in mask]

            detection = DetectionLayer(anchors)
            module.add_module('detection_{}'.format(idx), detection)
        
        module_list.append(module)
        prev_filters = filters
        output_filters.append(filters)

    return (net_info, module_list)

里面一些地方用了一些简单的 trick 来尽量保证代码简洁而可读性高，多看几遍应该不难理解

最后可以写一段代码来验证我们的网络是否正确创建

In [18]:
blocks = parse_cfg("src/cfg/yolov3.cfg")
print(create_modules(blocks))

({'type': 'net', 'batch': '64', 'subdivisions': '16', 'width': '608', 'height': '608', 'channels': '3', 'momentum': '0.9', 'decay': '0.0005', 'angle': '0', 'saturation': '1.5', 'exposure': '1.5', 'hue': '.1', 'learning_rate': '0.001', 'burn_in': '1000', 'max_batches': '500200', 'policy': 'steps', 'steps': '400000,450000', 'scales': '.1,.1'}, ModuleList(
  (0): Sequential(
    (conv_0): Conv2d(3, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
    (bn_0): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (leaky_0): LeakyReLU(negative_slope=0.1, inplace=True)
  )
  (1): Sequential(
    (conv_1): Conv2d(32, 64, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
    (bn_1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (leaky_1): LeakyReLU(negative_slope=0.1, inplace=True)
  )
  (2): Sequential(
    (conv_2): Conv2d(64, 32, kernel_size=(1, 1), stride=(1, 1), bias=False)
    (bn_2): BatchN

然后我们要开始正式创建网络了，forward 描述前向传播的过程，这部分跟 create_module 的代码很像，因为也是根据解析得到的 blocks 列表来一层一层传递特征图，仔细看一下会发现 route 部分是直接拿取的前面层的结果，这也印证了前面 EmptyLayer只是用来占位置，真实的concat 操作直接写在 forward 里即可

类似的，shortcut 也是直接把前面层的结果与本层的输入（即前一层的输出）相加

另外，由于shortcut 的前一层一定是卷积层，所以 x 一定会等于前一层的输出，因此没必要通过 outputs 来获取前一层结果

In [19]:
class Darknet(nn.Module):
    def __init__(self, cfg_file):
        super(Darknet, self).__init__()
        self.blocks = parse_cfg(cfg_file)
        self.net_info, self.module_list = create_modules(self.blocks)

    def forward(self, x):
        blocks = self.blocks[1:]
        outputs = {}

        for idx, block in enumerate(blocks):
            if block['type'] == 'convolutional' or block['type'] == 'upsample':
                x = self.module_list[idx](x)
            
            elif block['type'] == 'route':
                layers = list(map(lambda x: int(x), block['layers'].split(',')))
                
                if layers[0] > 0:
                    layers[0] -= idx
                if len(layers) == 1: # len must be equal to 1 or 2
                    x = outputs[layers[0] + idx]
                else:
                    if layers[1] > 0:
                        layers[1] -= idx
                    featuremap1 = outputs[layers[0] + idx]
                    featuremap2 = outputs[layers[1] + idx]

                    x = torch.cat((featuremap1, featuremap2), 1)
            
            elif block['type'] == 'shortcut':
                x += outputs[int(block['from']) + idx]

到这里我们写了 Darknet 主干网络的 conv, upsample, route, shortcut 四种模块的前向传播，接下来还要写 yolo 模块，即检测层的前向传播。但由于检测层的预测是用1x1卷积完成，最后得到的特征图形状为bs, (5+C)xB, H, W，很不利于我们操作（比如我想要第二个检测框的参数，就得用[:, (5+C):2*(5+C),:,:]）。为了操作直观简洁，我们要把这个四维的特征图展开成二维的，让每一行只有一个检测框的参数

In [61]:
from __future__ import division

import torch 
import torch.nn as nn
import torch.nn.functional as F 
import numpy as np
import cv2 

def transform_predict(pred, img_size, anchors, num_classes, device=None):
    '''
    Takes the prediction featuremap and some params

    Return a 2-dim tensor (BHW)x(5+C) which reshape from the prediction
    C = num_classes
    B = len(anchors)
    H = W = pred.size(2) = pred.size(3)
    '''
    batch_size = pred.size(0)
    scale = img_size // pred.size(2) # original img size is 'scale' times largger than the pred size(due to conv)
    grid_size = pred.size(2)         # current grid size is cur_size*cur_size
    bbox_attrs = 5 + num_classes
    num_anchors = len(anchors)

    # we want to reshape bs, (5+C)*B, H, W -> bs, BHW, (5+C)
    # step1: bs, (5+C)*B, H, W -> bs, (5+C)*B, HW 
    pred = pred.reshape(batch_size, bbox_attrs, grid_size*grid_size*num_anchors)
    # step2: bs, (5+C)*B, HW -> bs, HW, (5+C)*B
    pred = pred.transpose(1, 2)
    # step3: bs, HW, (5+C)*B -> bs, BHW, (5+C)
    pred = pred.reshape(batch_size, num_anchors*grid_size*grid_size, bbox_attrs)


这个变换的展开部分用了三步才完成，理解起来有一点难度，因为原本是一个四阶张量，我们要把它展开成一个二维的特征图![](https://pic1.zhimg.com/80/v2-f00c6ab7bb46832d43f90c96120a2b80_1440w.jpg)

直接对它 reshape 是肯定不行的，这里我做了个简单的实验，很容易就能明白为什么要用三步来完成变换：

一开始我们随机创建一个三阶张量 t（由于 batch_size 这一维没有变化，所以这个例子就用三阶来举例了），尺寸为2x2, 2, 2 形式上对应了(5+C)xB, H, W (为了简单，姑且当5+C=2吧）

In [42]:
t = torch.randn(((2*2), 2,2)) # (5+C)*B, H, W
t

tensor([[[ 0.3833,  1.2660],
         [-1.7883,  1.8974]],

        [[ 1.1179, -1.4644],
         [-0.4541, -0.9651]],

        [[ 1.9296,  0.9176],
         [-0.1147,  0.3998]],

        [[ 0.3357,  0.3666],
         [ 0.4282, -0.4085]]])

第一次变换合并了 H 和 W，对应原来特征图的长宽，即把一个平面拉成了一条

In [43]:
t1 = t.reshape(2*2,4) # (5+C)*B, H, W -> (5+C)*B, HW 
t1

tensor([[ 0.3833,  1.2660, -1.7883,  1.8974],
        [ 1.1179, -1.4644, -0.4541, -0.9651],
        [ 1.9296,  0.9176, -0.1147,  0.3998],
        [ 0.3357,  0.3666,  0.4282, -0.4085]])

第二次变换交换第一第二维，把 HW 整体挪到第二维上来，保证它们不会被截断

In [44]:
t2 = t1.transpose(0,1);t2 # (5+C)*B, HW -> HW, (5+C)*B

tensor([[ 0.3833,  1.1179,  1.9296,  0.3357],
        [ 1.2660, -1.4644,  0.9176,  0.3666],
        [-1.7883, -0.4541, -0.1147,  0.4282],
        [ 1.8974, -0.9651,  0.3998, -0.4085]])

第三次变换限定了每行只保留5+C 个数，因此就能顺利把张量展开成我们想要的格式了

In [49]:
t3 = t2.reshape(8,2);t3 # HW, (5+C)*B -> BHW, (5+C)

tensor([[ 0.3833,  1.1179],
        [ 1.9296,  0.3357],
        [ 1.2660, -1.4644],
        [ 0.9176,  0.3666],
        [-1.7883, -0.4541],
        [-0.1147,  0.4282],
        [ 1.8974, -0.9651],
        [ 0.3998, -0.4085]])

这里可以对比一下如果直接把原始张量 reshape 成8,2会怎么样，结果是完全不一样的。所以这三步是无法省略的

In [50]:
t.reshape(8,2)

tensor([[ 0.3833,  1.2660],
        [-1.7883,  1.8974],
        [ 1.1179, -1.4644],
        [-0.4541, -0.9651],
        [ 1.9296,  0.9176],
        [-0.1147,  0.3998],
        [ 0.3357,  0.3666],
        [ 0.4282, -0.4085]])

随后我们继续完善这个函数，预设的 anchors 尺寸要对应的缩小到跟特征图相同尺度，然后给中心坐标、存在性、偏移做 sigmoid，之后把偏移加上每个单元格左上角的坐标；以及为每个单元格分配相同个数的初始 anchors，并乘以对数变换值得到在特征图上预测的 bbox 尺寸。在完成上述操作后，要把坐标尺寸等参数乘上缩放系数，还原得到在原始图片上的 bbox 数据

In [63]:
def transform_predict(pred, img_size, anchors, num_classes, device=None):
    '''
    Takes the prediction featuremap and some params

    Return a 2-dim tensor (BHW)x(5+C) which reshape from the prediction
    C = num_classes
    B = len(anchors)
    H = W = pred.size(2) = pred.size(3)
    '''
    batch_size = pred.size(0)
    scale = img_size // pred.size(2) # original img size is 'scale' times largger than the pred size(due to conv)
    grid_size = pred.size(2)         # current grid size is cur_size*cur_size
    bbox_attrs = 5 + num_classes
    num_anchors = len(anchors)

    # we want to reshape bs, (5+C)*B, H, W -> bs, BHW, (5+C)
    # step1: bs, (5+C)*B, H, W -> bs, (5+C)*B, HW 
    pred = pred.reshape(batch_size, bbox_attrs, grid_size*grid_size*num_anchors)
    # step2: bs, (5+C)*B, HW -> bs, HW, (5+C)*B
    pred = pred.transpose(1, 2)
    # step3: bs, HW, (5+C)*B -> bs, BHW, (5+C)
    pred = pred.reshape(batch_size, num_anchors*grid_size*grid_size, bbox_attrs)

    anchors = [(a[0]/scale, a[1]/scale) for a in anchors]

    # 5+C = tx, ty, tw, th, ob, c1, c2 ... cn
    pred[:,:0] = pred[:,:0].sigmoid() # tx
    pred[:,:1] = pred[:,:1].sigmoid() # ty
    pred[:,:4] = pred[:,:4].sigmoid() # ob

    grid = np.arange(grid_size)
    x, y = np.meshgrid(grid, grid)
    x_offset = torch.FloatTensor(x).reshape(-1,1) # g x 1
    y_offset = torch.FloatTensor(y).reshape(-1,1) # g x 1

    if device: # use GPU
        x_offset = x_offset.to(device)
        y_offset = y_offset.to(device)
    
    # concat -> g, 2
    x_y_offset = torch.cat((x_offset, y_offset), 1) 
    # g, 2 -> g, 2xB
    x_y_offset = x_y_offset.repeat(1, num_anchors)
    # g, 2xB -> gxB, 2
    x_y_offset = x_y_offset.reshape(-1, 2)
    # gxB, 2 -> 1, gxB, 2
    x_y_offset = x_y_offset.unsqueeze(0)

    pred[:,:,:2] += x_y_offset # add offset to tx, ty

    anchors = torch.FloatTensor(anchors)
    if device:
        anchors = anchors.to(device)
    
    # init anchors for every grid unit
    # 1 x B -> HW, B
    anchors = anchors.repeat(grid_size*grid_size, 1)
    # HW, B -> 1, HW, B
    anchors = anchors.unsqueeze(0)
    # th = exp(th) * anchor_th (tw too)
    pred[:,:,2:4] = torch.exp(pred[:,:,2:4])*anchors

    # p(c) = sigmoid(c)
    pred[:,:,5:5+num_classes] = pred[:,:,5:5+num_classes].sigmoid()

    pred[:,:,:4] *= scale

    return pred

接下来我们继续完善模型定义：

In [76]:
class Darknet(nn.Module):
    def __init__(self, cfg_file):
        super(Darknet, self).__init__()
        self.blocks = parse_cfg(cfg_file)
        self.net_info, self.module_list = create_modules(self.blocks)

    def forward(self, x):
        blocks = self.blocks[1:]
        outputs = {}

        cnt_dets = 0
        for idx, block in enumerate(blocks):
            if block['type'] == 'convolutional' or block['type'] == 'upsample':
                x = self.module_list[idx](x)
            
            elif block['type'] == 'route':
                layers = list(map(lambda x: int(x), block['layers'].split(',')))
                
                if layers[0] > 0:
                    layers[0] -= idx
                if len(layers) == 1: # len must be equal to 1 or 2
                    x = outputs[layers[0] + idx]
                else:
                    if layers[1] > 0:
                        layers[1] -= idx
                    featuremap1 = outputs[layers[0] + idx]
                    featuremap2 = outputs[layers[1] + idx]

                    x = torch.cat((featuremap1, featuremap2), 1)
            
            elif block['type'] == 'shortcut':
                x += outputs[int(block['from']) + idx]

            elif block['type'] == 'yolo':
                anchors =  self.module_list[idx][0].anchors
                img_size = int(self.net_info['height'])
                num_classes = int(block['classes'])
                x = transform_predict(x, img_size, anchors, num_classes)

                if cnt_dets == 0:
                    dets = x
                else:
                    dets = torch.cat((dets, x), 1)
                cnt_dets += 1

            outputs[idx] = x
        return dets    
            

定义完模型我们试一下输入一张图片来验证一下模型能不能顺利完成一次前向传播：

用以下指令拿一张图：
```
wget https://github.com/ayooshkathuria/pytorch-yolo-v3/raw/master/dog-cycle-car.png
```
![](https://github.com/ayooshkathuria/pytorch-yolo-v3/raw/master/dog-cycle-car.png)

图片在输入神经网络之前还需要做一些简单的预处理，
* 我们用 opencv 的库来读取图片
* 把图片缩放成416x416
* 并且把三个颜色通道的排布安排成 RGB（这里是 opencv 比较特殊的地方，默认三个通道顺序是 BGR；其他的图像处理库没有这个问题）
* 最后还需要新增一维作为 batch 的维度，严格匹配网络的输入格式
* 并用 torch 读取刚才我们预处理好的图片，转换成 Pytorch 的浮点张量

In [67]:
def get_test_input():
    img = cv2.imread("dog-cycle-car.png")
    img = cv2.resize(img, (416,416))          #Resize to the input dimension
    img_ =  img[:,:,::-1].transpose((2,0,1))  # BGR -> RGB | H X W C -> C X H X W 
    img_ = img_[np.newaxis,:,:,:]/255.0       #Add a channel at 0 (for batch) | Normalise
    img_ = torch.from_numpy(img_).float()     #Convert to float
    return img_

然后我们就可以构建模型，并把预处理过的图片输入模型，打印结果

In [77]:
model = Darknet("src/cfg/yolov3.cfg")
inp = get_test_input()
pred = model(inp)
print (pred)

tensor([[[2.8506e+01, 2.9370e+01, 2.2536e+02,  ..., 6.5027e-01,
          6.5238e-01, 6.5067e-01],
         [2.3532e+01, 2.0787e+01, 2.9245e+02,  ..., 6.2272e-01,
          6.2550e-01, 6.2754e-01],
         [1.9401e+01, 2.2140e+01, 7.3307e+02,  ..., 6.2744e-01,
          6.2653e-01, 6.1711e-01],
         ...,
         [5.6003e+02, 5.6423e+02, 1.2923e+01,  ..., 5.6697e-01,
          5.2232e-01, 4.7274e-01],
         [5.5896e+02, 5.6110e+02, 1.8051e+01,  ..., 6.0637e-01,
          4.8670e-01, 4.5249e-01],
         [5.6047e+02, 5.6626e+02, 4.1285e+01,  ..., 5.0428e-01,
          4.8574e-01, 4.5820e-01]]], grad_fn=<CatBackward>)
