## 2023 DataLab Cup2 : CNN for Object Detection
##### Competition for CS565600 Deep Learning
* 組別: 瑜旋學姊教我DL
* 成員: 112062531 王興彥 112062559 邱仁緯 112062632 林沁璿
* Public: 0.36306
* Private: 0.39025

In [1]:
classes_name =  ["aeroplane", "bicycle", "bird", "boat", "bottle",
                 "bus", "car", "cat", "chair", "cow", "diningtable",
                 "dog", "horse", "motorbike", "person", "pottedplant",
                 "sheep", "sofa", "train","tvmonitor"]

In [2]:
import tensorflow as tf
import numpy as np
import os

os.environ["CUDA_VISIBLE_DEVICES"] = "1"

training_data_file = open("./dataset/pascal_voc_training_data.txt", "r")

gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
    try:
        # Currently, memory growth needs to be the same across GPUs
        for gpu in gpus:
            tf.config.experimental.set_memory_growth(gpu, True)
        # Select GPU number 1
        tf.config.experimental.set_visible_devices(gpus[0], 'GPU')
        logical_gpus = tf.config.experimental.list_logical_devices('GPU')
        print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
    except RuntimeError as e:
        # Memory growth must be set before GPUs have been initialized
        print(e)

1 Physical GPUs, 1 Logical GPUs


In [3]:
# common params
IMAGE_SIZE = 448
BATCH_SIZE = 32
NUM_CLASSES = 20
MAX_OBJECTS_PER_IMAGE = 20

# dataset params
DATA_PATH = './dataset/pascal_voc_training_data.txt'
IMAGE_DIR = './dataset/VOCdevkit_train/VOC2007/JPEGImages/'

# model params
CELL_SIZE = 7
BOXES_PER_CELL = 2
OBJECT_SCALE = 1
NOOBJECT_SCALE = 0.5
CLASS_SCALE = 1
COORD_SCALE = 5

# training params
LEARNING_RATE = 1e-5
EPOCHS = 1

# Augmentation dataset params
AUG_DATASET = False
DATA_AUG_PATH = './dataset/pascal_voc_training_data_aug.txt'
IMAGE_AUG_DIR = './dataset/VOCdevkit_train/VOC2007/JPEGImages_Aug/'

### Data Augmentation

1. Dealing with Data Imbalance(Over-sampling)
    * 由於不同class的資料量非常不平衡，所以必須把class之間的資料量做平衡，平衡後對於loss的下降速度以及最終預測結果都會有顯著的提升。
    * 這次做資料量平衡的手段是使用Over-sampling，將資料量較少的class增加到一個相對平衡的數量。
    * 這邊的方法是參考下方連結，將所有class都增加到至少有3000個資料。
    * Reference: https://reurl.cc/q0ANdg

In [4]:
import numpy as np

# Define a function to process each line
def process_line(line, max_objects):
    data = line.strip().split()

    # Convert the remaining data to float and limit the number of objects
    record = [float(num) for num in data[1:]]
    if len(record) > max_objects * 5:
        record = record[:max_objects * 5]

    return data[0], record, len(record) // 5

# Initialize lists
image_names = []
record_list = []
object_num_list = []

# Read the file and process each line
with open(DATA_PATH, 'r') as file:
    for line in file:
        name, record, obj_num = process_line(line, MAX_OBJECTS_PER_IMAGE)
        image_names.append(name)
        record_list.append(record)
        object_num_list.append(obj_num)

# Convert records to bounding boxes and create a dictionary mapping names to boxes
bboxes_list = [np.array(record).reshape((-1, 5)) for record in record_list]
name_to_bboxes_dict = dict(zip(image_names, bboxes_list))

# Extract object classes
object_class_list = [[int(record[i]) for i in range(4, len(record), 5)] for record in record_list]


In [5]:
class_count = [0] * NUM_CLASSES
for class_list in object_class_list:
    for c in class_list:
        class_count[c] += 1

# Print class count in a formatted way
for i, class_name in enumerate(classes_name):
    print(f'{i:2d}) {class_name:11} {class_count[i]:4d}', end=('\n' if (i + 1) % 4 == 0 else '\t'))

 0) aeroplane    331	 1) bicycle      412	 2) bird         577	 3) boat         398
 4) bottle       612	 5) bus          271	 6) car         1634	 7) cat          389
 8) chair       1423	 9) cow          356	10) diningtable  309	11) dog          536
12) horse        403	13) motorbike    387	14) person      5318	15) pottedplant  603
16) sheep        353	17) sofa         419	18) train        328	19) tvmonitor    366


In [6]:
# Initialization
aug_image_name_list = []
class_count = np.zeros(NUM_CLASSES, int)
image_count = np.zeros(len(image_names), int)

# Generate augmented image names and update class counts
for i, class_list in enumerate(object_class_list):
    aug_image_name_list.append(f'{image_count[i]}_{image_names[i]}')
    image_count[i] += 1
    for c in class_list:
        class_count[c] += 1

#  Set the minimum number of images per class
LOWER_BOUND = 3000
sorted_indices = np.argsort(object_num_list)[::-1]
invalid_aug = np.zeros(len(image_names), int)

# Ensure each class has at least LOWER_BOUND number of images
while not (class_count >= LOWER_BOUND).all():
    for i in sorted_indices:
        if invalid_aug[i] or (class_count[object_class_list[i]] >= LOWER_BOUND).any():
            invalid_aug[i] = 1
            continue

        aug_image_name_list.append(f'{image_count[i]}_{image_names[i]}')
        image_count[i] += 1
        for c in object_class_list[i]:
            class_count[c] += 1

# Print class count in a formatted way
for i, class_name in enumerate(classes_name):
    print(f'{i:2d}) {class_name:11} {class_count[i]:4d}', end=('\n' if (i + 1) % 4 == 0 else '\t'))

 0) aeroplane   3001	 1) bicycle     3000	 2) bird        3000	 3) boat        3000
 4) bottle      3000	 5) bus         3000	 6) car         3002	 7) cat         3000
 8) chair       3001	 9) cow         3002	10) diningtable 3000	11) dog         3000
12) horse       3000	13) motorbike   3000	14) person      5318	15) pottedplant 3004
16) sheep       3007	17) sofa        3000	18) train       3000	19) tvmonitor   3000


2. Perform Data Augmentation on new data
    * 上方透過複製原有的資料達到資料平衡，但若是使用相同的資料作訓練效果會非常有限，所以這邊必須將剛才新增的資料做一些轉換。
    * 轉換的方法是透過下方連結提供的方法，除了以下四種方式外還有其他轉換，但其他轉換會產生一些bug，所以只使用了這四種。
    * 分別是調整Hue, Saturation, Value(Lightness), Flip, Scale, Shear。
    * reference: https://reurl.cc/QZAbLM

In [7]:
from data_aug.data_aug import *

def generate_aug_dataset():
    f =  open(DATA_AUG_PATH, 'w')
    np.random.seed(0)
    seq = Sequence([RandomHSV(50, 50, 50),
                    RandomHorizontalFlip(0.5),
                    RandomScale(0.15),
                    RandomShear(0.15)])

    for name in aug_image_name_list:
        _, img_name = name.split('_')
    
        img_path = os.path.join(IMAGE_DIR, img_name)
        img_file = tf.io.read_file(img_path)
        img = tf.io.decode_jpeg(img_file, channels=3)
        bboxes = name_to_bboxes_dict.get(img_name)
    
        if not name.startswith('0'):
            img, bboxes = seq(img.numpy().copy(), bboxes.copy())
    
        encoded_img = tf.io.encode_jpeg(img)
        augmented_img_path = os.path.join(IMAGE_AUG_DIR, name)
        tf.io.write_file(augmented_img_path, encoded_img)
    
        bbox_str_list = [str(int(b)) for bbox in bboxes for b in bbox]
        bbox_line = ' '.join([name] + bbox_str_list)
        f.write(bbox_line + '\n')

In [8]:
if AUG_DATASET:
    generate_aug_dataset()

3. Apply mosaic data augmentation to raw data
    * 參考YoloV4中Mosaic方法，對原本的資料進一步進行data augmentation。
    * Mosaic是挑選四張圖片，並對其進行轉換，而除了轉換之外再將這四張圖片隨機拼接，形成新的圖片，具有豐富圖片背景的效果。
    * Mosaic除了能夠豐富dataset之外，也夠將增加的資料有效降低，降低data augmentation要付出的訓練時間overhead。
    * 這次使用Mosaic增加了與原始dataset一樣數量的資料。
    * Reference: https://reurl.cc/7MQkal

In [9]:
def find_line_by_name(filename, name):
    with open(filename, 'r') as file:
        for line in file:
            if name in line:
                return line
    return None

In [10]:
def convert_line_format(line):
    # Split the line by spaces
    parts = line.split()
    # The first part is the image filename, so we keep it separate
    filename = parts[0]
    # The rest are assumed to be numeric values for bounding boxes and classes
    numeric_parts = parts[1:]
    
    # Reformat the numeric parts by grouping every five elements
    reformatted_numeric_parts = [' '.join([','.join(numeric_parts[i:i+5])]) for i in range(0, len(numeric_parts), 5)]
    
    # Combine the filename and the reformatted numeric parts
    new_line = filename + ' ' + ' '.join(reformatted_numeric_parts)
    return new_line

In [11]:
# from utils.random_data import get_random_data_with_Mosaic
from myutils.random_data import get_random_data_with_Mosaic
from random import sample
from PIL import Image

input_shape = [IMAGE_SIZE, IMAGE_SIZE]

if AUG_DATASET:
    img_names = os.listdir(IMAGE_DIR)

    for index in range(len(img_names)):
        sample_imgs = sample(img_names, 4)
        annotation_line = []
        for name in sample_imgs:
            tmp = find_line_by_name(DATA_PATH, name)
            # print(tmp)
            if tmp == None: 
                annotation_line.append(None)
                continue
                
            found_line = IMAGE_DIR + tmp
            annotation_line.append(found_line)
        
        if None in annotation_line:
            continue
        else:
            annotation_line = [convert_line_format(l.strip()) for l in annotation_line]
            image_data, box_data = get_random_data_with_Mosaic(annotation_line, input_shape)

            img = Image.fromarray(image_data.astype(np.uint8))
            img.save(os.path.join(IMAGE_AUG_DIR, 'mosaic_'+str(index).zfill(6)+'.jpg'))
            formatted_line = ''
            for item in box_data:

                int_item = [str(int(num)) for num in item]
                my_string = ' '.join(int_item)
                formatted_line = formatted_line + ' ' + my_string

            formatted_line = 'mosaic_' + '{:06}'.format(index) + '.jpg'+ formatted_line

            with open(DATA_AUG_PATH, "a") as file:
                file.write(formatted_line + "\n")

剛開始做Data augmentation時，Over-sampling設定每個class至少要有1000張，並同時做Mosaic augmentation from raw data。

在這種情況下進行模型的訓練，發現預測階段時，模型會傾向於預測較多的物件數，儘管圖中只有少量的物件而已。我們認為這是因為Mosaic的比例太高導致，因此將Over-sampling的Lower_bound調高至3000，再重新進行訓練，結果就有明顯的改善。

### Dataset Loader

In [12]:
class DatasetGenerator:
    """
    Load pascalVOC 2007 dataset and creates an input pipeline.
    - Reshapes images into 448 x 448
    - converts [0 1] to [-1 1]
    - shuffles the input
    - builds batches
    """

    def __init__(self):
        self.image_names = []
        self.record_list = []
        self.object_num_list = []
        # filling the record_list
        input_file = open(DATA_AUG_PATH, 'r')

        for line in input_file:
            line = line.strip()
            ss = line.split(' ')
            self.image_names.append(ss[0])

            self.record_list.append([float(num) for num in ss[1:]])

            self.object_num_list.append(min(len(self.record_list[-1])//5, 
                                            MAX_OBJECTS_PER_IMAGE))
            if len(self.record_list[-1]) < MAX_OBJECTS_PER_IMAGE*5:
                # if there are objects less than MAX_OBJECTS_PER_IMAGE, pad the list
                self.record_list[-1] = self.record_list[-1] +\
                [0., 0., 0., 0., 0.]*\
                (MAX_OBJECTS_PER_IMAGE-len(self.record_list[-1])//5)
                
            elif len(self.record_list[-1]) > MAX_OBJECTS_PER_IMAGE*5:
               # if there are objects more than MAX_OBJECTS_PER_IMAGE, crop the list
                self.record_list[-1] = self.record_list[-1][:MAX_OBJECTS_PER_IMAGE*5]

    def _data_preprocess(self, image_name, raw_labels, object_num):
        image_file = tf.io.read_file(IMAGE_AUG_DIR+image_name)
        image = tf.io.decode_jpeg(image_file, channels=3)

        h = tf.shape(image)[0]
        w = tf.shape(image)[1]

        width_ratio  = IMAGE_SIZE * 1.0 / tf.cast(w, tf.float32) 
        height_ratio = IMAGE_SIZE * 1.0 / tf.cast(h, tf.float32) 

        image = tf.image.resize(image, size=[IMAGE_SIZE, IMAGE_SIZE])
        image = tf.keras.applications.densenet.preprocess_input(image)
        # image = (image/255) * 2 - 1
        

        raw_labels = tf.cast(tf.reshape(raw_labels, [-1, 5]), tf.float32)

        xmin = raw_labels[:, 0]
        ymin = raw_labels[:, 1]
        xmax = raw_labels[:, 2]
        ymax = raw_labels[:, 3]
        class_num = raw_labels[:, 4]

        xcenter = (xmin + xmax) * 1.0 / 2.0 * width_ratio
        ycenter = (ymin + ymax) * 1.0 / 2.0 * height_ratio

        box_w = (xmax - xmin) * width_ratio
        box_h = (ymax - ymin) * height_ratio

        labels = tf.stack([xcenter, ycenter, box_w, box_h, class_num], axis=1)

        return image, labels, tf.cast(object_num, tf.int32)

    def generate(self):
        dataset = tf.data.Dataset.from_tensor_slices((self.image_names, 
                                                      np.array(self.record_list), 
                                                      np.array(self.object_num_list)))
        dataset = dataset.shuffle(100000)
        dataset = dataset.map(self._data_preprocess, 
                              num_parallel_calls = tf.data.experimental.AUTOTUNE)
        dataset = dataset.batch(BATCH_SIZE)
        dataset = dataset.prefetch(buffer_size=200)

        return dataset

### Model
* 使用DenseNet121當作base model，因為效果相對其他model較好，至於不使用DenseNet169, DenseNet201 是因為效果有限之外，訓練時間也會變長。
* 除了base model之外還保留了8層convolution layer，convolution layer中增加了Batch Normalization，其餘部分與原始code相同。
* Reference: https://arxiv.org/abs/1608.06993
* Reference: https://keras.io/api/applications/densenet/

In [13]:
from tensorflow import keras
from tensorflow.keras import layers

In [14]:
def conv_leaky_relu(inputs, filters, size, stride):
    x = layers.Conv2D(filters, size, stride, padding="same")(inputs)
    x = layers.LeakyReLU(0.1)(x)
    x = layers.BatchNormalization()(x)
    return x

In [15]:
img_inputs = keras.Input(shape=(IMAGE_SIZE, IMAGE_SIZE, 3))

x = tf.keras.applications.densenet.DenseNet121(include_top=False, weights="imagenet")(img_inputs)

x = conv_leaky_relu(x, 512, 1, 1)
x = conv_leaky_relu(x, 1024, 3, 1)
x = conv_leaky_relu(x, 512, 1, 1)
x = conv_leaky_relu(x, 1024, 3, 1)
x = conv_leaky_relu(x, 1024, 3, 1)
x = conv_leaky_relu(x, 1024, 3, 2)
x = conv_leaky_relu(x, 1024, 3, 1)
x = conv_leaky_relu(x, 1024, 3, 1)

x = layers.Flatten()(x)
x = layers.Dense(4096, kernel_initializer=tf.keras.initializers.TruncatedNormal(stddev=0.01))(x)
x = layers.LeakyReLU(0.1)(x)
outputs = layers.Dense(1470, kernel_initializer=tf.keras.initializers.TruncatedNormal(stddev=0.01))(x)

YOLO = keras.Model(inputs=img_inputs, outputs=outputs, name="YOLO")

In [16]:
YOLO.summary()

Model: "YOLO"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(None, 448, 448, 3)]     0         
_________________________________________________________________
densenet121 (Functional)     (None, None, None, 1024)  7037504   
_________________________________________________________________
conv2d (Conv2D)              (None, 14, 14, 512)       524800    
_________________________________________________________________
leaky_re_lu (LeakyReLU)      (None, 14, 14, 512)       0         
_________________________________________________________________
batch_normalization (BatchNo (None, 14, 14, 512)       2048      
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 14, 14, 1024)      4719616   
_________________________________________________________________
leaky_re_lu_1 (LeakyReLU)    (None, 14, 14, 1024)      0      

### Define Loss

In [17]:
# base boxes (for loss calculation)
base_boxes = np.zeros([CELL_SIZE, CELL_SIZE, 4])

# initializtion for each cell
for y in range(CELL_SIZE):
    for x in range(CELL_SIZE):
        base_boxes[y, x, :] = [IMAGE_SIZE / CELL_SIZE * x, 
                               IMAGE_SIZE / CELL_SIZE * y, 0, 0]

base_boxes = np.resize(base_boxes, [CELL_SIZE, CELL_SIZE, 1, 4])
base_boxes = np.tile(base_boxes, [1, 1, BOXES_PER_CELL, 1])

In [18]:
def iou(boxes1, boxes2):
    """calculate ious
    Args:
      boxes1: 4-D tensor [CELL_SIZE, CELL_SIZE, BOXES_PER_CELL, 4]  ====> (x_center, y_center, w, h)
      boxes2: 1-D tensor [4] ===> (x_center, y_center, w, h)

    Return:
      iou: 3-D tensor [CELL_SIZE, CELL_SIZE, BOXES_PER_CELL]
      ====> iou score for each cell
    """

    #boxes1 : [4(xmin, ymin, xmax, ymax), cell_size, cell_size, boxes_per_cell]
    boxes1 = tf.stack([boxes1[:, :, :, 0] - boxes1[:, :, :, 2] / 2, boxes1[:, :, :, 1] - boxes1[:, :, :, 3] / 2,
                      boxes1[:, :, :, 0] + boxes1[:, :, :, 2] / 2, boxes1[:, :, :, 1] + boxes1[:, :, :, 3] / 2])

    #boxes1 : [cell_size, cell_size, boxes_per_cell, 4(xmin, ymin, xmax, ymax)]
    boxes1 = tf.transpose(boxes1, [1, 2, 3, 0])

    boxes2 =  tf.stack([boxes2[0] - boxes2[2] / 2, boxes2[1] - boxes2[3] / 2,
                      boxes2[0] + boxes2[2] / 2, boxes2[1] + boxes2[3] / 2])

    #calculate the left up point of boxes' overlap area
    lu = tf.maximum(boxes1[:, :, :, 0:2], boxes2[0:2])
    #calculate the right down point of boxes overlap area
    rd = tf.minimum(boxes1[:, :, :, 2:], boxes2[2:])

    #intersection
    intersection = rd - lu 

    #the size of the intersection area
    inter_square = intersection[:, :, :, 0] * intersection[:, :, :, 1]

    mask = tf.cast(intersection[:, :, :, 0] > 0, tf.float32) * tf.cast(intersection[:, :, :, 1] > 0, tf.float32)

    #if intersection is negative, then the boxes don't overlap
    inter_square = mask * inter_square

    #calculate the boxs1 square and boxs2 square
    square1 = (boxes1[:, :, :, 2] - boxes1[:, :, :, 0]) * (boxes1[:, :, :, 3] - boxes1[:, :, :, 1])
    square2 = (boxes2[2] - boxes2[0]) * (boxes2[3] - boxes2[1])

    return inter_square/(square1 + square2 - inter_square + 1e-6)

In [19]:
def losses_calculation(predict, label):
    """
    calculate loss
    Args:
      predict: 3-D tensor [cell_size, cell_size, num_classes + 5 * boxes_per_cell]
      label : [1, 5]  (x_center, y_center, w, h, class)
    """
    label = tf.reshape(label, [-1])

    #Step A. calculate objects tensor [CELL_SIZE, CELL_SIZE]
    #turn pixel position into cell position (corner)
    min_x = (label[0] - label[2] / 2) / (IMAGE_SIZE / CELL_SIZE)
    max_x = (label[0] + label[2] / 2) / (IMAGE_SIZE / CELL_SIZE)

    min_y = (label[1] - label[3] / 2) / (IMAGE_SIZE / CELL_SIZE)
    max_y = (label[1] + label[3] / 2) / (IMAGE_SIZE / CELL_SIZE)

    min_x = tf.floor(min_x)
    min_y = tf.floor(min_y)

    max_x = tf.minimum(tf.math.ceil(max_x), CELL_SIZE)
    max_y = tf.minimum(tf.math.ceil(max_y), CELL_SIZE)
    
    #calculate mask of object with cells
    onset = tf.cast(tf.stack([max_y - min_y, max_x - min_x]), dtype=tf.int32)
    object_mask = tf.ones(onset, tf.float32)

    offset = tf.cast(tf.stack([min_y, CELL_SIZE - max_y, min_x, CELL_SIZE - max_x]), tf.int32)
    offset = tf.reshape(offset, (2, 2))
    object_mask = tf.pad(object_mask, offset, "CONSTANT")

    #Step B. calculate the coordination of object center and the corresponding mask
    #turn pixel position into cell position (center)
    center_x = label[0] / (IMAGE_SIZE / CELL_SIZE)
    center_x = tf.floor(center_x)

    center_y = label[1] / (IMAGE_SIZE / CELL_SIZE)
    center_y = tf.floor(center_y)

    response = tf.ones([1, 1], tf.float32)

    #calculate the coordination of object center with cells
    objects_center_coord = tf.cast(tf.stack([center_y, CELL_SIZE - center_y - 1, 
                             center_x, CELL_SIZE - center_x - 1]), 
                             tf.int32)
    objects_center_coord = tf.reshape(objects_center_coord, (2, 2))

    #make mask
    response = tf.pad(response, objects_center_coord, "CONSTANT")

    #Step C. calculate iou_predict_truth [CELL_SIZE, CELL_SIZE, BOXES_PER_CELL]
    predict_boxes = predict[:, :, NUM_CLASSES + BOXES_PER_CELL:]

    predict_boxes = tf.reshape(predict_boxes, [CELL_SIZE, 
                                               CELL_SIZE, 
                                               BOXES_PER_CELL, 4])
    #cell position to pixel position
    predict_boxes = predict_boxes * [IMAGE_SIZE / CELL_SIZE, 
                                     IMAGE_SIZE / CELL_SIZE, 
                                     IMAGE_SIZE, IMAGE_SIZE]

    #if there's no predict_box in that cell, then the base_boxes will be calcuated with label and got iou equals 0
    predict_boxes = base_boxes + predict_boxes

    iou_predict_truth = iou(predict_boxes, label[0:4])

    #calculate C tensor [CELL_SIZE, CELL_SIZE, BOXES_PER_CELL]
    C = iou_predict_truth * tf.reshape(response, [CELL_SIZE, CELL_SIZE, 1])

    #calculate I tensor [CELL_SIZE, CELL_SIZE, BOXES_PER_CELL]
    I = iou_predict_truth * tf.reshape(response, [CELL_SIZE, CELL_SIZE, 1])

    max_I = tf.reduce_max(I, 2, keepdims=True)

    #replace large iou scores with response (object center) value
    I = tf.cast((I >= max_I), tf.float32) * tf.reshape(response, (CELL_SIZE, CELL_SIZE, 1))

    #calculate no_I tensor [CELL_SIZE, CELL_SIZE, BOXES_PER_CELL]
    no_I = tf.ones_like(I, dtype=tf.float32) - I

    p_C = predict[:, :, NUM_CLASSES:NUM_CLASSES + BOXES_PER_CELL]

    #calculate truth x, y, sqrt_w, sqrt_h 0-D
    x = label[0]
    y = label[1]

    sqrt_w = tf.sqrt(tf.abs(label[2]))
    sqrt_h = tf.sqrt(tf.abs(label[3]))

    #calculate predict p_x, p_y, p_sqrt_w, p_sqrt_h 3-D [CELL_SIZE, CELL_SIZE, BOXES_PER_CELL]
    p_x = predict_boxes[:, :, :, 0]
    p_y = predict_boxes[:, :, :, 1]

    p_sqrt_w = tf.sqrt(tf.minimum(IMAGE_SIZE * 1.0, tf.maximum(0.0, predict_boxes[:, :, :, 2])))
    p_sqrt_h = tf.sqrt(tf.minimum(IMAGE_SIZE * 1.0, tf.maximum(0.0, predict_boxes[:, :, :, 3])))

    #calculate ground truth p 1-D tensor [NUM_CLASSES]
    P = tf.one_hot(tf.cast(label[4], tf.int32), NUM_CLASSES, dtype=tf.float32)

    #calculate predicted p_P 3-D tensor [CELL_SIZE, CELL_SIZE, NUM_CLASSES]
    p_P = predict[:, :, 0:NUM_CLASSES]

    #class_loss
    class_loss = tf.nn.l2_loss(tf.reshape(object_mask, (CELL_SIZE, CELL_SIZE, 1)) * (p_P - P)) * CLASS_SCALE

    #object_loss
    object_loss = tf.nn.l2_loss(I * (p_C - C)) * OBJECT_SCALE

    #noobject_loss
    noobject_loss = tf.nn.l2_loss(no_I * (p_C)) * NOOBJECT_SCALE

    #coord_loss
    coord_loss = (tf.nn.l2_loss(I * (p_x - x)/(IMAGE_SIZE/CELL_SIZE)) +
                  tf.nn.l2_loss(I * (p_y - y)/(IMAGE_SIZE/CELL_SIZE)) +
                  tf.nn.l2_loss(I * (p_sqrt_w - sqrt_w))/IMAGE_SIZE +
                  tf.nn.l2_loss(I * (p_sqrt_h - sqrt_h))/IMAGE_SIZE) * COORD_SCALE

    return class_loss + object_loss + noobject_loss + coord_loss

In [20]:
def yolo_loss(predicts, labels, objects_num):
    """
    Add Loss to all the trainable variables
    Args:
        predicts: 4-D tensor [batch_size, cell_size, cell_size, num_classes + 5 * boxes_per_cell]
        ===> (num_classes, boxes_per_cell, 4 * boxes_per_cell)
        labels  : 3-D tensor of [batch_size, max_objects, 5]
        objects_num: 1-D tensor [batch_size]
    """

    loss = 0.
    
    #you can parallel the code with tf.map_fn or tf.vectorized_map (big performance gain!)
    for i in tf.range(predicts.shape[0]):
        predict = predicts[i, :, :, :]
        label = labels[i, :, :]
        object_num = objects_num[i]

        for j in tf.range(object_num):
            results = losses_calculation(predict, label[j:j+1, :])
            loss = loss + results

    return loss/predicts.shape[0]

### Training


In [21]:
dataset = DatasetGenerator().generate()
optimizer = tf.keras.optimizers.Adam(LEARNING_RATE)
train_loss_metric = tf.keras.metrics.Mean(name='loss')

In [22]:
ckpt = tf.train.Checkpoint(epoch=tf.Variable(0), net=YOLO)
manager = tf.train.CheckpointManager(ckpt, './ckpts/YOLO', max_to_keep=5, checkpoint_name='yolo')

ckpt.restore('./ckpts/YOLO/yolo-146')
# ckpt.restore(manager.latest_checkpoint)

<tensorflow.python.training.tracking.util.CheckpointLoadStatus at 0x7fe4ee1ce2e8>

In [23]:
@tf.function
def train_step(image, labels, objects_num):
    with tf.GradientTape() as tape:
        outputs = YOLO(image)
        class_end = CELL_SIZE * CELL_SIZE * NUM_CLASSES
        conf_end = class_end + CELL_SIZE * CELL_SIZE * BOXES_PER_CELL
        class_probs = tf.reshape(outputs[:, 0:class_end], (-1, 7, 7, 20))
        confs = tf.reshape(outputs[:, class_end:conf_end], (-1, 7, 7, 2))
        boxes = tf.reshape(outputs[:, conf_end:], (-1, 7, 7, 2*4))
        predicts = tf.concat([class_probs, confs, boxes], 3)

        loss = yolo_loss(predicts, labels, objects_num)
        train_loss_metric(loss)

    grads = tape.gradient(loss, YOLO.trainable_weights)
    optimizer.apply_gradients(zip(grads, YOLO.trainable_weights))

In [24]:
from tqdm import tqdm
from datetime import datetime

print("{}, start training.".format(datetime.now()))
for i in range(EPOCHS):
    train_loss_metric.reset_states()
    ckpt.epoch.assign_add(1)

    for idx, (image, labels, objects_num) in enumerate(tqdm(dataset)):
        train_step(image, labels, objects_num)

    print("{}, Epoch {}: loss {:.2f}".format(datetime.now(), i+1, train_loss_metric.result()))

    save_path = manager.save()
    print("Saved checkpoint for epoch {}: {}".format(int(ckpt.epoch), save_path))  

2023-11-19 16:06:21.482863, start training.


100%|██████████| 1185/1185 [07:27<00:00,  2.65it/s]


2023-11-19 16:13:49.408052, Epoch 1: loss 1.77
Saved checkpoint for epoch 147: ./ckpts/YOLO/yolo-147


### Predict Test data
* 處理output從選擇最高的confidence score，改成只要confidence score有超過最高confidence score * threshold就輸出結果。
* 在觀察預測結果時，把threshold設成0.5可以讓confidence score可以得到更好的結果，將threshold調高至0.6和0.7，但效果反而變差。

In [25]:
def process_outputs(outputs):
    """
    Process YOLO outputs into bou
    """

    class_end = CELL_SIZE * CELL_SIZE * NUM_CLASSES
    conf_end = class_end + CELL_SIZE * CELL_SIZE * BOXES_PER_CELL
    class_probs = np.reshape(outputs[:, 0:class_end], (-1, 7, 7, 20))
    confs = np.reshape(outputs[:, class_end:conf_end], (-1, 7, 7, 2))
    boxes = np.reshape(outputs[:, conf_end:], (-1, 7, 7, 2 * 4))
    predicts = np.concatenate([class_probs, confs, boxes], 3)

    p_classes = predicts[0, :, :, 0:20]
    C = predicts[0, :, :, 20:22]
    coordinate = predicts[0, :, :, 22:]

    p_classes = np.reshape(p_classes, (CELL_SIZE, CELL_SIZE, 1, 20))
    C = np.reshape(C, (CELL_SIZE, CELL_SIZE, BOXES_PER_CELL, 1))

    P = C * p_classes
    # P's shape [7, 7, 2, 20]

    xmin, ymin, xmax, ymax, class_num, conf = [], [], [], [], [], []

    max_conf_idx = np.unravel_index(np.argmax(P), P.shape)
    threshold = P[max_conf_idx] * 0.5

    for i in range(np.prod(P.shape)):
        idx = np.unravel_index(i, P.shape)
        if (P[idx] > threshold):
            class_num.append(idx[3])
            conf.append(P[idx])

            coordinate = np.reshape(coordinate, (CELL_SIZE, CELL_SIZE, BOXES_PER_CELL, 4))
            xcenter, ycenter, w, h = coordinate[idx[0], idx[1], idx[2], :]

            xcenter = (idx[1] + xcenter) * (IMAGE_SIZE / float(CELL_SIZE))
            ycenter = (idx[0] + ycenter) * (IMAGE_SIZE / float(CELL_SIZE))

            w = w * IMAGE_SIZE
            h = h * IMAGE_SIZE

            xmin.append(xcenter - w / 2.0)
            ymin.append(ycenter - h / 2.0)

            xmax.append(xmin[-1] + w)
            ymax.append(ymin[-1] + h)

    return xmin, ymin, xmax, ymax, class_num, conf

In [26]:
test_img_files = open('./dataset/pascal_voc_testing_data.txt')
test_img_dir = './dataset/VOCdevkit_test/VOC2007/JPEGImages/'
test_images = []

for line in test_img_files:
    line = line.strip()
    ss = line.split(' ')
    test_images.append(ss[0])



def load_img_data(image_name):
    image_file = tf.io.read_file(test_img_dir+image_name)
    image = tf.image.decode_jpeg(image_file, channels=3)

    h = tf.shape(image)[0]
    w = tf.shape(image)[1]

    image = tf.image.resize(image, size=[IMAGE_SIZE, IMAGE_SIZE])
    image = tf.keras.applications.densenet.preprocess_input(image)

    return image_name, image, h, w


test_dataset = tf.data.Dataset.from_tensor_slices(test_images)\
                              .map(load_img_data, num_parallel_calls = tf.data.experimental.AUTOTUNE)\
                              .batch(32)\
                              .prefetch(tf.data.AUTOTUNE)

In [28]:
ckpt = tf.train.Checkpoint(net=YOLO)
ckpt.restore('./ckpts/YOLO/yolo-147')

<tensorflow.python.training.tracking.util.CheckpointLoadStatus at 0x7fe4ee1c8320>

In [29]:
@tf.function
def prediction_step(img):
    return YOLO(img, training=False)

In [30]:
output_file = open('./test_prediction_with_aug.txt', 'w')

for img_name, test_img, img_h, img_w in test_dataset:
    batch_num = img_name.shape[0]
    for i in range(batch_num):
        xmin, ymin, xmax, ymax, class_num, conf = process_outputs(prediction_step(test_img[i:i+1]))
        for j in range(len(xmin)):
            xmin[j] = xmin[j] * (img_w[i] / IMAGE_SIZE)
            ymin[j] = ymin[j] * (img_h[i] / IMAGE_SIZE)
            xmax[j] = xmax[j] * (img_w[i] / IMAGE_SIZE)
            ymax[j] = ymax[j] * (img_h[i] / IMAGE_SIZE)
            output_file.write(img_name[i].numpy().decode('ascii')+" %d %d %d %d %d %f\n"%(
                xmin[j], ymin[j], xmax[j], ymax[j], class_num[j], conf[j]))
                
output_file.close()

In [1]:
import sys
sys.path.insert(0, './evaluate')

In [2]:
import evaluate
import pandas as pd

evaluate.evaluate('./test_prediction_with_aug.txt', './output_file_with_aug.csv')

cap = pd.read_csv('./output_file_with_aug.csv')['packedCAP']
print('score: {:f}'.format(sum((1. - cap) ** 2) / len(cap)))

  from cryptography.fernet import Fernet


End Evalutation
score: 0.376731


### Conclusion
這次Competition剛開始的時候先把助教的code跑完後發現成績離80的門檻有段距離，所以會覺得必須要實作比較新的model架構才能夠把分數拉上去，不過後續在對data進行處後發現成績有顯著的提升，感覺起來跟Competition 1一樣，再次印證到training data的好壞會大幅影響結果，最後結合很方式讓成績突破80的門檻，想要再往上提升大概就需要靠更改model架構才有辦法，不過由於這次的時間與其他課程作業有衝突，就沒有繼續嘗試了。

至於遇到的困難就是如何在訓練資料量和訓練時間做trade-off，因為這次訓練資料相比於作業和Competition 1多更多，最後的解法就是只能減少Augmentation的數量。

最後一點就是體會到顯卡對於DL的重要性，一開始使用1080ti做訓練的時候不只是速度相對較慢，記憶體也比較少，導致batch size和model不能太大，後還換到4090後能夠明顯感覺到差距。