--------------------

# RTML LAB Report 
## Lab 4.2 YOLO training

----------------------



In [None]:
Name = 'ati tesakulsiri'
ID = 'st123009'

## part 1 Introduction

- #### Object detection
The concept of object detection in computer vision includes identifying different things in digital photos or movies. Among the things found are people, vehicles, chairs, stones, structures, and animals.

- #### YOLO
YOLO is a method that provides real-time object detection using neural networks. The popularity of this algorithm is due to its accuracy and quickness. It has been applied in a variety of ways to identify animals, humans, parking meters, and traffic lights.

The YOLO method for object detection is described in this article along with its workings. It also highlights a few of its practical uses.

- #### Yolov4
- YOLO v4 was developed based on YOLO v3 by a new group of authors, Alexey Bochkovskiy and colleagues, who took
over the development of Darknet and YOLO after [Joseph Redmon quit computer vision research](https://twitter.com/pjreddie/status/1230524770350817280?lang=en).
- Take a look at the [YOLO v4 paper](https://arxiv.org/abs/2004.10934). The authors make many small and some large
improvements to YOLOv3 to achieve a higher frame rate and higher accuracy. Source code is available at the
[Darknet GitHub repository](https://github.com/AlexeyAB/darknet).
    - Bag of spacial 

        - Mish activation Function


            - Next, let's take a look at the newish activation function used in YOLOv4: Mish.
        Mish is a SoftPlus activation function that is non-monotonic and designed for
        neural networks that regularize themselves. It was inspired by the *swish* activation function.
        It has a range from -0.31 to $\infty$, due to the SoftPlus function:

$$\mathrm{SoftPlus}(x)=\ln(1+e^x) \\
f(x)=x \tanh(\mathrm{SoftPlus}(x))=x \tanh(\ln(1+e^x)) $$.

<img src = '/root/keep_lab/RTML_Labsession/04_y_olo3/to_submit/mish_activation_function_graph.png' title="weight" style="width: 480px;" />


- ### Mean Average precision
Mean Average Precision(mAP) is a metric used to evaluate object detection models such as Fast R-CNN, YOLO, Mask R-CNN, etc. The mean of average precision(AP) values are calculated over recall values from 0 to 1.
- mAP formula is based on the following sub metrics:
    - Confusion Matrix,
    - Intersection over Union(IoU),
    - Recall, 
    - Precision


To create a confusion matrix, we need four attributes:

True Positives (TP):  The model predicted a label and matches correctly as per ground truth.

True Negatives (TN): The model does not predict the label and is not a part of the ground truth.

False Positives (FP): The model predicted a label, but it is not a part of the ground truth (Type I Error).

False Negatives (FN): The model does not predict a label, but it is part of the ground truth. (Type II Error).

**Intersection over Union** indicates the overlap of the predicted bounding box coordinates to the ground truth box. Higher IoU indicates the predicted bounding box coordinates closely resembles the ground truth box coordinates.

**Precision**
Precision measures how well you can find true positives(TP) out of all positive predictions. (TP+FP).

Average Precision is calculated as the weighted mean of precisions at each threshold; the weight is the increase in recall from the prior threshold.

**Mean Average Precision** is the average of AP of each class. However, the interpretation of AP and mAP varies in different contexts. For instance, in the evaluation document of the COCO object detection challenge, AP and mAP are the same.

Here is a summary of the steps to calculate the AP:

Generate the prediction scores using the model.

Convert the prediction scores to class labels.

Calculate the confusion matrix—TP, FP, TN, FN.

Calculate the precision and recall metrics.

Calculate the area under the precision-recall curve.

Measure the average precision.

The mAP is calculated by finding Average Precision(AP) for each class and then average over a number of classes.

- ### Complete IoU Loss (CIoU loss)
CIoU loss bounding box regression uses three geometric factors.

Overlap area between the predicted box and the ground truth bounding box-IOU loss

The central point between the predicted box and the ground truth bounding box-DIoU loss

An aspect ratio of the predicted box and the ground truth box

As CIoU loss uses complete geometric factors, it converges faster than GIoU loss. It improves average precision (AP) and average recall (AR) for object detection and segmentation.

CIoU loss is an aggregation of the overlap area, distance, and aspect ratio, respectively, referred to as Complete IOU loss.

S is the overlap area denoted by S=1-IoU

D is the normalized distance Iou loss between the center point of the predicted and ground truth boxes.

V is the consistency of the aspect ratio.

All S, V, and D are invariant to the regression scale and are normalized to values between 0 and 1.

CIoU loss, like GIoU loss and DIoU loss, moves the predicted bounding box towards the ground truth bounding box for non-overlapping cases.

CIoU loss needs fewer iterations to converges than GIoU loss. CIoU loss makes regression very fast with extreme aspect ratios.

## Part 2 Method
- We will Continue from Last lab which is 
   1. Implementation of the mish activation function
   2. Option for the maxpool layer in the `create_modules` function and in your model's `forward()` method.
   3. Enabling a `[route]` module to concatenate more than two previous layers
   4. Loading the pre-trained weights [provided by the authors](https://github.com/AlexeyAB/darknet/releases/download/darknet_yolo_v3_optimal/yolov4.weights)
   4. Scale inputs to 608,608 and make sure you're passing input channels in RGB order, not OpenCV's BGR order.
   <br>

<br>

- Now, we are going to implement.
   Train the YOLOv4 model on the COCO dataset (or another dataset if you have one available).
   Here the purpose is not to get the best possible model (that would require implementing all
   of the "bag of freebies" training tricks described in the paper), but just some of them, to
   get a feel for their importance.

1. Get a set of ImageNet pretrained weights for CSPDarknet53

2. Load the pretrained weights into the backbone portion of PyTorch YOLOv4 model.

3. Implement a function similar to train model developed in previous labs for classifiers that preprocesses the input with       basic augmentation transformations, converts the
   anchor-relative outputs to bounding box coordinates, computes MSE loss for the bounding box coordinates,
   backpropagates the loss, and takes a step for the optimizer. Use the recommended IoU thresholds to determine
   which predicted bounding boxes to include in the loss. You will find many examples of how to do this
   online.

4. Train model on COCO. 

5. Compute mAP for model on the COCO validation set.

6. Implement the CIoU loss function.


## Part 3 result

- ### 1,2 CSPDarknet53 Pre-trained for backbone
- For download part
   - `-rw-r--r-- 1 root 5097 110710068 Feb 10 04:16 csdarknet53-omega_final.weights`
   - `-rw-r--r-- 1 root 5097 248007048 Feb 10 04:16 yolov3.weights`
   - `-rw-r--r-- 1 root 5097 257717640 Feb 10 04:16 yolov4.weights`

<br><br>

- For adding Partial portion of the model( backbone) of darknet we re-implement the `load-weights` method in DarkNets Class
> Note that I skip the code to shorten the report by putting the code to be `...`
```python
   def load_weights_(self, weightfile, backbone=False):
       ...

        stage = 1
    
        if(backbone):
            stage = 0

        for i in range(len(self.module_list)):
            module_type = self.blocks[i + 1]["type"]
    
            #If module_type is convolutional load weights
            #Otherwise ignore.
            if(backbone):
                # print(stage, self.blocks[i + 1])
                if(stage == 2): break

                if("backbone" in self.blocks[i + 1] and int(self.blocks[i + 1]["backbone"]) == 0):
                    stage = 1
                elif("backbone" in self.blocks[i + 1] and int(self.blocks[i + 1]["backbone"]) == 1):
                    stage = 2

                if(stage == 0): continue
            
            # print(self.blocks[i + 1])
            # Load weight
            if module_type == "convolutional":
                model = self.module_list[i]
                try:
                    batch_normalize = int(self.blocks[i+1]["batch_normalize"])
                  
                  ...

                conv.weight.data.copy_(conv_weights)
```


- and now we load the `weight with new method`
```python
model = Darknet("cfg/yolov4.cfg")
# Edit Convo Layer 114
model.module_list[114].conv_114 = nn.Conv2d(2048, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
model.load_weights_("csdarknet53-omega_final.weights",True)
print("Network successfully loaded")

model.net_info["height"] = 608

```

-  Here are the result
```bash
Loading network.....
Network successfully loaded
```

- ### 3 Implement the train Function
    - The function are tweak from `lab direction`
```py
def run_training(model, optimizer, dataloader, device, img_size, n_epoch, every_n_batch, every_n_epoch, ckpt_dir):
    losses = None
    for epoch_i in range(n_epoch):
        running_loss = 0.0
        for inputs, labels, bboxes in dataloader:
            inputs = Variable(torch.from_numpy(np.array(inputs)).squeeze(1).permute(0,3,1,2).float(),requires_grad=True)
            inputs = inputs.to(device)
            labels = Variable(torch.stack(labels),requires_grad=True)
            labels = labels.to(device)
            #print(inputs.shape)

            running_corrects = 0

            optimizer.zero_grad()
            with torch.set_grad_enabled(True):
                outputs = model(inputs, True)

                pred_xywh = outputs[..., 0:4] / img_size
                pred_conf = outputs[..., 4:5]
                pred_cls = outputs[..., 5:]

                label_xywh = labels[..., :4] / img_size

                label_obj_mask = labels[..., 4:5]
                label_noobj_mask = (1.0 - label_obj_mask)
                lambda_coord = 0.001
                lambda_noobj = 0.05
                label_cls = labels[..., 5:]
                loss = nn.MSELoss()
                loss_bce = nn.BCELoss()

                loss_coord = lambda_coord * label_obj_mask * loss(input=pred_xywh, target=label_xywh)
                loss_conf = (label_obj_mask * loss_bce(input=pred_conf, target=label_obj_mask)) + \
                            (lambda_noobj * label_noobj_mask * loss_bce(input=pred_conf, target=label_obj_mask))
                loss_cls = label_obj_mask * loss_bce(input=pred_cls, target=label_cls)

                loss_coord = torch.sum(loss_coord)
                loss_conf = torch.sum(loss_conf)
                loss_cls = torch.sum(loss_cls)

                # print(pred_xywh.shape, label_xywh.shape)

                ciou = CIOU_xywh_torch(pred_xywh, label_xywh)
                # print(ciou.shape)
                ciou = ciou.unsqueeze(-1)
                # print(ciou.shape)
                # print(label_obj_mask.shape)
                loss_ciou = torch.sum(label_obj_mask * (1.0 - ciou))
                # print(loss_coord)
                loss =  loss_ciou +  loss_conf + loss_cls
                loss.backward()
                optimizer.step()
                # statistics
                running_loss += loss.item() * inputs.size(0)
                # print('Running loss')
                # print(loss_coord, loss_conf, loss_cls)
        epoch_loss = running_loss / 750
        print(epoch_loss)
        print('End Epoch')
```
- Here are the result
``` bash
index created!
31.06427083333333
End Epoch
30.9518505859375
End Epoch
31.288087565104167
End Epoch
30.979817708333332
End Epoch
30.667805989583332
End Epoch
```

- ## 4. Train on COCO
    - The custom COCO class are here
```python
class CustomCoco(CocoDetection):
    def __init__(
            self,
            root: str,
            annFile: str,
            transform: Optional[Callable] = None,
            target_transform: Optional[Callable] = None,
            transforms: Optional[Callable] = None,
    ) -> None:
        super(CocoDetection, self).__init__(root, transforms, transform, target_transform)
        from pycocotools.coco import COCO
        self.coco = COCO(annFile)
        self.ids = list(sorted(self.coco.imgs.keys()))


    def __getitem__(self, index: int) -> Tuple[Any, Any]:

        coco = self.coco
        img_id = self.ids[index]
        ann_ids = coco.getAnnIds(imgIds=img_id)
        target = coco.loadAnns(ann_ids)
        # self.target = target

        path = coco.loadImgs(img_id)[0]['file_name']

        img = Image.open(os.path.join(self.root, path)).convert('RGB')
        img = np.array(img)

        category_ids = list(obj['category_id'] for obj in target)
        bboxes = list(obj['bbox'] for obj in target)
  
        if self.transform is not None:
            bboxes = list(obj['bbox'] for obj in target)
            category_ids = list(obj['category_id'] for obj in target)
            transformed = self.transform(image=img, bboxes=bboxes, category_ids=category_ids)
            img = transformed['image'],
            bboxes = torch.Tensor(transformed['bboxes'])
            cat_ids = torch.Tensor(transformed['category_ids'])
            labels, bboxes = self.__create_label(bboxes, cat_ids.type(torch.IntTensor))

        return img, labels, bboxes

    def __len__(self) -> int:
        return len(self.ids)

    def __create_label(self, bboxes, class_inds):


        # print("Class indices: ", class_inds)
        bboxes = np.array(bboxes)
        class_inds = np.array(class_inds)
        anchors = ANCHORS # all the anchors
        strides = np.array(STRIDES) # list of strides
        train_output_size = IP_SIZE / strides # image with different scales
        anchors_per_scale = NUM_ANCHORS # anchor per scale
        
        ...
        # print(train_output_size)

                    
```


- Here are the result of loadong
``` bash
Load Dataset
loading annotations into memory...
Done (t=17.65s)
creating index...
index created!
loading annotations into memory...
Done (t=7.63s)
creating index...
```

- After Training with `run_traning` function
```bash
31.06427083333333
End Epoch
30.9518505859375
End Epoch
31.288087565104167
End Epoch
30.979817708333332
End Epoch
30.667805989583332
End Epoch
```

- ### 5. Calculate the mAPs of the model
- I try to use the sklearn to calculate FN,TN,TP and FP and use the IoU function from the lab direcction file.
    - Unfortunatly, we did not train the model long enough to predict the correct result.
the partial of our result look like

```bash
n person person person person person person person person person person person person person person person person person person person person person pperson person person person person cat cat cat cat cat cat cat cat cat cat cat cla umbrella umbrella umbrella umbrella umbrella umbrella umbrella umbrella umbrella umbrella umbrella umbrella umbrella umbrella umbrella umbrella umbrella umbrella umbrella umbrella rella umbrella umbrella umbrella umbrella umbrella umbrella umbrella umbrella umbrella umbrella umbrella umbrella umbrella umbrella umbrella umbrella umbrella umbrella umbrella umbrella umbrella umbrella skis skis skis skis skis skis skis skis skis skis skis skis skis baseball bat baseball bat baseball bat baseball bat baseball bat baseball bat baseball bat baseball bat baseball bat baseball bat baseball bat diningtable diningtable diningtable diningtable diningtable diningtable diningtable diningtable diningtable diningtable diningtable diningtable diningtable diningtable diningtable diningtable diningtable diningtable diningtable diningtable diningtable diningier hair drier hair drier hair drier hair drier hair drier hair drier hair drier hair drier hair drier hair drier hair drier hair drier hair drier hair drier hair drier hair drier hair dr

----------------------------------------------------------
SUMMARY
----------------------------------------------------------
Task                     : Time Taken (in seconds)

Reading addresses        : 0.000
Loading batch            : 0.715
Detection (1 images)     : 2.929
Output Processing        : 0.000
Drawing Boxes            : 0.889
Average time_per_img     : 4.532
----------------------------------------------------------
```


- ### 6.CIOU

I have implement the code(from lab direction) below here.
``` python
def CIOU_xywh_torch(boxes1,boxes2):
    '''
    cal CIOU of two boxes or batch boxes
    :param boxes1:[xmin,ymin,xmax,ymax] or
                [[xmin,ymin,xmax,ymax],[xmin,ymin,xmax,ymax],...]
    :param boxes2:[xmin,ymin,xmax,ymax]
    :return:
    '''
    # cx cy w h->xyxy
    boxes1 = torch.cat([boxes1[..., :2] - boxes1[..., 2:] * 0.5,
                        boxes1[..., :2] + boxes1[..., 2:] * 0.5], dim=-1)
    boxes2 = torch.cat([boxes2[..., :2] - boxes2[..., 2:] * 0.5,
                        boxes2[..., :2] + boxes2[..., 2:] * 0.5], dim=-1)

    boxes1 = torch.cat([torch.min(boxes1[..., :2], boxes1[..., 2:]),
                        torch.max(boxes1[..., :2], boxes1[..., 2:])], dim=-1)
    boxes2 = torch.cat([torch.min(boxes2[..., :2], boxes2[..., 2:]),
                        torch.max(boxes2[..., :2], boxes2[..., 2:])], dim=-1)

    # (x2 minus x1 = width)  * (y2 - y1 = height)
    boxes1_area = (boxes1[..., 2] - boxes1[..., 0]) * (boxes1[..., 3] - boxes1[..., 1])
    boxes2_area = (boxes2[..., 2] - boxes2[..., 0]) * (boxes2[..., 3] - boxes2[..., 1])

    # upper left of the intersection region (x,y)
    inter_left_up = torch.max(boxes1[..., :2], boxes2[..., :2])

    # bottom right of the intersection region (x,y)
    inter_right_down = torch.min(boxes1[..., 2:], boxes2[..., 2:])

    # if there is overlapping we will get (w,h) else set to (0,0) because it could be negative if no overlapping
    inter_section = torch.max(inter_right_down - inter_left_up, torch.zeros_like(inter_right_down))
    inter_area = inter_section[..., 0] * inter_section[..., 1]
    union_area = boxes1_area + boxes2_area - inter_area
    ious = 1.0 * inter_area / union_area
    # cal outer boxes
    outer_left_up = torch.min(boxes1[..., :2], boxes2[..., :2])
    outer_right_down = torch.max(boxes1[..., 2:], boxes2[..., 2:])
    outer = torch.max(outer_right_down - outer_left_up, torch.zeros_like(inter_right_down))
    outer_diagonal_line = torch.pow(outer[..., 0], 2) + torch.pow(outer[..., 1], 2)

    # cal center distance
    # center x center y
    boxes1_center = (boxes1[..., :2] +  boxes1[...,2:]) * 0.5
    boxes2_center = (boxes2[..., :2] +  boxes2[...,2:]) * 0.5

    # euclidean distance
    # x1-x2 square 
    center_dis = torch.pow(boxes1_center[...,0]-boxes2_center[...,0], 2) +\
                 torch.pow(boxes1_center[...,1]-boxes2_center[...,1], 2)

    # cal penalty term
    # cal width,height
    boxes1_size = torch.max(boxes1[..., 2:] - boxes1[..., :2], torch.zeros_like(inter_right_down))
    boxes2_size = torch.max(boxes2[..., 2:] - boxes2[..., :2], torch.zeros_like(inter_right_down))
    v = (4 / (math.pi ** 2)) * torch.pow(
            torch.atan((boxes1_size[...,0]/torch.clamp(boxes1_size[...,1],min = 1e-6))) -
            torch.atan((boxes2_size[..., 0] / torch.clamp(boxes2_size[..., 1],min = 1e-6))), 2)

    alpha = v / (1-ious+v)

    #cal ciou
    cious = ious - (center_dis / outer_diagonal_line + alpha*v)

    return cious
```

## Psrt 4 Conclusion

- To conclude, we manage to load the darknet53 weight and use it as a backbone of the model. we also successfully load the COCO dataset the the system provieded and manage to make it trained. Unfortunatly, We unable to make the whole dataset to train. since it take a lot of time. In the end, our implemented(from lab direction) mAPs and CIOU loss function are left unused.