<a href="https://colab.research.google.com/github/nyp-sit/sdaai-pdc2-students/blob/master/iti107/session-4/object_detection_yolov2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab" align="left"/></a>

# Object Detection with YOLOv2

Welcome to your object detection programming exercise. This exercise allows you to implement some of the main ideas we covered in the lecture for YOLOv2 algorithm, such as bounding boxes, IOU and non-max-suppression. 

**You will learn how to:**
- apply object detection on image and video
- implement IoU, Non-Max-Suppression 

**Note**: Please run this using Tensorflow 1.14 environment

## 1. Import Required Packages

In [None]:
import matplotlib.pyplot as plt
import imgaug as ia
from tqdm import tqdm
import numpy as np
import os, cv2
from darknet import build_darknet
from utils import sigmoid, softmax, draw_boxes
%matplotlib inline

##  2. Build he YOLO network (Darknet)

In [None]:
model = build_darknet()


**Exercise**: 

Examine the model using model.summary()

1. What is the input shape (of the image) for the Yolo's Darknet? 

2. What is the output shape of the the Yolo's Darknet

3. What does the last axis of the Yolo output contain? 

<details><summary>Click here for answer</summary>
    
1. (416, 416, 3)
2. (13, 13, 5, 85)
3. The last axis has 85 elements, which are: confidence_score, x,y,w,h of the bounding boxes, and 80 class probabalities

</details>

## 3. Load pretrained weights

The pretrained weights for YOLOv2 converted to .h5 format (pre-trained on COCO dataset) can be downloaded from: 

https://sdaaidata.s3-ap-southeast-1.amazonaws.com/pretrained-weights/full_yolov2.h5

The original YOLO pretrained weights and config file can be downloaded from YOLO's author website: 

https://pjreddie.com/media/files/yolov2.weights

However this weight file need to be converted to .h5 format before it can loaded using keras's ```model.load_weights()``` method. 

Refer to this link on how to convert the weights:

https://github.com/allanzelener/YAD2K



In [None]:
# run this only if you have wget installed on your Linux system
! wget https://sdaaidata.s3-ap-southeast-1.amazonaws.com/pretrained-weights/iti107/session-4/full_yolov2.h5

In [None]:
model.load_weights("full_yolov2.h5")

## 4. Process the detection output

- **Anchor boxes** in YOLO allows multiple objects to be detected within a grid cell.  YOLOv2 uses 5 anchor boxes which are shown below. Each pair of numbers represent the width and height of a single anchor box. The dimension is relative to a grid cell. 
- **Object Threshold** controls which boxes to keep based on confidence score (in the corresponding grid) and class probabalities
- **NMS threshold** is the iou threshold used to decided whether to remove a bounding box

In [None]:
NUM_CLASSES = 80
OBJ_THRESHOLD = 0.5
NMS_THRESHOLD = 0.45
ANCHOR_BOXES = [0.57273, 0.677385, 1.87446, 2.06253, 3.33843, 5.47434, 7.88282, 3.52778, 9.77052, 9.16828]

Read the class labels from the file labels.txt (this will contain all the names for the 80 classes

In [None]:
# read the class labels YOLOv2 is trained on
labels = [label.rstrip('\n') for label in open('labels.txt')]
#print(labels)

Here we define a convenient class to hold the information about each bounding box predicted.

In [None]:
class BoundBox:
    def __init__(self, xmin, ymin, xmax, ymax, c=None, class_prob_score=None):
        self.xmin = xmin
        self.ymin = ymin
        self.xmax = xmax
        self.ymax = ymax
            
        # This is the confidence score
        self.confidence = c
        # This is the class probabalities score (confidence score * class probabilities)
        self.class_prob_scores = class_prob_score

        self.label = -1
        self.class_prob_score = -1

    def get_label(self):
        if self.label == -1:
            # return the class label corresponding to the highest class probability score
            self.label = np.argmax(self.class_prob_scores)

        return self.label

    def get_score(self):
        if self.class_prob_score == -1:
            self.class_prob_score = self.class_prob_scores[self.get_label()]

        return self.class_prob_score


The output of the detection layer (netout) is of the shape 13 x 13 x 5 x 85

It'll be convenient to rearrange the (13,13,5,85) tensor into the following variables:  
- `box_confidence`: tensor of shape $(13 \times 13, 5, 1)$ containing $p_c$ (confidence probability that there's some object) for each of the 5 boxes predicted in each of the 19x19 cells.
- `boxes`: tensor of shape $(13 \times 13, 5, 4)$ containing $(b_x, b_y, b_h, b_w)$ for each of the 5 boxes per cell.
- `box_class_probs`: tensor of shape $(13 \times 13, 5, 80)$ containing the detection probabilities $(c_1, c_2, ... c_{80})$ for each of the 80 classes for each of the 5 boxes per cell.


**Exercise** 

***This is quite a challenging exercise. Don't worry if you do not know how to do it :)***
- Retrieve the box confidence from the netout and apply sigmoid function `sigmoid(x)` so that the confidence score is between 0 and 1
- Retrieve the box class probabilities (80 of them) and apply function `softmax(x)`  so that the sum of the classes sums to 1.0
- Compute the scores of class probabilities by multiplying (element-wise) the confidence score with class probabalities (see the diagram below for clearer picture of how this is done)
- For those with score < threshold, set it to 0, so that the associated box will be removed later

<img src="nb_images/probability_extraction.png" style="width:500px;height:400;"/>

***Hint***:

`netout` is of shape (13, 13, 5, 85). To quickly access the elements of the last axis, you can use the ellipsis in numpy array to expand to the number of ':' objects needed to make a selection tuple of the same length as x.ndim. 

e.g. Given an numpy array x of shape (2,3,2): 
```
[[[ 0  1]
  [ 2  3]
  [ 4  5]]

 [[ 6  7]
  [ 8  9]
  [10 11]]]
```
``x[..., 0]`` will give  ``[[0,2,4], [6,8,10]]`` and ``x[..., 1]`` will give ``[[1,3,5], [7,9,11]]``

<details><summary>Click here for answer</summary>
    
def filter_boxes(netout, obj_threshold=0.3):
    
    boxes = []
   
    box_confidences = netout[..., 4] 
    box_confidences = sigmoid(box_confidences)

    # make the box_confidences the same number of axis as box_class_probs so you can multiply them together
    box_confidences = box_confidences[..., np.newaxis]
    
    box_class_probs = netout[..., 5:]   # 5th element onwards are individual class probs
    box_class_probs = softmax(box_class_probs)
    
    # Compute box clas prob scores by doing the elementwise product of box_confidences and box_class_probs
    # You need both box_confidnences and box_class_probs to have the same number of axis
    box_scores = box_confidences * box_class_probs
    
    # for class probablies less than threshold, set it to 0 other set it to 1
    box_scores *= box_scores > obj_threshold
    
    return box_confidences, box_class_probs, box_scores

</details>


In [2]:
def filter_boxes(netout, obj_threshold=0.3):
    
    boxes = []
    
    # look at the last axis which is the one with 85 elements
    # out of these 85, 5 is (x,y, w, h, confidence)
    # element at index 4 (5th element) is the confidence score
    
    ### START YOUR CODE HERE ###
    
    box_confidences = None
    
    ### END CODE HERE ###
    
    # make the box_confidences the same number of axis as box_class_probs so you can multiply them together
    box_confidences = box_confidences[..., np.newaxis]
    
    # The last axis's 5th element onwards are individual class probs
    
    ### START YOUR CODE HERE ###
    
    box_class_probs = None   

    ### END CODE HERE ###
    
    # Compute box clas prob scores by doing the elementwise product of box_confidences and box_class_probs
    # You need both box_confidnences and box_class_probs to have the same number of axis
    
    ### START YOUR CODE HERE ###
    
    box_scores = None
    
    ### END CODE HERE ###
    
    # for class probablies less than threshold, set it to 0 other set it to 1
    box_scores *= box_scores > obj_threshold
    
    return box_confidences, box_class_probs, box_scores

The following codes loops through each of the 13 x 13 grid cells, and for each of the 5 predicted boxes of each grid cell, compute the bounding box's x_min, y_min (top left corner) and x_max, y_max (bottom left corner) using the following formula (reproduced from YOLOv2 paper), where $p_w$ and $p_h$ are the width and height of the corresponding anchor box, and $t_x$, $t_y$, $t_w$ and $t_h$ are the 4 coordinates of each bounding box, and $c_x$, $c_y$ is the offset from top-left corner of the image (corresponds to grid location):

<img src="nb_images/bounding_box_location.png" style="width:150px;height:100"/>



In [None]:
def decode_netout(netout, anchors, nb_class, obj_threshold=0.3, nms_threshold=0.3):
    boxes = []
    
    ### START CODE HERE ###
    # call the filter_box() to get box_confidences, box_class_probs and box_scores
    box_confidences, box_class_probs, box_scores = filter_boxes(netout, obj_threshold)
    ### END CODE HERE ###
    
    grid_h, grid_w, nb_box = netout.shape[:3]
    
    # calculate the locations of each of the 5 bounding boxes for each of the 13 x 13 locations
    count = 0
    for row in range(grid_h):
        for col in range(grid_w):
            for b in range(nb_box):
                # from 4th element onwards are confidence and class classes
                box_score = box_scores[row,col,b]
                
                # if scores for all classes are 0, then skip the box
                if np.sum(box_score) > 0:
                    # first 4 elements are x, y, w, and h
                    x, y, w, h = netout[row,col,b,:4]
            
                    # x that is output is relative to each cell, so need to compute the 
                    # x, and y is the coordinate of the center of the bounding 
                    x = (col + sigmoid(x)) / grid_w # center position, unit: image width
                    y = (row + sigmoid(y)) / grid_h # center position, unit: image height
                    w = anchors[2 * b + 0] * np.exp(w) / grid_w # unit: image width
                    h = anchors[2 * b + 1] * np.exp(h) / grid_h # unit: image height
                    
                    confidence = box_confidences[row,col,b]
                    
                    # convert the coordinate to top/left corner and bottom/right corner 
                    x_min = x - w/2
                    x_max = x + w/2 
                    y_min = y - h/2
                    y_max = y + h/2 
                    
                    box = BoundBox(x_min, y_min, x_max, y_max, confidence, box_score)

                    boxes.append(box)
                    
    return boxes

### Intersection over Union

Non-max suppression uses a very important function called **"Intersection over Union"**, or IoU.
<img src="nb_images/iou.png" style="width:500px;height:400;"/>
<caption><center> Definition of "Intersection over Union"<br> </center></caption>

**Exercise**: Implement bbox_iou(). 

Some hints:
- In this exercise only, we define a box using its two corners (upper left and lower right): `(xmin, ymin, xmax, ymax)` rather than the midpoint and height/width.
- To calculate the area of a rectangle you need to multiply its height `(ymax - ymin)` by its width `(xmax - xmin)`.
- You'll also need to find the coordinates `(x1_i, y1_i, x2_i, y2_i)` of the intersection of two boxes. 

Remember that:
    - x1_i = maximum of the x1 coordinates of the two boxes
    - y1_i = maximum of the y1 coordinates of the two boxes
    - x2_i = minimum of the x2 coordinates of the two boxes
    - y2_i = minimum of the y2 coordinates of the two boxes
    
- In order to compute the intersection area, you need to make sure the height and width of the intersection are positive, otherwise the intersection area should be zero. Use `max(height, 0)` and `max(width, 0)`.

In this code, we use the convention that (0,0) is the top-left corner of an image, (1,0) is the upper-right corner, and (1,1) the lower-right corner. 

<details><summary>Click here for answer</summary>
    
def bbox_iou(box1, box2):
    """Implement the intersection over union (IoU) between box1 and box2
    
    Arguments:
    box1 -- first box, which is an object with the following attributes(xmin, ymin, xmax, ymax)
    box2 -- second box, which is an object with the following attributes(xmin, ymin, xmax, ymax)
    """
    
    # calculate the intersection
    x1_i = max(box1.xmin, box2.xmin)  
    y1_i = max(box1.ymin, box2.ymin)
    x2_i = min(box1.xmax, box2.xmax)
    y2_i = min(box1.ymax, box2.ymax)
    intersection_w = max(x2_i - x1_i, 0)
    intersection_h = max(y2_i - y1_i, 0)
    intersection_area = intersection_w * intersection_h
    
    # calculate the union 
    box1_area = (box1.xmax - box1.xmin) * (box1.ymax - box1.ymin)
    box2_area = (box2.xmax - box2.xmin) * (box2.ymax - box2.ymin)
    
    union_area = box1_area + box2_area - intersection_area
    
    iou = float(intersection_area)/union_area
    
    return iou

</details>

In [3]:
def bbox_iou(box1, box2):
    """Implement the intersection over union (IoU) between box1 and box2
    
    Arguments:
    box1 -- first box, which is an object with the following attributes(xmin, ymin, xmax, ymax)
    box2 -- second box, which is an object with the following attributes(xmin, ymin, xmax, ymax)
    """
    
    ### START YOUR CODE HERE ###
    
    # calculate the intersection
    
    
    # calculate the union 
    
    
    
    ### END YOUR CODE HERE ###
    
    return iou

### Non-Max Suppression

Here is the code that implement non-max suppression. The key steps are: 
1. Select the box that has the highest score.
2. Compute the overlap of this box with all other boxes, and remove boxes that overlap significantly (iou >= `iou_threshold`).
3. Go back to step 1 and iterate until there are no more boxes with a lower score than the currently selected box.

This will remove all boxes that have a large overlap with the selected boxes. Only the "best" boxes remain.


In [None]:
def non_max_suppression(boxes, nb_class, nms_threshold, obj_threshold):

    # np.argsort sorts in ascending order, we reverse so we will look at box with highest probablies
    for c in range(nb_class):
        sorted_indices = list(reversed(np.argsort([box.class_prob_scores[c] for box in boxes])))

        for i in range(len(sorted_indices)):
            index_i = sorted_indices[i]
            
            if boxes[index_i].class_prob_scores[c] == 0: 
                continue
            else:
                for j in range(i+1, len(sorted_indices)):
                    index_j = sorted_indices[j]
                    
                    if bbox_iou(boxes[index_i], boxes[index_j]) >= nms_threshold:
                        boxes[index_j].class_prob_scores[c] = 0
                        
    # remove the boxes which are less likely than obj_threshold
    boxes = [box for box in boxes if box.get_score() > 0]
    
    return boxes

## 5. Perform detection on image

**Exercise:**

Before you give the image to the model, it needs to be preprocessed as follows:

1. resize the image to the correct input size expected by the model (recall the input size from Part 2)
2. Normalize values of each pixel to between (0,1)
3. Reverse the channel order (if necessary) 

**Note**: opencv reads in image using BGR ordering, so we need to reverse it. Hint:  use ``::-1`` to reverse the items
4. Add additional dimension as 1st dimension as the model expects to receive inputs in batches, i.e of shape (batch, width, height, channels) ***Hint*** use ```np.expand_dims()```
5. Call model.predict() to get the predictions of shape (13, 13, 5, 85)
6. Pass the predictions to decode_netout()  **Hint** remove the 1st axis (batch axis) before call decode_netout()
7. Call non_max_suppression() to get the final list of boxes

<details><summary>Click here for answer</summary>

input_image = cv2.resize(image, (416, 416))
input_image = input_image / 255.
input_image = input_image[:,:,::-1]
input_image = np.expand_dims(input_image, 0)

netout = model.predict(input_image)

boxes = decode_netout(netout[0], 
                      anchors=ANCHOR_BOXES, 
                      nb_class=NUM_CLASSES)

boxes = non_max_suppression(boxes, NUM_CLASSES, NMS_THRESHOLD, OBJ_THRESHOLD)

</details>

In [None]:
image = cv2.imread('data/giraffe.jpg')

plt.figure(figsize=(10,10))

### START YOUR CODE HERE ###

### END THE CODE  ###

## draw the box on the original image, not preprocessed image
image = draw_boxes(image, boxes, labels=labels)

## reverse the BGR to RGB channel ordering
plt.imshow(image[:,:,::-1]) 
plt.show()


## 6. Perform detection on video

The following code shows how to perform detection on video and write the result (image with drawn bounding boxes) to an image file.

In [None]:
video_inp = 'data/street.mp4'
video_out = 'data/street_predicted.mp4'

video_reader = cv2.VideoCapture(video_inp)

nb_frames = int(video_reader.get(cv2.CAP_PROP_FRAME_COUNT))
frame_h = int(video_reader.get(cv2.CAP_PROP_FRAME_HEIGHT))
frame_w = int(video_reader.get(cv2.CAP_PROP_FRAME_WIDTH))

video_writer = cv2.VideoWriter(video_out,
                               cv2.VideoWriter_fourcc(*'XVID'), 
                               30.0, 
                               (frame_w, frame_h))

for i in tqdm(range(nb_frames)):
    ret, image = video_reader.read()
    
    input_image = cv2.resize(image, (416, 416))
    input_image = input_image / 255.
    input_image = input_image[:,:,::-1]
    input_image = np.expand_dims(input_image, 0)

    netout = model.predict(input_image)

    boxes = decode_netout(netout[0], 
                          obj_threshold=0.3,
                          nms_threshold=NMS_THRESHOLD,
                          anchors=ANCHOR_BOXES, 
                          nb_class=NUM_CLASSES)
    image = draw_boxes(image, boxes, labels=labels)

    video_writer.write(np.uint8(image))
    
video_reader.release()
video_writer.release()  


Now let's playback the video that we have created. 

***Note***: 

Only run the cell below if if you are running on a local PC. The opencv needs to open a local window to play the video and this is not possible if you remotely access the server through a browser (e.g. when you are using cloud VM). So, if you are using the cloud VM, you can download the video to your local PC and play it using any video player. 

In [None]:
## Open the video file and play it
video_out = 'data/street_predicted.mp4'

cap = cv2.VideoCapture(video_out)

if (cap.isOpened() == False):
    print('Error')
cv2.namedWindow('Frame')
cv2.startWindowThread()    
while(cap.isOpened()):   
    ret, frame = cap.read() 
    cv2.startWindowThread()
    if ret == True: 
        # Display the resulting frame 
        cv2.imshow('Frame', frame) 

        # Press Q on keyboard to  exit 
        if cv2.waitKey(25) & 0xFF == ord('q'): 
            break
    else:
        break

# When everything done, release the capture
cap.release()
cv2.waitKey(1)
cv2.destroyAllWindows()
cv2.waitKey(1)


**Additional Exercise**

Try using your own image or video to do Object Detection, have FUN!