## Dataset
Data has been taken from [VOC 2007](http://host.robots.ox.ac.uk/pascal/VOC/voc2007/) dataset.

The twenty object classes that have been selected are:

* Person: person
* Animal: bird, cat, cow, dog, horse, sheep
* Vehicle: aeroplane, bicycle, boat, bus, car, motorbike, train
* Indoor: bottle, chair, dining table, potted plant, sofa, tv/monitor

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import argparse
import xml.etree.ElementTree as ET
import os

In [3]:
parser = argparse.ArgumentParser(description='Build Annotations.')
parser.add_argument('dir', default='..', help='Annotations.')

_StoreAction(option_strings=[], dest='dir', nargs=None, const=None, default='..', type=None, choices=None, help='Annotations.', metavar=None)

In [2]:
sets = [('2007', 'train'), ('2007', 'val'), ('2007', 'test')]

In [3]:
classes_num = {'aeroplane': 0, 'bicycle': 1, 'bird': 2, 'boat': 3, 'bottle': 4, 'bus': 5,
               'car': 6, 'cat': 7, 'chair': 8, 'cow': 9, 'diningtable': 10, 'dog': 11,
               'horse': 12, 'motorbike': 13, 'person': 14, 'pottedplant': 15, 'sheep': 16,
               'sofa': 17, 'train': 18, 'tvmonitor': 19}

In [4]:
def convert_annotation(in_file):
    tree = ET.parse(in_file)
    root = tree.getroot()
    res = ''

    for obj in root.iter('object'):
        difficult = obj.find('difficult').text
        class_name = obj.find('name').text
        cls_id = classes_num.get(class_name)
        if cls_id == None : 
            continue
        xmlbox = obj.find('bndbox')
        b = (int(xmlbox.find('xmin').text), int(xmlbox.find('ymin').text),
             int(xmlbox.find('xmax').text), int(xmlbox.find('ymax').text))
        res += ' ' + ','.join([str(a) for a in b]) + ',' + str(cls_id)
    return res

In [13]:
classes_num.get('a') == None

True

## Algorithm
### Shapes and probabilities
The input is a (448, 448, 3) image and the output is a (7, 7, 30) tensor. The output is based on S x S x (B * 5 +C).

B = 2 in YOLO-v1 and no of classes = 20 => 2 * 5 + 20 = 30.

S X S is the number of grids, B is the number of bounding boxes per grid, C is the number of predictions per grid

The system divides the image into an S x S grid. Each of these grid cells predicts B bounding boxes and confidence scores for these boxes. The confidence score indicates how sure the model is that the box contains an object and also how accurate it thinks the box is that predicts. The confidence score can be calculated using the formula:
$C = Pr(object) * IoU$

$IoU$: Intersection over Union between the predicted box and the ground truth.

If no object exists in a cell, its confidence score should be zero.
<img src ="attachment:dab909d4-0afa-4b31-a846-0e13fe1fbe3b.png" width= "200px"/>

Each grid cell also predicts C conditional class probabilities $Pr(Classi|Object)$.

It only predicts one set of class probabilities per grid cell, regardless of the number of boxes B. During testing, these conditional class probabilities are multiplied by individual box confidence predictions which give class-specific confidence scores for each box. These scores show both the probability of that class and how well the box fits the object.

$Pr(Class i|Object)*Pr(Object)*IoU = Pr(Class i)*IoU$.

The final predictions are encoded as an S x S x (B*5 + C) tensor.

### Intersection Over Union (IoU):
IoU is used to evaluate the object detection algorithm. It is the overlap between the ground truth and the predicted bounding box, i.e it calculates how similar the predicted box is with respect to the ground truth.

![image.png](attachment:9d1ce88c-f91e-443b-b2c3-1a90f70e6249.png)

Usually, the threshold for IoU is kept as greater than 0.5. Although many researchers apply a much more stringent threshold like 0.6 or 0.7. If a bounding box has an IoU less than the specified threshold, that bounding box is not taken into consideration.

### Non-Max Suppression (NMS):
The algorithm may find multiple detections of the same object. Non-max suppression is a technique by which the algorithm detects the object only once. Consider an example where the algorithm detected three bounding boxes for the same object. The boxes with respective probabilities are shown in the image below.

![image.png](attachment:61386d5f-b32e-42bf-8ee5-50e6da35b42a.png)

The probabilities of the boxes are 0.7, 0.9, and 0.6 respectively. To remove the duplicates, we are first going to select the box with the highest probability and output that as a prediction. Then eliminate any bounding box with IoU > 0.5 (or any threshold value) with the predicted output. The result will be:

![image.png](attachment:391892c8-8da1-4dc9-9849-e5989c8c49ec.png)

### Network Architecture:
The base model has 24 convolutional layers followed by 2 fully connected layers. It uses 1 x 1 reduction layers followed by a 3 x 3 convolutional layer. Fast YOLO uses a neural network with 9 convolutional layers and fewer filters in those layers. The complete network is shown in the figure.

![image.png](attachment:11275151-1162-4226-a361-70ad0a36728b.png)

* The architecture was designed for use in the Pascal VOC dataset, where S = 7, B = 2, and C = 20. This is the reason why final feature maps are 7 x 7, and also the output tensor is of the shape (7 x 7 x (2*5 + 20)). To use this network with a different number of classes or different grid size you might have to tune the layer dimensions.

* The final layer uses a linear activation function. The rest uses a leaky ReLU.

### Limitations Of YOLO:
* Spatial constraints on bounding box predictions as each grid cell only predicts two boxes and can have only one class.
* It is difficult to detect small objects that appear in groups.
* It struggles to generalize objects in new or unusual aspect ratios as the model learns to predict bounding boxes from data itself.

## Create seperate partitions for train, test and val data

In [5]:
if not os.path.isdir("val"):
    os.makedirs("val")
    os.makedirs(os.path.join('val', 'images'))
    os.makedirs(os.path.join('val', 'labels'))
if not os.path.isdir("train"):
    os.makedirs("train")
    os.makedirs(os.path.join('train', 'images'))
    os.makedirs(os.path.join('train', 'labels'))
if not os.path.isdir("test"):
    os.makedirs("test")
    os.makedirs(os.path.join('test', 'images'))
    os.makedirs(os.path.join('test', 'labels'))

In [6]:
import shutil
for set_ in sets :
    print(set_)
    full_file_path = os.path.join('VOCdevkit', 
                                f'VOC{set_[0]}', 
                                'ImageSets', 
                                'Main',
                                 f'{set_[1]}.txt')
    
    jpeg_images_folder = os.path.join('VOCdevkit', 
                                      f'VOC{set_[0]}',
                                     'JPEGImages')
    
    annotations_folder = os.path.join('VOCdevkit', 
                                      f'VOC{set_[0]}',
                                     'Annotations')

    with open(full_file_path, 'r') as f :
        for i in f :
            i = i.strip()
            if i == '' : continue
            orig_jpg_path =  os.path.join(jpeg_images_folder, f'{i}.jpg')
            new_jpg_path = os.path.join(f'{set_[1]}', 'images', f'{i}.jpg')
            orig_annot_path = os.path.join(annotations_folder, f'{i}.xml')
            new_annot_path = os.path.join(f'{set_[1]}', 'labels', f'{i}.txt')
            if not os.path.exists(new_jpg_path) :
                shutil.copy(orig_jpg_path, new_jpg_path)
            if not os.path.exists(new_annot_path) :
                with open(new_annot_path, 'w') as f_w :
                    f_w.write(convert_annotation(orig_annot_path))
                    

('2007', 'train')
('2007', 'val')
('2007', 'test')


## Create partition for just 2 classes

In [12]:
if not os.path.isdir("val2"):
    os.makedirs("val2")
    os.makedirs(os.path.join('val2', 'images'))
    os.makedirs(os.path.join('val2', 'labels'))
if not os.path.isdir("train2"):
    os.makedirs("train2")
    os.makedirs(os.path.join('train2', 'images'))
    os.makedirs(os.path.join('train2', 'labels'))
if not os.path.isdir("test2"):
    os.makedirs("test2")
    os.makedirs(os.path.join('test2', 'images'))
    os.makedirs(os.path.join('test2', 'labels'))

In [None]:
import shutil

for set_ in sets :
    print(set_)
    full_file_path = os.path.join('VOCdevkit', 
                                  f'VOC{set_[0]}', 
                                  'ImageSets', 
                                  'Main',
                                  f'person_{set_[1]}.txt')

    jpeg_images_folder = os.path.join('VOCdevkit', 
                                      f'VOC{set_[0]}',
                                      'JPEGImages')

    annotations_folder = os.path.join('VOCdevkit', 
                                      f'VOC{set_[0]}',
                                      'Annotations')

    non_person = 0
    with open(full_file_path, 'r') as f :
        for i in f  :
            if non_person > 1015 : break
            a = i.strip()
            
            if a == '' : continue
            x = a.split()
            if x[1] == '-1' :
                non_person += 1
                
            a = x[0]
            
            orig_jpg_path =  os.path.join(jpeg_images_folder, f'{a}.jpg')
            new_jpg_path = os.path.join(f'{set_[1]}2', 'images', f'{a}.jpg')
            orig_annot_path = os.path.join(annotations_folder, f'{a}.xml')
            new_annot_path = os.path.join(f'{set_[1]}2', 'labels', f'{a}.txt')
            if not os.path.exists(new_jpg_path) :
                shutil.copy(orig_jpg_path, new_jpg_path)
            if not os.path.exists(new_annot_path) :
                with open(new_annot_path, 'w') as f_w :
                    f_w.write(convert_annotation(orig_annot_path))
                
             