# How to prepare your own dataset for training an Object Detector



## Gathering data
First of all you need data. To train a robust classifier, we need a lot of pictures which should differ a lot from each other. So they should have different backgrounds, random object, and varying lighting conditions. You can either take the pictures yourself or you can download them from the internet. 

## Structure of the Dataset folder - Pascal VOC notation
In order to train an object detector using our own dataset, we need to construct an internal structure for the folder that contains our images ('./data'). In order to generate this we follow the Pascal VOC notation. Pascal VOC annotations are saved as XML files, one XML file per image. Each XML file contains the path to the image in the 'path' element, the bounding box stored in an 'object' element and other features as can be seen in the example below.

<img src = "images/pascalvoc.PNG" style = "height:400px">

As you can see the bounding box is defined by two points, the upper left and bottom right corners.

We need to construct inside the main folder 'data' another subfolder 'VOC2007', that contains three subfolders:
1. Annotations: Inside this folder we will put the Pascal VOC formatted annotation XML files, that we are going to generate below using LabelImg.
2. ImageSets: Inside this folder there is another subfolder 'Main', that contains two files 'test.txt' and 'trainval.txt'. In these two files all the images that belong to that category are listed.
3. JPEGImages: Here there are all the images, that has to be in the JPG format.

At the end your dataset should have a structure like this:

![](images/structure.PNG)

## Labeling Data
In order to label our data, the image labeling software LabelImg was used. It is available on Github https://github.com/tzutalin/labelImg and thanks to this tool we can create easily the xml files with a similar structure to the PASCAL VOC dataset for all images in the training and testing directory.
### Instruction to install labelImg
1. git clone https://github.com/tzutalin/labelImg
2. conda install pyqt=5
3. pyrcc5 -o libs/resources.py resources.qrc
4. python labelImg.py
### Steps
1. Build and launch LabelImg using the instructions above
2. In the left column choose the saved annotation (Pascal VOC / Yolo). Make sure Pascal VOC is selected.
3. Click 'Open Dir' to open the directory that contains all your images (the folder should be found following this path "./data/VOCdevkit/VOC2007/ImageSets")
4. Change save directory for the XML annotation files to "./data/VOCdevkit/VOC2007/Annotations".
5. Click 'Create RectBox' 
6. Click and release left mouse to select a region to annotate the rect box
7. You can use right mouse to drag the rect box to copy or move it


A txt file with the different classes used in the labeling is also created and placed in the folders you are working on. 

<img src="images/labelimg.jpg" style="height:500px">


## Transform image resolution
Usually the images you collect will have different sizes and high resolution, since you can take them using different 
photographic equipment: a mobilephone, a webcam, a reflex camera, etc. Thus, it is needed to transform all the images to a lower scale (the same for all the pictures) in order to speed up the training (e.g. 200 x 150 can be an option, but it is important to check, while changing the resolution, that you are able to distinguish and recognize the objects in the picture, otherwise it will be difficult also for the CNN to learn the main features of your objects). 

In [1]:
from PIL import Image
import os

def rescale_images(directory, size):
    for img in os.listdir(directory):
        im = Image.open(directory + '/' + img)
        im_resized = im.resize(size, Image.ANTIALIAS)
        im_resized.save(directory + '/' + img)
        
path_images = '../data_lab/VOC2007/JPEGImages'
WIDTH_NEW = 800
HEIGHT_NEW = 600
rescale_images(path_images, (WIDTH_NEW, HEIGHT_NEW))        

## Split of the dataset
Once we have our images, we need to split them between those we use for the train, 80% of them, and those for the test, the remaining 20%. First of all we need to create a txt file, 'datafile.txt', that contains the name of all the images in our dataset. Then, using the function sklearn.model_selection.train_test_split we can make this split of the dataset and create the two txt files to place inside the folder 'ImageSets/Main', one with the train images and one with the test images.

In [4]:
import pandas as pd
import numpy
from sklearn import datasets, linear_model
from sklearn.model_selection import train_test_split

train_file = '../data_lab/VOC2007/ImageSets/Main/trainval.txt'
test_file = '../data_lab/VOC2007/ImageSets/Main/test.txt'
with open("../data_lab/datafile.txt", "r") as f:
 # in Windows you may need to put rb instead of r mode 
   data = f.read().split('\n')
   data = numpy.array(data)  #convert array to numpy type array

   train, test = train_test_split(data,test_size=0.2)      
   split = [train, test] 
   # the ouputs here are two lists containing train-test split of inputs.
   lengths = [len(train), len(test)]
   out_train = open(train_file,"w")
   out_test = open(test_file, "w")
   out_file = [out_train, out_test]
   out = 0
   for l in lengths:
        for i in range(l):
            name_img = split[out][i]
            out_file[out].write(name_img + '\n')
        out_file[out].close()    
        out += 1
        
        
# Split into a training set and a test set using a stratified k fold
# split into a training and testing set
# y here the label associated
#X_train, X_test, y_train, y_test = train_test_split(
#    X, y, test_size=0.25, random_state=42)

## Change xml file
Once we have labeled and created all the xml files for our images, we could want to change the resolution of our images as explained above. In order to change the xml file every time we transform image resolution, it is necessary to define a function that modify only the lines corresponding at the features of the images we are changing (i.e. the width and height, the box position and also the path where they are located)

In [5]:
from __future__ import print_function
from sys import argv
from os import listdir, path
import re


WIDTH_NEW = 800
HEIGHT_NEW = 600

DIMLINE_MASK = r'<(?P<type1>width|height)>(?P<size>\d+)</(?P<type2>width|height)>'
BBLINE_MASK = r'<(?P<type1>xmin|xmax|ymin|ymax)>(?P<size>\d+)</(?P<type2>xmin|xmax|ymin|ymax)>'
NAMELINE_MASK = r'<(?P<type1>filename)>(?P<size>\S+)</(?P<type2>filename)>'
PATHLINE_MASK = r'<(?P<type1>path)>(?P<size>.+)</(?P<type2>path)>'
#regular expression

def resize_file(file_lines):
    new_lines = []
    for line in file_lines:
        match = re.search(DIMLINE_MASK, line) or re.search(BBLINE_MASK, line) or  re.search(NAMELINE_MASK, line) or re.search(PATHLINE_MASK, line) 
        if match is not None:
            size = match.group('size')
            type1 = match.group('type1')
            type2 = match.group('type2')    
            if type1 != type2:
                raise ValueError('Malformed line: {}'.format(line))
          
            if type1.startswith('f'):
                new_name = size[:-3] + 'jpg'
                new_line = '\t<{}>{}</{}>\n'.format(type1, new_name, type1)
            elif type1.startswith('p'):
                new_size = '/scratch/lmeneghe/electrolux/Object_Detector/data_lab/VOC2007/Annotations/' + new_name
                new_line = '\t<{}>{}</{}>\n'.format(type1, new_size, type1)
            elif type1.startswith('x'):
                size = int(size)
                new_size = int(round(size * WIDTH_NEW / width_old))
                new_line = '\t\t\t<{}>{}</{}>\n'.format(type1, new_size, type1)
            elif type1.startswith('y'):
                size = int(size)
                new_size = int(round(size * HEIGHT_NEW / height_old))
                new_line = '\t\t\t<{}>{}</{}>\n'.format(type1, new_size, type1)
            elif type1.startswith('w'):
                size = int(size)
                width_old = size
                new_size = int(WIDTH_NEW)
                new_line = '\t\t<{}>{}</{}>\n'.format(type1, new_size, type1)
            elif type1.startswith('h'):
                size = int(size)
                height_old = size
                new_size = int(HEIGHT_NEW)
                new_line = '\t\t<{}>{}</{}>\n'.format(type1, new_size, type1)
            else:
                raise ValueError('Unknown type: {}'.format(type1))
            #new_line = '\t\t\t<{}>{}</{}>\n'.format(type1, new_size, type1)
            new_lines.append(new_line)
        else:
            new_lines.append(line)

    return ''.join(new_lines)


        
        
def change_xml(nome_file):
    if len(nome_file) < 1:
        raise ValueError('No file submitted')

    if path.isdir(nome_file):
    # the argument is a directory
        files = listdir(nome_file)
        for file in files:
            file_path = path.join(nome_file, file)
            file_name, file_ext = path.splitext(file)
            #print(file_path, end='') # Questo non e` tanto astuto
            if file_ext.lower() == '.xml':
                #print(': CONVERTIMIIII!!!', end='')
                with open(file_path,'r') as f:
                        righe = f.readlines()

                nuovo_file = resize_file(righe)
                #print(nuovo_file)
                with open(file_path,'w') as f:
                    f.write(nuovo_file)
            #print()
        
    else:
        # otherwise i have a file (hopefully)
        with open(nome_file,'r') as f:
            righe = f.readlines()

        nuovo_file = resize_file(righe)
        #print(nuovo_file)
        with open(nome_file,'w') as f:
            f.write(nuovo_file)        

#insert name of the xml file or directory that contains them
xml_file = '../data_lab/VOC2007/Annotations' 
change_xml(xml_file)

## Optional: How to convert your dataset in the COCO notation
You are out of luck if your object detection training pipeline require COCO data format since the labeling tool we use does not support COCO annotation format. If you already have the dataset generated using Pascal VOC notation, you can then convert your annotation to COCO format.
### COCO notation
For the COCO data format, first of all, there is only a single JSON file for all the annotation in a dataset or one for each split of datasets(Train/Val/Test). The bounding box is express as the upper left starting coordinate and the box width and height, like "bbox" :[x,y,width,height]. Here is an example for the COCO data format JSON file which just contains one image as seen the top-level “images” element, 3 unique categories/classes in total seen in top-level “categories” element and 2 annotated bounding boxes for the image seen in top-level “annotations” element. If you want to understand better the COCO format, check the official webpage of the COCO dataset: http://cocodataset.org/#format-data.

<img src="images/coco.PNG" style="height:800px">

Once you have some annotated XML and images files with a folder structure similar to the one explained above, you can generate a COCO data formatted JSON file using the function voc2coco (see also https://github.com/Tony607/voc2coco).

ATTENTION: In order to use this function the name of all the images must be numbers. So you need to rename all the images if this is not true.

In [27]:
import sys
import os
import json
import xml.etree.ElementTree as ET
import glob

START_BOUNDING_BOX_ID = 1
PRE_DEFINE_CATEGORIES = None
# If necessary, pre-define category and its id
#  PRE_DEFINE_CATEGORIES = {"aeroplane": 1, "bicycle": 2, "bird": 3, "boat": 4,
#  "bottle":5, "bus": 6, "car": 7, "cat": 8, "chair": 9,
#  "cow": 10, "diningtable": 11, "dog": 12, "horse": 13,
#  "motorbike": 14, "person": 15, "pottedplant": 16,
#  "sheep": 17, "sofa": 18, "train": 19, "tvmonitor": 20}


def get(root, name):
    vars = root.findall(name)
    return vars


def get_and_check(root, name, length):
    vars = root.findall(name)
    if len(vars) == 0:
        raise ValueError("Can not find %s in %s." % (name, root.tag))
    if length > 0 and len(vars) != length:
        raise ValueError(
            "The size of %s is supposed to be %d, but is %d."
            % (name, length, len(vars))
        )
    if length == 1:
        vars = vars[0]
    return vars


def get_filename_as_int(filename):
    try:
        filename = filename.replace("\\", "/")
        filename = os.path.splitext(os.path.basename(filename))[0]
        return int(filename)
    except:
        raise ValueError("Filename %s is supposed to be an integer." % (filename))


def get_categories(xml_files):
    """Generate category name to id mapping from a list of xml files.
    
    Arguments:
        xml_files {list} -- A list of xml file paths.
    
    Returns:
        dict -- category name to id mapping.
    """
    classes_names = []
    for xml_file in xml_files:
        tree = ET.parse(xml_file)
        root = tree.getroot()
        for member in root.findall("object"):
            classes_names.append(member[0].text)
    classes_names = list(set(classes_names))
    classes_names.sort()
    return {name: i for i, name in enumerate(classes_names)}


def convert(xml_files, json_file):
    json_dict = {"images": [], "type": "instances", "annotations": [], "categories": []}
    if PRE_DEFINE_CATEGORIES is not None:
        categories = PRE_DEFINE_CATEGORIES
    else:
        categories = get_categories(xml_files)
    bnd_id = START_BOUNDING_BOX_ID
    for xml_file in xml_files:
        tree = ET.parse(xml_file)
        root = tree.getroot()
        path = get(root, "path")
        if len(path) == 1:
            filename = os.path.basename(path[0].text)
        elif len(path) == 0:
            filename = get_and_check(root, "filename", 1).text
        else:
            raise ValueError("%d paths found in %s" % (len(path), xml_file))
        ## The filename must be a number
        image_id = get_filename_as_int(filename)
        size = get_and_check(root, "size", 1)
        width = int(get_and_check(size, "width", 1).text)
        height = int(get_and_check(size, "height", 1).text)
        image = {
            "file_name": filename,
            "height": height,
            "width": width,
            "id": image_id,
        }
        json_dict["images"].append(image)
        ## Currently we do not support segmentation.
        #  segmented = get_and_check(root, 'segmented', 1).text
        #  assert segmented == '0'
        for obj in get(root, "object"):
            category = get_and_check(obj, "name", 1).text
            if category not in categories:
                new_id = len(categories)
                categories[category] = new_id
            category_id = categories[category]
            bndbox = get_and_check(obj, "bndbox", 1)
            xmin = int(get_and_check(bndbox, "xmin", 1).text) - 1
            ymin = int(get_and_check(bndbox, "ymin", 1).text) - 1
            xmax = int(get_and_check(bndbox, "xmax", 1).text)
            ymax = int(get_and_check(bndbox, "ymax", 1).text)
            assert xmax > xmin
            assert ymax > ymin
            o_width = abs(xmax - xmin)
            o_height = abs(ymax - ymin)
            ann = {
                "area": o_width * o_height,
                "iscrowd": 0,
                "image_id": image_id,
                "bbox": [xmin, ymin, o_width, o_height],
                "category_id": category_id,
                "id": bnd_id,
                "ignore": 0,
                "segmentation": [],
            }
            json_dict["annotations"].append(ann)
            bnd_id = bnd_id + 1

    for cate, cid in categories.items():
        cat = {"supercategory": "none", "id": cid, "name": cate}
        json_dict["categories"].append(cat)

    #os.makedirs(os.path.dirname(json_file), exist_ok=True)
    json_fp = open(json_file, "w")
    json_str = json.dumps(json_dict)
    json_fp.write(json_str)
    json_fp.close()


xml_dir = './data/Annotations'
json_file = './data/output.json'
xml_files = glob.glob(os.path.join(xml_dir, "*.xml"))
# If you want to do train/test split, you can pass a subset of xml files to convert function.
print("Number of xml files: {}".format(len(xml_files)))
convert(xml_files, json_file)
print("Success: {}".format(json_file))

Number of xml files: 14
Success: ./images/output.json


### Visualize the COCO annotation
Once we have the JSON file, we can visualize the COCO annotation by drawing bounding box and class labels as an overlay over the image. Open the COCO_Image_Viewer.ipynb in Jupyter notebook, that can be found on the GitHub page https://github.com/Tony607/voc2coco. 
http://cocodataset.org/#format-data