# Data Preparation Tutorial
Transform benchmark dataset Prima into a ready-to-train structure. We convert the XML annotations into a json file, in the conversion we define the taxonomy we want. Then from the json file we create new images which are annotated masks.
## Dataset Preparation

You can download the dataset from https://www.primaresearch.org/datasets/Layout_Analysis

In [1]:
#Global Variables
DATA_PATH           = './prima'
PATH_TO_ANNOTATIONS = './prima/annotations.json'
PATH_TO_IMAGES      = './prima/Images'
TEST_SET_PATH       =  './test_set'
CLASSES_DATA        = ('Background', 'Paragraph', 'OtherText', 'VisualFigure') # Define classes, don't forget to add a 'Background' Class in front
PALETTE_DATA        = [[0, 0, 0], [0, 0, 255], [255, 0, 0], [0, 255, 0]] # Each color is mapped to its corresponding label, mapped by index
DATA_ROOT           = 'data'
PATH_TO_TR_IMAGES   = 'setr_images'
PATH_TO_ANN_IMAGES  = 'setr_annotations_palette'
PATH_TO_SPLIT       = 'splits'

we call the wrapper function for the conversion from XML to json-format. We need json-format in order to train the Mask R-CNN. We create custom taxonomy in the file <u>convert_prima_to_3_classes</u> which can be changed, depending on the purpose of the task.

In [2]:
from utils.convert_prima_to_3_classes import wrapper

wrapper(DATA_PATH, PATH_TO_ANNOTATIONS)

100%|██████████| 478/478 [00:16<00:00, 28.14it/s]


Second, we split the dataset into train-test.

In [10]:
from utils.cocosplit import main

main(PATH_TO_ANNOTATIONS, 0.8, True, PATH_TO_ANNOTATIONS[:-5]+'-train.json', PATH_TO_ANNOTATIONS[:-5]+'-val.json', 24)

Saved 382 entries in ./data/prima/annotations-train.json and 96 in ./data/prima/annotations-val.json


Given that some images come without annotations, we create a test set consisting of images which won't be used in training the model.

In [11]:
from utils.test_set_creator import test_set_creator

test_set_creator(PATH_TO_ANNOTATIONS, PATH_TO_IMAGES, TEST_SET_PATH)

## Create a Training Set
The training dataset should be structured:
* data
    * setr_images
    * setr_annotations_palette

In [2]:
import utils.train_set_creator as tsc

In [4]:
import os 

if  not os.path.exists(DATA_ROOT):
    os.mkdir(DATA_ROOT)

if not os.path.exists(os.path.join(DATA_ROOT, PATH_TO_TR_IMAGES)):
    os.mkdir(os.path.join(DATA_ROOT, PATH_TO_TR_IMAGES))

tsc.train_set_builder(PATH_TO_IMAGES, os.path.join(DATA_ROOT, PATH_TO_TR_IMAGES), PATH_TO_ANNOTATIONS)

In [3]:
if not os.path.exists(os.path.join(DATA_ROOT, PATH_TO_ANN_IMAGES)):
    os.mkdir(os.path.join(DATA_ROOT, PATH_TO_ANN_IMAGES))
    
tsc.annotation_set_builder(os.path.join(DATA_ROOT, PATH_TO_TR_IMAGES), PATH_TO_ANNOTATIONS, os.path.join(DATA_ROOT, PATH_TO_ANN_IMAGES), PALETTE_DATA)

FileNotFoundError: [Errno 2] No such file or directory: 'data/setr_images/00001086.png'

## Create Data Split
SETR takes as a train-test split two '.txt' files - one containing the training examples and one containing the evaluation examples.

In [4]:
if not os.path.exists(os.path.join(DATA_ROOT, PATH_TO_SPLIT)):
    os.mkdir(os.path.join(DATA_ROOT, PATH_TO_SPLIT))

tsc.split_set(PATH_TO_ANNOTATIONS[:-5]+'-train.json', PATH_TO_ANNOTATIONS[:-5]+'-val.json', os.path.join(DATA_ROOT, PATH_TO_SPLIT, 'train.txt'), os.path.join(DATA_ROOT, PATH_TO_SPLIT, 'val.txt'))