# Roboflow Dataset Preprocessing and General Structure

The purpose of this notebook is twofold: firstly, to provide an overview of the structure of the Roboflow dataset, and secondly, to offer information regarding certain files and preprocessing cells.

## Expected Directory Structure
We expect the directory structure to be in this format

```
.
└── dataset/
    ├── train/
    │   ├── images/
    │   └── labels/
    ├── valid/
    │   ├── images/
    │   └── labels/
    ├── test/
    │   ├── images/
    │   └── labels/
    └── data.yaml
```

This structure is essential for training the ultralytics YOLO model.

## *data.yaml* Structure

*data.yaml* is a configuration file that contains some basic information about the dataset. The basic structure is:

```yaml
train: ../train/images
val: ../valid/images
test: ../test/images

nc: 5
names: ['fire_extinguisher', 'baby_face', 'hard_hat', 'backpack', 'propane_tank']
```

- First three lines are constant, specifying the train, val and test folder positions.
- nc means number of classes (the class ids are directly related to nc)
- names indicate the names of the classes

## The Structure of *labels* Folder

The *labels* folder stores class id of the box in the first digit and the properties of the boxes for the following digits. For example:

```
0 0.34698 0.49266666666666664 0.34998 0.8866666666666667
0 0.61648 0.49896 0.2253 0.8895466666666666
```

The label above means that the image file it corresponds to has two boxes, because there are two lines. The line format usually follows the pattern [class x_center y_center width height].

## Preprocessing Dataset
This section is intended to provide some quick code snippets that may be of use to someone pre-processing data.

**Note:** This section is only useful if you want to pre-process a chunk of data at the same time. Otherwise, use [roboflow](https://blog.roboflow.com/getting-started-with-roboflow/) to pre-process the data you want to add. **Skip this section if you are not adding or pre-processing data for your training data set.**

Define the dirname of the *labels* folder we aim to modify.

In [None]:
import os

DIR = "archive"

Read the files and create `lines` variable to rewrite the label classes.

In [None]:
prev_class_id = '1' # class id to be replaced
new_class_id = '4' # new class id

for subfolder in ["train", "valid"]:
    dirname = f"{DIR}/{subfolder}/labels"
    filenames = os.listdir(dirname)
    for filename in filenames:
        path = os.path.join(dirname, filename)
        with open(path, 'r') as file:
            lines = file.readlines()

        with open(path, 'w') as file:
            for line in lines:
                if line.startswith(prev_class_id):
                    line = new_class_id + line[1:]
                file.write(line)


Turn segments into box.

In [None]:
for subfolder in ["train", "valid"]:
    dirname = f"{DIR}/{subfolder}/labels"
    filenames = os.listdir(dirname)
    for filename in filenames:
        path = os.path.join(dirname, filename)
        with open(path, 'r') as file:
            lines = file.readlines()

        with open(path, 'w') as file:
            for line in lines:
                if len(line.split()) > 5:
                    x = []
                    y = []
                    print(line)
                    for i in range(1, len(line.split()), 2):
                        x.append(float(line.split()[i]))
                        y.append(float(line.split()[i+1]))
                    line = line[0] + ' ' + str((min(x) + max(x)) / 2) + ' ' + str((min(y) + max(y)) / 2) + ' ' + str(max(x) - min(x)) + ' ' + str(max(y) - min(y))
                    print(line)
                file.write(line)


Delete the lines with unwanted class ids.

In [None]:
class_id = '9 ' # class id to be removed

for subfolder in ["train", "valid"]:
    dirname = f"{DIR}/{subfolder}/labels"
    filenames = os.listdir(dirname)
    for filename in filenames:
        path = os.path.join(dirname, filename)
        with open(path, 'r') as file:
            lines = file.readlines()

        with open(path, 'w') as file:
            for line in lines:
                if line[:2] != class_id:
                    file.write(line)


Delete the files with unwanted class ids.

In [None]:
class_id = '0 ' # class id to be removed

for subfolder in ["train", "valid"]:
    imgs = f"{DIR}/{subfolder}/images"
    labels = f"{DIR}/{subfolder}/labels"
    for img_name, label_name in zip(os.listdir(imgs), os.listdir(labels)):
        path = os.path.join(labels, label_name)
        with open(path, 'r') as file:
            lines = file.readlines()
        if any([line[:2] == class_id for line in lines]): # Delete if any line starts with class_id
            os.remove(os.path.join(imgs, img_name))
            os.remove(os.path.join(labels, label_name))

Analyse each class id

In [None]:
for subfolder in ["train", "valid"]:
    dirname = f"{DIR}/{subfolder}/labels"
    filenames = os.listdir(dirname)

    class_ids = ['0', '1', '2', '3', '4']
    count = [0] * len(class_ids)
    empty = 0
    for filename in filenames:
        path = os.path.join(dirname, filename)
        with open(path, 'r') as file:
            lines = file.readlines()
            if len(lines) == 0:
                empty += 1
            for line in lines:
                if line[0] == class_ids[0]:
                    count[0] += 1
                    break
                elif line[0] == class_ids[1]:
                    count[1] += 1
                    break
                elif line[0] == class_ids[2]:
                    count[2] += 1
                    break
                elif line[0] == class_ids[3]:
                    count[3] += 1
                    break
                elif line[0] == class_ids[4]:
                    count[4] += 1
                    break

    # Print the results
    print(f'{subfolder} folder:', end='\n\n')
    for ci, c in zip(class_ids, count):
        print(f'Class ID {ci}: {c} instances', end='\n\n')
    print(f'Empty files: {empty} instances', end='\n\n')
    
    if subfolder == "train":
        print('-------------------------', end='\n\n')

Move from train to valid and vice versa

In [None]:
DIR = "archive" # Directory to move files from/to
class_id = '0 ' # class id to be replaced
no_class = False # True if you want to move files with no class, False if you want to move files with a specific class

src = "train" # Source folder to move files from
des = "valid" # Destination folder to move files to

imgs = f"{DIR}/{src}/images"
labels = f"{DIR}/{src}/labels"

c = 0 # Counter to track the number of files processed
for img_name, label_name in zip(os.listdir(imgs), os.listdir(labels)):
    path = os.path.join(labels, label_name)
    with open(path, 'r') as file:
        lines = file.readlines()
    if no_class:
        if len(lines) == 0 and c < 0:
            os.rename(os.path.join(imgs, img_name), os.path.join(f"{DIR}/{des}/images", img_name))
            os.rename(os.path.join(labels, label_name), os.path.join(f"{DIR}/{des}/labels", label_name))
            c += 1
    elif any([line[:2] == class_id for line in lines]) and c < 300:
        os.rename(os.path.join(imgs, img_name), os.path.join(f"{DIR}/{des}/images", img_name))
        os.rename(os.path.join(labels, label_name), os.path.join(f"{DIR}/{des}/labels", label_name))
        c += 1

- [The last title](#preprocessing-dataset) is primarily concerned with providing concise code snippets for specific preprocessing operations, with the aim of either rectifying existing issues or implementing additional features as required.
- The main dataset that we use for training the model is on Dropbox.
- There are also other datasets that we can merge with our main dataset. Here is the [link](https://universe.roboflow.com/) to find other datasets to pre-process and merge with our main dataset.