# Roboflow Dataset Preprocessing and General Structure

The purpose of this notebook is twofold: firstly, to provide an overview of the structure of the Roboflow dataset, and secondly, to offer information regarding certain files and preprocessing cells.

## Expected Directory Structure
We expect the directory structure to be in this format

```
.
└── Dataset/Directory/
    ├── README_files/
    │   └── ...
    ├── test/
    │   ├── images/
    │   │   ├── 38.png
    │   │   └── ...
    │   └── labels/
    │   │   ├── 38_png.txt
    │   │   └── ...
    ├── train/
    │   ├── images/
    │   │   ├── 1.png
    │   │   └── ...
    │   └── labels/
    │   │   ├── 1_png.txt
    │   │   └── ...
    ├── valid/
    │   ├── images/
    │   │   ├── 15.png
    │   │   └── ...
    │   └── labels/
    │   │   ├── 15_png.txt
    │   │   └── ...
    └── data.yaml
```

This structure is essential for training the ultralytics YOLO model.

## *data.yaml* Structure

*data.yaml* is a configuration file that contains some basic information about the dataset. The basic structure is:

```yaml
train: ../train/images
val: ../valid/images
test: ../test/images

nc: 5
names: ['fire_extinguisher', 'baby_face', 'hard_hat', 'backpack', 'propane_tank']
```

- First three lines are constant, specifying the train, val and test folder positions.
- nc means number of classes (the class ids are directly related to nc)
- names indicate the names of the classes

There is no need to change this as we only have 5 classes and have no intention of adding more.

## The Structure of *labels* Folder

The *labels* folder stores class id of the box in the first digit and the properties of the boxes for the following digits. For example:

`0 0.46015625 0.409375 0.7890625 0.6875`

The text above means that the image with the same name as the text file has only one box since there is only one line and it has a class ID of 0 (fire_extinguisher). The remaining numbers represent the properties of the box (centerx, centery, width and height, respectively).

## Preprocessing Dataset

Define the dirname of the *labels* folder we aim to modify.

In [None]:
import os

dirname = "archive/valid/labels" # validation labels
filenames = os.listdir(dirname)

Read the files and create `lines` variable to rewrite the label classes.

In [181]:
prev_class_id = '0' # class id to be replaced
new_class_id = '4' # new class id

for filename in filenames:
    path = os.path.join(dirname, filename)
    with open(path, 'r') as file:
        lines = file.readlines()

    with open(path, 'w') as file:
        for line in lines:
            if line.startswith(prev_class_id):
                line = new_class_id + line[1:]
            file.write(line)


Turn segments into box.

In [174]:
for filename in filenames:
    path = os.path.join(dirname, filename)
    with open(path, 'r') as file:
        lines = file.readlines()

    with open(path, 'w') as file:
        for line in lines:
            if len(line.split()) > 5:
                x = []
                y = []
                print(line)
                for i in range(1, len(line.split()), 2):
                    x.append(float(line.split()[i]))
                    y.append(float(line.split()[i+1]))
                line = line[0] + ' ' + str((min(x) + max(x)) / 2) + ' ' + str((min(y) + max(y)) / 2) + ' ' + str(max(x) - min(x)) + ' ' + str(max(y) - min(y))
                print(line)
            file.write(line)

Delete the lines with unwanted class ids.

In [None]:
class_id = '9 ' # class id to be removed

for filename in filenames:
    path = os.path.join(dirname, filename)
    with open(path, 'r') as file:
        lines = file.readlines()

    with open(path, 'w') as file:
        for line in lines:
            if line[:2] == class_id:
                file.write(line)


Delete the files with unwanted class ids.

In [None]:
class_id = '0 ' # class id to be removed

for subfolder in ["train", "valid"]:
    imgs = f"yolo2.v2i.yolov11/{subfolder}/images"
    labels = f"yolo2.v2i.yolov11/{subfolder}/labels"
    for img_name, label_name in zip(os.listdir(imgs), os.listdir(labels)):
        path = os.path.join(labels, label_name)
        with open(path, 'r') as file:
            lines = file.readlines()
        if any([line[:2] == class_id for line in lines]): # Delete if any line starts with class_id
            os.remove(os.path.join(imgs, img_name))
            os.remove(os.path.join(labels, label_name))

Move from train to valid and vice versa

In [None]:
class_id = '1 ' # class id to be replaced
no_class = False

src = "valid"
des = "train"

imgs = f"archive/{src}/images"
labels = f"archive/{src}/labels"
c = 0 # Counter to track the number of files processed
for img_name, label_name in zip(os.listdir(imgs), os.listdir(labels)):
    path = os.path.join(labels, label_name)
    with open(path, 'r') as file:
        lines = file.readlines()
    if no_class:
        if len(lines) == 0 and c < 0:
            os.rename(os.path.join(imgs, img_name), os.path.join(f"archive/{des}/images", img_name))
            os.rename(os.path.join(labels, label_name), os.path.join(f"archive/{des}/labels", label_name))
            c += 1
    elif any([line[:2] == class_id for line in lines]) and c < 0:
        os.rename(os.path.join(imgs, img_name), os.path.join(f"archive/{des}/images", img_name))
        os.rename(os.path.join(labels, label_name), os.path.join(f"archive/{des}/labels", label_name))
        c += 1

Analyse each class id

In [96]:
class_ids = ['0', '1', '2', '3', '4']
count = [0] * len(class_ids)
empty = 0
for filename in filenames:
    path = os.path.join(dirname, filename)
    with open(path, 'r') as file:
        lines = file.readlines()
        if len(lines) == 0:
            empty += 1
        for line in lines:
            if line[0] == class_ids[0]:
                count[0] += 1
                break
            elif line[0] == class_ids[1]:
                count[1] += 1
                break
            elif line[0] == class_ids[2]:
                count[2] += 1
                break
            elif line[0] == class_ids[3]:
                count[3] += 1
                break
            elif line[0] == class_ids[4]:
                count[4] += 1
                break

for ci, c in zip(class_ids, count):
    print(f'Class ID {ci}: {c} instances')
print(f'Empty files: {empty} instances')

Class ID 0: 1109 instances
Class ID 1: 176 instances
Class ID 2: 1347 instances
Class ID 3: 185 instances
Class ID 4: 184 instances
Empty files: 401 instances


- [The last title](#preprocessing-dataset) is primarily concerned with providing concise code snippets for specific preprocessing operations, with the aim of either rectifying existing issues or implementing additional features as required.
- Here's the [link](https://universe.roboflow.com/) to find the datasets for preprocessing.