## Creating a YOLO Formatted Dataset from Images and YOLO Labels

A given YOLO dataset has to have a specific directory tree. It should consist of a root folder usually named `dataset` and sub folders named `train, test, val`

The directory tree for a given YOLO dataset looks something like this:
<pre>
.
└── dataset/
    ├── images/
    │   ├── train
    │   ├── test
    │   └── val
    └── labels/
        ├── train
        ├── test
        └── val
</pre>

It should be considered that the test directory is optional but recommended for the performance of the model.

The following program applies a `80 / 10 / 10` split on the given images and their corresponding labels by default.

A `data.yaml` file should be created in order for YOLO to use the dataset.

An example `data.yaml` file would look something like this:

```
path: dataset     # The dataset root

train: dataset\train     # Path to train
test: dataset\test       # Path to test
val: dataset\val         # Path to val

names:  # Class names
	0: class0
	1: class1
	2: class2
	3: class3
	4: class4

```

### Preparation

The bulk of the entire process is in this section

- We need to decide what ratio the train, test and validation splits should have. As stated above, by default the split is `80 / 10 / 10`; train, test and validation sets respectively.

- We also need to get the length of all images and labels to get the count of each split.

- Lastly, using the list of images and the calculated lengths, we calculate which images belong to which split.

In [None]:
import random
from pathlib import Path

# Ratios for train, test and validation sets
TRAIN_RATIO = 0.8
TEST_RATIO = 0.1
VAL_RATIO = 0.1

# List for all supported extensions.
SUPPORTED_EXTENSIONS = ['.bmp', '.jpg', '.jpeg', '.png', '.tif', '.tiff', '.dng']

# Required paths
images_path = Path("./images")
labels_path = Path("./labels")
dataset_path = Path("./dataset")

def split_images(train_ratio: float,
                 test_ratio: float,
                 val_ratio: float,
                 images_path: Path,
                 labels_path: Path,
                 seed: int = 59) -> dict[str, list[Path]]:
    """
    Calculates the size of the train, test and validation splits

    Args:
        train_ratio (float): Ratio of the train split (Between 0 and 1)
        test_ratio (float): Ratio of the train split (Between 0 and 1)
        val_ratio (float): Ratio of the train split (Between 0 and 1)
        images_path (Path): Path to the images
        labels_path (Path): Path to the labels
        seed (int): Seed for the randomiser

    Returns:
        dict[str, list[Path]]: A dictionary with keys 'train', 'test' and 'val' denoting each split's files' list.
    """

    ratio_sum = train_ratio + test_ratio + val_ratio
    if ratio_sum != 1:
        raise ValueError(f"Invalid ratio sum ({ratio_sum}). Sum of train, test and val ratios should be equal to one")

    if train_ratio <= 0:
        raise ValueError("Train ratio cannot be less than or equal to 0")
    
    # All image extensions are .png by default
    all_images = []
    for ext in SUPPORTED_EXTENSIONS:
        all_images.extend(list(images_path.glob(f"*{ext}")))

    # All label extensions are .txt by default
    all_labels = list(labels_path.glob("*.txt"))

    if len(all_images) != len(all_labels):
        raise ValueError(f"Length of all images ({len(all_images)}) is not equal to the length of all labels ({len(all_labels)}). Please check the dataset for unlabeled images before proceeding.")

    # Initialise randomisation
    random.seed(seed)

    # Shuffle all images
    random.shuffle(all_images)

    # Calculate the count of train, test and validation groups
    n = len(all_images)
    train_count = int(n * train_ratio)
    test_count = int(n * test_ratio)
    val_count = int(n * val_ratio)

    # If the sum of all groups isn't equal to all image count add the difference to the train count
    difference = len(all_images) - (train_count + test_count + val_count)
    if difference != 0:
        train_count += difference
    
    # Get train, test and val images from all images based on their calculated counts
    train_images = all_images[:train_count]                         # Images between 0 and train_count
    val_images   = all_images[train_count:train_count + val_count]  # Images between train_count and train_count + val_count
    test_images  = all_images[train_count + val_count:]             # Images between train_count + val_count and len(all_images) - 1

    print(f"""Split dataset into: 
    Train: {train_count}
    Val: {val_count}
    Test: {test_count}""")

    return {
        "train": train_images,
        "test": test_images,
        "val": val_images 
    }

splits = split_images(train_ratio=TRAIN_RATIO,
                      test_ratio=TEST_RATIO,
                      val_ratio=VAL_RATIO,
                      images_path=images_path,
                      labels_path=labels_path)

Split dataset into: 
    Train: 3426
    Val: 428
    Test: 428


### Creating the required dataset directory tree

In order to proceed, we need to create the aformentioned directory tree of:

<pre>
.
└── dataset/
    ├── images/
    │   ├── train
    │   ├── test
    │   └── val
    └── labels/
        ├── train
        ├── test
        └── val
</pre>

In [29]:
# Creating the dataset paths
def create_dataset_paths(sub_paths: list[str] = ["images", "labels"], splits: list[str] = ["train", "test", "val"]) -> None:
    """
    Creates the required dataset paths

    Args:
        sub_paths (list[str]): Sub paths of the dataset. ['images', 'labels'] by default
        splits (list[str]): The names of the splits. ['train', 'test', 'val'] by default

    Returns:
        None
    """
    # Set required paths
    dataset_path = Path("./dataset")

    for path in sub_paths:
        for split in splits:
            new_dir = dataset_path / path / split
            if Path.exists(new_dir):
                print(f"Directory '{new_dir}' already exists, skipping creation")
            else:
                new_dir.mkdir(parents=True, exist_ok=True)
                print(f"Created directory: '{new_dir}'")

create_dataset_paths()

Directory 'dataset\images\train' already exists, skipping creation
Directory 'dataset\images\test' already exists, skipping creation
Directory 'dataset\images\val' already exists, skipping creation
Directory 'dataset\labels\train' already exists, skipping creation
Directory 'dataset\labels\test' already exists, skipping creation
Directory 'dataset\labels\val' already exists, skipping creation


### Move the images and labels into their target directories

The last step of splitting the dataset is copying the images and labels to the target directories.

In [None]:
import shutil
from tqdm import tqdm   # For the progress bar

def copy_dataset_contents(dataset_path: Path, splits: dict[str, list[Path]]) -> None:
    """
    Copies the dataset contents to the target

    Args:
        dataset_path (Path): Path to the root dataset folder
        splits (dict[str, list[Path]]): Dictionary containing the dataset splits

    Returns:
        None
    """

    # Check if the provided dictionary is in the required format
    required_keys = ["train", "test", "val"]
    for key in required_keys:
        if key not in splits:
            raise ValueError(f"Missing key: {key}")
        
    # Copy all files
    for split, img_list in splits.items():
        print(f"\nProcessing {split} split:")
        for img_path in tqdm(img_list, desc=f"{split}", unit="files"):

            # Get the label using the image path name
            lbl_path = labels_path / (img_path.stem + '.txt')

            # Create the destination paths
            dst_img = dataset_path / "images" / split / img_path.name
            dst_lbl = dataset_path / "labels" / split / lbl_path.name

            # Copy the images and labels to the destination paths
            shutil.copy2(img_path, dst_img)
            shutil.copy2(lbl_path, dst_lbl)

copy_dataset_contents(dataset_path=dataset_path,
                      splits=splits)


Processing train split:


train: 100%|██████████| 3426/3426 [01:30<00:00, 37.77files/s] 



Processing test split:


test: 100%|██████████| 428/428 [00:01<00:00, 413.44files/s]



Processing val split:


val: 100%|██████████| 428/428 [00:01<00:00, 415.47files/s]


### Generating the data.yaml file

YOLO models use a data.yaml file to determine the classes with their index values and the dataset location. We need to create one if we want YOLO to be able to train.

In [None]:
classes = ["aircraft", "armored_vehicle", "helicopter", "tank", "truck"]

def create_yolo_yaml(classes: list[str]) -> None:
    """
    Creates the data.yaml file for YOLO to use

    Args:
        classes (list[str]): List of class names in order.

    Returns:
        None
    """

    class_dict = {}
    for i in range(len(classes)):
        class_dict[f"{i}"] = classes[i]

    with open("data.yaml", "w") as file:
        file.write(f"""path: {dataset_path}     # The dataset root

train: {dataset_path / "train"}     # Path to train
test: {dataset_path / "test"}       # Path to test
val: {dataset_path / "val"}         # Path to val

names:  # Class names
""")
        for item in class_dict:
            file.write(f"\t{item}: {class_dict[item]}\n")