## 1. Pre-processing  
This section prepares the raw dataset for training a **YOLOv11**.

* **Libraries** – We import `os`, `yaml`, and Ultralytics’ helper utilities.  
* **Dataset split** – `autosplit` shuffles and partitions all annotated images into  
  80 % training, 10 % validation, and 10 % test sets (only images with labels are kept).  
* **`data.yaml` creation** – We write a minimal `data.yaml` that tells Ultralytics  
  where the images live, which text files list each split, and what class names exist.  
  This file is the single source of truth YOLOv8 needs to locate data at train time.


In [3]:
import os
import yaml
from ultralytics.data.utils import autosplit

In [4]:
images_path = "../data/raw/images"

## Images and Annotations Split

Randomly divide the dataset into **80 % train**, **10 % validation**, and **10 % test**.

`annotated_only=True` ensures that only images with matching label files are kept in the split.


In [None]:
autosplit(path = images_path,
          weights = (0.8, 0.1, 0.1),
          annotated_only=True)

## `data.yaml` Generation
Ultralytics expects a YAML configuration with:

* **path** – root directory of the dataset  
* **train / val / test** – the `.txt` file listings produced by `autosplit`  
* **names** – a dictionary mapping numeric class IDs to class names  


In [7]:
data_config = {
    "names": {0: "panel"},
    "path": "/data/raw",
    "train": "autosplit_train.txt",
    "val": "autosplit_val.txt",
    "test": "autosplit_test.txt"
}

with open(f"{os.path.dirname(images_path)}/data.yaml", "w") as f:
    yaml.dump(data_config, f, sort_keys=False)

print("data.yaml generated successfully")

data.yaml generated successfully
