## 01. Setup and split data from Combined Dataset


1. Download dataset from: [https://app.roboflow.com/facemaskdatasetunion/face-mask-combined-dataset/6](https://app.roboflow.com/facemaskdatasetunion/face-mask-combined-dataset/6)
2. Unpack folder on `/YoloV8/data/combined/`
3. On `/YoloV8/data/combined/` you shoud have: `data.yaml, README.dataset.txt, README.roboflow.txt, train`
4. Run this notebook so you can split the train data into val/test

In [1]:
import os
import random
import shutil

# Set the seed for reproducibility
seed_value = 42
random.seed(seed_value)

# Set the paths to your original data directory and the destination directory for the split data
data_dir = 'data/combined/'

In [2]:
# Rename the folder
if not os.path.exists(data_dir + "/data"):
    os.rename(data_dir + "/train", data_dir + "/data")

In [3]:
# Create the split directories
os.makedirs(data_dir, exist_ok=True)
os.makedirs(os.path.join(data_dir, 'train'), exist_ok=True)
os.makedirs(os.path.join(data_dir, 'valid'), exist_ok=True)
os.makedirs(os.path.join(data_dir, 'test'), exist_ok=True)

images_dir = data_dir + "/data/images"
labels_dir = data_dir + "/data/labels"

# Set the train/validation/test split percentages
train_percent = 0.7
validation_percent = 0.1
test_percent = 0.2

# Get the list of image files in the data directory
image_files = [file for file in os.listdir(images_dir) if file.endswith('.jpg')]

# Shuffle the image files randomly
random.shuffle(image_files)

# Calculate the number of images for each split
num_images = len(image_files)
print("# of images:", num_images)

num_train = int(num_images * train_percent)
num_validation = int(num_images * validation_percent)
num_test = num_images - num_train - num_validation

# Split the image files into train/validation/test sets
train_files = image_files[:num_train]
validation_files = image_files[num_train:num_train + num_validation]
test_files = image_files[num_train + num_validation:]
    
def move_partition(partition, files):
    for file in files:
        file_name = os.path.splitext(os.path.basename(file))[0]
        label = file_name + ".txt"

        partition_images_dir = os.path.join(data_dir, partition, "images")
        partition_labels_dir = os.path.join(data_dir, partition, "labels")
        
        os.makedirs(partition_images_dir, exist_ok=True)
        os.makedirs(partition_labels_dir, exist_ok=True)

        # Copy Image
        shutil.copy(os.path.join(images_dir, file), os.path.join(partition_images_dir, file))
        # Copy label
        shutil.copy(os.path.join(labels_dir, label), os.path.join(partition_labels_dir, label))

move_partition("train", train_files)
move_partition("valid", validation_files)
move_partition("test", test_files)

print('Data split completed successfully.')

# of images: 2924
Data split completed successfully.


## Modify the YAML according to our new setup

The yaml should already be as below:
``` 
train: ../train/images
val: ../valid/images
test: ../test/images

nc: 2
names: ['with_mask', 'without_mask']

roboflow:
  workspace: facemaskdatasetunion
  project: face-mask-combined-dataset
  version: 6
  license: CC BY 4.0
  url: https://universe.roboflow.com/facemaskdatasetunion/face-mask-combined-dataset/dataset/6
```