# Merging Multiple YOLO v11 Datasets into One

This notebook provides a step-by-step guide to merging multiple YOLO v11 datasets into a single dataset. It ensures that class labels are uniquely mapped to avoid overlaps and maintains the required directory structure.

## **Overview**

1. **Setup and Imports**
2. **Utility Functions**
3. **Discovering Datasets**
4. **Collecting Unique Classes**
5. **Merging Datasets**
6. **Generating Final `data.yaml`**
7. **Conclusion**


In [3]:
import os
import shutil
import yaml
from collections import OrderedDict
from pathlib import Path
import argparse


In [4]:
from dotenv import load_dotenv
import os
load_dotenv()

from roboflow import Roboflow

# Create a 'datasets' folder if it doesn't exist
datasets_folder = "datasets"
os.makedirs(datasets_folder, exist_ok=True)

rf = Roboflow(api_key=os.getenv("ROBOFLOW_API_KEY"))

# Download and save datasets
datasets = [
    ("ghost-gsj7h", "broken-pole", 3),
    ("mytyolov8", "mytest-hrswj", 2),
    ("september-23-2024-flood-dataset", "flood-september-23-dataset", 1),
    ("hackathons-xbnwb", "fallen_trees_improve_1-1185-dkpus-5ufzr", 1)
]

for workspace, project_name, version_number in datasets:
    project = rf.workspace(workspace).project(project_name)
    version = project.version(version_number)
    dataset_path = os.path.join(datasets_folder, project_name)
    dataset = version.download("yolov11", location=dataset_path)
    print(f"Dataset '{project_name}' downloaded to: {dataset_path}")


loading Roboflow workspace...
loading Roboflow project...


Downloading Dataset Version Zip in datasets/broken-pole to yolov11:: 100%|██████████| 11588/11588 [00:00<00:00, 82990.57it/s]





Extracting Dataset Version Zip to datasets/broken-pole in yolov11:: 100%|██████████| 412/412 [00:00<00:00, 14004.01it/s]

Dataset 'broken-pole' downloaded to: datasets/broken-pole
loading Roboflow workspace...





loading Roboflow project...


Downloading Dataset Version Zip in datasets/mytest-hrswj to yolov11:: 100%|██████████| 106200/106200 [00:02<00:00, 53091.20it/s]





Extracting Dataset Version Zip to datasets/mytest-hrswj in yolov11:: 100%|██████████| 4780/4780 [00:00<00:00, 16166.08it/s]


Dataset 'mytest-hrswj' downloaded to: datasets/mytest-hrswj
loading Roboflow workspace...
loading Roboflow project...


Downloading Dataset Version Zip in datasets/flood-september-23-dataset to yolov11:: 100%|██████████| 146635/146635 [00:02<00:00, 67407.44it/s] 





Extracting Dataset Version Zip to datasets/flood-september-23-dataset in yolov11:: 100%|██████████| 4275/4275 [00:00<00:00, 12482.44it/s]

Dataset 'flood-september-23-dataset' downloaded to: datasets/flood-september-23-dataset
loading Roboflow workspace...





loading Roboflow project...


Downloading Dataset Version Zip in datasets/fallen_trees_improve_1-1185-dkpus-5ufzr to yolov11:: 100%|██████████| 68223/68223 [00:00<00:00, 86315.59it/s]





Extracting Dataset Version Zip to datasets/fallen_trees_improve_1-1185-dkpus-5ufzr in yolov11:: 100%|██████████| 1779/1779 [00:00<00:00, 12688.66it/s]

Dataset 'fallen_trees_improve_1-1185-dkpus-5ufzr' downloaded to: datasets/fallen_trees_improve_1-1185-dkpus-5ufzr





## Utility Functions

We'll define helper functions to load and save YAML files, handle paths, and manage warnings.


In [5]:
def load_yaml(yaml_path):
    """
    Load a YAML file and return its contents.
    """
    with open(yaml_path, 'r') as file:
        return yaml.safe_load(file)

def save_yaml(data, yaml_path):
    """
    Save data to a YAML file.
    """
    with open(yaml_path, 'w') as file:
        yaml.dump(data, file, sort_keys=False)

def get_all_datasets(datasets_root):
    """
    Discover all dataset directories within the root datasets directory.
    Assumes each subdirectory is a separate dataset.
    """
    datasets = [os.path.join(datasets_root, d) for d in os.listdir(datasets_root) 
                if os.path.isdir(os.path.join(datasets_root, d))]
    print(f"Found {len(datasets)} datasets to merge.")
    return datasets


## Collecting Unique Classes

To ensure that class labels are unique across all datasets, we'll collect all unique class names and assign new unique IDs to each.


In [6]:
def collect_unique_classes(datasets):
    """
    Collect all unique class names across all datasets and assign new unique IDs.

    Args:
        datasets (list): List of dataset directory paths.

    Returns:
        dict: Mapping from class names to unique IDs.
    """
    unique_classes = []
    for dataset in datasets:
        data_yaml_path = os.path.join(dataset, 'data.yaml')
        if not os.path.exists(data_yaml_path):
            print(f"Warning: data.yaml not found in {dataset}. Skipping.")
            continue
        data = load_yaml(data_yaml_path)
        classes = data.get('names', [])
        for cls in classes:
            if cls not in unique_classes:
                unique_classes.append(cls)
    class_to_id = {cls: idx for idx, cls in enumerate(unique_classes)}
    print(f"Total unique classes collected: {len(class_to_id)}")
    return class_to_id


## Merging Datasets

This section handles the merging of images and labels from all datasets into the final dataset structure. It ensures that class IDs are remapped to the new unique IDs.


In [7]:
def merge_datasets(datasets_root, output_root, class_to_id):
    """
    Merge multiple YOLO v11 datasets into a single dataset.

    Args:
        datasets_root (str): Path to the folder containing multiple datasets.
        output_root (str): Path to the output merged dataset folder.
        class_to_id (dict): Mapping from class names to unique IDs.
    """
    datasets = get_all_datasets(datasets_root)
    if not datasets:
        print("No datasets found to merge.")
        return
    
    # Prepare output directories
    splits = ['train', 'valid']
    output_dirs = {}
    for split in splits:
        images_path = os.path.join(output_root, split, 'images')
        labels_path = os.path.join(output_root, split, 'labels')
        os.makedirs(images_path, exist_ok=True)
        os.makedirs(labels_path, exist_ok=True)
        output_dirs[split] = {'images': images_path, 'labels': labels_path}
    
    # To handle potential filename conflicts, keep track of filenames
    existing_filenames = set()
    
    for dataset in datasets:
        print(f"\nMerging dataset: {dataset}")
        data_yaml_path = os.path.join(dataset, 'data.yaml')
        data = load_yaml(data_yaml_path)
        dataset_classes = data.get('names', [])
        if not dataset_classes:
            print(f"Warning: No class names found in {data_yaml_path}. Skipping this dataset.")
            continue
        # Create mapping from old class IDs to new class IDs
        old_to_new = {str(idx): class_to_id[cls] for idx, cls in enumerate(dataset_classes)}
        
        # Directories to process: train, valid, test (if exists)
        for split in ['train', 'valid', 'test']:
            split_dir = os.path.join(dataset, split)
            if not os.path.exists(split_dir):
                if split != 'test':  # 'test' is optional
                    print(f"Warning: {split_dir} does not exist. Skipping this split.")
                continue
            images_dir = os.path.join(split_dir, 'images')
            labels_dir = os.path.join(split_dir, 'labels')
            if not os.path.exists(images_dir) or not os.path.exists(labels_dir):
                print(f"Warning: Missing images or labels in {split_dir}. Skipping this split.")
                continue
            
            # Determine target split
            target_split = 'valid' if split == 'test' else split
            target_images = output_dirs[target_split]['images']
            target_labels = output_dirs[target_split]['labels']
            
            # Process images
            for img_name in os.listdir(images_dir):
                src_img_path = os.path.join(images_dir, img_name)
                if not os.path.isfile(src_img_path):
                    continue
                # Handle duplicate filenames by renaming
                base_name, ext = os.path.splitext(img_name)
                new_img_name = img_name
                counter = 1
                while new_img_name in existing_filenames:
                    new_img_name = f"{base_name}_{counter}{ext}"
                    counter += 1
                existing_filenames.add(new_img_name)
                dst_img_path = os.path.join(target_images, new_img_name)
                shutil.copy2(src_img_path, dst_img_path)
                
                # Process corresponding label
                label_name = base_name + '.txt'
                src_label_path = os.path.join(labels_dir, label_name)
                if os.path.exists(src_label_path):
                    with open(src_label_path, 'r') as f:
                        lines = f.readlines()
                    new_label_lines = []
                    for line in lines:
                        parts = line.strip().split()
                        if len(parts) < 1:
                            continue
                        old_class = parts[0]
                        if old_class not in old_to_new:
                            print(f"Warning: Class {old_class} not found in mapping for {dataset}. Skipping.")
                            continue
                        new_class = old_to_new[old_class]
                        new_line = ' '.join([str(new_class)] + parts[1:])
                        new_label_lines.append(new_line)
                    # Write to new label file with the new image name
                    new_label_name = os.path.splitext(new_img_name)[0] + '.txt'
                    dst_label_path = os.path.join(target_labels, new_label_name)
                    with open(dst_label_path, 'w') as f:
                        f.write('\n'.join(new_label_lines) + '\n')
                else:
                    print(f"Warning: Label file {src_label_path} does not exist for image {src_img_path}.")


## Generating Final `data.yaml`

After merging all datasets, we'll create a final `data.yaml` file that points to the new training and validation image directories and includes the consolidated list of class names.


In [8]:
def generate_final_yaml(output_root, class_to_id):
    """
    Generate the final data.yaml file for the merged dataset.

    Args:
        output_root (str): Path to the output merged dataset folder.
        class_to_id (dict): Mapping from class names to unique IDs.
    """
    final_data_yaml = {
        'train': 'train/images',
        'val': 'valid/images',
        'nc': len(class_to_id),
        'names': list(class_to_id.keys())
    }
    # Save data.yaml
    data_yaml_path = os.path.join(output_root, 'data.yaml')
    save_yaml(final_data_yaml, data_yaml_path)
    print(f"\nFinal data.yaml created at {data_yaml_path}")


## Running the Merge Process

Specify the paths to your datasets and the desired output directory. Then, execute the merging process.


In [9]:
# Define the root directory containing all individual datasets
datasets_root = 'datasets'  # Replace with your datasets folder path

# Define the output directory for the merged dataset
output_root = 'final_dataset'  # Replace with your desired output path

# Ensure the output directory exists
os.makedirs(output_root, exist_ok=True)

# Step 1: Discover all datasets
datasets = get_all_datasets(datasets_root)

# Step 2: Collect all unique classes
class_to_id = collect_unique_classes(datasets)

# Step 3: Merge datasets
merge_datasets(datasets_root, output_root, class_to_id)

# Step 4: Generate final data.yaml
generate_final_yaml(output_root, class_to_id)

print("\nMerging process completed successfully!")


Found 4 datasets to merge.
Total unique classes collected: 5
Found 4 datasets to merge.

Merging dataset: datasets/mytest-hrswj

Merging dataset: datasets/flood-september-23-dataset

Merging dataset: datasets/broken-pole

Merging dataset: datasets/fallen_trees_improve_1-1185-dkpus-5ufzr

Final data.yaml created at final_dataset/data.yaml

Merging process completed successfully!
