# Split to Inpainting

This notebook reorganizes a pre-split damaged image dataset (`v6-split-dataset`) into a standardized format for image inpainting. It copies and renames damaged images and their masks into a new structure under `inpainting_dataset/{train,val,test}/{img,mask}`, preserving damage type labels in filenames. The process includes identifier extraction, directory setup, file mapping, and progress tracking with tqdm.

## Imports and Configuration

In [1]:
import os
import shutil
from pathlib import Path
from collections import defaultdict
from tqdm import tqdm

# Base paths
SOURCE_BASE = Path("../../data/v6-split-dataset")
DEST_BASE = Path("../../data/inpainting-dataset")

## Helper Functions

In [2]:
def extract_identifier(filename):
    """Extract base identifier from filename like 'abc123-scratch.png' or 'abc123-scratch-mask.png'."""
    return filename.split("-")[0]

def copy_file(src_path, dest_path):
    """Copy file from src to dest, creating directories if needed."""
    dest_path.parent.mkdir(parents=True, exist_ok=True)
    shutil.copy2(src_path, dest_path)

## Directory Setup

In [3]:
def create_inpainting_dirs(base_path):
    """Create inpainting_dataset/<split>/{img,mask} directory structure."""
    for split in ['train', 'val', 'test']:
        (base_path / split / 'img').mkdir(parents=True, exist_ok=True)
        (base_path / split / 'mask').mkdir(parents=True, exist_ok=True)

create_inpainting_dirs(DEST_BASE)

## File Mapping and Copying Logic

In [4]:
def process_split(split):
    """Process a single data split (train, val, or test) and copy files into the new format."""
    print(f"\nProcessing split: {split}")

    src_split_path = SOURCE_BASE / split
    src_img_dir = src_split_path / "img"
    src_mask_dir = src_split_path / "mask"

    dest_img_dir = DEST_BASE / split / "img"
    dest_mask_dir = DEST_BASE / split / "mask"

    processed = 0
    missing_masks = 0

    for img_file in tqdm(list(src_img_dir.glob("*.png")), desc=f"{split.upper()}"):
        identifier = extract_identifier(img_file.name)
        damage_type = img_file.name.split("-")[1].replace(".png", "")

        mask_file = src_mask_dir / f"{identifier}-{damage_type}-mask.png"
        if not mask_file.exists():
            missing_masks += 1
            continue

        # Option: Keep original filenames with damage info
        dest_img_path = dest_img_dir / img_file.name
        dest_mask_path = dest_mask_dir / mask_file.name

        copy_file(img_file, dest_img_path)
        copy_file(mask_file, dest_mask_path)

        processed += 1

    print(f"Copied {processed} image-mask pairs.")
    if missing_masks > 0:
        print(f"Skipped {missing_masks} due to missing masks.")

## Execute for All Splits

In [5]:
for split in ['train', 'val', 'test']:
    process_split(split)


Processing split: train


TRAIN: 100%|██████████| 24206/24206 [09:19<00:00, 43.23it/s]


Copied 24205 image-mask pairs.
Skipped 1 due to missing masks.

Processing split: val


VAL: 100%|██████████| 5187/5187 [01:59<00:00, 43.30it/s]


Copied 5187 image-mask pairs.

Processing split: test


TEST: 100%|██████████| 5187/5187 [02:00<00:00, 43.12it/s]

Copied 5186 image-mask pairs.
Skipped 1 due to missing masks.





## Sanity Check

In [6]:
def count_files(directory):
    return len(list(Path(directory).glob("*.png")))

print("\n=== Sanity Check ===")
for split in ['train', 'val', 'test']:
    img_count = count_files(DEST_BASE / split / 'img')
    mask_count = count_files(DEST_BASE / split / 'mask')
    print(f"{split.upper()}: {img_count} images, {mask_count} masks")


=== Sanity Check ===
TRAIN: 24205 images, 24205 masks
VAL: 5187 images, 5187 masks
TEST: 5186 images, 5186 masks


In [8]:
def check_split_integrity(split_dir, mask_suffix=None):
    image_dir = os.path.join(split_dir, "img")
    mask_dir = os.path.join(split_dir, "mask")

    image_files = sorted(os.listdir(image_dir))
    mask_files = sorted(os.listdir(mask_dir))

    print(f"\nChecking: {split_dir}")
    print(f"- Images: {len(image_files)}")
    print(f"- Masks : {len(mask_files)}")

    mismatched = []
    for img_file in tqdm(image_files, desc=f"Validating {os.path.basename(split_dir)}"):
        img_id = os.path.splitext(img_file)[0]
        expected_mask = img_id + mask_suffix if mask_suffix else img_file
        if expected_mask not in mask_files:
            mismatched.append((img_file, expected_mask))

    if mismatched:
        print(f"\nMismatched files ({len(mismatched)}):")
        for img, expected in mismatched[:10]:  # show only first 10
            print(f"  Image: {img} -> Expected mask: {expected}")
    else:
        print("All image-mask pairs are valid.")

for split in ["train", "val", "test"]:
    check_split_integrity(os.path.join(DEST_BASE, split), mask_suffix="-mask.png")  # or "_mask.png"


Checking: ../../data/inpainting-dataset/train
- Images: 24205
- Masks : 24205


Validating train: 100%|██████████| 24205/24205 [00:06<00:00, 3765.98it/s]


All image-mask pairs are valid.

Checking: ../../data/inpainting-dataset/val
- Images: 5187
- Masks : 5187


Validating val: 100%|██████████| 5187/5187 [00:00<00:00, 19393.81it/s]


All image-mask pairs are valid.

Checking: ../../data/inpainting-dataset/test
- Images: 5186
- Masks : 5186


Validating test: 100%|██████████| 5186/5186 [00:00<00:00, 18817.73it/s]

All image-mask pairs are valid.



