<div style="background-color:#f0f0f0; padding:20px; border-radius:12px; max-width:1490px; width:100%; box-sizing:border-box;">

  <h1 style="margin:0; color:#2c3e50; font-size:32px;">001 🏠 Cleaning Setup</h1>

  <p style="margin:5px 0 0 0; font-size:16px; color:#34495e;">
    <strong>Authors:</strong> Cecil Quibranza & Matthew Israel &nbsp;&nbsp;|&nbsp;&nbsp; 
    <strong>Date:</strong> 2025-08-30
  </p>

  <hr style="margin:10px 0; border:none; border-top:2px solid #dcdcdc;">

  <p style="margin:0; font-size:16px; color:#34495e;">
    This is the <strong>first notebook</strong>, focusing on the <strong>cleaning of images</strong>. 
    The goal is to standardize and prepare datasets by removing redundant or irrelevant data, ensuring that only high-quality, non-augmented images are retained. 
    This process minimizes dataset noise and prevents biases during training.
  </p>

  <ol style="margin:10px 0 0 20px; color:#34495e; font-size:16px; list-style-position:inside;">
    <li><strong>Extract:</strong> Unzip the dataset files into their respective folders to make all images accessible for preprocessing.</li>
    <li><strong>Remove augmented/duplicates:</strong> Manually remove rotated/mirrored/heavily filtered images and exact duplicates. 
        Keep only originals to reduce imbalance and overfitting.</li>
    <li><strong>Move cleaned images:</strong> Consolidate remaining images into a single folder structure by class (e.g., disease/condition). 
        This enforces a uniform layout for loading and evaluation.</li>
    <li><strong>Repeat for all sources:</strong> Apply steps 2–3 across every dataset to maintain consistency for downstream training.</li>
  </ol>

  <p style="margin:10px 0 0 0; font-size:15px; color:#7f8c8d;">
    ✅ This establishes a clean baseline so later steps (resize, normalization, controlled augmentation) operate on unbiased data.
  </p>
</div>


<div style="background-color:#f0f0f0; padding:20px; border-radius:12px; max-width:1490px; width:100%; box-sizing:border-box;">

  <h1 style="margin:0; color:#2c3e50; font-size:32px;">Modules</h1>

  <p style="margin:10px 0 0 0; font-size:16px; color:#34495e;">
    The <strong>code</strong> below installs necessary python modules for this project
  </p>

</div>


In [None]:
%pip install termcolor
%pip install roboflow


Collecting termcolor
  Using cached termcolor-3.1.0-py3-none-any.whl.metadata (6.4 kB)
Using cached termcolor-3.1.0-py3-none-any.whl (7.7 kB)
Installing collected packages: termcolor
Successfully installed termcolor-3.1.0
Note: you may need to restart the kernel to use updated packages.


Could not find platform independent libraries <prefix>


<div style="background-color:#f0f0f0; padding:20px; border-radius:12px; max-width:1490px; width:100%; box-sizing:border-box;">

  <h1 style="margin:0; color:#2c3e50; font-size:32px;">1. Extract</h1>

  <p style="margin:10px 0 0 0; font-size:16px; color:#34495e;">
    The <strong>code</strong> below unzips files in the "RawDatasets" folder and place them in a new folder called "Extracted".
  </p>

</div>


In [2]:
import os
import zipfile

# Define the base directory where ZIP files are located
base_dir = "RawDatasets"

# Extracted folder will be outside the base_dir
extract_dir = os.path.join(os.path.dirname(base_dir), "Extracted")

# Ensure the extracted directory exists
os.makedirs(extract_dir, exist_ok=True)

# Loop through all folders and extract ZIP files separately
for root, _, files in os.walk(base_dir):
    for file in files:
        if file.endswith(".zip"):
            zip_path = os.path.join(root, file)
            folder_name = os.path.splitext(file)[0]  # Get ZIP filename without extension
            target_dir = os.path.join(extract_dir, folder_name)  # Create a separate folder
            os.makedirs(target_dir, exist_ok=True)  # Ensure folder exists

            with zipfile.ZipFile(zip_path, 'r') as zip_ref:
                zip_ref.extractall(target_dir)  # Extract into separate folder
            print(f"Extracted: {file} → {target_dir}")


Extracted: Acne_0.zip → Extracted\Acne_0
Extracted: Acne_1.zip → Extracted\Acne_1
Extracted: Eczema_0.zip → Extracted\Eczema_0
Extracted: Eczema_1.zip → Extracted\Eczema_1
Extracted: Melasma_0.zip → Extracted\Melasma_0
Extracted: Melasma_1.zip → Extracted\Melasma_1
Extracted: Melasma_2.zip → Extracted\Melasma_2
Extracted: Melasma_3.zip → Extracted\Melasma_3
Extracted: Rosacea_0.zip → Extracted\Rosacea_0
Extracted: Shingles_0.zip → Extracted\Shingles_0
Extracted: Shingles_1.zip → Extracted\Shingles_1


<div style="background-color:#f0f0f0; padding:20px; border-radius:12px; max-width:1490px; width:100%; box-sizing:border-box;">

  <h1 style="margin:0; color:#2c3e50; font-size:32px;">Scan</h1>

  <p style="margin:10px 0 0 0; font-size:16px; color:#34495e;">
    The <strong>Scan</strong> step displays the contents of a dataset folder in a 
    hierarchical tree view. This provides a clear overview of the directory 
    structure, including subfolders and files, making it easier to verify 
    dataset organization before cleaning or preprocessing.
  </p>

</div>


In [34]:
import os

# Define the root directory to scan
root_dir = "Extracted"

# ANSI escape codes for colors
CYAN = "\033[96m"
YELLOW = "\033[93m"
GREEN = "\033[92m"
BOLD = "\033[1m"
RESET = "\033[0m"

# Function to count files in a folder
def count_files(directory):
    return len([f for f in os.listdir(directory) if os.path.isfile(os.path.join(directory, f))])

# Function to count total images and labels in a dataset folder
def count_dataset_items(dataset_path):
    total_images, total_labels = 0, 0
    for split in ["train", "valid", "test"]:
        split_path = os.path.join(dataset_path, split)
        images_path = os.path.join(split_path, "images")
        labels_path = os.path.join(split_path, "labels")

        if os.path.isdir(images_path):
            total_images += count_files(images_path)
        if os.path.isdir(labels_path):
            total_labels += count_files(labels_path)

    return total_images, total_labels

# Function to recursively scan and display folder hierarchy with image/label counts
def scan_folders(directory, indent=0):
    if not os.path.exists(directory):
        print(f"{YELLOW}⚠️ Directory not found!{RESET}")
        return

    for item in sorted(os.listdir(directory)):  # Sort for consistent order
        item_path = os.path.join(directory, item)
        prefix = " " * indent + "|-- "  # Indentation for hierarchy visualization
        
        if os.path.isdir(item_path):
            if item in ["images", "labels"]:  # Count files instead of listing them
                file_count = count_files(item_path)
                print(f"{CYAN}{prefix}📂 {BOLD}{item}{RESET} ({file_count} files)")
            else:
                # Count total images and labels for dataset folders
                total_images, total_labels = count_dataset_items(item_path)
                
                # Format the display string (hide 0 values)
                count_text = []
                if total_images > 0:
                    count_text.append(f"✅ {total_images} images")
                if total_labels > 0:
                    count_text.append(f"✅ {total_labels} labels")
                
                count_display = " ".join(count_text) if count_text else ""  
                print(f"{CYAN}{prefix}📂 {BOLD}{item}{RESET} {count_display}")
                
                scan_folders(item_path, indent + 4)  # Recursively scan subfolders
        else:
            print(f"{GREEN}{prefix}📄 {item}{RESET}")

# Run the folder scan
print(f"\n{BOLD}📁 Scanning folder hierarchy in:{RESET} {CYAN}{root_dir}{RESET}\n")
scan_folders(root_dir)



[1m📁 Scanning folder hierarchy in:[0m [96mExtracted[0m

[96m|-- 📂 [1mAcne_0[0m ✅ 569 images ✅ 569 labels
[92m    |-- 📄 README.dataset.txt[0m
[92m    |-- 📄 README.roboflow.txt[0m
[92m    |-- 📄 data.yaml[0m
[96m    |-- 📂 [1mtest[0m 
[96m        |-- 📂 [1mimages[0m (114 files)
[96m        |-- 📂 [1mlabels[0m (114 files)
[96m    |-- 📂 [1mtrain[0m 
[96m        |-- 📂 [1mimages[0m (330 files)
[96m        |-- 📂 [1mlabels[0m (330 files)
[96m    |-- 📂 [1mvalid[0m 
[96m        |-- 📂 [1mimages[0m (125 files)
[96m        |-- 📂 [1mlabels[0m (125 files)
[96m|-- 📂 [1mAcne_1[0m ✅ 608 images ✅ 608 labels
[92m    |-- 📄 README.dataset.txt[0m
[92m    |-- 📄 README.roboflow.txt[0m
[92m    |-- 📄 data.yaml[0m
[96m    |-- 📂 [1mtest[0m 
[96m        |-- 📂 [1mimages[0m (13 files)
[96m        |-- 📂 [1mlabels[0m (13 files)
[96m    |-- 📂 [1mtrain[0m 
[96m        |-- 📂 [1mimages[0m (339 files)
[96m        |-- 📂 [1mlabels[0m (339 files)
[96m    |-- 📂 [

<div style="background-color:#f0f0f0; padding:20px; border-radius:12px; max-width:1490px; width:100%; box-sizing:border-box;">

  <h1 style="margin:0; color:#2c3e50; font-size:32px;">2. Remove augmented Images</h1>

  <p style="margin:10px 0 0 0; font-size:16px; color:#34495e;">
    You will still need to manually verify it.
  </p>

</div>


<div style="background-color:#f0f0f0; padding:20px; border-radius:12px; max-width:1490px; width:100%; box-sizing:border-box;">

  <h1 style="margin:0; color:#2c3e50; font-size:32px;">Scan all duplicates (Augmented)</h1>

</div>


In [36]:
import os
from collections import defaultdict
from termcolor import colored

# Define root path where all datasets are located
root_path = r'Extracted'
folders = ['train', 'valid', 'test']
image_exts = ('.jpg', '.png', '.jpeg')

# Grand total for all datasets
grand_total_duplicates = 0

print(colored(f"\n📁 Scanning for duplicate images in datasets inside: {root_path}\n", "cyan", attrs=["bold"]))

# Iterate over each dataset folder (e.g., Acne_0, Eczema_1)
for dataset_name in sorted(os.listdir(root_path)):
    dataset_path = os.path.join(root_path, dataset_name)
    if not os.path.isdir(dataset_path):
        continue

    total_duplicates = 0

    for folder in folders:
        image_folder = os.path.join(dataset_path, folder, 'images')

        if not os.path.exists(image_folder):
            continue

        filename_dict = defaultdict(int)

        # Count image base names (e.g., Acne_001 from Acne_001.rf.asdf123.jpg)
        for filename in os.listdir(image_folder):
            if filename.lower().endswith(image_exts):
                base_name = filename.rsplit('_', 1)[0]
                filename_dict[base_name] += 1

        # Count duplicates
        for count in filename_dict.values():
            if count > 1:
                total_duplicates += count - 1

    if total_duplicates > 0:
        print(colored(f"📂 {dataset_name} → ❌ {total_duplicates} duplicate images found", "red"))
    else:
        print(colored(f"📂 {dataset_name} → ✅ No duplicate images", "green"))

    grand_total_duplicates += total_duplicates

# Final summary
print(colored(f"\n🧮 Grand Total Duplicate Images Across All Datasets: {grand_total_duplicates}\n", "magenta", attrs=["bold"]))


[1m[36m
📁 Scanning for duplicate images in datasets inside: Extracted
[0m
[32m📂 Acne_0 → ✅ No duplicate images[0m
[32m📂 Acne_1 → ✅ No duplicate images[0m
[32m📂 Eczema_0 → ✅ No duplicate images[0m
[32m📂 Eczema_1 → ✅ No duplicate images[0m
[32m📂 Melasma_0 → ✅ No duplicate images[0m
[32m📂 Melasma_1 → ✅ No duplicate images[0m
[32m📂 Melasma_2 → ✅ No duplicate images[0m
[32m📂 Melasma_3 → ✅ No duplicate images[0m
[32m📂 Rosacea_0 → ✅ No duplicate images[0m
[32m📂 Shingles_0 → ✅ No duplicate images[0m
[32m📂 Shingles_1 → ✅ No duplicate images[0m
[1m[35m
🧮 Grand Total Duplicate Images Across All Datasets: 0
[0m


<div style="background-color:#f0f0f0; padding:20px; border-radius:12px; max-width:1490px; width:100%; box-sizing:border-box;">

  <h1 style="margin:0; color:#2c3e50; font-size:32px;">2. Remove Augmented Images</h1>

  <p style="margin:10px 0 0 0; font-size:16px; color:#34495e;">
    The <strong>code</strong> below removes all augmented files in the "Extracted" folder and place them in a new folder called "Cleaned".
  </p>

</div>


In [7]:
import os
import shutil
from collections import defaultdict

# Main paths
base_path = r'Extracted'
cleaned_base_path = r'Cleaned'
folders = ['train', 'valid', 'test']
image_exts = ('.jpg', '.jpeg', '.png')

# Function to get base filename (excluding hash)
def get_base_filename(filename):
    return filename.rsplit('_', 1)[0]

# Global counters
grand_total_duplicates = 0
summary = {}

# Process each dataset inside "extracted"
for dataset in os.listdir(base_path):
    dataset_path = os.path.join(base_path, dataset)
    if not os.path.isdir(dataset_path):
        continue

    cleaned_path = os.path.join(cleaned_base_path, dataset)
    
    # Prepare cleaned folders
    for folder in folders:
        os.makedirs(os.path.join(cleaned_path, folder, 'images'), exist_ok=True)
        os.makedirs(os.path.join(cleaned_path, folder, 'labels'), exist_ok=True)
    
    # Registry to store all base names in this dataset
    image_registry = defaultdict(list)
    duplicates_count = {folder: 0 for folder in folders}

    # Step 1: Register image filenames
    for folder in folders:
        image_folder = os.path.join(dataset_path, folder, 'images')
        if not os.path.exists(image_folder):
            continue
        for filename in os.listdir(image_folder):
            if not filename.lower().endswith(image_exts):
                continue
            base_name = get_base_filename(filename)
            image_registry[(folder, base_name)].append(filename)

    # Step 2: Remove duplicates and move them to cleaned folder
    for (folder, base_name), filenames in image_registry.items():
        if len(filenames) > 1:
            duplicates_count[folder] += len(filenames) - 1
            image_folder = os.path.join(dataset_path, folder, 'images')
            label_folder = os.path.join(dataset_path, folder, 'labels')
            cleaned_image_folder = os.path.join(cleaned_path, folder, 'images')
            cleaned_label_folder = os.path.join(cleaned_path, folder, 'labels')
            for file in filenames[1:]:
                # Move image
                src_image = os.path.join(image_folder, file)
                dst_image = os.path.join(cleaned_image_folder, file)
                if os.path.exists(src_image):
                    shutil.move(src_image, dst_image)

                # Move corresponding label
                label_file = os.path.splitext(file)[0] + '.txt'
                src_label = os.path.join(label_folder, label_file)
                dst_label = os.path.join(cleaned_label_folder, label_file)
                if os.path.exists(src_label):
                    shutil.move(src_label, dst_label)

    # Collect summary
    dataset_total = sum(duplicates_count.values())
    grand_total_duplicates += dataset_total
    summary[dataset] = dataset_total

# Final Output
print("\n📁 Duplicate Removal Summary Across All Datasets:\n")
for dataset, count in summary.items():
    if count > 0:
        print(f"📂 {dataset} → ❌ {count} duplicate images removed")
    else:
        print(f"📂 {dataset} → ✅ No duplicates found")
print(f"\n🧮 Grand Total Duplicates Removed: {grand_total_duplicates}")



📁 Duplicate Removal Summary Across All Datasets:

📂 Acne_0 → ❌ 1944 duplicate images removed
📂 Acne_1 → ❌ 674 duplicate images removed
📂 Eczema_0 → ❌ 6261 duplicate images removed
📂 Eczema_1 → ✅ No duplicates found
📂 Melasma_0 → ❌ 101 duplicate images removed
📂 Melasma_1 → ✅ No duplicates found
📂 Melasma_2 → ✅ No duplicates found
📂 Melasma_3 → ✅ No duplicates found
📂 Rosacea_0 → ❌ 2200 duplicate images removed
📂 Shingles_0 → ❌ 1181 duplicate images removed
📂 Shingles_1 → ❌ 7992 duplicate images removed

🧮 Grand Total Duplicates Removed: 20353


In [37]:
import os

def list_files_hierarchy(startpath):
    """
    Displays the file and directory hierarchy from a given start path,
    excluding specified directories like node_modules.
    """
    # Define directories to exclude from the hierarchy display
    # You can add more directories here if needed (e.g., '.git', '.expo', '__pycache__')
    excluded_dirs = ['images', 'labels', '.vscode', '.venv', '.git' ]

    if not os.path.isdir(startpath):
        print(f"Error: Directory '{startpath}' not found.")
        return

    print(f"Directory hierarchy for: {startpath}\n")

    for root, dirs, files in os.walk(startpath):
        # Calculate current level for indentation
        level = root.replace(startpath, '').count(os.sep)
        indent = ' ' * 4 * (level)

        # Print current directory
        # The first root will be the startpath itself, handled below
        if root != startpath:
            dir_name = os.path.basename(root)
            # Check if current directory should be excluded
            if dir_name in excluded_dirs:
                # If a directory is in excluded_dirs, we skip its contents
                # and prevent os.walk from descending into it.
                del dirs[:] # This modifies 'dirs' in-place, preventing os.walk from entering these.
                continue
            print(f'{indent}📦 {dir_name}/') # Folder icon

        # Remove excluded directories from the list to prevent os.walk from entering them
        # This is important to not process files inside excluded_dirs
        dirs[:] = [d for d in dirs if d not in excluded_dirs]


        subindent = ' ' * 4 * (level + 1)
        for f in files:
            print(f'{subindent}📄 {f}') # File icon

# --- How to use it ---
if __name__ == "__main__":
    # You can set the path manually or use os.getcwd() to get the current working directory
    project_root = r"C:\Users\rapha\OneDrive\Desktop\CS Thesis 2" # Use 'r' for raw string to handle backslashes
    # Or to use the directory where the script is run:
    # project_root = os.getcwd()

    list_files_hierarchy(project_root)

Directory hierarchy for: C:\Users\rapha\OneDrive\Desktop\CS Thesis 2

    📄 .gitignore
    📄 main.ipynb
    📄 README.md
    📄 RunProjectThisDirectory.txt
    📦 Cleaned/
        📦 Acne_0/
            📦 test/
            📦 train/
            📦 valid/
        📦 Acne_1/
            📦 test/
            📦 train/
            📦 valid/
        📦 Eczema_0/
            📦 test/
            📦 train/
            📦 valid/
        📦 Eczema_1/
            📦 test/
            📦 train/
            📦 valid/
        📦 Melasma_0/
            📦 test/
            📦 train/
            📦 valid/
        📦 Melasma_1/
            📦 test/
            📦 train/
            📦 valid/
        📦 Melasma_2/
            📦 test/
            📦 train/
            📦 valid/
        📦 Melasma_3/
            📦 test/
            📦 train/
            📦 valid/
        📦 Rosacea_0/
            📦 test/
            📦 train/
            📦 valid/
        📦 Shingles_0/
            📦 test/
            📦 train/
            📦 valid/
        

<div style="background-color:#f0f0f0; padding:20px; border-radius:12px; max-width:1490px; width:100%; box-sizing:border-box;">

  <h1 style="margin:0; color:#2c3e50; font-size:32px;">Merge</h1>

  <p style="margin:10px 0 0 0; font-size:16px; color:#34495e;">
    The <strong>code</strong> below merges nested folders (test, train, valid) of each dataset into a single folder for each dataset didsregarding the label folders.
  </p>

</div>


In [30]:
import os
import shutil

# Base Extracted directory (relative path)
base_dir = "Extracted"
merged_dir = "Merged"

# Make sure Merged folder exists
os.makedirs(merged_dir, exist_ok=True)

# Walk through each dataset folder inside Extracted
for dataset in os.listdir(base_dir):
    dataset_path = os.path.join(base_dir, dataset)
    if os.path.isdir(dataset_path):
        print(f"Processing {dataset}...")

        # Create subfolder inside Merged for this dataset
        dataset_merged_dir = os.path.join(merged_dir, dataset)
        os.makedirs(dataset_merged_dir, exist_ok=True)

        # Look into test, train, valid
        for subset in ["train", "test", "valid"]:
            subset_path = os.path.join(dataset_path, subset)
            if os.path.exists(subset_path):
                for root, _, files in os.walk(subset_path):
                    for file in files:
                        # Only copy image files
                        if file.lower().endswith((".jpg", ".jpeg", ".png")):
                            src_file = os.path.join(root, file)
                            dst_file = os.path.join(dataset_merged_dir, file)

                            # Handle duplicates by prefixing subset name
                            if os.path.exists(dst_file):
                                filename, ext = os.path.splitext(file)
                                dst_file = os.path.join(dataset_merged_dir, f"{subset}_{filename}{ext}")

                            shutil.copy2(src_file, dst_file)

print(f"\n✅ Merging complete! All images are in: {merged_dir}")


Processing Acne_0...
Processing Acne_1...
Processing Eczema_0...
Processing Eczema_1...
Processing Melasma_0...
Processing Melasma_1...
Processing Melasma_2...
Processing Melasma_3...
Processing Rosacea_0...
Processing Shingles_0...
Processing Shingles_1...

✅ Merging complete! All images are in: Merged


<div style="background-color:#f0f0f0; padding:20px; border-radius:12px; max-width:1490px; width:100%; box-sizing:border-box;">

  <h1 style="margin:0; color:#2c3e50; font-size:32px;">Finish</h1>

  <p style="margin:10px 0 0 0; font-size:16px; color:#34495e;">
    Now we do some labelling. We will use the "Merged" folder and label it in the website
  </p>

</div>


<div style="background-color:#f0f0f0; padding:20px; border-radius:12px; max-width:1490px; width:100%; box-sizing:border-box;">

  <h1 style="margin:0; color:#2c3e50; font-size:32px;">Rename image names</h1>

  <p style="margin:10px 0 0 0; font-size:16px; color:#34495e;">
    Rename file names base on the dataset name
  </p>

</div>


In [1]:
import os
import shutil

# Original and new folder paths
src_folder = "Merged"
dst_folder = "Merged_Renamed"

# Copy the entire folder first (without overwriting if exists)
if not os.path.exists(dst_folder):
    shutil.copytree(src_folder, dst_folder)
    print(f"📂 Copied '{src_folder}' to '{dst_folder}'")
else:
    print(f"⚠️ Destination '{dst_folder}' already exists, will rename inside it.")

# Walk through new copied folder and rename files
for root, _, files in os.walk(dst_folder):
    folder_name = os.path.basename(root)  # e.g. Acne_0, Eczema_0
    for i, file in enumerate(files, start=1):
        ext = os.path.splitext(file)[1].lower()  # keep original extension
        old_path = os.path.join(root, file)
        new_filename = f"{folder_name}_{i:04d}{ext}"  # Acne_0_0001.jpg
        new_path = os.path.join(root, new_filename)

        # Avoid overwriting by checking existence
        if not os.path.exists(new_path):
            os.rename(old_path, new_path)

print("✅ All images renamed with their parent folder name.")


📂 Copied 'Merged' to 'Merged_Renamed'
✅ All images renamed with their parent folder name.
