# 02d_data_cleaning_create_clean_dataset:

copy_all_excluded_images - Copies the bad images to outputs/all_excluded_images/ for manual inspection
create_clean_dataset - Copies the good images to data/CleanPetImages/ for training
verify clean dataset - verifies the copy was okay and the right images were excluded

### copy_all_excluded_images - Copies the bad images to outputs/all_excluded_images/ for manual inspection

### Two exclusion files to read:

clip_cleanup_exclusions.txt - Contains just file paths (one per line), same path format as the detailed file but without the scores/reasons columns. These are images flagged by CLIP cleaning that remain after good images were kept - i.e., the undesirable leftovers.
exclude_list.txt - Contains file paths (one per line, no header). These are images that should be excluded from the project in general (from earlier cleaning stages like tiny images, duplicates, corrupted files, etc.).

The script should:

Read both files and collect all unique paths
Copy all these excluded images to a review/inspection folder (preserving Cat/Dog subfolders)
Handle the different formats (one has headers to skip, one doesn't)

Let me create this script:Prepared to synthesize file exclusion logic and filtering requirementsWhat the script does:

Reads clip_cleanup_exclusions.txt - Handles the same format as the detailed file but expects only paths (skips any lines starting with # or -)
Reads exclude_list.txt - Simple one-path-per-line format, no header
Combines both into a unique set (so duplicates between files aren't copied twice)
Copies to all_excluded_images/ with Cat/Dog subfolders preserved
Prints detailed breakdown showing:

How many came from each file
How many overlap between both files
Copy success/failure counts
Final counts by category

In [None]:
## Imports

In [1]:
import shutil
from pathlib import Path

In [None]:
# -----------------------------------------------------------------------------
# PATHS
# -----------------------------------------------------------------------------

base_path = Path(r"C:\AWrk\cats_dogs_project")
outputs_folder = base_path / "outputs"

# Exclusion files
clip_cleanup_file = outputs_folder / "clip_cleanup_exclusions.txt"
general_exclude_file = outputs_folder / "exclude_list.txt"

# Destination folder for inspection
dest_folder = outputs_folder / "all_excluded_images"

# Source image folders
cat_folder = base_path / "data" / "PetImages" / "Cat"
dog_folder = base_path / "data" / "PetImages" / "Dog"

# -----------------------------------------------------------------------------
# CREATE DESTINATION FOLDER
# -----------------------------------------------------------------------------

dest_folder.mkdir(parents=True, exist_ok=True)
(dest_folder / "Cat").mkdir(exist_ok=True)
(dest_folder / "Dog").mkdir(exist_ok=True)

# -----------------------------------------------------------------------------
# HELPER FUNCTION TO PARSE PATHS
# -----------------------------------------------------------------------------

def extract_category_and_filename(rel_path):
    """
    Extract category (Cat/Dog) and filename from a relative path.
    Handles paths like: ..\data\PetImages\Cat\10029.jpg
    """
    path_parts = rel_path.replace("\\", "/").split("/")
    # Find Cat or Dog in the path
    for i, part in enumerate(path_parts):
        if part in ("Cat", "Dog"):
            category = part
            filename = path_parts[i + 1] if i + 1 < len(path_parts) else None
            return category, filename
    return None, None

# -----------------------------------------------------------------------------
# READ CLIP CLEANUP EXCLUSIONS (paths only, may have header)
# -----------------------------------------------------------------------------

clip_paths = set()

if clip_cleanup_file.exists():
    with open(clip_cleanup_file, "r") as f:
        for line in f:
            line = line.strip()
            # Skip empty lines and header lines
            if not line or line.startswith("#") or line.startswith("-"):
                continue
            # Take first column if tab-separated, otherwise whole line
            path = line.split("\t")[0].strip()
            if path and ".jpg" in path.lower():
                clip_paths.add(path)
    print(f"Read {len(clip_paths)} paths from clip_cleanup_exclusions.txt")
else:
    print(f"WARNING: {clip_cleanup_file} not found")

# -----------------------------------------------------------------------------
# READ GENERAL EXCLUDE LIST (one path per line, no header)
# -----------------------------------------------------------------------------

general_paths = set()

if general_exclude_file.exists():
    with open(general_exclude_file, "r") as f:
        for line in f:
            line = line.strip()
            if line and ".jpg" in line.lower():
                general_paths.add(line)
    print(f"Read {len(general_paths)} paths from exclude_list.txt")
else:
    print(f"WARNING: {general_exclude_file} not found")

# -----------------------------------------------------------------------------
# COMBINE AND COPY
# -----------------------------------------------------------------------------

all_paths = clip_paths.union(general_paths)
print(f"\nTotal unique paths to copy: {len(all_paths)}")

# Track results
copied = 0
errors = []
from_clip = 0
from_general = 0
from_both = 0

for rel_path in sorted(all_paths):
    category, filename = extract_category_and_filename(rel_path)
    
    if not category or not filename:
        errors.append(f"Could not parse path: {rel_path}")
        continue
    
    # Track source
    in_clip = rel_path in clip_paths
    in_general = rel_path in general_paths
    if in_clip and in_general:
        from_both += 1
    elif in_clip:
        from_clip += 1
    else:
        from_general += 1
    
    # Build source path
    if category == "Cat":
        source = cat_folder / filename
    else:
        source = dog_folder / filename
    
    # Build destination path
    dest = dest_folder / category / filename
    
    # Copy the file
    try:
        if source.exists():
            shutil.copy2(source, dest)
            copied += 1
        else:
            errors.append(f"File not found: {source}")
    except Exception as e:
        errors.append(f"Error copying {source}: {e}")

# -----------------------------------------------------------------------------
# SUMMARY
# -----------------------------------------------------------------------------

print()
print("=" * 70)
print("COPY ALL EXCLUDED IMAGES FOR INSPECTION")
print("=" * 70)
print()
print("SOURCE FILES:")
print(f"  CLIP cleanup exclusions: {clip_cleanup_file}")
print(f"    Paths in file: {len(clip_paths)}")
print(f"  General exclude list:    {general_exclude_file}")
print(f"    Paths in file: {len(general_paths)}")
print()
print("BREAKDOWN:")
print(f"  Only in CLIP cleanup:    {from_clip}")
print(f"  Only in general exclude: {from_general}")
print(f"  In both files:           {from_both}")
print(f"  Total unique:            {len(all_paths)}")
print()
print("RESULTS:")
print(f"  Successfully copied: {copied}")
print(f"  Errors:              {len(errors)}")
print()
print(f"DESTINATION: {dest_folder}")

# Count by category in destination
cat_count = len(list((dest_folder / "Cat").glob("*.jpg")))
dog_count = len(list((dest_folder / "Dog").glob("*.jpg")))
print(f"  Cats: {cat_count}")
print(f"  Dogs: {dog_count}")

if errors:
    print()
    print("ERRORS:")
    for err in errors[:30]:
        print(f"  {err}")
    if len(errors) > 30:
        print(f"  ... and {len(errors) - 30} more")

print("=" * 70)

Read 36 paths from exclude_list.txt

Total unique paths to copy: 36

COPY ALL EXCLUDED IMAGES FOR INSPECTION

SOURCE FILES:
  CLIP cleanup exclusions: C:\AWrk\cats_dogs_project\outputs\clip_cleanup_exclusions.txt
    Paths in file: 0
  General exclude list:    C:\AWrk\cats_dogs_project\outputs\exclude_list.txt
    Paths in file: 36

BREAKDOWN:
  Only in CLIP cleanup:    0
  Only in general exclude: 36
  In both files:           0
  Total unique:            36

RESULTS:
  Successfully copied: 36
  Errors:              0

DESTINATION: C:\AWrk\cats_dogs_project\outputs\all_excluded_images
  Cats: 22
  Dogs: 14


In [4]:
import shutil
from pathlib import Path

# -----------------------------------------------------------------------------
# PATHS
# -----------------------------------------------------------------------------

base_path = Path(r"C:\AWrk\cats_dogs_project")
outputs_folder = base_path / "outputs"

# Input: CLIP filter results (for review)
clip_results_file = outputs_folder / "clip_exclude_details_v3plus.txt"

# Output: Folder to visually inspect rejected images
review_folder = outputs_folder / "clip_rejected_for_review"

# Source image folders
cat_folder = base_path / "data" / "PetImages" / "Cat"
dog_folder = base_path / "data" / "PetImages" / "Dog"

# -----------------------------------------------------------------------------
# CREATE DESTINATION FOLDERS
# -----------------------------------------------------------------------------

review_folder.mkdir(parents=True, exist_ok=True)
(review_folder / "Cat").mkdir(exist_ok=True)
(review_folder / "Dog").mkdir(exist_ok=True)

# -----------------------------------------------------------------------------
# HELPER FUNCTION
# -----------------------------------------------------------------------------

def extract_category_and_filename(rel_path):
    """Extract category (Cat/Dog) and filename from a path."""
    path_parts = rel_path.replace("\\", "/").split("/")
    for i, part in enumerate(path_parts):
        if part in ("Cat", "Dog"):
            category = part
            filename = path_parts[i + 1] if i + 1 < len(path_parts) else None
            return category, filename
    return None, None

# -----------------------------------------------------------------------------
# READ CLIP RESULTS AND COPY IMAGES
# -----------------------------------------------------------------------------

paths_to_copy = []

with open(clip_results_file, "r") as f:
    for line in f:
        line = line.strip()
        # Skip empty lines and header lines
        if not line or line.startswith("#") or line.startswith("-"):
            continue
        # First column is the path (tab-separated)
        path = line.split("\t")[0].strip()
        if path:
            paths_to_copy.append(path)

print(f"Found {len(paths_to_copy)} rejected images in {clip_results_file.name}")

# Copy each image
copied = 0
errors = []

for rel_path in paths_to_copy:
    category, filename = extract_category_and_filename(rel_path)
    
    if not category or not filename:
        errors.append(f"Could not parse: {rel_path}")
        continue
    
    # Build source path
    if category == "Cat":
        source = cat_folder / filename
    else:
        source = dog_folder / filename
    
    # Build destination
    dest = review_folder / category / filename
    
    # Copy
    try:
        if source.exists():
            shutil.copy2(source, dest)
            copied += 1
        else:
            errors.append(f"File not found: {source}")
    except Exception as e:
        errors.append(f"Error copying {source}: {e}")

# -----------------------------------------------------------------------------
# SUMMARY
# -----------------------------------------------------------------------------

print()
print("=" * 70)
print("COPY CLIP REJECTED IMAGES FOR MANUAL REVIEW")
print("=" * 70)
print()
print(f"Source file: {clip_results_file}")
print(f"Destination: {review_folder}")
print()
print(f"Images in file:     {len(paths_to_copy)}")
print(f"Successfully copied: {copied}")
print(f"Errors:              {len(errors)}")
print()

# Count by category
cat_count = len(list((review_folder / "Cat").glob("*.jpg")))
dog_count = len(list((review_folder / "Dog").glob("*.jpg")))
print(f"Cats: {cat_count}")
print(f"Dogs: {dog_count}")
print(f"Total: {cat_count + dog_count}")

if errors:
    print()
    print("ERRORS:")
    for err in errors[:20]:
        print(f"  {err}")
    if len(errors) > 20:
        print(f"  ... and {len(errors) - 20} more")

print("=" * 70)

Found 86 rejected images in clip_exclude_details_v3plus.txt

COPY CLIP REJECTED IMAGES FOR MANUAL REVIEW

Source file: C:\AWrk\cats_dogs_project\outputs\clip_exclude_details_v3plus.txt
Destination: C:\AWrk\cats_dogs_project\outputs\clip_rejected_for_review

Images in file:     86
Successfully copied: 86
Errors:              0

Cats: 44
Dogs: 42
Total: 86


### create clean dataset

**copy_all_excluded_images: Copy the Files that were **

1. **Reads both exclusion files** and builds a set of filenames to exclude (stored separately for Cat and Dog)

2. **Iterates through all images** in the original `PetImages/Cat` and `PetImages/Dog` folders

3. **Copies only clean images** (those NOT in the exclusion sets) to `CleanPetImages/Cat` and `CleanPetImages/Dog`

4. **Prints a detailed summary** showing:
   - Original dataset counts
   - How many were excluded
   - Final clean dataset counts
   - Verification of files actually in destination

**Clear Dataset Output structure:**
```
C:\AWrk\cats_dogs_project\data\CleanPetImages\
Cat\
(clean cat images only)
Dog\
(clean dog images only)

Run the first script (`copy_all_excluded_images.py`) first if you want to inspect the rejected images, then run this one (`create_clean_dataset.py`) to create clean training dataset.

In [5]:
# -----------------------------------------------------------------------------
# PATHS
# -----------------------------------------------------------------------------

base_path = Path(r"C:\AWrk\cats_dogs_project")
outputs_folder = base_path / "outputs"

# Exclusion files (TWO sources)
clip_cleanup_file = outputs_folder / "clip_cleanup_exclusions.txt"  # Manual confirmations after review
general_exclude_file = outputs_folder / "exclude_list.txt"          # Earlier exclusions (corrupted, duplicates)

# Source image folders
source_cat = base_path / "data" / "PetImages" / "Cat"
source_dog = base_path / "data" / "PetImages" / "Dog"

# Destination - clean dataset
dest_folder = base_path / "data" / "CleanPetImages"
dest_cat = dest_folder / "Cat"
dest_dog = dest_folder / "Dog"

# -----------------------------------------------------------------------------
# CREATE DESTINATION FOLDERS
# -----------------------------------------------------------------------------

dest_folder.mkdir(parents=True, exist_ok=True)
dest_cat.mkdir(exist_ok=True)
dest_dog.mkdir(exist_ok=True)

# -----------------------------------------------------------------------------
# HELPER FUNCTION
# -----------------------------------------------------------------------------

def extract_category_and_filename(rel_path):
    """Extract category (Cat/Dog) and filename from a path."""
    path_parts = rel_path.replace("\\", "/").split("/")
    for i, part in enumerate(path_parts):
        if part in ("Cat", "Dog"):
            category = part
            filename = path_parts[i + 1] if i + 1 < len(path_parts) else None
            return category, filename
    return None, None

# -----------------------------------------------------------------------------
# BUILD EXCLUSION SET FROM BOTH FILES
# -----------------------------------------------------------------------------

excluded_files = {"Cat": set(), "Dog": set()}

# Read clip_cleanup_exclusions.txt (manual confirmations)
clip_count = 0
if clip_cleanup_file.exists():
    with open(clip_cleanup_file, "r") as f:
        for line in f:
            line = line.strip()
            if not line or line.startswith("#") or line.startswith("-"):
                continue
            path = line.split("\t")[0].strip()
            if path:
                category, filename = extract_category_and_filename(path)
                if category and filename:
                    excluded_files[category].add(filename.lower())
                    clip_count += 1
    print(f"Read {clip_count} exclusions from clip_cleanup_exclusions.txt")
else:
    print(f"WARNING: {clip_cleanup_file} not found")

# Read exclude_list.txt (earlier exclusions)
general_count = 0
if general_exclude_file.exists():
    with open(general_exclude_file, "r") as f:
        for line in f:
            line = line.strip()
            if line:
                category, filename = extract_category_and_filename(line)
                if category and filename:
                    excluded_files[category].add(filename.lower())
                    general_count += 1
    print(f"Read {general_count} exclusions from exclude_list.txt")
else:
    print(f"WARNING: {general_exclude_file} not found")

total_excluded_cats = len(excluded_files["Cat"])
total_excluded_dogs = len(excluded_files["Dog"])
print(f"\nTotal unique exclusions: {total_excluded_cats} cats, {total_excluded_dogs} dogs")

# -----------------------------------------------------------------------------
# COPY CLEAN IMAGES
# -----------------------------------------------------------------------------

def copy_clean_images(source_folder, dest_folder, excluded_set):
    """Copy all images except those in the exclusion set."""
    copied = 0
    skipped = 0
    errors = []
    
    all_images = list(source_folder.glob("*.jpg"))
    
    for img_path in all_images:
        filename_lower = img_path.name.lower()
        
        if filename_lower in excluded_set:
            skipped += 1
            continue
        
        dest_path = dest_folder / img_path.name
        
        try:
            shutil.copy2(img_path, dest_path)
            copied += 1
        except Exception as e:
            errors.append(f"{img_path.name}: {e}")
    
    return copied, skipped, errors, len(all_images)

print("\nCopying clean images...")
print("-" * 50)

# Copy cats
cat_copied, cat_skipped, cat_errors, cat_total = copy_clean_images(
    source_cat, dest_cat, excluded_files["Cat"]
)
print(f"Cat: {cat_copied} copied, {cat_skipped} excluded (from {cat_total} total)")

# Copy dogs
dog_copied, dog_skipped, dog_errors, dog_total = copy_clean_images(
    source_dog, dest_dog, excluded_files["Dog"]
)
print(f"Dog: {dog_copied} copied, {dog_skipped} excluded (from {dog_total} total)")

# -----------------------------------------------------------------------------
# SUMMARY
# -----------------------------------------------------------------------------

print()
print("=" * 70)
print("CLEAN DATASET CREATION COMPLETE")
print("=" * 70)
print()
print("EXCLUSION SOURCES:")
print(f"  {clip_cleanup_file.name}: {clip_count} paths")
print(f"  {general_exclude_file.name}: {general_count} paths")
print()
print("ORIGINAL DATASET:")
print(f"  Cats: {cat_total:,}")
print(f"  Dogs: {dog_total:,}")
print(f"  Total: {cat_total + dog_total:,}")
print()
print("EXCLUDED:")
print(f"  Cats: {cat_skipped:,}")
print(f"  Dogs: {dog_skipped:,}")
print(f"  Total: {cat_skipped + dog_skipped:,}")
print()
print("CLEAN DATASET:")
print(f"  Cats: {cat_copied:,}")
print(f"  Dogs: {dog_copied:,}")
print(f"  Total: {cat_copied + dog_copied:,}")
print()
print(f"DESTINATION: {dest_folder}")

# Verify
final_cats = len(list(dest_cat.glob("*.jpg")))
final_dogs = len(list(dest_dog.glob("*.jpg")))
print()
print("VERIFICATION (files in destination):")
print(f"  Cats: {final_cats:,}")
print(f"  Dogs: {final_dogs:,}")
print(f"  Total: {final_cats + final_dogs:,}")

all_errors = cat_errors + dog_errors
if all_errors:
    print()
    print("ERRORS:")
    for err in all_errors[:20]:
        print(f"  {err}")
    if len(all_errors) > 20:
        print(f"  ... and {len(all_errors) - 20} more")

print("=" * 70)

Read 47 exclusions from clip_cleanup_exclusions.txt
Read 36 exclusions from exclude_list.txt

Total unique exclusions: 44 cats, 32 dogs

Copying clean images...
--------------------------------------------------
Cat: 12456 copied, 44 excluded (from 12500 total)
Dog: 12468 copied, 32 excluded (from 12500 total)

CLEAN DATASET CREATION COMPLETE

EXCLUSION SOURCES:
  clip_cleanup_exclusions.txt: 47 paths
  exclude_list.txt: 36 paths

ORIGINAL DATASET:
  Cats: 12,500
  Dogs: 12,500
  Total: 25,000

EXCLUDED:
  Cats: 44
  Dogs: 32
  Total: 76

CLEAN DATASET:
  Cats: 12,456
  Dogs: 12,468
  Total: 24,924

DESTINATION: C:\AWrk\cats_dogs_project\data\CleanPetImages

VERIFICATION (files in destination):
  Cats: 12,456
  Dogs: 12,468
  Total: 24,924


### verify clean dataset - Verify the clean dataset was created correctly

In [6]:
# -----------------------------------------------------------------------------
# PATHS
# -----------------------------------------------------------------------------

base_path = Path(r"C:\AWrk\cats_dogs_project")
outputs_folder = base_path / "outputs"

# Exclusion files
clip_cleanup_file = outputs_folder / "clip_cleanup_exclusions.txt"
general_exclude_file = outputs_folder / "exclude_list.txt"

# Original and clean datasets
original_cat = base_path / "data" / "PetImages" / "Cat"
original_dog = base_path / "data" / "PetImages" / "Dog"
clean_cat = base_path / "data" / "CleanPetImages" / "Cat"
clean_dog = base_path / "data" / "CleanPetImages" / "Dog"

# -----------------------------------------------------------------------------
# HELPER FUNCTION
# -----------------------------------------------------------------------------

def extract_category_and_filename(rel_path):
    """Extract category (Cat/Dog) and filename from a path."""
    path_parts = rel_path.replace("\\", "/").split("/")
    for i, part in enumerate(path_parts):
        if part in ("Cat", "Dog"):
            category = part
            filename = path_parts[i + 1] if i + 1 < len(path_parts) else None
            return category, filename
    return None, None

# -----------------------------------------------------------------------------
# BUILD EXCLUSION SET (same logic as create_clean_dataset.py)
# -----------------------------------------------------------------------------

excluded_files = {"Cat": set(), "Dog": set()}

# Read clip_cleanup_exclusions.txt
if clip_cleanup_file.exists():
    with open(clip_cleanup_file, "r") as f:
        for line in f:
            line = line.strip()
            if not line or line.startswith("#") or line.startswith("-"):
                continue
            path = line.split("\t")[0].strip()
            if path:
                category, filename = extract_category_and_filename(path)
                if category and filename:
                    excluded_files[category].add(filename.lower())

# Read exclude_list.txt
if general_exclude_file.exists():
    with open(general_exclude_file, "r") as f:
        for line in f:
            line = line.strip()
            if line:
                category, filename = extract_category_and_filename(line)
                if category and filename:
                    excluded_files[category].add(filename.lower())

# -----------------------------------------------------------------------------
# COUNT FILES
# -----------------------------------------------------------------------------

original_cats = set(f.name.lower() for f in original_cat.glob("*.jpg"))
original_dogs = set(f.name.lower() for f in original_dog.glob("*.jpg"))
clean_cats = set(f.name.lower() for f in clean_cat.glob("*.jpg"))
clean_dogs = set(f.name.lower() for f in clean_dog.glob("*.jpg"))

# -----------------------------------------------------------------------------
# VERIFICATION TESTS
# -----------------------------------------------------------------------------

print("=" * 70)
print("CLEAN DATASET VERIFICATION")
print("=" * 70)
print()

all_passed = True

# Test 1: No excluded files should be in clean dataset
print("TEST 1: No excluded files in clean dataset")
print("-" * 50)

leaked_cats = clean_cats.intersection(excluded_files["Cat"])
leaked_dogs = clean_dogs.intersection(excluded_files["Dog"])

if leaked_cats:
    print(f"  ❌ FAIL: {len(leaked_cats)} excluded cats found in clean dataset:")
    for f in sorted(leaked_cats)[:10]:
        print(f"       {f}")
    if len(leaked_cats) > 10:
        print(f"       ... and {len(leaked_cats) - 10} more")
    all_passed = False
else:
    print(f"  ✓ PASS: No excluded cats leaked into clean dataset")

if leaked_dogs:
    print(f"  ❌ FAIL: {len(leaked_dogs)} excluded dogs found in clean dataset:")
    for f in sorted(leaked_dogs)[:10]:
        print(f"       {f}")
    if len(leaked_dogs) > 10:
        print(f"       ... and {len(leaked_dogs) - 10} more")
    all_passed = False
else:
    print(f"  ✓ PASS: No excluded dogs leaked into clean dataset")

print()

# Test 2: Count verification
print("TEST 2: Count verification")
print("-" * 50)

expected_cats = len(original_cats) - len(excluded_files["Cat"].intersection(original_cats))
expected_dogs = len(original_dogs) - len(excluded_files["Dog"].intersection(original_dogs))

print(f"  Cats: {len(original_cats)} original - {len(excluded_files['Cat'].intersection(original_cats))} excluded = {expected_cats} expected")
print(f"        Clean dataset has: {len(clean_cats)}")
if len(clean_cats) == expected_cats:
    print(f"  ✓ PASS: Cat count matches")
else:
    print(f"  ❌ FAIL: Cat count mismatch (diff: {len(clean_cats) - expected_cats})")
    all_passed = False

print()
print(f"  Dogs: {len(original_dogs)} original - {len(excluded_files['Dog'].intersection(original_dogs))} excluded = {expected_dogs} expected")
print(f"        Clean dataset has: {len(clean_dogs)}")
if len(clean_dogs) == expected_dogs:
    print(f"  ✓ PASS: Dog count matches")
else:
    print(f"  ❌ FAIL: Dog count mismatch (diff: {len(clean_dogs) - expected_dogs})")
    all_passed = False

print()

# Test 3: No unexpected files (files in clean that weren't in original)
print("TEST 3: No unexpected files")
print("-" * 50)

unexpected_cats = clean_cats - original_cats
unexpected_dogs = clean_dogs - original_dogs

if unexpected_cats:
    print(f"  ❌ FAIL: {len(unexpected_cats)} unexpected cats in clean dataset")
    all_passed = False
else:
    print(f"  ✓ PASS: All clean cats came from original")

if unexpected_dogs:
    print(f"  ❌ FAIL: {len(unexpected_dogs)} unexpected dogs in clean dataset")
    all_passed = False
else:
    print(f"  ✓ PASS: All clean dogs came from original")

print()

# Test 4: Check exclusion files were found
print("TEST 4: Exclusion files loaded correctly")
print("-" * 50)

if clip_cleanup_file.exists():
    print(f"  ✓ Found: {clip_cleanup_file.name} ({len(excluded_files['Cat']) + len(excluded_files['Dog'])} total after combining)")
else:
    print(f"  ⚠ WARNING: {clip_cleanup_file.name} not found")

if general_exclude_file.exists():
    print(f"  ✓ Found: {general_exclude_file.name}")
else:
    print(f"  ⚠ WARNING: {general_exclude_file.name} not found")

print()

# -----------------------------------------------------------------------------
# SUMMARY
# -----------------------------------------------------------------------------

print("=" * 70)
print("SUMMARY")
print("=" * 70)
print()
print(f"Original dataset:  {len(original_cats):,} cats + {len(original_dogs):,} dogs = {len(original_cats) + len(original_dogs):,} total")
print(f"Exclusions:        {len(excluded_files['Cat']):,} cats + {len(excluded_files['Dog']):,} dogs = {len(excluded_files['Cat']) + len(excluded_files['Dog']):,} total")
print(f"Clean dataset:     {len(clean_cats):,} cats + {len(clean_dogs):,} dogs = {len(clean_cats) + len(clean_dogs):,} total")
print()

if all_passed:
    print("✓ ALL TESTS PASSED - Clean dataset is valid!")
else:
    print("❌ SOME TESTS FAILED - Check issues above")

print("=" * 70)

CLEAN DATASET VERIFICATION

TEST 1: No excluded files in clean dataset
--------------------------------------------------
  ✓ PASS: No excluded cats leaked into clean dataset
  ✓ PASS: No excluded dogs leaked into clean dataset

TEST 2: Count verification
--------------------------------------------------
  Cats: 12500 original - 44 excluded = 12456 expected
        Clean dataset has: 12456
  ✓ PASS: Cat count matches

  Dogs: 12500 original - 32 excluded = 12468 expected
        Clean dataset has: 12468
  ✓ PASS: Dog count matches

TEST 3: No unexpected files
--------------------------------------------------
  ✓ PASS: All clean cats came from original
  ✓ PASS: All clean dogs came from original

TEST 4: Exclusion files loaded correctly
--------------------------------------------------
  ✓ Found: clip_cleanup_exclusions.txt (76 total after combining)
  ✓ Found: exclude_list.txt

SUMMARY

Original dataset:  12,500 cats + 12,500 dogs = 25,000 total
Exclusions:        44 cats + 32 dogs 