# Comic Dataset Exploration

This notebook explores the structure of the comic dataset located at `C:\Users\uanus\Box\AML Comic Project`.

## Objectives:
1. List all folders in the dataset directory
2. Count the number of images in each folder
3. Analyze the dataset structure and distribution


In [1]:
import os
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Set up plotting style
plt.style.use('default')
sns.set_palette("husl")


## 1. Dataset Path Configuration


In [2]:
# Define the dataset path
dataset_path = Path(r"C:\Users\uanus\Box\AML Comic Project")

print(f"Dataset path: {dataset_path}")
print(f"Path exists: {dataset_path.exists()}")

if dataset_path.exists():
    print(f"Path is directory: {dataset_path.is_dir()}")
else:
    print("Warning: Dataset path does not exist!")


Dataset path: C:\Users\uanus\Box\AML Comic Project
Path exists: True
Path is directory: True


## 2. List All Folders in Dataset Directory


In [9]:
# List all folders in the dataset directory
if dataset_path.exists():
    folders = [f for f in dataset_path.iterdir() if f.is_dir()]
    folders.sort(key=lambda x: int(x.name) if x.name.isdigit() else float('inf'))
    
    print(f"Found {len(folders)} folders in the dataset:")
    print("=" * 50)
    
    for i, folder in enumerate(folders, 1):
        print(f"{i:2d}. {folder.name}")
    
    print(f"\nTotal folders: {len(folders)}")
else:
    print("Cannot list folders - dataset path does not exist")


Found 15 folders in the dataset:
 1. 0
 2. 1
 3. 2
 4. 3
 5. 4
 6. 5
 7. 6
 8. 7
 9. 8
10. 9
11. 10
12. 11
13. 12
14. 13
15. 14

Total folders: 15


## 3. Count Images in Each Folder


In [10]:
def count_images_in_folder(folder_path):
    """Count the number of JPG images in a folder"""
    try:
        jpg_files = list(folder_path.glob("*.jpg"))
        return len(jpg_files)
    except Exception as e:
        print(f"Error counting images in {folder_path.name}: {e}")
        return 0

# Count images in each folder
folder_stats = []

if dataset_path.exists():
    for folder in folders:
        image_count = count_images_in_folder(folder)
        folder_stats.append({
            'folder': folder.name,
            'image_count': image_count,
            'folder_path': str(folder)
        })
    
    # Create DataFrame for analysis
    df = pd.DataFrame(folder_stats)
    
    print("Image Count by Folder:")
    print("=" * 50)
    
    for _, row in df.iterrows():
        print(f"Folder {row['folder']:>3s}: {row['image_count']:>4d} images")
    
    print("=" * 50)
    print(f"Total images across all folders: {df['image_count'].sum():,}")
    print(f"Average images per folder: {df['image_count'].mean():.1f}")
    print(f"Median images per folder: {df['image_count'].median():.1f}")
else:
    print("Cannot count images - dataset path does not exist")


Image Count by Folder:
Folder   0:  314 images
Folder   1:  452 images
Folder   2:  221 images
Folder   3:  207 images
Folder   4:  309 images
Folder   5:  221 images
Folder   6:  232 images
Folder   7:  222 images
Folder   8:  216 images
Folder   9:  208 images
Folder  10:  202 images
Folder  11:  215 images
Folder  12:  215 images
Folder  13:  219 images
Folder  14:    4 images
Total images across all folders: 3,457
Average images per folder: 230.5
Median images per folder: 219.0


## 4. Display Detailed Statistics


In [12]:
if 'df' in locals() and not df.empty:
    print("\nDetailed Statistics:")
    print("=" * 50)
    
    # Basic statistics
    print(f"Minimum images in a folder: {df['image_count'].min()}")
    print(f"Maximum images in a folder: {df['image_count'].max()}")
    
    # Show top 5 folders with most images
    print("\nTop 5 folders with most images:")
    top_folders = df.nlargest(5, 'image_count')
    for _, row in top_folders.iterrows():
        print(f"  Folder {row['folder']}: {row['image_count']} images")
    
    # Show folders with least images
    print("\nFolders with fewest images:")
    bottom_folders = df.nsmallest(3, 'image_count')
    for _, row in bottom_folders.iterrows():
        print(f"  Folder {row['folder']}: {row['image_count']} images")



Detailed Statistics:
Minimum images in a folder: 4
Maximum images in a folder: 452

Top 5 folders with most images:
  Folder 1: 452 images
  Folder 0: 314 images
  Folder 4: 309 images
  Folder 6: 232 images
  Folder 7: 222 images

Folders with fewest images:
  Folder 14: 4 images
  Folder 10: 202 images
  Folder 3: 207 images


## 5. Visualizations


## 6. Sample Image Analysis


## 7. Export Results


In [8]:
# Save the analysis results to CSV
if 'df' in locals() and not df.empty:
    # Save folder statistics
    df.to_csv('comic_dataset_folder_stats.csv', index=False)
    print("Folder statistics saved to 'comic_dataset_folder_stats.csv'")
    
    # Create a summary report
    summary_report = f"""
Comic Dataset Summary Report
{'='*50}

Dataset Location: {dataset_path}
Total Folders: {len(df)}
Total Images: {df['image_count'].sum():,}

Statistics:
- Average images per folder: {df['image_count'].mean():.1f}
- Median images per folder: {df['image_count'].median():.1f}
- Minimum images in folder: {df['image_count'].min()}
- Maximum images in folder: {df['image_count'].max()}
- Standard deviation: {df['image_count'].std():.1f}

Folder Distribution:
- Folders with 100+ images: {len(df[df['image_count'] >= 100])}
- Folders with 200+ images: {len(df[df['image_count'] >= 200])}
- Folders with 300+ images: {len(df[df['image_count'] >= 300])}

Generated on: {pd.Timestamp.now().strftime('%Y-%m-%d %H:%M:%S')}
"""
    
    with open('comic_dataset_summary.txt', 'w') as f:
        f.write(summary_report)
    
    print("Summary report saved to 'comic_dataset_summary.txt'")
    print("\nSummary Report:")
    print(summary_report)


Folder statistics saved to 'comic_dataset_folder_stats.csv'
Summary report saved to 'comic_dataset_summary.txt'

Summary Report:

Comic Dataset Summary Report

Dataset Location: C:\Users\uanus\Box\AML Comic Project
Total Folders: 15
Total Images: 3,457

Statistics:
- Average images per folder: 230.5
- Median images per folder: 219.0
- Minimum images in folder: 4
- Maximum images in folder: 452
- Standard deviation: 91.2

Folder Distribution:
- Folders with 100+ images: 14
- Folders with 200+ images: 14
- Folders with 300+ images: 3

Generated on: 2025-10-06 00:03:30

