# Dataset Processing Pipeline for Image Classification

## Overview
This notebook processes raw image folders into a machine learning-ready pickle dataset.

### Features:
- Loads images from class folders
- Resizes all images to 128×128 pixels
- Balances classes to 1000 images each
- Saves as pickle file for fast loading

### Input Structure:
```
data/
├── butterfly/
├── cat/
├── chicken/
└── ...
```

## 1. Import Libraries

In [1]:
import os
import pickle
import numpy as np
from PIL import Image
from tqdm import tqdm
import random

## 2. Configuration Parameters

Main settings for dataset processing.

In [2]:
DATASET_PATH = "../data"  # Input folder with class subfolders
IMAGE_SIZE = (128, 128)   # Target size for all images
MAX_IMAGES_PER_CLASS = 1000  # Limit per class for balancing
OUTPUT_PATH = "../data/dataset.pkl"  # Output pickle file
SUPPORTED_EXTENSIONS = {'.jpg', '.jpeg', '.png', '.bmp', '.gif', '.tiff'}

## 3. Processing Functions

Core functions for image loading, filtering, and dataset creation.

In [3]:
def load_and_resize_image(image_path, target_size):
    """
    Load image, convert to RGB, and resize to target size.
    Returns None if image cannot be processed.
    """
    try:
        with Image.open(image_path) as img:
            if img.mode != 'RGB':
                img = img.convert('RGB')
            img_resized = img.resize(target_size, Image.Resampling.LANCZOS)
            return np.array(img_resized)
    except Exception as e:
        print(f"Error: {image_path}: {e}")
        return None

def get_image_files(folder_path):
    """
    Get all image files from a folder based on supported extensions.
    """
    image_files = []
    for file in os.listdir(folder_path):
        if os.path.splitext(file.lower())[1] in SUPPORTED_EXTENSIONS:
            image_files.append(os.path.join(folder_path, file))
    return image_files

def process_dataset():
    """
    Process all class folders and create balanced dataset.
    Returns images (X), labels (y), and class names.
    """
    # Find all class folders
    class_folders = [item for item in os.listdir(DATASET_PATH) 
                    if os.path.isdir(os.path.join(DATASET_PATH, item))]
    class_folders.sort()    
    print(f"\nClasses found: {class_folders}")
    
    X = []
    y = []

    # Process each class
    for class_name in class_folders:
        print(f"\nProcessing: {class_name}")
        class_path = os.path.join(DATASET_PATH, class_name)
        image_files = get_image_files(class_path)
        
        # Limit images per class for balancing
        if len(image_files) > MAX_IMAGES_PER_CLASS:
            image_files = random.sample(image_files, MAX_IMAGES_PER_CLASS)
        
        # Process each image
        for img_path in tqdm(image_files, desc=f"Processing {class_name}"):
            img_array = load_and_resize_image(img_path, IMAGE_SIZE)
            if img_array is not None:
                X.append(img_array)
                y.append(class_name)
        
        print(f"  {len([f for f in image_files if load_and_resize_image(f, IMAGE_SIZE) is not None])} images added !")
    
    # Convert to numpy arrays
    X = np.array(X)
    y = np.array(y)
    
    print(f"\nResult:")
    print(f"X shape: {X.shape}")
    print(f"y shape: {y.shape}")
    print(f"Classes: {len(class_folders)}")
    print(f"Example y: {y[:5]}")
    
    return X, y, class_folders

## 4. Save Dataset

Save processed dataset as pickle file for fast loading during training.

In [4]:
def save_dataset(X, y, class_names, output_path):
    """
    Save dataset as pickle file with images, labels, and class names.
    """
    dataset = {
        'X': X,
        'y': y,
        'class_names': class_names
    }
    with open(output_path, 'wb') as f:
        pickle.dump(dataset, f)
    
    file_size = os.path.getsize(output_path) / (1024**2)
    print(f"\nDataset saved: {output_path} ({file_size:.2f} MB)")

## 5. Execute Pipeline

Run the complete processing pipeline with reproducible random sampling.

In [5]:
if __name__ == "__main__":
    # Set random seeds for reproducibility
    random.seed(42)
    np.random.seed(42)

    print("🚀 Starting processing...")
    X, y, class_names = process_dataset()
    save_dataset(X, y, class_names, OUTPUT_PATH)
    print(f"\n✅ Completed!")
    print(f"📁 Dataset: {OUTPUT_PATH}")

🚀 Starting processing...

Classes found: ['butterfly', 'cat', 'chicken', 'cow', 'dog', 'elephant', 'horse', 'sheep', 'spider', 'squirrel']

Processing: butterfly


Processing butterfly:   0%|          | 0/1000 [00:00<?, ?it/s]

Processing butterfly: 100%|██████████| 1000/1000 [00:03<00:00, 262.74it/s]


  1000 images added !

Processing: cat


Processing cat: 100%|██████████| 1000/1000 [00:08<00:00, 124.25it/s]


  1000 images added !

Processing: chicken


Processing chicken: 100%|██████████| 1000/1000 [00:01<00:00, 581.79it/s]


  1000 images added !

Processing: cow


Processing cow: 100%|██████████| 1000/1000 [00:01<00:00, 614.63it/s]


  1000 images added !

Processing: dog


Processing dog: 100%|██████████| 1000/1000 [00:01<00:00, 566.80it/s]


  1000 images added !

Processing: elephant


Processing elephant: 100%|██████████| 1000/1000 [00:02<00:00, 381.66it/s]


  1000 images added !

Processing: horse


Processing horse: 100%|██████████| 1000/1000 [00:01<00:00, 617.82it/s]


  1000 images added !

Processing: sheep


Processing sheep: 100%|██████████| 1000/1000 [00:02<00:00, 424.57it/s]


  1000 images added !

Processing: spider


Processing spider: 100%|██████████| 1000/1000 [00:01<00:00, 525.40it/s]


  1000 images added !

Processing: squirrel


Processing squirrel: 100%|██████████| 1000/1000 [00:01<00:00, 605.83it/s]


  1000 images added !

Result:
X shape: (10000, 128, 128, 3)
y shape: (10000,)
Classes: 10
Example y: ['butterfly' 'butterfly' 'butterfly' 'butterfly' 'butterfly']

Dataset saved: ../data/dataset.pkl (469.09 MB)

✅ Completed!
📁 Dataset: ../data/dataset.pkl
