# STL-10 Subset Classification: Optimized Preprocessing Pipeline

## Dataset Overview
- **Classes used:** airplane, bird, car, cat, deer (5 classes total)
- **Source:** STL-10 dataset from Kaggle in binary format
- **Image size:** Originally 96x96, resized to 64x64 for faster processing
- **Train/Test split:** 80/20 after combining and shuffling all data

## Preprocessing Pipeline
1. Image loading and class filtering
2. Grayscale conversion with CLAHE contrast enhancement
3. Noise reduction with Gaussian blur
4. Normalization and standardization
5. Dimensionality reduction using Incremental PCA

In [2]:
# ================== Cell 1: Import Essential Libraries ================== #
import numpy as np
import cv2
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import IncrementalPCA
import gc

# 1️⃣ Load STL-10 Binary Data
We use the STL-10 dataset downloaded from Kaggle in binary format. We filter the subset of 5 classes.


In [3]:
# ================== Cell 2: Load STL-10 subset (5 classes) ==================
from torchvision import datasets, transforms

# --- Target Classes ---
TARGET_CLASSES = [0, 1, 2, 3, 4]
class_names = ['airplane', 'bird', 'car', 'cat', 'deer']

# --- Simple Transform for Resizing ---
transform = transforms.Compose([
    transforms.Resize((64, 64)),   # Resize to 64x64 for faster processing
    transforms.ToTensor(),
])

# Load STL-10 dataset
print("Loading STL-10 dataset...")
train_set = datasets.STL10(root='../../Datasets', split='train', download=True, transform=transform)
test_set = datasets.STL10(root='../../Datasets', split='test', download=True, transform=transform)

# Combine and filter
print("Combining and filtering data...")
combined_data = np.concatenate([train_set.data, test_set.data], axis=0)
combined_labels = np.concatenate([train_set.labels, test_set.labels], axis=0)

# Filter target classes
indices = [i for i, label in enumerate(combined_labels) if label in TARGET_CLASSES]
X = combined_data[indices]
y = combined_labels[indices]

# Re-map labels to 0..4
label_map = {old: new for new, old in enumerate(TARGET_CLASSES)}
y = np.array([label_map[label] for label in y])

# Shuffle and split
print("Shuffling and splitting...")
idx = np.arange(len(y))
np.random.shuffle(idx)
X = X[idx]
y = y[idx]

split = int(0.8 * len(y))
X_train, y_train = X[:split], y[:split]
X_test, y_test = X[split:], y[split:]

print(f"Train shape: {X_train.shape}, Test shape: {X_test.shape}")
print(f"Train labels: {len(y_train)}, Test labels: {len(y_test)}")

Loading STL-10 dataset...
Combining and filtering data...
Shuffling and splitting...
Train shape: (5200, 3, 96, 96), Test shape: (1300, 3, 96, 96)
Train labels: 5200, Test labels: 1300


# 2️⃣ Preprocessing
- Resize images to 64x64  
- Histogram Equalization per channel  
- Normalize to [0,1]  
- Flatten for feature matrix  
- StandardScaler



In [4]:
# ================== Cell 3: Optimized Preprocessing ==================
def optimized_preprocessing(images, target_size=64):
    """
    Optimized preprocessing pipeline for STL-10 images
    
    Args:
        images: Input images in (batch, channels, height, width) format
        target_size: Target size for resizing
    
    Returns:
        Preprocessed and flattened images
    """
    processed_images = []
    
    for img in images:
        # Convert from (channels, height, width) to (height, width, channels)
        # for OpenCV processing
        img_processed = np.transpose(img, (1, 2, 0))
        
        # 1. Resize to target size
        if img_processed.shape[0] != target_size or img_processed.shape[1] != target_size:
            img_processed = cv2.resize(img_processed, (target_size, target_size))
        
        # 2. Convert to grayscale (sufficient for traditional ML algorithms)
        gray = cv2.cvtColor(img_processed, cv2.COLOR_RGB2GRAY)
        
        # 3. Apply CLAHE (Contrast Limited Adaptive Histogram Equalization)
        # Better than standard histogram equalization for preserving details
        clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8, 8))
        gray = clahe.apply(gray)
        
        # 4. Apply mild Gaussian blur to reduce noise
        gray = cv2.GaussianBlur(gray, (3, 3), 0)
        
        # 5. Normalize to [0, 1] range
        gray = gray.astype(np.float32) / 255.0
        
        processed_images.append(gray.flatten())  # Flatten directly
    
    return np.array(processed_images)

print("\nApplying optimized preprocessing...")
Xtr_processed = optimized_preprocessing(X_train)
Xte_processed = optimized_preprocessing(X_test)

print(f"After preprocessing - Train shape: {Xtr_processed.shape}, Test shape: {Xte_processed.shape}")


Applying optimized preprocessing...
After preprocessing - Train shape: (5200, 4096), Test shape: (1300, 4096)


# 3️⃣ Feature Extraction (PCA)
Reduce dimensions to 200 principal components


In [5]:
# ================== Cell 4: Feature Standardization ==================
print("\nApplying feature standardization...")
scaler = StandardScaler()
Xtr_scaled = scaler.fit_transform(Xtr_processed)
Xte_scaled = scaler.transform(Xte_processed)

print(f"Scaled features - Mean: {Xtr_scaled.mean():.6f}, Std: {Xtr_scaled.std():.6f}")
print(f"Features per image: {Xtr_scaled.shape[1]}")



Applying feature standardization...
Scaled features - Mean: 0.000000, Std: 1.000000
Features per image: 4096


In [6]:
# ================== Cell 5: Efficient Dimensionality Reduction ==================
print("\nApplying Incremental PCA...")

# Determine optimal number of components (keep 95% variance)
n_components = min(200, Xtr_scaled.shape[1] - 1)
batch_size = min(500, Xtr_scaled.shape[0])

print(f"PCA Parameters: n_components={n_components}, batch_size={batch_size}")

# Initialize and fit PCA
pca = IncrementalPCA(n_components=n_components, batch_size=batch_size)

# Fit PCA incrementally
n_batches = int(np.ceil(Xtr_scaled.shape[0] / batch_size))
for i in range(n_batches):
    start_idx = i * batch_size
    end_idx = min((i + 1) * batch_size, Xtr_scaled.shape[0])
    batch = Xtr_scaled[start_idx:end_idx]
    pca.partial_fit(batch)
    
    if (i + 1) % 5 == 0 or (i + 1) == n_batches:
        print(f"  Fitted {end_idx}/{Xtr_scaled.shape[0]} samples")

# Transform data
print("\nTransforming data...")
Xtr_pca = pca.transform(Xtr_scaled)
Xte_pca = pca.transform(Xte_scaled)

print(f"\nPCA Transformation Complete!")
print(f"  Original shape: {Xtr_scaled.shape}")
print(f"  Reduced shape: {Xtr_pca.shape}")
print(f"  Explained variance: {np.sum(pca.explained_variance_ratio_):.4f}")
print(f"  Number of components: {pca.n_components_}")

# Memory cleanup
print("\nCleaning up memory...")
del Xtr_processed, Xte_processed, Xtr_scaled
gc.collect()

print("\nPreprocessing pipeline completed successfully!")
print(f"Final training data shape: {Xtr_pca.shape}")
print(f"Final test data shape: {Xte_pca.shape}")


Applying Incremental PCA...
PCA Parameters: n_components=200, batch_size=500
  Fitted 2500/5200 samples
  Fitted 5000/5200 samples
  Fitted 5200/5200 samples

Transforming data...

PCA Transformation Complete!
  Original shape: (5200, 4096)
  Reduced shape: (5200, 200)
  Explained variance: 0.8800
  Number of components: 200

Cleaning up memory...

Preprocessing pipeline completed successfully!
Final training data shape: (5200, 200)
Final test data shape: (1300, 200)


In [7]:
# ================== Cell 6: Save Processed Data for Next Notebook ==================
import joblib
import os

print("\nSaving processed data for use in modeling notebook...")

# Create directories if they don't exist
os.makedirs('../data/processed', exist_ok=True)
os.makedirs('../models', exist_ok=True)

# Save processed data
np.save('../data/processed/X_train_pca.npy', Xtr_pca)
np.save('../data/processed/X_test_pca.npy', Xte_pca)
np.save('../data/processed/y_train.npy', y_train)
np.save('../data/processed/y_test.npy', y_test)

# Save the scaler and PCA models for consistency
joblib.dump(scaler, '../models/scaler.pkl')
joblib.dump(pca, '../models/pca_model.pkl')

# Save class names for reference
np.save('../data/processed/class_names.npy', np.array(class_names))

print("Data saved successfully!")
print(f"Files saved in '../data/processed/':")
print("  - X_train_pca.npy: PCA-transformed training features")
print("  - X_test_pca.npy: PCA-transformed test features")
print("  - y_train.npy: Training labels")
print("  - y_test.npy: Test labels")
print("  - class_names.npy: Class names for reference")
print("\nModels saved in '../models/':")
print("  - scaler.pkl: StandardScaler for consistency")
print("  - pca_model.pkl: PCA model for transforming new data")


Saving processed data for use in modeling notebook...
Data saved successfully!
Files saved in '../data/processed/':
  - X_train_pca.npy: PCA-transformed training features
  - X_test_pca.npy: PCA-transformed test features
  - y_train.npy: Training labels
  - y_test.npy: Test labels
  - class_names.npy: Class names for reference

Models saved in '../models/':
  - scaler.pkl: StandardScaler for consistency
  - pca_model.pkl: PCA model for transforming new data
