### Introduction
This notebook focuses on preparing the CIFAR-10 dataset for training a Convolutional Neural Network (CNN). The preprocessing steps include normalization, dataset splitting into training, validation, and test sets, and data augmentation to improve model generalization. The processed data is then saved for efficient reuse during training.

### Importing Libraries

In [1]:
import numpy as np
import os
from tensorflow.keras.datasets import cifar10
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.utils import to_categorical

### Loading the CIFAR-10 Dataset

In [2]:
(X_train, y_train), (X_test, y_test) = cifar10.load_data()

print("Training set shape:", X_train.shape, y_train.shape)
print("Test set shape:", X_test.shape, y_test.shape)


Training set shape: (50000, 32, 32, 3) (50000, 1)
Test set shape: (10000, 32, 32, 3) (10000, 1)


### Normalizing the Pixel Values
Converts pixel values from [0, 255] to [0, 1] for better neural network performance.


In [3]:
X_train = X_train.astype('float32') / 255.0
X_test = X_test.astype('float32') / 255.0

### Spliting Data into Training and Validation Sets
Split the dataset into:
 - 66,6% train set
 - 16,6% validation set
 - 16,6% test set

In [4]:
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

print("Training data shape:", X_train.shape, y_train.shape)
print("Validation data shape:", X_val.shape, y_val.shape)
print("Testing data shape:", X_test.shape, y_test.shape)


Training data shape: (40000, 32, 32, 3) (40000, 1)
Validation data shape: (10000, 32, 32, 3) (10000, 1)
Testing data shape: (10000, 32, 32, 3) (10000, 1)


### Data Augmentation for Training Set
Data augmentation helps increase dataset diversity and reduces overfitting

In [5]:
train_datagen = ImageDataGenerator(
    rotation_range=15,
    width_shift_range=0.1,
    height_shift_range=0.1,
    horizontal_flip=True,
    fill_mode='nearest'
)

### One-Hot Encode the Labels

In [6]:
num_classes = 10

y_train = to_categorical(y_train, num_classes)
y_val = to_categorical(y_val, num_classes)
y_test = to_categorical(y_test, num_classes)

print("Labels converted to One-Hot Encoding")


Labels converted to One-Hot Encoding


### Saving Preprocessed Data

In [7]:
project_root = os.path.abspath(os.path.join(os.getcwd(), ".."))
processed_data_dir = os.path.join(project_root, "data", "processed")
train_dir = os.path.join(processed_data_dir, "train/")
val_dir = os.path.join(processed_data_dir, "val/")
test_dir = os.path.join(processed_data_dir, "test/")

os.makedirs(train_dir, exist_ok=True)
os.makedirs(val_dir, exist_ok=True)
os.makedirs(test_dir, exist_ok=True)


np.save(os.path.join(train_dir, 'x_train.npy'), X_train)
np.save(os.path.join(train_dir, 'y_train.npy'), y_train)

np.save(os.path.join(val_dir, 'x_val.npy'), X_val)
np.save(os.path.join(val_dir, 'y_val.npy'), y_val)

np.save(os.path.join(test_dir, 'x_test.npy'), X_test)
np.save(os.path.join(test_dir, 'y_test.npy'), y_test)

print("Data preprocessing complete. Preprocessed files saved in 'data/processed/'.")


Data preprocessing complete. Preprocessed files saved in 'data/processed/'.
