# 01 - Data Exploration

**Goal:** Load and understand the CIFAR-10 dataset

## Overview
In this notebook, we will:
1. Load the CIFAR-10 dataset
2. Explore the data structure and some statistics
3. Visualize class distribution
4. Examine sample images

### Setup and Imports

In [4]:
import sys
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sys.path.append('../')
from src.data.cifar10 import load_cifar10, visualize_samples, get_class_distribution

### Load CIFAR-10 Dataset

We'll load the full CIFAR-10 dataset with:
- 50,000 training images (split into 45,000 training and 5,000 validation)
- 10,000 testing images
- 10 classes (balanced, hopefully)

The `load_cifar10()` function:
- Normalizes pixel values from 0 to 1
- Maintains equal balance in the train/validation split
- Flattens labels from (N, 1) to (N,)

In [8]:
# Load the full dataset where 10% of training data goes to validation:

(training_images, training_labels), (validation_images, validation_labels), (testing_images, testing_labels), class_names = load_cifar10()

print("\nDataset loading completed")


Final splits:
Training:   45000 samples
Validation: 5000 samples
Testing:    10000 samples

Dataset loading completed


### Data Structure Overview

Let's examine the shape and overall structure of our dataset:

In [None]:
# Shapes
print("\nDataset Shapes:")
print(f"Training:   {training_images.shape} | Labels: {training_labels.shape}")
print(f"Validation: {validation_images.shape} | Labels: {validation_labels.shape}")
print(f"Testing:    {testing_images.shape} | Labels: {testing_labels.shape}")

# Ranges
print("\nValue Ranges:")
print(f"Image pixels: [{training_images.min():.3f}, {training_images.max():.3f}]")
print(f"Label range:  [{training_labels.min()}, {training_labels.max()}]")

# Classes
print("\nClasses:")
for i, name in enumerate(class_names):
    print(f"{i}: {name}")


Dataset Shapes:
Training:   (45000, 32, 32, 3) | Labels: (45000,)
Validation: (5000, 32, 32, 3) | Labels: (5000,)
Testing:    (10000, 32, 32, 3) | Labels: (10000,)

Data Types:
Images: float64
Labels: uint8

Value Ranges:
Image pixels: [0.000, 1.000]
Label range:  [0, 9]

Classes:
0: airplane
1: automobile
2: bird
3: cat
4: deer
5: dog
6: frog
7: horse
8: ship
9: truck
