# Lesson 16: Datasets and DataLoaders

In the last lesson, we built a simple model from scratch. But to train any serious model, we need data. A lot of it.

Today, we focus on the most important part of the machine learning pipeline: **the data**. We will learn how to load, prepare, and visualize a real-world dataset. We will **not** be building a model in this lesson; our only goal is to understand the data.

## 1. What are Datasets and DataLoaders?

This is a core concept in PyTorch.

1.  **`torch.utils.data.Dataset`**
    * Think of this as a **textbook**. It's the complete collection of all your data (e.g., all 70,000 images and their labels). It knows how to access any specific item (e.g., `dataset[123]` would return the 123rd image and its label).

2.  **`torch.utils.data.DataLoader`**
    * Think of this as the **teacher** who uses the textbook. You don't read the entire textbook at once; the teacher gives you one chapter or one page at a time. 
    * The `DataLoader` wraps the `Dataset` and serves you the data in small, manageable **batches** (e.g., 64 images at a time). It also handles shuffling the data (so you don't memorize the order) and can even load data using multiple computer cores.

We have put all this logic into a separate file, `data_loader.py`, so we can reuse it easily.

## 2. The MNIST Dataset

We will use the **MNIST** dataset. It's the "Hello, World!" of deep learning.

* **What it is:** A collection of 70,000 28x28 pixel grayscale images of handwritten digits (0 through 9).
* **Goal:** To eventually train a model that can look at one of these images and correctly classify which digit it is.

### Data Format & Efficiency

You might ask: "Why don't we just have 70,000 PNG files in a folder?"

The dataset is stored in a few large, compressed binary files. This is *much* more efficient. Imagine if your dataset was 10 million images. If they were all tiny, individual `.png` files, your computer's hard drive would spend most of its time just *finding* and *opening* files, not reading them. This is called an **I/O (Input/Output) bottleneck**.

By storing the data in a few large binary "blobs", the program can read millions of images in one continuous operation, which is thousands of times faster. 

Luckily for us, the `torchvision` library handles all the downloading and parsing of this binary format automatically.

### Why Split Data into Train, Validation, and Test?

This is one of the most important concepts in all of machine learning. We *never* test our model on the same data it used for training. We use an analogy of a student studying for an exam:

1.  **Training Set (The Homework):**
    * This is the largest part of the data (e.g., ~54,000 images).
    * The model looks at this data and its labels, and **learns by updating its parameters**.
    * This is the *only* data the model ever learns from.

2.  **Validation Set (The Mock/Practice Exam):**
    * A smaller part of the data, held back from the model (e.g., ~6,000 images).
    * The model **does not learn** from this. It only makes predictions.
    * **We (the humans)** look at the model's score on this set *during* training to see if it's actually learning or just memorizing. If the score on the validation set stops improving, we know we should stop training.

3.  **Test Set (The Final Exam):**
    * A completely separate set of data (e.g., 10,000 images).
    * The model and the developer have *never* seen this data before. We run the model on this set **only once** at the very end.
    * This gives us the final, unbiased score of how well our model will perform in the real world on new, unseen data.

## 3. Loading and Visualizing the Data

Now, let's use our `data_loader.py` file to load the data and see what it actually looks like.

In [None]:
import torch
import matplotlib.pyplot as plt
from data_loader import get_mnist_loaders

# This will download the data (if not already present) and prepare the loaders
# We set batch_size=64, meaning our DataLoader will give us 64 images at a time.
train_loader, val_loader, test_loader = get_mnist_loaders(batch_size=64)

### Inspecting a Batch

Our `train_loader` is a Python iterable. Let's get one batch from it to see its properties.

In [None]:
# Get one batch of data
data_iter = iter(train_loader)
images, labels = next(data_iter)

print(f"Shape of one batch of images: {images.shape}")
print(f"Shape of one batch of labels: {labels.shape}")

# The shape [64, 1, 28, 28] means:
# [batch_size, color_channels, height, width]
# It's '1' color channel because the images are grayscale.

### Plotting a Single Digit

Let's plot the very first image from the batch we just loaded.

In [None]:
first_image = images[0]
first_label = labels[0]

print(f"Shape of one image: {first_image.shape}")

plt.figure(figsize=(6, 6))
# We use .squeeze() to remove the '1' color channel dimension (from [1, 28, 28] to [28, 28]), 
# so matplotlib can plot the 2D image.
plt.imshow(first_image.squeeze(), cmap='gray')
plt.title(f"This image has the label: {first_label.item()}")
plt.show()

## Next Steps

We have successfully:
1.  Understood `Datasets` and `DataLoaders`.
2.  Created a reusable `data_loader.py` file to handle all data preparation.
3.  Visualized our input data to confirm it's correct.

In the next lesson, we will import this *exact same* `data_loader.py` file and use the `train_loader` to train our first neural network.