### 1. Getting a dataset
For demonstration purposes, we will use the FashionMNIST dataset, which is easily accessible through Torchvision. This dataset contains greyscale images of 10 different types of clothing.

In [None]:
import torch
from torchvision import datasets
from torchvision.transforms import ToTensor

# Setup the training data
training_data = datasets.FashionMNIST(
    root="data", # directory that contains the dataset
    train=True, # Traing or test split
    download=True, # Should the data be downloaded or not
    transform=ToTensor() # Transforms that will be applied to the data. We need to use ToTensor transform to convert the images data into a tensor
)

# Setup the testing data
test_data = datasets.FashionMNIST(
    root="data",
    train=False,
    download=True,
    transform=ToTensor()
)





In [None]:
# Get a sample image and label from the training data
# Image is a tensor that contains our image data, while label is an int that tells us what category this image belongs to (bag, coat, dress, etc).
image, label = training_data[0]

print(f"shape of the image tensor: {image.shape}")
print(f"Label of the image: {label}")

##### Input and output shapes of a computer vision model
As you can see above the shape of the image tensor is [1, 28, 28] where [color_channels=1, height=28, width=28]. Input and output shapes of a model will depend on your task and model you are using. Generally images tensor shapes are **CHW** (Color Channels, Height, Width) or **HWC** (Height, Width, Color Channels). Later we will see shapes like **NCHW** or **NHWC** where **N** is the batch size. 

In [None]:
print(f"training images:{len(training_data.data)}") 
print(f"training targets:{len(training_data.targets)}")

print(f"test images:{len(test_data.data)}")
print(f"test targets:{len(test_data.data)}")

print(f"class names {training_data.classes}")

In [None]:
import matplotlib.pyplot as plt

labels_map = training_data.classes  # Map of label indices to label names

# Create a figure with a specified size
figure = plt.figure(figsize=(8, 8))
cols, rows = 3, 3  # Define the number of columns and rows for the grid

# Loop through to create a grid of images
for i in range(1, cols * rows + 1):
    # Get a random sample index
    sample_idx = torch.randint(len(training_data), size=(1,)).item()
    # Get the image and label at the random index
    img, label = training_data[sample_idx]
    # Add a subplot in the correct grid position
    figure.add_subplot(rows, cols, i)
    # Set the title of the subplot to the label name
    plt.title(labels_map[label])
    # Turn off the axis
    plt.axis("off")
    # Display the image, removing singleton dimensions and setting the color map to gray
    plt.imshow(img.squeeze(), cmap="gray")

# Show the figure with all the subplots
plt.show()

### Dataloaders
Now that we have a dataset next step is to create a **Dataloader**. 
What is a dataloader?
A DataLoader is a PyTorch class that provides an efficient and flexible way to load data in batches for training and evaluating models.

Key Functions and Benefits of DataLoader:

    Batch Processing:
        DataLoader allows you to easily divide your dataset into smaller batches. This is important because processing the entire dataset at once can be computationally expensive and memory-intensive. By splitting the data into batches, you can fit the data processing within the memory limits of your hardware.

    Shuffling:
        It supports shuffling of the dataset, which helps in training models more effectively by ensuring that the model does not learn the order of the data. This helps to reduce overfitting and improves generalization.

    Parallel Data Loading:
        DataLoader can load data in parallel using multiple worker threads. This parallelism helps in speeding up the data loading process, especially when working with large datasets.

    Handling Data Transformations:
        It can handle various data transformations (like normalization, augmentation, etc.) on-the-fly. This is particularly useful for data preprocessing steps that need to be applied to each batch.

    Custom Batching:
        DataLoader allows you to define custom collate functions, which can be used to combine different samples into a single batch in a custom manner. This is useful for tasks that require specific ways of handling batches (e.g., varying image sizes, sequences of different lengths).


In [None]:
from torch.utils.data import DataLoader

# the batch size we will use
batch_size = 16

# Create dataloaders for training and testing.
train_dataloader = DataLoader(dataset=training_data, batch_size=batch_size, shuffle=True, num_workers=2)
test_dataloader = DataLoader(dataset=test_data, batch_size=batch_size, shuffle=True, num_workers=2)

- dataset: The dataset from which to load the data.
- batch_size: The number of samples per batch to load.
- shuffle: Whether to shuffle the data at every epoch.
- num_workers: How many subprocesses to use for data loading. 0 means that the data will be loaded in the main process.

In [None]:
import math

# get the images and the labels from the dataloader.
train_features, train_labels = next(iter(train_dataloader))

print(f"Feature batch shape: {train_features.size()}")
print(f"Labels batch shape: {train_labels.size()}")

# Initialize a figure for plotting
figure = plt.figure(figsize=(8, 8))

# Calculate the number of rows and columns for the grid
cols = int(math.sqrt(batch_size))
rows = math.ceil(batch_size / cols)

# Plot each image in the batch
for i in range(batch_size):
    figure.add_subplot(rows, cols, i + 1)
    plt.axis("off")
    plt.imshow(train_features[i].squeeze(), cmap="gray")

# Show the figure
plt.show()
    
    