# Exercises on Data Pipeline in PyTorch

In this excercise, you will build a data pipeline for a pet breed classification task using [_Oxford-IIIT Pet Dataset_](https://www.robots.ox.ac.uk/~vgg/data/pets/) to gain skill to handle complex data scenarios. You will learn how build a robust and efficient data pipeline that can handle data accessibility and quality issues to make model more reliable by adding data augmentation and error handling to avoid training crash due to just one bad image.

## Importing Packages

In [None]:
# Import packages
...


## Data Ingestion

_Oxford-IIIT Pet Dataset_ is a 37 category (25 breeds for cats and 12 breeds for dogs) pet dataset (~800 MB) with roughly 200 images for each class. The images have a large variations in scale, pose and lighting. All images have an associated ground truth annotation of breed, head ROI, and pixel level trimap segmentation. To keep things simple, you should just be using breed label from the ground truth data. After downloading and unzipping the dataset file, there will be a folder full of JPEG images generically named like <breed_name>_1.jpg.

**Downloading Data Files**

In [None]:
image_url = "https://thor.robots.ox.ac.uk/~vgg/data/pets/images.tar.gz"
annotations_url = "https://thor.robots.ox.ac.uk/~vgg/data/pets/annotations.tar.gz"

# Creates dataset directory, if does not exist
dataset_folder = "./datasets/oxford-iiit-pet"
...

# Downloads image file only if it does not exist
...

# Downloads annotations file only if it does not exist
...


**Decompressing Data Files**

In [2]:
# Decompresses images file
...

# Decompress annotations file
...

In [None]:
# Loads class-id (labels) from list.txt from folder 'annotations'
# Note: Open list.txt to understand the format. In addition to class-id, it also contains information
# on the speecies (cat or dog) and breed (e.g. Abyssinian, Bengal, etc.). To make things simple, you should
# just be using class-id [1 through 37] as labels for model to predict.
...

## Exploratory Data Analysis (EDA)

In [None]:
# Plots few of the random images in a figure to visualize them
# (Visualizes those samples images in a grid format)
...

In [None]:
# Print sizes of sample images to confirm that the image sizes varies
...

In [None]:
# Refer any one sample image and checks for min. and max. pixel values for each channel
# This will be used later for normalization step.
...

In [None]:
# Print the minimum and maximum class-id values
# This is required to confirm if class-ids are required to be shifted to start from 0
...

## Data Preparation

Handling image augmentation (only for training), resizing, format conversion, and normalization using PyTorch's transformation pipeline.

**Image Transformations**

In [None]:
# Compose transformations for 
# 1) resizing,
# 2) centering cropping and
# 3) normalization

In [None]:
# Also, compose transformations for training set with data augmentation
# in addition to above three transformations to be applied on validation/test set
...

In [None]:
# Apply the above composed transformations to sample images and checks if all 
# the transformations are actually applied on those images
...

**Building Data Pipeline**

Build data pipeline to
- access images files and pairing them with their labels (class-ids),
- get the images into the right format, correct size, data type, and structure for model to actually learn from them, and finally to
- load data in batches for efficient learning.

Use PyTorch's `Dataset` and `DataLoader` classes as primary tools to this task.

In [None]:
# Defines a custom `Dataset` class for Oxford-IIIT Pet dataset for PyTorch to handle the data, as required

class OxfordPetDataset(Dataset):
    """Represents Oxford-IIIT Pet Dataset that follows lazy loading pattern."""

    def __init__(self, root_dir, transform=None):
        self.root_dir = ...
        self.images_dir = ...
        self.transform = ...

        # Prepare a list of path for all images (for Dataset to load the images later on-the-fly)
        ...

        # Loads labels from .mat file
        # (Also, ensure adjusting labels to start from 0)
        ...

        # A list to hold statistics on the individual images being accessed
        ...

        # A list to keep track of error during transformations
        ...

    def __len__(self):
        """Returns the total number of samples in the dataset."""
        ...
        
    def __getitem__(self, idx):
        """Retrieves the image and label at the specified index."""
        
        try:
            # Prepare the image file name based on the index
            ...

            # Load image from disk
            ...

            # Checks the file for corruption (integrity check)
            # (You might need to re-open the image as it might get closed during integrity check)
            ...

            # Skips smaller images (as it may break transformation)
            ...

            # Converts gray-scale images to RGB
            ...

            # Applies transformations if any
            ...

            # Records statistics on the image
            ...

            # Return the image and label
            ...

        except Exception as e:
            # Logs the error
            ...
            # Print the error message
            ...
            # Keeps pipeline moving even when files are broken

    def get_error_summary(self):
        """Provides error summary to inspect which images had problems during loading or transformation."""

        # Show errors only for first images if the list is too long
        ...

In [None]:
# Creates the instance of the custom Dataset class for training and validation sets
dataset = OxfordPetDataset(
    root_dir=...,
    transform=...
)

**Splitting Data**

Split the the full dataset into training, validation and test set.

You will use the train set to train the model, validation set to check the model's performance during training and to tune the model paraneters, and test set for the final check on the model performance.

In [None]:
# Randomly splits the full dataset into train, validation and test set with 70:15:15 ratio.

train_set, val_set, test_set = ...

# Print the length of each of splitted dataset
...
...
...

**Batching Data**

Use `DataLoader` to load the data efficiently in batches. You may consider 32 or 64 samples in a batch. Shuffle the samples only in train set as shuffling samples in validation and test set does not make sense as these are used for model evaluation and not for training.

In [None]:
# Creating DataLoaders for each set of the data
train_set_loader = DataLoader(
    dataset = ..., 
    batch_size = ..., 
    shuffle = ...
)

val_set_loader = DataLoader(
    dataset = ..., 
    batch_size = ..., 
    shuffle = ...
)

test_set_loader = DataLoader(
    dataset = ..., 
    batch_size = ..., 
    shuffle = ...
)

In [None]:
# Perform a test run to ensure DataLoader is working as expected
# by fetching a single batch from training set and printing the batch shapes for images and its labels.
...