# CV Data Loading Benchmarking

## Section 1: Preparing the Benchmark

Prepare the Computer Vision data loading benchmark
- Creates data loader with given dataset path
- Iterates the full dataset for multiple epoches (by default 5 epoches)
- Provides data loading performance for comparing data loading solutions

In [3]:
import time
import torch

import torchvision.transforms as transforms
from torchvision.datasets import ImageFolder
from torch.utils.data import DataLoader

In [4]:
batch_size = 32
num_workers = 4
num_epochs = 5

In [5]:
mean = (0.485, 0.456, 0.406)
std = (0.229, 0.224, 0.225)

transform = transforms.Compose(
    [
        transforms.Resize(256),
        transforms.CenterCrop(224),
        transforms.ToTensor(),
        transforms.Normalize(mean, std),
    ]
)

In [6]:
def benchmark_data_loading(dataset):
    data_loader = DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=False,
        num_workers=num_workers,
    )
    start_time = time.perf_counter()
    for epoch in range(num_epochs):
        epoch_start = time.perf_counter()
        for _, _ in data_loader:
            pass
        epoch_end = time.perf_counter()
        print(f"Epoch {epoch}: {epoch_end - epoch_start:0.4f} seconds")
    end_time = time.perf_counter()
    elapsed_time = end_time - start_time
    print(f"Data loading in {elapsed_time:0.4f} seconds")
    print(f"num_epochs: {num_epochs} | batch_size: {batch_size} | num_workers: {num_workers} | time: {elapsed_time:0.4f}")

## Section 2: Run benchmark against Alluxio dataset

In [None]:
Run the benchmark against the Alluxio 

In [8]:
alluxio_data_path = "/mnt/alluxio/data/imagenet-mini/train"
alluxio_dataset = ImageFolder(root=alluxio_data_path, transform=transform)
benchmark_data_loading(alluxio_dataset)

Epoch 0: 35.9072 seconds
Epoch 1: 34.8164 seconds
Epoch 2: 35.5212 seconds
Epoch 3: 34.5897 seconds
Epoch 4: 34.9106 seconds
Data loading in 175.7465 seconds
[Summary] Experiment: DataLoaderBenchmark
num_epochs: 5 | batch_size: 32 | num_workers: 4 | time: 175.7465


## Section 3: Run benchmark against S3 dataset via S3FS-FUSE

[S3FS-FUSE](https://github.com/s3fs-fuse/s3fs-fuse) is the most popular S3 FUSE solution to turn S3 into a local folder.
S3fs-fuse is mounted with local metadata and data cache.

In [None]:
s3fs_data_path = "/mnt/s3fs/data/imagenet-mini/train"
s3fs_dataset = ImageFolder(root=s3fs_data_path, transform=transform)
benchmark_data_loading(s3fs_dataset)

## Section 4: Run benchmark against S3 dataset via boto3 S3 Python API

When using S3 Python API boto3, users need to modify their training scripts and explicitly define dataset behaviors.

In [None]:
from s3_dataset import S3ImageDataset

s3_dataset = S3ImageDataset("s3-bucket", "imagenet-mini/train", transform)
benchmark_data_loading(s3_dataset)