# Benchmarking the performance of Pelican part2

## Dataset
[ImageNet](https://www.kaggle.com/c/imagenet-object-localization-challenge/overview)

Using this [script](https://raw.githubusercontent.com/soumith/imagenetloader.torch/master/valprep.sh) to prepare the data first. Then train it using ResNet50.


<br>


If you want to download the dataset in our Pelican Origin, do

```
pelican object get pelican://osg-htc.org/chtc/PUBLIC/hzhao292/<filename> <destination>
```

Filename and size listed:


| File     | Size | Description |
| :----------- | :-----------: | :---------|
| ImageNet.zip     |156G   | The zip version of full ImageNet dataset |
| ImageNet|161G |Decompressed version of the upper one, data in/train and /val|
| ImageNetMini.tgz|1.5G| The smaller version, subset of ImageNet |
| ImageNetMini  | 1.5G  | Folder of smaller version ImageNet dataset, go /train or /val for classified images|





## Hardware
GPU:                   NVIDIA Tesla V100

RAM:                   256G

Architecture:          x86_64

CPU op-mode(s):        32-bit, 64-bit

Byte Order:            Little Endian

CPU(s):                40


In [1]:
import os
import time
import torch
import argparse
import torch.nn as nn
import torch.optim as optim
import torch.multiprocessing as mp
from torch.utils.data import Dataset, DataLoader

from torchvision import models
from torchvision.datasets import VisionDataset
from torchvision.io import read_image
import torchvision.transforms as transforms
import torchvision.models as models

import fsspec
from pelicanfs.core import PelicanFileSystem
from fsspec.implementations.local import LocalFileSystem
from fsspec.implementations.cached import WholeFileCacheFileSystem

from PIL import Image
import warnings
import zipfile
from remote_image_folder import RemoteImageFolder

warnings.filterwarnings("ignore")
mp.set_start_method('spawn', force=True)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


In this notebook, we are going to use the `dev_trainfile_path` and `dev_valfile_path`. They are the smaller version of ImageNet dataset, which is 1.5G totally. 

Change the path to `trainfile_path` and `valfile_path` passed to RemoteImageFolder if you want to test the 150G whole dataset.

## Define path, transformers

In [2]:
# Local datas path
local_trainfile_path = "ImageNetMini/train"
local_valfile_path = "ImageNetMini/val"

# Define the Pelican paths
trainfile_path = "/chtc/PUBLIC/hzhao292/ImageNet/train"
valfile_path = "/chtc/PUBLIC/hzhao292/ImageNet/val"

dev_trainfile_path = "/chtc/PUBLIC/hzhao292/ImageNetMini/train"
dev_valfile_path = "/chtc/PUBLIC/hzhao292/ImageNetMini/val"

# Define the transformer.
train_transforms = transforms.Compose([
    transforms.RandomResizedCrop(224),
    transforms.RandomHorizontalFlip(),
    transforms.ToTensor(),  # Ensure ToTensor is included
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])

val_transforms = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),  # Ensure ToTensor is included
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])



This is a class `RemoteImageFolder` inherit from `VisionDataset`. It does the same function as `ImageFolder` in pyTorch, but accept remote data source. It's also compatible with local path.

## Training Function: 

In [3]:
def training(train_loader, val_loader):

    model = models.vgg16(pretrained=True)

    model = model.to(device)

    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=0.001)

    num_epochs = 5

    print("Training started.")
    for epoch in range(num_epochs):
        start_time = time.time()
        # Training phase
        model.train()
        running_loss = 0.0
        for inputs, labels in train_loader:
            inputs, labels = inputs.to(device), labels.to(device)
            optimizer.zero_grad()
            outputs = model(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()
            running_loss += loss.item() * inputs.size(0)

        epoch_loss = running_loss / len(train_loader.dataset)
        print(f"Epoch {epoch+1}/{num_epochs}, Training Loss: {epoch_loss:.4f}")

        # Validation phase
        model.eval()
        running_loss = 0.0
        correct = 0
        total = 0
        with torch.no_grad():
            for inputs, labels in val_loader:
                inputs, labels = inputs.to(device), labels.to(device)
                outputs = model(inputs)
                loss = criterion(outputs, labels)
                running_loss += loss.item() * inputs.size(0)
                _, predicted = torch.max(outputs, 1)
                total += labels.size(0)
                correct += (predicted == labels).sum().item()

        epoch_loss = running_loss / len(val_loader.dataset)
        accuracy = correct / total
        end_time = time.time()
        time_taken = end_time - start_time
        print(f"Epoch {epoch+1}/{num_epochs}, Validation Loss: {epoch_loss:.2f}, Accuracy: {accuracy:.2f}, Time Taken: {time_taken:.2f} seconds")
    print("Training completed.")

# Benchmarking:

## Training with reading data locally

In [4]:
print()
print("Read Locally.")

start_time = time.time()

train_dataset = RemoteImageFolder(root=local_trainfile_path, transform=train_transforms)
val_dataset = RemoteImageFolder(root=local_valfile_path, transform=val_transforms)

# Create the dataloaders
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True, num_workers=2, pin_memory=True)
val_loader = DataLoader(val_dataset, batch_size=32, shuffle=True, num_workers=2, pin_memory=True)

end_time = time.time()
print(f"Data preparing time: {end_time-start_time:4f}.")
training(train_loader, val_loader)


Read Locally.
Data preparing time: 0.036054.


Training started.


Epoch 1/5, Training Loss: 2.3163


Epoch 1/5, Validation Loss: 1.76, Accuracy: 0.37, Time Taken: 69.63 seconds


Epoch 2/5, Training Loss: 1.8953


Epoch 2/5, Validation Loss: 1.69, Accuracy: 0.39, Time Taken: 69.31 seconds


Epoch 3/5, Training Loss: 1.7722


Epoch 3/5, Validation Loss: 1.53, Accuracy: 0.44, Time Taken: 69.18 seconds


Epoch 4/5, Training Loss: 1.6996


Epoch 4/5, Validation Loss: 1.62, Accuracy: 0.42, Time Taken: 69.93 seconds


Epoch 5/5, Training Loss: 1.6504


Epoch 5/5, Validation Loss: 1.50, Accuracy: 0.49, Time Taken: 69.68 seconds
Training completed.


## Training with reading data remotely using pelicanfs

In [5]:
print("Read data remotely from Pelican")

start_time = time.time()

fs = PelicanFileSystem("pelican://osg-htc.org")

# Load the datasets
train_dataset = RemoteImageFolder(root=dev_trainfile_path,fs=fs,transform=train_transforms)
val_dataset = RemoteImageFolder(root=dev_valfile_path,fs=fs,transform=val_transforms)

# Create the dataloaders
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True, num_workers=2, pin_memory=True)
val_loader = DataLoader(val_dataset, batch_size=32, shuffle=True, num_workers=2, pin_memory=True)

end_time = time.time()
print(f"Data preparing time: {end_time-start_time:4f}.")

training(train_loader, val_loader)

Read data remotely from Pelican


Data preparing time: 0.939501.


Training started.


Epoch 1/5, Training Loss: 2.4811


Epoch 1/5, Validation Loss: 1.84, Accuracy: 0.31, Time Taken: 316.74 seconds


Epoch 2/5, Training Loss: 1.8827


Epoch 2/5, Validation Loss: 1.74, Accuracy: 0.37, Time Taken: 95.79 seconds


Epoch 3/5, Training Loss: 1.8245


Epoch 3/5, Validation Loss: 1.90, Accuracy: 0.32, Time Taken: 91.96 seconds


Epoch 4/5, Training Loss: 1.7541


Epoch 4/5, Validation Loss: 1.55, Accuracy: 0.45, Time Taken: 89.68 seconds


Epoch 5/5, Training Loss: 1.6650


Epoch 5/5, Validation Loss: 1.46, Accuracy: 0.51, Time Taken: 89.63 seconds
Training completed.


## Training with reading data remotely using pelicanfs, adding local cache 

In [6]:

print("Read data remotely from Pelican with local Cache")

start_time = time.time()
# Load the datasets
fs = fsspec.filesystem("filecache", target_protocol='osdf', cache_storage='tmp/files/')
train_dataset = RemoteImageFolder(root=dev_trainfile_path, fs=fs, transform=train_transforms)
val_dataset = RemoteImageFolder(root=dev_valfile_path, fs=fs, transform=val_transforms)

# Create the dataloaders
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True, num_workers=2, pin_memory=True)
val_loader = DataLoader(val_dataset, batch_size=32, shuffle=True, num_workers=2, pin_memory=True)
end_time = time.time()
print(f"Data preparing time: {end_time-start_time:4f}.")

training(train_loader, val_loader)

Read data remotely from Pelican with local Cache


Data preparing time: 1.024198.


Training started.


Epoch 1/5, Training Loss: 2.3736


Epoch 1/5, Validation Loss: 2.02, Accuracy: 0.27, Time Taken: 2076.08 seconds


Epoch 2/5, Training Loss: 1.8901


Epoch 2/5, Validation Loss: 1.55, Accuracy: 0.45, Time Taken: 74.39 seconds


Epoch 3/5, Training Loss: 1.7297


Epoch 3/5, Validation Loss: 1.47, Accuracy: 0.49, Time Taken: 69.88 seconds


Epoch 4/5, Training Loss: 1.6441


Epoch 4/5, Validation Loss: 1.43, Accuracy: 0.51, Time Taken: 69.89 seconds


Epoch 5/5, Training Loss: 1.5515


Epoch 5/5, Validation Loss: 1.22, Accuracy: 0.59, Time Taken: 69.93 seconds
Training completed.


## Downloading zip file using pelicanfs first, then unzip and train locally

In [7]:
print("Downloading zip file from pelican first, extract and train on it.")
time1 = time.time()
print("Downloading ImageNetMini.zip")
fs = PelicanFileSystem("pelican://osg-htc.org")
fs.get("/chtc/PUBLIC/hzhao292/ImageNetMini.zip","./")
time2 = time.time()
print(f"  - Time used: {time2-time1:2f}.",)


print("Extracting ImageNetMini.zip")
file = zipfile.ZipFile('ImageNetMini.zip')
file.extractall('./data')
time3 = time.time()
print(f"  - Time used: {time3-time2:2f}.",)

trainfile_path = "./data/ImageNetMini/train/"
valfile_path = "./data/ImageNetMini/val/"

# Load the datasets
train_dataset = RemoteImageFolder(root=trainfile_path, transform=train_transforms)
val_dataset = RemoteImageFolder(root=valfile_path, transform=val_transforms)

# Create the dataloaders
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True, num_workers=2, pin_memory=True)
val_loader = DataLoader(val_dataset, batch_size=32, shuffle=True, num_workers=2, pin_memory=True)
time4 = time.time()
print(f"Data preparing time: {time4-time3:2f}.")

training(train_loader, val_loader)

Downloading zip file from pelican first, extract and train on it.
Downloading ImageNetMini.zip


  - Time used: 19.973815.
Extracting ImageNetMini.zip


  - Time used: 7.157655.
Data preparing time: 0.046287.


Training started.


Epoch 1/5, Training Loss: 2.3754


Epoch 1/5, Validation Loss: 1.77, Accuracy: 0.35, Time Taken: 69.38 seconds


Epoch 2/5, Training Loss: 1.8147


Epoch 2/5, Validation Loss: 1.68, Accuracy: 0.37, Time Taken: 69.67 seconds


Epoch 3/5, Training Loss: 1.7315


Epoch 3/5, Validation Loss: 1.45, Accuracy: 0.49, Time Taken: 69.56 seconds


Epoch 4/5, Training Loss: 1.6722


Epoch 4/5, Validation Loss: 1.42, Accuracy: 0.51, Time Taken: 69.33 seconds


Epoch 5/5, Training Loss: 1.5247


Epoch 5/5, Validation Loss: 1.27, Accuracy: 0.58, Time Taken: 69.76 seconds
Training completed.
