This kernel shows how training time can be considerably improved. The approach is demonstrated using competition data from the [APTOS 2019 Blindness Detection challenge](https://www.kaggle.com/c/aptos2019-blindness-detection), but we can easily upscale it to also include old competition data from 2015 in training (around ~40.000 images in total).

## Objective

Most public kernels currently perform image pre-processing dynamically during training. This is memory-efficient, because only data needed for the current batch is loaded on memory. However, it also increases training time, because pre-processing is repeated for each image in each epoch. For example, for a 5-fold CV with 10 epochs each, we would apply the same pre-processing 50 times to each image. My objective was to move image pre-processing out of the training loop, and therefore to apply it only once during training.

Note: I define image 'pre-processing' in this kernel as any processing steps that are uniformly applied to each image, such as resizing the image, cropping, or applying Ben's pre-processing. In contrast, data augmentation, such as random flipping or random rotation, is randomly applied to each image in each epoch and therefore has to stay inside the training loop.


## Method

First, what didn't work:

* Loading all pre-processed images into memory (as demonstrated in [this excellent kernel](https://www.kaggle.com/xhlulu/aptos-2019-densenet-keras-starter)). This works very well if we only use data from this year’s competition, but I had memory issues when training old competition data (~40.000 images are too much for the kernel memory). 
* Saving all pre-processed images as .png files and loading them during training. This led to a kernel error saying that too many output files were produced. 

What did work in the end: Saving all pre-processed images in a single HDF5-file (see below).


## Speed improvement 

To benchmark the speed improvement of the approach described in this kernel I used these references:

* Pre-processing is entirely based on [Neuron Engineers kernel](https://www.kaggle.com/ratthachat/aptos-updated-preprocessing-ben-s-cropping).
* Training is entirely based on [Chanhus EfficientNet training kernel](https://www.kaggle.com/chanhu/eye-efficientnet-pytorch-lb-0-777), which is currently the best-scoring public kernel. That kernel is used as the reference kernel in the table below.

Speed improvements were substantial: 


| | Duration pre-processing | Duration 1 epoch | Total training duration (10 epochs) |
| --- | --- | --- | --- |
| Reference kernel | -- | ~600s | ~6.000s |
| This kernel | ~500s | ~35s | ~850s |

<br>

In another (private) kernel, I trained a model with the same EfficientNet b0 architecture using old and new data in around 400s per epoch (pre-processing took around 5000s). 
  
Speed improvements are biggest when using k-fold CV. A 5-fold CV of the reference kernel would take around 30.000s. With the approach from this kernel, this can be reduced to around 2.500s. 

## Demonstration

The code in this kernel is almost entirely from the reference kernel (https://www.kaggle.com/chanhu/eye-efficientnet-pytorch-lb-0-777). All modifications made by me are commented.

In [None]:
! pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ../input/nvidiaapex/repository/NVIDIA-apex-39e153a

In [None]:
import cv2
import matplotlib.pyplot as plt
from os.path import isfile
import torch.nn.init as init
import torch
import torch.nn as nn
import numpy as np
import pandas as pd 
import os
from PIL import Image, ImageFilter
print(os.listdir("../input"))
from sklearn.model_selection import train_test_split, StratifiedKFold
from torch.utils.data import Dataset
from torchvision import transforms
from torch.optim import Adam, SGD, RMSprop
import time
from torch.autograd import Variable
import torch.functional as F
from tqdm import tqdm
from sklearn import metrics
import urllib
import pickle
import cv2
import torch.nn.functional as F
from torchvision import models
import seaborn as sns
import random
from apex import amp
import sys

We'll use [h5py](https://www.h5py.org/) as the HDF5-interface and multiprocessing for image pre-processing.

In [None]:
import h5py
from multiprocessing import Pool
from itertools import repeat
from timeit import default_timer as timer

In [None]:
package_path = '../input/efficientnet/efficientnet-pytorch/EfficientNet-PyTorch/'
sys.path.append(package_path)

In [None]:
from efficientnet_pytorch import EfficientNet

In [None]:
def seed_everything(seed):
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.deterministic = True

In [None]:
num_classes = 1
seed_everything(1234)
lr          = 1e-3
IMG_SIZE    = 256

In [None]:
train      = '../input/aptos2019-blindness-detection/train_images/'
test       = '../input/aptos2019-blindness-detection/test_images/'
train_csv  = pd.read_csv('../input/aptos2019-blindness-detection/train.csv')

train_df, val_df = train_test_split(train_csv, test_size=0.1, random_state=2018, stratify=train_csv.diagnosis)
train_df.reset_index(drop=True, inplace=True)
val_df.reset_index(drop=True, inplace=True)
train_df.head()

In [None]:
#The Code from: https://www.kaggle.com/ratthachat/aptos-updated-albumentation-meets-grad-cam

def crop_image1(img,tol=7):
    # img is image data
    # tol  is tolerance
        
    mask = img>tol
    return img[np.ix_(mask.any(1),mask.any(0))]

def crop_image_from_gray(img,tol=7):
    if img.ndim ==2:
        mask = img>tol
        return img[np.ix_(mask.any(1),mask.any(0))]
    elif img.ndim==3:
        gray_img = cv2.cvtColor(img, cv2.COLOR_RGB2GRAY)
        mask = gray_img>tol
        
        check_shape = img[:,:,0][np.ix_(mask.any(1),mask.any(0))].shape[0]
        if (check_shape == 0): # image is too dark so that we crop out everything,
            return img # return original image
        else:
            img1=img[:,:,0][np.ix_(mask.any(1),mask.any(0))]
            img2=img[:,:,1][np.ix_(mask.any(1),mask.any(0))]
            img3=img[:,:,2][np.ix_(mask.any(1),mask.any(0))]
    #         print(img1.shape,img2.shape,img3.shape)
            img = np.stack([img1,img2,img3],axis=-1)
    #         print(img.shape)
        return img

In [None]:
n_cores = 2
h5_file_name = 'train_images.h5' # name of the hdf5 file used throughout the notebook

Pre-processing from https://www.kaggle.com/ratthachat/aptos-updated-preprocessing-ben-s-cropping :

In [None]:
def crop_and_resize_images(image_name, image_folder, target_size):
    image_path = os.path.join(image_folder, image_name + '.png')

    image = cv2.imread(image_path)
    image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
    image = crop_image_from_gray(image)
    image = cv2.resize(image, (target_size, target_size))
    image=cv2.addWeighted(image, 4, cv2.GaussianBlur(image, (0,0), 10), -4, 128)

    return image

In the following function, we 

1. create an empty hdf5 file,
2. pre-process images using multiprocessing
3. save pre-processed images in the hdf5 file.

Comments to 1:
* We create an empty dataset inside the hdf5 file with compression='gzip'. Compression can be removed if we only want to train images from the current competition. However, it is necessary if we want to train images from 2015; without compression, the resulting hdf5 file might become too big for the Kaggle kernel and throw an error. 
* Choosing an adequate chunk size for the dataset and access pattern is crucial. During testing, I used h5py's 'auto'-option to select a chunk size. As a result, training time of one epoch was around 25m (instead of 35s in the current version of this kernel). As our Dataloader only ever accesses one image at a time, we use a chunk size that corresponds to one image. This means that each image is compressed separately inside the hdf5 file.

Comments to 2:
* We pre-process images in batches of 1000 to prevent memory-issues if we train images from 2015. If we only want to train images from the current competition, we could train all images in a single batch.

Comments to 3:
* We add a dataset with image names to the hdf5 file. This is later used as a lookup table when retrieving images from the hdf5 file.






In [None]:
def create_hdf5_file(image_names, image_folder, target_size, h5_file_name):
   
    # create hdf5 file and datasets
    hdf5_file = h5py.File(h5_file_name, 'w', libver='latest')
    
    # compression 'gzip' and adequate chunk-size seems to be the best compromise between file size and access speed
    hdf5_file.create_dataset('images', (0, target_size, target_size, 3), maxshape=(None, target_size, target_size, 3), compression='gzip',
                             compression_opts=2, chunks=(1, target_size, target_size, 3), dtype=np.uint8)
    hdf5_file.close()

    # create batches of images to prevent holding all images into memory at the same time
    batch_size = 1000
    n_batches = np.ceil(len(image_names) / batch_size).astype(np.int)

    image_i = 0
    for i, batch in enumerate(range(n_batches)):
        
        start_time = timer()

        batch_start_image_idx = image_i
        batch_stop_image_idx = image_i + batch_size

        image_i = batch_stop_image_idx

        if batch_stop_image_idx > len(image_names):
            batch_stop_image_idx = len(image_names)
            
        print(f'batch {i+1}/{n_batches}, precessing images {batch_start_image_idx} to {batch_stop_image_idx - 1}')

        image_names_in_batch = image_names.iloc[batch_start_image_idx:batch_stop_image_idx]

        with Pool(n_cores) as pool:
            images_in_batch = pool.starmap(crop_and_resize_images, 
                                           zip(image_names_in_batch, 
                                               repeat(image_folder), 
                                               repeat(target_size)))
        
        images_in_batch = np.asarray(images_in_batch)

        time_elapsed = timer() - start_time
        print(f'batch processing finished in {(time_elapsed // 60):.0f}m {(time_elapsed % 60):.0f}s, inserting data into h5 file')

        with h5py.File(h5_file_name, 'a') as hdf5_file:
            # dynamically increase the size of the dataset:
            hdf5_file['images'].resize((hdf5_file['images'].shape[0] + images_in_batch.shape[0]), axis=0)
            
            hdf5_file['images'][-images_in_batch.shape[0]:] = images_in_batch

    # add dataset with image names to hdf5 file 
    with h5py.File(h5_file_name, 'a') as hdf5_file:
        assert hdf5_file['images'].shape[0] == len(image_names)
        hdf5_file.create_dataset('id_code', data=image_names.values.astype('S'))

In [None]:
%%time
create_hdf5_file(image_names=train_csv['id_code'], image_folder=train, target_size=IMG_SIZE, h5_file_name=h5_file_name)

In [None]:
print(f'Size of h5 file: {(os.path.getsize(h5_file_name) / float(1 << 20)):.0f} mb')

The following function returns a single image from the hdf5 file:

In [None]:
def open_image_from_hdf5_file(image_name):
    with h5py.File(h5_file_name, 'r') as hdf5_file:
        image_names_in_h5 = np.array(hdf5_file.get('id_code'))
        image_idx_in_h5 = np.where(image_names_in_h5 == image_name.encode())[0][0]
        image = hdf5_file['images'][image_idx_in_h5]
        image = transforms.ToPILImage()(image)
        return image

In [None]:
def p_show(imgs, label_name=None, per_row=3):
    n = len(imgs)
    rows = (n + per_row - 1)//per_row
    cols = min(per_row, n)
    fig, axes = plt.subplots(rows,cols, figsize=(15,15))
    for ax in axes.flatten(): ax.axis('off')
    for i,(p, ax) in enumerate(zip(imgs, axes.flatten())): 
        img = open_image_from_hdf5_file(p)
        ax.imshow(img)
        ax.set_title(train_df[train_df.id_code == p].diagnosis.values)

In [None]:
imgs = []
for p in train_df.id_code:
    imgs.append(p)
    if len(imgs) == 16: break
p_show(imgs)

The following dataset is modified to work with our hdf5 file. One thing to note is that the image dataset in the hdf5 is opened only once in the __getitem__ method. This seems to considerably improve access speed. More information can be found here: https://discuss.pytorch.org/t/dataloader-when-num-worker-0-there-is-bug/25643/16

In [None]:
class MyDatasetHDF5(Dataset):
    
    def __init__(self, dataframe, h5_file_name, transform=None):
        self.df = dataframe
        self.transform = transform
        self.h5_file_name = h5_file_name
        
        with h5py.File(h5_file_name, 'r') as hdf5_file:
            self.image_names_in_h5 = np.array(hdf5_file.get('id_code')) 
            
        self.h5_dataset = None
    
    def __len__(self):
        return len(self.df)
    
    def __getitem__(self, idx):
        
        label = self.df.diagnosis.values[idx]
        label = np.expand_dims(label, -1)
        
        image_name = self.df.id_code.values[idx]
        
        # here, we get the index of the image in the hdf5 file:
        image_idx_in_h5 = np.where(self.image_names_in_h5 == image_name.encode())[0][0] 
        
        if self.h5_dataset is None:
            self.h5_dataset = h5py.File(self.h5_file_name, 'r')['images']
        
        image = self.h5_dataset[image_idx_in_h5]
        image = transforms.ToPILImage()(image)
        
        if self.transform:
            image = self.transform(image)
        
        return image, label

In [None]:
train_transform = transforms.Compose([
    transforms.RandomHorizontalFlip(),
    transforms.RandomRotation((-120, 120)),
    transforms.ToTensor(),
    transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])])

trainset     = MyDatasetHDF5(train_df, h5_file_name=h5_file_name, transform =train_transform)
train_loader = torch.utils.data.DataLoader(trainset, batch_size=32, shuffle=True, num_workers=0)
valset       = MyDatasetHDF5(val_df, h5_file_name=h5_file_name, transform   =train_transform)
val_loader   = torch.utils.data.DataLoader(valset, batch_size=32, shuffle=False, num_workers=0)

Note: Setting the number of workers in the Dataloader to anything else than 0 results in an error. 

The rest of the code is left exactly as it is in the reference kernel. 

In [None]:
model = EfficientNet.from_name('efficientnet-b0')
model.load_state_dict(torch.load('../input/efficientnet-pytorch/efficientnet-b0-08094119.pth'))
in_features = model._fc.in_features
model._fc = nn.Linear(in_features, num_classes)
model.cuda()

In [None]:
optimizer = torch.optim.Adam(model.parameters(), lr=lr, weight_decay=1e-5)
criterion = nn.MSELoss()
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=5, gamma=0.1)
model, optimizer = amp.initialize(model, optimizer, opt_level="O1",verbosity=0)

In [None]:
def train_model(epoch):
    model.train() 
        
    avg_loss = 0.
    optimizer.zero_grad()
    for idx, (imgs, labels) in enumerate(train_loader):
        imgs_train, labels_train = imgs.cuda(), labels.float().cuda()
        output_train = model(imgs_train)
        loss = criterion(output_train,labels_train)
        with amp.scale_loss(loss, optimizer) as scaled_loss:
            scaled_loss.backward()
        optimizer.step() 
        optimizer.zero_grad() 
        avg_loss += loss.item() / len(train_loader)
        
    return avg_loss

def test_model():
    
    avg_val_loss = 0.
    model.eval()
    with torch.no_grad():
        for idx, (imgs, labels) in enumerate(val_loader):
            imgs_vaild, labels_vaild = imgs.cuda(), labels.float().cuda()
            output_test = model(imgs_vaild)
            avg_val_loss += criterion(output_test, labels_vaild).item() / len(val_loader)
        
    return avg_val_loss

In [None]:
best_avg_loss = 100.0
n_epochs      = 10

for epoch in range(n_epochs):
    
    print('lr:', scheduler.get_lr()[0]) 
    start_time   = time.time()
    avg_loss     = train_model(epoch)
    avg_val_loss = test_model()
    elapsed_time = time.time() - start_time 
    print('Epoch {}/{} \t loss={:.4f} \t val_loss={:.4f} \t time={:.2f}s'.format(
        epoch + 1, n_epochs, avg_loss, avg_val_loss, elapsed_time))
    
    if avg_val_loss < best_avg_loss:
        best_avg_loss = avg_val_loss
        torch.save(model.state_dict(), 'weight_best.pt')
    
    scheduler.step()

## Further improvements and ideas

* The resulting hdf5 file can be used as a input dataset in subsequent kernels. This leads to a further improvement in training speed, because no pre-preprocessing has to be performed during training.
* Pre-processing could be moved to a dedicated CPU kernel, where we could use 4 CPUs for pre-processing (instead of the 2 available CPUs in the GPU kernels).
* Using multiple workers in the Dataloader could further improve training speed. According to [this discussion](https://discuss.pytorch.org/t/dataloader-when-num-worker-0-there-is-bug/25643/16), it should be possible in theory to do so. 
* For inference, the approach from this kernel is not relevant if we submit only one model. However, for ensembling, this approach could lead to a large speed boost if multiple models are based on identical pre-processing. 
