<a href="https://colab.research.google.com/github/Asciotti/neural-sar/blob/master/99_1_speedup_image_loading.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Experiment to speed up data loading

Right now model takes ~20-30 minutes to train 256x256 images starting w/ 32 filters on the front and end CNN. Noticed that the loading of images was extremely slow, up to 1-2 seconds per batch of ~32 images. Given 32 random images are loaded every step for 32 steps per epoch for 20 epochs this would be roughly 10-20 minutes of training time.

In [1]:
import numpy as np
from skimage.util import random_noise
from skimage import io
import os
import importlib
from PIL import Image
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from google.colab import drive
drive.mount('/content/drive')


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [0]:
from skimage.io import imread, imread_collection, ImageCollection
from skimage.transform import resize
from skimage.color import rgb2lab

## TL;DR - Results

### Setup: Randomly select batches of 32 images from a corpus of 1000 images 10 times

### Old Method: ~35 seconds

###New Method: ~15 seconds

Note that with the old method, this is a ~linear retrieval rate because no caching occurs. With the new method, the retrieval rate is not linear because images are cached (though at the expense of memory), thus after the initial loading, retrieval is merely list indexing.

Note, and attempt to individually time events (i.e. loading vs indexing) was difficult due to the different methods and some annoyances with trying to control for caching within the notebook. Ultimately, the initial results speak for themselves.

## Old method of loading data

In [0]:
def get_input(name):
    
    img = imread('/content/drive/My Drive/Colab Notebooks/1_ft_ortho_images/'  + str(int(name)) + '.png', as_gray=False) #Note in production this would be as_gray=True
    img = resize(img, (256,256,1))
    if np.max(img) > 1:
      img = img/255.0
    
    return(img)

In [0]:
def get_output(name):
    
    img = imread('/content/drive/My Drive/Colab Notebooks/1_ft_ortho_images/'  + str(int(name)) + '.png', as_gray=False)
    img = resize(img, (256,256,3))
    if np.max(img) > 1:
      img = img/255.0
    return(img)

In [0]:
def image_generator(indices,batch_size = 32):
    
    while True:
          # Select files (paths/indices) for the batch
          batch_paths = np.random.choice(a = indices, 
                                         size = batch_size)
          batch_input = []
          batch_output = [] 
#           print(batch_paths[0], batch_paths[-1])
          
          # Read in each input, perform preprocessing and get labels
          for input_path in batch_paths:
              input = get_input(input_path)
              output = get_output(input_path)
              batch_input += [ input ]
              batch_output += [ output ]
          # Return a tuple of (input,output) to feed the network
          batch_x = np.array( batch_input )
          batch_y = np.array( batch_output )
        
          yield( batch_x, batch_y )

In [6]:
%%time
x = image_generator(range(1,1000),batch_size=32)
imgs = [x.__next__() for y in range(10)]
print('Final # of images ',len(imgs))
print('Old method loading data times:')

Final # of images  10
Old method loading data times:
CPU times: user 35.8 s, sys: 29.8 s, total: 1min 5s
Wall time: 35.1 s


## New method of loading data

In [7]:
%%time
arr = ['/content/drive/My Drive/Colab Notebooks/1_ft_ortho_images/{}.png'.format(x) for x in range(1,1000)]
# print(arr)
IC = imread_collection(arr, conserve_memory=False)
imgs2 = []
for i in range(10):
    choices = np.random.choice(a = len(IC), size = 32)
    _imgs = np.stack([resize(IC[x], (256,256,3))/255.0 for x in choices])
    imgs2.append(_imgs)
print('Final # of images ',len(imgs2))
print('New method loading data times:')

Final # of images  10
New method loading data times:
CPU times: user 14.4 s, sys: 13.1 s, total: 27.5 s
Wall time: 15.1 s
