# Uncorking the I/O Bottleneck of Bio-Imaging

## Dealing with the dimensionality of microscopy images

Today’s image acquisition devices, be they digital pathology slide scanners or lightsheet microscopes can generate a huge amount of data. This volume of data can make it very challenging to move around, save and load - let alone trying to ingest it into some sort of machine-learning or deep learning algorithm. 

The objective of this workshop is to introduce you to a few tools and techniques that can really help to deliver insights from this rich data without exhausting your system memory or taking eons to run. In fact, you may be surprised to see that toolsets such as RAPIDS can turn some, previously unfeasible, analyses into near-real-time processes.

This workshop will mostly focus on digital pathology, but really, these techniques are very generic and could be applied to data from many different modalities.

Images or volumes can be saved in a variety of formats, some of which are generic and some of which are domain-specific. Additionally, images may be saved and loaded using formats that are based on open-standards or are proprietary to the manufacturers of the device used to capture the image.

The image that we will be using came from the [Camelyon Challenge Dataset](https://camelyon17.grand-challenge.org) and was saved in the .tif format.


## Part 1 - Loading images

Images store a lot of information. Most commonly, images are composed of one or more channels of intensity values across 2 or 3 dimensions. In order to keep file sizes manageable, compression is usually employed. In some cases lossy compression is suitable but for other domains the images need to be lossless. This means that getting all of the pixel data from disk into computer memory can be quite an intense process and without the right tools, techniques and hardware, it can be a slow process. If you have an accelerator such as a GPU you may find that you are unable to utilize its full capabilities because you are unable to feed it data at a sufficiently high rate to keep it busy. 
This first section introduces a few tools that you can use to make best use of the resources available when it comes to loading the data. There are a few factors that come into play here:
* The efficiency of the software algorithm
* The speed of the machine-code that the software is compiled into
* The number of CPU threads or processes used
* The performance of the disk and networking that the data needs to traverse
* The speed of the CPU
* Any hardware that the CPU supports to accelerate certain processes, such as AVX instructions

For the loading of a variety of biomedical imaging formats, the go-to software has been OpenSlide, which can load formats such as Aperio’s .svs format and many other tiff-based formats. First of all, let’s use OpenSlide to load up one of the images we have to get a feel for the latency involved in loading images at certain resolutions.


In [None]:
import openslide

# Load the image
slide = openslide.OpenSlide("data/patient_100_node_0.tif")

# Get the dimensions at level 0 (Full size)
width, height = slide.level_dimensions[0]

print("Full-Size Image Dimensions - Width = {}, Height = {}".format(width, height))

You should see that this image is 126976 x 82944, so that’s 10.5 billion pixels, with 3 color channels - just in one image. To put this into perspective, at a standard display resolution of 120 dots per inch, you’d need a 41 x 27 metre monitor to view this image at full resolution - that's about 4 tennis courts!

For this very reason, these types of image are often saved in formats that allow the image to be loaded at a lower resolution or provide a means of only loading a small sub-region of the image.


In [None]:
import matplotlib.pyplot as plt

# get height and width at the lowest level resolution
w_thumbnail, h_thumbnail = slide.level_dimensions[slide.level_count-1]

# Load up the image data at the lowest resolution - to preview it
img = slide.read_region((0,0), slide.level_count-1,(w_thumbnail, h_thumbnail))
print("Reduced-Size Image Dimensions - Width = {}, Height = {}".format(w_thumbnail,h_thumbnail))

# Use Matplotlib to display the thumbnail view of the image
plt.figure(figsize=(10,10))
plt.imshow(img)
plt.title('patient_100_node_0.tif')
plt.show()

The image displayed shows that the majority of the image pixels are actually not very informative. Most is white background, since this image contains three distinct tissue slices that have been scanned as one image. 
Many image-processing pipelines will perform some sort of thresholding on the whole slide before doing more computationally intensive operations 

So let’s investigate the loading time for different resolutions of this image. In the cells below you will see some skeleton code which you need to flesh out to measure the time it takes to load the image at each of the resolutions that it contains. You should complete the code so that it plots the times for each resolution.

time_loading_at_resolution is a function that takes a slide and a reduction level and returns the time it takes to load the image data at that resolution

Check the [solution](solutions/solution1_1.py) if you get stuck


In [None]:
from timeit import default_timer as timer

def time_loading_at_resolution(slide, level): # slide 
    
    start = timer()

    # TODO insert code to load the image at the specified resolution reduction level  
    # and with the full width and height at that resolution

    end = timer()
    
    return end - start

When you have completed the code in the cell above and run it, you can test it by running the cell below. It should print out a range of load times for different resolutions

In [None]:
# Initialise a list to hold the loading times
times = [0] * (slide.level_count-2)

# Now call the timing function for a range of possible resolutions
for i in range(slide.level_count-1,1,-1):
    times[i-2] = time_loading_at_resolution(slide,i)
    print("Time at resolution reduction level {} = {}".format(i, times[i-2]))
    
print("Completed")

If that works as expected, you should now have the load times in the array we created, which we can plot out by running the cell below

In [None]:
# Now plot the load times
x = range(2,slide.level_count)

plt.plot(x, times, '-ok')
plt.xlabel("Reduction Factor")
plt.ylabel("Load Time (s)");

plt.show

So, because each reduction level is reducing the number of pixels by a factor of 2 for each dimension, we see that the load time increases by a corresponding factor of 4. We stopped this experiment at reduction level 2 because, according to this trend, it would take about 100 seconds to load at level 1 reduction and 400 seconds at full resolution.

So, what can we do to reduce this load time? One technique that is often used to speed up many different types of operation is to use multi-threading. Multi-threading is a technique in which the process or program running your code spawns multiple sub-processes, known as threads, which can then operate in parallel, reducing the overall time to perform certain operations. In this case, we could get multiple threads loading different parts of the image. There are many ways of using multi-threading in your code but let’s start by using Python’s threading library to create a simple class that we can launch as a thread.

In the code above, the loading of the level 2 reduction took about 20 seconds. So, what would happen if we tried to use different threads to load different regions of the image at the same time? Let's try it and see...


In [None]:
%%time
import threading
import time

# TODO Try changing the number of threads and see what effect
# it has on the overall run time
num_threads = 16

class loaderThread (threading.Thread):
    def __init__(self, threadID):
        # Class initialisation - set its ID
        threading.Thread.__init__(self)
        self.threadID = threadID

    def run(self):
        print("Starting thread {}".format(self.threadID))
        start = timer()
        width, height = slide.level_dimensions[2]
        x = (width // num_threads) * (1-self.threadID)
        img = slide.read_region((x,0), 2,(width//num_threads, height))
        del(img) # To conserve memory 
        end = timer()
        print("Exiting thread {}, running time = {}".format(self.threadID, str(end-start)))

threads = []

for i in range(num_threads):
    # Create new threads
    thread = loaderThread(i)
    thread.start()
    threads.append(thread)

# Wait for all threads to complete
for t in threads:
    t.join()
    
print("Exiting Main Thread")

What you should notice is that each thread takes roughly the same amount of time, but the overall wall time is reduced. You will also notice that the order in which events happen is non-deterministic. This is an important to remember - by default, threads and processes do not execute in a particular order.

Next, try seeing how this approach performs with different numbers of threads. Try changing the num_threads - perhaps refactor the code to execute the same image load whilst varying the number of threads (say 1,2,4,8,16,32). Note the overall runtime and plot the results to get a feel for how this approach scales using the code in the cell below. Notice that each thread is loading a different column-chunk of the whole image. What happens if we change this so that each worker loads a different row-chunk? Does it make a difference to the time?


In [None]:
num_threads_array = [1,2,4,8,16,32]
times = [0] * len(num_threads_array)

# TODO Add code to log the times for different numbers of threads (use the code above as a starting point)

plt.plot(num_threads_array, times, '-ok')
plt.xlabel("Number of threads")
plt.ylabel("Load Time (s)");

plt.show

What you should observe is that doubling the number of threads almost halves the run time but as the number of threads increases the rate of improvement tails off until, at some point, the cost of spawning additional threads and coordinating them outweighs the reduced runtime. This is evidenced by the growing disparity between the running time of each thread and the overall run time.

In this case, you probably won’t see much of a difference when inverting the column/row loading, but this can make a difference for some file formats if data is stored in row-major or column-major format.


There is another approach to parallelisation in which we do not launch multiple threads, but instead we use processes. Processes are more resource hungry than threads but they do have the advantage of not falling foul of the Python [GIL](https://wiki.python.org/moin/GlobalInterpreterLock), since the GIL only operates within each process. The syntax for multiprocessing is equivalent to multithreading, so you can simply refactor the test above to use multi-processing and see how the results change. If you get stuck, check the [solution](solutions/solution1_2.py).

In [None]:
%%time
import multiprocessing
import time

# TODO Try changing the number of processes and see what effect
# it has on the overall run time
num_processes = 1

#TODO - create a class similar to the loaderThread, but using processes instead
class loaderProcess():
    pass

processes = []

# TODO Create new processes
    
# TODO Wait for all threads to complete
    
print("Exiting Main Process")

In [None]:
num_procs_array = [1,2,4,8,16,32]
times = [0] * len(num_procs_array)

# TODO As before refactor the code to record the load time for each number of processes

plt.plot(num_procs_array, times, '-ok')
plt.xlabel("Number of processes")
plt.ylabel("Load Time (s)");

plt.show

Even though each process occupies more resources, it does generally reduce the time to load the image a little more efficiently than threading and this is because we are bypassing the GIL, which would enforce certain code regions to execute in serial to ensure that no data races occur

## Introducing CuCIM

When dealing with much larger images, it is necessary to utilise as much of the available compute power than we have to run in parallel, otherwise it can be difficult to keep the GPU busy all the time. The problem we face here is that it is not just the Python GIL that we are working with but OpenSlide itself is not especially fast at this sort of operation. For this reason, the cuCIM library was recently added to the RAPIDS platform. cuCIM offers similar capabilities to Openslide but has been optimised for the scenario we have just been exploring. The API is not exactly the same as OpenSlide, so you can see that, to do what we did before, we will need to amend the loading code slightly. Have a look at the code cell below to see how to get the image dimensions at a specific resolution and load the image.

N.B. When loading a specific region of interest at a reduction level > 0, you need to supply the x and y coordinates at the full resolution, whereas the width and height should be supplied at the reduced size. See the [documentation](https://docs.rapids.ai/api/cucim/stable/api.html#module-cucim.CuImage) for more details

In [None]:
from cucim import CuImage

input_file = "data/patient_100_node_0.tif"
# load the image header
wsi = CuImage(input_file)

# Get the resolution meta data
sizes=wsi.metadata["cucim"]["resolutions"]
levels = sizes["level_count"]

# Get the dimensions at the lowest resolution level
wt = sizes["level_dimensions"][levels-1][0]
ht = sizes["level_dimensions"][levels-1][1]

# Load the image data at this resolution
wsi_thumb = wsi.read_region(location=(0,0), size=(wt,ht), level=levels-1)

plt.figure(figsize=(10,10))
plt.imshow(wsi_thumb)
plt.title('patient_100_node_0.tif')
print(wt,ht)
plt.show()

Now we can compare the performance of image loading using OpenSlide and cuCIM. In the code cell below add the necessary steps for cucim to load the image at the specified resolution ([solution](solutions/solution1_3.py))

In [None]:
from timeit import default_timer as timer

def time_loading_at_resolution(level, use_cucim):
    
    start = timer()

    if use_cucim:
        # TODO insert code to load the image at the specified resolution reduction level  
        # and with the full width and height at that resolution
    else:
        width, height = slide.level_dimensions[level]
        img = slide.read_region((0,0), level, (width, height))
        del(img) # To conserve memory 

    end = timer()
    
    return end - start

Once you have completed and run the code cell above, you can run the code below to test the function and generate some load times to compare

In [None]:
# Now call the timing function for each of the possible resolutions
cu_times = [0] * (slide.level_count-2)
times = [0] * (slide.level_count-2)

print("Using cuCim...")
for i in range(slide.level_count-1,1,-1):
    cu_times[i-2] = time_loading_at_resolution(i,True)
    print("Time at resolution reduction level {} = {}".format(i, cu_times[i-2]))

print("Using OpenSlide...")
for i in range(slide.level_count-1,1,-1):
    times[i-2] = time_loading_at_resolution(i,False)
    print("Time at resolution reduction level {} = {}".format(i, times[i-2]))

print("Completed")

Let's plot that out 

In [None]:
reduction_factor = [8,7,6,5,4,3,2]

plt.plot(reduction_factor,times, '-ok')
plt.plot(reduction_factor,cu_times, '-or')
plt.xlabel("Reduction Factor")
plt.ylabel("Load Time (s)");

plt.show

So, you should notice that CuCIM is almost an order of magnitude faster at loading the image data. Next we can see how the use of multiprocessing and threading affects the performance

In [None]:
%%time
import threading
import time
from cucim import CuImage
from timeit import default_timer as timer
import openslide

# TODO Try changing the number of threads, and which loader (cucim or openslide) to 
# use and see what effect it has on the overall run time
num_threads = 8
use_cucim = False
level = 1
input_file = "data/patient_100_node_0.tif"

if use_cucim:
    cuslide = CuImage(input_file)
else:
    slide = openslide.OpenSlide(input_file)


class loaderThread (threading.Thread):
    def __init__(self, threadID):
        threading.Thread.__init__(self)
        self.threadID = threadID

    def run(self):
        #print("Starting thread {}".format(self.threadID))
        start = timer()
        if use_cucim:
            sizes=cuslide.metadata["cucim"]["resolutions"]
            width = sizes["level_dimensions"][level][0]
            height = sizes["level_dimensions"][level][1]
            x = (width // num_threads) * self.threadID
            img = cuslide.read_region((x,0),(width//num_threads, height), level)
            del(img) # To conserve memory 
        else:
            width, height = slide.level_dimensions[level]
            x = (width // num_threads) * self.threadID
            img = slide.read_region((x,0), level,(width//num_threads, height))
            del(img) # To conserve memory 
            
        print("Thread {}, running time = {}".format(self.threadID, str(timer()-start)))

threads = []

print("Starting Threads")
for i in range(num_threads):
    # Create new threads
    thread = loaderThread(i)
    thread.start()
    threads.append(thread)

# Wait for all threads to complete
for t in threads:
    t.join()
    
print("Exited Threads")


You will probably notice that, with OpenSlide, some of the threads actually don't take too long to run, whereas others seem to get stuck and take a lot longer. This is precisely the sort of behaviour that you'd expect if there is some sort of blocking going on. When we use multiprocessing you might expect to see less of this, but in fact it does not make much difference. Using cuCIM, on the other hand, we get faster loading of each sub-region and little or no competing for resources between the threads or processes. However, multi-processing still tends to give slightly better performance than threading with cuCIM.

In [None]:
%%time
import multiprocessing
import time
from cucim import CuImage

# TODO Try changing the number of processes and see what effect
# it has on the overall run time
num_processes = 8
use_cucim = True
level = 1

if use_cucim:
    cuslide = CuImage(input_file)
else:
    openslide.OpenSlide(input_file)


class loaderProcess (multiprocessing.Process):
    def __init__(self, processID):
        multiprocessing.Process.__init__(self)
        self.processID = processID

    def run(self):
        start = timer()
        if use_cucim:
            sizes=cuslide.metadata["cucim"]["resolutions"]
            width = sizes["level_dimensions"][level][0]
            height = sizes["level_dimensions"][level][1]
            x = (width // num_processes) * self.processID
            img = cuslide.read_region((x,0),(width//num_processes, height), level)
            del(img) # To conserve memory 
        else:
            width, height = slide.level_dimensions[level]
            x = (width // num_processes) * self.processID
            img = slide.read_region((x,0), level,(width//num_threads, height))
            del(img) # To conserve memory 
            
        print("Thread {}, running time = {}".format(self.processID, str(timer()-start)))

processes = []

print("Starting Processes")
for i in range(num_processes):
    # Create new process
    proc = loaderProcess(i)
    proc.start()
    processes.append(proc)

# Wait for all processes to complete
for p in processes:
    p.join()
    
print("Exited Processes")


In [None]:
import matplotlib.pyplot as plt

num_procs_array = [1,2,4,8,16,32]
# TODO replace times with actual measured values for each number of processes 
times = [0] * len(num_procs_array)

plt.plot(num_procs_array, times, '-ok')
plt.xlabel("Number of processes")
plt.ylabel("Load Time (s)");

plt.show

cuCIM has just provided a new feature (v21.12.1+) that actually uses multiple threads internally to load an image. This is a much more efficient and cleaner way of quickly loading an image. It requires no Python GIL workarounds and uses far fewer resources. Let's compare it with our Python implementation.

Please note that we are loading a very large image and so there is a chance that we will run out of RAM when loading this image. If this happens you will most likely see an error message pop up telling you that the kernel just re-launched. If this happens, it will actually remove all the current data from RAM from the previous cells and it will probably work if you try again (No need to re-run any previous cells).

In [None]:
%%time
from cucim import CuImage
level = 1

input_file = "data/patient_100_node_0.tif"
cuslide = CuImage(input_file)
sizes=cuslide.metadata["cucim"]["resolutions"]
width = sizes["level_dimensions"][0][0]
height = sizes["level_dimensions"][0][1]
img = cuslide.read_region((0,0),(width, height), level, num_workers=32)

In [None]:
import matplotlib.pyplot as plt

workers = [1,2,4,8,16,32,64]
# TODO replace times with actual measured values for each number of workers 
times = [0] * len(workers)

plt.plot(workers, times, '-ok')
plt.xlabel("Number of workers")
plt.ylabel("Load Time (s)");

plt.show

So, you may be wondering why this method seems to be taking slightly longer than our multi-threaded and multi-process code. The reason for this disparity is actually because, in our previous experiments, each thread/process was loading up a chunk of the image, but it wasn't actually doing anything with it. In contrast, when cuCIM is loading the image, it is actually assembling these chunks into a single matrix, which requires some additional coordination between the threads/processes.

To make a fair comparison, we need to allocate an array for the whole image first and then insert each chunk into the correct location in the array as it is loaded. As a final exercise, see what the best overall loading time you can get using either multi-processing or multi-threading whilst actually assembling the whole image. 

Again, you may have to restart the kernel if you run out of memory. Alternatively you could load the image at a reduction level of <=1 (but make sure you compare your load time with cuCIM also loading the whole image at the same reduction level). Solution basis [here](solutions/solution1_4.py)

In [None]:
%%time
import threading
import multiprocessing
import time
from cucim import CuImage

# TODO Try changing the number of threads/processes and see what effect
# it has on the overall run time
num_... = 32

# TODO get the image size information using cuCUM

# TODO create the array to contain the whole image

# TODO create a thread or process loading class that inserts the
class loader... (...):

# TODO Launch the threads/processes


# Wait for all the thread/processes to complete


This section should have given you a good grasp of how much of a difference the combination of a decent image loader and some threading or multi-processing can make.

You can now close this notebook and open Notebook_2 to find out how to use a different technique to load images. 

## Bonus Exercise
Can you use a [Zarr](https://zarr.readthedocs.io/en/stable/) array to save the data from each chunk at full resolution?

We have been loading a 1x reduced resolution version of the image (i.e. 4x fewer pixels than full resolution) because this environment has limited RAM and, if we try to load the image at full resolution it is likely that we will run out of memory. Zarr arrays work a little differently from, say, a numpy array. They save the data in chunks and can load the data lazily (i.e. as it is needed), which can reduce the memory consumption when loading the data.

There are a few things you will need to do to make this work (get it working at reduction level=1 first):
* Set the datatype of the zarr array
* Delete the temporary img_s array to reduce memory usage
* Set a chunk size that matches the width of the data chunks you are loading
* Split the loading of each thread's chunk into 2 halves and load them serially to conserve memory
* Use the zarr.open method to automatically write data to disk

You can see whether it worked by looking in the data subdirectory and checking the example.zarr folder

You will also need to pip install Zarr (execute cell below)

In [None]:
!pip install zarr

In [None]:
%%time
import zarr

# TODO refactor the previous code to use a Zarr array rather than a numpy array
# it has on the overall run time
