# Multi-thread and Multi-process Tests

In this notebook, we compare hipCIM with OpenSlide in a multi-thread/multi-process environment.

`input/image.tif` file (whose size is 19920x26420 and tile size is 256x256) is used.

Since hipCIM doesn't implement internal cache yet, according to `start_location` variable in the experiment code, hipCIM would have a different performance.

![](static_images/Multi-thread_and_Multi-process_Tests_Alignment.png)

For the first case (`start_location = 0`), when we try to read the whole image starting from (0,0) with 256x256 patch size, both OpenSlide and hipCIM would read each time only once.
However, in the second case (`start_location = 1`) that starts reading patch from (1,1), hipCIM would have a disadvantage -- for the second patch (second red box), hipCIM should need four tiles whereas OpenSlide would use only two tiles (two tiles in the middle would be cached when OpenSlide read the first patch).

**Note::** hipCIM would support cache mechanism in the near future.

The following system is used to conduct the experiment:
- OS: Ubuntu 22.04 LTS
- CPU: AMD EPYC 9654


Before running this notebook, make sure to download the data, follow these steps to download the data
```
cd hipCIM
./run_amd download_testdata
```

In [1]:
from contextlib import ContextDecorator
from time import perf_counter

class Timer(ContextDecorator):
    def __init__(self, message):
        self.message = message
        self.end = None
    def elapsed_time(self):
        self.end = perf_counter()
        return self.end - self.start
    def __enter__(self):
        self.start = perf_counter()
        return self
    def __exit__(self, exc_type, exc, exc_tb):
        if not self.end:
            self.elapsed_time()
        print("{} : {}".format(self.message, self.end - self.start))

## Multithreading

In [None]:
import numpy as np
from openslide import OpenSlide
import concurrent.futures
from cucim import CuImage

import os

# num_threads = os.cpu_count() # uncomment this line to work with all threads
num_threads = 5 #comment this line if you uncomment the line above

input_file = "input/image.tif"
start_location = 0
patch_size = 256


def load_tile_openslide(slide, start_loc, patch_size):
    region = slide.read_region(start_loc, 0, [patch_size, patch_size])

def load_tile_cucim(slide, start_loc, patch_size):
    region = slide.read_region(start_loc, [patch_size, patch_size], 0)

openslide_tot_time = 0
cucim_tot_time = 0
for num_workers in range(1, num_threads + 1):

    print("# of thread : {}".format(num_workers))
    openslide_time = 0
    
    with OpenSlide(input_file) as slide:
        width, height = slide.dimensions

        count = 0
        for h in range(start_location, height, patch_size):
            for w in range(start_location, width, patch_size):
                count += 1
        start_loc_iter = ((sx, sy)
                          for sy in range(start_location, height, patch_size)
                              for sx in range(start_location, width, patch_size))
        with Timer("  Thread elapsed time (OpenSlide)") as timer:
            with concurrent.futures.ThreadPoolExecutor(
                max_workers=num_workers
            ) as executor:
                executor.map(
                    lambda start_loc: load_tile_openslide(slide, start_loc, patch_size),
                    start_loc_iter,
                )
            openslide_time = timer.elapsed_time()
            openslide_tot_time += openslide_time

    cucim_time = 0
    slide = CuImage(input_file)
    start_loc_iter = ((sx, sy)
                      for sy in range(start_location, height, patch_size)
                          for sx in range(start_location, width, patch_size))
    with Timer("  Thread elapsed time (hipCIM)") as timer:
        with concurrent.futures.ThreadPoolExecutor(
            max_workers=num_workers
        ) as executor:
            executor.map(
                lambda start_loc: load_tile_cucim(slide, start_loc, patch_size),
                start_loc_iter,
            )
        cucim_time = timer.elapsed_time()
        cucim_tot_time += cucim_time
    print("  Performance gain (OpenSlide/hipCIM): {}".format(openslide_time / cucim_time))

print("Total time (OpenSlide):", openslide_tot_time)
print("Total time (hipCIM):", cucim_tot_time)
print("Average performance gain (OpenSlide/hipCIM): {}".format(openslide_tot_time / cucim_tot_time))


# of thread : 1


  Thread elapsed time (OpenSlide) : 10.050760764992447
  Thread elapsed time (hipCIM) : 2.159018546997686
  Performance gain (OpenSlide/hipCIM): 4.6552452173090195
# of thread : 2
  Thread elapsed time (OpenSlide) : 5.1799155339977005
  Thread elapsed time (hipCIM) : 1.2631855660001747
  Performance gain (OpenSlide/hipCIM): 4.100676633283335
# of thread : 3
  Thread elapsed time (OpenSlide) : 3.5216421329969307
  Thread elapsed time (hipCIM) : 0.9337919630052056
  Performance gain (OpenSlide/hipCIM): 3.771334807448218
# of thread : 4
  Thread elapsed time (OpenSlide) : 2.7125375229952624
  Thread elapsed time (hipCIM) : 0.7509548369998811
  Performance gain (OpenSlide/hipCIM): 3.6121180520416463
# of thread : 5
  Thread elapsed time (OpenSlide) : 2.211385290007456
  Thread elapsed time (hipCIM) : 0.6447373099945253
  Performance gain (OpenSlide/hipCIM): 3.4299012260144117
Total time (OpenSlide): 23.676241244989797
Total time (hipCIM): 5.751688222997473
Average performance gain (OpenSli

## Multiprocessing (method1: Slow)

For each patch, it open the image file.

In [None]:
import concurrent.futures
from itertools import repeat

import numpy as np
from openslide import OpenSlide
from cucim import CuImage

import os

# num_processes = os.cpu_count() # uncomment this line to work with all threads
num_processes = 5 #comment this line if you uncomment the line above

input_file = "input/image.tif"
start_location = 0
patch_size = 256


def load_tile_openslide_mp(inp_file, start_loc, patch_size):
    with OpenSlide(inp_file) as slide:
        region = slide.read_region(start_loc, 0, [patch_size, patch_size])

def load_tile_cucim_mp(inp_file, start_loc, patch_size):
    slide = CuImage(inp_file)
    region = slide.read_region(start_loc, [patch_size, patch_size], 0)

openslide_tot_time = 0
cucim_tot_time = 0
for num_workers in range(1, num_processes + 1):

    print("# of processes : {}".format(num_workers))
    openslide_time = 0
    
    with OpenSlide(input_file) as slide:
        width, height = slide.dimensions

        start_loc_iter = ((sy, sx)
                          for sy in range(start_location, height, patch_size)
                              for sx in range(start_location, width, patch_size))

        with Timer("  Process elapsed time (OpenSlide)") as timer:
            with concurrent.futures.ProcessPoolExecutor(
                max_workers=num_workers
            ) as executor:
                executor.map(
                    load_tile_openslide_mp,
                    repeat(input_file),
                    start_loc_iter,
                    repeat(patch_size)
                )
            openslide_time = timer.elapsed_time()
            openslide_tot_time += openslide_time

    cucim_time = 0
    slide = CuImage(input_file)
    start_loc_iter = ((sy, sx)
                      for sy in range(start_location, height, patch_size)
                          for sx in range(start_location, width, patch_size))
    with Timer("  Process elapsed time (hipCIM)") as timer:
        with concurrent.futures.ProcessPoolExecutor(
            max_workers=num_workers
        ) as executor:
            executor.map(
                load_tile_cucim_mp,
                repeat(input_file),
                start_loc_iter,
                repeat(patch_size)
            )
        cucim_time = timer.elapsed_time()
        cucim_tot_time += cucim_time
    print("  Performance gain (OpenSlide/hipCIM): {}".format(openslide_time / cucim_time))

print("Total time (OpenSlide):", openslide_tot_time)
print("Total time (hipCIM):", cucim_tot_time)
print("Average performance gain (OpenSlide/hipCIM): {}".format(openslide_tot_time / cucim_tot_time))


# of processes : 1
  Process elapsed time (OpenSlide) : 16.669776021997677
  Process elapsed time (hipCIM) : 3.1612361349980347
  Performance gain (OpenSlide/hipCIM): 5.273182802589988
# of processes : 2
  Process elapsed time (OpenSlide) : 9.090929320009309
  Process elapsed time (hipCIM) : 1.7610195820016088
  Performance gain (OpenSlide/hipCIM): 5.162310182647931
# of processes : 3
  Process elapsed time (OpenSlide) : 6.427699690000736
  Process elapsed time (hipCIM) : 1.3449563359899912
  Performance gain (OpenSlide/hipCIM): 4.779114026233019
# of processes : 4
  Process elapsed time (OpenSlide) : 4.91031409399875
  Process elapsed time (hipCIM) : 1.1096740909997607
  Performance gain (OpenSlide/hipCIM): 4.425005624466552
# of processes : 5
  Process elapsed time (OpenSlide) : 4.274232757990831
  Process elapsed time (hipCIM) : 0.937119823996909
  Performance gain (OpenSlide/hipCIM): 4.561031202777041
Total time (OpenSlide): 41.3729518839973
Total time (hipCIM): 8.314005967986304
A

## Multiprocessing (method2: Faster)

For each process, reuse the opened file but submit a job for each patch request.

In [None]:
import concurrent.futures
from itertools import repeat
from functools import partial

import numpy as np
from openslide import OpenSlide
from cucim import CuImage

import os

# num_processes = os.cpu_count() # uncomment this line to work with all threads
num_processes = 5 #comment this line if you uncomment the line above

input_file = "input/image.tif"
start_location = 0
patch_size = 256

is_process_initialized = False
openslide_obj = None
cucim_obj = None


def load_tile_openslide_mp(slide, start_loc, patch_size):
    region = slide.read_region(start_loc, 0, [patch_size, patch_size])

def proc_init_openslide(inp_file, f, *iters):
    global is_process_initialized, openslide_obj
    if not is_process_initialized:
        is_process_initialized = True
        openslide_obj = OpenSlide(inp_file)
    return f(openslide_obj, *iters)

def load_tile_cucim_mp(slide, start_loc, patch_size):
    region = slide.read_region(start_loc, [patch_size, patch_size], 0)

def proc_init_cucim(inp_file, f, *iters):
    global is_process_initialized, cucim_obj
    if not is_process_initialized:
        is_process_initialized = True
        cucim_obj = CuImage(inp_file)
    return f(cucim_obj, *iters)

openslide_tot_time = 0
cucim_tot_time = 0
for num_workers in range(1, num_processes + 1):

    print("# of processes : {}".format(num_workers))
    openslide_time = 0
    
    with OpenSlide(input_file) as slide:
        width, height = slide.dimensions

        start_loc_iter = ((sx, sy)
                          for sy in range(start_location, height, patch_size)
                              for sx in range(start_location, width, patch_size))

        with Timer("  Process elapsed time (OpenSlide)") as timer:
            with concurrent.futures.ProcessPoolExecutor(
                max_workers=num_workers
            ) as executor:
                executor.map(
                    partial(proc_init_openslide, input_file, load_tile_openslide_mp),
                    start_loc_iter,
                    repeat(patch_size)
                )
            openslide_time = timer.elapsed_time()
            openslide_tot_time += openslide_time

    cucim_time = 0
    slide = CuImage(input_file)
    start_loc_iter = ((sx, sy)
                      for sy in range(start_location, height, patch_size)
                          for sx in range(start_location, width, patch_size))
    with Timer("  Process elapsed time (hipCIM)") as timer:
        with concurrent.futures.ProcessPoolExecutor(
            max_workers=num_workers
        ) as executor:
            executor.map(
                partial(proc_init_cucim, input_file, load_tile_cucim_mp),
                start_loc_iter,
                repeat(patch_size)
            )
        cucim_time = timer.elapsed_time()
        cucim_tot_time += cucim_time
    print("  Performance gain (OpenSlide/hipCIM): {}".format(openslide_time / cucim_time))

print("Total time (OpenSlide):", openslide_tot_time)
print("Total time (hipCIM):", cucim_tot_time)
print("Average performance gain (OpenSlide/hipCIM): {}".format(openslide_tot_time / cucim_tot_time))


# of processes : 1
  Process elapsed time (OpenSlide) : 10.514794820002862
  Process elapsed time (hipCIM) : 2.644248381999205
  Performance gain (OpenSlide/hipCIM): 3.9764777362000574
# of processes : 2
  Process elapsed time (OpenSlide) : 5.417017662999569
  Process elapsed time (hipCIM) : 1.5246468510013074
  Performance gain (OpenSlide/hipCIM): 3.5529655011204455
# of processes : 3
  Process elapsed time (OpenSlide) : 3.7354009700065944
  Process elapsed time (hipCIM) : 1.0586009879916674
  Performance gain (OpenSlide/hipCIM): 3.5286203322870855
# of processes : 4
  Process elapsed time (OpenSlide) : 2.8769145969999954
  Process elapsed time (hipCIM) : 1.0371424960030708
  Performance gain (OpenSlide/hipCIM): 2.773885563543119
# of processes : 5
  Process elapsed time (OpenSlide) : 2.359736131998943
  Process elapsed time (hipCIM) : 0.9017732270003762
  Performance gain (OpenSlide/hipCIM): 2.6167733320807036
Total time (OpenSlide): 24.903864182007965
Total time (hipCIM): 7.16641194

## Multiprocessing (method3: Fastest)

Patch requests are divided into multiple processes and, for each process, request only one job with the list of patch requests.

In [None]:
import concurrent.futures
from itertools import repeat

import numpy as np
from openslide import OpenSlide
from cucim import CuImage

import os

# num_processes = os.cpu_count() # uncomment this line to work with all threads
num_processes = 5 #comment this line if you uncomment the line above

input_file = "input/image.tif"
start_location = 0
patch_size = 256


def load_tile_openslide_chunk_mp(inp_file, start_loc_list, patch_size):
    with OpenSlide(inp_file) as slide:
        for start_loc in start_loc_list:
            region = slide.read_region(start_loc, 0, [patch_size, patch_size])

def load_tile_cucim_chunk_mp(inp_file, start_loc_list, patch_size):
    slide = CuImage(inp_file)
    for start_loc in start_loc_list:
        region = slide.read_region(start_loc, [patch_size, patch_size], 0)

openslide_tot_time = 0
cucim_tot_time = 0
print("Total # of processes : {}".format(num_processes))
for num_workers in range(1, num_processes + 1):

    print("# of processes : {}".format(num_workers))
    openslide_time = 0
    
    with OpenSlide(input_file) as slide:
        width, height = slide.dimensions
        
        start_loc_data = [(sx, sy)
                          for sy in range(start_location, height, patch_size)
                              for sx in range(start_location, width, patch_size)]

        # chunk_size = len(start_loc_data) // num_workers # this line doesn't work with small images like 32x32, because the chunk_size becomes 0
        chunk_size = max(1,len(start_loc_data) // num_workers) # this works with small and large images
        
        start_loc_list_iter = [start_loc_data[i:i+chunk_size] for i in range(0, len(start_loc_data), chunk_size)]


        with Timer("  Process elapsed time (OpenSlide)") as timer:
            with concurrent.futures.ProcessPoolExecutor(
                max_workers=num_workers
            ) as executor:
                executor.map(
                    load_tile_openslide_chunk_mp,
                    repeat(input_file),
                    start_loc_list_iter,
                    repeat(patch_size)
                )
            openslide_time = timer.elapsed_time()
            openslide_tot_time += openslide_time

    cucim_time = 0
    slide = CuImage(input_file)
    start_loc_data = [(sx, sy)
                      for sy in range(start_location, height, patch_size)
                          for sx in range(start_location, width, patch_size)]
    # chunk_size = len(start_loc_data) // num_workers #this line doesn't work with small images like 32x32, because the chunk_size becomes 0
    chunk_size = max(1,len(start_loc_data) // num_workers) # this works with small and large images
    start_loc_list_iter = [start_loc_data[i:i+chunk_size] for i in range(0, len(start_loc_data), chunk_size)]

    with Timer("  Process elapsed time (hipCIM)") as timer:
        with concurrent.futures.ProcessPoolExecutor(
            max_workers=num_workers
        ) as executor:
            executor.map(
                load_tile_cucim_chunk_mp,
                repeat(input_file),
                start_loc_list_iter,
                repeat(patch_size)
            )
        cucim_time = timer.elapsed_time()
        cucim_tot_time += cucim_time
    print("  Performance gain (OpenSlide/hipCIM): {}".format(openslide_time / cucim_time))

print("Total time (OpenSlide):", openslide_tot_time)
print("Total time (hipCIM):", cucim_tot_time)
print("Average performance gain (OpenSlide/hipCIM): {}".format(openslide_tot_time / cucim_tot_time))


Total # of processes : 5
# of processes : 1
  Process elapsed time (OpenSlide) : 9.910386636009207
  Process elapsed time (hipCIM) : 2.2090284360019723
  Performance gain (OpenSlide/hipCIM): 4.4863101237146585
# of processes : 2
  Process elapsed time (OpenSlide) : 5.090444343994022
  Process elapsed time (hipCIM) : 1.2175663139933022
  Performance gain (OpenSlide/hipCIM): 4.180835397210262
# of processes : 3
  Process elapsed time (OpenSlide) : 3.415929338996648
  Process elapsed time (hipCIM) : 0.8279977780039189
  Performance gain (OpenSlide/hipCIM): 4.1255295965063326
# of processes : 4
  Process elapsed time (OpenSlide) : 2.688147079999908
  Process elapsed time (hipCIM) : 0.7138953260000562
  Performance gain (OpenSlide/hipCIM): 3.765463902196351
# of processes : 5
  Process elapsed time (OpenSlide) : 2.166374384003575
  Process elapsed time (hipCIM) : 0.6026285010011634
  Performance gain (OpenSlide/hipCIM): 3.5948754172836455
Total time (OpenSlide): 23.27128178300336
Total time