In this notebook, we explore the ability of Dask to handle the large amount of data we are presented with.

We will first load the necessary libraries and set the environment variables. In case the ```all_mhd_files``` list is empty, please ensure your data has been downloaded.

In [1]:
from LUNA16.utils.analyze_folders import analyze_folder
from LUNA16.utils.analyze_data_distribution import read_mhd
import matplotlib.pyplot as plt
import random
import numpy as np
import dask
import dask.array as da
from dask.distributed import Client
import time
%matplotlib inline
plt.rcParams["figure.figsize"] = [20, 8]
random.seed(123)

In [2]:
ROOT_FOLDER = "/home/azureuser/cloudfiles/data/LUNA16/extracted"
all_files = analyze_folder(ROOT_FOLDER)
assert len(all_files) == 3567
all_mhd_files = [file for file in all_files if file.extension == "mhd"]
assert len(all_mhd_files) == 1776

1st Experimentation: single file - First making some numpy standard calculations to have a baseline of some sort.

In [3]:
%%time
sample = all_mhd_files[0]
sample_array = read_mhd(sample)
print(np.sum(sample_array))

-25956655000.0
CPU times: user 18.5 ms, sys: 37 ms, total: 55.6 ms
Wall time: 74.2 ms


In [4]:
client = Client(n_workers=4)

Perhaps you already have a cluster running?
Hosting the HTTP server on port 36221 instead


In [5]:
%%time
lazy_sample = dask.delayed(read_mhd)(sample)
lazy_sample

CPU times: user 1.45 ms, sys: 0 ns, total: 1.45 ms
Wall time: 925 µs


Delayed('read_mhd-5f148ce4-7c96-4f2f-b016-2e5d4857cbf2')

In [6]:
# %%time
# lazy_array = da.from_delayed(lazy_sample, shape=sample_array.shape, dtype=np.float32)
# lazy_array

In [7]:
# %%time
# lazy_array.sum().compute()

In [8]:
lazy_array = da.from_delayed(lazy_sample, shape=(np.nan, 512, 512), dtype=np.float32)
lazy_array

Unnamed: 0,Array,Chunk
Bytes,unknown,unknown
Shape,"(nan, 512, 512)","(nan, 512, 512)"
Count,2 Tasks,1 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes unknown unknown Shape (nan, 512, 512) (nan, 512, 512) Count 2 Tasks 1 Chunks Type float32 numpy.ndarray",,

Unnamed: 0,Array,Chunk
Bytes,unknown,unknown
Shape,"(nan, 512, 512)","(nan, 512, 512)"
Count,2 Tasks,1 Chunks
Type,float32,numpy.ndarray


In [9]:
lazy_array.sum().compute()

-25956655000.0