<a href="https://colab.research.google.com/github/AceroMike/Big-Data/blob/main/Dask_Arrays_vs_Numpy_Arrays.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Dask allows us to speed up computation time because it parallelizes the computations. That means that it can run multiple smaller computations simulataneously. NumPy, like Pandas, does not parallelize. But instead works on memory. Working on memory is great if you have the space for it but Dask let's you run computations on data that would not fit to memory. As we will see, parallelizing with Dask arrays will lead to faster compute times than working with NumPy arrays. 

In [1]:
# Installations

!pip install dask[complete] --quiet
!pip install dask distributed --upgrade --quiet
!pip install aiohttp --quiet
!pip install dask-ml --quiet

[K     |████████████████████████████████| 675kB 10.5MB/s 
[K     |████████████████████████████████| 112kB 27.5MB/s 
[31mERROR: distributed 2021.3.0 has requirement cloudpickle>=1.5.0, but you'll have cloudpickle 1.3.0 which is incompatible.[0m
[31mERROR: distributed 2021.3.0 has requirement dask>=2021.03.0, but you'll have dask 2.12.0 which is incompatible.[0m
[K     |████████████████████████████████| 931kB 12.8MB/s 
[K     |████████████████████████████████| 1.3MB 13.8MB/s 
[K     |████████████████████████████████| 296kB 43.8MB/s 
[K     |████████████████████████████████| 143kB 47.5MB/s 
[K     |████████████████████████████████| 143kB 12.6MB/s 
[K     |████████████████████████████████| 22.3MB 1.7MB/s 
[?25h

In [2]:
# Imports
import numpy as np
import dask.array as da

We will be comparing the runtimes for Numpy and Dask array operations on some made up data. Now let's create the data. 

1. Let's create a 10,000 x 10,000 array of random numbers
2. Then we will add this array with it's transpose
3. Filter the resulting array and calculate its mean

Remember, that we have to use `.compute()` for Dask to evaluate the results. Also notice that we will be setting the chunks parameter. This simply tells Dask how to partition the data into NumPy arrays. In this case, we will have 100 NumPy arrays of size 1000 x 1000. We will be changing these parameters to see how our time change. 

In [3]:
# With Dask
%time
x = da.random.random((10000,10000), chunks=(1000,1000))
y = x + x.T
z = y[::2, 5000:].mean(axis=1)
z.compute()

CPU times: user 2 µs, sys: 0 ns, total: 2 µs
Wall time: 6.68 µs


array([1.00436239, 1.00289961, 0.99869341, ..., 0.98985282, 0.99361362,
       1.01671068])

In [4]:
# With NumPy
%%time
x = np.random.random((10000, 10000))
y = x + x.T
z = y[::2, 5000:].mean(axis=1)

CPU times: user 1.41 s, sys: 439 ms, total: 1.85 s
Wall time: 1.85 s


As we can see, thanks to parallelization, we are able to run the same computation faster with Dask than with NumPy. Now we want to see how the computation times change when we adjust the array we are working with. We will change the chunk size first to 500x500 and then to 250x250. Each time we reduce the chunk size our Dask array will have more individual Numpy arrays. Will this make computations faster? or slower?

In [6]:
%%timeit
x = da.random.random((10000, 10000), chunks=(500, 500))
y = x + x.T
z = y[::2, 5000:].mean(axis=1)
z.compute()

1 loop, best of 5: 1.04 s per loop


In [5]:
%%timeit
x = da.random.random((10000, 10000), chunks=(250, 250))
y = x + x.T
z = y[::2, 5000:].mean(axis=1)
z.compute()

1 loop, best of 5: 1.8 s per loop


As we might have expected, there is some diminishing returns to parralelization. If we ask Dask to divide the data too much then it will take even longer to run the computations. Setting the chunk size to a larger number leads to faster results because it leads to fewer computations. 