<a href="https://www.dask.org/" target="_blank">
<img src="http://dask.readthedocs.io/en/latest/_images/dask_horizontal.svg"
     align="right"
     width="30%"
     alt="Dask logo\">
</a>

# Out of core computations

In this notebook, we will explain how Dask enables **out-of-core computations**, that is, the ability to process datasets larger than computer's memory.

**Content**

1. Eager evaluation
2. Lazy evaluation
3. Out-of-core computation

**Learning outcomes**
* Define eager evaluation
* Define lazy evaluation.
* Describe how lazy evaluation and domain decomposition enables out-of-core computations.

## 1. Eager evaluation

Python computations are by default **eagerly evaluated**, i.e., as soon as you define the computation, Python evaluates it and execute it. This means that all the values required in the computation are loaded into the main memory and the result of the computation is calculated. You can see this behabiour in the subsequent cells. Be attentive of the memory usage on each of the following computation.

__1. Import required libraries, define required variables and functions__

_Hint: memory_profiler is a set of notebook magics that allow memory usage measurement_

In [None]:
%load_ext memory_profiler

import numpy as np

shape = (10000,10000)

__2. Create array and measure memory usage__

_Hint: the `%%memit` magic measures, the peack memory usage and how much the memory usage was incremented as a consequence of running the cell_

_Questions: How much data was loaded into the main memory once the `x` and `y` arrays were created? Is it the data being loaded eagerly?_ 

In [None]:
%%memit

x = np.ones(shape=shape)
y = np.ones(shape=shape)

__3. Perform an arithmetic operation and measure memory usage__

_Questions: How much data was loaded into the main memory once the `z` array was created? Is it the data being loaded eagerly?_

In [None]:
%%memit

z = x * y
z

# 2. Lazy evaluation

Dask, as opposed to the default Python execution behavior, uses **lazy evaluation**, i.e, as soon as you define the computation, Dask evaluates it but it **DO NOT** execute it. The later means that the data is not being loaded into memory until it is needed. Be attentive of the memory usage on each of the following computation.

__1. Import required libraries, define required variables and functions__

In [None]:
%load_ext memory_profiler

import dask.array as da

shape = (10000,10000)
chunks = (1000,1000)

__2. Create "lazy" Dask Arrays and measure their memory usage__

_Hint: Arrays were evaluated, but the data they hold WAS NOT created yet, it will be created once required._

_Questions: How much data was loaded into the main memory once the `x` and `y` arrays were evaluated?_

In [None]:
%%memit

x = da.ones(shape=shape, chunks=chunks)
y = da.ones(shape=shape, chunks=chunks)

__3. Perform a "lazy" arithmetic operation and measure its memory usage__

_Hint: Again, the computation is evaluated, but the result was not computed yet, it will be computed once required._


In [None]:
%%memit

z = (x**2) + (y**2)
z

__4. Compute the result__ of the equation and measure its memory usage.

_Hint: Now, the result is required, then all the computations including, array creation, and arrays operation (previuos equiation) are computed._

In [None]:
%%memit

z.compute()

## 3. Out-of-core computation

Dask achieves out of core computations by using **lazy evaluation**, **domain decomposition**, and **task scheduling**. 

Example: Suppose you need to process a 20 GiB array, and your computer only has 4 GiB of memory.

* If you use Numpy, the array will be loaded eagerly, then the array creation will fail since the whole array does not fit in memory.
* If you use Dask Array, the array will be **lazily loaded**, in addition, the array will be partitioned into chunks, **domain decomposition**.
* Finally, Dask will load and process chunk per chunk. Since every chunk is just 2 GiB in memory, it will perfectly fit into memory. Dask determine which chunk to load and process using **task scheduling**.

In the following examples, you will take a look on how **lazy evaluation**, **domain decomposition**, and **task scheduling** enable out-of-core computation.

__1. Import required libraries, define required variables and functions__

In [6]:
import numpy as np
import dask.array as da

# This will be the size of a 74.5 GiB Array 
shape = (100000,100000)
chunks = (5000,5000)

__2. Create two large arrays__

_Hint: if your computer memory is lower than 74.5 GiB **you will get a memory error**. Since the arrays are too large to fit into the main memory._

_Question: Are these arrays are eagerly or lazily evaluated? Do you think is even possible to process a 74.5 GiB dataset in a computer with 4 GiB of memory?._

In [None]:
x = np.ones(shape=shape)
y = np.ones(shape=shape)
y

__2. Now the same large arrays using Dask__

_Question: Are these arrays are eagerly or lazily evaluated? Did you get the same memory error as before? Why Dask can load a 74.5 GiB dataset in a computer with 4 GiB of memory?._

In [7]:
x = da.ones(shape=shape, chunks=chunks)
y = da.ones(shape=shape, chunks=chunks)
y

Unnamed: 0,Array,Chunk
Bytes,74.51 GiB,190.73 MiB
Shape,"(100000, 100000)","(5000, 5000)"
Dask graph,400 chunks in 1 graph layer,400 chunks in 1 graph layer
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 74.51 GiB 190.73 MiB Shape (100000, 100000) (5000, 5000) Dask graph 400 chunks in 1 graph layer Data type float64 numpy.ndarray",100000  100000,

Unnamed: 0,Array,Chunk
Bytes,74.51 GiB,190.73 MiB
Shape,"(100000, 100000)","(5000, 5000)"
Dask graph,400 chunks in 1 graph layer,400 chunks in 1 graph layer
Data type,float64 numpy.ndarray,float64 numpy.ndarray


__4. Compute the result__ of the equation using Dask.

_Question: How Dask was able to process a 74.5 GiB dataset in a computer with 4 GiB of memory?._

In [None]:
z = (x**2) + (y**2)
z

__4. Visualize the computations to be performed per array chunk.__

_Hint: `visualize` display the graph of the computations to be performed in the array chunks_

In [None]:
z.visualize()

__5. Compute the result__

In [None]:
z.compute()

# [Excerise 2](labs/Lab2.ipynb)