En el grupo de Notebookes se revisan conceptos como: 

*Lazy evaluation and Building delayed pipelines*

- Gráfos de tareas
- Evaluación perezosa
- Procesos Hilos y Múltiples procesos
- dask.delayed

*Parallel procesing big data*

- Analísis de grande bases de datos
- Dask arrays
- Dask data frames
- Uso avanzado de bases como : h5py, zarr, parquet
- pandas y numpy -> dask
    
*Dask bags for unstructured data*
    
- Dask bags para grandes datos no estructurados y semi estructurados como JSON, texto, imágenes o audios.

*Dask machie learning*
    
- Uso de LocalCluster y otros clusteres
- Dask ML
- Entrenando ML con big data
- Evaluación perezosa con big data  

En este notebook introduciremos conceptos rápidos paara trabajar y entender lo primero de dask.

In [1]:
import dask.array as da

images = da.ones((10000,1000,1000))
images

Unnamed: 0,Array,Chunk
Bytes,74.51 GiB,119.21 MiB
Shape,"(10000, 1000, 1000)","(250, 250, 250)"
Count,640 Tasks,640 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 74.51 GiB 119.21 MiB Shape (10000, 1000, 1000) (250, 250, 250) Count 640 Tasks 640 Chunks Type float64 numpy.ndarray",1000  1000  10000,

Unnamed: 0,Array,Chunk
Bytes,74.51 GiB,119.21 MiB
Shape,"(10000, 1000, 1000)","(250, 250, 250)"
Count,640 Tasks,640 Chunks
Type,float64,numpy.ndarray


In [3]:
import numpy as np
import pandas as pd

import dask.dataframe as dd
import dask.array as da
import dask.bag as db

In [13]:
index = pd.date_range("2021-09-01", periods=2400, freq="1H")
df = pd.DataFrame({"a": np.arange(2400), "b": list("abcaddbe" * 300)}, index=index)

In [14]:
type(df)

pandas.core.frame.DataFrame

In [15]:
df.head()

Unnamed: 0,a,b
2021-09-01 00:00:00,0,a
2021-09-01 01:00:00,1,b
2021-09-01 02:00:00,2,c
2021-09-01 03:00:00,3,a
2021-09-01 04:00:00,4,d


In [26]:
df.shape

(2400, 2)

In [56]:
# See Dask DataFrame.
ddf = dd.from_pandas(df, npartitions=10)
ddf

Unnamed: 0_level_0,a,b
npartitions=10,Unnamed: 1_level_1,Unnamed: 2_level_1
2021-09-01 00:00:00,int32,object
2021-09-11 00:00:00,...,...
...,...,...
2021-11-30 00:00:00,...,...
2021-12-09 23:00:00,...,...


In [57]:
ddf.dask

0,1
"layer_type  MaterializedLayer  is_materialized  True  npartitions  10  columns  ['a', 'b']  type  dask.dataframe.core.DataFrame  dataframe_type  pandas.core.frame.DataFrame  series_dtypes  {'a': dtype('int32'), 'b': dtype('O')}",

0,1
layer_type,MaterializedLayer
is_materialized,True
npartitions,10
columns,"['a', 'b']"
type,dask.dataframe.core.DataFrame
dataframe_type,pandas.core.frame.DataFrame
series_dtypes,"{'a': dtype('int32'), 'b': dtype('O')}"


In [58]:
type(ddf)

dask.dataframe.core.DataFrame

In [59]:
# check the index values covered by each partition
ddf.divisions

(Timestamp('2021-09-01 00:00:00', freq='H'),
 Timestamp('2021-09-11 00:00:00', freq='H'),
 Timestamp('2021-09-21 00:00:00', freq='H'),
 Timestamp('2021-10-01 00:00:00', freq='H'),
 Timestamp('2021-10-11 00:00:00', freq='H'),
 Timestamp('2021-10-21 00:00:00', freq='H'),
 Timestamp('2021-10-31 00:00:00', freq='H'),
 Timestamp('2021-11-10 00:00:00', freq='H'),
 Timestamp('2021-11-20 00:00:00', freq='H'),
 Timestamp('2021-11-30 00:00:00', freq='H'),
 Timestamp('2021-12-09 23:00:00', freq='H'))

In [60]:
# access a particular partition
ddf.partitions[9]

Unnamed: 0_level_0,a,b
npartitions=1,Unnamed: 1_level_1,Unnamed: 2_level_1
2021-11-30 00:00:00,int32,object
2021-12-09 23:00:00,...,...


In [61]:
#Dask is lazily evaluated. The result from a computation isn’t computed until you ask for it. Instead, a Dask task graph for the computation is produced.
#Anytime you have a Dask object and you want to get the result, call compute:
ddf.b.compute()

2021-09-01 00:00:00    a
2021-09-01 01:00:00    b
2021-09-01 02:00:00    c
2021-09-01 03:00:00    a
2021-09-01 04:00:00    d
                      ..
2021-12-09 19:00:00    a
2021-12-09 20:00:00    d
2021-12-09 21:00:00    d
2021-12-09 22:00:00    b
2021-12-09 23:00:00    e
Freq: H, Name: b, Length: 2400, dtype: object

In [63]:
ddf.partitions[9].a.compute()

2021-11-30 00:00:00    2160
2021-11-30 01:00:00    2161
2021-11-30 02:00:00    2162
2021-11-30 03:00:00    2163
2021-11-30 04:00:00    2164
                       ... 
2021-12-09 19:00:00    2395
2021-12-09 20:00:00    2396
2021-12-09 21:00:00    2397
2021-12-09 22:00:00    2398
2021-12-09 23:00:00    2399
Freq: H, Name: a, Length: 240, dtype: int32

In [40]:
ddf["2021-09-01": "2021-10-21 5:00"].compute()

Unnamed: 0,a,b
2021-09-01 00:00:00,0,a
2021-09-01 01:00:00,1,b
2021-09-01 02:00:00,2,c
2021-09-01 03:00:00,3,a
2021-09-01 04:00:00,4,d
...,...,...
2021-10-21 01:00:00,1201,b
2021-10-21 02:00:00,1202,c
2021-10-21 03:00:00,1203,a
2021-10-21 04:00:00,1204,d


In [43]:
ddf.b.value_counts().compute()

a    600
b    600
d    600
c    300
e    300
Name: b, dtype: int64

In [44]:
ddf.a.mean().compute()

1199.5

In [49]:
ddf.describe().compute()

Unnamed: 0,a
count,2400.0
mean,1199.5
std,692.964646
min,0.0
25%,599.5
50%,1199.0
75%,1799.5
max,2399.0


In [75]:
# See Dask DataFrame.
ddf = dd.from_pandas(df, npartitions=10)

ddf["a_1"] = ddf.a + 10
ddf.compute().head()

Unnamed: 0,a,b,a_1
2021-09-01 00:00:00,0,a,10
2021-09-01 01:00:00,1,b,11
2021-09-01 02:00:00,2,c,12
2021-09-01 03:00:00,3,a,13
2021-09-01 04:00:00,4,d,14


In [71]:
ddf.dask

0,1
"layer_type  MaterializedLayer  is_materialized  True  npartitions  10  columns  ['a', 'b']  type  dask.dataframe.core.DataFrame  dataframe_type  pandas.core.frame.DataFrame  series_dtypes  {'a': dtype('int32'), 'b': dtype('O')}",

0,1
layer_type,MaterializedLayer
is_materialized,True
npartitions,10
columns,"['a', 'b']"
type,dask.dataframe.core.DataFrame
dataframe_type,pandas.core.frame.DataFrame
series_dtypes,"{'a': dtype('int32'), 'b': dtype('O')}"

0,1
layer_type  Blockwise  is_materialized  False,

0,1
layer_type,Blockwise
is_materialized,False

0,1
layer_type  Blockwise  is_materialized  False,

0,1
layer_type,Blockwise
is_materialized,False

0,1
"layer_type  Blockwise  is_materialized  False  npartitions  10  columns  ['a', 'b', 'a_1']  type  dask.dataframe.core.DataFrame  dataframe_type  pandas.core.frame.DataFrame  series_dtypes  {'a': dtype('int32'), 'b': dtype('O'), 'a_1': dtype('int32')}",

0,1
layer_type,Blockwise
is_materialized,False
npartitions,10
columns,"['a', 'b', 'a_1']"
type,dask.dataframe.core.DataFrame
dataframe_type,pandas.core.frame.DataFrame
series_dtypes,"{'a': dtype('int32'), 'b': dtype('O'), 'a_1': dtype('int32')}"


In [76]:
result = ddf["2021-10-01": "2021-10-09 5:00"].a.cumsum() - 100
result.compute().head()

2021-10-01 00:00:00     620
2021-10-01 01:00:00    1341
2021-10-01 02:00:00    2063
2021-10-01 03:00:00    2786
2021-10-01 04:00:00    3510
Freq: H, Name: a, dtype: int32

In [77]:
result.dask

0,1
"layer_type  MaterializedLayer  is_materialized  True  npartitions  10  columns  ['a', 'b']  type  dask.dataframe.core.DataFrame  dataframe_type  pandas.core.frame.DataFrame  series_dtypes  {'a': dtype('int32'), 'b': dtype('O')}",

0,1
layer_type,MaterializedLayer
is_materialized,True
npartitions,10
columns,"['a', 'b']"
type,dask.dataframe.core.DataFrame
dataframe_type,pandas.core.frame.DataFrame
series_dtypes,"{'a': dtype('int32'), 'b': dtype('O')}"

0,1
layer_type  Blockwise  is_materialized  False,

0,1
layer_type,Blockwise
is_materialized,False

0,1
layer_type  Blockwise  is_materialized  False,

0,1
layer_type,Blockwise
is_materialized,False

0,1
"layer_type  Blockwise  is_materialized  False  npartitions  10  columns  ['a', 'b', 'a_1']  type  dask.dataframe.core.DataFrame  dataframe_type  pandas.core.frame.DataFrame  series_dtypes  {'a': dtype('int32'), 'b': dtype('O'), 'a_1': dtype('int32')}",

0,1
layer_type,Blockwise
is_materialized,False
npartitions,10
columns,"['a', 'b', 'a_1']"
type,dask.dataframe.core.DataFrame
dataframe_type,pandas.core.frame.DataFrame
series_dtypes,"{'a': dtype('int32'), 'b': dtype('O'), 'a_1': dtype('int32')}"

0,1
"layer_type  MaterializedLayer  is_materialized  True  npartitions  1  columns  ['a', 'b', 'a_1']  type  dask.dataframe.core.DataFrame  dataframe_type  pandas.core.frame.DataFrame  series_dtypes  {'a': dtype('int32'), 'b': dtype('O'), 'a_1': dtype('int32')}",

0,1
layer_type,MaterializedLayer
is_materialized,True
npartitions,1
columns,"['a', 'b', 'a_1']"
type,dask.dataframe.core.DataFrame
dataframe_type,pandas.core.frame.DataFrame
series_dtypes,"{'a': dtype('int32'), 'b': dtype('O'), 'a_1': dtype('int32')}"

0,1
layer_type  Blockwise  is_materialized  False,

0,1
layer_type,Blockwise
is_materialized,False

0,1
layer_type  Blockwise  is_materialized  False,

0,1
layer_type,Blockwise
is_materialized,False

0,1
layer_type  Blockwise  is_materialized  False,

0,1
layer_type,Blockwise
is_materialized,False

0,1
layer_type  MaterializedLayer  is_materialized  True,

0,1
layer_type,MaterializedLayer
is_materialized,True

0,1
layer_type  Blockwise  is_materialized  True,

0,1
layer_type,Blockwise
is_materialized,True


**Low-Level Interfaces**

In [85]:
import dask
@dask.delayed
def inc(x):
   return x + 1

@dask.delayed
def add(x, y):
   return x + y

a = inc(1)       # no work has happened yet
b = inc(2)       # no work has happened yet
c = add(a, b)    # no work has happened yet

c = c.compute()  # This triggers all of the above computations
c

5