# 10+ Minutes to Dask

<a href="https://colab.research.google.com/github/shauryashaurya/learn-data-munging/blob/main/03.001%20-%2010%2B%20minutes%20to%20dask.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import numpy as np
import pandas as pd
import dask
import dask.dataframe as dd
import dask.array as da
import dask.bag as db

# Dask Objects

## Dask DataFrames

Dask Dataframes coordinate many Pandas dataframes, partitioned along an index.  
Support a subset of the Pandas API.  


In [2]:
# dask dataframe
# from pandas
idx = pd.date_range("2023-05-06", periods = 1000, freq="1H")

In [3]:
idx

DatetimeIndex(['2023-05-06 00:00:00', '2023-05-06 01:00:00',
               '2023-05-06 02:00:00', '2023-05-06 03:00:00',
               '2023-05-06 04:00:00', '2023-05-06 05:00:00',
               '2023-05-06 06:00:00', '2023-05-06 07:00:00',
               '2023-05-06 08:00:00', '2023-05-06 09:00:00',
               ...
               '2023-06-16 06:00:00', '2023-06-16 07:00:00',
               '2023-06-16 08:00:00', '2023-06-16 09:00:00',
               '2023-06-16 10:00:00', '2023-06-16 11:00:00',
               '2023-06-16 12:00:00', '2023-06-16 13:00:00',
               '2023-06-16 14:00:00', '2023-06-16 15:00:00'],
              dtype='datetime64[ns]', length=1000, freq='H')

In [4]:
pd_df = pd.DataFrame({"a": np.arange(1000), "b": list("abcd"*250)}, index = idx)

In [5]:
pd_df

Unnamed: 0,a,b
2023-05-06 00:00:00,0,a
2023-05-06 01:00:00,1,b
2023-05-06 02:00:00,2,c
2023-05-06 03:00:00,3,d
2023-05-06 04:00:00,4,a
...,...,...
2023-06-16 11:00:00,995,d
2023-06-16 12:00:00,996,a
2023-06-16 13:00:00,997,b
2023-06-16 14:00:00,998,c


In [6]:
dask_df = dd.from_pandas(pd_df, npartitions=10)

In [7]:
dask_df

Unnamed: 0_level_0,a,b
npartitions=10,Unnamed: 1_level_1,Unnamed: 2_level_1
2023-05-06 00:00:00,int32,string
2023-05-10 04:00:00,...,...
...,...,...
2023-06-12 12:00:00,...,...
2023-06-16 15:00:00,...,...


In [8]:
dask_df.divisions

(Timestamp('2023-05-06 00:00:00'),
 Timestamp('2023-05-10 04:00:00'),
 Timestamp('2023-05-14 08:00:00'),
 Timestamp('2023-05-18 12:00:00'),
 Timestamp('2023-05-22 16:00:00'),
 Timestamp('2023-05-26 20:00:00'),
 Timestamp('2023-05-31 00:00:00'),
 Timestamp('2023-06-04 04:00:00'),
 Timestamp('2023-06-08 08:00:00'),
 Timestamp('2023-06-12 12:00:00'),
 Timestamp('2023-06-16 15:00:00'))

In [9]:
dask_df.partitions[1]

Unnamed: 0_level_0,a,b
npartitions=1,Unnamed: 1_level_1,Unnamed: 2_level_1
2023-05-10 04:00:00,int32,string
2023-05-14 08:00:00,...,...


In [10]:
# data types of each of the columns
dask_df.dtypes

a              int32
b    string[pyarrow]
dtype: object

We can do regular Pandas stuff with Dask Dataframes now...

In [11]:
# get a subset based on index (date-time)
dask_df2 = dask_df.loc[idx[0:100]]

In [12]:
dask_df2

Unnamed: 0_level_0,a,b
npartitions=1,Unnamed: 1_level_1,Unnamed: 2_level_1
2023-05-06 00:00:00,int32,string
2023-05-10 03:00:00,...,...


In [13]:
# perform analysis on the subset
dask_df2_grpby_count = dask_df2.groupby("b").count()

In [14]:
# Dask evaluates lazy
# nothing happens untill we call .compute()
dask_df2_grpby_count.compute()

Unnamed: 0_level_0,a
b,Unnamed: 1_level_1
a,25
b,25
c,25
d,25


## Dask Arrays

Dask arrays coordinate many Numpy arrays, arranged into chunks within a grid.  
Dask arrays support a subset of Numpy API.

In [15]:
np_array = np.arange(100000).reshape(200,500)

In [16]:
dask_array = da.from_array(np_array, chunks = (100,100))

In [17]:
dask_array

Unnamed: 0,Array,Chunk
Bytes,390.62 kiB,39.06 kiB
Shape,"(200, 500)","(100, 100)"
Dask graph,10 chunks in 1 graph layer,10 chunks in 1 graph layer
Data type,int32 numpy.ndarray,int32 numpy.ndarray
"Array Chunk Bytes 390.62 kiB 39.06 kiB Shape (200, 500) (100, 100) Dask graph 10 chunks in 1 graph layer Data type int32 numpy.ndarray",500  200,

Unnamed: 0,Array,Chunk
Bytes,390.62 kiB,39.06 kiB
Shape,"(200, 500)","(100, 100)"
Dask graph,10 chunks in 1 graph layer,10 chunks in 1 graph layer
Data type,int32 numpy.ndarray,int32 numpy.ndarray


In [18]:
dask_array.chunks

((100, 100), (100, 100, 100, 100, 100))

In [19]:
dask_array.blocks[1,3]

Unnamed: 0,Array,Chunk
Bytes,39.06 kiB,39.06 kiB
Shape,"(100, 100)","(100, 100)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,int32 numpy.ndarray,int32 numpy.ndarray
"Array Chunk Bytes 39.06 kiB 39.06 kiB Shape (100, 100) (100, 100) Dask graph 1 chunks in 2 graph layers Data type int32 numpy.ndarray",100  100,

Unnamed: 0,Array,Chunk
Bytes,39.06 kiB,39.06 kiB
Shape,"(100, 100)","(100, 100)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,int32 numpy.ndarray,int32 numpy.ndarray


In [20]:
# let's play with a slightly more interesting example
# x is a matrix of random numbers
x = da.random.random((100, 100), chunks=(10,10))

In [21]:
x

Unnamed: 0,Array,Chunk
Bytes,78.12 kiB,800 B
Shape,"(100, 100)","(10, 10)"
Dask graph,100 chunks in 1 graph layer,100 chunks in 1 graph layer
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 78.12 kiB 800 B Shape (100, 100) (10, 10) Dask graph 100 chunks in 1 graph layer Data type float64 numpy.ndarray",100  100,

Unnamed: 0,Array,Chunk
Bytes,78.12 kiB,800 B
Shape,"(100, 100)","(10, 10)"
Dask graph,100 chunks in 1 graph layer,100 chunks in 1 graph layer
Data type,float64 numpy.ndarray,float64 numpy.ndarray


In [22]:
# operations just like Numpy
y = x + x.T
y

Unnamed: 0,Array,Chunk
Bytes,78.12 kiB,800 B
Shape,"(100, 100)","(10, 10)"
Dask graph,100 chunks in 3 graph layers,100 chunks in 3 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 78.12 kiB 800 B Shape (100, 100) (10, 10) Dask graph 100 chunks in 3 graph layers Data type float64 numpy.ndarray",100  100,

Unnamed: 0,Array,Chunk
Bytes,78.12 kiB,800 B
Shape,"(100, 100)","(10, 10)"
Dask graph,100 chunks in 3 graph layers,100 chunks in 3 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray


In [23]:
z1 = y[::2, 50:].mean(axis=0)
z2 = y[::2, 50:].mean(axis=1)

In [24]:
z1

Unnamed: 0,Array,Chunk
Bytes,400 B,80 B
Shape,"(50,)","(10,)"
Dask graph,5 chunks in 7 graph layers,5 chunks in 7 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 400 B 80 B Shape (50,) (10,) Dask graph 5 chunks in 7 graph layers Data type float64 numpy.ndarray",50  1,

Unnamed: 0,Array,Chunk
Bytes,400 B,80 B
Shape,"(50,)","(10,)"
Dask graph,5 chunks in 7 graph layers,5 chunks in 7 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray


In [25]:
# to actually compute z1, let's use .compute()
z1.compute()

array([0.90160892, 0.96692917, 0.92467744, 0.93958085, 0.9467103 ,
       0.94339515, 0.98915593, 0.98704426, 0.88760331, 0.99642897,
       1.04525434, 1.11351807, 1.01448251, 1.09310032, 1.12733542,
       1.10749337, 0.99227195, 1.09288144, 0.936306  , 1.12143735,
       1.00998179, 1.02665084, 0.97864114, 1.04484189, 0.95065862,
       0.95256728, 1.00319956, 0.98897261, 1.01678169, 1.05561826,
       0.98927774, 0.88228437, 1.00065157, 0.95785663, 0.95374877,
       1.01668819, 1.10717317, 0.85006956, 1.10576345, 1.14880675,
       1.03569118, 1.075367  , 1.04725386, 0.99498959, 0.99719567,
       0.99527439, 0.94531162, 1.02718389, 0.97175097, 0.99791746])

In [26]:
z2

Unnamed: 0,Array,Chunk
Bytes,400 B,40 B
Shape,"(50,)","(5,)"
Dask graph,10 chunks in 7 graph layers,10 chunks in 7 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 400 B 40 B Shape (50,) (5,) Dask graph 10 chunks in 7 graph layers Data type float64 numpy.ndarray",50  1,

Unnamed: 0,Array,Chunk
Bytes,400 B,40 B
Shape,"(50,)","(5,)"
Dask graph,10 chunks in 7 graph layers,10 chunks in 7 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray


In [27]:
z2.compute()

array([0.93729662, 1.0657908 , 1.02576267, 0.96314044, 0.98717069,
       1.01674448, 0.99211418, 0.99074132, 1.0299492 , 1.01475323,
       1.02094913, 0.94097618, 1.03503597, 1.01536134, 1.12059305,
       1.02767104, 1.02767964, 1.02260553, 0.93744926, 1.0646103 ,
       0.93683829, 0.90634198, 1.06738739, 1.08563189, 0.98015144,
       0.89131072, 0.97274789, 0.98014751, 1.03899682, 1.04477446,
       1.0734944 , 1.02820681, 1.04757768, 1.0062819 , 0.98738677,
       1.06747535, 0.98914537, 0.96119075, 0.93506423, 1.04465264,
       0.93181948, 1.1068038 , 0.94677491, 1.11074815, 1.01911052,
       0.95226363, 0.94151463, 1.00751305, 0.98595969, 0.97167731])

## Dask Bag

Bag is unordered collection of objects allowing repeats. Use these for semi/un-structured data.  
It's fun but slower than dataframes and arrays.  
The [examples](https://examples.dask.org/bag.html) page is really interesting.

In [28]:
dask_bag = db.from_sequence([1,2,3,4,5,6,7,8,9,0], npartitions = 2)

In [29]:
dask_bag

dask.bag<from_sequence, npartitions=2>

In [30]:
dask_bag.take(2)

(1, 2)

In [31]:
# dask is lazy - this one grabs values from one partition
dask_bag.filter(lambda x: x>3).take(2)

(4, 5)

In [32]:
# Here's how we take ALL across all partitions
dask_bag.filter(lambda x: x>3).compute()

[4, 5, 6, 7, 8, 9]

In [33]:
dask_bag.map(lambda x:x*x).take(5)

(1, 4, 9, 16, 25)

In [34]:
dask_bag.count().compute()

10

In [35]:
# convert to a dask dataframe
# this is a trivial example
dask_df_from_bag = dask_bag.to_dataframe()

In [36]:
dask_df_from_bag

Unnamed: 0_level_0,0
npartitions=2,Unnamed: 1_level_1
,int64
,...
,...


### Build bag with complex json and convert to dataframe
* Step 1: define a 'flatten' function
* Step 2: map 'flatten' to the bag
* Step 3: convert the flattened bag to dataframe using bag_instance.to_dataframe()

Using example from https://examples.dask.org/bag.html

#### Create Random Data

In [37]:
import json
import os

In [38]:
os.makedirs("./data/dask-bag-example-01", exist_ok = True)

In [39]:
b = dask.datasets.make_people()

In [40]:
b.map(json.dumps).to_textfiles("./data/dask-bag-example-01/*.json")

['D:/2/shaurya-lab/learn-data-munging/data/dask-bag-example-01/0.json',
 'D:/2/shaurya-lab/learn-data-munging/data/dask-bag-example-01/1.json',
 'D:/2/shaurya-lab/learn-data-munging/data/dask-bag-example-01/2.json',
 'D:/2/shaurya-lab/learn-data-munging/data/dask-bag-example-01/3.json',
 'D:/2/shaurya-lab/learn-data-munging/data/dask-bag-example-01/4.json',
 'D:/2/shaurya-lab/learn-data-munging/data/dask-bag-example-01/5.json',
 'D:/2/shaurya-lab/learn-data-munging/data/dask-bag-example-01/6.json',
 'D:/2/shaurya-lab/learn-data-munging/data/dask-bag-example-01/7.json',
 'D:/2/shaurya-lab/learn-data-munging/data/dask-bag-example-01/8.json',
 'D:/2/shaurya-lab/learn-data-munging/data/dask-bag-example-01/9.json']

#### Read JSON Data

In [41]:
# for windows
# !more .\data\dask-bag-example-01\0.json
# for linux
# !head -n 2 ./data/dask-bag-example-01/0.json

In [42]:
b = db.read_text('./data/dask-bag-example-01/*.json').map(json.loads)
b

dask.bag<loads, npartitions=10>

In [43]:
b.take(2)

({'age': 41,
  'name': ['Duncan', 'Ramos'],
  'occupation': 'Medical Officer',
  'telephone': '+1-878-428-6516',
  'address': {'address': '806 Storey Ferry', 'city': 'San Pablo'},
  'credit-card': {'number': '2626 2473 7537 3914',
   'expiration-date': '07/18'}},
 {'age': 27,
  'name': ['Layne', 'Morrow'],
  'occupation': 'Milliner',
  'telephone': '+18485702536',
  'address': {'address': '333 Jack Kerouac Square', 'city': 'Elizabeth'},
  'credit-card': {'number': '5423 1684 6073 8696',
   'expiration-date': '10/24'}})