# 간단하게 Dask 사용해보기

---

&emsp;일반적으로 다음과 같이 Dask를 import 합니다. 작업 중인 데이터 유형(DataFrame, array, list)에 따라 이들 중 일부가 필요하지 않을 수도 있습니다.

In [1]:
import numpy as np
import pandas as pd

import dask.dataframe as dd
import dask.array as da
import dask.bag as db

---

## Dask DataFrame

### Dask 객체 생성

&emsp;먼저, Dask DataFrame을 활용해 Pandas DataFrame 유형의 데이터를 작업해보도록 하겠습니다.

In [2]:
index = pd.date_range("2021-09-01", periods=2400, freq="1H")
df = pd.DataFrame({"a": np.arange(2400), "b": list("abcaddbe" * 300)}, index=index)
df

Unnamed: 0,a,b
2021-09-01 00:00:00,0,a
2021-09-01 01:00:00,1,b
2021-09-01 02:00:00,2,c
2021-09-01 03:00:00,3,a
2021-09-01 04:00:00,4,d
...,...,...
2021-12-09 19:00:00,2395,a
2021-12-09 20:00:00,2396,d
2021-12-09 21:00:00,2397,d
2021-12-09 22:00:00,2398,b


In [3]:
ddf = dd.from_pandas(df, npartitions=10)
ddf

Unnamed: 0_level_0,a,b
npartitions=10,Unnamed: 1_level_1,Unnamed: 2_level_1
2021-09-01 00:00:00,int32,string
2021-09-11 00:00:00,...,...
...,...,...
2021-11-30 00:00:00,...,...
2021-12-09 23:00:00,...,...


&emsp;이제 2개의 열, 2400개의 행으로 구성된 Dask DataFrame이 있습니다. 이 Dask DataFrame은 2400개 행을 10개의 파티션(partition)으로 나누며, 따라서 각 파티션은 240개 행으로 구성되어 있습니다. 여기서 파티션은 데이터 조각을 나타냅니다.  

&emsp;다음은 Dask DataFrame의 몇 가지 주요 속성입니다.

In [4]:
# 각 파티션이 포함하는 인덱스 값 확인
ddf.divisions

(Timestamp('2021-09-01 00:00:00'),
 Timestamp('2021-09-11 00:00:00'),
 Timestamp('2021-09-21 00:00:00'),
 Timestamp('2021-10-01 00:00:00'),
 Timestamp('2021-10-11 00:00:00'),
 Timestamp('2021-10-21 00:00:00'),
 Timestamp('2021-10-31 00:00:00'),
 Timestamp('2021-11-10 00:00:00'),
 Timestamp('2021-11-20 00:00:00'),
 Timestamp('2021-11-30 00:00:00'),
 Timestamp('2021-12-09 23:00:00'))

In [5]:
# 특정 파티션에 액세스
ddf.partitions[1]

Unnamed: 0_level_0,a,b
npartitions=1,Unnamed: 1_level_1,Unnamed: 2_level_1
2021-09-11,int32,string
2021-09-21,...,...


### 인덱싱 (Indexing)

&emsp;Dask DataFrame의 인덱싱은 pandas DataFrame을 슬라이싱(slicing)하는 것과 유사합니다.

In [6]:
ddf.b

Dask Series Structure:
npartitions=10
2021-09-01 00:00:00    string
2021-09-11 00:00:00       ...
                        ...  
2021-11-30 00:00:00       ...
2021-12-09 23:00:00       ...
Name: b, dtype: string
Dask Name: getitem, 3 graph layers

In [7]:
ddf["2021-10-01": "2021-10-09 5:00"]

Unnamed: 0_level_0,a,b
npartitions=1,Unnamed: 1_level_1,Unnamed: 2_level_1
2021-10-01 00:00:00.000000000,int32,string
2021-10-09 05:00:59.999999999,...,...


### 연산 (Computation)

&emsp;연산 결과는 사용자가 요청할 때까지 계산되지 않으며, 대신에 연산 과정을 정리한 Dask 작업 그래프가 생성됩니다. 연산 결과를 얻고 싶다면 `compute`를 호출해야 합니다.

In [8]:
ddf["2021-10-01": "2021-10-09 5:00"].compute()

Unnamed: 0,a,b
2021-10-01 00:00:00,720,a
2021-10-01 01:00:00,721,b
2021-10-01 02:00:00,722,c
2021-10-01 03:00:00,723,a
2021-10-01 04:00:00,724,d
...,...,...
2021-10-09 01:00:00,913,b
2021-10-09 02:00:00,914,c
2021-10-09 03:00:00,915,a
2021-10-09 04:00:00,916,d


### 메서드 (Methods)

&emsp;Dask DataFrame의 메서드는 기존 Pandas 메서드와 일치합니다. 메서드를 호출하여 작업 그래프(task graph)를 설정한 다음, `compute`를 호출하여 결과를 가져옵니다.

In [9]:
ddf.a.mean()

dd.Scalar<series-..., dtype=float64>

In [10]:
ddf.a.mean().compute()

1199.5

In [11]:
ddf.b.unique()

Dask Series Structure:
npartitions=1
    string
       ...
Name: b, dtype: string
Dask Name: unique-agg, 5 graph layers

In [12]:
ddf.b.unique().compute()

0    a
1    b
2    c
3    d
4    e
Name: b, dtype: string

&emsp;Pandas처럼 여러 메서드를 함께 연결하여 사용할 수도 있습니다.

In [13]:
result = ddf["2021-10-01": "2021-10-09 5:00"].a.cumsum() - 100
result

Dask Series Structure:
npartitions=1
2021-10-01 00:00:00.000000000    int32
2021-10-09 05:00:59.999999999      ...
Name: a, dtype: int32
Dask Name: sub, 8 graph layers

In [14]:
result.compute()

2021-10-01 00:00:00       620
2021-10-01 01:00:00      1341
2021-10-01 02:00:00      2063
2021-10-01 03:00:00      2786
2021-10-01 04:00:00      3510
                        ...  
2021-10-09 01:00:00    158301
2021-10-09 02:00:00    159215
2021-10-09 03:00:00    160130
2021-10-09 04:00:00    161046
2021-10-09 05:00:00    161963
Freq: H, Name: a, Length: 198, dtype: int32

### 작업 그래프 (Task Graph) 시각화

&emsp;지금까지 연산 작업을 설정하고 `compute`를 호출했습니다. 추가적으로, `compute`를 호출하기 전에 작업 그래프(task graph)를 검사하여 연산 과정을 파악하고 검토할 수 있습니다.

In [15]:
result.dask

0,1
"layer_type  MaterializedLayer  is_materialized  True  number of outputs  10  npartitions  10  columns  ['a', 'b']  type  dask.dataframe.core.DataFrame  dataframe_type  pandas.core.frame.DataFrame  series_dtypes  {'a': dtype('int32'), 'b': string[pyarrow]}",

0,1
layer_type,MaterializedLayer
is_materialized,True
number of outputs,10
npartitions,10
columns,"['a', 'b']"
type,dask.dataframe.core.DataFrame
dataframe_type,pandas.core.frame.DataFrame
series_dtypes,"{'a': dtype('int32'), 'b': string[pyarrow]}"

0,1
"layer_type  Blockwise  is_materialized  False  number of outputs  10  npartitions  10  columns  ['a', 'b']  type  dask.dataframe.core.DataFrame  dataframe_type  pandas.core.frame.DataFrame  series_dtypes  {'a': dtype('int32'), 'b': string[pyarrow]}  depends on from_pandas-6142e84c1d4011397f444f391cd4e05f",

0,1
layer_type,Blockwise
is_materialized,False
number of outputs,10
npartitions,10
columns,"['a', 'b']"
type,dask.dataframe.core.DataFrame
dataframe_type,pandas.core.frame.DataFrame
series_dtypes,"{'a': dtype('int32'), 'b': string[pyarrow]}"
depends on,from_pandas-6142e84c1d4011397f444f391cd4e05f

0,1
"layer_type  MaterializedLayer  is_materialized  True  number of outputs  1  npartitions  1  columns  ['a', 'b']  type  dask.dataframe.core.DataFrame  dataframe_type  pandas.core.frame.DataFrame  series_dtypes  {'a': dtype('int32'), 'b': string[pyarrow]}  depends on to_pyarrow_string-0f3b37cf4394c7ec6c87ad2f07e26a91",

0,1
layer_type,MaterializedLayer
is_materialized,True
number of outputs,1
npartitions,1
columns,"['a', 'b']"
type,dask.dataframe.core.DataFrame
dataframe_type,pandas.core.frame.DataFrame
series_dtypes,"{'a': dtype('int32'), 'b': string[pyarrow]}"
depends on,to_pyarrow_string-0f3b37cf4394c7ec6c87ad2f07e26a91

0,1
layer_type  Blockwise  is_materialized  False  number of outputs  1  depends on loc-f142afadbb298ee83a952d7ae89f15bd,

0,1
layer_type,Blockwise
is_materialized,False
number of outputs,1
depends on,loc-f142afadbb298ee83a952d7ae89f15bd

0,1
layer_type  Blockwise  is_materialized  False  number of outputs  1  depends on getitem-4c2d6071b79fb3cda430b7c7b592b9c1,

0,1
layer_type,Blockwise
is_materialized,False
number of outputs,1
depends on,getitem-4c2d6071b79fb3cda430b7c7b592b9c1

0,1
layer_type  Blockwise  is_materialized  False  number of outputs  1  depends on series-cumsum-map-a07067c6901a2e403c12ad5e674204bb,

0,1
layer_type,Blockwise
is_materialized,False
number of outputs,1
depends on,series-cumsum-map-a07067c6901a2e403c12ad5e674204bb

0,1
layer_type  MaterializedLayer  is_materialized  True  number of outputs  1  depends on series-cumsum-map-a07067c6901a2e403c12ad5e674204bb  series-cumsum-take-last-156f128c2ac1566b7ee1873a77b6286c,

0,1
layer_type,MaterializedLayer
is_materialized,True
number of outputs,1
depends on,series-cumsum-map-a07067c6901a2e403c12ad5e674204bb
,series-cumsum-take-last-156f128c2ac1566b7ee1873a77b6286c

0,1
layer_type  Blockwise  is_materialized  True  number of outputs  1  depends on series-cumsum-014418bacdeef6b8de8aca9141b970c7,

0,1
layer_type,Blockwise
is_materialized,True
number of outputs,1
depends on,series-cumsum-014418bacdeef6b8de8aca9141b970c7


In [17]:
import os

In [21]:
os.environ["PATH"]

'C:\\Users\\BEGAS_15\\PycharmProjects\\test_dask\\venv/Scripts;C:\\Program Files (x86)\\Common Files\\Oracle\\Java\\javapath;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v11.7\\bin;C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v11.7\\libnvvp;C:\\Windows\\system32;C:\\Windows;C:\\Windows\\System32\\Wbem;C:\\Windows\\System32\\WindowsPowerShell\\v1.0\\;C:\\Windows\\System32\\OpenSSH\\;C:\\Program Files\\NVIDIA Corporation\\Nsight Compute 2022.2.1\\;C:\\Program Files (x86)\\NVIDIA Corporation\\PhysX\\Common;C:\\WINDOWS\\system32;C:\\WINDOWS;C:\\WINDOWS\\System32\\Wbem;C:\\WINDOWS\\System32\\WindowsPowerShell\\v1.0\\;C:\\WINDOWS\\System32\\OpenSSH\\;C:\\Program Files\\Git\\cmd;C:\\Program Files\\dotnet\\;C:\\Users\\BEGAS_15\\AppData\\Local\\Microsoft\\WindowsApps;;C:\\Program Files\\JetBrains\\PyCharm Community Edition 2022.3.1\\bin;;C:\\Program Files\\Graphviz\x08in'

In [19]:
os.environ["PATH"] += os.pathsep+"C:\Program Files\Graphviz\bin"

In [20]:
result.visualize()

ExecutableNotFound: failed to execute WindowsPath('dot'), make sure the Graphviz executables are on your systems' PATH