![title](img/dask.png)

![title](img/parallel.jpg)

#### **Dask provides advanced parallelism for analytics, enabling performance at scale for the tools you love**

#### **Dask is composed of two parts:**

    1. Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.
    2. “Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of dynamic task schedulers.
    
[detailed information here](https://docs.dask.org/en/latest/)

#### **DASK with all its features:**
    • Built in Python
    • Scales properly from single laptops to 1000-node clusters
    • Leverages and interops with existing Python APIs as much as possible
    • Adheres to (Tim Peters') "Zen of Python" (https://www.python.org/dev/peps/pep-0020/) ... especially these elements:
        ◦ Explicit is better than implicit.
        ◦ Simple is better than complex.
        ◦ Complex is better than complicated.
        ◦ Readability counts. [ed: that goes for docs, too!]
        ◦ Special cases aren't special enough to break the rules.
        ◦ Although practicality beats purity.
        ◦ In the face of ambiguity, refuse the temptation to guess.
        ◦ If the implementation is hard to explain, it's a bad idea.
        ◦ If the implementation is easy to explain, it may be a good idea.
    • While we're borrowing inspiration, it Dask embodies one of Perl's slogans, making easy things easy and hard things possible
        ◦ Specifically, it supports common data-parallel abstractions like Pandas and Numpy
        ◦ But also allows scheduling arbitary custom computation that doesn't fit a preset mold
   

    


#### **Dask emphasizes the following virtues:**


    • Familiar: Provides parallelized NumPy array and Pandas DataFrame objects
    • Flexible: Provides a task scheduling interface for more custom workloads and integration with other projects.
    • Native: Enables distributed computing in pure Python with access to the PyData stack.
    • Fast: Operates with low overhead, low latency, and minimal serialization necessary for fast numerical algorithms
    • Scales up: Runs resiliently on clusters with 1000s of cores
    • Scales down: Trivial to set up and run on a laptop in a single process
    • Responsive: Designed with interactive computing in mind, it provides rapid feedback and diagnostics to aid humans.

In [8]:
import dask.dataframe as dd
import s3fs
# ddf = dd.read_csv('/home/koustav/Documents/DataframeWar/Data/yellow_tripdata_2020-01.csv',
#                        dtype={'RatecodeID': 'float64', 'VendorID': 'float64', 'passenger_count':'float64', 'payment_type': 'float64'},assume_missing=True, blocksize=12e6)

taxi_dtypes = {
    'store_and_fwd_flag': str,
    'RatecodeID': 'float64',
    'VendorID': 'float64',
    'passenger_count': 'float64',
    'payment_type': 'float64',
}



ddf = dd.read_csv(
    's3://nyc-tlc/trip data/yellow_tripdata_2020-01.csv',
    dtype=taxi_dtypes, 
    parse_dates=['tpep_pickup_datetime', 'tpep_dropoff_datetime'],
    storage_options={'anon': True},
)

In [10]:
%%time
len(ddf)

CPU times: user 5 µs, sys: 1e+03 ns, total: 6 µs
Wall time: 10 µs


In [6]:
ddf.loc[0]

Unnamed: 0_level_0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge
npartitions=127,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
,float64,datetime64[ns],datetime64[ns],float64,float64,float64,object,int64,int64,float64,float64,float64,float64,float64,float64,float64,float64,float64
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...


In [3]:
len(ddf)

6405008

In [3]:
ddf_part = dd.read_csv('/home/koustav/Documents/DataframeWar/Data/yellow_tripdata_2020-01.csv',
                       dtype={'RatecodeID': 'float64', 'VendorID': 'float64', 'passenger_count':'float64', 'payment_type': 'float64'},assume_missing=True)
ddf_part = ddf_part.repartition(npartitions=100)

In [1]:
import pandas as pd
pdf = pd.read_csv('/home/koustav/Documents/DataframeWar/Data/yellow_tripdata_2020-01.csv')

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


DataFrame shapes

In [4]:
print(f'Pandas shape: {pdf.shape}')
print('---------------------------')
print(f'Dask lazy shape: {ddf_part.shape}')

Pandas shape: (6405008, 18)
---------------------------
Dask lazy shape: (Delayed('int-e4c39bbb-1014-470f-8a9f-9c789aca477c'), 18)


In [5]:
print(f'Dask computed shape: {len(ddf.index):,}')  # expensive


Dask computed shape: 6,405,008


In [5]:
len(ddf_part)

  return func(*(_execute_task(a, cache) for a in args))


6405008

In [6]:
ddf_part.loc[0]

Unnamed: 0_level_0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge
npartitions=100,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
,float64,object,object,float64,float64,float64,object,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...


![title](img/dask-dataframe.png)

In [11]:
import dask.dataframe as dd

In [8]:
ddf = dd.read_csv('/home/koustav/Documents/DataframeWar/Data/weather.csv')

In [16]:
ddf.npartitions
len(ddf)
ddf.loc[0]

Unnamed: 0_level_0,date,temp,wind,rainfall,humidity
npartitions=1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
,int64,float64,float64,float64,float64
,...,...,...,...,...


In [21]:
ddf_part = ddf.repartition(npartitions=10)


In [22]:
len(ddf_part)
ddf_part.loc[0]

Unnamed: 0_level_0,date,temp,wind,rainfall,humidity
npartitions=10,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
,int64,float64,float64,float64,float64
,...,...,...,...,...
...,...,...,...,...,...
,...,...,...,...,...
,...,...,...,...,...


[The perfect formula to set the number of performance](https://stackoverflow.com/questions/44657631/strategy-for-partitioning-dask-dataframes-efficiently)

In [26]:
ddf_part.map_partitions(type).compute(assume_missing=True)


0    <class 'pandas.core.frame.DataFrame'>
1    <class 'pandas.core.frame.DataFrame'>
2    <class 'pandas.core.frame.DataFrame'>
3    <class 'pandas.core.frame.DataFrame'>
4    <class 'pandas.core.frame.DataFrame'>
5    <class 'pandas.core.frame.DataFrame'>
6    <class 'pandas.core.frame.DataFrame'>
7    <class 'pandas.core.frame.DataFrame'>
8    <class 'pandas.core.frame.DataFrame'>
9    <class 'pandas.core.frame.DataFrame'>
dtype: object

In [25]:
ddf.head()


Unnamed: 0,date,temp,wind,rainfall,humidity
0,19900101,28.4,17.96,20.4,32.18
1,19900102,35.5,2.23,0.0,22.84
2,19900103,17.4,9.06,0.0,29.38
3,19900104,28.4,1.57,0.0,26.3
4,19900105,28.3,0.3,0.0,26.75


>### Partitions/Chunks and Tasks

Remember that Dask is a scheduler for regular Python functions operating on (and producing) regular Python objects.

Your partitions, chunks, or data segments should be small enough to comfortably fit in RAM for each worker thread/core.

That is...
* if you have a 1GB worker with 1 core, want to keep your partitions below 1GB
* with 2 x 1 GB workers with 1 cores, we still want partitions below 1GB
* with n x 4 GB workers with 2 cores per worker, we want partitions below 2 GB

It's also good to take into account that more memory may be used for operations than the data chunk size itself, and that it's helpful to have a few chunks of data available to keep Dask's worker cores busy. 

So we might want to take those numbers above and make them 2-4x smaller (or, equivalently, create 2-4x as many partitions).

Generally speaking, a lot of tasks is not a bad thing. Scheduling overhead for each additional task is typically less than 1 millisecond, and can be a lot less.

That said, if you have, say, a billion tasks, those milliseconds will add up to minutes. In that case you may want to simplify your task graph or use larger (and hence fewer) partitions/chunks.