## Why it doesn't make sense in Dask
To support the kinds of loops we want, we need to access individual elements. Dask is not good at that and any worarounds are so complex that nobody would write code like this. To see why this is, suppose that we want to do the following: Access the first element of the column. If you do `.loc[0, "VendorID"]` it won't work because the Dask DataFrame is split into multiple partitions across rows (AFAIK). So, if you index `0`, you will in fact get 31 rows back (the number of partitions), which is not what you want. Working around that becomes too complex.

Nevertheless, I will show some code below on why indexing individual elements of a Dask DataFrame is _really_ slow. I will index one element of the Dask DataFrame while doing _20 whole iterations_ in Pandas, and Pandas will still be faster.

In [1]:
import utils
import pandas as pd
import dask.dataframe as dd

dask_df = dd.read_csv('../datasets/yellow_tripdata_2015-01.csv')
pandas_df = pd.read_csv('../datasets/yellow_tripdata_2015-01.csv')
# Trigger some computatoin before timing anything because read_csv() with Dask is very fast (so, either it's actually
# fast or we need to trigger a computation)
print(dask_df.head())

   VendorID tpep_pickup_datetime tpep_dropoff_datetime  passenger_count  \
0         2  2015-01-15 19:05:39   2015-01-15 19:23:42                1   
1         1  2015-01-10 20:33:38   2015-01-10 20:53:28                1   
2         1  2015-01-10 20:33:38   2015-01-10 20:43:41                1   
3         1  2015-01-10 20:33:39   2015-01-10 20:35:31                1   
4         1  2015-01-10 20:33:39   2015-01-10 20:52:58                1   

   trip_distance  pickup_longitude  pickup_latitude  RateCodeID  \
0           1.59        -73.993896        40.750111           1   
1           3.30        -74.001648        40.724243           1   
2           1.80        -73.963341        40.802788           1   
3           0.50        -74.009087        40.713818           1   
4           3.00        -73.971176        40.762428           1   

  store_and_fwd_flag  dropoff_longitude  dropoff_latitude  payment_type  \
0                  N         -73.974785         40.750618             1

In [2]:
%%time_cell
# Compute one element
dask_df.loc[0, "VendorID"].compute().iloc[0]

2

In [3]:
dask_time = _TIMED_CELL
print(f"Dask time: {dask_time:.1f}s")

Dask time: 14.7s


In [4]:
%%time_cell

# Do a whole loop with 15 iterations and multiple operations

pandas_df['discourse_nr'] = 1
counter = 1

for i in range(1, 15):
  if pandas_df.loc[i, 'VendorID'] == pandas_df.loc[i-1, 'VendorID']:
    counter += 1
    pandas_df.loc[i, 'discourse_nr'] = counter
  else:
    counter = 1
    pandas_df.loc[i, 'discourse_nr'] = counter

In [5]:
pandas_time = _TIMED_CELL
print(f"Pandas time: {pandas_time:.1f}s")

Pandas time: 0.0s


In [6]:
slowdown = dask_time / pandas_time
utils.print_md(f"### Dask is {slowdown:.1f}x slower.")

### Dask is 506.2x slower.