In [1]:
import numpy as np
import pandas as pd
import math

In [2]:
import dask.dataframe as dd
from dask.diagnostics import ProgressBar
import dask.delayed as delayed
from dask.distributed import Client

print('Dask lib dependencies imported properly!')

Dask lib dependencies imported properly!


 New York City taxi driver data set: This data set has 10 years of data and the data file for each month is close to 1 GB in size. It takes a powerful computer to be able to fit this data into memory like pandas requires. However, using Dask, the entire dataset is processed on a typical laptop. We’re going to step through the code that processes the NYC taxi data one piece at a time.

In [3]:
taxi_data = dd.read_csv('s3://nyc-tlc/trip data/yellow_tripdata_2018-04.csv',
                        storage_options = {'anon': True, 'use_ssl': False})

You can see that dask.dataframe.read_csv supports reading files directly from S3. The code here reads a single file since they are each 1 GB in size. It is easy to change Dask to read all of the yellow taxi files by simply changing yellow_tripdata_2018-04.csv to yellow_tripdata_*.csv. One thing to be aware of when making this change is that while Dask will improve the speed of processing the data, it won’t improve the speed of downloading the data. So depending on your connection speed this may take a long time.

A better option to get an idea of the speed of Dask would be to download all of the data to your local system using the AWS CLI:

Simply remove the --dryrun flag to download the files. Then the read_csv function can be pointed to the data files location on local disk rather than in S3.

One thing to note about the read_csv function is it doesn’t actually load your data into memory. Instead it creates a task graph of the work that needs to be done to load the data into memory. The task graph is executed in a later part of the program. We can take a quick look at what the dask.dataframe looks like by printing it out:

In [4]:
taxi_data

Unnamed: 0_level_0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
npartitions=13,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
,int64,object,object,int64,float64,int64,object,int64,int64,int64,float64,float64,float64,float64,float64,float64,float64
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...


At this point we see that the dataframe knows the structure of the data it will load and has divided the work into tasks. The data includes the columns VendorID, tpep_pickup_datetime and others. Dask divides the work of loading the data into 39 tasks. However, it hasn’t completed any of the tasks yet.

In [5]:
taxi_data = taxi_data.compute()

In [6]:
taxi_data

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
0,1,2018-04-01 00:22:20,2018-04-01 00:22:26,1,0.00,1,N,145,145,2,2.5,0.5,0.5,0.00,0.0,0.3,3.80
1,1,2018-04-01 00:47:37,2018-04-01 01:08:42,1,6.70,1,N,152,90,2,22.5,0.5,0.5,0.00,0.0,0.3,23.80
2,1,2018-04-01 00:02:13,2018-04-01 00:17:52,2,4.10,1,N,239,158,1,15.5,0.5,0.5,3.35,0.0,0.3,20.15
3,1,2018-04-01 00:46:49,2018-04-01 00:52:05,1,0.70,1,N,90,249,1,5.5,0.5,0.5,1.35,0.0,0.3,8.15
4,1,2018-04-01 00:19:04,2018-04-01 00:19:09,1,0.00,1,N,145,145,2,2.5,0.5,0.5,0.00,0.0,0.3,3.80
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
603234,1,2018-04-30 23:15:20,2018-04-30 23:32:58,1,3.60,1,N,148,112,1,14.5,0.5,0.5,3.15,0.0,0.3,18.95
603235,2,2018-04-30 23:02:02,2018-04-30 23:03:37,5,0.01,1,N,151,151,2,3.0,0.5,0.5,0.00,0.0,0.3,4.30
603236,2,2018-04-30 23:38:18,2018-04-30 23:44:57,1,1.62,1,N,186,125,1,7.5,0.5,0.5,1.76,0.0,0.3,10.56
603237,2,2018-04-30 23:07:08,2018-04-30 23:23:04,1,6.36,1,N,261,162,2,20.0,0.5,0.5,0.00,0.0,0.3,21.30


The next step is to transform the data. In order to create a reasonably complex transformation process to show off what Dask can do, we’ll assume that we’re interested in the mean fare to go a specified distance. Since the time to travel a set distance may vary by traffic, we’ll also break the mean fare down by travel time. Here’s the code to calculate those numbers.

In [7]:
def roundup(x, base: int = 5):
    """Round `x` up to nearest `base`"""
    return int(math.ceil(x / float(base))) * base


def round_series_up(s: dd.Series) -> dd.Series:
    """Apply roundup function to all elements of `s`"""
    return s.apply(roundup, meta=pd.Series(data=[], dtype=np.float32))


def transform_dask_dataframe(df: dd.DataFrame) -> dd.DataFrame:
    """Process NYC taxi data"""
    return (
        df[[
            'tpep_pickup_datetime', 'tpep_dropoff_datetime',
            'trip_distance', 'total_amount'
        ]]
        .astype({
            'tpep_pickup_datetime': 'datetime64[ms]',
            'tpep_dropoff_datetime': 'datetime64[ms]'
        })
        .assign(drive_time=(lambda df: (
            df.tpep_dropoff_datetime - df.tpep_pickup_datetime).dt.seconds
            // 300))
        .assign(drive_time=lambda df: round_series_up(df.drive_time))
        .assign(trip_distance=lambda df: round_series_up(df.trip_distance))
        .query('drive_time <= 120 & trip_distance <= 50')
        .drop(['tpep_pickup_datetime', 'tpep_dropoff_datetime'], axis=1)
        .round({'trip_distance': 0})
        .groupby(['drive_time', 'trip_distance'])
        .mean()
        .rename(columns={'total_amount': 'avg_amount'})
    )

This code will look hauntingly familiar if you’re experienced with pandas. We see many functions which are old friends: groupby, drop and assign for example. Almost exactly the same code could be used to process a pandas.DataFrame if the data would fit into memory.
At this point you might think we’re really getting close to our final result! Sorry to disappoint, but we’re still figuring out what tasks need to be done. Dask adds more steps to the task graph, however none of them have executed yet. Here’s what the dask.dataframe looks like now:

You can see that the only column which exists outside the index is avg_amount which stores the average fare for a given trip time and distance. The number of tasks has also grown. Over two hundred new tasks are added to the dataframe’s task graph to handle all of the data processing.

In [9]:
def compute_final_dataframe(df: dd.DataFrame) -> pd.DataFrame:
    """Execute dask task graph and compute final results"""
    return (
        df
        .compute()
        .reset_index()
        .pivot(
             index='drive_time',
             columns='trip_distance',
             values='avg_amount'
        )
        .fillna(0)
    )

We’re finally to the part of the code that does the data processing! Any time that you call compute on a dask.dataframe all of the tasks in the graph gets executed. This also includes calls to head or getting the len of a dataframe. Basically any time that you want to inspect the data itself, Dask will kick off the execution of the task graph. Development patterns will be different than pandas. With pandas it’s easy to display the head of a DataFrame to make sure the processing is going as expected. Calling head on a dask.dataframe will kick off processing of potentially hundreds or thousands of tasks. So while it may be necessary during the development of a script, it should be avoided during any production processing job.

In [10]:
if __name__ == "__main__":
    client = Client()

    taxi_data = dd.read_csv(
        's3://nyc-tlc/trip data/yellow_tripdata_2018-04.csv',
        storage_options={'anon': True, 'use_ssl': False}
    )

    transformed_data = transform_dask_dataframe(taxi_data)
    fare_distribution = compute_final_dataframe(transformed_data)

    print(fare_distribution.to_string())

trip_distance          0          5          10         15         20          25          30          35          40          45          50
drive_time                                                                                                                                   
0               34.106532   6.777906  29.322293  41.465332  48.328954   53.812857   62.784265  114.779697  137.180000  188.779091  237.975000
5               13.162592  12.659089  28.653480  43.249538  64.541436   71.395503   91.944865   80.038333   52.800000    0.000000    0.000000
10              28.810443  25.936256  36.371690  49.040409  65.759698   70.834030   88.876753  131.893420  155.772594  182.284359  245.113750
15              35.985702  36.814331  49.385895  57.912027  66.867544   69.883872   85.950816  117.336853  153.899789  180.471179  220.812613
20              55.574459  30.579348  60.521722  68.039804  68.516193   72.595018   90.693380  108.568716  157.129796  161.684426  196.237400
25    

We’re finally looking at some real live data. The table is the mean values of fares by distance and time. If you run this code on your laptop, Dask runs tasks on multiple cores in the background. Therefore, if you have four cores on your machine the processing will happen roughly four times faster than usingpandas. While this is a nice performance boost on a single machine, the great thing about Dask is that the exact same code runs on a distributed cluster of up to hundreds of machines. The task scheduler takes care of everything in the background and the only difference that you notice is that the program runs much faster! The speed will scale approximately linearly with the size of the cluster.