# NYC Taxi Dataset - Dask for Multi GPU

# Demo 6: A quick intro to Dask + RAPIDS

Dask is a sophisticated package for parallel computation with a number of different datatypes. For much more detail, see: https://tutorial.dask.org/

In these examples, we'll focus on the basics of `dask_cudf` and `dask_cuda`

In [17]:
import numpy as np
import pandas as pd
import cuml
import cudf
import os

In [18]:
import dask_cudf

In [19]:
import dask, dask_cudf
from dask_cuda import LocalCUDACluster
from dask.distributed import Client, wait


In [20]:
# Setup a cluster and connect a client to it

cluster = LocalCUDACluster()
client = Client(cluster)

Perhaps you already have a cluster running?
Hosting the HTTP server on port 37895 instead
  http_address["port"], self.http_server.port


In [21]:
client

0,1
Client  Scheduler: tcp://127.0.0.1:41491  Dashboard: http://127.0.0.1:37895/status,Cluster  Workers: 2  Cores: 2  Memory: 100.01 GB


In [22]:
%%time
ddf = dask_cudf.read_orc("yellow_tripdata_2014-03-cleaned.orc")

CPU times: user 253 ms, sys: 402 ms, total: 656 ms
Wall time: 606 ms


In [23]:
len(ddf)

15428127

In [24]:
# Compute a simple histogram of passengers

value_counts = ddf.passenger_count.value_counts()
print(value_counts)
print(value_counts.compute())

<dask_cudf.Series | 39 tasks | 1 npartitions>
1    10921910
2     2105662
5      882918
3      637338
6      584503
4      295668
0         120
7           4
8           3
9           1
Name: passenger_count, dtype: int64


## Machine learning with Dask

See also XGBoost's Dask interface docs: https://github.com/dmlc/xgboost/tree/master/demo/dask

In [28]:
kmeans_cols = ["passenger_count", "trip_distance", "rate_code", "fare_amount"]

In [30]:
%%time
X_ddf = ddf[kmeans_cols]
for c in X_ddf.columns:
    X_ddf[c] = X_ddf[c].astype(np.float32)
Y_ddf = X_ddf["fare_amount"]
X_ddf = X_ddf.drop(columns="fare_amount")

X_ddf, y_ddf = client.persist([X_ddf, Y_ddf]) # Trigger the computation and cache in RAM
_ = wait([X_ddf, y_ddf]) # Actually wait for persistence to finish

CPU times: user 394 ms, sys: 77.7 ms, total: 472 ms
Wall time: 2.09 s


In [70]:
import xgboost
from xgboost.dask import DaskDMatrix

In [75]:
# TODO: explore within-train-sample predictions