# Machine Learning on Distributed Dask with SageMaker and Fargate

This notebook will demonstrate how to perform Machine Learning on Distributed Dask using SageMaker and Fargate.  We will demo how to connect to distributed dask fargate cluster, scale out dask worker nodes, perform EDA work on public newyork cab trip data sets. Then, we demonstrate how you can run regression algorithms and hyperparameters optimization on distributed dask cluster. Next, we will demonstrate how you can monitor the operational metrics of Dask Cluster that will be fronted by Network Load Balancer for accessing the Dask Cluster Status UI from internet. Finally, we will close with how to build your own python script container and run against the dask fargate cluster.  This notebook was inspired by customer use case where they were running dask on local computer for building regression models.   

## Setup required packages

In [None]:
!conda install -c conda-forge scikit-learn==0.23 -y

In [None]:
!conda install -y dask=2.25

In [None]:
!conda install -c conda-forge s3fs=0.3.0 -y

In [None]:
!conda install -c conda-forge matplotlib=3.3.1 -y

In [None]:
!conda install -c conda-forge dask-ml=1.6.0 -y

## Connect to Dask Fargate Cluster.  You need to provision this cluster following the instructions from here https://github.com/rvvittal/aws-dask-sm-fargate

In [1]:
from dask.distributed import Client
client = Client('localhost:8786')


+---------+---------------+---------------+---------------+
| Package | client        | scheduler     | workers       |
+---------+---------------+---------------+---------------+
| blosc   | None          | 1.7.0         | 1.7.0         |
| python  | 3.7.6.final.0 | 3.7.7.final.0 | 3.7.7.final.0 |
+---------+---------------+---------------+---------------+


## Scale out the number of dask workers as needed for your data science work

In [None]:
#!aws ecs update-service --service Dask-Workers --desired-count 20 --cluster Fargate-Dask-Cluster

## Restart the client after scale out operation

In [None]:
client.restart()

## Load dask dataframe with the trip data


## TODO: Introduction to Dask DataFrame

In [2]:
import s3fs
import dask.dataframe as dd
import boto3
import dask.distributed
#df = dd.read_csv('s3://octank-claims-web/public-data/yellow_tripdata_2018-01.csv', storage_options={'anon': False})
# df = dd.read_csv('s3://nyc-tlc/trip data/green_tripdata_2018-02.csv', storage_options={'anon': True})

In [3]:
df = dd.read_csv(
    's3://nyc-tlc/trip data/green_tripdata_2018-02.csv', storage_options={'anon': True}
)

In [4]:
df.head()

Unnamed: 0,VendorID,lpep_pickup_datetime,lpep_dropoff_datetime,store_and_fwd_flag,RatecodeID,PULocationID,DOLocationID,passenger_count,trip_distance,fare_amount,extra,mta_tax,tip_amount,tolls_amount,ehail_fee,improvement_surcharge,total_amount,payment_type,trip_type
0,2,2018-02-01 00:39:38,2018-02-01 00:39:41,N,5,97,65,1,0.0,20.0,0.0,0.0,3.0,0.0,,0.0,23.0,1,2
1,2,2018-02-01 00:58:28,2018-02-01 01:05:35,N,1,256,80,5,1.6,7.5,0.5,0.5,0.88,0.0,,0.3,9.68,1,1
2,2,2018-02-01 00:56:05,2018-02-01 01:18:54,N,1,25,95,1,9.6,28.5,0.5,0.5,5.96,0.0,,0.3,35.76,1,1
3,2,2018-02-01 00:12:40,2018-02-01 00:15:50,N,1,61,61,1,0.73,4.5,0.5,0.5,0.0,0.0,,0.3,5.8,2,1
4,2,2018-02-01 00:45:18,2018-02-01 00:51:56,N,1,65,17,2,1.87,8.0,0.5,0.5,0.0,0.0,,0.3,9.3,2,1


In [None]:
# load and count number of rows
len(df)

In [None]:
df.dtypes

## Persist multiple Dask collections into memory

In [16]:
df_persisted = client.persist(df)
print(df_persisted.head())

   VendorID lpep_pickup_datetime lpep_dropoff_datetime store_and_fwd_flag  \
0         2  2018-02-01 00:39:38   2018-02-01 00:39:41                  N   
1         2  2018-02-01 00:58:28   2018-02-01 01:05:35                  N   
2         2  2018-02-01 00:56:05   2018-02-01 01:18:54                  N   
3         2  2018-02-01 00:12:40   2018-02-01 00:15:50                  N   
4         2  2018-02-01 00:45:18   2018-02-01 00:51:56                  N   

   RatecodeID  PULocationID  DOLocationID  passenger_count  trip_distance  \
0           5            97            65                1           0.00   
1           1           256            80                5           1.60   
2           1            25            95                1           9.60   
3           1            61            61                1           0.73   
4           1            65            17                2           1.87   

   fare_amount  extra  mta_tax  tip_amount  tolls_amount  ehail_fee  \
0  

## Compute the mean trip distance grouped by the number of passengers

In [17]:
grouped_df = df.groupby(df_persisted.passenger_count).trip_distance.mean().compute()
print(grouped_df)

passenger_count
0    2.248809
1    2.719773
2    2.800221
3    2.763138
4    2.660013
5    2.740212
6    2.653089
7    1.260000
8    1.013571
9    0.132500
Name: trip_distance, dtype: float64


## Compute Max trip distance

In [18]:
max_trip_dist = df_persisted.trip_distance.max().compute()
print(max_trip_dist)

120.47


## Count the total trip distance and count for each vendor

In [19]:
%%time
df.groupby('VendorID').agg({'passenger_count':'count', 'trip_distance': 'sum'}).astype(int).reset_index()\
.rename(columns={'passenger_count':'Trip Count'}).compute()

CPU times: user 72.1 ms, sys: 36.4 ms, total: 108 ms
Wall time: 1min 5s


Unnamed: 0,VendorID,Trip Count,trip_distance
0,1,131590,339995
1,2,638350,1758388


## Count Missing Values for Each Feature

In [20]:
df.isna().sum().compute()

VendorID                      0
lpep_pickup_datetime          0
lpep_dropoff_datetime         0
store_and_fwd_flag            0
RatecodeID                    0
PULocationID                  0
DOLocationID                  0
passenger_count               0
trip_distance                 0
fare_amount                   0
extra                         0
mta_tax                       0
tip_amount                    0
tolls_amount                  0
ehail_fee                769940
improvement_surcharge         0
total_amount                  0
payment_type                  0
trip_type                     0
dtype: int64

##  Visual EDA  
- ref https://medium.com/datadriveninvestor/analyzing-big-data-with-dask-a05a8798da8c

In [21]:
##Selecting top 10 rides based on fare amount
most_paid_rides_dask = df[['PULocationID', 'fare_amount']].nlargest(10, "fare_amount")

In [None]:
##Visualizing most paid rides through Barplot
import matplotlib.pyplot as plt
most_paid_rides_dask.set_index('PULocationID',sorted=True).compute().plot(kind='barh',stacked=False, figsize=[10,8], legend=True)
#######
plt.title('Most Paid Rides')
plt.xlabel('Fare Amount')
plt.ylabel('PU LocationID')
plt.show()


In [None]:
##Visualizing trip distance through Barplot
import matplotlib.pyplot as plt
most_paid_rides_dask2 = df[['trip_distance', 'fare_amount']].nlargest(10, "trip_distance")
most_paid_rides_dask2.set_index('trip_distance',sorted=True).compute().plot(kind='bar', colormap='PiYG', stacked=False, figsize=[10,8], legend=True)
#######
plt.title('Fares by Distance')
plt.xlabel('Trip Distance')
plt.ylabel('Fare Amount')
plt.show()

## TODO: Regression modeling with Scikit Learn

In [23]:
dfl = dd.read_csv(
    's3://nyc-tlc/trip data/green_tripdata_2018-02.csv', storage_options={'anon': True},
    parse_dates=['lpep_pickup_datetime', 'lpep_dropoff_datetime'],
).sample(frac=0.1, replace=True)

In [24]:
dfl['trip_duration'] = dfl['lpep_dropoff_datetime'] - dfl['lpep_pickup_datetime']

In [25]:
import numpy as np
dfl['trip_duration'] = dfl['trip_duration']/np.timedelta64(1,'D')

In [26]:
dfl['trip_duration'] = dfl['trip_duration'] * 24

In [27]:
dfl['trip_duration']

Dask Series Structure:
npartitions=2
    float64
        ...
        ...
Name: trip_duration, dtype: float64
Dask Name: getitem, 26 tasks

In [28]:
dfl.head()

Unnamed: 0,VendorID,lpep_pickup_datetime,lpep_dropoff_datetime,store_and_fwd_flag,RatecodeID,PULocationID,DOLocationID,passenger_count,trip_distance,fare_amount,extra,mta_tax,tip_amount,tolls_amount,ehail_fee,improvement_surcharge,total_amount,payment_type,trip_type,trip_duration
601557,2,2018-02-22 21:11:19,2018-02-22 21:24:43,N,1,97,79,2,3.05,12.0,0.5,0.5,2.66,0.0,,0.3,15.96,1,1,0.223333
205086,2,2018-02-08 11:40:40,2018-02-08 11:49:41,N,1,69,116,5,1.02,7.5,0.0,0.5,0.0,0.0,,0.3,8.3,1,1,0.150278
450895,2,2018-02-17 00:49:33,2018-02-17 00:54:43,N,1,80,112,1,1.14,6.0,0.5,0.5,1.46,0.0,,0.3,8.76,1,1,0.086111
333002,2,2018-02-12 21:26:55,2018-02-12 21:32:16,N,1,149,178,1,1.07,6.0,0.5,0.5,0.0,0.0,,0.3,7.3,1,1,0.089167
264125,2,2018-02-10 10:32:11,2018-02-10 10:35:34,N,1,75,75,1,0.52,4.5,0.0,0.5,0.0,0.0,,0.3,5.3,2,1,0.056389


In [29]:
dfl = dfl.fillna(value=0)

In [30]:
dfl = dd.get_dummies(dfl.categorize()).compute()

In [31]:
dfl.head()

Unnamed: 0,VendorID,lpep_pickup_datetime,lpep_dropoff_datetime,RatecodeID,PULocationID,DOLocationID,passenger_count,trip_distance,fare_amount,extra,...,tip_amount,tolls_amount,ehail_fee,improvement_surcharge,total_amount,payment_type,trip_type,trip_duration,store_and_fwd_flag_N,store_and_fwd_flag_Y
601557,2,2018-02-22 21:11:19,2018-02-22 21:24:43,1,97,79,2,3.05,12.0,0.5,...,2.66,0.0,0.0,0.3,15.96,1,1,0.223333,1,0
205086,2,2018-02-08 11:40:40,2018-02-08 11:49:41,1,69,116,5,1.02,7.5,0.0,...,0.0,0.0,0.0,0.3,8.3,1,1,0.150278,1,0
450895,2,2018-02-17 00:49:33,2018-02-17 00:54:43,1,80,112,1,1.14,6.0,0.5,...,1.46,0.0,0.0,0.3,8.76,1,1,0.086111,1,0
333002,2,2018-02-12 21:26:55,2018-02-12 21:32:16,1,149,178,1,1.07,6.0,0.5,...,0.0,0.0,0.0,0.3,7.3,1,1,0.089167,1,0
264125,2,2018-02-10 10:32:11,2018-02-10 10:35:34,1,75,75,1,0.52,4.5,0.0,...,0.0,0.0,0.0,0.3,5.3,2,1,0.056389,1,0


In [32]:
x = dfl[['VendorID','RatecodeID','PULocationID','DOLocationID','passenger_count','trip_distance','fare_amount','total_amount']]

In [33]:
y = dfl['trip_duration']

In [34]:
from dask_ml.model_selection import train_test_split

In [35]:
X_train, X_test, y_train, y_test = train_test_split(x, y, random_state=1)

In [36]:
len(X_train), len(X_test), len(y_train), len(y_test)

(69294, 7700, 69294, 7700)

In [37]:
training_x = X_train.values
training_y = y_train.values

In [38]:
testing_x = X_test.values
testing_y = y_test.values

In [39]:
import numpy as np
from sklearn.metrics import mean_squared_error

In [40]:
def rmse(preds, actuals):
    error = mean_squared_error(actuals, preds)
    rmse = np.sqrt(error)
    print(rmse)

In [41]:
from dask_ml.linear_model import LinearRegression
lr = LinearRegression(random_state=1, n_jobs=-1, fit_intercept=True)
lr.fit(training_x,training_y)

LinearRegression(n_jobs=-1, random_state=1)

In [42]:
import joblib
from dask_ml.linear_model import LinearRegression

with joblib.parallel_backend('dask'):
    lr = LinearRegression(random_state=1, fit_intercept=False)
    lr.fit(training_x,training_y)

## Linear Regression with Dask distributed machine learning

In [5]:
from dask_glm.datasets import make_regression
X, y = make_regression(n_samples=200000, n_features=100, n_informative=5, chunksize=10000)
X


Unnamed: 0,Array,Chunk
Bytes,160.00 MB,8.00 MB
Shape,"(200000, 100)","(10000, 100)"
Count,20 Tasks,20 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 160.00 MB 8.00 MB Shape (200000, 100) (10000, 100) Count 20 Tasks 20 Chunks Type float64 numpy.ndarray",100  200000,

Unnamed: 0,Array,Chunk
Bytes,160.00 MB,8.00 MB
Shape,"(200000, 100)","(10000, 100)"
Count,20 Tasks,20 Chunks
Type,float64,numpy.ndarray


In [6]:
import dask
X, y = dask.persist(X, y)

In [7]:
import dask_glm.algorithms

b = dask_glm.algorithms.admm(X, y, max_iter=5)

In [8]:
b = dask_glm.algorithms.proximal_grad(X, y, max_iter=5)

In [9]:
import dask_glm.families
import dask_glm.regularizers

family = dask_glm.families.Poisson()
regularizer = dask_glm.regularizers.ElasticNet()

b = dask_glm.algorithms.proximal_grad(
    X, y,
    max_iter=5,
    family=family,
    regularizer=regularizer,
)


## Hyperparameter Optimization with Dask distributed machine learning

Scikit-learn uses joblib for single-machine parallelism. This lets you train most estimators (anything that accepts an n_jobs parameter) using all the cores of your laptop or workstation.Alternatively, Scikit-Learn can use Dask for parallelism. This lets you train those estimators using all the cores of your cluster without significantly changing your code.

In [10]:
from sklearn.datasets import make_classification
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
import pandas as pd


In [11]:
X, y = make_classification(n_samples=1000, random_state=0)
X[:5]

array([[-1.06377997,  0.67640868,  1.06935647, -0.21758002,  0.46021477,
        -0.39916689, -0.07918751,  1.20938491, -0.78531472, -0.17218611,
        -1.08535744, -0.99311895,  0.30693511,  0.06405769, -1.0542328 ,
        -0.52749607, -0.0741832 , -0.35562842,  1.05721416, -0.90259159],
       [ 0.0708476 , -1.69528125,  2.44944917, -0.5304942 , -0.93296221,
         2.86520354,  2.43572851, -1.61850016,  1.30071691,  0.34840246,
         0.54493439,  0.22532411,  0.60556322, -0.19210097, -0.06802699,
         0.9716812 , -1.79204799,  0.01708348, -0.37566904, -0.62323644],
       [ 0.94028404, -0.49214582,  0.67795602, -0.22775445,  1.40175261,
         1.23165333, -0.77746425,  0.01561602,  1.33171299,  1.08477266,
        -0.97805157, -0.05012039,  0.94838552, -0.17342825, -0.47767184,
         0.76089649,  1.00115812, -0.06946407,  1.35904607, -1.18958963],
       [-0.29951677,  0.75988955,  0.18280267, -1.55023271,  0.33821802,
         0.36324148, -2.10052547, -0.4380675 , -

In [22]:
y[:5]

array([0, 1, 1, 1, 0])

In [12]:
param_grid = {"C": [0.001, 0.01, 0.1, 0.5, 1.0, 2.0, 5.0, 10.0],
              "kernel": ['rbf', 'poly', 'sigmoid'],
              "shrinking": [True, False]}

grid_search = GridSearchCV(SVC(gamma='auto', random_state=0, probability=True),
                           param_grid=param_grid,
                           return_train_score=False,
                           iid=True,
                           cv=3,
                           n_jobs=-1)


In [13]:
grid_search.fit(X, y)



GridSearchCV(cv=3,
             estimator=SVC(gamma='auto', probability=True, random_state=0),
             iid=True, n_jobs=-1,
             param_grid={'C': [0.001, 0.01, 0.1, 0.5, 1.0, 2.0, 5.0, 10.0],
                         'kernel': ['rbf', 'poly', 'sigmoid'],
                         'shrinking': [True, False]})

In [14]:
import joblib

with joblib.parallel_backend('dask'):
    grid_search.fit(X, y)




In [15]:
pd.DataFrame(grid_search.cv_results_).head()

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_C,param_kernel,param_shrinking,params,split0_test_score,split1_test_score,split2_test_score,mean_test_score,std_test_score,rank_test_score
0,0.15557,0.011876,0.013422,0.001323,0.001,rbf,True,"{'C': 0.001, 'kernel': 'rbf', 'shrinking': True}",0.502994,0.501502,0.501502,0.502,0.000704,41
1,0.190303,0.038406,0.010841,0.000753,0.001,rbf,False,"{'C': 0.001, 'kernel': 'rbf', 'shrinking': False}",0.502994,0.501502,0.501502,0.502,0.000704,41
2,0.068233,0.007792,0.008976,0.002419,0.001,poly,True,"{'C': 0.001, 'kernel': 'poly', 'shrinking': True}",0.502994,0.501502,0.501502,0.502,0.000704,41
3,0.085549,0.009503,0.006081,0.000797,0.001,poly,False,"{'C': 0.001, 'kernel': 'poly', 'shrinking': Fa...",0.502994,0.501502,0.501502,0.502,0.000704,41
4,0.17243,0.015639,0.013679,0.001162,0.001,sigmoid,True,"{'C': 0.001, 'kernel': 'sigmoid', 'shrinking':...",0.502994,0.501502,0.501502,0.502,0.000704,41


## Run your python script container for your machine learning work.  
### Make sure to follow the steps in github repo for building/deploying this container before running this step

In [None]:
!aws ecr get-login-password --region us-west-2 | docker login --username AWS --password-stdin 716664005094.dkr.ecr.us-west-2.amazonaws.com

In [None]:
!docker run -e s3url='s3://nyc-tlc/trip data/green_tripdata_2018-02.csv' -e schurl='tcp://Dask-Scheduler.local-dask:8786' 716664005094.dkr.ecr.us-west-2.amazonaws.com/daskclientapp:latest

## Scale in the Fargate cluster worker nodes after all work is done

In [None]:
!pip3 install --upgrade --user awscli

In [None]:
conda install -c conda-forge awscli -y

In [None]:
!aws ecs update-service --service Dask-Workers --desired-count 1 --cluster Fargate-Dask-Cluster