# Machine Learning on Distributed Dask with SageMaker and Fargate

This notebook will demonstrate how to perform Machine Learning on Distributed Dask using SageMaker and Fargate.  We will demo how to connect to distributed dask fargate cluster, scale out dask worker nodes, perform EDA work on public newyork cab trip data sets. Then, we demonstrate how you can run regression algorithms and hyperparameters optimization on distributed dask cluster. Next, we will demonstrate how you can monitor the operational metrics of Dask Cluster that will be fronted by Network Load Balancer for accessing the Dask Cluster Status UI from internet. Finally, we will close with how to build your own python script container and run against the dask fargate cluster.  This notebook was inspired by customer use case where they were running dask on local computer for building regression models.   

In [None]:
!conda install scikit-learn=0.23.2 -c conda-forge -y

In [None]:
!conda install dask-ml=1.6.0 -c conda-forge -y

In [None]:
!conda install cloudpickle=1.3.0 -c conda-forge -y

## Connect to Dask Fargate Cluster.  You need to provision this cluster following the instructions from here https://github.com/rvvittal/aws-dask-sm-fargate

In [2]:
from dask.distributed import Client

#enable this client for local device testing
client = Client('localhost:8786')

#enable this client for sagemaker notebook testing
#client = Client('dask-scheduler.local-dask:8786')


cloudpickle
+------------------------+---------+
|                        | version |
+------------------------+---------+
| client                 | 1.3.0   |
| scheduler              | 1.6.0   |
| tcp://172.20.0.3:39593 | 1.6.0   |
| tcp://172.20.0.4:44457 | 1.6.0   |
+------------------------+---------+

python
+------------------------+----------------+
|                        | version        |
+------------------------+----------------+
| client                 | 3.7.6.final.0  |
| scheduler              | 3.6.10.final.0 |
| tcp://172.20.0.3:39593 | 3.6.10.final.0 |
| tcp://172.20.0.4:44457 | 3.6.10.final.0 |
+------------------------+----------------+


## Scale out the number of dask workers as needed for your data science work

In [None]:
#enable this  when cluster is running on Fargate to scale out your cluster. 
#!aws ecs update-service --service Dask-Workers --desired-count 10 --cluster Fargate-Dask-Cluster

## Restart the client after scale out operation

In [None]:
client.restart()

## Introduction to Dask DataFrame
A Dask DataFrame is a large parallel DataFrame composed of many smaller Pandas DataFrames, split along the index. These Pandas DataFrames may live on disk for larger-than-memory computing on a single machine, or on many different machines in a cluster. One Dask DataFrame operation triggers many operations on the constituent Pandas DataFrames. For more details, review this page: https://docs.dask.org/en/latest/dataframe.html 



In [3]:
import s3fs
import dask.dataframe as dd
import boto3
import dask.distributed

## Using Dask for EDA on NewYork Taxi Trip datasets

In [4]:
df = dd.read_csv(
    's3://nyc-tlc/trip data/green_tripdata_2018-02.csv', storage_options={'anon': True}
)

In [5]:
df.head()

Exception: maximum recursion depth exceeded

distributed.client - ERROR - Failed to reconnect to scheduler after 10.00 seconds, closing client
_GatheringFuture exception was never retrieved
future: <_GatheringFuture finished exception=CancelledError()>
concurrent.futures._base.CancelledError


In [None]:
# load and count number of rows
len(df)

In [None]:
df.dtypes

## Persist  collections into memory
Calls to Client.compute or Client.persist submit task graphs to the cluster and return Future objects that point to particular output tasks. Compute returns a single future per input, persist returns a copy of the collection with each block or partition replaced by a single future. In short, use persist to keep full collection on the cluster and use compute when you want a small result as a single future.


In [None]:
df_persisted = client.persist(df)
print(df_persisted.head())

## Compute the mean trip distance grouped by the number of passengers

In [None]:
grouped_df = df.groupby(df_persisted.passenger_count).trip_distance.mean().compute()
print(grouped_df)

## Compute Max trip distance

In [None]:
max_trip_dist = df_persisted.trip_distance.max().compute()
print(max_trip_dist)

## Count the total trip distance and count for each vendor

In [None]:
%%time
df.groupby('VendorID').agg({'passenger_count':'count', 'trip_distance': 'sum'}).astype(int).reset_index()\
.rename(columns={'passenger_count':'Trip Count'}).compute()

## Count Missing Values for Each Feature

In [None]:
df.isna().sum().compute()

##  Visual EDA  

In [None]:
##Selecting top 10 rides based on fare amount
most_paid_rides_dask = df[['PULocationID', 'fare_amount']].nlargest(10, "fare_amount")

In [None]:
##Visualizing most paid rides through Barplot
import matplotlib.pyplot as plt
most_paid_rides_dask.set_index('PULocationID',sorted=True).compute().plot(kind='barh',stacked=False, figsize=[10,8], legend=True)
#######
plt.title('Most Paid Rides')
plt.xlabel('Fare Amount')
plt.ylabel('PU LocationID')
plt.show()


In [None]:
##Visualizing trip distance through Barplot
import matplotlib.pyplot as plt
most_paid_rides_dask2 = df[['trip_distance', 'fare_amount']].nlargest(10, "trip_distance")
most_paid_rides_dask2.set_index('trip_distance',sorted=True).compute().plot(kind='bar', colormap='PiYG', stacked=False, figsize=[10,8], legend=True)
#######
plt.title('Fares by Distance')
plt.xlabel('Trip Distance')
plt.ylabel('Fare Amount')
plt.show()

## Regression modeling with Scikit Learn

In [None]:
dfl = dd.read_csv(
    's3://nyc-tlc/trip data/green_tripdata_2018-02.csv', storage_options={'anon': True},
    parse_dates=['lpep_pickup_datetime', 'lpep_dropoff_datetime'],
).sample(frac=0.1, replace=True)

In [None]:
dfl['trip_duration'] = dfl['lpep_dropoff_datetime'] - dfl['lpep_pickup_datetime']

In [None]:
import numpy as np
dfl['trip_duration'] = dfl['trip_duration']/np.timedelta64(1,'D')

In [None]:
dfl['trip_duration'] = dfl['trip_duration'] * 24

In [None]:
dfl['trip_duration']

In [None]:
dfl.head()

In [None]:
dfl = dfl.fillna(value=0)

In [None]:
dfl = dd.get_dummies(dfl.categorize()).compute()

In [None]:
dfl.head()

In [None]:
x = dfl[['VendorID','RatecodeID','PULocationID','DOLocationID','passenger_count','trip_distance','fare_amount','total_amount']]

In [None]:
y = dfl['trip_duration']

In [None]:
from dask_ml.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(x, y, random_state=1)

In [None]:
len(X_train), len(X_test), len(y_train), len(y_test)

In [None]:
training_x = X_train.values
training_y = y_train.values

In [None]:
testing_x = X_test.values
testing_y = y_test.values

In [None]:
import numpy as np
from sklearn.metrics import mean_squared_error

In [None]:
def rmse(preds, actuals):
    error = mean_squared_error(actuals, preds)
    rmse = np.sqrt(error)
    print(rmse)

In [None]:
from dask_ml.linear_model import LinearRegression
lr = LinearRegression(random_state=1, n_jobs=-1, fit_intercept=True)
lr.fit(training_x,training_y)

In [None]:
import joblib
from dask_ml.linear_model import LinearRegression

with joblib.parallel_backend('dask'):
    lr = LinearRegression(random_state=1, fit_intercept=False)
    lr.fit(training_x,training_y)

## Linear Regression with Dask distributed machine learning

In [None]:
from dask_glm.datasets import make_regression
X, y = make_regression(n_samples=200000, n_features=100, n_informative=5, chunksize=10000)
X


In [None]:
import dask
X, y = dask.persist(X, y)

In [None]:
import dask_glm.algorithms

b = dask_glm.algorithms.admm(X, y, max_iter=5)

In [None]:
b = dask_glm.algorithms.proximal_grad(X, y, max_iter=5)

In [None]:
import dask_glm.families
import dask_glm.regularizers

family = dask_glm.families.Poisson()
regularizer = dask_glm.regularizers.ElasticNet()

b = dask_glm.algorithms.proximal_grad(
    X, y,
    max_iter=5,
    family=family,
    regularizer=regularizer,
)


## Hyperparameter Optimization with Dask distributed machine learning

Scikit-learn uses joblib for single-machine parallelism. This lets you train most estimators (anything that accepts an n_jobs parameter) using all the cores of your laptop or workstation.Alternatively, Scikit-Learn can use Dask for parallelism. This lets you train those estimators using all the cores of your cluster without significantly changing your code.

In [None]:
from sklearn.datasets import make_classification
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
import pandas as pd


In [None]:
X, y = make_classification(n_samples=1000, random_state=0)
X[:5]

In [None]:
y[:5]

In [None]:
param_grid = {"C": [0.001, 0.01, 0.1, 0.5, 1.0, 2.0, 5.0, 10.0],
              "kernel": ['rbf', 'poly', 'sigmoid'],
              "shrinking": [True, False]}

grid_search = GridSearchCV(SVC(gamma='auto', random_state=0, probability=True),
                           param_grid=param_grid,
                           return_train_score=False,
                           iid=True,
                           cv=3,
                           n_jobs=-1)


In [None]:
grid_search.fit(X, y)

In [None]:
import joblib

with joblib.parallel_backend('dask'):
    grid_search.fit(X, y)


In [None]:
pd.DataFrame(grid_search.cv_results_).head()

## Run your python script container for your machine learning work.  

Make sure to follow the steps in github repo for building/deploying this container before running this step. Change {YOURACCOUNTID} to your account id

In [None]:
!aws ecr get-login-password --region us-west-2 | docker login --username AWS --password-stdin {YOURACCOUNTID}.dkr.ecr.us-west-2.amazonaws.com

In [None]:
!docker run -e s3url='s3://nyc-tlc/trip data/green_tripdata_2018-02.csv' -e schurl='tcp://Dask-Scheduler.local-dask:8786' {YOURACCOUNTID}.dkr.ecr.us-west-2.amazonaws.com/daskclientapp:latest

## Scale in the Fargate cluster worker nodes after all work is done

In [None]:
!aws ecs update-service --service Dask-Workers --desired-count 1 --cluster Fargate-Dask-Cluster