# Machine Learning on Distributed Dask with SageMaker and Fargate

This notebook will demonstrate how to perform Machine Learning on Distributed Dask using SageMaker and Fargate.  We will demo how to connect to distributed dask fargate cluster, scale out dask worker nodes, perform EDA work on public newyork cab trip data sets. Then, we demonstrate how you can run regression algorithms and hyperparameters optimization on distributed dask cluster. Next, we will demonstrate how you can monitor the operational metrics of Dask Cluster that will be fronted by Network Load Balancer for accessing the Dask Cluster Status UI from internet. Finally, we will close with how to build your own python script container and run against the dask fargate cluster.  This notebook was inspired by customer use case where they were running dask on local computer for building regression models.   

## Setup required packages

In [None]:
!conda update -y dask

In [None]:
!conda update -y s3fs

## Connect to Dask Fargate Cluster.  You need to provision this cluster following the instructions from here https://github.com/rvvittal/aws-dask-sm-fargate

In [None]:
from dask.distributed import Client
client = Client('Dask-Scheduler.local-dask:8786')

## Scale out the number of dask workers as needed for your data science work

In [None]:
!aws ecs update-service --service Dask-Workers --desired-count 10 --cluster Fargate-Dask-Cluster

## Restart the client after scale out operation

In [None]:
client.restart()

## Load dask dataframe with the trip data
### TODO: update S3 trip data set with actual public location 


In [None]:
import s3fs
import dask.dataframe as dd
import boto3
import dask.distributed
df = dd.read_csv('s3://octank-claims-web/public-data/yellow_tripdata_2018-01.csv', storage_options={'anon': False})
df

## Persist multiple Dask collections into memory

In [None]:
df_persisted = client.persist(df)
print(df_persisted.head())

## Compute the trip distance and grouping by the number of passengers

In [None]:
grouped_df = df.groupby(df_persisted.passenger_count).trip_distance.mean().compute()
print(grouped_df)

## Count the total trip distance and count for each vendor

In [None]:
%%time
df.groupby('VendorID').agg({'passenger_count':'count', 'trip_distance': 'sum'}).astype(int).reset_index()\
.rename(columns={'passenger_count':'Trip Count'}).compute()

## Count Missing Values for Each Feature

In [None]:
df.isna().sum().compute()

## Run your python script container for your machine learning work.  
### Make sure to follow the steps in github repo for building/deploying this container before running this step

In [None]:
!aws ecr get-login-password --region us-west-2 | docker login --username AWS --password-stdin <<your-account-id>>.dkr.ecr.us-west-2.amazonaws.com


In [None]:
!docker run -e s3url='s3://octank-claims-web/public-data/yellow_tripdata_2018-01.csv' -e schurl='tcp://Dask-Scheduler.local-dask:8786' <<your-account-id>>.dkr.ecr.us-west-2.amazonaws.com/daskclientapp:latest

## TODO:  Visual EDA 

## TODO: Regression modeling with Scikit Learn

## Scale in the Fargate cluster worker nodes after all work is done

In [None]:
!aws ecs update-service --service Dask-Workers --desired-count 1 --cluster Fargate-Dask-Cluster