# Machine Learning on Distributed Dask with SageMaker and Fargate

This notebook will demonstrate how to perform Machine Learning on Distributed Dask using SageMaker and Fargate.  We will demo how to connect to distributed dask fargate cluster, scale out dask worker nodes, perform EDA work on public newyork cab trip data sets. Then, we demonstrate how you can run regression algorithms and hyperparameters optimization on distributed dask cluster. Finally, we will demonstrate how you can monitor the operational metrics of Dask Cluster that will be fronted by Network Load Balancer for accessing the Dask Cluster Status UI from internet.     

# 1. Setup conda packages dependencies
We need additional conda packages and newer version of a few existing packages for running distributed dask on SageMaker notebook and fargate cluster. Sagemaker notebook's conda_python3 environment does not ship with those packages. In this section we will install those additional packages.  

scikit-learn version 0.23 is required for integrating its joblib with dask for distrbuted dask cluster level processing.

dask-ml provides scalable machine learning in Python using Dask alongside popular machine learning libraries like Scikit-Learn, XGBoost, and others.

cloudpickle 1.6.0 is required to serialize Python constructs not supported by the default pickle module from the Python standard library 

In [None]:
!conda install scikit-learn=0.23.2 -c conda-forge -n python3 -y

In [None]:
!conda install -n python3 dask-ml=1.6.0 -c conda-forge -y

In [None]:
!conda install cloudpickle=1.6.0 -c conda-forge  -y

# 2. Setup Dask Client

In [None]:
from dask.distributed import Client

#enable this client for local device testing
#client = Client()

#enable this client for local distributed cluster testing 
#client = Client('localhost:8786')

#enable this client for fargate distributed cluster testing
client = Client('Dask-Scheduler.local-dask:8786')

## Scale out the number of dask workers as needed for your data science work

In [None]:
#enable this  when cluster is running on Fargate to scale out your cluster. 
!sudo aws ecs update-service --service Dask-Workers --desired-count 2 --cluster Fargate-Dask-Cluster

## Restart the client after scale out operation

In [None]:
client.restart()

# 3. Exploratory Data Analysis(EDA)

We will be using Dask Dataframe and perform various operations on the dataframe for data analysis.

A Dask DataFrame is a large parallel DataFrame composed of many smaller Pandas DataFrames, split along the index. These Pandas DataFrames may live on disk for larger-than-memory computing on a single machine, or on many different machines in a cluster. One Dask DataFrame operation triggers many operations on the constituent Pandas DataFrames. For more details, review this page: https://docs.dask.org/en/latest/dataframe.html 



In [None]:
import s3fs
import dask.dataframe as dd
import boto3


In [None]:
df = dd.read_csv(
    's3://nyc-tlc/trip data/yellow_tripdata_2018-01.csv', storage_options={'anon': True}, parse_dates=['tpep_pickup_datetime','tpep_dropoff_datetime']
)

##  Calculate the trip duration in seconds 

In [None]:
df['trip_dur_secs'] = (df['tpep_dropoff_datetime'] - df['tpep_pickup_datetime']).dt.seconds

In [None]:
%%time
df.head()

## Calculate max trip duration across all trips

In [None]:
%%time
max_trip_duration = df.trip_dur_secs.max().compute()
print(max_trip_duration)

## Calculate total mean for passengers across trips  by pickup date

In [None]:
# df['date_only'] = df['date_time_column'].dt.date
df['pickup_date'] = df['tpep_dropoff_datetime'].dt.date

In [None]:
%%time
df.head()

In [None]:
%%time
df_mean_psngr_pickup_date = df.groupby('pickup_date').passenger_count.mean().compute()

## Calculate total trips by pickup date

In [None]:
%%time
df_trips_by_pickup_date = df.groupby('pickup_date').store_and_fwd_flag.count().compute()

In [None]:
len(df_trips_by_pickup_date)

In [None]:
df_trips_by_pickup_date.head()

In [None]:
# load and count number of rows
len(df)

In [None]:
df.dtypes

## Persist  collections into memory
Calls to Client.compute or Client.persist submit task graphs to the cluster and return Future objects that point to particular output tasks. Compute returns a single future per input, persist returns a copy of the collection with each block or partition replaced by a single future. In short, use persist to keep full collection on the cluster and use compute when you want a small result as a single future.


In [None]:
from dask.distributed import Client, progress


In [None]:
%%time
df_persisted = client.persist(df)
print(df_persisted.head())

## Compute the mean trip distance grouped by the number of passengers

In [None]:
%%time
grouped_df = df.groupby(df_persisted.passenger_count).trip_distance.mean().compute()
print(grouped_df)

## Compute Max trip distance

In [None]:
%%time
max_trip_dist = df_persisted.trip_distance.max().compute()
print(max_trip_dist)

## Count the total trip distance and count for each vendor

In [None]:
%%time
df.groupby('VendorID').agg({'passenger_count':'count', 'trip_distance': 'sum'}).astype(int).reset_index()\
.rename(columns={'passenger_count':'Trip Count'}).compute()

## Count Missing Values for Each Feature

In [None]:
df.isna().sum().compute()

## Visualize your Exploratory Data Analysis

In this section will demonstrate how to perform  Visual exploratory data analysis

In [None]:
##Selecting top 10 rides based on fare amount
most_paid_rides_dask = df[['PULocationID', 'fare_amount']].nlargest(10, "fare_amount")

In [None]:
##Visualizing most paid rides through Barplot
import matplotlib.pyplot as plt
most_paid_rides_dask.set_index('PULocationID',sorted=True).compute().plot(kind='barh',stacked=False, figsize=[10,8], legend=True)
#######
plt.title('Most Paid Rides')
plt.xlabel('Fare Amount')
plt.ylabel('PU LocationID')
plt.show()


In [None]:
##Visualizing trip distance through Barplot
import matplotlib.pyplot as plt
most_paid_rides_dask2 = df[['trip_distance', 'fare_amount']].nlargest(10, "trip_distance")
most_paid_rides_dask2.set_index('trip_distance',sorted=True).compute().plot(kind='bar', colormap='PiYG', stacked=False, figsize=[10,8], legend=True)
#######
plt.title('Fares by Distance')
plt.xlabel('Trip Distance')
plt.ylabel('Fare Amount')
plt.show()

# 4. Regression modeling with  Scikit Learn and Distributed Dask

This section will demonstrate how to perform regression modeling using Scikit learn on Distributed Dask back-end. We will continue to the Newyork taxi trips dataset but now predict the duration of the trip using linear regression.

Many Scikit-Learn algorithms are written for parallel execution using Joblib, which natively provides thread-based and process-based parallelism. Joblib is what backs the n_jobs= parameter in normal use of Scikit-Learn. Dask can scale these Joblib-backed algorithms out to a cluster of machines by providing an alternative Joblib backend. 


In [None]:
dfl = dd.read_csv(
    's3://nyc-tlc/trip data/green_tripdata_2018-02.csv', storage_options={'anon': True},
    parse_dates=['lpep_pickup_datetime', 'lpep_dropoff_datetime'],
).sample(frac=0.8, replace=True)

In [None]:
dfl['trip_duration'] = dfl['lpep_dropoff_datetime'] - dfl['lpep_pickup_datetime']

In [None]:
import numpy as np
dfl['trip_duration'] = dfl['trip_duration']/np.timedelta64(1,'D')

In [None]:
dfl['trip_duration'] = dfl['trip_duration'] * 24

In [None]:
dfl['trip_duration']

In [None]:
dfl.head()

In [None]:
dfl = dfl.fillna(value=0)

In [None]:
dfl = dd.get_dummies(dfl.categorize()).compute()

In [None]:
dfl.head()

In [None]:
x = dfl[['VendorID','RatecodeID','PULocationID','DOLocationID','passenger_count','trip_distance','fare_amount','total_amount']]

In [None]:
y = dfl['trip_duration']

In [None]:
from dask_ml.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(x, y, random_state=1)

In [None]:
len(X_train), len(X_test), len(y_train), len(y_test)

In [None]:
training_x = X_train.values
training_y = y_train.values

In [None]:
testing_x = X_test.values
testing_y = y_test.values

In [None]:
import numpy as np
from sklearn.metrics import mean_squared_error

In [None]:
def rmse(preds, actuals):
    error = mean_squared_error(actuals, preds)
    rmse = np.sqrt(error)
    print(rmse)

In [None]:
from dask_ml.linear_model import LinearRegression
lr = LinearRegression(random_state=1, n_jobs=-1, fit_intercept=True)
lr.fit(training_x,training_y)

In [None]:
import joblib
from dask_ml.linear_model import LinearRegression

with joblib.parallel_backend('dask'):
    lr = LinearRegression(random_state=1, fit_intercept=False)
    lr.fit(training_x,training_y)

In [None]:
lr.predict(testing_x)

# 5. Hyperparameter Optimization with Dask distributed machine learning

This section will demonstrate how to perform hyperparameter optimization using dask distributed cluster

Scikit-learn uses joblib for single-machine parallelism. This lets you train most estimators (anything that accepts an n_jobs parameter) using all the cores of your laptop or workstation.Alternatively, Scikit-Learn can use Dask for parallelism. This lets you train those estimators using all the cores of your cluster without significantly changing your code.

In [None]:
from sklearn.datasets import make_classification
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
import pandas as pd


In [None]:
X, y = make_classification(n_samples=1000, random_state=0)
X[:5]

In [None]:
y[:5]

In [None]:
param_grid = {"C": [0.001, 0.01, 0.1, 0.5, 1.0, 2.0, 5.0, 10.0],
              "kernel": ['rbf', 'poly', 'sigmoid'],
              "shrinking": [True, False]}

grid_search = GridSearchCV(SVC(gamma='auto', random_state=0, probability=True),
                           param_grid=param_grid,
                           return_train_score=False,
                           cv=3,
                           n_jobs=-1)


In [None]:
grid_search.fit(X, y)

In [None]:
pd.DataFrame(grid_search.cv_results_).head()

In [None]:
grid_search.score(X, y)


# 7. Monitoring Dask Cluster Operational Metrics

This section will discuss how to setup and monitor the health of your distributed dask cluster and also help understand what happens behind the scenes as your workload gets executed in the back-end. For detailed discussion and documentation on monitoring dask cluster visit this page: https://docs.dask.org/en/latest/diagnostics-distributed.html


## Setup Dask Dashboard UI

1. Navigate to Amazon ECS > Fargate Dask Cluster > Dask Scheduler Service > Tasks and select running task
2. Copy the private IP for the running task
3. Navigate to EC2 > Target Groups and select the dask-scheduler-tg 
4. Select Targets and click Register targets 
5. Select dask-vpc-main and paste the private IP from step 2 and click button - Include as pending below
6. Navigate to EC2 > Load Balancers and copy the DNS Name to browser tab to view the Dask Dashboard

## Dask Dashboard - Example UI



![dashboard](./dask-dashboard-ui.png)

# 8. Scale in the Fargate cluster worker nodes after all work is done

In [None]:
!sudo aws ecs update-service --service Dask-Workers --desired-count 1 --cluster Fargate-Dask-Cluster