# Getting started with Dask on Saturn Cloud


Dask is a framework that easily lets you run Python in parallel across distributed machines. Below is a small example of using Dask on Saturn Cloud. The code creates a function that computes exponents and runs it across a list of inputs in parallel.

_For more details about the basics of Dask, read the [Parallelization in Python](https://www.saturncloud.io/docs/reference/dask_concepts/) article in the Saturn Cloud docs._ You can also look at the [Saturn Cloud Dask examples](https://www.saturncloud.io/docs/examples/dask/), and [the official Dask documentation](https://docs.dask.org/en/latest/).

Before running this example, you need to create a Dask cluster associated with this project. You can create the cluster through the [Saturn Cloud project page](https://www.saturncloud.io/docs/getting-started/create_cluster_ui/), or [programmatically in Python](https://www.saturncloud.io/docs/getting-started/create_cluster/#create-clustersaturncluster-object).

This code chunk imports the Dask libraries and connects to the Saturn Cloud Dask cluster. 

In [9]:
import dask
from dask.distributed import Client
from dask_saturn import SaturnCluster

# cluster = SaturnCluster.reset()

n_jobs=10

cluster = SaturnCluster(
    scheduler_size='2xlarge',
    worker_size='medium',
    nthreads=2,
    n_workers=n_jobs,
)

client = Client(cluster)

INFO:dask-saturn:Starting cluster. Status: pending
INFO:dask-saturn:Starting cluster. Status: pending
INFO:dask-saturn:Starting cluster. Status: pending
INFO:dask-saturn:Starting cluster. Status: pending
INFO:dask-saturn:Starting cluster. Status: pending
INFO:dask-saturn:Starting cluster. Status: pending
INFO:dask-saturn:Starting cluster. Status: pending
INFO:dask-saturn:Cluster is ready
INFO:dask-saturn:Registering default plugins
INFO:dask-saturn:{}


In [None]:
client.wait_for_workers(n_jobs)

In [None]:
client

0,1
Client  Scheduler: tcp://d-admin-bciavm-64ad14d7cb76443591b4d1ebacd4bbf0.main-namespace:8786  Dashboard: https://d-admin-bciavm-64ad14d7cb76443591b4d1ebacd4bbf0.gcodeai.saturnenterprise.io,Cluster  Workers: 20  Cores: 40  Memory: 70.00 GB


In [None]:
import io
from bciavm.core.config import your_bucket
from bciavm.utils.bci_utils import ReadParquetFile, get_postcodeOutcode_from_postcode, get_postcodeArea_from_outcode, drop_outliers, preprocess_data
import pandas as pd
import bciavm
from sklearn.model_selection import ShuffleSplit, StratifiedShuffleSplit

dfPricesEpc = pd.DataFrame()
dfPrices = pd.DataFrame()

yearArray = ['2020', '2019']
for year in yearArray:
    singlePriceEpcFile = pd.DataFrame(ReadParquetFile(your_bucket, 'epc_price_data/byDate/2021-02-04/parquet/' + year))
    dfPricesEpc = dfPricesEpc.append(singlePriceEpcFile)

dfPricesEpc['POSTCODE_OUTCODE'] = dfPricesEpc['Postcode'].apply(get_postcodeOutcode_from_postcode)
dfPricesEpc['POSTCODE_AREA'] = dfPricesEpc['POSTCODE_OUTCODE'].apply(get_postcodeArea_from_outcode)
dfPricesEpc.groupby('TypeOfMatching_m').count()['Postcode']

TypeOfMatching_m
1. Address Matched            699206
2. Address Matched No Spec     26055
3. No in Address Matched      325243
4. No match                   339476
Name: Postcode, dtype: int64

## Preprocessing Data
Prior to training a model, check for missing values and split the data into training and validation sets.

In [None]:
# initial preprocessing+cleaning
train, test = preprocess_data(dfPricesEpc)

In [None]:
import dask.dataframe as dd
import dask.array as da
from dask.distributed import wait

X_test_arr = dd.from_pandas(X_test, npartitions=n_jobs)

X_test_arr= dask.persist(
    X_test_arr
)
_ = wait(X_test_arr)

In [None]:
X_test_arr[0]

Unnamed: 0_level_0,unit_indx,POSTCODE,POSTCODE_OUTCODE,POSTTOWN_e,PROPERTY_TYPE_e,TOTAL_FLOOR_AREA_e,NUMBER_HEATED_ROOMS_e,FLOOR_LEVEL_e,Latitude_m,Longitude_m,POSTCODE_AREA
npartitions=20,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1544,Int64,category[known],category[known],category[known],category[known],float64,float64,float64,float64,float64,category[known]
37710,...,...,...,...,...,...,...,...,...,...,...
...,...,...,...,...,...,...,...,...,...,...,...
696716,...,...,...,...,...,...,...,...,...,...,...
731157,...,...,...,...,...,...,...,...,...,...,...


## Load The Pipeline
Load the pipeline we trained in the **hypertuning** notebook.

In [None]:
import bciavm
from bciavm.pipelines import RegressionPipeline

model_path = ''
avm_pipeline = AVMPipeline.load(model_path)
avm_pipeline.parameters

{'Imputer': {'categorical_impute_strategy': 'most_frequent',
  'numeric_impute_strategy': 'mean',
  'categorical_fill_value': None,
  'numeric_fill_value': None,
  'random_seed': 0},
 'One Hot Encoder': {'top_n': 100,
  'features_to_encode': ['agg_cat'],
  'categories': None,
  'drop': None,
  'handle_unknown': 'ignore',
  'handle_missing': 'error',
  'random_seed': 0},
 'K Nearest Neighbors Regressor': {'n_neighbors': 5,
  'weights': 'distance',
  'algorithm': 'auto',
  'leaf_size': 20,
  'p': 2,
  'metric': 'minkowski',
  'n_jobs': 4},
 'XGBoost Regressor': {'learning_rate': 0.06325261812661621,
  'max_depth': 14,
  'min_child_weight': 0.6718934260322275,
  'reg_alpha': 0.043706006022706405,
  'reg_lambda': 0.026408282583277758,
  'n_estimators': 766},
 'MultiLayer Perceptron Regressor': {'activation': 'relu',
  'solver': 'adam',
  'alpha': 0.043706006022706405,
  'batch_size': 'auto',
  'learning_rate': 'constant',
  'learning_rate_init': 0.001,
  'max_iter': 500,
  'early_stopping'

In [14]:
len(X_test_arr[0])

1450

In [63]:
%%time
#20 threads, 10 workers, 2 threads/worker
predicted = wmodel.predict(X_test_arr[0]).compute(scheduler='distributed', n_jobs=n_jobs)
predicted

CPU times: user 55.4 s, sys: 5.68 s, total: 1min 1s
Wall time: 7min 44s


array([[66, 261164.0, nan, ..., '1621092479.787824', 1, 0],
       [411, 211335.0, 74850.0, ..., '1621092484.002987', 1, 0],
       [1056, 140014.0, 69653.0, ..., '1621092486.532709', 1, 0],
       ...,
       [730039, 250948.0, 164031.0, ..., '1621092819.282984', 1, 0],
       [730481, 242548.0, 185670.0, ..., '1621092821.8026', 1, 0],
       [730754, 237321.0, 48984.0, ..., '1621092824.096057', 1, 0]],
      dtype=object)


When you're done, you can close the connection to the cluster:

In [79]:
client.close()

In [80]:
cluster.close()