# Scaling XGBoost with Dask and Coiled

This notebook walks through training a distributed [XGBoost](https://xgboost.readthedocs.io/en/latest/) model locally on a small dataset using [Dask](https://dask.org/) and then using Dask and [Coiled](https://coiled.io/) to scale out to the cloud to run XGBoost on a larger-than-memory dataset.

In [None]:
# coiled.create_software_environment(
#     name='coiled-xgboost',
#     conda="/Users/rpelgrim/Documents/coiled/coiled-local/xgboost/xgboost.yml"
# )

## 1. Importing Libraries

We'll start by importing all the libraries we'll need to run this notebook.

In [1]:
import coiled
import dask.dataframe as  dd
from dask.distributed import Client, LocalCluster
from dask_ml.preprocessing import Categorizer
from dask_ml.model_selection import train_test_split
import xgboost as xgb

## 2. Local Distributed XGBoost Model using Dask

Next, let's instantiate a local version of the Dask distributed scheduler using the **LocalCluster** object. 

This object will handle parallelism for us on our local machine.

In [2]:
# local dask cluster
cluster = LocalCluster(n_workers=4)
client = Client(cluster)
client

0,1
Connection method: Cluster object,Cluster type: LocalCluster
Dashboard: http://127.0.0.1:8787/status,

0,1
Status: running,Using processes: True
Dashboard: http://127.0.0.1:8787/status,Workers: 4
Total threads:  8,Total memory:  16.00 GiB

0,1
Comm: tcp://127.0.0.1:54693,Workers: 4
Dashboard: http://127.0.0.1:8787/status,Total threads:  8
Started:  Just now,Total memory:  16.00 GiB

0,1
Comm: tcp://127.0.0.1:54705,Total threads: 2
Dashboard: http://127.0.0.1:54706/status,Memory: 4.00 GiB
Nanny: tcp://127.0.0.1:54697,
Local directory: /Users/rpelgrim/Documents/coiled/coiled-local/xgboost/dask-worker-space/worker-ldk4yow0,Local directory: /Users/rpelgrim/Documents/coiled/coiled-local/xgboost/dask-worker-space/worker-ldk4yow0

0,1
Comm: tcp://127.0.0.1:54699,Total threads: 2
Dashboard: http://127.0.0.1:54700/status,Memory: 4.00 GiB
Nanny: tcp://127.0.0.1:54696,
Local directory: /Users/rpelgrim/Documents/coiled/coiled-local/xgboost/dask-worker-space/worker-h4u5zbwi,Local directory: /Users/rpelgrim/Documents/coiled/coiled-local/xgboost/dask-worker-space/worker-h4u5zbwi

0,1
Comm: tcp://127.0.0.1:54701,Total threads: 2
Dashboard: http://127.0.0.1:54703/status,Memory: 4.00 GiB
Nanny: tcp://127.0.0.1:54695,
Local directory: /Users/rpelgrim/Documents/coiled/coiled-local/xgboost/dask-worker-space/worker-401910m0,Local directory: /Users/rpelgrim/Documents/coiled/coiled-local/xgboost/dask-worker-space/worker-401910m0

0,1
Comm: tcp://127.0.0.1:54708,Total threads: 2
Dashboard: http://127.0.0.1:54709/status,Memory: 4.00 GiB
Nanny: tcp://127.0.0.1:54698,
Local directory: /Users/rpelgrim/Documents/coiled/coiled-local/xgboost/dask-worker-space/worker-8qo2hojm,Local directory: /Users/rpelgrim/Documents/coiled/coiled-local/xgboost/dask-worker-space/worker-8qo2hojm


In [3]:
# Specify the columns we want to download
columns = [
    "interest_rate", "loan_age", "num_borrowers", 
    "borrower_credit_score", "num_units"
]

categorical = [
    "orig_channel", "occupancy_status", "property_state",
    "first_home_buyer", "loan_purpose", "property_type",
    "zip", "relocation_mortgage_indicator", "delinquency_12"
]

In [4]:
# Download data from S3
mortgage_data_local = dd.read_parquet(
    "s3://coiled-data/mortgage-2000.parq/part.0.parquet", 
    #compression="gzip",
    columns=columns + categorical, 
    storage_options={"anon": True}
)

# Cache the data on Cluster workers
mortgage_data_local = mortgage_data_local.persist()



In [None]:
# inspect the first 5 entries
mortgage_data_local.head()



This is looking good.

Before we can start training our XGBoost model, however, we'll have to conduct two preprocessing steps:
1. Cast our categorical columns to the correct types (XGBoost only accepts float, integer and boolean dtypes)
2. Create our train and test splits

*Note: we're using the **dask_ml** library for this, which mimics the familiar scikit-learn API*

In [7]:
# Cast categorical columns to the correct type
ce = Categorizer(columns=categorical)
mortgage_data_local = ce.fit_transform(mortgage_data_local)
for col in categorical:
    mortgage_data_local[col] = mortgage_data_local[col].cat.codes

In [8]:
# Create the train-test split
X, y = mortgage_data_local.iloc[:, :-1], mortgage_data_local["delinquency_12"]
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.1, shuffle=True, random_state=2
)

X_train = X_train.persist()
X_test = X_test.persist()
y_train = y_train.persist()
y_test = y_test.persist()

Great, now we're all set to start training our XGBoost model.

First, we'll create the XGBoost DMatrix and set the model parameters.

In [18]:
# Create the XGBoost DMatrix

dtrain = xgb.dask.DaskDMatrix(client, X_train, y_train)

# Set parameters
params = {
    "max_depth": 8,
    "max_leaves": 2 ** 8,
    "gamma": 0.1,
    "eta": 0.1,
    "min_child_weight": 30,
    "objective": "binary:logistic",
    "grow_policy": "lossguide"
}


AssertionError: 

Then let's go ahead and train the model.

In [10]:
%%time 
# train the model
output = xgb.dask.train(
    client, params, dtrain, num_boost_round=5,
    evals=[(dtrain, 'train')]
)

[10:56:40] task [xgboost.dask]:tcp://127.0.0.1:52294 got new rank 0


CPU times: user 1.92 s, sys: 380 ms, total: 2.3 s
Wall time: 30.3 s


And see the results:

In [11]:
# 'booster' is the trained model
booster = output['booster']  

# 'history' is a dictionary containing evaluation metrics
history = output['history']  

In [6]:
# Shut down the cluster
client.close()

## 3. Cloud-Based Distributed XGBoost using Dask and Coiled

Let's now expand this workflow to process the entire dataset (~200GB). We'll run almost exactly the same code except for **2 changes**:
1. We'll connect Dask to a Coiled cluster in the cloud, instead of to our local CPU cores,
2. We'll download the entire dataset, instead of a single partition.

In the section below we've copied and pasted the cells from above so that you can run this notebook from top to bottom in one go. Alternatively, you could run the cell below (where we instantiate the Coiled Cluster) and then simply re-run the cells above -- making sure to adjust the cell that downloads the data as well, of course.

### Instantiate Coiled Cluster
Let's create our Coiled cluster in the cloud. We'll specify a cluster of 20 workers, with 4 CPU cores and 16GB of RAM each. That should allow the entire dataset to fit into the cluster's memory comfortably.

In [2]:
# Create Coiled Cloud cluster
cluster = coiled.Cluster(
    name='xgboost',
    n_workers=20,
    worker_cpu=4,
    worker_memory='16GiB',
    software='rrpelgrim/coiled-xgboost',
    shutdown_on_close=False,
    scheduler_options={'idle_timeout':'2hours'}
)

# Connect Dask client to the Coiled cluster
client = Client(cluster)
client

Output()

Found software environment build
Created FW rules: coiled-dask-rrpelgr71-30855-firewall
Created scheduler VM: coiled-dask-rrpelgr71-30855-scheduler (ip: ['34.234.78.82'])


0,1
Connection method: Cluster object,Cluster type: Cluster
Dashboard: http://34.234.78.82:8787,

0,1
Dashboard: http://34.234.78.82:8787,Workers: 9
Total threads:  36,Total memory:  138.73 GiB

0,1
Comm: tls://10.4.0.66:8786,Workers: 9
Dashboard: http://10.4.0.66:8787/status,Total threads:  36
Started:  Just now,Total memory:  138.73 GiB

0,1
Comm: tls://10.4.1.128:45913,Total threads: 4
Dashboard: http://10.4.1.128:36575/status,Memory: 15.46 GiB
Nanny: tls://10.4.1.128:44515,
Local directory: /dask-worker-space/worker-n09piu3_,Local directory: /dask-worker-space/worker-n09piu3_

0,1
Comm: tls://10.4.1.247:36991,Total threads: 4
Dashboard: http://10.4.1.247:34995/status,Memory: 15.35 GiB
Nanny: tls://10.4.1.247:46751,
Local directory: /dask-worker-space/worker-t3shgvt0,Local directory: /dask-worker-space/worker-t3shgvt0

0,1
Comm: tls://10.4.1.110:33413,Total threads: 4
Dashboard: http://10.4.1.110:37995/status,Memory: 15.46 GiB
Nanny: tls://10.4.1.110:42829,
Local directory: /dask-worker-space/worker-vf5yq4y7,Local directory: /dask-worker-space/worker-vf5yq4y7

0,1
Comm: tls://10.4.1.225:36929,Total threads: 4
Dashboard: http://10.4.1.225:46011/status,Memory: 15.46 GiB
Nanny: tls://10.4.1.225:39037,
Local directory: /dask-worker-space/worker-vo19vgiu,Local directory: /dask-worker-space/worker-vo19vgiu

0,1
Comm: tls://10.4.1.241:46579,Total threads: 4
Dashboard: http://10.4.1.241:41209/status,Memory: 15.46 GiB
Nanny: tls://10.4.1.241:44997,
Local directory: /dask-worker-space/worker-oky8nktl,Local directory: /dask-worker-space/worker-oky8nktl

0,1
Comm: tls://10.4.1.43:37381,Total threads: 4
Dashboard: http://10.4.1.43:35095/status,Memory: 15.46 GiB
Nanny: tls://10.4.1.43:39453,
Local directory: /dask-worker-space/worker-7b6gp20t,Local directory: /dask-worker-space/worker-7b6gp20t

0,1
Comm: tls://10.4.1.148:38967,Total threads: 4
Dashboard: http://10.4.1.148:34325/status,Memory: 15.46 GiB
Nanny: tls://10.4.1.148:34841,
Local directory: /dask-worker-space/worker-ou6y3t8i,Local directory: /dask-worker-space/worker-ou6y3t8i

0,1
Comm: tls://10.4.1.158:37415,Total threads: 4
Dashboard: http://10.4.1.158:41077/status,Memory: 15.46 GiB
Nanny: tls://10.4.1.158:32877,
Local directory: /dask-worker-space/worker-oariiun6,Local directory: /dask-worker-space/worker-oariiun6

0,1
Comm: tls://10.4.1.34:36499,Total threads: 4
Dashboard: http://10.4.1.34:43211/status,Memory: 15.19 GiB
Nanny: tls://10.4.1.34:40333,
Local directory: /dask-worker-space/worker-v1xlnyyn,Local directory: /dask-worker-space/worker-v1xlnyyn


### Download the Data

In [3]:
# Specify the columns we want to download
columns = [
    "interest_rate", "loan_age", "num_borrowers", 
    "borrower_credit_score", "num_units"
]

categorical = [
    "orig_channel", "occupancy_status", "property_state",
    "first_home_buyer", "loan_purpose", "property_type",
    "zip", "relocation_mortgage_indicator", "delinquency_12"
]

In [4]:
# Download data from S3
mortgage_data_all = dd.read_parquet(
    "s3://coiled-data/mortgage-2000.parq/*", 
    storage_options={"anon": True},
    columns = columns + categorical,
)

# Cache the data on Cluster workers
mortgage_data_all = mortgage_data_all.repartition(partition_size='50MB').persist()

In [5]:
mortgage_data_all

Unnamed: 0_level_0,interest_rate,loan_age,num_borrowers,borrower_credit_score,num_units,orig_channel,occupancy_status,property_state,first_home_buyer,loan_purpose,property_type,zip,relocation_mortgage_indicator,delinquency_12
npartitions=197,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
,float64,float64,float64,float64,int32,object,object,object,object,object,object,int32,object,object
,...,...,...,...,...,...,...,...,...,...,...,...,...,...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...,...,...,...,...,...
,...,...,...,...,...,...,...,...,...,...,...,...,...,...


In [6]:
# inspect the first 5 entries
mortgage_data_all.head()

Unnamed: 0_level_0,interest_rate,loan_age,num_borrowers,borrower_credit_score,num_units,orig_channel,occupancy_status,property_state,first_home_buyer,loan_purpose,property_type,zip,relocation_mortgage_indicator,delinquency_12
loan_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
100000174660,7.875,18.0,2.0,673.0,1,B,P,MA,N,C,SF,26,N,False
100000174660,7.875,5.0,2.0,673.0,1,B,P,MA,N,C,SF,26,N,False
100000174660,7.875,17.0,2.0,673.0,1,B,P,MA,N,C,SF,26,N,False
100000174660,7.875,6.0,2.0,673.0,1,B,P,MA,N,C,SF,26,N,False
100000174660,7.875,7.0,2.0,673.0,1,B,P,MA,N,C,SF,26,N,False


### Preprocessing

In [7]:
# Cast categorical columns to the correct type
ce = Categorizer(columns=categorical)
mortgage_data_all = ce.fit_transform(mortgage_data_all)
for col in categorical:
    mortgage_data_all[col] = mortgage_data_all[col].cat.codes

In [8]:
# Create the train-test split
X, y = mortgage_data_all.iloc[:, :-1], mortgage_data_all["delinquency_12"]
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.1, shuffle=True, random_state=2
)

X_train = X_train.persist()
y_train = y_train.persist()
X_test = X_test.persist()
y_test = y_test.persist()

### Training Model

In [9]:
from dask.distributed import performance_report

In [12]:
with performance_report('create_DaskDMatrix.html'):
    # Create the XGBoost DMatrix
    dtrain = xgb.dask.DaskDMatrix(client, X_train, y_train)

dtrain = dtrain.persist()

# Set model parameters
params = {
    "max_depth": 8,
    "max_leaves": 2 ** 8,
    "gamma": 0.1,
    "eta": 0.1,
    "min_child_weight": 30,
    "objective": "binary:logistic",
    "grow_policy": "lossguide"
}


AssertionError: 

In [11]:
%%time 
with performance_report('create_DaskDMatrix.html'):

    # train the model
    output = xgb.dask.train(
        client, params, dtrain, num_boost_round=5,
        evals=[(dtrain, 'train')]
    )

ValueError: DataFrame.dtypes for data must be int, float, bool or categorical.  When
                categorical type is supplied, DMatrix parameter
                `enable_categorical` must be set to `True`.timestamp, mod_flag, seller_name, orig_date, product_type

In [21]:
# 'booster' is the trained model
booster = output['booster']  

# 'history' is a dictionary containing evaluation metrics
history = output['history']  

### Shutting down the cluster

In [24]:
# Stop the cluster and close the client
coiled.delete_cluster(name='xgboost')
client.close()

## 4. Recap

In this notebook, we:
- trained a distributed XGBoost model on a portion of the XXX dataset using all of the cores of our machine in parallel by instantiating a Dask LocalCluster,
- expanded the distributed XGBoost model to train on the entire dataset using a Coiled Cluster of XX machines and XX total memory in the cloud.

We’d love to see you apply distributed XGBoost to a dataset that’s meaningful to you. If you’d like to try, swap your own dataset into this notebook and see how well it does! 

Let us know how you get on in our [Coiled Community Slack channel](https://join.slack.com/t/coiled-users/shared_invite/zt-hx1fnr7k-In~Q8ui3XkQfvQon0yN5WQ) or by tweeting at us.