# Scaling XGBoost with Dask and Coiled

This notebook walks through training a distributed [XGBoost](https://xgboost.readthedocs.io/en/latest/) model locally on a small dataset using [Dask](https://dask.org/) and then using Dask and [Coiled](https://coiled.io/) to scale out to the cloud to run XGBoost on a larger-than-memory dataset.

## 1. Importing Libraries

We'll start by importing all the libraries we'll need to run this notebook.

*Note how the the objects we import from **dask_ml** resemble the familiar sklearn API.*

In [1]:
import coiled
import dask.dataframe as  dd
from dask.distributed import Client, LocalCluster
from dask_ml.preprocessing import Categorizer
from dask_ml.model_selection import train_test_split
import xgboost as xgb
from dask.distributed import performance_report

In [None]:
# coiled.create_software_environment(
#     name='coiled-xgboost',
#     account='coiled-examples',
#     conda="/Users/rpelgrim/Documents/git/coiled-resources/xgboost-with-coiled/xgboost.yml"
# )

## 2. Local Distributed XGBoost Model using Dask

Next, let's instantiate a local version of the Dask distributed scheduler using the **LocalCluster** object. 

This object will handle parallelism for us on our local machine.

In [2]:
# local dask cluster
cluster = LocalCluster(n_workers=8)
client = Client(cluster)
client

0,1
Connection method: Cluster object,Cluster type: LocalCluster
Dashboard: http://127.0.0.1:8787/status,

0,1
Status: running,Using processes: True
Dashboard: http://127.0.0.1:8787/status,Workers: 8
Total threads:  8,Total memory:  16.00 GiB

0,1
Comm: tcp://127.0.0.1:50381,Workers: 8
Dashboard: http://127.0.0.1:8787/status,Total threads:  8
Started:  Just now,Total memory:  16.00 GiB

0,1
Comm: tcp://127.0.0.1:50403,Total threads: 1
Dashboard: http://127.0.0.1:50404/status,Memory: 2.00 GiB
Nanny: tcp://127.0.0.1:50388,
Local directory: /Users/rpelgrim/Documents/git/coiled-resources/xgboost-with-coiled/dask-worker-space/worker-5c5qs7n0,Local directory: /Users/rpelgrim/Documents/git/coiled-resources/xgboost-with-coiled/dask-worker-space/worker-5c5qs7n0

0,1
Comm: tcp://127.0.0.1:50394,Total threads: 1
Dashboard: http://127.0.0.1:50395/status,Memory: 2.00 GiB
Nanny: tcp://127.0.0.1:50384,
Local directory: /Users/rpelgrim/Documents/git/coiled-resources/xgboost-with-coiled/dask-worker-space/worker-js1n26r5,Local directory: /Users/rpelgrim/Documents/git/coiled-resources/xgboost-with-coiled/dask-worker-space/worker-js1n26r5

0,1
Comm: tcp://127.0.0.1:50407,Total threads: 1
Dashboard: http://127.0.0.1:50410/status,Memory: 2.00 GiB
Nanny: tcp://127.0.0.1:50389,
Local directory: /Users/rpelgrim/Documents/git/coiled-resources/xgboost-with-coiled/dask-worker-space/worker-dceybqnr,Local directory: /Users/rpelgrim/Documents/git/coiled-resources/xgboost-with-coiled/dask-worker-space/worker-dceybqnr

0,1
Comm: tcp://127.0.0.1:50406,Total threads: 1
Dashboard: http://127.0.0.1:50408/status,Memory: 2.00 GiB
Nanny: tcp://127.0.0.1:50387,
Local directory: /Users/rpelgrim/Documents/git/coiled-resources/xgboost-with-coiled/dask-worker-space/worker-kcmz8cn2,Local directory: /Users/rpelgrim/Documents/git/coiled-resources/xgboost-with-coiled/dask-worker-space/worker-kcmz8cn2

0,1
Comm: tcp://127.0.0.1:50397,Total threads: 1
Dashboard: http://127.0.0.1:50398/status,Memory: 2.00 GiB
Nanny: tcp://127.0.0.1:50386,
Local directory: /Users/rpelgrim/Documents/git/coiled-resources/xgboost-with-coiled/dask-worker-space/worker-k_eokezs,Local directory: /Users/rpelgrim/Documents/git/coiled-resources/xgboost-with-coiled/dask-worker-space/worker-k_eokezs

0,1
Comm: tcp://127.0.0.1:50400,Total threads: 1
Dashboard: http://127.0.0.1:50401/status,Memory: 2.00 GiB
Nanny: tcp://127.0.0.1:50385,
Local directory: /Users/rpelgrim/Documents/git/coiled-resources/xgboost-with-coiled/dask-worker-space/worker-rpb45sea,Local directory: /Users/rpelgrim/Documents/git/coiled-resources/xgboost-with-coiled/dask-worker-space/worker-rpb45sea

0,1
Comm: tcp://127.0.0.1:50391,Total threads: 1
Dashboard: http://127.0.0.1:50392/status,Memory: 2.00 GiB
Nanny: tcp://127.0.0.1:50383,
Local directory: /Users/rpelgrim/Documents/git/coiled-resources/xgboost-with-coiled/dask-worker-space/worker-02uy9u25,Local directory: /Users/rpelgrim/Documents/git/coiled-resources/xgboost-with-coiled/dask-worker-space/worker-02uy9u25

0,1
Comm: tcp://127.0.0.1:50412,Total threads: 1
Dashboard: http://127.0.0.1:50413/status,Memory: 2.00 GiB
Nanny: tcp://127.0.0.1:50390,
Local directory: /Users/rpelgrim/Documents/git/coiled-resources/xgboost-with-coiled/dask-worker-space/worker-e__cibwb,Local directory: /Users/rpelgrim/Documents/git/coiled-resources/xgboost-with-coiled/dask-worker-space/worker-e__cibwb


In [None]:
columns =

In [6]:
data = dd.read_table(
    "s3://coiled-datasets/dea-opioid/arcos_washpost.tsv",
    storage_options={'anon': True},
    dtype={
        'ACTION_INDICATOR': 'object',
        'ORDER_FORM_NO': 'object',
        'REPORTER_ADDL_CO_INFO': 'object',
        'REPORTER_ADDRESS2': 'object'
    }
)

In [None]:
data_local = data.head(10000).persist()

In [4]:
# Specify the columns we want to download
columns = [
    "interest_rate", "loan_age", "num_borrowers", 
    "borrower_credit_score", "num_units"
]

categorical = [
    "orig_channel", "occupancy_status", "property_state",
    "first_home_buyer", "loan_purpose", "property_type",
    "zip", "relocation_mortgage_indicator", "delinquency_12"
]

In [5]:
# Download data from S3
mortgage_data_local = dd.read_parquet(
    "s3://coiled-data/mortgage-2000.parq/part.0.parquet", 
    compression="gzip",
    columns=columns + categorical, 
    storage_options={"anon": True}
)

# Cache the data on Cluster workers
mortgage_data_local = mortgage_data_local.persist()

CPU times: user 980 ms, sys: 546 ms, total: 1.53 s
Wall time: 3.86 s


In [6]:
# inspect the first 5 entries
mortgage_data_local.head()

Unnamed: 0_level_0,interest_rate,loan_age,num_borrowers,borrower_credit_score,num_units,orig_channel,occupancy_status,property_state,first_home_buyer,loan_purpose,property_type,zip,relocation_mortgage_indicator,delinquency_12
loan_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
100000174660,7.875,18.0,2.0,673.0,1,B,P,MA,N,C,SF,26,N,False
100000174660,7.875,5.0,2.0,673.0,1,B,P,MA,N,C,SF,26,N,False
100000174660,7.875,17.0,2.0,673.0,1,B,P,MA,N,C,SF,26,N,False
100000174660,7.875,6.0,2.0,673.0,1,B,P,MA,N,C,SF,26,N,False
100000174660,7.875,7.0,2.0,673.0,1,B,P,MA,N,C,SF,26,N,False


This is looking good.

Before we can start training our XGBoost model, however, we'll have to conduct two preprocessing steps:
1. Cast our categorical columns to the correct types (XGBoost only accepts float, integer and boolean dtypes)
2. Create our train and test splits

*Note: we're using the **dask_ml** library for this, which mimics the familiar scikit-learn API*

In [8]:
# Cast categorical columns to the correct type
ce = Categorizer(columns=categorical)
mortgage_data_local = ce.fit_transform(mortgage_data_local)
for col in categorical:
    mortgage_data_local[col] = mortgage_data_local[col].cat.codes

In [9]:
# Create the train-test split
X, y = mortgage_data_local.iloc[:, :-1], mortgage_data_local["delinquency_12"]
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.1, shuffle=True, random_state=2
)

Great, now we're all set to start training our XGBoost model.

First, we'll create the XGBoost DMatrix and set the model parameters.

In [10]:
# Create the XGBoost DMatrix

dtrain = xgb.dask.DaskDMatrix(client, X_train, y_train)

# Set parameters
params = {
    "max_depth": 8,
    "max_leaves": 2 ** 8,
    "gamma": 0.1,
    "eta": 0.1,
    "min_child_weight": 30,
    "objective": "binary:logistic",
    "grow_policy": "lossguide"
}


Then let's go ahead and train the model.

In [None]:
%%time 
# train the model
output = xgb.dask.train(
    client, params, dtrain, num_boost_round=5,
    evals=[(dtrain, 'train')]
)

And see the results:

In [None]:
# 'booster' is the trained model
booster = output['booster']  

# 'history' is a dictionary containing evaluation metrics
history = output['history']  

In [14]:
# Shut down the cluster
client.close()

## 3. Cloud-Based Distributed XGBoost using Dask and Coiled

Let's now expand this workflow to process the entire dataset (~200GB). We'll run almost exactly the same code except for **2 changes**:
1. We'll connect Dask to a Coiled cluster in the cloud, instead of to our local CPU cores,
2. We'll download the entire dataset, instead of a single partition.

In the section below we've copied and pasted the cells from above so that you can run this notebook from top to bottom in one go. Alternatively, you could run the cell below (where we instantiate the Coiled Cluster) and then simply re-run the cells above -- making sure to adjust the cell that downloads the data as well, of course.

### Instantiate Coiled Cluster
Let's create our Coiled cluster in the cloud. We'll specify a cluster of 20 workers, with 4 CPU cores and 16GB of RAM each. That should allow the entire dataset to fit into the cluster's memory comfortably.

In [2]:
# Create Coiled Cloud cluster
cluster = coiled.Cluster(
    name='xgboost',
    n_workers=10,
    worker_cpu=4,
    worker_memory='16GiB',
    software='coiled-examples/xgboost',
    shutdown_on_close=False,
    scheduler_options={'idle_timeout': '1hour'}
)

# Connect Dask client to the Coiled cluster
client = Client(cluster)
client

Output()

Found software environment build
Created FW rules: coiled-dask-rrpelgr71-34006-firewall
Created scheduler VM: coiled-dask-rrpelgr71-34006-scheduler (ip: ['34.229.95.53'])



+-------------+---------------+----------------+----------------+
| Package     | client        | scheduler      | workers        |
+-------------+---------------+----------------+----------------+
| blosc       | None          | 1.10.2         | 1.10.2         |
| dask        | 2021.07.1     | 2021.07.0      | 2021.07.0      |
| distributed | 2021.07.1     | 2021.07.0      | 2021.07.0      |
| numpy       | 1.21.1        | 1.21.0         | 1.21.0         |
| pandas      | 1.3.1         | 1.2.4          | 1.2.4          |
| python      | 3.8.8.final.0 | 3.8.10.final.0 | 3.8.10.final.0 |
+-------------+---------------+----------------+----------------+


0,1
Connection method: Cluster object,Cluster type: Cluster
Dashboard: http://34.229.95.53:8787,

0,1
Dashboard: http://34.229.95.53:8787,Workers: 6
Total threads:  24,Total memory:  92.47 GiB

0,1
Comm: tls://10.4.0.141:8786,Workers: 6
Dashboard: http://10.4.0.141:8787/status,Total threads:  24
Started:  Just now,Total memory:  92.47 GiB

0,1
Comm: tls://10.4.1.9:41895,Total threads: 4
Dashboard: http://10.4.1.9:43179/status,Memory: 15.46 GiB
Nanny: tls://10.4.1.9:34019,
Local directory: /dask-worker-space/worker-ik76v68l,Local directory: /dask-worker-space/worker-ik76v68l

0,1
Comm: tls://10.4.1.70:34555,Total threads: 4
Dashboard: http://10.4.1.70:40539/status,Memory: 15.19 GiB
Nanny: tls://10.4.1.70:46255,
Local directory: /dask-worker-space/worker-vjbl34a2,Local directory: /dask-worker-space/worker-vjbl34a2

0,1
Comm: tls://10.4.1.173:39629,Total threads: 4
Dashboard: http://10.4.1.173:36511/status,Memory: 15.46 GiB
Nanny: tls://10.4.1.173:40263,
Local directory: /dask-worker-space/worker-amh2n_f_,Local directory: /dask-worker-space/worker-amh2n_f_

0,1
Comm: tls://10.4.1.233:39641,Total threads: 4
Dashboard: http://10.4.1.233:34787/status,Memory: 15.46 GiB
Nanny: tls://10.4.1.233:35489,
Local directory: /dask-worker-space/worker-tfsw2p2j,Local directory: /dask-worker-space/worker-tfsw2p2j

0,1
Comm: tls://10.4.1.220:38081,Total threads: 4
Dashboard: http://10.4.1.220:33695/status,Memory: 15.46 GiB
Nanny: tls://10.4.1.220:35567,
Local directory: /dask-worker-space/worker-tuzq79cx,Local directory: /dask-worker-space/worker-tuzq79cx

0,1
Comm: tls://10.4.1.222:43229,Total threads: 4
Dashboard: http://10.4.1.222:43787/status,Memory: 15.46 GiB
Nanny: tls://10.4.1.222:33547,
Local directory: /dask-worker-space/worker-49l4nmhw,Local directory: /dask-worker-space/worker-49l4nmhw


### Download the Data

In [9]:
data_coiled = dd.read_table(
    "s3://coiled-datasets/dea-opioid/arcos_washpost.tsv",
    storage_options={'anon': True},
    dtype={
        'ACTION_INDICATOR': 'object',
        'ORDER_FORM_NO': 'object',
        'REPORTER_ADDL_CO_INFO': 'object',
        'REPORTER_ADDRESS2': 'object'
    }
)

In [10]:
data_coiled.head(10)

CancelledError: ('head-1-10-read-csv-fd4bd5121d8267058f74a43a99d11cda', 0)

In [4]:
data_coiled = data_coiled.persist()

In [5]:
data_coiled.head()

CancelledError: ('read-csv-fd4bd5121d8267058f74a43a99d11cda', 0)

In [3]:
# Specify the columns we want to download
columns = [
    "interest_rate", "loan_age", "num_borrowers", 
    "borrower_credit_score", "num_units"
]

categorical = [
    "orig_channel", "occupancy_status", "property_state",
    "first_home_buyer", "loan_purpose", "property_type",
    "zip", "relocation_mortgage_indicator", "delinquency_12"
]

In [4]:
%%time
# Download data from S3
mortgage_data_all = dd.read_parquet(
    "s3://coiled-data/mortgage-2000.parq", 
    compression="gzip",
    columns=columns + categorical, 
    storage_options={"anon": True}
)

# Cache the data on Cluster workers
mortgage_data_all = mortgage_data_all.persist()

CPU times: user 734 ms, sys: 353 ms, total: 1.09 s
Wall time: 3.78 s


In [5]:
# inspect the first 5 entries
mortgage_data_all.head()

Unnamed: 0_level_0,interest_rate,loan_age,num_borrowers,borrower_credit_score,num_units,orig_channel,occupancy_status,property_state,first_home_buyer,loan_purpose,property_type,zip,relocation_mortgage_indicator,delinquency_12
loan_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
100000174660,7.875,18.0,2.0,673.0,1,B,P,MA,N,C,SF,26,N,False
100000174660,7.875,5.0,2.0,673.0,1,B,P,MA,N,C,SF,26,N,False
100000174660,7.875,17.0,2.0,673.0,1,B,P,MA,N,C,SF,26,N,False
100000174660,7.875,6.0,2.0,673.0,1,B,P,MA,N,C,SF,26,N,False
100000174660,7.875,7.0,2.0,673.0,1,B,P,MA,N,C,SF,26,N,False


### Preprocessing

In [6]:
# Cast categorical columns to the correct type
ce = Categorizer(columns=categorical)
mortgage_data_all = ce.fit_transform(mortgage_data_all)
for col in categorical:
    mortgage_data_all[col] = mortgage_data_all[col].cat.codes

In [7]:
# Create the train-test split
X, y = mortgage_data_all.iloc[:, :-1], mortgage_data_all["delinquency_12"]
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.1, shuffle=True, random_state=2
)

### Training Model

In [8]:
# Create the XGBoost DMatrix
dtrain = xgb.dask.DaskDMatrix(client, X_train, y_train)

# Set model parameters
params = {
    "max_depth": 8,
    "max_leaves": 2 ** 8,
    "gamma": 0.1,
    "eta": 0.1,
    "min_child_weight": 30,
    "objective": "binary:logistic",
    "grow_policy": "lossguide"
}


In [9]:
%%time 
# train the model (and generate a Dask performance report)
with performance_report(filename="xgboost-training.html"):
    output = xgb.dask.train(
        client, params, dtrain, num_boost_round=5,
        evals=[(dtrain, 'train')]
    )

CPU times: user 749 ms, sys: 562 ms, total: 1.31 s
Wall time: 24.7 s


distributed.client - ERROR - Failed to reconnect to scheduler after 10.00 seconds, closing client
_GatheringFuture exception was never retrieved
future: <_GatheringFuture finished exception=CancelledError()>
asyncio.exceptions.CancelledError


In [None]:
# 'booster' is the trained model
booster = output['booster']  

# 'history' is a dictionary containing evaluation metrics
history = output['history']  

### Shutting down the cluster

In [29]:
# Shut down the cluster
client.close()

## 4. Recap

In this notebook, we:
- trained a distributed XGBoost model on a portion of the XXX dataset using all of the cores of our machine in parallel by instantiating a Dask LocalCluster,
- expanded the distributed XGBoost model to train on the entire dataset using a Coiled Cluster of XX machines and XX total memory in the cloud.

We’d love to see you apply distributed XGBoost to a dataset that’s meaningful to you. If you’d like to try, swap your own dataset into this notebook and see how well it does! 

Let us know how you get on in our [Coiled Community Slack channel](https://join.slack.com/t/coiled-users/shared_invite/zt-hx1fnr7k-In~Q8ui3XkQfvQon0yN5WQ) or by tweeting at us.

In [30]:
coiled.create_notebook(
    name="xgboost-demo",
    conda="xgboost.yml",
    cpu=4,
    memory="16 GiB",
    files=["coiled-xgboost.ipynb"],
    description="Analyzes dataset with XGBoost, Dask, and Coiled",
)



Found existing software environment build, returning
