<img src="https://raw.githubusercontent.com/dask/dask/main/docs/source/images/dask_horizontal_no_pad.svg"
     width="30%"
     alt="Dask logo\" />

# Parallel and Distributed Machine Learning

The material in this notebook was based on the open-source content from [Dask's tutorial repository](https://github.com/dask/dask-tutorial) and the [Machine learning notebook](https://github.com/coiled/data-science-at-scale/blob/master/3-machine-learning.ipynb) from data science at scale from coiled

So far we have seen how Dask makes data analysis scalable with parallelization via Dask DataFrames. Let's now see how [Dask-ML](https://ml.dask.org/) allows us to do machine learning in a parallel and distributed manner. Note, machine learning is really just a special case of data analysis (one that automates analytical model building), so the 💪 Dask gains 💪 we've seen will apply here as well!

(If you'd like a refresher on the difference between parallel and distributed computing, [here's a good discussion on StackExchange](https://cs.stackexchange.com/questions/1580/distributed-vs-parallel-computing).)


## Types of scaling problems in machine learning

There are two main types of scaling challenges you can run into in your machine learning workflow: scaling the **size of your data** and scaling the **size of your model**. That is:

1. **CPU-bound problems**: Data fits in RAM, but training takes too long. Many hyperparameter combinations, a large ensemble of many models, etc.
2. **Memory-bound problems**: Data is larger than RAM, and sampling isn't an option.

Here's a handy diagram for visualizing these problems:

<img src="https://raw.githubusercontent.com/coiled/data-science-at-scale/master/images/dimensions_of_scale.svg"
     width="60%"
     alt="scaling problems\" />


In the bottom-left quadrant, your datasets are not too large (they fit comfortably in RAM) and your model is not too large either. When these conditions are met, you are much better off using something like scikit-learn, XGBoost, and similar libraries. You don't need to leverage multiple machines in a distributed manner with a library like Dask-ML. However, if you are in any of the other quadrants, distributed machine learning is the way to go.

Summarizing: 

* For in-memory problems, just use scikit-learn (or your favorite ML library).
* For large models, use `dask_ml.joblib` and your favorite scikit-learn estimator.
* For large datasets, use `dask_ml` estimators.

## Scikit-Learn in five minutes

<img src="https://raw.githubusercontent.com/coiled/data-science-at-scale/master/images/scikit_learn_logo_small.svg" 
     width="30%"
     alt="sklearn logo\" />


In this section, we'll quickly run through a typical Scikit-Learn workflow:

* Load some data (in this case, we'll generate it)
* Import the Scikit-Learn module for our chosen ML algorithm
* Create an estimator for that algorithm and fit it with our data
* Inspect the learned attributes
* Check the accuracy of our model

Scikit-Learn has a nice, consistent API:

* You instantiate an `Estimator` (e.g. `LinearRegression`, `RandomForestClassifier`, etc.). All of the models *hyperparameters* (user-specified parameters, not the ones learned by the estimator) are passed to the estimator when it's created.
* You call `estimator.fit(X, y)` to train the estimator.
* Use `estimator` to inspect attributes, make predictions, etc. 

Here `X` is an array of *feature variables* (what you're using to predict) and `y` is an array of *target variables* (what we're trying to predict).

### Generate some random data

In [1]:
from sklearn.datasets import make_classification

# Generate data
X, y = make_classification(n_samples=10000, n_features=4, random_state=0)

**Refreshing some ML concepts**

- `X` is the samples matrix (or design matrix). The size of `X` is typically (`n_samples`, `n_features`), which means that samples are represented as rows and features are represented as columns.
- A "feature" (also called an "attribute") is a measurable property of the phenomenon we're trying to analyze. A feature for a dataset of employees might be their hire date, for example.
- `y` are the target values, which are real numbers for regression tasks, or integers for classification (or any other discrete set of values). For unsupervized learning tasks, `y` does not need to be specified. `y` is usually 1d array where the `i`th entry corresponds to the target of the `i`th sample (row) of `X`.

In [2]:
# Let's take a look at X
X[:8]

array([[-0.77244139,  0.3607576 , -2.38110133,  0.08757   ],
       [ 1.14946035,  0.62254594,  0.37302939,  0.45965795],
       [-1.90879217, -1.1602627 , -0.27364545, -0.82766028],
       [-0.77694695,  0.31434299, -2.26231851,  0.06339125],
       [-1.17047054,  0.02212382, -2.17376797, -0.13421976],
       [ 0.79010037,  0.68530624, -0.44740487,  0.44692959],
       [ 1.68616989,  1.6329131 , -1.42072654,  1.04050557],
       [-0.93912893, -1.02270838,  1.10093827, -0.63714432]])

In [3]:
# Let's take a look at y
y[:8]

array([0, 0, 1, 0, 0, 0, 0, 1])

### Fitting and SVC

For this example, we will fit a [Support Vector Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html).

In [4]:
from sklearn.svm import SVC

estimator = SVC(random_state=0)
estimator.fit(X, y)

SVC(random_state=0)

We can inspect the learned features by taking a look a the `support_vectors_`:

In [5]:
estimator.support_vectors_[:4]

array([[-0.77244139,  0.3607576 , -2.38110133,  0.08757   ],
       [ 1.14946035,  0.62254594,  0.37302939,  0.45965795],
       [-0.77694695,  0.31434299, -2.26231851,  0.06339125],
       [ 0.79010037,  0.68530624, -0.44740487,  0.44692959]])

And we check the accuracy:

In [6]:
estimator.score(X, y)

0.905

### Hyperparameter Optimization

There are a few ways to learn the best *hyper*parameters while training. One is `GridSearchCV`.
As the name implies, this does a brute-force search over a grid of hyperparameter combinations. Scikit-learn provides tools to automatically find the best parameter combinations via cross-validation (which is the "CV" in `GridSearchCV`).

In [7]:
from sklearn.model_selection import GridSearchCV

In [8]:
%%time
estimator = SVC(gamma='auto', random_state=0, probability=True)
param_grid = {
    'C': [0.001, 10.0],
    'kernel': ['rbf', 'poly'],
}

# Brute-force search over a grid of hyperparameter combinations
grid_search = GridSearchCV(estimator, param_grid, verbose=2, cv=2)
grid_search.fit(X, y)

Fitting 2 folds for each of 4 candidates, totalling 8 fits
[CV] END ................................C=0.001, kernel=rbf; total time=   2.9s
[CV] END ................................C=0.001, kernel=rbf; total time=   2.9s
[CV] END ...............................C=0.001, kernel=poly; total time=   1.1s
[CV] END ...............................C=0.001, kernel=poly; total time=   1.2s
[CV] END .................................C=10.0, kernel=rbf; total time=   0.8s
[CV] END .................................C=10.0, kernel=rbf; total time=   0.8s
[CV] END ................................C=10.0, kernel=poly; total time=   1.1s
[CV] END ................................C=10.0, kernel=poly; total time=   1.0s
CPU times: user 13.8 s, sys: 194 ms, total: 14 s
Wall time: 14 s


GridSearchCV(cv=2,
             estimator=SVC(gamma='auto', probability=True, random_state=0),
             param_grid={'C': [0.001, 10.0], 'kernel': ['rbf', 'poly']},
             verbose=2)

In [9]:
grid_search.best_params_, grid_search.best_score_

({'C': 10.0, 'kernel': 'rbf'}, 0.9086000000000001)

## Compute Bound: Single-machine parallelism with Joblib

<img src="https://raw.githubusercontent.com/coiled/data-science-at-scale/master/images/joblib_logo.svg" 
     alt="Joblib logo" 
     width="15%"/>

In this section we'll see how [Joblib](https://joblib.readthedocs.io/en/latest/) ("*a set of tools to provide lightweight pipelining in Python*") gives us parallelism on our laptop. Here's what our grid search graph would look like if we set up six training "jobs" in parallel:

<img src="https://raw.githubusercontent.com/coiled/data-science-at-scale/master/images/unmerged_grid_search_graph.svg" 
     alt="grid search graph" 
     width="75%"/>

With Joblib, we can say that Scikit-Learn has *single-machine* parallelism.

**Any Scikit-Learn estimator that can operate in parallel exposes an `n_jobs` keyword**, which tells you how many tasks to run in parallel. Specifying `n_jobs=-1` jobs means running the maximum possible number of tasks in parallel.

In [10]:
%%time
grid_search = GridSearchCV(estimator, param_grid, verbose=2, cv=2, n_jobs=-1)
grid_search.fit(X, y)

Fitting 2 folds for each of 4 candidates, totalling 8 fits
CPU times: user 2.23 s, sys: 115 ms, total: 2.35 s
Wall time: 6.66 s


GridSearchCV(cv=2,
             estimator=SVC(gamma='auto', probability=True, random_state=0),
             n_jobs=-1,
             param_grid={'C': [0.001, 10.0], 'kernel': ['rbf', 'poly']},
             verbose=2)

Notice that the computation above it is faster than before. If you are running this computation on binder, you might not see a speed-up and the reason for that is that binder instances tend to have only one core with no threads so you can't see any parallelism. 

## Compute Bound: Multi-machine parallelism with Dask


In this section we'll see how Dask (plus Joblib and Scikit-Learn) gives us multi-machine parallelism. Here's what our grid search graph would look like if we allowed Dask to schedule our training "jobs" over multiple machines in our cluster:

<img src="https://raw.githubusercontent.com/coiled/data-science-at-scale/master/images/merged_grid_search_graph.svg" 
     alt="merged grid search graph" 
     width="100%"/>
     
Dask can talk to Scikit-Learn (via Joblib) so that our *cluster* is used to train a model. 

Let's instantiate a Client with `n_workers=8`, which will give us a distributed `LocalCluster`.

In [12]:
import dask.distributed

client = dask.distributed.Client(n_workers=8)
client

Perhaps you already have a cluster running?
Hosting the HTTP server on port 53793 instead


0,1
Connection method: Cluster object,Cluster type: distributed.LocalCluster
Dashboard: http://127.0.0.1:53793/status,

0,1
Dashboard: http://127.0.0.1:53793/status,Workers: 8
Total threads: 8,Total memory: 16.00 GiB
Status: running,Using processes: True

0,1
Comm: tcp://127.0.0.1:53794,Workers: 8
Dashboard: http://127.0.0.1:53793/status,Total threads: 8
Started: Just now,Total memory: 16.00 GiB

0,1
Comm: tcp://127.0.0.1:53814,Total threads: 1
Dashboard: http://127.0.0.1:53816/status,Memory: 2.00 GiB
Nanny: tcp://127.0.0.1:53804,
Local directory: /Users/rpelgrim/Documents/git/coiled-resources/dask-tutorial/notebooks/dask-worker-space/worker-ife6hjkn,Local directory: /Users/rpelgrim/Documents/git/coiled-resources/dask-tutorial/notebooks/dask-worker-space/worker-ife6hjkn

0,1
Comm: tcp://127.0.0.1:53824,Total threads: 1
Dashboard: http://127.0.0.1:53826/status,Memory: 2.00 GiB
Nanny: tcp://127.0.0.1:53799,
Local directory: /Users/rpelgrim/Documents/git/coiled-resources/dask-tutorial/notebooks/dask-worker-space/worker-pr7gb0ao,Local directory: /Users/rpelgrim/Documents/git/coiled-resources/dask-tutorial/notebooks/dask-worker-space/worker-pr7gb0ao

0,1
Comm: tcp://127.0.0.1:53835,Total threads: 1
Dashboard: http://127.0.0.1:53836/status,Memory: 2.00 GiB
Nanny: tcp://127.0.0.1:53802,
Local directory: /Users/rpelgrim/Documents/git/coiled-resources/dask-tutorial/notebooks/dask-worker-space/worker-3brseq4s,Local directory: /Users/rpelgrim/Documents/git/coiled-resources/dask-tutorial/notebooks/dask-worker-space/worker-3brseq4s

0,1
Comm: tcp://127.0.0.1:53830,Total threads: 1
Dashboard: http://127.0.0.1:53831/status,Memory: 2.00 GiB
Nanny: tcp://127.0.0.1:53800,
Local directory: /Users/rpelgrim/Documents/git/coiled-resources/dask-tutorial/notebooks/dask-worker-space/worker-t42atbnc,Local directory: /Users/rpelgrim/Documents/git/coiled-resources/dask-tutorial/notebooks/dask-worker-space/worker-t42atbnc

0,1
Comm: tcp://127.0.0.1:53820,Total threads: 1
Dashboard: http://127.0.0.1:53822/status,Memory: 2.00 GiB
Nanny: tcp://127.0.0.1:53798,
Local directory: /Users/rpelgrim/Documents/git/coiled-resources/dask-tutorial/notebooks/dask-worker-space/worker-_58g4hdf,Local directory: /Users/rpelgrim/Documents/git/coiled-resources/dask-tutorial/notebooks/dask-worker-space/worker-_58g4hdf

0,1
Comm: tcp://127.0.0.1:53815,Total threads: 1
Dashboard: http://127.0.0.1:53818/status,Memory: 2.00 GiB
Nanny: tcp://127.0.0.1:53797,
Local directory: /Users/rpelgrim/Documents/git/coiled-resources/dask-tutorial/notebooks/dask-worker-space/worker-yin28lat,Local directory: /Users/rpelgrim/Documents/git/coiled-resources/dask-tutorial/notebooks/dask-worker-space/worker-yin28lat

0,1
Comm: tcp://127.0.0.1:53829,Total threads: 1
Dashboard: http://127.0.0.1:53832/status,Memory: 2.00 GiB
Nanny: tcp://127.0.0.1:53803,
Local directory: /Users/rpelgrim/Documents/git/coiled-resources/dask-tutorial/notebooks/dask-worker-space/worker-5myq3tt6,Local directory: /Users/rpelgrim/Documents/git/coiled-resources/dask-tutorial/notebooks/dask-worker-space/worker-5myq3tt6

0,1
Comm: tcp://127.0.0.1:53821,Total threads: 1
Dashboard: http://127.0.0.1:53825/status,Memory: 2.00 GiB
Nanny: tcp://127.0.0.1:53801,
Local directory: /Users/rpelgrim/Documents/git/coiled-resources/dask-tutorial/notebooks/dask-worker-space/worker-_ouy9psk,Local directory: /Users/rpelgrim/Documents/git/coiled-resources/dask-tutorial/notebooks/dask-worker-space/worker-_ouy9psk


**Note:** Click on Cluster Info, to see more details about the cluster. You can see the configuration of the cluster and some other specs. 

We can expand our problem by specifying more hyperparameters before training, and see how using `dask` as backend can help us. 

In [13]:
param_grid = {
    'C': [0.001, 0.1, 1.0, 2.5, 5, 10.0],
    'kernel': ['rbf', 'poly', 'linear'],
    'shrinking': [True, False],
}

grid_search = GridSearchCV(estimator, param_grid, verbose=2, cv=2, n_jobs=-1)

### Dask parallel backend

We can fit our estimator with multi-machine parallelism by quickly *switching to a Dask parallel backend* when using joblib. 

In [14]:
import joblib

In [15]:
%%time
with joblib.parallel_backend("dask", scatter=[X, y]):
    grid_search.fit(X, y)

Fitting 2 folds for each of 36 candidates, totalling 72 fits
CPU times: user 5.64 s, sys: 851 ms, total: 6.49 s
Wall time: 25.5 s


**What just happened?**

Dask-ML developers worked with the Scikit-Learn and Joblib developers to implement a Dask parallel backend. So internally, scikit-learn now talks to Joblib, and Joblib talks to Dask, and Dask is what handles scheduling all of those tasks on multiple machines.

The best parameters and best score:

In [16]:
grid_search.best_params_, grid_search.best_score_

({'C': 10.0, 'kernel': 'rbf', 'shrinking': True}, 0.9086000000000001)

## Memory Bound: Multi-machine parallelism with Dask-ML

We have seen how to work with larger models, but sometimes you'll want to train on a larger than memory dataset. `dask-ml` has implemented estimators that work well on Dask `Arrays` and `DataFrames` that may be larger than your machine's RAM.

In [17]:
from dask_ml.datasets import make_regression
from dask_ml.linear_model import LinearRegression

In [18]:
X, y = make_regression(n_samples=10_000, chunks=100)

In [19]:
X

Unnamed: 0,Array,Chunk
Bytes,7.63 MiB,78.12 kiB
Shape,"(10000, 100)","(100, 100)"
Count,100 Tasks,100 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 7.63 MiB 78.12 kiB Shape (10000, 100) (100, 100) Count 100 Tasks 100 Chunks Type float64 numpy.ndarray",100  10000,

Unnamed: 0,Array,Chunk
Bytes,7.63 MiB,78.12 kiB
Shape,"(10000, 100)","(100, 100)"
Count,100 Tasks,100 Chunks
Type,float64,numpy.ndarray


In [20]:
lr = LinearRegression()

In [21]:
%%time
lr.fit(X, y)

CPU times: user 2.9 s, sys: 524 ms, total: 3.42 s
Wall time: 16.5 s


LinearRegression()

In [22]:
lr.predict(X)

Unnamed: 0,Array,Chunk
Bytes,78.12 kiB,800 B
Shape,"(10000,)","(100,)"
Count,301 Tasks,100 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 78.12 kiB 800 B Shape (10000,) (100,) Count 301 Tasks 100 Chunks Type float64 numpy.ndarray",10000  1,

Unnamed: 0,Array,Chunk
Bytes,78.12 kiB,800 B
Shape,"(10000,)","(100,)"
Count,301 Tasks,100 Chunks
Type,float64,numpy.ndarray


In [23]:
lr.score(X, y)

0.9999928479369835

This works...yay!

But it does take a while...can we speed this up even more?

In [None]:
client.close()

##  Multi-machine parallelism in the cloud with Coiled

<br>
<img src="https://raw.githubusercontent.com/coiled/data-science-at-scale/master/images/Coiled-Logo_Horizontal_RGB_Black.png"
     alt="Coiled logo" 
     width=25%/>
<br>

In this section we'll see how Coiled allows us to solve machine learning problems with multi-machine parallelism in the cloud.

Coiled, [among other things](https://coiled.io/product/), provides hosted and scalable Dask clusters. **The biggest barriers to entry for doing machine learning at scale are "Do you have access to a cluster?" and "Do you know how to manage it?"** Coiled solves both of those problems. 

We'll connect to the Coiled cluster we launched earlier then connect our Dask client to that cluster.

If you are running on your local machine and not in binder, and you want to give Coiled a try, you can signup [here](https://cloud.coiled.io/login?redirect_uri=/) and reach out to Support to get set up with some free credits. If you installed the environment by following the steps on the repository's [README](https://github.com/coiled/dask-mini-tutorial/blob/main/README.md) you will have `coiled` installed. You will just need to login, by following the steps on the [setup page](https://docs.coiled.io/user_guide/getting_started.html), and you will be ready to go. 

To learn more about how to set up an environment you can visit Coiled documentation on [Creating software environments](https://docs.coiled.io/user_guide/software_environment_creation.html). But for now you can use the envioronment we set up for this tutorial. 

In [25]:
import coiled
from dask.distributed import Client

In [27]:
# Connect to running cluster by referencing only its name
cluster = coiled.Cluster(
    name="dask-mini-tutorial", 
)

Output()

Our cluster currently has 20 workers.

Machine Learning is a compute-heavy process and we're working against a tight deadline. Let's scale up to 100 to speed this up.

In [28]:
cluster.scale(100)

In [29]:
client = Client(cluster)
client

0,1
Connection method: Cluster object,Cluster type: coiled.Cluster
Dashboard: http://3.239.225.22:8787,

0,1
Dashboard: http://3.239.225.22:8787,Workers: 20
Total threads: 160,Total memory: 620.19 GiB

0,1
Comm: tls://10.4.14.86:8786,Workers: 20
Dashboard: http://10.4.14.86:8787/status,Total threads: 160
Started: 1 hour ago,Total memory: 620.19 GiB

0,1
Comm: tls://10.4.6.221:37913,Total threads: 8
Dashboard: http://10.4.6.221:46397/status,Memory: 31.01 GiB
Nanny: tls://10.4.6.221:38371,
Local directory: /dask-worker-space/worker-2s8mco4f,Local directory: /dask-worker-space/worker-2s8mco4f

0,1
Comm: tls://10.4.6.63:44521,Total threads: 8
Dashboard: http://10.4.6.63:39905/status,Memory: 31.01 GiB
Nanny: tls://10.4.6.63:33547,
Local directory: /dask-worker-space/worker-5kenqc7f,Local directory: /dask-worker-space/worker-5kenqc7f

0,1
Comm: tls://10.4.9.166:43379,Total threads: 8
Dashboard: http://10.4.9.166:38737/status,Memory: 31.01 GiB
Nanny: tls://10.4.9.166:38131,
Local directory: /dask-worker-space/worker-scvvq__8,Local directory: /dask-worker-space/worker-scvvq__8

0,1
Comm: tls://10.4.13.59:37303,Total threads: 8
Dashboard: http://10.4.13.59:46257/status,Memory: 31.01 GiB
Nanny: tls://10.4.13.59:45881,
Local directory: /dask-worker-space/worker-39a1wsqt,Local directory: /dask-worker-space/worker-39a1wsqt

0,1
Comm: tls://10.4.13.137:46617,Total threads: 8
Dashboard: http://10.4.13.137:44119/status,Memory: 31.01 GiB
Nanny: tls://10.4.13.137:39681,
Local directory: /dask-worker-space/worker-2r8misgq,Local directory: /dask-worker-space/worker-2r8misgq

0,1
Comm: tls://10.4.1.175:33203,Total threads: 8
Dashboard: http://10.4.1.175:36401/status,Memory: 31.01 GiB
Nanny: tls://10.4.1.175:39241,
Local directory: /dask-worker-space/worker-14wiu2mh,Local directory: /dask-worker-space/worker-14wiu2mh

0,1
Comm: tls://10.4.0.149:42371,Total threads: 8
Dashboard: http://10.4.0.149:43925/status,Memory: 31.01 GiB
Nanny: tls://10.4.0.149:36707,
Local directory: /dask-worker-space/worker-va790dos,Local directory: /dask-worker-space/worker-va790dos

0,1
Comm: tls://10.4.10.195:41837,Total threads: 8
Dashboard: http://10.4.10.195:38465/status,Memory: 31.01 GiB
Nanny: tls://10.4.10.195:36083,
Local directory: /dask-worker-space/worker-5ivpjgra,Local directory: /dask-worker-space/worker-5ivpjgra

0,1
Comm: tls://10.4.2.138:39303,Total threads: 8
Dashboard: http://10.4.2.138:45455/status,Memory: 31.01 GiB
Nanny: tls://10.4.2.138:41895,
Local directory: /dask-worker-space/worker-76f7g9dk,Local directory: /dask-worker-space/worker-76f7g9dk

0,1
Comm: tls://10.4.0.130:39057,Total threads: 8
Dashboard: http://10.4.0.130:46403/status,Memory: 31.01 GiB
Nanny: tls://10.4.0.130:45451,
Local directory: /dask-worker-space/worker-atsoek34,Local directory: /dask-worker-space/worker-atsoek34

0,1
Comm: tls://10.4.11.73:32817,Total threads: 8
Dashboard: http://10.4.11.73:34675/status,Memory: 31.01 GiB
Nanny: tls://10.4.11.73:40213,
Local directory: /dask-worker-space/worker-39wk3y5g,Local directory: /dask-worker-space/worker-39wk3y5g

0,1
Comm: tls://10.4.15.89:38965,Total threads: 8
Dashboard: http://10.4.15.89:41587/status,Memory: 31.01 GiB
Nanny: tls://10.4.15.89:41935,
Local directory: /dask-worker-space/worker-rlqw122q,Local directory: /dask-worker-space/worker-rlqw122q

0,1
Comm: tls://10.4.14.28:39061,Total threads: 8
Dashboard: http://10.4.14.28:46451/status,Memory: 31.01 GiB
Nanny: tls://10.4.14.28:41109,
Local directory: /dask-worker-space/worker-yd16547d,Local directory: /dask-worker-space/worker-yd16547d

0,1
Comm: tls://10.4.7.48:32989,Total threads: 8
Dashboard: http://10.4.7.48:33855/status,Memory: 31.01 GiB
Nanny: tls://10.4.7.48:41741,
Local directory: /dask-worker-space/worker-x0mmj8gw,Local directory: /dask-worker-space/worker-x0mmj8gw

0,1
Comm: tls://10.4.3.30:39765,Total threads: 8
Dashboard: http://10.4.3.30:43145/status,Memory: 31.01 GiB
Nanny: tls://10.4.3.30:37701,
Local directory: /dask-worker-space/worker-6mu3u8au,Local directory: /dask-worker-space/worker-6mu3u8au

0,1
Comm: tls://10.4.15.23:40745,Total threads: 8
Dashboard: http://10.4.15.23:42607/status,Memory: 31.01 GiB
Nanny: tls://10.4.15.23:39769,
Local directory: /dask-worker-space/worker-y1hd0e6b,Local directory: /dask-worker-space/worker-y1hd0e6b

0,1
Comm: tls://10.4.14.189:40977,Total threads: 8
Dashboard: http://10.4.14.189:35711/status,Memory: 31.01 GiB
Nanny: tls://10.4.14.189:46855,
Local directory: /dask-worker-space/worker-fgvidjs4,Local directory: /dask-worker-space/worker-fgvidjs4

0,1
Comm: tls://10.4.11.149:40821,Total threads: 8
Dashboard: http://10.4.11.149:33131/status,Memory: 31.01 GiB
Nanny: tls://10.4.11.149:39695,
Local directory: /dask-worker-space/worker-6_gisroz,Local directory: /dask-worker-space/worker-6_gisroz

0,1
Comm: tls://10.4.0.255:46449,Total threads: 8
Dashboard: http://10.4.0.255:44015/status,Memory: 31.01 GiB
Nanny: tls://10.4.0.255:45031,
Local directory: /dask-worker-space/worker-ezhs3p_h,Local directory: /dask-worker-space/worker-ezhs3p_h

0,1
Comm: tls://10.4.11.53:34755,Total threads: 8
Dashboard: http://10.4.11.53:39587/status,Memory: 31.01 GiB
Nanny: tls://10.4.11.53:41133,
Local directory: /dask-worker-space/worker-ev2ck_v8,Local directory: /dask-worker-space/worker-ev2ck_v8


## Same Linear Regression Model as before

We can use Dask-ML estimators on the cloud to work with larger datasets.

In [36]:
X, y = make_regression(n_samples=2_000_000, chunks=1000) 

Notice we created a dataset with 1000 chunks, a number that can be logically distributed over our n_workers (100).

In [37]:
X

Unnamed: 0,Array,Chunk
Bytes,1.49 GiB,781.25 kiB
Shape,"(2000000, 100)","(1000, 100)"
Count,2000 Tasks,2000 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 1.49 GiB 781.25 kiB Shape (2000000, 100) (1000, 100) Count 2000 Tasks 2000 Chunks Type float64 numpy.ndarray",100  2000000,

Unnamed: 0,Array,Chunk
Bytes,1.49 GiB,781.25 kiB
Shape,"(2000000, 100)","(1000, 100)"
Count,2000 Tasks,2000 Chunks
Type,float64,numpy.ndarray


In [38]:
lr = LinearRegression()

In [39]:
%%time
lr.fit(X, y)

CPU times: user 13.3 s, sys: 2.24 s, total: 15.5 s
Wall time: 1min 6s


LinearRegression()

distributed.client - ERROR - Failed to reconnect to scheduler after 30.00 seconds, closing client
_GatheringFuture exception was never retrieved
future: <_GatheringFuture finished exception=CancelledError()>
asyncio.exceptions.CancelledError
Traceback (most recent call last):
  File "/Users/rpelgrim/mambaforge/envs/dask-mini-tutorial/lib/python3.9/site-packages/distributed/comm/tcp.py", line 398, in connect
    stream = await self.client.connect(
  File "/Users/rpelgrim/mambaforge/envs/dask-mini-tutorial/lib/python3.9/site-packages/tornado/tcpclient.py", line 275, in connect
    af, addr, stream = await connector.start(connect_timeout=timeout)
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/rpelgrim/mambaforge/envs/dask-mini-tutorial/lib/python3.9/asyncio/tasks.py", line 492, in wait_for
    fut.result()
asyncio.exceptions.CancelledError

The above exception was the direct cause of

In [40]:
lr.predict(X)

Unnamed: 0,Array,Chunk
Bytes,78.12 kiB,800 B
Shape,"(10000,)","(100,)"
Count,301 Tasks,100 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 78.12 kiB 800 B Shape (10000,) (100,) Count 301 Tasks 100 Chunks Type float64 numpy.ndarray",10000  1,

Unnamed: 0,Array,Chunk
Bytes,78.12 kiB,800 B
Shape,"(10000,)","(100,)"
Count,301 Tasks,100 Chunks
Type,float64,numpy.ndarray


In [41]:
lr.score(X, y)

0.9999915557428719

In [None]:
client.close()

## Training XGBoost in Parallel

Dask-ML implements some of the most popular machine learning algorithms for parallel processing, but not all of them.

For XGBoost, the maintainers of Dask and XGBoost took a different approach: they built a Dask Backend for XGBoost so you can run XGBoost in parallel with Dask straight from your normal XGBoost library.

Running an XGBoost model with the distributed Dask backend only requires two changes to your regular XGBoost code:

1. substitute `dtrain = xgb.DMatrix(X_train, y_train)` with `dtrain = xgb.dask.DaskDMatrix(X_train, y_train)`
2. substitute `xgb.train(params, dtrain, ...)` with `xgb.dask.train(client, params, dtrain)`

[Here's a step-by-step tutorial.](https://coiled.io/blog/dask-xgboost-python-example/)



## Extra resources:

- [Dask-ML documentation](https://ml.dask.org/)
- [Getting started with Coiled](https://docs.coiled.io/user_guide/getting_started.html)