# Dask-ML Part 2

### Parallelizing Scikit-Learn

We'll take a look at 
* closer interoperation with scikit-learn
* syntax/constructs that look or work like scikit-learn
* using Dask for high parallelism on small/medium data tasks (where data can fit in memory)

In [None]:
from dask.distributed import Client

client = Client(n_workers=4, threads_per_worker=1, memory_limit='512MB')

client

In [None]:
import dask.dataframe

ddf = dask.dataframe.read_csv('data/diamonds.csv', blocksize=1e6)
ddf

In [None]:
y = ddf.price
ddf = ddf.drop(['Unnamed: 0', 'price'], axis=1)

y

Now that we have the core data loaded, let's execute similar categorical preprocessing to our earlier example, but using `sklearn.pipeline`

In [None]:
from sklearn.pipeline import make_pipeline
from dask_ml.preprocessing import Categorizer, DummyEncoder

pipe = make_pipeline(
    Categorizer(),
    DummyEncoder()
)

Calling `.fit` to the Dask dataframe will apply the relevant `fit`, `transform`, or `fit_transform` operations for the elements within the pipeline ...

In [None]:
pipe.fit(ddf)

... making the pipeline ready to transform the actual data:

In [None]:
pipe.transform(ddf)

### Mixing Dask and scikit APIs

We've just seen a pipeline composed entirely of Dask drop-ins 

We can also operate on Dask dataframes using Dask's APIs, and pipelines that contain a mixture of Dask and scikit APIs, provided we're careful about which ones.

Let's `categorize` the data via Dask's API, then see how we can use Dask's `DummyEncoder` and scikit's unmodified `RandomForestRegressor` together with Dask's joblib backend:

In [None]:
ddf_cat = ddf.categorize()
ddf_cat

In [None]:
from dask_ml.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(ddf_cat, y, test_size=0.3)

X_train

Note
* joblib Dask backend specified as a context manager
* `n_jobs` specified for the RandomForestRegressor

This approach to parallel training only works where the sklearn estimator supports multiple jobs via joblib. The relevant classes are not well documented, but are typically ones with `n_jobs` as a constructor argument.

There's an open issue (around the sklearn documentation) at:
* https://github.com/scikit-learn/scikit-learn/issues/14228

And a list generated from the source code at: 
* https://gist.github.com/cmarmo/f8cd0f4c82f8fc816a106fd3510c61dd

In [None]:
from sklearn.ensemble import RandomForestRegressor
import joblib

pipe_2 = make_pipeline(
    DummyEncoder(),
    RandomForestRegressor(n_jobs=4)
)

with joblib.parallel_backend('dask'):
    pipe_2.fit(X_train, y_train)

Note that although the training involves parallel tasked scheduled by Dask, the pipeline itself is not parallel-aware, and so when we call `.predict` that is a local operation returning a regular ndarray:

In [None]:
y_test_predicted = pipe_2.predict(X_test)
y_test_predicted

If X_test were very large -- or we wanted a production system that was scoring many records in parallel -- Dask provides a wrapper to parallelize the "post-fit" operations such as predict and score. We'll look at that in a future notebook.

In [None]:
y_test

In [None]:
from sklearn.metrics import mean_squared_error
from math import sqrt

sqrt(mean_squared_error(y_test.compute(), y_test_predicted))

In [None]:
client.close()