# Dask-ML Part 3

Here we'll look at additional scenarios including
* Big data, low parallelism (using Dask just to support out-of-core sklearn training)
* Big data, big parallelism with sklearn (only supports a small number of estimators)
* Parallel scoring (both in conjunction with parallel training, and in the scoring-only scenario)

### Out-of-Core Scikit-Learn via Dask

This use case applies when we have a large dataset, and we're using a scikit-learn estimator that supports incremental training (`partial_fit` method).

Dask can manage the "chunking" of the data so that we can easily train on large datasets. But __we will only have the parallelism supported by that sklearn estimator, which is usually none__

In [None]:
from dask.distributed import Client

client = Client(n_workers=4, threads_per_worker=1, memory_limit='256MB')

client

In [None]:
import dask.dataframe

ddf = dask.dataframe.read_csv('data/diamonds.csv', blocksize=1e6)
ddf

In [None]:
y = ddf.price
ddf = ddf.drop(['Unnamed: 0', 'price'], axis=1)

y

In [None]:
from sklearn.pipeline import make_pipeline
from dask_ml.preprocessing import Categorizer, DummyEncoder

pipe = make_pipeline(
    Categorizer(),
    DummyEncoder()
)

In [None]:
ddf_cat = pipe.fit_transform(ddf)
ddf_cat

In [None]:
from dask_ml.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(ddf_cat, y, test_size=0.3)

X_train

__The PassiveAgressiveRegressor is designed for incremental training__

In [None]:
from sklearn.linear_model import PassiveAggressiveRegressor

est = PassiveAggressiveRegressor(n_iter_no_change=10)

Here we wrap the sklearn estimator in Dask's `Incremental` meta-estimator. Note the `scoring` kwarg. It is strongly recommended to pass a scoring parameter in order to ensure that a Dask-compatible metric calculation is used during training. 

More info at https://ml.dask.org/incremental.html

In [None]:
from dask_ml.wrappers import Incremental

inc = Incremental(est, scoring='neg_mean_squared_error')
inc.fit(X_train, y_train)

In [None]:
import math

neg_mse = inc.score(X_test, y_test.to_dask_array())
math.sqrt(-neg_mse)

Not so great ... but we can run multiple batches (or, here, epochs, since the data isn't so large) and perhaps converge to something better:

In [None]:
for _ in range(10):
    inc.partial_fit(X_train, y_train)
    print('Score:', math.sqrt(-inc.score(X_test, y_test.to_dask_array())))

## Putting it all together...

For a very small number of sklearn estimators, we get support for both incremental training (batches) and parallel fitting. In this case, we can use Dask to handle scaling that data and the training.

We'll try a classification problem: "cheap/small" diamonds (below the 25th percentile in price) vs. the rest.

In [None]:
y_train.describe().compute()

We can use Dask array or DF APIs to threshold the response value, or use `.apply` and provide our own simple function:

In [None]:
THRESHOLDS = (1200, 11000)

def threshold(p):
    if p < THRESHOLDS[0]:
        return 0
    if p > THRESHOLDS[1]:
        return 2
    else:
        return 1

y_train_cat = y_train.apply(threshold, meta=('price','int64'))
y_test_cat = y_test.apply(threshold, meta=('price','int64'))

Here we'll use the Dask joblib backend together with the Incremental wrapper

In [None]:
from sklearn.linear_model import SGDClassifier
import joblib

sgd = SGDClassifier(n_jobs=4)

with joblib.parallel_backend('dask'):
    inc2 = Incremental(sgd, scoring='accuracy')
    inc2.fit(X_train, y_train_cat, classes=[0,1,2])

In [None]:
inc2.score(X_test, y_test_cat)

Depending on our luck with SGD we may or may not have a great solution ... we can still run multiple epochs/batches if we want:

In [None]:
for _ in range(10):
    inc2.partial_fit(X_train, y_train_cat)
    print('Score:', inc2.score(X_test, y_test_cat))

The `Incremental` wrapper also parallelizes post-fit operations like `score` and `predict`

In [None]:
predictions = inc2.predict(X_test)
predictions

In [None]:
predictions.partitions[0].compute()

### Parallel Prediction Only

In the case where we have small data and can train a sklearn model locally (or load a model trained elsewhere), we can still use Dask to parallelize certain post-fit operations like `transform`, `predict`, and `predict_proba`.

Dask's `ParallelPostFit` wrapper/meta-estimator can make predictions using parallel tasks for *any* sklearn estimator because, under the hood, it's basically just doing a `map_partitions` or `map_blocks` with the relevant function.

In [None]:
from dask_ml.wrappers import ParallelPostFit

In [None]:
from sklearn.tree import DecisionTreeRegressor

tree = DecisionTreeRegressor()
tree.fit(X_train, y_train) 
#note that the X_train and y_train will get `compute`d to the local VM

parallel_predicting_scorer = ParallelPostFit(estimator=tree)

In [None]:
da_scores = parallel_predicting_scorer.predict(X_test)
da_scores

In [None]:
da_scores.compute()

In [None]:
from math import sqrt
from dask_ml.metrics import mean_squared_error

sqrt(mean_squared_error(y_test.to_dask_array(), da_scores))

In [None]:
client.close()