# A Guided Tour of Ray Core: JobLib

[*Distributed scikit-learn*](https://docs.ray.io/en/latest/joblib.html) provides a drop-in replacement to parallelize the [`JobLib`](https://joblib.readthedocs.io/en/latest/) backend for [`scikit-learn`](https://scikit-learn.org/stable/)


---

First, let's start Ray…

In [None]:
import logging
import ray

ray.init(
    ignore_reinit_error=True,
    logging_level=logging.ERROR,
)

## JobLib example

Set up for this example...

In [None]:
from ray.util.joblib import register_ray
from sklearn.datasets import load_digits
from sklearn.model_selection import RandomizedSearchCV
from sklearn.svm import SVC
import numpy as np
import joblib

First, let's register Ray as the parallelized [*joblib*](https://scikit-learn.org/stable/modules/generated/sklearn.utils.parallel_backend.html) backend for `scikit-learn`, using  Ray actors instead of local processes.
This makes it easy to scale existing applications running on a single node to running on a cluster.

See: <https://docs.ray.io/en/master/joblib.html>

In [None]:
%%time

register_ray()

Next, load a copy of the UCI machine learning data repository's hand-written *digits* dataset.
See: <https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html>

In [None]:
%%time

digits = load_digits()

We'll define the hyper-parameter space for training a *support vector machines* model:

In [None]:
%%time

param_space = {
    "C": np.logspace(-6, 6, 30),
    "gamma": np.logspace(-8, 8, 30),
    "tol": np.logspace(-4, -1, 30),
    "class_weight": [None, "balanced"],
}

model = SVC(kernel="rbf")

Then use a randomized search to optimize these hyper-parameters. See: <https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html>

We'll use 10 cross-validation splits and 50 iterations, which will result in a total of 500 "fits". This is enough to illustrate the `joblib` being parallelized, although in practice you'd probably use more iterations.

In [None]:
%%time

clf = RandomizedSearchCV(model, param_space, cv=10, n_iter=50, verbose=True)
clf

Run the cross-validation fits (i.e., the random search for hyper-parameter optimization) using Ray to parallelize the backend processes.

NB: **While this runs, check out the performance metrics on the Ray dashboard**

In [None]:
%%time

with joblib.parallel_backend("ray"):
    search = clf.fit(digits.data, digits.target)

So far, what is the best set of hyper-parameters found?

In [None]:
search.best_params_

Finally, shutdown Ray

In [None]:
ray.shutdown()