# Parallelizing a simple grid search

When fitting a neural net, it is often unclear what values of the hyperparameters will produce the model with the lowest error. A common method to yield more insight is to perform a hyperparameter grid search, where we fit the model for different sets of hyperparameters on a specified grid.

This notebook will go over a simple grid search for a Support Vector Classifier and how we can parallelize it using `scikit-learn` and `dask`. The majority of its content is taken from [this dask tutorial](https://github.com/dask/dask-tutorial/blob/main/08_machine_learning.ipynb).

## 0. Create the dataset

Our hypothetical problem has 4 features, and we have 10,000 samples. Each sample of 4 features is classified to either 0 or 1. We use `scikit-learn`'s ability to make random datasets for testing. We can look at a few samples of inputs and outputs and plot them to get an idea of what our dataset looks like.

In [None]:
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=13_000, n_features=4, random_state=0)
X[:8]

In [None]:
y[:8]

In [None]:
import matplotlib.pyplot as plt

nplot = 1000  # no. of points to plot
ix, iy = 0, 1  # which features to plot on x and y

fig, ax = plt.subplots(1, clear=True, figsize=(6, 5))
ax.grid();
ax.scatter(X[:nplot, ix], X[:nplot, iy], c=y[:nplot], zorder=4, alpha=0.8)
ax.set_xlabel(f'Feature {ix}')
ax.set_ylabel(f'Feature {iy}')

## 1. Fit a Support Vector Classifier

Let's start by fitting a single SVC model using a hard-coded set of hyperparameters. Note how long it takes to fit the model.

In [None]:
%%time

from sklearn.svm import SVC

estimator = SVC(C=1.0, kernel='poly', gamma='auto', random_state=0, probability=True)

estimator.fit(X, y)

We can quantify how good the model works using the `estimator.score` method.

In [None]:
estimator.score(X, y)

## 2. Hyperparameter optimization (no parallelization)

There are a few ways to learn the best hyperparameters while training. One is GridSearchCV. As the name implies, this does a brute-force search over a grid of hyperparameter combinations.

We start by defining our grid of hyperparameters that we want to search. We'll also redefine our estimator that we'll use in the grid search. Finally, we will be using something called 2-fold validation, which means that we will run every computation twice for a more robust solution. The k-fold cross-validation is specified using the `cv` keyword argument in `GridSearchCV`.

In [None]:
estimator = SVC(gamma='auto', random_state=0, probability=True)
param_grid = {
    'C': [0.001, 10.0],
    'kernel': ['rbf', 'poly'],
}

**Exercise together:** Considering that our grid has (1) two values for `C` and (2) two values for `kernel` and (3) 2-fold cross validation, about how long do we think it would take to search the grid without parallelization? Use the fitting time from the previous section.

**Answer:**

Let's go ahead and test that. Does it match?

In [None]:
%%time
from sklearn.model_selection import GridSearchCV

grid_search = GridSearchCV(estimator, param_grid, verbose=2, cv=2)
grid_search.fit(X, y)

Here are the values and score for the best hyperparameters.

In [None]:
def print_best(grid_search):
    """Print the values and score for the grid_search result"""
    best_svc = grid_search.best_estimator_
    print(f'The best model has C = {best_svc.C} and kernel = {best_svc.kernel}')
    print(f'  Model score: {grid_search.best_score_:.4f}')

print_best(grid_search)

## 3. Parallelization with scikit-learn

**Exercise together:** Considering the number of CPUs on your laptop, how fast do you expect the code to run if it is parallelized using multiple processes?

**Answer:**

The `GridSearchCV` in `scikit-learn` has been written to allow basic parallelization. In particular, you can use the `n_jobs` keyword, which will result in multiple processes being launched.  You can specify an integer number of processes, or `-1` means all available CPUs.

Note! On Windows, calling multiple processes will mess up printing to the console. Expect no verbose output.

**Exercise for the reader:** How does the wall time compare? Did it match your prediction? Why/why not?

**Answer:**

In [None]:
%%time
grid_search = GridSearchCV(estimator, param_grid, verbose=2, cv=2, n_jobs=-1)
grid_search.fit(X, y)

Verify that we got the same solution for the grid search.

In [None]:
print_best(grid_search)

## 4. Parallelization with dask

**Exercise together:** The `scikit-learn` `n_jobs` option is good for this problem, but can you think of a situation (or two) where it would be insufficient?

**Answer:** 

Dask can also be used to parallelize, and it has the ability to be run across multiple machines (e.g., on a cluster). For dask to run, we first define something called a "client", which is used to communicate with the processors. We will set up a simple client on the desktop for this exercise, but clients can also be launched on clusters. Because that is very cluster-specific, we won't discuss that here.

**Note!** If you have larger-than-memory datasets, you might want to check out some of the estimators in `dask_ml`. You can see an

Be sure to call the client using `c.close()` before closing the notebook or goblins will appear from the 4th hellish dimension. (Not really, just may leave some ports open that we want closed.)

In [None]:
import dask.distributed

c = dask.distributed.Client()

One cool thing about dask is that we can open a dashboard showing the status of our parallelization task. Open the following link in a browser.

The status window includes the number of bytes per worker (process), what processes are doing which tasks, and how many tasks are happening per process. You can also click the CPU tab on the bottom left to see how well your CPU is being utilized. Again, don't worry about the details on this dashboard. If you want to learn more about dask, there are many tutorials available.

In [None]:
c.dashboard_link

If you run this on a cluster, uncomment lines in the grid below to get a larger search grid.

In [None]:
param_grid = {
    'C': [0.001, 10.0],
    'kernel': ['rbf', 'poly'],
    # Uncomment this for larger Grid searches on a cluster
    # 'C': [0.001, 0.1, 1.0, 2.5, 5, 10.0],
    # 'kernel': ['rbf', 'poly', 'linear'],
    # 'shrinking': [True, False],
}

grid_search = GridSearchCV(estimator, param_grid, verbose=2, cv=2, n_jobs=-1)

It's finally time to run!

We use a context manager `parallel_backend` from the `joblib` package ([docs here](https://joblib.readthedocs.io/en/latest/parallel.html#joblib.parallel_backend)) to specify that we want out code to run using dask instead of the normal backend. The `scatter` keyword is making sure that each process has a copy of the data before fitting. Don't worry about understanding what `joblib` is doing, it is not super important to the exercise.

While the code is running, you can keep an eye on the dask dashboard to see CPU usage, tasks per CPU, and a bunch of other information.

In [None]:
%%time
import joblib

with joblib.parallel_backend('dask', scatter=[X, y]):
    grid_search.fit(X, y)

Verify the correct result and close the client:

In [None]:
c.close()
print_best(grid_search)

The simulation time is much slower than with multiprocessing! But this is actually expected: dask creates a lot of overhead and doesn't always evenly assign tasks to workers. 

One advantage of dask is that can be run on multiple nodes on a cluster using almost exactly the same code as what we have here. Moreover, there are functions in the `dask_ml` library that will work on very large datasets that cannot be loaded into memory. If that is of interest to you, then you can see the last exercise in [this tutorial](https://github.com/dask/dask-tutorial/blob/main/08_machine_learning.ipynb).