(hyper-params)=
# Hyperparameter tuning optimization

MLRun supports iterative tasks for automatic and distributed execution of many tasks with variable parameters (hyperparams). Iterative tasks can be distributed across multiple containers. They can be used for:
* Parallel loading and preparation of many data objects
* Model training with different parameter sets and/or algorithms
* Parallel testing with many test vector options
* AutoML

MLRun iterations can be viewed as child runs under the main task/run. Each child run gets a set of parameters that are computed/selected from the input hyperparameters based on the chosen strategy ([Grid](#grid-search-default), [List](#list-search), [Random](#random-search) or [Custom](#custom-iterator)).

The different iterations can run in parallel over multiple containers (using Dask or Nuclio runtimes, which manage the workers). Read more in [Parallel execution over containers](#parallel-execution-over-containers).

The hyperparameters and options are specified in the `task` or the {py:meth}`~mlrun.runtimes.BaseRuntime.run` command 
through the `hyperparams` (for hyperparam values) and `hyper_param_options` (for 
{py:class}`~mlrun.model.HyperParamOptions`) properties. See the examples below. 

The hyperparams are specified as a struct of `key: list` values. The values can be of any type (int, string, float, ..). 
The lists are used to compute the parameter combinations using one of the 
following strategies: 
- [Grid search](#grid-search-default) (`grid`) &mdash; running all the parameter combinations. The `key: list` values structure is similar to: 
  ` { "p1": [1,2], "p2": [2,4] }`<br>
   The result is the four iterations with all the combinations of p1 and p2. 
   Hyperparameters can also be loaded directly from a JSON file (specify `param_file` in {py:class}`~mlrun.model.HyperParamOptions`).
- [Random](#random-search) (`random`) &mdash; running a sampled set from all the parameter combinations. Hyperparameters can       also be loaded directly from a JSON file, the same as `grid`.
- [List](#list-search) (`list`) &mdash; running the first parameter from each list followed by the second from each list and so on. **All the lists must be of equal length**. Hyperparameters can also be loaded directly from a JSON or CSV file containing a list of the iterations to be executed. Example JSON: `{"p1": [1], "p2": [10]}` (specify `param_file` in {py:class}`~mlrun.model.HyperParamOptions`).

You can specify a selection criteria to select the best run among the different child runs by setting the `selector` option. This marks the selected result as the parent (iteration 0) result, and marks the best result in the user interface.

You can also specify the `stop_condition` to stop the execution of child runs when some criteria, based on the returned results, is met (for example `stop_condition="accuracy>=0.9"`).

**In this section**
- [Basic code](#basic-code)
- [Review the results](#Review-the-results)
- [Examples](#examples)
- [Parallel execution over containers](#parallel-execution-over-containers)

## Basic code

Here's a basic example of running multiple jobs in parallel for **hyperparameters tuning**, selecting the best run with respect to the `max accuracy`. 

Run the hyperparameters tuning job by using the keywords arguments: 

* `hyperparams` for the hyperparameters options and values of choice.
* `selector` for specifying how to select the best model.

```python
hp_tuning_run = project.run_function(
    "trainer", 
    inputs={"dataset": gen_data_run.outputs["dataset"]}, 
    hyperparams={
        "n_estimators": [100, 500, 1000], 
        "max_depth": [5, 15, 30]
    }, 
    selector="max.accuracy", 
    local=True
)
```

The returned run object in this case represents the `parent` (and the **best** result). You can also access the 
individual child runs (called iterations) in the MLRun UI.

## Review the results

When running a hyperparam job, the job `results` tab shows the list and marks the best run:

<img src="./_static/images/hyperparam-results.png" alt="results" width="800"/>

You can also view results by printing the artifact `iteration_results`:

```hp_tuning_run.artifact("iteration_results").as_df()```

MLRun also generates a `parallel coordinates plot` for the run, you can view it in the MLRun UI.

![parallel_coordinates](./_static/images/parallel-coordinates.png)

## Examples

**Base dummy function:**

In [1]:
import mlrun

In [2]:
def hyper_func(context, p1, p2):
    print(f"p1={p1}, p2={p2}, result={p1 * p2}")
    context.log_result("multiplier", p1 * p2)

### Grid search (default)

In [3]:
grid_params = {"p1": [2, 4, 1], "p2": [10, 20]}
task = mlrun.new_task("grid-demo").with_hyper_params(
    grid_params, selector="max.multiplier"
)
run = mlrun.new_function().run(task, handler=hyper_func)

**UI Screenshot:**
<br><br>
<img src="_static/images/hyper-params.png" alt="hyper-params" width="800"/>


### Random Search
MLRun chooses random parameter combinations. Limit the number of combinations using the `max_iterations` attribute.

In [4]:
grid_params = {"p1": [2, 4, 1, 3], "p2": [10, 20, 30]}
task = mlrun.new_task("random-demo")
task.with_hyper_params(
    grid_params, selector="max.multiplier", strategy="random", max_iterations=4
)
run = mlrun.new_function().run(task, handler=hyper_func)

### List search

This example also shows how to use the `stop_condition` option.

In [5]:
list_params = {"p1": [2, 3, 7, 4, 5], "p2": [15, 10, 10, 20, 30]}
task = mlrun.new_task("list-demo").with_hyper_params(
    list_params,
    selector="max.multiplier",
    strategy="list",
    stop_condition="multiplier>=70",
)
run = mlrun.new_function().run(task, handler=hyper_func)

### Custom iterator

You can define a child iteration context under the parent/main run. The child run is logged independently.

In [6]:
def handler(context: mlrun.MLClientCtx, param_list):
    best_multiplier = total = 0
    for param in param_list:
        with context.get_child_context(**param) as child:
            hyper_func(child, **child.parameters)
            multiplier = child.results["multiplier"]
            total += multiplier
            if multiplier > best_multiplier:
                child.mark_as_best()
                best_multiplier = multiplier

    # log result at the parent
    context.log_result("avg_multiplier", total / len(param_list))

In [7]:
param_list = [{"p1": 2, "p2": 10}, {"p1": 3, "p2": 30}, {"p1": 4, "p2": 7}]
run = mlrun.new_function().run(handler=handler, params={"param_list": param_list})

## Parallel execution over containers

When working with compute intensive or long running tasks you'll want to run your iterations over a cluster of containers. At the same time, you don't want to bring up too many containers, and you want to limit the number of parallel tasks.

MLRun supports distribution of the child runs over Dask or Nuclio clusters. This is handled automatically by MLRun. You only need to deploy the Dask or Nuclio function used by the workers, and set the level of parallelism in the task. The execution can be controlled from the client/notebook, or can have a job (immediate or scheduled) that controls the execution.

### Code example (single task)

In [8]:
# mark the start of a code section that will be sent to the job
# mlrun: start-code

In [9]:
import socket
import pandas as pd


def hyper_func2(context, data, p1, p2, p3):
    print(data.as_df().head())
    context.logger.info(f"p2={p2}, p3={p3}, r1={p2 * p3} at {socket.gethostname()}")
    context.log_result("r1", p2 * p3)
    raw_data = {
        "first_name": ["Jason", "Molly", "Tina", "Jake", "Amy"],
        "age": [42, 52, 36, 24, 73],
        "testScore": [25, 94, 57, 62, 70],
    }
    df = pd.DataFrame(raw_data, columns=["first_name", "age", "testScore"])
    context.log_dataset("mydf", df=df, stats=True)

In [10]:
# mlrun: end-code

### Running the workers using Dask

This example creates a new function and executes the parent/controller as an MLRun `job` and the different child runs over a Dask cluster (MLRun Dask function).

#### Define a Dask cluster (using MLRun serverless Dask)

In [11]:
dask_cluster = mlrun.new_function("dask-cluster", kind="dask", image="mlrun/mlrun")
dask_cluster.apply(mlrun.mount_v3io())  # add volume mounts
dask_cluster.spec.service_type = "NodePort"  # open interface to the dask UI dashboard
dask_cluster.spec.replicas = 2  # define two containers
uri = dask_cluster.save()
uri

In [12]:
# initialize the dask cluster and get its dashboard url
dask_cluster.client

#### Define the parallel work

Set the `parallel_runs` attribute to indicate how many child tasks to run in parallel. Set the `dask_cluster_uri` to point 
to the dask cluster (if it's not set the cluster uri uses dask local). You can also set the `teardown_dask` flag to free up 
all the dask resources after completion.

In [13]:
grid_params = {"p2": [2, 1, 4, 1], "p3": [10, 20]}
task = mlrun.new_task(
    params={"p1": 8},
    inputs={"data": "https://s3.wasabisys.com/iguazio/data/iris/iris_dataset.csv"},
)
task.with_hyper_params(
    grid_params,
    selector="r1",
    strategy="grid",
    parallel_runs=4,
    dask_cluster_uri=uri,
    teardown_dask=True,
)

**Define a job that will take the code (using `code_to_function`) and run it over the cluster**

In [14]:
fn = mlrun.code_to_function(name="hyper-tst", kind="job", image="mlrun/mlrun")

In [16]:
run = fn.run(task, handler=hyper_func2)

### Running the workers using Nuclio

Nuclio is a high-performance serverless engine that can process many events in parallel. It can also separate initialization from execution. Certain parts of the code (imports, loading data, etc.) can be done once per worker vs. in any run.

Nuclio, by default, process events (http, stream, ..). There is a special Nuclio kind that runs MLRun jobs (nuclio:mlrun).

```{admonition} Notes
* Nuclio tasks are relatively short (preferably under 5 minutes), use it for running many iterations where each individual run is less than 5 min.
* Use `context.logger` to drive text outputs (vs `print()`).
```

#### Create a nuclio:mlrun function



In [17]:
fn = mlrun.code_to_function(name="hyper-tst2", kind="nuclio:mlrun", image="mlrun/mlrun")
# replicas * workers need to match or exceed parallel_runs
fn.spec.replicas = 2
fn.with_http(workers=2)
fn.deploy()

#### Run the parallel task over the function

In [18]:
# this is required to fix Jupyter issue with asyncio (not required outside of Jupyter)
# run it only once
import nest_asyncio

nest_asyncio.apply()

In [19]:
grid_params = {"p2": [2, 1, 4, 1], "p3": [10, 20]}
task = mlrun.new_task(
    params={"p1": 8},
    inputs={"data": "https://s3.wasabisys.com/iguazio/data/iris/iris_dataset.csv"},
)
task.with_hyper_params(
    grid_params, selector="r1", strategy="grid", parallel_runs=4, max_errors=3
)
run = fn.run(task, handler=hyper_func2)