# Hyperparameter Tuning with Katib (Pytorch)

## Introduction

Hyperparameter tuning is the process of optimizing a model's hyperparameter values in order to maximize the predictive quality of the model.
Examples of such hyperparameters are the learning rate, neural architecture depth (layers) and width (nodes), epochs, batch size, dropout rate, and activation functions.
These are the parameters that are set prior to training; unlike the model parameters (weights and biases), these do not change during the process of training the model.


This notebook shows how you can create and configure an `Experiment` for `PyTorch` training job.
In terms of Kubernetes, such an experiment is a custom resource handled by the Katib operator.

### What You'll Need
The Docker Container with PyTorch Operator from the previous session can be used
 - [PyTorch](../training/pytorch/MNIST%20with%20PyTorch.ipynb)
 
The model must always accept input Hyperparameters for Tunning


## How to Specify Hyperparameters in Your Models
In order for Katib to be able to tweak hyperparameters it needs to know what these are called in the model.
Beyond that, the model must specify these hyperparameters either as regular (command line) parameters or as environment variables.
Since the model needs to be containerized, any command line parameters or environment variables must to be passed to the container that holds your model.
By far the most common and also the recommended way is to use command line parameters that are captured with [`argparse`](https://docs.python.org/3/library/argparse.html) or similar; the trainer (function) then uses their values internally.

## How to Expose Model Metrics as Objective Functions
By default, Katib collects metrics from the standard output of a job container by using a sidecar container.
In order to make the metrics available to Katib, they must be logged to [stdout](https://www.kubeflow.org/docs/components/hyperparameter-tuning/experiment/#metrics-collector) in the `key=value` format.
The job output will be redirected to `/var/log/katib/metrics.log` file.
This means that the objective function (for Katib) must match the metric's `key` in the models output.
It's therefore possible to define custom model metrics for your use case.

## How to Create Experiments
Before we proceed, let's set up a few basic definitions that we can re-use.
Note that you typically use (YAML) resource definitions for Kubernetes from the command line, but we shall show you how to do everything from a notebook, so that you do not have to exit your favourite environment at all!
Of course, if you are more familiar or comfortable with `kubectl` and the command line, feel free to use a local CLI or the embedded terminals from the Jupyter Lab launch screen.

In [1]:
PYTORCH_EXPERIMENT_FILE = "katib-pytorchjob-experiment.yaml"

We also want to capture output from a cell with [`%%capture`](https://ipython.readthedocs.io/en/stable/interactive/magics.html#cellmagic-capture) that usually looks like `some-resource created`.
To that end, let's define a helper function:

In [2]:
import re

from IPython.utils.capture import CapturedIO


def get_resource(captured_io: CapturedIO) -> str:
    """
    Gets a resource name from `kubectl apply -f <configuration.yaml>`.

    :param str captured_io: Output captured by using `%%capture` cell magic
    :return: Name of the Kubernetes resource
    :rtype: str
    :raises Exception: if the resource could not be created
    """
    out = captured_io.stdout
    matches = re.search(r"^(.+)\s+created", out)
    if matches is not None:
        return matches.group(1)
    else:
        raise Exception(f"Cannot get resource as its creation failed: {out}. It may already exist.")

### PyTorch: Katib PyTorchJob Experiment

This example is based on the FASHION MNIST with PyTorch notebook.

This model accepts several arguments:
- `--batch-size`
- `--epochs`
- `--lr` (i.e. the learning rate)
- `--momentum`

For our experiment we wish to find the optimal learning rate and momentum on the test data set.

In [3]:
%%writefile $PYTORCH_EXPERIMENT_FILE
apiVersion: "kubeflow.org/v1beta1"
kind: Experiment
metadata:
  namespace: demo01
  name: pytorchjob-ex2
spec:
  parallelTrialCount: 3
  maxTrialCount: 12
  maxFailedTrialCount: 3
  objective:
    type: minimize
    goal: 0.2
    objectiveMetricName: loss
  algorithm:
    algorithmName: random
  parameters:
    - name: lr
      parameterType: double
      feasibleSpace:
        min: "0.01"
        max: "0.05"
    - name: momentum
      parameterType: double
      feasibleSpace:
        min: "0.5"
        max: "0.9"
  trialTemplate:
    primaryContainerName: pytorch
    trialParameters:
      - name: learningRate
        description: Learning rate for the training model
        reference: lr
      - name: momentum
        description: Momentum for the training model
        reference: momentum
    trialSpec:
      apiVersion: "kubeflow.org/v1"
      kind: PyTorchJob
      spec:
        pytorchReplicaSpecs:
          Master:
            replicas: 1
            restartPolicy: OnFailure
            template:
              metadata:
                annotations:
                  sidecar.istio.io/inject: "false"
              spec:
                containers:
                  - name: pytorch
                    image: mavencodev/iris_pytorchjob:5.0
                    command:
                      - "python3"
                      - "/opt/iris.py"
                      - "--epochs=10"
                      - "--lr=${trialParameters.learningRate}"
                      - "--momentum=${trialParameters.momentum}"
          Worker:
            replicas: 1
            restartPolicy: OnFailure
            template:
              metadata:
                annotations:
                  sidecar.istio.io/inject: "false"
              spec:
                containers:
                  - name: pytorch
                    image: mavencodev/iris_pytorchjob:5.0
                    command:
                      - "python3"
                      - "/opt/iris.py"
                      - "--epochs=10"
                      - "--lr=${trialParameters.learningRate}"
                      - "--momentum=${trialParameters.momentum}"

Overwriting katib-pytorchjob-experiment.yaml


## How to Run and Monitor Experiments

You can either execute these commands on your local machine with `kubectl` or you can run them from the notebook.

To submit our experiment, we execute:

In [4]:
%%capture kubectl_output --no-stderr
! kubectl apply -f $PYTORCH_EXPERIMENT_FILE

The cell magic grabs the output of the `kubectl` command and stores it in an object named `kubectl_output`.
From there we can use the utility function we defined earlier:

In [5]:
EXPERIMENT = get_resource(kubectl_output)

To see the status, we can then run:

In [6]:
! kubectl describe $EXPERIMENT



To get the list of created trials, use the following command:

In [7]:
! kubectl get trials.kubeflow.org -l experiment=pytorchjob-ex2

NAME                      TYPE        STATUS   AGE
pytorchjob-ex2-cp848tn9   Succeeded   True     6m8s
pytorchjob-ex2-drhn5xtt   Succeeded   True     6m4s
pytorchjob-ex2-j9fx4p8f   Succeeded   True     5m57s
pytorchjob-ex2-jzwmzpk2   Succeeded   True     5m52s
pytorchjob-ex2-m56k9w8l   Succeeded   True     5m44s
pytorchjob-ex2-nwvwh6nx   Succeeded   True     6m17s
pytorchjob-ex2-p8lx4hgn   Succeeded   True     6m17s
pytorchjob-ex2-r5kxqrkz   Succeeded   True     6m9s
pytorchjob-ex2-rvb5vhpg   Succeeded   True     6m17s
pytorchjob-ex2-s7jj7w9n   Succeeded   True     5m57s
pytorchjob-ex2-t66xzgt7   Succeeded   True     5m40s
pytorchjob-ex2-zznbzlws   Succeeded   True     5m43s


In [8]:
! kubectl get pods

NAME                                                              READY   STATUS                  RESTARTS   AGE
flo-worker-0                                                      0/1     Completed               0          16h
flo-worker-1                                                      0/1     Completed               0          16h
flo1-worker-0                                                     0/1     Completed               0          16h
flo1-worker-1                                                     0/1     Completed               0          16h
horovod-mnist-charles-driver                                      0/1     Completed               0          6h24m
kale-demo-training-0                                              2/2     Running                 0          2d16h
minio-covid-default-0-classifier-6dd59b964c-cb8bf                 0/3     Init:CrashLoopBackOff   352        31h
minio-sklearn-default-0-classifier-5b95d54bb6-265rm               0/3     Init:CrashLoopBack

After the experiment is completed, use `describe` to get the best trial results:

In [9]:
! kubectl describe $EXPERIMENT



The relevant section of the output looks like this:
    
```yaml
Name:         katib-pytorchjob-experiment
...
Status:
  ...
  Current Optimal Trial:
    Best Trial Name:  katib-pytorchjob-experiment-jv4sc9q7
    Observation:
      Metrics:
        Name:   accuracy
        Value:  0.9902
    Parameter Assignments:
      Name:    --lr
      Value:   0.5512569257804198
  ...
  Trials:            6
  Trials Succeeded:  6
...
```

## Delete Katib Job Runs to Free up resources

In [10]:
! kubectl delete -f $PYTORCH_EXPERIMENT_FILE

experiment.kubeflow.org "pytorchjob-ex2" deleted


Check to see if the check to see if the pod is still up and running 

In [11]:
! kubectl -n demo01 logs -f pytorchjob-ex2

Error from server (NotFound): pods "pytorchjob-ex2" not found
