# Hyperparameter Tuning with Katib (Tensorflow)

## Introduction

Hyperparameter tuning is the process of optimizing a model's hyperparameter values in order to maximize the predictive quality of the model.
Examples of such hyperparameters are the learning rate, neural architecture depth (layers) and width (nodes), epochs, batch size, dropout rate, and activation functions.
These are the parameters that are set prior to training; unlike the model parameters (weights and biases), these do not change during the process of training the model.


This notebook shows how you can create and configure an `Experiment` for `TensorFlow` training job.
In terms of Kubernetes, such an experiment is a custom resource handled by the Katib operator.

### What You'll Need
The Docker Container with TensorFlow Operator from the previous session can be used
 - [TensorFlow](../training/tensorflow/MNIST%20with%20TensorFlow.ipynb) 
 
The model must always accept input Hyperparameters for Tunning


## How to Specify Hyperparameters in Your Models
In order for Katib to be able to tweak hyperparameters it needs to know what these are called in the model.
Beyond that, the model must specify these hyperparameters either as regular (command line) parameters or as environment variables.
Since the model needs to be containerized, any command line parameters or environment variables must to be passed to the container that holds your model.
By far the most common and also the recommended way is to use command line parameters that are captured with [`argparse`](https://docs.python.org/3/library/argparse.html) or similar; the trainer (function) then uses their values internally.

## How to Expose Model Metrics as Objective Functions
By default, Katib collects metrics from the standard output of a job container by using a sidecar container.
In order to make the metrics available to Katib, they must be logged to [stdout](https://www.kubeflow.org/docs/components/hyperparameter-tuning/experiment/#metrics-collector) in the `key=value` format.
The job output will be redirected to `/var/log/katib/metrics.log` file.
This means that the objective function (for Katib) must match the metric's `key` in the models output.
It's therefore possible to define custom model metrics for your use case.

## How to Create Experiments
Before we proceed, let's set up a few basic definitions that we can re-use.
Note that you typically use (YAML) resource definitions for Kubernetes from the command line, but we shall show you how to do everything from a notebook, so that you do not have to exit your favourite environment at all!
Of course, if you are more familiar or comfortable with `kubectl` and the command line, feel free to use a local CLI or the embedded terminals from the Jupyter Lab launch screen.

In [1]:
TF_EXPERIMENT_FILE = "katib-tfjob-experimentnew.yaml"

In [2]:
import re

from IPython.utils.capture import CapturedIO


def get_resource(captured_io: CapturedIO) -> str:
    """
    Gets a resource name from `kubectl apply -f <configuration.yaml>`.

    :param str captured_io: Output captured by using `%%capture` cell magic
    :return: Name of the Kubernetes resource
    :rtype: str
    :raises Exception: if the resource could not be created
    """
    out = captured_io.stdout
    matches = re.search(r"^(.+)\s+created", out)
    if matches is not None:
        return matches.group(1)
    else:
        raise Exception(f"Cannot get resource as its creation failed: {out}. It may already exist.")

### TensorFlow: Katib TFJob Experiment

The `TFJob` definition for this example is based on the [MNIST with TensorFlow](../training/tensorflow/MNIST%20with%20TensorFlow.ipynb) notebook.

This model accepts several arguments:
- `--batch-size`
- `--buffer-size`
- `--epochs`
- `--steps`
- `--learning-rate`
- `--momentum`

For our experiment, we want to focus on the learning rate and momentum of the [SGD algorithm](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/SGD).

The following YAML file describes an `Experiment` object:

In [3]:
%%writefile $TF_EXPERIMENT_FILE
apiVersion: "kubeflow.org/v1beta1"
kind: Experiment
metadata:
  namespace: demo01
  name: newtfjob
spec:
  parallelTrialCount: 3
  maxTrialCount: 12
  maxFailedTrialCount: 3
  objective:
    type: maximize
    goal: 0.99
    objectiveMetricName: accuracy
  algorithm:
    algorithmName: random
  metricsCollectorSpec:
    kind: StdOut
  parameters:
    - name: learning_rate
      parameterType: double
      feasibleSpace:
        min: "0.001"
        max: "0.005"
    - name: batch_size
      parameterType: int
      feasibleSpace:
        min: "100"
        max: "200"
    - name: optimizer
      parameterType: categorical
      feasibleSpace:
        list:
          - rmsprop
          - adam
  trialTemplate:
    primaryContainerName: tensorflow
    trialParameters:
      - name: learningRate
        description: Learning rate for the training model
        reference: learning_rate
      - name: batchSize
        description: Batch Size
        reference: batch_size
      - name: optimizer
        description: Training model optimizer (sdg, adam)
        reference: optimizer
    trialSpec:
      apiVersion: "kubeflow.org/v1"
      kind: TFJob
      spec:
        tfReplicaSpecs:
          Worker:
            replicas: 1
            restartPolicy: OnFailure
            template:
              metadata:
                annotations:
                  sidecar.istio.io/inject: "false"
              spec:
                containers:
                  - name: tensorflow
                    image: mavencodev/tfjob:6.0
                    command:
                      - "python"
                      - "/tfjob.py"
                      - "--learning_rate=${trialParameters.learningRate}"
                      - "--batch_size=${trialParameters.batchSize}"
                      - "--optimizer=${trialParameters.optimizer}"

Overwriting katib-tfjob-experimentnew.yaml


## How to Run and Monitor Experiments

You can either execute these commands on your local machine with `kubectl` or you can run them from the notebook.

To submit our experiment, we execute:

In [18]:
%%capture kubectl_output --no-stderr
! kubectl apply -f $TF_EXPERIMENT_FILE

The cell magic grabs the output of the `kubectl` command and stores it in an object named `kubectl_output`.
From there we can use the utility function we defined earlier:

In [19]:
EXPERIMENT = get_resource(kubectl_output)

To see the status, we can then run:

In [22]:
! kubectl describe $EXPERIMENT

Name:         newtfjob
Namespace:    demo01
Labels:       <none>
Annotations:  kubectl.kubernetes.io/last-applied-configuration:
                {"apiVersion":"kubeflow.org/v1beta1","kind":"Experiment","metadata":{"annotations":{},"name":"newtfjob","namespace":"demo01"},"spec":{"alg...
API Version:  kubeflow.org/v1beta1
Kind:         Experiment
Metadata:
  Creation Timestamp:  2021-03-21T01:41:55Z
  Finalizers:
    update-prometheus-metrics
  Generation:  2
  Managed Fields:
    API Version:  kubeflow.org/v1beta1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .:
          f:kubectl.kubernetes.io/last-applied-configuration:
      f:spec:
        .:
        f:algorithm:
          .:
          f:algorithmName:
        f:maxFailedTrialCount:
        f:maxTrialCount:
        f:metricsCollectorSpec:
        f:objective:
          .:
          f:goal:
          f:objectiveMetricName:
          f:type:
        f:para

To get the list of created trials, use the following command:

In [27]:
! kubectl get trials

No resources found in demo01 namespace.


After the experiment is completed, use `describe` to get the best trial results:

In [24]:
! kubectl describe $EXPERIMENT

Name:         newtfjob
Namespace:    demo01
Labels:       <none>
Annotations:  kubectl.kubernetes.io/last-applied-configuration:
                {"apiVersion":"kubeflow.org/v1beta1","kind":"Experiment","metadata":{"annotations":{},"name":"newtfjob","namespace":"demo01"},"spec":{"alg...
API Version:  kubeflow.org/v1beta1
Kind:         Experiment
Metadata:
  Creation Timestamp:  2021-03-21T01:41:55Z
  Finalizers:
    update-prometheus-metrics
  Generation:  2
  Managed Fields:
    API Version:  kubeflow.org/v1beta1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .:
          f:kubectl.kubernetes.io/last-applied-configuration:
      f:spec:
        .:
        f:algorithm:
          .:
          f:algorithmName:
        f:maxFailedTrialCount:
        f:maxTrialCount:
        f:metricsCollectorSpec:
        f:objective:
          .:
          f:goal:
          f:objectiveMetricName:
          f:type:
        f:para

The relevant section of the output looks like this:
    
```yaml
Name:         katib-tfjob-experiment
...
Status:
  ...
  Current Optimal Trial:
    Best Trial Name:  katib-tfjob-experiment-jv4sc9q7
    Observation:
      Metrics:
        Name:   accuracy
        Value:  0.9902
    Parameter Assignments:
      Name:    --lr
      Value:   0.5512569257804198
  ...
  Trials:            3
  Trials Succeeded:  3
...
```

## Delete Katib Job Runs to Free up resources

In [25]:
! kubectl delete -f $TF_EXPERIMENT_FILE

experiment.kubeflow.org "newtfjob" deleted


Check to see if the check to see if the pod is still up and running 

In [26]:
! kubectl -n demo01 logs -f newtfjob

Error from server (NotFound): pods "newtfjob" not found
