### Create Experiments

##### Before we proceed, let's set up a few basic definitions that we can re-use.

In [1]:
TF_EXPERIMENT_FILE = "katib-tfjob-experiment.yaml"

In [2]:
import re

from IPython.utils.capture import CapturedIO


def get_resource(captured_io: CapturedIO) -> str:
    """
    Gets a resource name from `kubectl apply -f <configuration.yaml>`.

    :param str captured_io: Output captured by using `%%capture` cell magic
    :return: Name of the Kubernetes resource
    :rtype: str
    :raises Exception: if the resource could not be created
    """
    out = captured_io.stdout
    matches = re.search(r"^(.+)\s+created", out)
    if matches is not None:
        return matches.group(1)
    else:
        raise Exception(f"Cannot get resource as its creation failed: {out}. It may already exist.")

### TensorFlow: Katib TFJob Experiment

The TFJob definition for this example is based on the birds TensorFlow notebook.

This model accepts several arguments, but for our experiment, we want to focus on the following parameters of the algorithm:

--learning_rate
--batch-size
--optimizer

The following YAML file describes an Experiment object:

In [3]:
%%writefile $TF_EXPERIMENT_FILE
apiVersion: "kubeflow.org/v1beta1"
kind: Experiment
metadata:
  namespace: ekemini
  name: katibtf
spec:
  parallelTrialCount: 3
  maxTrialCount: 12
  maxFailedTrialCount: 3
  objective:
    type: maximize
    goal: 0.99
    objectiveMetricName: accuracy
  algorithm:
    algorithmName: random
  metricsCollectorSpec:
    kind: StdOut
  parameters:
    - name: learning_rate
      parameterType: double
      feasibleSpace:
        min: "0.001"
        max: "0.005"
    - name: batch_size
      parameterType: int
      feasibleSpace:
        min: "100"
        max: "200"
    - name: optimizer
      parameterType: categorical
      feasibleSpace:
        list:
          - rmsprop
          - adam
  trialTemplate:
    primaryContainerName: tensorflow
    trialParameters:
      - name: learningRate
        description: Learning rate for the training model
        reference: learning_rate
      - name: batchSize
        description: Batch Size
        reference: batch_size
      - name: optimizer
        description: Training model optimizer (rmsprop, adam)
        reference: optimizer
    trialSpec:
      apiVersion: "kubeflow.org/v1"
      kind: TFJob
      spec:
        tfReplicaSpecs:
          Worker:
            replicas: 1
            restartPolicy: OnFailure
            template:
              metadata:
                annotations:
                  sidecar.istio.io/inject: "false"
              spec:
                containers:
                  - name: tensorflow
                    image: mavencodev/tfbirds:v.0.3
                    command:
                      - "python"
                      - "/tfjob.py"
                      - "--batch_size=${trialParameters.batchSize}"
                      - "--learning_rate=${trialParameters.learningRate}"
                      - "--optimizer=${trialParameters.optimizer}"

Writing katib-tfjob-experiment.yaml


### Run and Monitor Experiments
You can either execute these commands on your local machine with kubectl or you can run them from the notebook.

To submit our experiment, we execute:

In [None]:
%%capture kubectl_output --no-stderr
! kubectl apply -f $TF_EXPERIMENT_FILE

##### The cell magic grabs the output of the kubectl command and stores it in an object named kubectl_output. From there we can use the utility function we defined earlier:

In [None]:
EXPERIMENT = get_resource(kubectl_output)

##### To see the status, we can then run:

In [None]:
! kubectl describe $EXPERIMENT

##### To get the list of created experiments, use the following command:

In [None]:
! kubectl get experiments

##### To get the list of created trials, use the following command:

In [None]:
! kubectl get trials

##### After the experiment is completed, use describe to get the best trial results:

In [None]:
! kubectl describe $EXPERIMENT

##### Delete Katib Job Runs to Free up resources

In [None]:
#! kubectl delete -f $TF_EXPERIMENT_FILE

##### Check to see if the check to see if the pod is still up and running

In [None]:
#! kubectl -n demo01 logs -f katibtf