#### Create Experiments

In [93]:
TF_EXPERIMENT_FILE = "katibairline.yaml"

In [94]:
import re

from IPython.utils.capture import CapturedIO


def get_resource(captured_io: CapturedIO) -> str:
    """
    Gets a resource name from `kubectl apply -f <configuration.yaml>`.

    :param str captured_io: Output captured by using `%%capture` cell magic
    :return: Name of the Kubernetes resource
    :rtype: str
    :raises Exception: if the resource could not be created
    """
    out = captured_io.stdout
    matches = re.search(r"^(.+)\s+created", out)
    if matches is not None:
        return matches.group(1)
    else:
        raise Exception(f"Cannot get resource as its creation failed: {out}. It may already exist.")

#### For our experiment, we want to focus on the learning rate, batch-size and optimizer. The following YAML file describes an Experiment object:

In [98]:
%%writefile $TF_EXPERIMENT_FILE
apiVersion: "kubeflow.org/v1alpha3"
kind: Experiment
metadata:
  namespace: admin
  name: airline2
spec:
  parallelTrialCount: 3
  maxTrialCount: 12
  maxFailedTrialCount: 3
  objective:
    type: maximize
    goal: 0.99
    objectiveMetricName: accuracy
  algorithm:
    algorithmName: random
  metricsCollectorSpec:
    kind: StdOut
  parameters:
    - name: learning_rate
      parameterType: double
      feasibleSpace:
        min: "0.01"
        max: "0.1"
    - name: batch_size
      parameterType: int
      feasibleSpace:
        min: "50"
        max: "100"
    - name: optimizer
      parameterType: categorical
      feasibleSpace:
        list:
          - rmsprop
          - adam
  trialTemplate:
    primaryContainerName: tensorflow
    trialParameters:
      - name: learningRate
        description: Learning rate for the training model
        reference: learning_rate
      - name: batchSize
        description: Batch Size
        reference: batch_size
      - name: optimizer
        description: Training model optimizer (sdg, adam)
        reference: optimizer
    trialSpec:
      apiVersion: "kubeflow.org/v1"
      kind: TFJob
      spec:
        tfReplicaSpecs:
          Worker:
            replicas: 1
            restartPolicy: OnFailure
            template:
              metadata:
                annotations:
                  sidecar.istio.io/inject: "false"
              spec:
                containers:
                  - name: tensorflow
                    image: mavencodevv/tfjob_airline:v.0.1
                    command:
                      - "python"
                      - "/tfjobairline.py"
                      - "--batch_size=${trialParameters.batchSize}"
                      - "--learning_rate=${trialParameters.learningRate}"
                      - "--optimizer=${trialParameters.optimizer}"


Overwriting katibairline.yaml


#### Run and Monitor Experiments

To submit our experiment, we execute:

In [99]:
%%capture kubectl_output --no-stderr
! kubectl apply -f $TF_EXPERIMENT_FILE

#### The cell magic grabs the output of the kubectl command and stores it in an object named kubectl_output. From there we can use the utility function we defined earlier:

In [100]:
EXPERIMENT = get_resource(kubectl_output)

##### To see the status, we can then run:

In [101]:
! kubectl describe $EXPERIMENT

Name:         airline2
Namespace:    admin
Labels:       <none>
Annotations:  kubectl.kubernetes.io/last-applied-configuration:
                {"apiVersion":"kubeflow.org/v1alpha3","kind":"Experiment","metadata":{"annotations":{},"name":"airline2","namespace":"admin"},"spec":{"alg...
API Version:  kubeflow.org/v1alpha3
Kind:         Experiment
Metadata:
  Creation Timestamp:  2021-07-09T10:01:18Z
  Generation:          1
  Managed Fields:
    API Version:  kubeflow.org/v1alpha3
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .:
          f:kubectl.kubernetes.io/last-applied-configuration:
      f:spec:
        .:
        f:algorithm:
          .:
          f:algorithmName:
        f:maxFailedTrialCount:
        f:maxTrialCount:
        f:metricsCollectorSpec:
          .:
          f:kind:
        f:objective:
          .:
          f:goal:
          f:objectiveMetricName:
          f:type:
        f:parallel

##### To get the list of created experiments, use the following command:

In [91]:
! kubectl get experiments

NAME       STATUS   AGE
airline             41m
airline1            15h


##### To get the list of created trials, use the following command:

In [84]:
! kubectl get trials

No resources found in admin namespace.


##### After the experiment is completed, use describe to get the best trial results:

In [102]:
! kubectl logs $EXPERIMENT

error: no kind "Experiment" is registered for version "kubeflow.org/v1alpha3" in scheme "k8s.io/kubectl/pkg/scheme/scheme.go:28"


##### Delete Katib Job Runs to Free up resources

In [63]:
! kubectl delete -f $TF_EXPERIMENT_FILE

experiment.kubeflow.org "airline" deleted


Check to see if the check to see if the pod is still up and running

In [88]:
! kubectl -n admin logs -f airline

Error from server (NotFound): pods "airline" not found
