#### Create Experiments

In [38]:
TF_EXPERIMENT_FILE = "katibairline.yaml"

In [39]:
import re

from IPython.utils.capture import CapturedIO


def get_resource(captured_io: CapturedIO) -> str:
    """
    Gets a resource name from `kubectl apply -f <configuration.yaml>`.

    :param str captured_io: Output captured by using `%%capture` cell magic
    :return: Name of the Kubernetes resource
    :rtype: str
    :raises Exception: if the resource could not be created
    """
    out = captured_io.stdout
    matches = re.search(r"^(.+)\s+created", out)
    if matches is not None:
        return matches.group(1)
    else:
        raise Exception(f"Cannot get resource as its creation failed: {out}. It may already exist.")

#### For our experiment, we want to focus on the learning rate, batch-size and optimizer. The following YAML file describes an Experiment object:

In [53]:
%%writefile $TF_EXPERIMENT_FILE
apiVersion: "kubeflow.org/v1beta1"
kind: Experiment
metadata:
  namespace: sooter
  name: airlineearly1
spec:
  parallelTrialCount: 3
  maxTrialCount: 50
  maxFailedTrialCount: 3
  objective:
    type: maximize
    goal: 0.99
    objectiveMetricName: accuracy
  algorithm:
    algorithmName: random
  earlyStopping:
    algorithmName: medianstop
    algorithmSettings:
      - name: min_trials_required
        value: "3"
      - name: start_step
        value: "5"
  metricsCollectorSpec:
    kind: StdOut
  parameters:
    - name: learning_rate
      parameterType: double
      feasibleSpace:
        min: "0.0001"
        max: "0.1"
    - name: batch_size
      parameterType: int
      feasibleSpace:
        min: "50"
        max: "200"
    - name: optimizer
      parameterType: categorical
      feasibleSpace:
        list:
          - adam
          - sgd
  trialTemplate:
    retain: true
    primaryContainerName: tensorflow
    trialParameters:
      - name: learningRate
        description: Learning rate for the training model
        reference: learning_rate
      - name: batchSize
        description: Batch Size
        reference: batch_size
      - name: optimizer
        description: Training model optimizer (sdg, adam)
        reference: optimizer
    trialSpec:
      apiVersion: "kubeflow.org/v1"
      kind: TFJob
      spec:
        tfReplicaSpecs:
          Worker:
            replicas: 1
            restartPolicy: OnFailure
            template:
              metadata:
                annotations:
                  sidecar.istio.io/inject: "false"
              spec:
                containers:
                  - name: tensorflow
                    image: mavencodevv/tfjob_airline:v.0.9
                    command:
                      - "python"
                      - "/tfjobairline.py"
                      - "--batch_size=${trialParameters.batchSize}"
                      - "--learning_rate=${trialParameters.learningRate}"
                      - "--optimizer=${trialParameters.optimizer}"


Overwriting katibairline.yaml


#### Run and Monitor Experiments

To submit our experiment, we execute:

In [54]:
%%capture kubectl_output --no-stderr
! kubectl apply -f $TF_EXPERIMENT_FILE

#### The cell magic grabs the output of the kubectl command and stores it in an object named kubectl_output. From there we can use the utility function we defined earlier:

In [55]:
EXPERIMENT = get_resource(kubectl_output)

##### To see the status, we can then run:

In [56]:
! kubectl describe $EXPERIMENT

Name:         airlineearly1
Namespace:    sooter
Labels:       <none>
Annotations:  <none>
API Version:  kubeflow.org/v1beta1
Kind:         Experiment
Metadata:
  Creation Timestamp:  2021-07-19T18:03:35Z
  Finalizers:
    update-prometheus-metrics
  Generation:  1
  Managed Fields:
    API Version:  kubeflow.org/v1beta1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:finalizers:
      f:status:
        .:
        f:completionTime:
        f:conditions:
        f:currentOptimalTrial:
          .:
          f:bestTrialName:
          f:observation:
            .:
            f:metrics:
          f:parameterAssignments:
        f:startTime:
    Manager:      katib-controller
    Operation:    Update
    Time:         2021-07-19T18:03:35Z
    API Version:  kubeflow.org/v1beta1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .:
          f:kubectl.kubernetes.io/last-applied-configuration:
      f:spec:
        .:
        f:al

##### To get the list of created experiments, use the following command:

In [57]:
! kubectl get experiments

NAME                  TYPE        STATUS   AGE
airline1              Succeeded   True     5h53m
airline2-end-to-end   Running     True     74m
airline3-end-to-end   Running     True     6m44s
airlineearly1         Created     True     6s


##### To get the list of created trials, use the following command:

In [58]:
! kubectl get trials

NAME                           TYPE        STATUS   AGE
airline1-44c7fzhg              Succeeded   True     5h47m
airline1-5frqpdkw              Succeeded   True     5h14m
airline1-5xb6npd4              Succeeded   True     5h22m
airline1-6m5qnqrm              Succeeded   True     5h6m
airline1-7hm5hbn7              Succeeded   True     4h58m
airline1-7s9zjzz6              Succeeded   True     5h52m
airline1-7z2ncblx              Succeeded   True     5h21m
airline1-8d85jh8m              Succeeded   True     4h58m
airline1-8p8dlw5j              Succeeded   True     5h23m
airline1-8t4pzj4m              Succeeded   True     5h36m
airline1-9tbrwskt              Succeeded   True     5h6m
airline1-9v9prk2v              Succeeded   True     5h51m
airline1-b2xk89bs              Succeeded   True     5h14m
airline1-d6z989nh              Succeeded   True     5h52m
airline1-dhx22qg9              Succeeded   True     5h42m
airline1-h2sfsb7g              Succeeded   True     5h31m
airline1-hwm2p99f 

##### After the experiment is completed, use describe to get the best trial results:

In [63]:
! kubectl describe $EXPERIMENT

Name:         airlineearly1
Namespace:    sooter
Labels:       <none>
Annotations:  <none>
API Version:  kubeflow.org/v1beta1
Kind:         Experiment
Metadata:
  Creation Timestamp:  2021-07-19T18:03:35Z
  Finalizers:
    update-prometheus-metrics
  Generation:  1
  Managed Fields:
    API Version:  kubeflow.org/v1beta1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .:
          f:kubectl.kubernetes.io/last-applied-configuration:
      f:spec:
        .:
        f:algorithm:
          .:
          f:algorithmName:
        f:earlyStopping:
          .:
          f:algorithmName:
          f:algorithmSettings:
        f:maxFailedTrialCount:
        f:maxTrialCount:
        f:metricsCollectorSpec:
          .:
          f:kind:
        f:objective:
          .:
          f:goal:
          f:objectiveMetricName:
          f:type:
        f:parallelTrialCount:
        f:parameters:
        f:trialTemplate:
          .:
          f:primaryContain

##### Delete Katib Job Runs to Free up resources

In [64]:
! kubectl delete -f $TF_EXPERIMENT_FILE

experiment.kubeflow.org "airlineearly1" deleted


Check to see if the check to see if the pod is still up and running

In [88]:
! kubectl -n sooter logs -f airline

Error from server (NotFound): pods "airline" not found
