#### Create Experiments

In [15]:
TF_EXPERIMENT_FILE = "katibheartdisease-tfjob-experiment.yaml"

In [16]:
import re

from IPython.utils.capture import CapturedIO


def get_resource(captured_io: CapturedIO) -> str:
    """
    Gets a resource name from `kubectl apply -f <configuration.yaml>`.

    :param str captured_io: Output captured by using `%%capture` cell magic
    :return: Name of the Kubernetes resource
    :rtype: str
    :raises Exception: if the resource could not be created
    """
    out = captured_io.stdout
    matches = re.search(r"^(.+)\s+created", out)
    if matches is not None:
        return matches.group(1)
    else:
        raise Exception(f"Cannot get resource as its creation failed: {out}. It may already exist.")

#### For our experiment, we want to focus on the learning rate, batch-size and optimizer. The following YAML file describes an Experiment object:

In [17]:
%%writefile $TF_EXPERIMENT_FILE
apiVersion: "kubeflow.org/v1beta1"
kind: Experiment
metadata:
  namespace: ekemini
  name: heart
spec:
  parallelTrialCount: 3
  maxTrialCount: 12
  maxFailedTrialCount: 3
  objective:
    type: maximize
    goal: 0.8
    objectiveMetricName: accuracy
  algorithm:
    algorithmName: random
  metricsCollectorSpec:
    kind: StdOut
  parameters:
    - name: learning_rate
      parameterType: double
      feasibleSpace:
        min: "0.01"
        max: "0.1"
    - name: batch_size
      parameterType: int
      feasibleSpace:
        min: "50"
        max: "100"
    - name: optimizer
      parameterType: categorical
      feasibleSpace:
        list:
          - rmsprop
          - adam
  trialTemplate:
    primaryContainerName: tensorflow
    trialParameters:
      - name: learningRate
        description: Learning rate for the training model
        reference: learning_rate
      - name: batchSize
        description: Batch Size
        reference: batch_size
      - name: optimizer
        description: Training model optimizer (sdg, adam)
        reference: optimizer
    trialSpec:
      apiVersion: "kubeflow.org/v1"
      kind: TFJob
      spec:
        tfReplicaSpecs:
          Worker:
            replicas: 1
            restartPolicy: OnFailure
            template:
              metadata:
                annotations:
                  sidecar.istio.io/inject: "false"
              spec:
                containers:
                  - name: tensorflow
                    image: mavencodevv/tfjob_heart:v.0.1
                    command:
                      - "python"
                      - "/tfjobheart.py"
                      - "--batch_size=${trialParameters.batchSize}"
                      - "--learning_rate=${trialParameters.learningRate}"
                      - "--optimizer=${trialParameters.optimizer}"

Overwriting katibheartdisease-tfjob-experiment.yaml


#### Run and Monitor Experiments

To submit our experiment, we execute:

In [18]:
%%capture kubectl_output --no-stderr
! kubectl apply -f $TF_EXPERIMENT_FILE

#### The cell magic grabs the output of the kubectl command and stores it in an object named kubectl_output. From there we can use the utility function we defined earlier:

In [19]:
EXPERIMENT = get_resource(kubectl_output)

##### To see the status, we can then run:

In [21]:
! kubectl describe $EXPERIMENT

Name:         heart
Namespace:    ekemini
Labels:       <none>
Annotations:  <none>
API Version:  kubeflow.org/v1beta1
Kind:         Experiment
Metadata:
  Creation Timestamp:  2021-04-13T08:12:57Z
  Finalizers:
    update-prometheus-metrics
  Generation:  2
  Managed Fields:
    API Version:  kubeflow.org/v1beta1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .:
          f:kubectl.kubernetes.io/last-applied-configuration:
      f:spec:
        .:
        f:algorithm:
          .:
          f:algorithmName:
        f:maxFailedTrialCount:
        f:maxTrialCount:
        f:metricsCollectorSpec:
        f:objective:
          .:
          f:goal:
          f:objectiveMetricName:
          f:type:
        f:parallelTrialCount:
        f:parameters:
        f:trialTemplate:
          .:
          f:primaryContainerName:
          f:trialParameters:
          f:trialSpec:
            .:
            f:apiVersion:
            f:kind:
            f

##### To get the list of created experiments, use the following command:

In [23]:
! kubectl get experiments

NAME    TYPE        STATUS   AGE
heart   Succeeded   True     48s


##### To get the list of created trials, use the following command:

In [24]:
! kubectl get trials

NAME             TYPE        STATUS   AGE
heart-4cs6hjx5   Succeeded   True     33s
heart-sc6rhzv7   Succeeded   True     33s
heart-tlgzln5p   Succeeded   True     33s


##### After the experiment is completed, use describe to get the best trial results:

In [25]:
! kubectl describe $EXPERIMENT

Name:         heart
Namespace:    ekemini
Labels:       <none>
Annotations:  <none>
API Version:  kubeflow.org/v1beta1
Kind:         Experiment
Metadata:
  Creation Timestamp:  2021-04-13T08:12:57Z
  Finalizers:
    update-prometheus-metrics
  Generation:  2
  Managed Fields:
    API Version:  kubeflow.org/v1beta1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .:
          f:kubectl.kubernetes.io/last-applied-configuration:
      f:spec:
        .:
        f:algorithm:
          .:
          f:algorithmName:
        f:maxFailedTrialCount:
        f:maxTrialCount:
        f:metricsCollectorSpec:
        f:objective:
          .:
          f:goal:
          f:objectiveMetricName:
          f:type:
        f:parallelTrialCount:
        f:parameters:
        f:trialTemplate:
          .:
          f:primaryContainerName:
          f:trialParameters:
          f:trialSpec:
            .:
            f:apiVersion:
            f:kind:
            f

##### Delete Katib Job Runs to Free up resources

In [26]:
#! kubectl delete -f $TF_EXPERIMENT_FILE

Check to see if the check to see if the pod is still up and running

In [11]:
#! kubectl -n ekemini logs -f heartdisease