## About

> Role of training operators in kubeflow

- Training ops in kubeflow play a crucial role in managing and orchestrating the training of ml models.

- they act as specialized components that handle the training process, making it more efficient and scalable.

[Reference Video](https://www.youtube.com/watch?v=fMXFbREG7Yg)

> Let's explore an example using Katib and TFJob

In [None]:
# Step 1: Define a complex TensorFlow model and training script
# Assume you have a file called train_advanced.py

# train_advanced.py
import tensorflow as tf
from tensorflow.keras import layers, models
from tensorflow.keras.datasets import mnist
from sklearn.model_selection import train_test_split

# Load and preprocess your training data
(train_images, train_labels), _ = mnist.load_data()
train_images = train_images.reshape((60000, 28, 28, 1)).astype('float32') / 255
train_labels = tf.keras.utils.to_categorical(train_labels)

# Split the data into training and validation sets
train_images, val_images, train_labels, val_labels = train_test_split(
    train_images, train_labels, test_size=0.1, random_state=42
)

# Define a convolutional neural network
model = models.Sequential()
model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.Flatten())
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(10, activation='softmax'))

# Compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Train the model
model.fit(train_images, train_labels, epochs=5, validation_data=(val_images, val_labels))

# Save the trained model
model.save('/output/model.h5')


In [None]:
# Step 2: Create a Katib experiment YAML file for hyperparameter tuning
# Assume you have a file called katib-experiment.yaml

# katib-experiment.yaml
apiVersion: "kubeflow.org/v1beta1"
kind: Experiment
metadata:
  name: mnist-tuning
spec:
  maxTrialCount: 12
  maxFailedTrialCount: 3
  parallelTrialCount: 3
  algorithm:
    algorithmName: random
  objective:
    goal: 0.99
    objectiveMetricName: accuracy
    additionalMetricNames:
      - loss
  parameters:
    - name: learning_rate
      parameterType: double
      feasibleSpace:
        min: "0.01"
        max: "0.1"
    - name: batch_size
      parameterType: int
      feasibleSpace:
        min: "32"
        max: "128"
    - name: dropout_rate
      parameterType: double
      feasibleSpace:
        min: "0.2"
        max: "0.5"


In [None]:
# Step 3: Modify the TFJob YAML file to use hyperparameters
# Assume you have a file called tfjob_advanced.yaml

# tfjob_advanced.yaml
apiVersion: "kubeflow.org/v1"
kind: "TFJob"
metadata:
  name: "mnist-tfjob"
spec:
  tfReplicaSpecs:
    Chief:
      replicas: 1
      template:
        spec:
          containers:
          - name: tensorflow
            image: "your-tensorflow-image:latest"
            command:
              - "python"
              - "/mnt/train_advanced.py"
              - "--learning_rate=${trialParameters.learningRate}"
              - "--batch_size=${trialParameters.batchSize}"
              - "--dropout_rate=${trialParameters.dropoutRate}"
            volumeMounts:
              - name: data
                mountPath: "/mnt"
          volumes:
            - name: data
              persistentVolumeClaim:
                claimName: "your-data-pvc"


## Scripts Defn

- The `train_advanced.py` script defines a convolutional neural network for image classification and includes data preprocessing.
- `The katib-experiment.yaml` file sets up a Katib experiment for hyperparameter tuning, specifying parameters like learning rate, batch size, and dropout rate.
- `The tfjob_advanced.yaml` file modifies the TFJob to use the hyperparameters provided by Katib during the training process.