Building a distributed machine learning system in this modern era of deep learning is necessary. Every company that uses machine learning wants to serve their customers at scale. Models are becoming huge and the datasets required to train these models are also increasing. This trend will continue in future as well because of the scaling laws. On top of that GPUs are very expensive. So keeping these GPUs idle can cost you a lot. Using multi-GPU training and optimizing inference can save costs and improve user experience.
I'm working on this project to get a deeper understanding of distributed machine learning systems. I'm using Kubernetes, Kubeflow, Tensorflow and Argo.
- Distributed Machine Learning System
Distributed systems are a group of nodes that talk to each other to achieve a specific task, such as streaming movies across devices, search engines, etc. - Understanding distributed systems
These systems handle massive amounts of data across multiple clusters, use automation tools, and benefit from hardware accelerations.
This repository includes code and references to implement a scalable and reliable machine learning system. I'm constructing all the components including data ingestion, training, serving, and monitoring these workloads.
I'm building an image classification end-to-end system.
The steps involved are:
- Setup
- Data Ingestion
- Distributed Training
- Evaluation
- Serving
- End-to-End Workflow
I'm using a Macbook. These system are generally built on cloud. I'm using conda as the package manager. I also use homebrew for installations.
[1] Let's install pytorch for data processing, model building and evaluation workflows.
conda install tensorflow tensorflow_datasets
[2] Docker
is required to create single- or multi-node clusters. I'm installing docker desktop.
[3] Install a popular CLI utility called kubectl
.
brew install kubectl
[4] To use Kubernetes on the local machine, install k3d, a lightweight wrapper to run k8s. There's minikube, kind and other distributions as well but I find k3d lean, memory efficient and simpler.
wget -q -O - https://raw.githubusercontent.com/rancher/k3d/main/install.sh | bash
Create a single-node cluster. You can create a multi-server cluster as well by specifying --servers 3
.
k3d cluster create dist-ml --image rancher/k3s:v1.25.3-k3s1
You can see the cluster info using command
kubectl cluster-info
Let's see which pods are created using
kubectl get nodes
[5] Install kubectx
to easily switch between clusters and kubens
for namespaces. This is a very handy utility.
brew install kubectx
Using kubens you can swtich between namespaces easily
[6] Next, we install kubeflow training operator that allows us to train large models effectively.
But first create a new namespace.
kubectl create namespace kubeflow
kubens kubeflow
Now install the operator.
kubectl apply -k "github.com/kubeflow/training-operator.git/manifests/overlays/standalone?ref=v1.7.0"
[7] To create an end-to-end workflow we need argo workflows.
kubectl create namespace argo
kubectl apply -n argo -f https://github.com/argoproj/argo-workflows/releases/download/v3.5.8/install.yaml
[8] For experiment tracking, install MLFlow.
helm install dist-mlflow community-charts/mlflow
There are multiple design patterns which can be used to create a ML system. In this project, I'm sticking to the easiest one. It has a data ingestion component. Once data is available you can schedule the pipeline to download the data and store it somewhere(e.g. s3). We then train multiple models on the same dataset parallely. Once the models are available, we can pick the best model and create a scalable inference service.

I'm using the cifar10 dataset that has 60,000 images(50,000 for training and 10,000 for testing). It has 10 different catergories and each is a low resolution color image of 32x32 px.
The tf.data
API enables you to build complex input pipelines from simple, reusable pieces. It's very efficient and enables handling large amounts of data, reading from different data formats, and performing complex transformations.
I'm loading the dataset into a tf.data.Dataset
object and cast the images to float32. Next, I'm normalizing the image pixel values from the [0, 255] to the [0, 1] range. These are some standard practices. I'm keeping an in-memory cache to improve performance. Let's also shuffle the training data to add some randomness.
import Tensorflow_datasets as tfds
import Tensorflow as tf
def get_dataset():
BUFFER_SIZE = 10000
def scale(image, label):
image = tf.cast(image, tf.float32)
image /= 255
return image, label
datasets, info = tfds.load(name='mnist', with_info=True, as_supervised=True)
train = datasets['train']
return train.map(scale).cache().shuffle(BUFFER_SIZE)
I'm using Tensorflow datasets module to load the dataset. The above piece of code gives a shuffled dataset where each element consists of images and labels.
To consume a large dataset(>PBs), we need to use a distributed approach. We can do that with some tweaks to the same function that we created.
For distributed data ingestion, just increase the batch size to use the extra computing power effectively,
Tip
Use the largest batch size that fits the GPU memory
There are several strategies in-built into Tensorflow library. There is a MirroredStrategy()
that can be used to train on a single machine with multiple GPUs but if you want to distribute training on multiple machines in a cluster/s(recommended and my goal), then MultiWorkerMirroredStrategy()
strategy is a way to go.
strategy = tf.distribute.MultiWorkerMirroredStrategy()
BATCH_SIZE_PER_REPLICA = 64
BATCH_SIZE = BATCH_SIZE_PER_REPLICA * strategy.num_replicas_in_sync
The num_replicas_in_sync
equals the number of devices that are used in the all-reduce operation. Use the tf.distribute.MultiWorkerMirroredStrategy
API and with the help of this strategy, a keras model that was designed to run on a single worker can seamlessly work on multiple workers with minimal code changes.
-
Each GPU performs the forward pass on a different slice of the input data and computes the loss
-
Next each GPU compute the gradients based on the loss
-
These gradients are then aggregated across all of the devices(using an all-reduce algorithm)
-
The optimizer updates the weights using the reduced gradients thereby keeping the devices in sync
I'm enabling automatic data sharding across workers by setting tf.data.experimental.AutoShardPolicy
to AutoShardPolicy.DATA
. This setting is needed to ensure convergence and performance. The concept of sharding means handing each worker a subset of the entire dataset.
Now the final training workflow can be written below
with strategy.scope():
dataset = get_dataset().batch(BATCH_SIZE).repeat()
options = tf.data.Options()
options.experimental_distribute.auto_shard_policy = tf.data.experimental.AutoShardPolicy.DATA
dataset = dataset.with_options(options)
model = build_and_compile_cnn_model()
model.fit(dataset, epochs=5, steps_per_epoch=70)
A simple neural net with Adam
optimizer and SparseCategoricalCrossentropy
loss as we have 10 categories to predict from.
def build_and_compile_cnn_model():
"""
Build and compile a simple cnn model
"""
print("Training a simple cnn model")
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Input(shape=(28, 28, 1), name="image_bytes"))
model.add(tf.keras.layers.Conv2D(32, (3, 3), activation="relu"))
model.add(tf.keras.layers.MaxPooling2D((2, 2)))
model.add(tf.keras.layers.Flatten())
model.add(tf.keras.layers.Dense(64, activation="relu"))
model.add(tf.keras.layers.Dense(10, activation="softmax"))
model.summary()
model.compile(
optimizer="adam", loss="sparse_categorical_crossentropy", metrics=["accuracy"]
)
return model
Let's define some necessary callbacks that will be executed during model training.
- Checkpointing saves model weights at some frequency(use
save_freq
). We usetf.keras.callbacks.ModelCheckpoint
for checkpointing.
checkpoint_dir = './training_checkpoints'
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")
I define the checkpoint directory to store the checkpoints and the names of the files. Checkpoints are important to restore the weights if the model training stops due to some issues.
-
tf.keras.callbacks.TensorBoard
writes a log for TensorBoard, which allows you to visualize the graphs. -
tf.keras.callbacks.LearningRateScheduler
schedules the learning rate to change after, for example, every epoch/batch.
def decay(epoch):
if epoch < 3:
return 1e-3
elif epoch >= 3 and epoch < 7:
return 1e-4
else:
return 1e-5
- PrintLR prints the learning rate at the end of each epoch.
class PrintLR(tf.keras.callbacks.Callback):
def on_epoch_end(self, epoch, logs=None):
print('\nLearning rate for epoch {} is {}'.format( epoch + 1, model.optimizer.lr.numpy()))
Now put all the components together.
callbacks = [
tf.keras.callbacks.TensorBoard(log_dir='./logs'),
tf.keras.callbacks.ModelCheckpoint(filepath=checkpoint_prefix, save_weights_only=True),
tf.keras.callbacks.LearningRateScheduler(decay),
PrintLR()
]
Now every piece is in it's correct place.
Next, train the model.
model = build_and_compile_cnn_model()
model.fit(dataset, epochs=3, steps_per_epoch=70, callbacks=callbacks)
I'm getting an accuracy of 94% on the training data. I'm not spending much time on increasing the accuracy as it's not our end goal.
Note
I'm doing these experiments in a colab notebook. Later I will copy this code to python scripts.
I've already discussed about the strategy to use here. Just a recap - for distributed training on multiple workers, use the MultiWorkerMirroredStrategy
with Keras(tf as backend).
There are different ways to do distributed training and data parallelism
is the most common one. There are two common ways to do distributed training with data parallelism:
-
Synchronous training, where the steps of training are synced across the workers and replicas. all workers train over different slices of input data in sync, and aggregating gradients at each step.
-
Asynchronous training, where the training steps are not strictly synced. all workers are independently training over the input data and updating variables asynchronously. see parameter server training.
I'm using the MultiWorkerMirroredStrategy
that does synchronous distributed training across multiple workers, each with potentially multiple GPUs. It replicates all variables and computations to each local device and uses distributed collective implementation (e.g. all-reduce) so that multiple workers can work together.
Let's initiate the distributed input data pipeline and the model inside the strategy scope but hold on we need to save these models somewhere so that they can then be fetched for inference.
To save the model we can use model.save
but since there are multiple models training parallely, the saving destinations need to be different for each worker.
One approach is:
- For worker nodes, save the model to a temporary directory
- For the master node, save the model to the provided directory
The temporary directories of the workers need to be unique to prevent errors. The model saved in the directories will be identical, and only the model saved by the master should be referenced for restoring or serving.
I'm not saving the model to temporary directories as doing this will waste my laptop's computing resources and memory. I'm determining which worker node is the master and saving its model only.
To determine if the worker node is the master or not, use the environment variable TF_CONFIG
.
An example configuration looks like below:
tf_config = {
'cluster': {
'worker': ['localhost:12345', 'localhost:23456']
},
'task': {'type': 'worker', 'index': 0}
}
The _is_worker_master
function inspects the cluster specs and current task type and returns True
if the worker is the master and False
otherwise.
def _is_worker_master():
return TASK_INDEX == 0
tf_config = json.loads(os.environ.get('TF_CONFIG') or '{}')
TASK_INDEX = tf_config['task']['index']
if _is_worker_master():
model_path = args.saved_model_dir
else:
model_path = args.saved_model_dir + '/worker_tmp_' + str(TASK_INDEX)
multi_worker_model.save(model_path)
To containerize I have a Python script called distributed-training.py
that has all the three models.
Let's create a Dockerfile
FROM python:3.9
RUN pip install Tensorflow Tensorflow_datasets
COPY distributed-training.py /
Next, build the docker image
docker build -f Dockerfile -t kubeflow/distributed-training-strategy:v0.1 .
I need to import the image to the k3d cluster as it cannot access the image registry.
k3d image import kubeflow/distributed-training-strategy:v0.1 --cluster dist-ml
When training a model in respective pods, if the operations/computations are completed/failed, the files in the pod are recycled/deleted by the garbage collector. This means that all the model checkpoints are lost, that means now we don't have a model for serving.
To avoid this we have to use PersistentVolume
and PersistentVolumeClaim
.
A PersistentVolume(PV) is a piece of storage in the cluster that has been provisioned by an administrator or dynamically provisioned. It is a resource in the cluster just like a node is a cluster resource. PVs are volume plugins like volumes but have a lifecycle independent of any individual Pod that uses the PV, that means the storage will persist and live even when the pods are deleted.
A PersistentVolumeClaim (PVC) is a request for storage by a user. Pods consume node resources and PVCs consume PV resources. Pods can request specific levels of resources (CPU and Memory). Claims can request specific size and access modes (e.g., they can be mounted ReadWriteOnce, ReadOnlyMany, or ReadWriteMany).
Next, create a PVC to submit a request for storage that will be used in worker pods to store the trained model. I'm requesting 1GB of storage with ReadWriteOnce
mode.
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: strategy-volume
spec:
accessModes: [ "ReadWriteOnce" ]
volumeMode: Filesystem
resources:
requests:
storage: 1Gi
kubectl apply -f pvc.yaml
Next, define a TFJob spec which helps distributed model training.
Note
There's a concept of deployments and the main difference between deployments and jobs is how they handle a pod that is terminated. a deployment is intended to be a "service", e.g. it should be up and running, so it will try to restart the pods it's managing, to match the desired number of replicas. while a job is intended to execute and successfully terminate.
apiVersion: kubeflow.org/v1
kind: TFJob
metadata:
name: training
spec:
runPolicy:
cleanPodPolicy: None
tfReplicaSpecs:
Worker:
replicas: 2
restartPolicy: Never
template:
spec:
containers:
- name: Tensorflow
image: kubeflow/ditributed-training-strategy:v0.1
imagePullPolicy: IfNotPresent
command: ["python", "/distributed-training.py", "--saved_model_dir", "/trained_model/saved_model_versions/2/", "--checkpoint_dir", "/trained_model/checkpoint", "--model_type", "cnn"]
volumeMounts:
- mountPath: /trained_model
name: training
resources:
limits:
cpu: 500m
volumes:
- name: training
persistentVolumeClaim:
claimName: strategy-volume
You can pass saved_model_dir
and checkpoint_dir
to the container.
The volumes
field specifies the persistent volume claim and volumeMounts
field specifies what folder to mount the files. The CleanPodPolicy
in the TFJob spec controls the deletion of pods when a job terminates. The restartPolicy
determines whether pods will be restarted when they exit.
Submit the TFJob to our cluster and start our distributed model training.
kubectl apply -f tfjob.yaml
I can see two pods running our distributed training as we've specified 2
workers.
- training-worker-0
- training-worker-1
Let's see the logs from the pod training-worker-0
kubectl logs training-worker-0
#TODO
While training the model, I'm storing it in the /saved_model_versions/1/
path.
Note
We can edit/update the code and resubmit the job. Just delete the running job, rebuild the docker image, import it, and resubmit the job. These are the steps to remember every time we change the code.
kubectl delete tfjob --all
docker build -f Dockerfile -t kubeflow/distributed-training-strategy:v0.1 .
k3d image import kubeflow/distributed-training-strategy:v0.1 --cluster dist-ml
kubectl apply -f tfjob.yaml
Voila! model training's done.
Next, evaluate the model's performance.
kubectl apply -f predict-service.yaml
Finally, I have a trained model stored in the file path trained_model/saved_model_versions/2/
.
kubectl exec --stdin --tty predict-service -- bin/bash
To see the evals, you can exec into a running container predict-service
.
#TODO
Execute predict-service.py
which takes the trained model and evaluates it on the test
dataset.
I'm training multiple models and let's pick the best one and use it for the inference.
I've a deep neural network with batch norm layers.
def build_and_compile_cnn_model_with_batch_norm():
print("Training CNN model with batch normalization")
model = models.Sequential()
model.add(layers.Input(shape=(28, 28, 1), name='image_bytes'))
model.add(layers.Conv2D(32, (3, 3), activation='relu'))
model.add(layers.BatchNormalization())
model.add(layers.Activation('sigmoid'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.BatchNormalization())
model.add(layers.Activation('sigmoid'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.Flatten())
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(10, activation='softmax'))
model.summary()
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
return model
Let's create one more nn with dropout.
def build_and_compile_cnn_model_with_dropout():
print("Training CNN model with dropout")
model = models.Sequential()
model.add(layers.Input(shape=(28, 28, 1), name='image_bytes'))
model.add(layers.Conv2D(32, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Dropout(0.5))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.Flatten())
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(10, activation='softmax'))
model.summary()
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
return model
Let's train these models by submitting three different TFJob
s with arguments --model_type
and --saved_model_dir
.
kubectl apply -f tfjob.yaml
Next, evaluate all the models performance. The model with the highest accuracy score can be moved to a different folder and then used for serving.
best_model_path = ""
best_accuracy = 0
for i in range(3):
model_path = "trained_models/saved_model_versions/" + str(i)
model = tf.keras.models.load_model(model_path)
datasets, info = tfds.load(name='mnist', with_info=True, as_supervised=True)
test = datasets['test']
ds = mnist_test.map(scale).cache().shuffle(BUFFER_SIZE).batch(64)
loss, accuracy = model.evaluate(ds)
if accuracy > best_accuracy:
best_accuracy = accuracy
best_model_path = model_path
dest = "trained_model/saved_model_versions/3"
shutil.copytree(best_model_path, dest)
You can run the model-selection.py.
apiVersion: v1
kind: Pod
metadata:
name: model-selection
spec:
containers:
- name: predict
image: kubeflow/distributed-training-strategy:v0.1
command: ["python", "/model-selection.py"]
volumeMounts:
- name: model
mountPath: /trained_model
volumes:
- name: model
persistentVolumeClaim:
claimName: strategy-volume
#TODO
I've implemented distributed training and model selection components. Next I'm creating a model serving component. This component takes the trained model from trained_model/saved_model_versions/3
.
The inference service should be very highly performant and robust. I'm not considering cost at this moment.
model_path = "trained_models/saved_model_versions/3"
model = tf.keras.models.load_model(model_path)
datasets, info = tfds.load(name='mnist', with_info=True, as_supervised=True)
mnist_test = datasets['test']
ds = mnist_test.map(scale).cache().shuffle(BUFFER_SIZE).batch(64)
loss, accuracy = model.predict(ds)
I'm using TFServing
to expose the model as an endpoint service.
# Environment variable with the path to the model
os.environ["MODEL_PATH"] = f"{model_path}"
nohup Tensorflow_model_server \
--port=8500 \
--rest_api_port=8501 \
--model_name=model \
--model_base_path=$MODEL_PATH
Nohup, short for no hang-up is a command in Linux systems that keeps processes running even after exiting the shell or terminal.
The method mentioned above works great if we're only experimenting locally. There are more efficient ways for distributed model serving.
TensorFlow models contain a signature definition that defines the signature of a computation supported in a TensorFlow graph. SignatureDefs aims to provide generic support to identify the inputs and outputs of a function. We can modify this input layer with a preprocessing function so that clients can use base64 encoded images, which is a standard way of sending images through RESTFUL APIs. To do that, we’ll save a model with new serving signatures. The new signatures use Python functions to handle preprocessing the image from a JPEG to a Tensor. Refer
def _preprocess(bytes_inputs):
decoded = tf.io.decode_jpeg(bytes_inputs, channels=1)
resized = tf.image.resize(decoded, size=(28, 28))
return tf.cast(resized, dtype=tf.uint8)
def _get_serve_image_fn(model):
@tf.function(input_signature=[tf.TensorSpec([None], dtype=tf.string, name='image_bytes')])
def serve_image_fn(bytes_inputs):
decoded_images = tf.map_fn(_preprocess, bytes_inputs, dtype=tf.uint8)
return model(decoded_images)
return serve_image_fn
signatures = {
"serving_default": _get_serve_image_fn(model).get_concrete_function(
tf.TensorSpec(shape=[None], dtype=tf.string, name='image_bytes')
)
}
tf.saved_model.save(multi_worker_model, model_path, signatures=signatures)
Now we have updated the training script, we should rebuild the image and re-train the model.
Next, we will use KServe for inference service. KServe enables serverless inferencing on Kubernetes and provides performant, high-abstraction interfaces for common machine learning (ML) frameworks like TensorFlow, PyTorch, etc. Refer.
We create an InferenceService yaml, which specifies the framework tensorflow and storageUri that is pointed to a saved Tensorflow model.
apiVersion: "serving.kserve.io/v1beta1"
kind: InferenceService
metadata:
name: tf-mnist
spec:
predictor:
model:
modelFormat:
name: tensorflow
storageUri: "pvc://strategy-volume/saved_model_versions"
Install KServe.
curl -s "https://raw.githubusercontent.com/kserve/kserve/release-0.11/hack/quick_install.sh" | bash
Next, apply the inference-service.yaml to create the InferenceService. By default, it exposes an HTTP/REST endpoint.
kubectl apply -f inference-service.yaml
Wait for the InferenceService to be in a ready state.
kubectl get isvc tf-mnist
Next, we run the prediction. But first, we need to determine and set the INGRESS_HOST and INGRESS_PORT. An ingress gateway is like an API gateway that load-balances requests. To test it locally we have to do Port Forward
.
INGRESS_GATEWAY_SERVICE=$(kubectl get svc --namespace istio-system --selector="app=istio-ingressgateway" --output jsonpath='{.items[0].metadata.name}')
kubectl port-forward --namespace istio-system svc/${INGRESS_GATEWAY_SERVICE} 8080:80
Then do the following in a different terminal window.
export INGRESS_HOST=localhost
export INGRESS_PORT=8080
We can send a sample request to our inference service. We can curl.
MODEL_NAME=tf-mnist
INPUT_PATH=@./mnist-input.json
SERVICE_HOSTNAME=$(kubectl get inferenceservice $MODEL_NAME -n kubeflow -o jsonpath='{.status.url}' | cut -d "/" -f 3)
curl -v -H "Host: ${SERVICE_HOSTNAME}" http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/$MODEL_NAME:predict -d $INPUT_PATH
or we use the requests library.
input_path = "mnist-input.json"
with open(input_path) as json_file:
data = json.load(json_file)
response = requests.post(
url="http://localhost:8080/v1/models/tf-mnist:predict",
data=json.dumps(data),
headers={"Host": "tf-mnist.kubeflow.example.com"},
)
print(response.text)
#TODO
Our inference service is working as expected.
Next, I want to have multiple model servers to handle large amounts of traffic. KServe can autoscale based on the requests. The autoscaler can scale down to zero if the application is receiving no traffic or we can specify a minimum number of replicas that need to be there. The autoscaling.knative.dev/target
sets a soft limit. Other specs that can be configured like minReplicas
, containerConcurrency
, and scaleMetric
, etc.
apiVersion: "serving.kserve.io/v1beta1"
kind: InferenceService
metadata:
name: tf-mnist
annotations:
autoscaling.knative.dev/target: "1"
spec:
predictor:
model:
modelFormat:
name: tensorflow
storageUri: "pvc://strategy-volume/saved_model_versions"
Next, I install Hey, a tiny program that sends some load to a web application. Hey runs provided a number of requests in the provided concurrency level and prints stats.
# https://github.com/rakyll/hey
brew install hey
kubectl apply -f inference-service.yaml
hey -z 30s -c 5 -m POST -host ${SERVICE_HOSTNAME} -D mnist-input.json "http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/tf-mnist:predict"
#TODO
I'm sending traffic for 30 seconds with 5 concurrent requests. As the scaling target is set to 1 and we load the service with 5 concurrent requests, the autoscaler tries scaling up to 5 pods. There will be a cold start time initially to spawn pods. It may take longer (to pull the docker image) if is not cached on the node.
It's time to connect all the parts. I'm using argo workflow to orchestrate the jobs we executed before in an end-to-end fashion. We can build a CICD workflow using DAG (exactly similar to GitLab CICD) on Kubernetes. Argo is the defacto engine for orchestration on Kubernetes.
We will start by installing argo workflows in a different namespace.
kubectl create namespace argo
kubectl apply -n argo -f https://github.com/argoproj/argo-workflows/releases/download/v3.4.11/install.yaml
I'm creating an end-to-end workflow with 4 steps:
- Data Ingestion
- Distributed Training
- Model Selection
- Model Serving
apiVersion: argoproj.io/v1alpha1
kind: Workflow # new type of k8s spec
metadata:
generateName: tfjob-wf- # name of the workflow spec
spec:
entrypoint: tfjob-wf # invoke the tfjob template
templates:
- name: tfjob-wf
steps:
- - name: data-ingestion-step
template: data-ingestion-step
- - name: distributed-tf-training-steps
template: distributed-tf-training-steps
- - name: model-selection-step
template: model-selection-step
- - name: create-model-serving-service
template: create-model-serving-service
podGC:
strategy: OnPodSuccess
volumes:
- name: model
persistentVolumeClaim:
claimName: strategy-volume
This is a multi-step workflow where all the steps are executed sequentially(double dash). PodGC
describes how to delete completed pods. Deleting completed pods can free the resources. I'm also using persistent storage to store the dataset and the trained models.
The first step is the data ingestion. We have added a memoize
spec to cache the output of this step. Memoization reduces cost and execution time. Since we do not want to download the data every time, we can cache it using the configMap. We have to specify the key
and name for the config-map
cache. I have also specified maxAge
to 1h
, which defines how long should the cache be considered valid.
- name: data-ingestion-step
serviceAccountName: argo
memoize:
cache:
configMap:
name: data-ingestion-config
key: "data-ingestion-cache"
maxAge: "1h"
container:
image: kubeflow/distributed-training-strategy:v0.1
imagePullPolicy: IfNotPresent
command: ["python", "/data-ingestion.py"]
Next, we execute the model training steps in parallel.
- name: distributed-training-step
steps:
- - name: cnn-model
template: cnn-model
- name: cnn-model-with-dropout
template: cnn-model-with-dropout
- name: cnn-model-with-batch-norm
template: cnn-model-with-batch-norm
Next, we create a step to run distributed training with the CNN model. To create the TFJob, we include the manifest we created before. We also add the successCondition
and failureCondition
to indicate if the job is created. Here we are storing the trained model in a different folder. We create similar steps for the other two models.
- name: cnn-model
serviceAccountName: training-operator
resource:
action: create
setOwnerReference: true
successCondition: status.replicaStatuses.Worker.succeeded = 2
failureCondition: status.replicaStatuses.Worker.failed > 0
manifests: |
apiVersion: kubeflow.org/v1
kind: TFJob
metadata:
generateName: training-
spec:
runPolicy:
cleanPodPolicy: None
tfReplicaSpecs:
Worker:
replicas: 2
restartPolicy: Never
template:
spec:
containers:
- name: tensorflow
image: kubeflow/distributed-training-strategy:v0.1
imagePullPolicy: IfNotPresent
command: ["python", "/distributed-training.py", "--saved_model_dir", "/trained_model/saved_model_versions/1/", "--checkpoint_dir", "/trained_model/checkpoint", "--model_type", "cnn"]
volumeMounts:
- mountPath: /trained_model
name: training
resources:
limits:
cpu: 500m
volumes:
- name: training
persistentVolumeClaim:
claimName: strategy-volume
Next, we add the model selection step. It is similar to model-selection.yaml
we created earlier.
- name: model-selection-step
serviceAccountName: argo
container:
image: kubeflow/distributed-training-strategy:v0.1
imagePullPolicy: IfNotPresent
command: ["python", "/model-selection.py"]
volumeMounts:
- name: model
mountPath: /trained_model
The last step of the workflow is the model serving.
- name: create-model-serving-service
serviceAccountName: training-operator
successCondition: status.modelStatus.states.transitionStatus = UpToDate
resource:
action: create
setOwnerReference: true
manifest: |
apiVersion: "serving.kserve.io/v1beta1"
kind: InferenceService
metadata:
name: tf-mnist
spec:
predictor:
model:
modelFormat:
name: tensorflow
storageUri: "pvc://strategy-volume/saved_model_versions"
Next, run the workflow.
kubectl create -f workflow.yaml
Logging is an essential component of the machine learning system. It helps debug issues, analyze performance, troubleshoot errors, gather insights, and implement a feedback loop. Fortunately, KServe makes it easy to create a service called message-dumper. It logs the request and the response. It has a unique identifier for the request and the response.
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
name: message-dumper
spec:
template:
spec:
containers:
- image: gcr.io/knative-releases/knative.dev/eventing-contrib/cmd/event_display
kubectl apply -f message-dumper.yaml
Next, we include the logger which points to the message dumper url in the InferenceService predictor.
logger:
mode: all
url: http://message-dumper.default/
You can read about the inference logger here.
- A distributed machine learning system is designed to train machine learning models on large datasets that cannot be processed on a single machine. There is a need to distribute the computation or training process to train complex models with millions or rather billions of parameters.
- Kubernetes is a popular choice for building such complex distributed systems. We can build scalable and highly available systems using K8s.
- Tensorflow provides a number of strategies for distributed training. We have used
MultiWorkerMirroredStrategy
here. - We have used KServe for building an Inference Service which can be autoscaled based on the traffic.
- Argo workflows are helpful in building CICD pipelines on Kubernetes.
add a new dataset cifar10- update tf code to pytorch
- add mlflow to track experiments
- monitoring the metrics with prometheus and grafana
- deploy the app to aws eks
- add ci using github actions
- add gitops for cd