Skip to content
This repository has been archived by the owner on Jan 10, 2023. It is now read-only.

TensorFlow model training on ABM using Kubeflow TFJob #1

Closed
puneith opened this issue May 13, 2021 · 0 comments · Fixed by #2
Closed

TensorFlow model training on ABM using Kubeflow TFJob #1

puneith opened this issue May 13, 2021 · 0 comments · Fixed by #2
Assignees
Labels
enhancement New feature or request

Comments

@puneith
Copy link
Contributor

puneith commented May 13, 2021

TensorFlow ML model training on Anthos Bare Metal. Create a basic Hello World training app using TFJob.

TFJob is a Kubernetes custom resource that you can use to run TensorFlow training jobs on Kubernetes. The Kubeflow implementation of TFJob is in tf-operator.

The model is a simple MNIST model with persistent volume. Once we have a running example on ABM, we'll need to create an end to end ML workflow on ABM and serve the model as well.

The above example also provides an exploration into Kubeflow on ABM and why such a workflow makes sense. Even though we can run TensorFlow directly on Kubernetes (as shown in the TF Serving example), the TFJob abstraction makes it easy to define TensorFlow deployments. The standard way of performing multi-node training in TensorFlow is using TF_CONFIG environment variable. This environment variable is a JSON string which provides cluster and task information, and is set for each binary running on the cluster. The nodes in the cluster can be worker and ps nodes. In multi-worker training, there is usually one worker that takes on a little more responsibility like saving checkpoint and writing summary file for TensorBoard in addition to what a regular worker does. Such worker is referred to as the chief worker, and it is customary that the worker with index 0 is appointed as the chief worker. Each node in the cluster also has one of the roles worker, ps, or chief.

The setting of TF_CONFIG environment variable can be a manual process if done outside Kubeflow, and this is the part which TFJob Controller automatically manages.

@puneith puneith self-assigned this May 13, 2021
@puneith puneith added the enhancement New feature or request label May 13, 2021
@puneith puneith linked a pull request May 14, 2021 that will close this issue
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant