Find file History
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
..
Failed to load latest commit information.
README.md
manifest_pytorchmnist.yml
manifest_tfmnist.yml
pytorch_mnist.py
tensorflow_mnist.py

README.md

Distributed deep learning training with Horovod and FfDL

You can leverage Uber's Horovod mechanism for distributed deep learning training with FfDL. Horovod is a distributed training framework for TensorFlow, Keras, and PyTorch. Horovod improves efficient inter-GPU communication via ring reduction and requires only a few lines of modification to user code, enabling faster, easier distributed training. Horovod enables distributed model training via MPI, a low-level interface for high-performance parallel computing.

Horovod Tensorflow example

  1. Deploy FfDL on your Kubernetes Cluster.

  2. In the main FfDL repository, run the following commands to obtain the object storage endpoint from your cluster.

node_ip=$PUBLIC_IP
s3_port=$(kubectl get service s3 -o jsonpath='{.spec.ports[0].nodePort}')
s3_url=http://$node_ip:$s3_port
  1. Next, set up the default object storage access ID and KEY. Then create buckets for all the necessary training data and models.
export AWS_ACCESS_KEY_ID=test; export AWS_SECRET_ACCESS_KEY=test; export AWS_DEFAULT_REGION=us-east-1;

s3cmd="aws --endpoint-url=$s3_url s3"
$s3cmd mb s3://tf_training_data
$s3cmd mb s3://tf_trained_model
  1. Now, create a temporary repository, download the necessary images for training and labeling our TensorFlow model, and upload those images to your tf_training_data bucket.
mkdir tmp
for file in t10k-images-idx3-ubyte.gz t10k-labels-idx1-ubyte.gz train-images-idx3-ubyte.gz train-labels-idx1-ubyte.gz;
do
  test -e tmp/$file || wget -q -O tmp/$file http://yann.lecun.com/exdb/mnist/$file
  $s3cmd cp tmp/$file s3://tf_training_data/$file
done
  1. Now you should have all the necessary training data set in your object storage. Let's go ahead to set up your restapi endpoint and default credentials for Deep Learning as a Service. Once you done that, you can start running jobs using the FfDL CLI (executable binary).
restapi_port=$(kubectl get service ffdl-restapi -o jsonpath='{.spec.ports[0].nodePort}')
export DLAAS_URL=http://$node_ip:$restapi_port; export DLAAS_USERNAME=test-user; export DLAAS_PASSWORD=test;

Replace the default object storage path with your s3_url. You can skip this step if your already modified the object storage path with your s3_url.

if [ "$(uname)" = "Darwin" ]; then
  sed -i '' s/s3.default.svc.cluster.local/$node_ip:$s3_port/ etc/examples/horovod/manifest_tfmnist.yml
else
  sed -i s/s3.default.svc.cluster.local/$node_ip:$s3_port/ etc/examples/horovod/manifest_tfmnist.yml
fi

Obtain the correct CLI for your machine and run the training job with our default Horovod model

CLI_CMD=$(pwd)/cli/bin/ffdl-$(if [ "$(uname)" = "Darwin" ]; then echo 'osx'; else echo 'linux'; fi)
$CLI_CMD train etc/examples/horovod/manifest_tfmnist.yml etc/examples/horovod

Congratulations, you had submitted your first Horovod TensorFlow job on FfDL. You can check your FfDL status either from the FfDL UI or simply run $CLI_CMD list

TroubleShooting

  • For Kubeadm-DIND cluster, some users are having issue with inter-node pod communication. Thus, we suggest you to use a real Kubernetes cluster environment or only use one worker node if you are testing on Kubeadm-DIND environment. (e.g. run export NUM_NODES=1 before provisioning your cluster)