Skip to content
Permalink
Branch: master
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
32 lines (19 sloc) 2.52 KB

Train Deep Learning Models using GPUs

Prerequisites

  • You need to have a Kubernetes cluster configured to use GPUs. Currently tested with Kubernetes configured using device plugin. If you are using Accelerators, please set lcm.device_plugin=false when deploying the FfDL helm chart (e.g. helm install --set lcm.device_plugin=false .).

  • You need to have FfDL running on your Cluster.

  • Currently Tensorflow, Caffe, PyTorch, and Horovod are tested with GPUs.

Instructions

TensorFlow example

To run the TensorFlow job with GPU, simply go to the tf-model's manifest file and do the following changes

  • Change the framework version to one of the TensorFlow versions that support GPU.
  • Change the gpus section to be greater than 0, so the learner can get GPU resource to train the job.

The etc/examples/tf-model/gpu-manifest.yml is the example manifest file for running the TensorFlow example with GPU. Once you have done the above changes, you can following the same testing instructions on the main README to run the sample TensorFlow job on GPU.

Caffe example

To run the Caffe job with GPU, simply go to the caffe-model's manifest file and do the following changes

  • Change the Framework version from cpu to gpu.
  • Change the gpus section to be greater than 0, so the learner can get GPU resource to train the job.
  • Add the caffe GPU flag in the command section (e.g. Change the command from caffe train -solver lenet_solver.prototxt to caffe train -gpu all -solver lenet_solver.prototxt).
  • Lastly, go to the lenet_solver.prototxt file and change solver_mode to GPU to enable Caffe to run on GPU.

The etc/examples/caffe-model/gpu-manifest.yml is the example manifest file for running the Caffe example with GPU. Once you have done the above changes, you can following the same testing instructions on the main README to run the sample TensorFlow job on GPU.

You can go to the user guide to learn more about how to modify the model manifest file and run GPU jobs with your own setting. Note that you must select the framework versions that support GPU and set the gpus section greater than 0 in order to execute your job with GPU in the manifest file.

You can’t perform that action at this time.