Update DeepSpeed Training example launcher (#2103)

* change training example * fix workflow * fix syntax * update readmes * edit readme * remove env variables
Azure · Mar 9, 2023 · 8968381 · 8968381
1 parent 23806f8
commit 8968381
Show file tree

Hide file tree

Showing 8 changed files with 100 additions and 129 deletions.
diff --git a/.github/workflows/cli-jobs-deepspeed-deepspeed-training-job.yml b/.github/workflows/cli-jobs-deepspeed-deepspeed-training-job.yml
@@ -46,6 +46,5 @@ jobs:
       run: |
           source "${{ github.workspace }}/infra/sdk_helpers.sh";
           source "${{ github.workspace }}/infra/init_environment.sh";
-          bash -x generate-yml.sh
           bash -x ../../../run-job.sh job.yml
       working-directory: cli/jobs/deepspeed/deepspeed-training
diff --git a/cli/jobs/deepspeed/deepspeed-autotuning/README.md b/cli/jobs/deepspeed/deepspeed-autotuning/README.md
@@ -1,10 +1,59 @@
-### Deepspeed autotuning with Azure Machine Learning
-## Overview
-The deepspeed autotuner will generate an optimal configuration file (``ds_config.json``) that can be used in a deepspeed training job.
-## How to Run
-1. Create a compute that can run the job. Tesla V100 or A100 GPUs are strongly recommended, the example may not work correctly without them. In the ``generate-yml.sh`` file, set the compute to be the name of your compute.
-2. Generate the autotuning job yaml file with the following command:<br />
-```bash generate-yml.sh```
-3. Start the job with the following command:<br />
-```az ml job create --file job.yml```
-4. The optimal configuration file ``ds_config_optimal.json`` can be found at ``outputs/autotuning_results/exps`` under the ``outputs + logs`` tab of the completed run.
+## Using DeepSpeed Autotuning to generate an optimal DeepSpeed configuration file
+
+DeepSpeed Autotuning is a feature used to find the most optimal configuration file that will maximize the training speed and memory efficiency of a model for a given hardware configuration. This can give users the best possible performance, without having to spend time manually tweaking hyperparameters.
+
+To use DeepSpeed Autotuning, we are going to need a DeepSpeed config file to start with.
+
+```
+{
+  "train_batch_size": 64,
+  "optimizer": {
+    "type": "Adam",
+    "params": {
+      "lr": 0.0002,
+      "betas": [
+        0.5,
+        0.999
+      ],
+      "eps": 1e-8
+    }
+  },
+  "fp16": {
+    "enabled": true
+  },
+  "autotuning": {
+    "enabled": true,
+    "fast": false,
+    "results_dir": "outputs/autotuning_results/results",
+    "exps_dir": "outputs/autotuning_results/exps"
+  },
+  "steps_per_print": 10
+}
+```
+This ``ds_config.json`` file has some typical configurations, but autotuning will find the ideal configuration for the provided resources. Notice this file includes an 'autotuning' section, where we can configure the autotuning job and set an output directory. In the ``train.py`` file, DeepSpeed will be initialized using this DeepSpeed Configuration and will train a simple Neural Network Model.
+
+## Using DeepSpeed Autotuning with AzureML
+
+Using DeepSpeed Autotuning with AzureML requires that all nodes can communicate with each other via SSH. To do this, we will need two scripts to start the job.
+
+First we have the ``generate-yml.sh`` script. Typically to start an AzureML job we need a yml file. Instead of including a yml file in this example, this script will generate one. This is done for security reasons, to generate an unique SSH key per job for passwordless login. After generating the SSH key, it will be added as an environment variable in the job so each node can have access to it later on.
+
+Next is the start-deepspeed script. This does not need to be modified, but it has three purposes:
+- Add the generated SSH key to all nodes.
+- Generate a hostfile to be used by DeepSpeed Autotuning. This file lists the available nodes for the job and the number of GPUs they each have. (The number of GPUs used can be changed by changing the num_gpus_per_node parameter in ``generate-yml.sh``)
+- Start the DeepSpeed launcher with the arguments provided in ``generate-yml.sh``.
+
+### Setup
+#### Environment
+The environment for this job is provided in the docker-context file. There is no setup needed to run the job, however, if you want to setup the environment separately for use later (to prevent rebuilding the environment every time the job is run), follow [this guide](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-manage-environments-in-studio).
+#### Compute Instance
+V100 GPUs (ND40rs) are recommended for this job. This example was originally run using 2 ND40rs nodes with 8 V100 GPUs each. Make sure to edit the ``generate-yml.sh`` file with the name of the compute you want to use for this job.
+- Inside the ``generate-yml.sh`` file, the ``num_gpus_per_node`` variable at the top of the file will need to be edited to specify how many GPUs exist per each compute node being used, as well as the ``instance_count`` variable to specify how many nodes to use.
+### Running the Job
+1. Inside the ``generate-yml.sh`` file, uncomment the last line of the file. This will allow the job to be run using only the command in step 2, since the job will immediately be started once the ``job.yml`` file has been generated.
+2. To start a DeepSpeed Autotuning job, run the following command in the command line while inside this directory:
+```
+bash generate-yml.sh
+```
+
+When the job completes, the optimal DeepSpeed configuration can be found at ``outputs/autotuning_results/results/ds_config_optimal.json``.
diff --git a/cli/jobs/deepspeed/deepspeed-autotuning/generate-yml.sh b/cli/jobs/deepspeed/deepspeed-autotuning/generate-yml.sh
@@ -36,4 +36,4 @@ distribution:
 resources:
   instance_count: 2
 EOF
-# az ml job create --file deepspeed-autotune-aml.yaml
+# az ml job create --file job.yml
diff --git a/cli/jobs/deepspeed/deepspeed-training/README.md b/cli/jobs/deepspeed/deepspeed-training/README.md
@@ -1,10 +1,14 @@
-### Deepspeed training with Azure Machine Learning
-## Overview
-Train a model using deepspeed.
-## How to Run
-1. Create a compute that can run the job. Tesla V100 or A100 GPUs are strongly recommended, the example may not work correctly without them. In the ``generate-yml.sh`` file, set the compute to be the name of your compute.
-2. Generate the training job yaml file with the following command:<br />
-```bash generate-yml.sh```
-3. Start the job with the following command:<br />
-```az ml job create --file job.yml```
-4. This example provides a basic ``ds_config.json`` file to configure deepspeed. To have a more optimal configuration, run the deepspeed-autotuning example first to generate a new ds_config file to replace this one.
+## Deepspeed training with Azure Machine Learning
+This example showcases how to use DeepSpeed with an AzureML training job.
+### Setup
+#### Environment
+The environment for this job is provided in the docker-context file. There is no setup needed to run the job, however, if you want to setup the environment separately for use later (to prevent rebuilding the environment every time the job is run), follow [this guide](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-manage-environments-in-studio).
+#### Compute Instance
+V100 GPUs (ND40rs) are recommended for this job. This example was originally run using 2 ND40rs nodes with 8 V100 GPUs each. Make sure to edit the ``job.yml`` file with the name of the compute you want to use for this job.
+- Inside the ``job.sh`` file, the ``instance_count`` and the ``process_count_per_instance`` variables will also need to be edited to specify how many compute nodes and how many GPUs exist per each compute node being used.
+### Running the Job
+To start a DeepSpeed training job, run the following command in the command line while inside this directory:
+```
+az ml job create --file job.yml
+```
+> Using with Autotuning: If the deepspeed-autotuning example is run first, you can use the the optimal ``ds_config.json`` file that is generated as the configuration file for this example. This will allow for better resource utilization.
diff --git a/cli/jobs/deepspeed/deepspeed-training/generate-yml.sh b/cli/jobs/deepspeed/deepspeed-training/generate-yml.sh
diff --git a/cli/jobs/deepspeed/deepspeed-training/job.yml b/cli/jobs/deepspeed/deepspeed-training/job.yml
@@ -1 +1,25 @@
-# This is a temporary file until generate-yml.sh is run to generate the job.yml file that will be used for this example.
+# Training job submission via AML CLI v2
+
+$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
+
+command: python train.py --with_aml_log=True --deepspeed --deepspeed_config ds_config.json
+
+experiment_name: DistributedJob-DeepsSpeed-Training-cifar
+display_name: deepspeed-training-example
+code: src
+environment:
+  build:
+    path: docker-context
+limits:
+  timeout: 900
+outputs:
+  output:
+    type: uri_folder
+    mode: rw_mount
+    path: azureml://datastores/workspaceblobstore/paths/outputs/training_results
+compute: azureml:gpu-v100-cluster
+distribution:
+  type: pytorch
+  process_count_per_instance: 8
+resources:
+  instance_count: 2
diff --git a/cli/jobs/deepspeed/deepspeed-training/src/start-deepspeed.sh b/cli/jobs/deepspeed/deepspeed-training/src/start-deepspeed.sh
diff --git a/cli/readme.py b/cli/readme.py
@@ -459,7 +459,7 @@ def write_job_workflow(job):
           pip install azure-identity
           bash \"{GITHUB_WORKSPACE}/sdk/python/setup.sh\"  
           python prepare_data.py --subscription $SUBSCRIPTION_ID --group $RESOURCE_GROUP_NAME --workspace $WORKSPACE_NAME\n"""
-    elif "deepspeed" in job:
+    elif "autotuning" in job:
         workflow_yaml += f"""          bash -x generate-yml.sh\n"""
         # workflow_yaml += f"""          bash -x {os.path.relpath(".", project_dir)}/run-job.sh generate-yml.yml\n"""
     workflow_yaml += f"""          bash -x {os.path.relpath(".", project_dir).replace(os.sep, "/")}/run-job.sh {filename}.yml