Skip to content

Commit

Permalink
Update DeepSpeed Training example launcher (#2103)
Browse files Browse the repository at this point in the history
* change training example

* fix workflow

* fix syntax

* update readmes

* edit readme

* remove env variables
  • Loading branch information
cassieesvelt committed Mar 9, 2023
1 parent 23806f8 commit 8968381
Show file tree
Hide file tree
Showing 8 changed files with 100 additions and 129 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,5 @@ jobs:
run: |
source "${{ github.workspace }}/infra/sdk_helpers.sh";
source "${{ github.workspace }}/infra/init_environment.sh";
bash -x generate-yml.sh
bash -x ../../../run-job.sh job.yml
working-directory: cli/jobs/deepspeed/deepspeed-training
69 changes: 59 additions & 10 deletions cli/jobs/deepspeed/deepspeed-autotuning/README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,59 @@
### Deepspeed autotuning with Azure Machine Learning
## Overview
The deepspeed autotuner will generate an optimal configuration file (``ds_config.json``) that can be used in a deepspeed training job.
## How to Run
1. Create a compute that can run the job. Tesla V100 or A100 GPUs are strongly recommended, the example may not work correctly without them. In the ``generate-yml.sh`` file, set the compute to be the name of your compute.
2. Generate the autotuning job yaml file with the following command:<br />
```bash generate-yml.sh```
3. Start the job with the following command:<br />
```az ml job create --file job.yml```
4. The optimal configuration file ``ds_config_optimal.json`` can be found at ``outputs/autotuning_results/exps`` under the ``outputs + logs`` tab of the completed run.
## Using DeepSpeed Autotuning to generate an optimal DeepSpeed configuration file

DeepSpeed Autotuning is a feature used to find the most optimal configuration file that will maximize the training speed and memory efficiency of a model for a given hardware configuration. This can give users the best possible performance, without having to spend time manually tweaking hyperparameters.

To use DeepSpeed Autotuning, we are going to need a DeepSpeed config file to start with.

```
{
"train_batch_size": 64,
"optimizer": {
"type": "Adam",
"params": {
"lr": 0.0002,
"betas": [
0.5,
0.999
],
"eps": 1e-8
}
},
"fp16": {
"enabled": true
},
"autotuning": {
"enabled": true,
"fast": false,
"results_dir": "outputs/autotuning_results/results",
"exps_dir": "outputs/autotuning_results/exps"
},
"steps_per_print": 10
}
```
This ``ds_config.json`` file has some typical configurations, but autotuning will find the ideal configuration for the provided resources. Notice this file includes an 'autotuning' section, where we can configure the autotuning job and set an output directory. In the ``train.py`` file, DeepSpeed will be initialized using this DeepSpeed Configuration and will train a simple Neural Network Model.

## Using DeepSpeed Autotuning with AzureML

Using DeepSpeed Autotuning with AzureML requires that all nodes can communicate with each other via SSH. To do this, we will need two scripts to start the job.

First we have the ``generate-yml.sh`` script. Typically to start an AzureML job we need a yml file. Instead of including a yml file in this example, this script will generate one. This is done for security reasons, to generate an unique SSH key per job for passwordless login. After generating the SSH key, it will be added as an environment variable in the job so each node can have access to it later on.

Next is the start-deepspeed script. This does not need to be modified, but it has three purposes:
- Add the generated SSH key to all nodes.
- Generate a hostfile to be used by DeepSpeed Autotuning. This file lists the available nodes for the job and the number of GPUs they each have. (The number of GPUs used can be changed by changing the num_gpus_per_node parameter in ``generate-yml.sh``)
- Start the DeepSpeed launcher with the arguments provided in ``generate-yml.sh``.

### Setup
#### Environment
The environment for this job is provided in the docker-context file. There is no setup needed to run the job, however, if you want to setup the environment separately for use later (to prevent rebuilding the environment every time the job is run), follow [this guide](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-manage-environments-in-studio).
#### Compute Instance
V100 GPUs (ND40rs) are recommended for this job. This example was originally run using 2 ND40rs nodes with 8 V100 GPUs each. Make sure to edit the ``generate-yml.sh`` file with the name of the compute you want to use for this job.
- Inside the ``generate-yml.sh`` file, the ``num_gpus_per_node`` variable at the top of the file will need to be edited to specify how many GPUs exist per each compute node being used, as well as the ``instance_count`` variable to specify how many nodes to use.
### Running the Job
1. Inside the ``generate-yml.sh`` file, uncomment the last line of the file. This will allow the job to be run using only the command in step 2, since the job will immediately be started once the ``job.yml`` file has been generated.
2. To start a DeepSpeed Autotuning job, run the following command in the command line while inside this directory:
```
bash generate-yml.sh
```

When the job completes, the optimal DeepSpeed configuration can be found at ``outputs/autotuning_results/results/ds_config_optimal.json``.
2 changes: 1 addition & 1 deletion cli/jobs/deepspeed/deepspeed-autotuning/generate-yml.sh
Original file line number Diff line number Diff line change
Expand Up @@ -36,4 +36,4 @@ distribution:
resources:
instance_count: 2
EOF
# az ml job create --file deepspeed-autotune-aml.yaml
# az ml job create --file job.yml
24 changes: 14 additions & 10 deletions cli/jobs/deepspeed/deepspeed-training/README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,14 @@
### Deepspeed training with Azure Machine Learning
## Overview
Train a model using deepspeed.
## How to Run
1. Create a compute that can run the job. Tesla V100 or A100 GPUs are strongly recommended, the example may not work correctly without them. In the ``generate-yml.sh`` file, set the compute to be the name of your compute.
2. Generate the training job yaml file with the following command:<br />
```bash generate-yml.sh```
3. Start the job with the following command:<br />
```az ml job create --file job.yml```
4. This example provides a basic ``ds_config.json`` file to configure deepspeed. To have a more optimal configuration, run the deepspeed-autotuning example first to generate a new ds_config file to replace this one.
## Deepspeed training with Azure Machine Learning
This example showcases how to use DeepSpeed with an AzureML training job.
### Setup
#### Environment
The environment for this job is provided in the docker-context file. There is no setup needed to run the job, however, if you want to setup the environment separately for use later (to prevent rebuilding the environment every time the job is run), follow [this guide](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-manage-environments-in-studio).
#### Compute Instance
V100 GPUs (ND40rs) are recommended for this job. This example was originally run using 2 ND40rs nodes with 8 V100 GPUs each. Make sure to edit the ``job.yml`` file with the name of the compute you want to use for this job.
- Inside the ``job.sh`` file, the ``instance_count`` and the ``process_count_per_instance`` variables will also need to be edited to specify how many compute nodes and how many GPUs exist per each compute node being used.
### Running the Job
To start a DeepSpeed training job, run the following command in the command line while inside this directory:
```
az ml job create --file job.yml
```
> Using with Autotuning: If the deepspeed-autotuning example is run first, you can use the the optimal ``ds_config.json`` file that is generated as the configuration file for this example. This will allow for better resource utilization.
39 changes: 0 additions & 39 deletions cli/jobs/deepspeed/deepspeed-training/generate-yml.sh

This file was deleted.

26 changes: 25 additions & 1 deletion cli/jobs/deepspeed/deepspeed-training/job.yml
Original file line number Diff line number Diff line change
@@ -1 +1,25 @@
# This is a temporary file until generate-yml.sh is run to generate the job.yml file that will be used for this example.
# Training job submission via AML CLI v2

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json

command: python train.py --with_aml_log=True --deepspeed --deepspeed_config ds_config.json

experiment_name: DistributedJob-DeepsSpeed-Training-cifar
display_name: deepspeed-training-example
code: src
environment:
build:
path: docker-context
limits:
timeout: 900
outputs:
output:
type: uri_folder
mode: rw_mount
path: azureml://datastores/workspaceblobstore/paths/outputs/training_results
compute: azureml:gpu-v100-cluster
distribution:
type: pytorch
process_count_per_instance: 8
resources:
instance_count: 2
66 changes: 0 additions & 66 deletions cli/jobs/deepspeed/deepspeed-training/src/start-deepspeed.sh

This file was deleted.

2 changes: 1 addition & 1 deletion cli/readme.py
Original file line number Diff line number Diff line change
Expand Up @@ -459,7 +459,7 @@ def write_job_workflow(job):
pip install azure-identity
bash \"{GITHUB_WORKSPACE}/sdk/python/setup.sh\"
python prepare_data.py --subscription $SUBSCRIPTION_ID --group $RESOURCE_GROUP_NAME --workspace $WORKSPACE_NAME\n"""
elif "deepspeed" in job:
elif "autotuning" in job:
workflow_yaml += f""" bash -x generate-yml.sh\n"""
# workflow_yaml += f""" bash -x {os.path.relpath(".", project_dir)}/run-job.sh generate-yml.yml\n"""
workflow_yaml += f""" bash -x {os.path.relpath(".", project_dir).replace(os.sep, "/")}/run-job.sh {filename}.yml
Expand Down

0 comments on commit 8968381

Please sign in to comment.