Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update DeepSpeed Training example launcher #2103

Merged
merged 6 commits into from
Mar 9, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,5 @@ jobs:
run: |
source "${{ github.workspace }}/infra/sdk_helpers.sh";
source "${{ github.workspace }}/infra/init_environment.sh";
bash -x generate-yml.sh
bash -x ../../../run-job.sh job.yml
working-directory: cli/jobs/deepspeed/deepspeed-training
69 changes: 59 additions & 10 deletions cli/jobs/deepspeed/deepspeed-autotuning/README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,59 @@
### Deepspeed autotuning with Azure Machine Learning
## Overview
The deepspeed autotuner will generate an optimal configuration file (``ds_config.json``) that can be used in a deepspeed training job.
## How to Run
1. Create a compute that can run the job. Tesla V100 or A100 GPUs are strongly recommended, the example may not work correctly without them. In the ``generate-yml.sh`` file, set the compute to be the name of your compute.
2. Generate the autotuning job yaml file with the following command:<br />
```bash generate-yml.sh```
3. Start the job with the following command:<br />
```az ml job create --file job.yml```
4. The optimal configuration file ``ds_config_optimal.json`` can be found at ``outputs/autotuning_results/exps`` under the ``outputs + logs`` tab of the completed run.
## Using DeepSpeed Autotuning to generate an optimal DeepSpeed configuration file

DeepSpeed Autotuning is a feature used to find the most optimal configuration file that will maximize the training speed and memory efficiency of a model for a given hardware configuration. This can give users the best possible performance, without having to spend time manually tweaking hyperparameters.

To use DeepSpeed Autotuning, we are going to need a DeepSpeed config file to start with.

```
{
"train_batch_size": 64,
"optimizer": {
"type": "Adam",
"params": {
"lr": 0.0002,
"betas": [
0.5,
0.999
],
"eps": 1e-8
}
},
"fp16": {
"enabled": true
},
"autotuning": {
"enabled": true,
"fast": false,
"results_dir": "outputs/autotuning_results/results",
"exps_dir": "outputs/autotuning_results/exps"
},
"steps_per_print": 10
}
```
This ``ds_config.json`` file has some typical configurations, but autotuning will find the ideal configuration for the provided resources. Notice this file includes an 'autotuning' section, where we can configure the autotuning job and set an output directory. In the ``train.py`` file, DeepSpeed will be initialized using this DeepSpeed Configuration and will train a simple Neural Network Model.

## Using DeepSpeed Autotuning with AzureML

Using DeepSpeed Autotuning with AzureML requires that all nodes can communicate with each other via SSH. To do this, we will need two scripts to start the job.

First we have the ``generate-yml.sh`` script. Typically to start an AzureML job we need a yml file. Instead of including a yml file in this example, this script will generate one. This is done for security reasons, to generate an unique SSH key per job for passwordless login. After generating the SSH key, it will be added as an environment variable in the job so each node can have access to it later on.

Next is the start-deepspeed script. This does not need to be modified, but it has three purposes:
- Add the generated SSH key to all nodes.
- Generate a hostfile to be used by DeepSpeed Autotuning. This file lists the available nodes for the job and the number of GPUs they each have. (The number of GPUs used can be changed by changing the num_gpus_per_node parameter in ``generate-yml.sh``)
- Start the DeepSpeed launcher with the arguments provided in ``generate-yml.sh``.

### Setup
#### Environment
The environment for this job is provided in the docker-context file. There is no setup needed to run the job, however, if you want to setup the environment separately for use later (to prevent rebuilding the environment every time the job is run), follow [this guide](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-manage-environments-in-studio).
#### Compute Instance
V100 GPUs (ND40rs) are recommended for this job. This example was originally run using 2 ND40rs nodes with 8 V100 GPUs each. Make sure to edit the ``generate-yml.sh`` file with the name of the compute you want to use for this job.
- Inside the ``generate-yml.sh`` file, the ``num_gpus_per_node`` variable at the top of the file will need to be edited to specify how many GPUs exist per each compute node being used, as well as the ``instance_count`` variable to specify how many nodes to use.
### Running the Job
1. Inside the ``generate-yml.sh`` file, uncomment the last line of the file. This will allow the job to be run using only the command in step 2, since the job will immediately be started once the ``job.yml`` file has been generated.
2. To start a DeepSpeed Autotuning job, run the following command in the command line while inside this directory:
```
bash generate-yml.sh
```

When the job completes, the optimal DeepSpeed configuration can be found at ``outputs/autotuning_results/results/ds_config_optimal.json``.
2 changes: 1 addition & 1 deletion cli/jobs/deepspeed/deepspeed-autotuning/generate-yml.sh
Original file line number Diff line number Diff line change
Expand Up @@ -36,4 +36,4 @@ distribution:
resources:
instance_count: 2
EOF
# az ml job create --file deepspeed-autotune-aml.yaml
# az ml job create --file job.yml
24 changes: 14 additions & 10 deletions cli/jobs/deepspeed/deepspeed-training/README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,14 @@
### Deepspeed training with Azure Machine Learning
## Overview
Train a model using deepspeed.
## How to Run
1. Create a compute that can run the job. Tesla V100 or A100 GPUs are strongly recommended, the example may not work correctly without them. In the ``generate-yml.sh`` file, set the compute to be the name of your compute.
2. Generate the training job yaml file with the following command:<br />
```bash generate-yml.sh```
3. Start the job with the following command:<br />
```az ml job create --file job.yml```
4. This example provides a basic ``ds_config.json`` file to configure deepspeed. To have a more optimal configuration, run the deepspeed-autotuning example first to generate a new ds_config file to replace this one.
## Deepspeed training with Azure Machine Learning
This example showcases how to use DeepSpeed with an AzureML training job.
### Setup
#### Environment
The environment for this job is provided in the docker-context file. There is no setup needed to run the job, however, if you want to setup the environment separately for use later (to prevent rebuilding the environment every time the job is run), follow [this guide](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-manage-environments-in-studio).
#### Compute Instance
V100 GPUs (ND40rs) are recommended for this job. This example was originally run using 2 ND40rs nodes with 8 V100 GPUs each. Make sure to edit the ``job.yml`` file with the name of the compute you want to use for this job.
- Inside the ``job.sh`` file, the ``instance_count`` and the ``process_count_per_instance`` variables will also need to be edited to specify how many compute nodes and how many GPUs exist per each compute node being used.
### Running the Job
To start a DeepSpeed training job, run the following command in the command line while inside this directory:
```
az ml job create --file job.yml
```
> Using with Autotuning: If the deepspeed-autotuning example is run first, you can use the the optimal ``ds_config.json`` file that is generated as the configuration file for this example. This will allow for better resource utilization.
39 changes: 0 additions & 39 deletions cli/jobs/deepspeed/deepspeed-training/generate-yml.sh

This file was deleted.

26 changes: 25 additions & 1 deletion cli/jobs/deepspeed/deepspeed-training/job.yml
Original file line number Diff line number Diff line change
@@ -1 +1,25 @@
# This is a temporary file until generate-yml.sh is run to generate the job.yml file that will be used for this example.
# Training job submission via AML CLI v2

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json

command: python train.py --with_aml_log=True --deepspeed --deepspeed_config ds_config.json

experiment_name: DistributedJob-DeepsSpeed-Training-cifar
display_name: deepspeed-training-example
code: src
environment:
build:
path: docker-context
limits:
timeout: 900
outputs:
output:
type: uri_folder
mode: rw_mount
path: azureml://datastores/workspaceblobstore/paths/outputs/training_results
compute: azureml:gpu-v100-cluster
distribution:
type: pytorch
process_count_per_instance: 8
resources:
instance_count: 2
66 changes: 0 additions & 66 deletions cli/jobs/deepspeed/deepspeed-training/src/start-deepspeed.sh

This file was deleted.

2 changes: 1 addition & 1 deletion cli/readme.py
Original file line number Diff line number Diff line change
Expand Up @@ -459,7 +459,7 @@ def write_job_workflow(job):
pip install azure-identity
bash \"{GITHUB_WORKSPACE}/sdk/python/setup.sh\"
python prepare_data.py --subscription $SUBSCRIPTION_ID --group $RESOURCE_GROUP_NAME --workspace $WORKSPACE_NAME\n"""
elif "deepspeed" in job:
elif "autotuning" in job:
workflow_yaml += f""" bash -x generate-yml.sh\n"""
# workflow_yaml += f""" bash -x {os.path.relpath(".", project_dir)}/run-job.sh generate-yml.yml\n"""
workflow_yaml += f""" bash -x {os.path.relpath(".", project_dir).replace(os.sep, "/")}/run-job.sh {filename}.yml
Expand Down