Skip to content

Commit

Permalink
Fixing broken links in the BestPractices folder. (#2146)
Browse files Browse the repository at this point in the history
Fixing broken links under BestPractices folder, used relative paths.
  • Loading branch information
rtanase committed Mar 21, 2023
1 parent 7a7395d commit 92120d3
Show file tree
Hide file tree
Showing 4 changed files with 19 additions and 20 deletions.
5 changes: 2 additions & 3 deletions best-practices/largescale-deep-learning/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,6 @@
## Table of Contents

- [AzureML Large Scale Deep Learning Best Practices](#azureml-large-scale-deep-learning-best-practices)
- [Table of Contents](#table-of-contents)
- [Welcome](#welcome)
- [Optimizations for Deep Learning in AzureML](#optimizations-for-deep-learning-in-azureml)
- [Create ML resources to get started](#create-ml-resources-to-get-started)
Expand Down Expand Up @@ -34,7 +33,7 @@ The host OS is updated with the latest drivers and patches to ensure smooth oper

The AzureML Compute layer abstracts the complexities for managing the cloud scale infrastructure for compute, storage and networking.

AzureML supports curated environments for training execution on cached Docker images reducing the run preparation cost and consistency for experiment runs. The Azure Container for PyTorch ([ACPT](https://learn.microsoft.com/azure/machine-learning/reference-azure-container-for-pytorch)) Curated Environment is the built-in setup for running pytorch training experiments on the Azure AI hardware. ACPT includes a curated set of optimizer libraries to improve the training throughput with DeepSpeed for GPU memory optimization, ONNX Runtime Training for efficient op-level execution and NebulaML for fast checkpointing.
AzureML supports curated environments for training execution on cached Docker images reducing the run preparation cost and consistency for experiment runs. The Azure Container for PyTorch ([ACPT](https://learn.microsoft.com/en-us/azure/machine-learning/resource-azure-container-for-pytorch)) Curated Environment is the built-in setup for running pytorch training experiments on the Azure AI hardware. ACPT includes a curated set of optimizer libraries to improve the training throughput with DeepSpeed for GPU memory optimization, ONNX Runtime Training for efficient op-level execution and NebulaML for fast checkpointing.

The AzureML PaaS offers capabilities for the enterprise MLOps lifecycle to manage all aspects of the experimentation and deployment loops.

Expand All @@ -56,7 +55,7 @@ AzureML supports thre data asset types:
Follow this [guide](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-create-data-assets) to learn more how to create any of the supported data assets.

## Create Training environment
An Environment is useful to track and reproduce the projects' software dependencies as they evolve over time. In this [section](./Environment/Readme.md) you can learn more about Environments in AzureML, how to quickly get started and validate the setup before we begin training.
An Environment is useful to track and reproduce the projects' software dependencies as they evolve over time. In this [section](./Environment/README.md) you can learn more about Environments in AzureML, how to quickly get started and validate the setup before we begin training.

## Efficient data loading for large training workloads

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -10,9 +10,9 @@ V100 GPUs (STANDARD_ND40RS_V2) are recommended for this job. This example was or
To attain linear scaling for large model, one important step can be to use InfiniBand. InfiniBand enables low-latency, GPU-to-GPU communication across nodes in a cluster. InfiniBand requires specialized hardware to operate. Only some VM SKUs on Azure contain this required hardware. You can view the full list of InfiniBand-enabled machine SKUs [here](https://learn.microsoft.com/en-us/azure/virtual-machines/sizes-hpc#rdma-capable-instances).

### **Environment**
The environment found at ``src/envrionments`` is an ACPT environment with multiple accelerators to boost the training job. If you would like to add additional packages, edit the appropriate files in ``src/environments`` with your changes, then create the custom environment using the following command:
The environment found at ``src/envrionment`` is an ACPT environment with multiple accelerators to boost the training job. If you would like to add additional packages, edit the appropriate files in ``src/environment`` with your changes, then create the custom environment using the following command:
```
az ml environment create --file ./src/environments/env.yml
az ml environment create --file ./src/environment/env.yml
```
## **Code**
All of the code described in this document can be found either in one of the submit yml files or in the ``src`` folder of this directory.
Expand All @@ -24,7 +24,7 @@ The first step in the training script is to parse the arguments passed in from t
- ``--gradient_accumulation_steps`` Number of training steps to accumulate gradients before using them to compute variables. This value should match the value of ``gradient_accumulation_steps`` in your ``ds_config.json`` file if deepspeed is enabled.
- ``--model_checkpoint`` the model to pretrain. In this case we are pretraining "bert-large-uncased" but this example was also run with DistilBERT and BERT-base. See below for more information.

This example also supports the interactive capabilities from JupyterLab, TensorBoard and VSCode. These are added via the ``services`` section of the yml submit files. For more information on these, see [this](https://github.com/microsoft/azureml-largescale-deeplearning-bestpractices/tree/main/Training#interactive-debugging) page. Remove these sections under ``services`` to disable these tools.
This example also supports the interactive capabilities from JupyterLab, TensorBoard and VSCode. These are added via the ``services`` section of the yml submit files. For more information on these, see [this](../README.md#interactive-debugging) page. Remove these sections under ``services`` to disable these tools.

#### **DeepSpeed Configuration**
As discussed above, arguments to the command job will need to match arguments in the DeepSpeed configuration file (``ds_config.json``) if DeepSpeed is being used. We use a very simple configuration for this experiment. This config is without the additional profiling + checkpointing tools added to the ``ds_config.json`` located in the ``src`` folder.
Expand All @@ -42,7 +42,7 @@ As discussed above, arguments to the command job will need to match arguments in
```
Each setting here is described above, but this configuration also includes ``fp16`` to improve training speed and reduce memory usage.

This configuration was found by running [DeepSpeed Autotuning](https://www.deepspeed.ai/tutorials/autotuning/) with this training script and BERT large in [this example](https://github.com/microsoft/azureml-largescale-deeplearning-bestpractices/tree/main/Training/DeepSpeed-Autotuning). DeepSpeed as it relates to this example is described in more detail [here](https://github.com/microsoft/azureml-largescale-deeplearning-bestpractices/tree/main/Training#deepspeed).
This configuration was found by running [DeepSpeed Autotuning](https://www.deepspeed.ai/tutorials/autotuning/) with this training script and BERT large in [this example](../DeepSpeed-Autotuning). DeepSpeed as it relates to this example is described in more detail [here](../README.md#deepspeed).
### **Load the dataset**
Once arguments have been parsed, its time to prepare the dataset. First we prepare a tokenizer to tokenize the data:
```
Expand All @@ -56,7 +56,7 @@ encoded_dataset_train, encoded_dataset_eval = load_encoded_glue_dataset(
task=task, tokenizer=tokenizer
)
```
This is done from within the [``glue_datasets.py``](https://github.com/microsoft/azureml-largescale-deeplearning-bestpractices/blob/main/Training/Bert-Pretrain/src/glue_datasets.py) file.
This is done from within the [``glue_datasets.py``](./src/glue_datasets.py) file.
```
def load_raw_glue_dataset(task: str) -> Union[DatasetDict, Dataset]:
dataset = load_dataset("glue", actual_task(task))
Expand Down Expand Up @@ -113,7 +113,7 @@ trainer.pop_callback(MLflowCallback)
result = trainer.train()
```
The ``ProfilerCallback`` in the above code is used to integrate the experiment with [Pytorch Profiler](https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html). For more information on this code, see [this page](https://github.com/microsoft/azureml-largescale-deeplearning-bestpractices/tree/main/Training#pytorch-profiler).
The ``ProfilerCallback`` in the above code is used to integrate the experiment with [Pytorch Profiler](https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html). For more information on this code, see [this page](../README.md#pytorch-profiler).

## **Run the Job**
### **Submit with vanilla Pytorch**
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ NVIDIA A100 80GB GPUs are recommended for this job. This experiment was original
To attain linear scaling for large model, one important step can be to use InfiniBand. InfiniBand enables low-latency, GPU-to-GPU communication across nodes in a cluster. InfiniBand requires specialized hardware to operate. Only some VM SKUs on Azure contain this required hardware. You can view the full list of InfiniBand-enabled machine SKUs [here](https://learn.microsoft.com/en-us/azure/virtual-machines/sizes-hpc#rdma-capable-instances).

### **Environment**
The environment found [here](https://github.com/savitamittal1/Megatron-DeepSpeed-AML/blob/353db918a3a061552aa541e8d67d9b55a35b2f3d/examples/azureml/environment/context/Dockerfile) is an ACPT environment with multiple accelerators to boost the training job. Also included are HuggingFace packages used for this training. If you would like to add additional packages, edit the appropriate files in that directory with your changes, then create the custom environment using the following command:
The environment found [here](./src/environment/context/Dockerfile) is an ACPT environment with multiple accelerators to boost the training job. Also included are HuggingFace packages used for this training. If you would like to add additional packages, edit the appropriate files in that directory with your changes, then create the custom environment using the following command:
```
az ml environment create --file ./src/environment/env.yml
```
Expand All @@ -19,7 +19,7 @@ az ml environment create --file ./src/environment/env.yml
The following code can be found under this directory in ``src/deepspeed-BLOOM-AML-SDKv2.yaml`` for the submit file and environment and ``src/Megatron-DeepSpeed/pretrain_gpt.py`` for the training code.

### **Job Configuration**
In the [``deepspeed-BLOOM-AML-SDKv2.yaml``](https://github.com/microsoft/azureml-largescale-deeplearning-bestpractices/tree/main/Training/Bloom-Pretrain/src/deepspeed-BLOOM-AML-SDKv2.yaml) file for submitting the job, there are several arguments passed in for the pretraining, with most being settings specific to how the model will be trained. For more information on command line arguments, see [here](https://github.com/bigscience-workshop/Megatron-DeepSpeed/blob/main/megatron/arguments.py). Some arguments relevant to this example are:
In the [``deepspeed-BLOOM-AML-SDKv2.yaml``](./src/deepspeed-BLOOM-AML-SDKv2.yaml) file for submitting the job, there are several arguments passed in for the pretraining, with most being settings specific to how the model will be trained. For more information on command line arguments, see [here](https://github.com/bigscience-workshop/Megatron-DeepSpeed/blob/main/megatron/arguments.py). Some arguments relevant to this example are:

- ``--data-path`` - The paths to the data the model is trained on. The format should be the weight of the dataset and the path and name of the file that references .bin and .idx file (without the extension). For example, command below will add weight of 0.033178301 to ar language data, and inside the ar folder should be ar_text_document.bin and ar_text_document.idx.
- ``--deepspeed`` and other deepspeed related arguments. These arguments are specific to DeepSpeed. The ``ds_config.json`` file passed in gives the configuration settings for DeepSpeed. Notice that the argument ``global_batch_size`` matches the ``train_batch_size`` setting in the ds_config. Similarly, the ``--zero_stage`` command line argument matches the ``zero_optimization`` setting in the ``ds_config.json`` file.
Expand Down
Loading

0 comments on commit 92120d3

Please sign in to comment.