From 92120d3981e2e5826fe0ef501639365cd98f1a5d Mon Sep 17 00:00:00 2001 From: Razvan Tanase Date: Tue, 21 Mar 2023 15:15:16 -0700 Subject: [PATCH] Fixing broken links in the BestPractices folder. (#2146) Fixing broken links under BestPractices folder, used relative paths. --- .../largescale-deep-learning/README.md | 5 ++--- .../Training/Bert-Pretrain/README.md | 12 ++++++------ .../Training/Bloom-Pretrain/README.md | 4 ++-- .../Training/README.md | 18 +++++++++--------- 4 files changed, 19 insertions(+), 20 deletions(-) diff --git a/best-practices/largescale-deep-learning/README.md b/best-practices/largescale-deep-learning/README.md index 783a4fe2b0..21ecd567ac 100644 --- a/best-practices/largescale-deep-learning/README.md +++ b/best-practices/largescale-deep-learning/README.md @@ -3,7 +3,6 @@ ## Table of Contents - [AzureML Large Scale Deep Learning Best Practices](#azureml-large-scale-deep-learning-best-practices) - - [Table of Contents](#table-of-contents) - [Welcome](#welcome) - [Optimizations for Deep Learning in AzureML](#optimizations-for-deep-learning-in-azureml) - [Create ML resources to get started](#create-ml-resources-to-get-started) @@ -34,7 +33,7 @@ The host OS is updated with the latest drivers and patches to ensure smooth oper The AzureML Compute layer abstracts the complexities for managing the cloud scale infrastructure for compute, storage and networking. -AzureML supports curated environments for training execution on cached Docker images reducing the run preparation cost and consistency for experiment runs. The Azure Container for PyTorch ([ACPT](https://learn.microsoft.com/azure/machine-learning/reference-azure-container-for-pytorch)) Curated Environment is the built-in setup for running pytorch training experiments on the Azure AI hardware. ACPT includes a curated set of optimizer libraries to improve the training throughput with DeepSpeed for GPU memory optimization, ONNX Runtime Training for efficient op-level execution and NebulaML for fast checkpointing. +AzureML supports curated environments for training execution on cached Docker images reducing the run preparation cost and consistency for experiment runs. The Azure Container for PyTorch ([ACPT](https://learn.microsoft.com/en-us/azure/machine-learning/resource-azure-container-for-pytorch)) Curated Environment is the built-in setup for running pytorch training experiments on the Azure AI hardware. ACPT includes a curated set of optimizer libraries to improve the training throughput with DeepSpeed for GPU memory optimization, ONNX Runtime Training for efficient op-level execution and NebulaML for fast checkpointing. The AzureML PaaS offers capabilities for the enterprise MLOps lifecycle to manage all aspects of the experimentation and deployment loops. @@ -56,7 +55,7 @@ AzureML supports thre data asset types: Follow this [guide](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-create-data-assets) to learn more how to create any of the supported data assets. ## Create Training environment -An Environment is useful to track and reproduce the projects' software dependencies as they evolve over time. In this [section](./Environment/Readme.md) you can learn more about Environments in AzureML, how to quickly get started and validate the setup before we begin training. +An Environment is useful to track and reproduce the projects' software dependencies as they evolve over time. In this [section](./Environment/README.md) you can learn more about Environments in AzureML, how to quickly get started and validate the setup before we begin training. ## Efficient data loading for large training workloads diff --git a/best-practices/largescale-deep-learning/Training/Bert-Pretrain/README.md b/best-practices/largescale-deep-learning/Training/Bert-Pretrain/README.md index ce8945cb7f..a3b33a2df5 100644 --- a/best-practices/largescale-deep-learning/Training/Bert-Pretrain/README.md +++ b/best-practices/largescale-deep-learning/Training/Bert-Pretrain/README.md @@ -10,9 +10,9 @@ V100 GPUs (STANDARD_ND40RS_V2) are recommended for this job. This example was or To attain linear scaling for large model, one important step can be to use InfiniBand. InfiniBand enables low-latency, GPU-to-GPU communication across nodes in a cluster. InfiniBand requires specialized hardware to operate. Only some VM SKUs on Azure contain this required hardware. You can view the full list of InfiniBand-enabled machine SKUs [here](https://learn.microsoft.com/en-us/azure/virtual-machines/sizes-hpc#rdma-capable-instances). ### **Environment** -The environment found at ``src/envrionments`` is an ACPT environment with multiple accelerators to boost the training job. If you would like to add additional packages, edit the appropriate files in ``src/environments`` with your changes, then create the custom environment using the following command: +The environment found at ``src/envrionment`` is an ACPT environment with multiple accelerators to boost the training job. If you would like to add additional packages, edit the appropriate files in ``src/environment`` with your changes, then create the custom environment using the following command: ``` -az ml environment create --file ./src/environments/env.yml +az ml environment create --file ./src/environment/env.yml ``` ## **Code** All of the code described in this document can be found either in one of the submit yml files or in the ``src`` folder of this directory. @@ -24,7 +24,7 @@ The first step in the training script is to parse the arguments passed in from t - ``--gradient_accumulation_steps`` Number of training steps to accumulate gradients before using them to compute variables. This value should match the value of ``gradient_accumulation_steps`` in your ``ds_config.json`` file if deepspeed is enabled. - ``--model_checkpoint`` the model to pretrain. In this case we are pretraining "bert-large-uncased" but this example was also run with DistilBERT and BERT-base. See below for more information. -This example also supports the interactive capabilities from JupyterLab, TensorBoard and VSCode. These are added via the ``services`` section of the yml submit files. For more information on these, see [this](https://github.com/microsoft/azureml-largescale-deeplearning-bestpractices/tree/main/Training#interactive-debugging) page. Remove these sections under ``services`` to disable these tools. +This example also supports the interactive capabilities from JupyterLab, TensorBoard and VSCode. These are added via the ``services`` section of the yml submit files. For more information on these, see [this](../README.md#interactive-debugging) page. Remove these sections under ``services`` to disable these tools. #### **DeepSpeed Configuration** As discussed above, arguments to the command job will need to match arguments in the DeepSpeed configuration file (``ds_config.json``) if DeepSpeed is being used. We use a very simple configuration for this experiment. This config is without the additional profiling + checkpointing tools added to the ``ds_config.json`` located in the ``src`` folder. @@ -42,7 +42,7 @@ As discussed above, arguments to the command job will need to match arguments in ``` Each setting here is described above, but this configuration also includes ``fp16`` to improve training speed and reduce memory usage. -This configuration was found by running [DeepSpeed Autotuning](https://www.deepspeed.ai/tutorials/autotuning/) with this training script and BERT large in [this example](https://github.com/microsoft/azureml-largescale-deeplearning-bestpractices/tree/main/Training/DeepSpeed-Autotuning). DeepSpeed as it relates to this example is described in more detail [here](https://github.com/microsoft/azureml-largescale-deeplearning-bestpractices/tree/main/Training#deepspeed). +This configuration was found by running [DeepSpeed Autotuning](https://www.deepspeed.ai/tutorials/autotuning/) with this training script and BERT large in [this example](../DeepSpeed-Autotuning). DeepSpeed as it relates to this example is described in more detail [here](../README.md#deepspeed). ### **Load the dataset** Once arguments have been parsed, its time to prepare the dataset. First we prepare a tokenizer to tokenize the data: ``` @@ -56,7 +56,7 @@ encoded_dataset_train, encoded_dataset_eval = load_encoded_glue_dataset( task=task, tokenizer=tokenizer ) ``` -This is done from within the [``glue_datasets.py``](https://github.com/microsoft/azureml-largescale-deeplearning-bestpractices/blob/main/Training/Bert-Pretrain/src/glue_datasets.py) file. +This is done from within the [``glue_datasets.py``](./src/glue_datasets.py) file. ``` def load_raw_glue_dataset(task: str) -> Union[DatasetDict, Dataset]: dataset = load_dataset("glue", actual_task(task)) @@ -113,7 +113,7 @@ trainer.pop_callback(MLflowCallback) result = trainer.train() ``` -The ``ProfilerCallback`` in the above code is used to integrate the experiment with [Pytorch Profiler](https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html). For more information on this code, see [this page](https://github.com/microsoft/azureml-largescale-deeplearning-bestpractices/tree/main/Training#pytorch-profiler). +The ``ProfilerCallback`` in the above code is used to integrate the experiment with [Pytorch Profiler](https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html). For more information on this code, see [this page](../README.md#pytorch-profiler). ## **Run the Job** ### **Submit with vanilla Pytorch** diff --git a/best-practices/largescale-deep-learning/Training/Bloom-Pretrain/README.md b/best-practices/largescale-deep-learning/Training/Bloom-Pretrain/README.md index 475a7a3f9e..f14cc79b8a 100644 --- a/best-practices/largescale-deep-learning/Training/Bloom-Pretrain/README.md +++ b/best-practices/largescale-deep-learning/Training/Bloom-Pretrain/README.md @@ -10,7 +10,7 @@ NVIDIA A100 80GB GPUs are recommended for this job. This experiment was original To attain linear scaling for large model, one important step can be to use InfiniBand. InfiniBand enables low-latency, GPU-to-GPU communication across nodes in a cluster. InfiniBand requires specialized hardware to operate. Only some VM SKUs on Azure contain this required hardware. You can view the full list of InfiniBand-enabled machine SKUs [here](https://learn.microsoft.com/en-us/azure/virtual-machines/sizes-hpc#rdma-capable-instances). ### **Environment** -The environment found [here](https://github.com/savitamittal1/Megatron-DeepSpeed-AML/blob/353db918a3a061552aa541e8d67d9b55a35b2f3d/examples/azureml/environment/context/Dockerfile) is an ACPT environment with multiple accelerators to boost the training job. Also included are HuggingFace packages used for this training. If you would like to add additional packages, edit the appropriate files in that directory with your changes, then create the custom environment using the following command: +The environment found [here](./src/environment/context/Dockerfile) is an ACPT environment with multiple accelerators to boost the training job. Also included are HuggingFace packages used for this training. If you would like to add additional packages, edit the appropriate files in that directory with your changes, then create the custom environment using the following command: ``` az ml environment create --file ./src/environment/env.yml ``` @@ -19,7 +19,7 @@ az ml environment create --file ./src/environment/env.yml The following code can be found under this directory in ``src/deepspeed-BLOOM-AML-SDKv2.yaml`` for the submit file and environment and ``src/Megatron-DeepSpeed/pretrain_gpt.py`` for the training code. ### **Job Configuration** -In the [``deepspeed-BLOOM-AML-SDKv2.yaml``](https://github.com/microsoft/azureml-largescale-deeplearning-bestpractices/tree/main/Training/Bloom-Pretrain/src/deepspeed-BLOOM-AML-SDKv2.yaml) file for submitting the job, there are several arguments passed in for the pretraining, with most being settings specific to how the model will be trained. For more information on command line arguments, see [here](https://github.com/bigscience-workshop/Megatron-DeepSpeed/blob/main/megatron/arguments.py). Some arguments relevant to this example are: +In the [``deepspeed-BLOOM-AML-SDKv2.yaml``](./src/deepspeed-BLOOM-AML-SDKv2.yaml) file for submitting the job, there are several arguments passed in for the pretraining, with most being settings specific to how the model will be trained. For more information on command line arguments, see [here](https://github.com/bigscience-workshop/Megatron-DeepSpeed/blob/main/megatron/arguments.py). Some arguments relevant to this example are: - ``--data-path`` - The paths to the data the model is trained on. The format should be the weight of the dataset and the path and name of the file that references .bin and .idx file (without the extension). For example, command below will add weight of 0.033178301 to ar language data, and inside the ar folder should be ar_text_document.bin and ar_text_document.idx. - ``--deepspeed`` and other deepspeed related arguments. These arguments are specific to DeepSpeed. The ``ds_config.json`` file passed in gives the configuration settings for DeepSpeed. Notice that the argument ``global_batch_size`` matches the ``train_batch_size`` setting in the ds_config. Similarly, the ``--zero_stage`` command line argument matches the ``zero_optimization`` setting in the ``ds_config.json`` file. diff --git a/best-practices/largescale-deep-learning/Training/README.md b/best-practices/largescale-deep-learning/Training/README.md index 3e327b2ab7..a00ca73b7f 100644 --- a/best-practices/largescale-deep-learning/Training/README.md +++ b/best-practices/largescale-deep-learning/Training/README.md @@ -87,7 +87,7 @@ This guide will show best practices to allow you to train large models very effi - ### **Environment** - The recommended environment for a large scale distributed training job is an Azure Container for PyTorch (ACPT) environment with several built in optimizers and is described in more detail [here](https://github.com/microsoft/azureml-largescale-deeplearning-bestpractices/blob/main/Environment/ACPT.md). This environment is built and ready to use under the 'Environments' tab in AzureML studio. Some optimizers included in the environment are: + The recommended environment for a large scale distributed training job is an Azure Container for PyTorch (ACPT) environment with several built in optimizers and is described in more detail [here](../Environment/ACPT.md). This environment is built and ready to use under the 'Environments' tab in AzureML studio. Some optimizers included in the environment are: - Onnx Runtime, Built-in optimizations that deliver up to 1.4X faster training - Deepspeed allows to train trillion model parameter at low cost by achieving excellent system throughput and efficiently scale to thousands of GPUs - MSCCL, an inter-accelerator communication framework that is built on top of NCCL @@ -95,7 +95,7 @@ This guide will show best practices to allow you to train large models very effi - ### **Data Loading** - To load data in the most efficient way with large scale distributed training jobs, follow [this guide](https://github.com/microsoft/azureml-largescale-deeplearning-bestpractices/blob/main/Data-loading/data-loading.md). + To load data in the most efficient way with large scale distributed training jobs, follow [this guide](../Data-loading/data-loading.md). ## Optimizations To achive the best possible performance and resource utilization of jobs on AzureML, we employ several different optimization tools showcased below. - ### **DeepSpeed** @@ -128,7 +128,7 @@ To achive the best possible performance and resource utilization of jobs on Azur DeepSpeed features can be enabled, disabled, or configured using a config JSON file that should be specified as args.deepspeed_config. - To include DeepSpeed in a job using the HuggingFace ``Trainer`` class, simply include the argument ``--deepspeed ds_config.json`` as part of the ``TrainerArguments`` class passed into the Trainer. Example code for Bert Pretraining with Deepspeed and the HuggingFace Trainer class is shown at [BERT pretraining guide](https://github.com/microsoft/azureml-largescale-deeplearning-bestpractices/tree/main/Training/Bert-Pretrain). + To include DeepSpeed in a job using the HuggingFace ``Trainer`` class, simply include the argument ``--deepspeed ds_config.json`` as part of the ``TrainerArguments`` class passed into the Trainer. Example code for Bert Pretraining with Deepspeed and the HuggingFace Trainer class is shown at [BERT pretraining guide](./Bert-Pretrain). To include DeepSpeed in a job using a custom training loop, DeepSpeed will have to be initialized before the training loop as shown here: @@ -160,7 +160,7 @@ To achive the best possible performance and resource utilization of jobs on Azur | DeBERTa | 1.5B | Not runnable | 140.587 (z = 1, gas = 1 mbs = 8) | 162.395 (z1_gas1_tmbspg11) | inf | 40 | 12 | - To learn how to use DeepSpeed Autotuning with AzureML, see [this tutorial](https://github.com/microsoft/azureml-largescale-deeplearning-bestpractices/tree/main/Training/DeepSpeed-Autotuning). + To learn how to use DeepSpeed Autotuning with AzureML, see [this tutorial](./DeepSpeed-Autotuning/README.md). When running the Bloom and BERT examples in this repo, the following results were found: | Metrics | Vanilla Pytorch | DeepSpeed + Autotuning| @@ -192,7 +192,7 @@ To achive the best possible performance and resource utilization of jobs on Azur ``` --optim adamw_ort_fused ``` - This is an extra argument added with ORTTrainingArguments that applies the Fused Adam Optimizer to give a little extra performance gain. For a training example that uses ORT, See the [BERT Pretrain example](https://github.com/microsoft/azureml-largescale-deeplearning-bestpractices/tree/main/Training/Bert-Pretrain). + This is an extra argument added with ORTTrainingArguments that applies the Fused Adam Optimizer to give a little extra performance gain. For a training example that uses ORT, See the [BERT Pretrain example](./Bert-Pretrain/README.md). ## Monitoring - ### **Interactive Debugging** Machine learning model training is usually an iterative process and requires significant experimentation. With the Azure Machine Learning interactive job experience, we can access the container where the job is running and iterate on training scripts, monitor progress and even debug the job remotely on local machines. @@ -216,7 +216,7 @@ To achive the best possible performance and resource utilization of jobs on Azur SSH Connections - For an example that enables these tools, see [here](https://github.com/microsoft/azureml-largescale-deeplearning-bestpractices/tree/main/Training/Bert-Pretrain). + For an example that enables these tools, see [here](./Bert-Pretrain/README.md). #### **JupyterLab** With JupyterLab, you can open a terminal and interact with the job container as well as iterate on your training script. @@ -270,7 +270,7 @@ To achive the best possible performance and resource utilization of jobs on Azur self.prof.step() ``` > NOTE: To make sure the Pytorch Profiler is visible with Tensorboard, we create a variable called `my_logs` (as shown in the above code) from passing an additional argument ``--tensorboard_log_dir "/outputs/runs/"`` to our training script. This path matches the ``logDir`` property under ``my_tensorboard`` in our yaml file for submitting the job. - See the [BERT Pretrain example](https://github.com/microsoft/azureml-largescale-deeplearning-bestpractices/tree/main/Training/Bert-Pretrain) for the full implementation of this code. + See the [BERT Pretrain example](./Bert-Pretrain/README.md) for the full implementation of this code. After the job starts running, go to the TensorBoard as described above and click on 'Pytorch Profiler'. This page will show the relevant resource utilization information. @@ -326,7 +326,7 @@ Nebula Checkpointing improves on standard model checkpointing by saving models 1 - ### **Pretraining a model** Pretraining a language model is a process of training a model on a large corpus of unlabeled text using self-supervision, which means that the model learns to predict some parts of the text from other parts. Pretraining helps the model learn general language knowledge and skills that can be useful for various downstream tasks. Pretraining from scratch means training a model from random initialization without using any existing pretrained models. Pretraining from scratch can be beneficial when you have a large amount of domain-specific data that differs significantly from general text corpora, or when you want to customize your model architecture or hyperparameters. However, pretraining from scratch can also be more costly and time-consuming than finetuning an existing pretrained model. - ### **BERT Pretrain** - [This example](https://github.com/microsoft/azureml-largescale-deeplearning-bestpractices/tree/main/Training/Bert-Pretrain) shows how to run a BERT pretraining job on AzureML. + [This example](./Bert-Pretrain/README.md) shows how to run a BERT pretraining job on AzureML. The following results were found using 2 ND40rs nodes with 8 V100 GPUs each. | Optimizations | Model size | GPU | MBS | Samples/Second | GPU memory utilized | @@ -334,7 +334,7 @@ Nebula Checkpointing improves on standard model checkpointing by saving models 1 | Vanilla Pytorch| 330M | 16 | 64 | 2431.02 | 49.4%​ | | DeepSpeed + Autotuning| 330M | 16 | 93 | 3369.37 | 64.5%​ | - ### **Bloom Pretrain** - [This example](https://github.com/microsoft/azureml-largescale-deeplearning-bestpractices/tree/main/Training/Bloom-Pretrain) shows how to pretrain the Bloom model in AzureML. The following results were found using 16 NVIDIA A100 80GB GPUs (2 nodes NVLink enabled). + [This example](./Bloom-Pretrain/README.md) shows how to pretrain the Bloom model in AzureML. The following results were found using 16 NVIDIA A100 80GB GPUs (2 nodes NVLink enabled). |Experiment |Model size|GPU Count | TP| PP | MBS | TFlops| Samples per second | GPU memory Utillized |----|----|----|----|----|----|----|----|----| |1|25B|16| 8| 1| 1| 119.42| 4.173 |69.7%|