diff --git a/training/a4/llama3-1-70b/nemo-pretraining-gke/README.md b/training/a4/llama3-1-70b/nemo-pretraining-gke/README.md deleted file mode 100644 index 4982e1a..0000000 --- a/training/a4/llama3-1-70b/nemo-pretraining-gke/README.md +++ /dev/null @@ -1,402 +0,0 @@ -# Pretrain Llama-3.1-70B workloads on A4 GKE Node pools with Nvidia NeMo Framework - -This recipe outlines the steps for running a Llama-3.1-70B pretraining workload -on [A4 GKE Node pools](https://cloud.google.com/kubernetes-engine) by using the -[NVIDIA NeMo framework](https://github.com/NVIDIA/nemo). - -## Orchestration and deployment tools - -For this recipe, the following setup is used: - -- Orchestration - [Google Kubernetes Engine (GKE)](https://cloud.google.com/kubernetes-engine) -- Pretraining job configuration and deployment - A Helm chart is used to configure and deploy - the [Kubernetes Jobset](https://kubernetes.io/blog/2025/03/23/introducing-jobset) - resource which manages the execution of the - [NeMo pretraining workload](https://github.com/NVIDIA/NeMo/blob/main/examples/nlp/language_modeling/megatron_gpt_pretraining.py). - -## Test environment - -This recipe has been optimized for and tested with the following configuration: - -- GKE cluster - - [A regional standard cluster](https://cloud.google.com/kubernetes-engine/docs/concepts/configuration-overview) version: 1.31.7-gke.1265000 or later. - - A GPU node pool with 32 or 64 - [a4-highgpu-8g](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4-high-vms) provisioned using the DENSE deployment type. - - [Workload Identity Federation for GKE](https://cloud.google.com/kubernetes-engine/docs/concepts/workload-identity) enabled. - - [Cloud Storage FUSE CSI driver for GKE](https://cloud.google.com/kubernetes-engine/docs/concepts/cloud-storage-fuse-csi-driver) enabled. - - [DCGM metrics](https://cloud.google.com/kubernetes-engine/docs/how-to/dcgm-metrics) enabled. - - [Kueue](https://kueue.sigs.k8s.io/docs/reference/kueue.v1beta1/) and [JobSet](https://jobset.sigs.k8s.io/docs/overview/) APIs installed. - - Kueue configured to support [Topology Aware Scheduling](https://kueue.sigs.k8s.io/docs/concepts/topology_aware_scheduling/). -- A regional Google Cloud Storage (GCS) bucket to store logs generated by the recipe runs. - -To prepare the required environment, see -[GKE environment setup guide](../../../../docs/configuring-environment-gke-a4.md). - -## Training dataset - -This recipe uses a mock pretraining dataset provided by the NeMo framework - -## Docker container image - -This recipe uses the following docker image: -`us-central1-docker.pkg.dev/deeplearning-images/reproducibility/pytorch-gpu-nemo-nccl:nemo25.02-gib1.0.5-A4`. - -This image is based on NVIDIA NeMo 25.02 and contains the NCCL gIB plugin -v1.0.5, bundling all NCCL binaries validated for use with A4 GPUs. - -## Run the recipe - -From your client workstation, complete the following steps: - -### Configure environment settings - -Set the environment variables to match your environment: - - ```bash - export PROJECT_ID= - export CLUSTER_REGION= - export CLUSTER_NAME= - export GCS_BUCKET= - export KUEUE_NAME= - ``` - -Replace the following values: - - - ``: your Google Cloud project ID. - - ``: the region where your cluster is located. - - ``: the name of your GKE cluster. - - ``: the name of your Cloud Storage bucket. Don't include the `gs://` prefix. - - ``: the name of the Kueue local queue. The default queue created by the cluster toolkit is `a4`. Make sure to verify the name of the local queue in your cluster. - -Set the default project: - - ```bash - gcloud config set project $PROJECT_ID - ``` - -### Get the recipe - -Clone the `gpu-recipes` repository and set a reference to the recipe folder. - -``` -git clone https://github.com/ai-hypercomputer/gpu-recipes.git -cd gpu-recipes -export REPO_ROOT=`git rev-parse --show-toplevel` -export RECIPE_ROOT=$REPO_ROOT/training/a4/llama3-1-70b/nemo-pretraining-gke -cd $RECIPE_ROOT -``` - -### Get cluster credentials - -``` -gcloud container clusters get-credentials $CLUSTER_NAME --region $CLUSTER_REGION -``` - -### Configure and submit a pretraining job - -#### Using 32 nodes (256 GPUs) FP8 precision - -The default job setting is 15 training steps and fp8 precision. To execute the -job with the default settings, run the following command from your client: - -```bash -helm install -f $RECIPE_ROOT/values.yaml \ - --set-file workload_launcher=$REPO_ROOT/src/launchers/nemo-10-launcher.sh \ - --set-file workload_config=$REPO_ROOT/src/frameworks/a4/nemo-configs/llama3-1-70b-256gpus-a4-fp8.yaml \ - --set queue=${KUEUE_NAME} \ - --set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \ - $USER-llama-3-1-70b-nemo-fp8 \ - $REPO_ROOT/src/helm-charts/a4/jobset -``` - -#### Using 32 nodes (256 GPUs) BF16 precision - -The default job setting is 15 training steps and bf16 precision. To execute the -job with the default settings, run the following command from your client: - -```bash -helm install -f $RECIPE_ROOT/values.yaml \ - --set-file workload_launcher=$REPO_ROOT/src/launchers/nemo-10-launcher.sh \ - --set-file workload_config=$REPO_ROOT/src/frameworks/a4/nemo-configs/llama3-1-70b-256gpus-a4-bf16.yaml \ - --set queue=${KUEUE_NAME} \ - --set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \ - $USER-llama-3-1-70b-nemo-bf16 \ - $REPO_ROOT/src/helm-charts/a4/jobset -``` - -#### Using 64 nodes (512 GPUs) FP8 precision - -The default job setting is 15 training steps and fp8 precision. To execute the -job with the default settings, run the following command from your client: - -```bash -helm install -f $RECIPE_ROOT/values-64-128-nodes.yaml \ - --set-file workload_launcher=$REPO_ROOT/src/launchers/nemo-10-launcher.sh \ - --set-file workload_config=$REPO_ROOT/src/frameworks/a4/nemo-configs/llama3-1-70b-512gpus-a4-fp8.yaml \ - --set queue=${KUEUE_NAME} \ - --set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \ - $USER-llama-3-1-70b-nemo-fp8 \ - $REPO_ROOT/src/helm-charts/a4/jobset -``` - -#### Using 64 nodes (512 GPUs) BF16 precision - -The default job setting is 15 training steps and bf16 precision. To execute the -job with the default settings, run the following command from your client: - -```bash -helm install -f $RECIPE_ROOT/values-64-128-nodes.yaml \ - --set-file workload_launcher=$REPO_ROOT/src/launchers/nemo-10-launcher.sh \ - --set-file workload_config=$REPO_ROOT/src/frameworks/a4/nemo-configs/llama3-1-70b-512gpus-a4-bf16.yaml \ - --set queue=${KUEUE_NAME} \ - --set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \ - $USER-llama-3-1-70b-nemo-bf16 \ - $REPO_ROOT/src/helm-charts/a4/jobset -``` - -#### Using 128 nodes (1024 GPUs) FP8 precision - -The default job setting is 15 training steps and fp8 precision. To execute the -job with the default settings, run the following command from your client: - -```bash -helm install -f $RECIPE_ROOT/values-64-128-nodes.yaml \ - --set-file workload_launcher=$REPO_ROOT/src/launchers/nemo-10-launcher.sh \ - --set-file workload_config=$REPO_ROOT/src/frameworks/a4/nemo-configs/llama3-1-70b-1024gpus-a4-fp8.yaml \ - --set queue=${KUEUE_NAME} \ - --set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \ - --set workload.gpus=1024 \ - $USER-llama-3-1-70b-nemo-fp8 \ - $REPO_ROOT/src/helm-charts/a4/jobset -``` - -#### Using 128 nodes (1024 GPUs) BF16 precision - -The default job setting is 15 training steps and bf16 precision. To execute the -job with the default settings, run the following command from your client: - -```bash -helm install -f $RECIPE_ROOT/values-64-128-nodes.yaml \ - --set-file workload_launcher=$REPO_ROOT/src/launchers/nemo-10-launcher.sh \ - --set-file workload_config=$REPO_ROOT/src/frameworks/a4/nemo-configs/llama3-1-70b-1024gpus-a4-bf16.yaml \ - --set queue=${KUEUE_NAME} \ - --set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \ - --set workload.gpus=1024 \ - $USER-llama-3-1-70b-nemo-bf16 \ - $REPO_ROOT/src/helm-charts/a4/jobset -``` - - -#### Configure job settings - -You can overwrite any of the default: -- [NeMo configurations 32 nodes fp8](../../../../src/frameworks/a4/nemo-configs/llama3-1-70b-256gpus-a4-fp8.yaml) -- [NeMo configurations 32 nodes bf16](../../../../src/frameworks/a4/nemo-configs/llama3-1-70b-256gpus-a4-bf16.yaml) -- [NeMo configurations 64 nodes fp8](../../../../src/frameworks/a4/nemo-configs/llama3-1-70b-512gpus-a4-fp8.yaml) -- [NeMo configurations 64 nodes bf16](../../../../src/frameworks/a4/nemo-configs/llama3-1-70b-512gpus-a4-bf16.yaml) - -for this job. To do this, we can set the new arguments using `--set -workload.arguments`. - -**Examples** - -- To set the number of training steps to 100, run the following command from - your client: - - ```bash - helm install -f $RECIPE_ROOT/values.yaml \ - --set-file workload_launcher=$REPO_ROOT/src/launchers/nemo-10-launcher.sh \ - --set-file workload_config=$REPO_ROOT/src/frameworks/a4/nemo-configs/llama3-1-70b-256gpus-a4-fp8.yaml \ - --set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \ - --set queue=${KUEUE_NAME} \ - --set workload.arguments[0]="trainer.max_steps=100" \ - $USER-llama-3-1-70b-nemo-fp8 \ - $REPO_ROOT/src/helm-charts/a4/jobset - ``` - -### Monitor the job - -To check the status of pods in your job, run the following command: - -``` -kubectl get pods | grep JOB_NAME_PREFIX -``` - -Replace the following: -- JOB_NAME_PREFIX - your job name prefix. For example $USER-llama-3-1-70b-nemo-fp8. - -To get the logs for one of the pods, run the following command: - -``` -kubectl logs POD_NAME -``` - -Information about the training job's progress, including crucial details such as loss, -step count, and step time, is generated by the rank 0 process. -This process runs on the pod whose name begins with `JOB_NAME_PREFIX-workload-0-0`. -For example: `user-llama-3-1-70b-nemo-fp8-workload-0-0-s9zrv`. - -### Analyze results - -When completed, the job creates several artifacts, including logs and traces, -and places them in the configured Google Cloud Storage bucket as follows: - -``` -gs://${GCS_BUCKET}/nemo-experiments/ -├── hparams.yaml -├── lightning_logs.txt -├── nemo_error_logs.txt -├── nemo_log_globalrank-[RANK]_localrank-[LOCAL].txt -├── dllogger -│ ├── rank-0 -│ │ ├── dllogger.json -... -``` - -- `hparams.yaml`: the NeMo configuration used by the pretraining script. This - includes the combined - [configuration file](../../../../src/frameworks/a4/nemo-configs/llama3-1-70b-256gpus-a4-fp8.yaml) - and the command line overrides -- `lightning_logs.txt`: the log files generated by PyTorch Lightning, which is - used by NeMo -- `nemo_error_logs.txt`: the warning and error logs generated by NeMo -- `nemo_log_globalrank-[RANK]_localrank-[LOCAL].txt`: the NeMo logs for each - rank -- `dllogger/: The log captured by [NVIDIA - DLLogger](https://github.com/NVIDIA/dllogger)`: DLLogger is configured to - store logs on the rank 0 node. The log is in JSON format and includes loss, - step_time, and other key metrics for each training step - -The JOB_ID has the following format: - -$USER-llama-3-1-70b-nemo-[YYYY]-[MM]-[DD]-[hh]-[mm]-[ss], where the suffix of the ID is a day and time when the job was started. - -Here is an example of an entry in the DLLogger log: - -```json -DLLL -{ - "timestamp": "1742531120.867155", - "datetime": "2025-03-21 04:25:20.867155", - "elapsedtime": "416.858187", - "type": "LOG", - "step": 11, - "data": - { - "reduced_train_loss": 2.589764356613159, - "lr": 8.249999723375367e-07, - "global_step": 11.0, - "consumed_samples": 24576.0, - "train_backward_timing in s": 4.3010711669921876e-05, - "train_step_timing in s": 19.954481744766234, - "epoch": 0 - } -} -``` - -The DLLogger log can be used to calculate the Model FLOPS Utilization (MFU) -metric, as described in the next section. - -### Calculate training performance metrics (MFU, TFLOPS, Average Step Time) - -This section explains how to calculate key training performance metrics, such as -Model FLOPS Utilization (MFU), using the `dllogger.json` file generated during -training. - -We provide a tool called -[training_metrics](../../../../src/utils/training_metrics/) to help you easily -compute these metrics. This tool can calculate the following metrics: - -- *MFU*: Model FLOPS Utilization -- *Average training step time*: the average time taken for each training step -- *TFLOPS per GPU*: the number of Tera Floating Point Operations per second - achieved by each GPU - -To calculate training performance metrics using the `training_metrics` tool, -complete the following steps command from your client: - -1. Download the `dllogger.json` file. The `dllogger.json` file is generated - during the training session. - - To download the file, run the following command. Replace `` with the - ID of your training session. - - ```bash - gcloud storage cp gs://${GCS_BUCKET}/nemo-experiments/megatron_gpt//dllogger/rank-0/dllogger.json \ - $RECIPE_ROOT/dllogger.json - ``` - -2. Run the - [`process_training_results.py`](../../../../src/utils/training_metrics/process_training_results.py) - script - - ```bash - cd $REPO_ROOT/src/utils/training_metrics - python3 process_training_results.py --file $RECIPE_ROOT/dllogger.json \ - --batch_size 2048 \ - --num_accelerators 256 \ - --precision fp8 \ - --model_type llama3.1-70b \ - --accelerator_type b200 - ``` - -**Note:** The `batch_size`, `num_accelerators`, `precision`, `model_type` and -`accelerator_type` are the specific values for this recipe running the default -configuration. Average step time is computed by default using the steps 10 to -30. - -For more detailed information and advanced usage instructions of this tool, see -the [full documentation](../../../../src/utils/training_metrics/README.md) - -### Troubleshooting - -This section provides guidance on troubleshooting issues with the training job. - -To check the status of the job's pods, use the following command: - -```bash -kubectl get pods | grep JOB_NAME_PREFIX -``` - -Replace `JOB_NAME_PREFIX` with the prefix of your job name. For example, `$USER-mixtral-8x7b-nemo`. This command will list all pods associated with the specified job, along with their current status. - - -To get the logs from a specific pod, use the following command: - -```bash -kubectl logs POD_NAME -``` - -Replace `POD_NAME` with the name of the pod you want to inspect. - -In this recipe, the training job is orchestrated by the [Kubernetes JobSet](https://jobset.sigs.k8s.io/docs/overview/). If the JobSet encounters a fatal failure, it removes all pods, making it impossible to inspect their logs directly. To analyze logs from a failed job, retrieve them from Cloud Logging using the following filter: - -``` -resource.type="k8s_container" -resource.labels.project_id="PROJECT_ID" -resource.labels.location="CLUSTER_REGION" -resource.labels.cluster_name="CLUSTER_NAME" -resource.labels.namespace_name="default" -resource.labels.pod_name=~"^JOB_NAME_PREFIX.*" -severity>=DEFAULT -``` - -Replace the following: -- `PROJECT_ID`: your Google Cloud project ID. -- `CLUSTER_REGION`: the region where your cluster is located. -- `CLUSTER_NAME`: the name of your GKE cluster. -- `JOB_NAME_PREFIX`: the prefix of your job name (e.g., `$USER-llama-3-1-70b-nemo`). - -This filter will retrieve logs from all containers within pods that match the job with the specified name prefix. - - -### Uninstall the Helm release - -You can delete the job and other resources created by the Helm chart. To -uninstall Helm, run the following command from your client: - -```bash -helm uninstall $USER-llama-3-1-70b-nemo-fp8 -helm uninstall $USER-llama-3-1-70b-nemo-bf16 - -``` diff --git a/training/a4/llama3-1-70b/nemo-pretraining-gke/values-64-128-nodes.yaml b/training/a4/llama3-1-70b/nemo-pretraining-gke/values-64-128-nodes.yaml deleted file mode 100644 index 8d7f86e..0000000 --- a/training/a4/llama3-1-70b/nemo-pretraining-gke/values-64-128-nodes.yaml +++ /dev/null @@ -1,68 +0,0 @@ -# Copyright 2025 Google LLC -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -queue: -dwsSettings: - maxRunDurationSeconds: - -tasSettings: - topologyRequest: - kueue.x-k8s.io/podset-preferred-topology: "kubernetes.io/hostname" - -volumes: - gcsVolumes: true - psVolumes: false - gcsMounts: - - bucketName: - mountPath: "/job-logs" - - bucketName: cloud-samples-data - mountPath: "/artifacts" - mountOptions: "implicit-dirs" - -workload: - gpus: 512 # This should be one of: {<= 8, multiple of 8} - image: us-central1-docker.pkg.dev/deeplearning-images/reproducibility/pytorch-gpu-nemo-nccl:nemo25.02-gib1.0.5-A4 - defaultArguments[]: - arguments[]: - configFile: nemo-config.yaml - configPath: /workload/configs - envs: - - name: NEMO_CONFIG_PATH - value: "/workload/configs" - - name: NEMO_CONFIG_NAME - value: "nemo-config.yaml" - - name: EXPERIMENT_NAME - value: "nemo-experiments" - - name: EXPERIMENT_ROOT_DIR - value: "/job-logs" - - name: NVTE_FWD_LAYERNORM_SM_MARGIN - value: "8" - - name: NVTE_BWD_LAYERNORM_SM_MARGIN - value: "8" - - name: GLOO_SOCKET_IFNAME - value: "eth0" - - name: TOKENIZER_PATH - value: "/artifacts/third-party/tokenizers/gpt2" - - name: NEMO_LAUNCH_SCRIPT - value: "/opt/NeMo/examples/nlp/language_modeling/megatron_gpt_pretraining.py" - - name: TORCH_DISTRIBUTED_TRACING - value: "ALL" - -network: - hostNetwork: true - gibVersion: us-docker.pkg.dev/gce-ai-infra/gpudirect-gib/nccl-plugin-gib:v1.0.5 - subnetworks[]: - ncclSettings: - - name: NCCL_DEBUG - value: "VERSION" diff --git a/training/a4/llama3-1-70b/nemo-pretraining-gke/values.yaml b/training/a4/llama3-1-70b/nemo-pretraining-gke/values.yaml deleted file mode 100644 index 3b86e81..0000000 --- a/training/a4/llama3-1-70b/nemo-pretraining-gke/values.yaml +++ /dev/null @@ -1,70 +0,0 @@ -# Copyright 2025 Google LLC -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -queue: -dwsSettings: - maxRunDurationSeconds: - -tasSettings: - topologyRequest: - kueue.x-k8s.io/podset-preferred-topology: "cloud.google.com/gce-topology-block" - -volumes: - gcsVolumes: true - psVolumes: false - gcsMounts: - - bucketName: - mountPath: "/job-logs" - - bucketName: cloud-samples-data - mountPath: "/artifacts" - mountOptions: "implicit-dirs" - -workload: - gpus: 256 # This should be one of: {<= 8, multiple of 8} - image: us-central1-docker.pkg.dev/deeplearning-images/reproducibility/pytorch-gpu-nemo-nccl:nemo25.02-gib1.0.5-A4 - defaultArguments[]: - arguments[]: - configFile: nemo-config.yaml - configPath: /workload/configs - envs: - - name: NEMO_CONFIG_PATH - value: "/workload/configs" - - name: NEMO_CONFIG_NAME - value: "nemo-config.yaml" - - name: EXPERIMENT_NAME - value: "nemo-experiments" - - name: EXPERIMENT_ROOT_DIR - value: "/job-logs" - - name: NVTE_FWD_LAYERNORM_SM_MARGIN - value: "8" - - name: NVTE_BWD_LAYERNORM_SM_MARGIN - value: "8" - - name: GLOO_SOCKET_IFNAME - value: "eth0" - - name: TOKENIZER_PATH - value: "/artifacts/third-party/tokenizers/gpt2" - - name: NEMO_LAUNCH_SCRIPT - value: "/opt/NeMo/examples/nlp/language_modeling/megatron_gpt_pretraining.py" - - name: TORCH_DISTRIBUTED_TRACING - value: "ALL" - -network: - hostNetwork: true - gibVersion: us-docker.pkg.dev/gce-ai-infra/gpudirect-gib/nccl-plugin-gib:v1.0.5 - subnetworks[]: - ncclSettings: - - name: NCCL_DEBUG - value: "VERSION" - - name: NVTE_UB_SOCKET_IFNAME - value: "eth1"