# Training Llama-3.1 8B with Megatron-LM

This tutorial demonstrates how to train the Llama-3.1 model using *mock data*. The Llama-3.1 8B model is a popular open-source large language model (LLM) designed to handle a wide range of natural language processing tasks efficiently. Learn more about the Llama models at [Llama's website](https://www.llama.com/).

This tutorial uses mock data to provide a quick and lightweight demonstration of the training workflow, enabling you to verify that your environment is correctly configured and functional. Mock data is a useful way to validate the training pipeline without requiring large datasets.

The training process leverages the Megatron-LM framework, a specialized framework for pretraining and fine-tuning large-scale language models. For more information about Megatron-LM, see their [GitHub repository](https://github.com/NVIDIA/Megatron-LM). All steps are executed within a Docker container, which provides a ready-to-use environment with all necessary dependencies.

This tutorial builds on the setup completed in the [Pretraining with Megatron-LM tutorial](https://rocm.docs.amd.com/projects/ai-developer-hub/en/latest/notebooks/pretrain/setup_tutorial.html).

## Prerequisites

### Hugging Face API access

* Obtain an API token from [Hugging Face](https://huggingface.co) for downloading models.
* Ensure the Hugging Face API token has the necessary permissions and approval to access [Meta's Llama checkpoints](https://huggingface.co/meta-llama/Llama-3.1-8B).


## Prepare the training environment

After your system meets the prerequisites, follow these steps to set up the training environment.

### 1. Clone the Megatron-LM repository

Run the following commands inside the Docker container to clone the Megatron-LM repository and navigate to the validated commit:

In [1]:
# Clone the Megatron-LM repository and navigate to the validated commit
!git clone https://github.com/ROCm/Megatron-LM && cd Megatron-LM && git checkout bb93ccbfeae6363c67b361a97a27c74ab86e7e92

fatal: destination path 'Megatron-LM' already exists and is not an empty directory.


### 2. Complete necessary installs.

In [2]:
try:
    import huggingface_hub
    print("huggingface_hub is already installed.")
except ImportError:
    !pip install huggingface_hub

try:
    import regex
    print("regex is already installed.")
except ImportError:
    !pip install regex

huggingface_hub is already installed.
regex is already installed.


### 3. Provide your Hugging Face token

A Hugging Face token can be generated by signing into your account at [Hugging Face Tokens](https://huggingface.co/settings/tokens).

You'll require a Hugging Face API token to access Llama-3.1 8B. Generate your token at Hugging Face Tokens and request access for Llama-3.1 8B. Tokens typically start with "hf_".

Run the following interactive block in your Jupyter notebook to set up the token:

**Note**: Uncheck the "Add token as Git credential" option.

In [3]:
from huggingface_hub import notebook_login, HfApi

# Prompt the user to log in
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.svâ€¦

Verify that your token was accepted correctly:

In [4]:
try:
    api = HfApi()
    user_info = api.whoami()
    print(f"Token validated successfully! Logged in as: {user_info['name']}")
except Exception as e:
    print(f"Token validation failed. Error: {e}")

Token validated successfully! Logged in as: andyll7772


## Run the training script

This section describes how to run the training script, with an explanation of the key parameters.

### Single-node training overview

The training process involves running a pre-configured script that initializes and executes the training of the Llama-3.1 model. The script leverages the Megatron-LM framework and mock data to simulate a full training pipeline. This approach ensures your environment is configured correctly and is functional for real-world use cases.

Before running the script, ensure all environment variables are set correctly.

### Key parameters for training:

* **Batch size (`BS`)**: Set this to `64` for optimal GPU usage.
* **Sequence length (`SEQ_LENGTH`)**: Input sequence length, set to `4096`.
* **Tensor parallelism (`TP`)**: Set this to `8` for efficient parallelism.
* **Precision (`TE_FP8`)**: Set this to `0` for `BF16` precision.

### Run the training script

Use the following command to train the model on a single node:


In [5]:
!cd Megatron-LM && TEE_OUTPUT=1 MBS=2 BS=64 TP=8 TE_FP8=0 SEQ_LENGTH=4096  \
TOKENIZER_MODEL='meta-llama/Llama-3.1-8B' MODEL_SIZE='8' \
bash examples/llama/train_llama3.sh

NO_TRAINING=0
Single node setup, skipping NCCL and GLOO socket interface settings.
experiment/1nodes_rank0_train_8B_mbs2_bs64_tp8_pp1_cp1_iter10/TE_FP8_0/2025-06-09_22-25-07/output_perf.log
  @custom_fwd
  @custom_bwd
  @custom_fwd
  @custom_bwd
Traceback (most recent call last):
  File "/var/lib/jenkins/jupyter/Megatron-LM/pretrain_gpt.py", line 265, in <module>
    pretrain(
  File "/var/lib/jenkins/jupyter/Megatron-LM/megatron/training/training.py", line 245, in pretrain
    initialize_megatron(
  File "/var/lib/jenkins/jupyter/Megatron-LM/megatron/training/initialize.py", line 67, in initialize_megatron
    validate_args(args, args_defaults)
  File "/var/lib/jenkins/jupyter/Megatron-LM/megatron/training/arguments.py", line 181, in validate_args
    assert args.world_size % total_model_size == 0, (
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: world size (1) is not divisible by total_model_size (encoder_model_size=0 + decoder_model_size=8)
E0609 22:25:34.678000 

### Additional details about the command

This command configures the training process with the following parameters:

* **`TEE_OUTPUT=1`**: Enables logging output to the console.
* **`MBS=2`**: Micro-batch size per GPU.
* **`BS=64`**: Total batch size across all GPUs.
* **`TP=8`**: Tensor parallelism for distributing the model across GPUs.
* **`TE_FP8=0`**: Sets the precision to `BF16` for training.
* **`SEQ_LENGTH=4096`**: Maximum input sequence length.

The training script does the following:
* Uses mock data as input.
* Trains the Llama-3.1 8B model with the specified configurations.

You can customize these parameters based on your hardware and desired configurations by modifying the command details.

## Monitor the training progress

Monitor the output logs during the training process for the following developments:

* **Iteration progress**: The number of completed iterations.
* **Loss values**: This indicates the model's learning progress. Lower values suggest better learning.
* **GPU utilization**: Ensures the optimal usage of your hardware resources.

Logs are printed to the console and saved to a log file within the directory specified by the script.

## Key notes

* Mock data is for validation only. To use a different dataset, see the [Pretraining with Megatron-LM tutorial](https://rocm.docs.amd.com/projects/ai-developer-hub/en/latest/notebooks/pretrain/setup_tutorial.html).
* Tune the hyperparameters based on your hardware. The hyperparameter configuration in this tutorial is based on one node of 8x MI300x GPUs.
* This example illustrates how to run a training task on a single node. For multi-node training instructions, see the [Pretraining with Megatron-LM tutorial](https://rocm.docs.amd.com/projects/ai-developer-hub/en/latest/notebooks/pretrain/setup_tutorial.html).
* Verify the logs for correctness.