# Fine-Tune Llama 3.1 405B on a single Node with PyTorch FSDP and Q-LoRA.

The release of [Llama 3.1 405B](https://huggingface.co/meta-llama/Meta-Llama-3.1-405B) marks a significant change in the landscape of large language models, setting a new benchmark for performance in general knowledge, reasoning, and multilingual tasks. As Meta's largest open-source model, Llama 3.1 405B competes directly with proprietary models like GPT-4 and Claude 3.5 Sonnet, offering frontier-level capabilities at a more accessible price point.

This blog post will guide you through the process of fine-tuning Llama 3.1 405B using PyTorch FSDP and Q-LoRA, supported by Hugging Face's [TRL](https://huggingface.co/docs/trl/index), [Transformers](https://huggingface.co/docs/transformers/index), [peft](https://huggingface.co/docs/peft/index) & [datasets](https://huggingface.co/docs/datasets/index). We will also integrate [Flash Attention v2](https://github.com/Dao-AILab/flash-attention) for enhanced performance.

1. **Setup development environment**
2. **Create and prepare the dataset**
3. **Fine-tune the LLM with PyTorch FSDP, Q-LoRA, and SDPA**
4. **Test Model and run Inference**

_Note: This example is optimized for NVIDIA H100 and A100 GPUs. Adjustments can be made for different hardware configurations._ 

**Background on FSDP and Q-LoRA**

FSDP enables efficient model sharding across GPUs, allowing the training of large models like Llama 3.1 405B. Q-LoRA reduces computational and memory requirements by combining quantization and low-rank adaptation. This collaboration between [Answer.AI](https://www.answer.ai/posts/2024-03-06-fsdp-qlora.html), [Tim Dettmers](https://github.com/TimDettmers/bitsandbytes), and [Hugging Face](https://huggingface.co/).

For more details on these techniques, refer to the following resources:

* [PyTorch FSDP](https://pytorch.org/blog/introducing-pytorch-fully-sharded-data-parallel-api/) is a data/model parallelism technique that shards model across GPUs, reducing memory requirements and enabling the training of larger models more efficiently​​​​​​.
* Q-LoRA is a fine-tuning method that leverages quantization and Low-Rank Adapters to efficiently reduced computational requirements and memory footprint. 


## 1. Setup development environment

Our first step is to install Hugging Face Libraries and Pyroch, including trl, transformers and datasets. If you haven't heard of trl yet, don't worry. It is a new library on top of transformers and datasets, which makes it easier to fine-tune, rlhf, align open LLMs. 

In [1]:
# Install Pytorch for FSDP and FA/SDPA
%pip install "torch==2.4.0" tensorboard
# Install Hugging Face libraries
%pip install  --upgrade "transformers==4.44.0" "datasets==2.21.0" "accelerate==0.33.0" "evaluate==0.4.1" "bitsandbytes==0.43.3" "huggingface_hub==0.24.2" "trl==0.9.6" "peft==0.12.0" "hf_transfer==0.1.8" "flash-attn==2.6.3"
# if your are running on AWS cluster you might need to update the nccl version
%pip install nvidia-nccl-cu12==2.22.3 --upgrade

# Install transformers from git commit to support pre-quantized weights
%pip install git+https://github.com/huggingface/transformers.git@c409cd81777fb27aadc043ed3d8339dbc020fb3b

Collecting nvidia-nccl-cu12==2.20.5 (from torch==2.4.0)
  Using cached nvidia_nccl_cu12-2.20.5-py3-none-manylinux2014_x86_64.whl.metadata (1.8 kB)
Using cached nvidia_nccl_cu12-2.20.5-py3-none-manylinux2014_x86_64.whl (176.2 MB)
Installing collected packages: nvidia-nccl-cu12
  Attempting uninstall: nvidia-nccl-cu12
    Found existing installation: nvidia-nccl-cu12 2.22.3
    Uninstalling nvidia-nccl-cu12-2.22.3:
      Successfully uninstalled nvidia-nccl-cu12-2.22.3
Successfully installed nvidia-nccl-cu12-2.20.5
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Collecting nvidia-nccl-cu12==2.22.3
  Using cached nvidia_nccl_cu12-2.22.3-py3-none-manylinux2014_x86_64.whl.metadata (1.8 kB)
Using cached nvidia_nccl_cu12-2.22.3-py3-none-manylinux2014_x86_64.whl (190.9 MB)
Installing collected packages: nvidia-nccl-cu12
  Attempting uninstall: nvidia-nccl-cu12
    Found existing installation: nvidia-nccl-cu12 2

Next we need to login into Hugging Face to access the Llama 3.1 405b model. If you don't have an account yet and [accepted the terms](https://huggingface.co/meta-llama/Meta-Llama-3.1-405B), you can create one [here](https://huggingface.co/join). 

In [None]:
!huggingface-cli login --token ""

## 2. Create and prepare the dataset

After our environment is set up, we can start creating and preparing our dataset. A fine-tuning dataset should have a diverse set of demonstrations of the task you want to solve. If you want to learn more about how to create a dataset, take a look at the [How to Fine-Tune LLMs in 2024 with Hugging Face](https://www.philschmid.de/fine-tune-llms-in-2024-with-trl#3-create-and-prepare-the-dataset).

We will use the [HuggingFaceH4/no_robots](https://huggingface.co/datasets/HuggingFaceH4/no_robots) dataset a high-quality dataset of 10,000 instructions and demonstrations created by skilled human annotators. This data can be used for supervised fine-tuning (SFT) to make language models follow instructions better. No Robots was modelled after the instruction dataset described in OpenAI's [InstructGPT paper](https://huggingface.co/papers/2203.02155), and is comprised mostly of single-turn instructions.

```json
{"messages": [{"role": "system", "content": "You are..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}
{"messages": [{"role": "system", "content": "You are..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}
{"messages": [{"role": "system", "content": "You are..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}
```

The [no_robots](https://huggingface.co/datasets/HuggingFaceH4/no_robots) dataset has 10,000 split into 9,500 training and  500 test examples. Some samples are not including a `system` message. We will load the dataset with the `datasets` library, add a missing `system` message and save them to separate json files.

In [1]:
from datasets import load_dataset

# Convert dataset to OAI messages
system_message = """You are Llama, an AI assistant created by Philipp to be helpful and honest. Your knowledge spans a wide range of topics, allowing you to engage in substantive conversations and provide analysis on complex subjects."""

def create_conversation(sample):
    if sample["messages"][0]["role"] == "system":
        return sample
    else:
      sample["messages"] = [{"role": "system", "content": system_message}] + sample["messages"]
      return sample

# Load dataset from the hub
dataset = load_dataset("HuggingFaceH4/no_robots")

# Add system message to each conversation
columns_to_remove = list(dataset["train"].features)
columns_to_remove.remove("messages")
dataset = dataset.map(create_conversation, remove_columns=columns_to_remove,batched=False)

# Filter out conversations which are corrupted with wrong turns, keep which have even number of turns after adding system message
dataset["train"] = dataset["train"].filter(lambda x: len(x["messages"][1:]) % 2 == 0)
dataset["test"] = dataset["test"].filter(lambda x: len(x["messages"][1:]) % 2 == 0)

# save datasets to disk 
dataset["train"].to_json("train_dataset.json", orient="records", force_ascii=False)
dataset["test"].to_json("test_dataset.json", orient="records", force_ascii=False)

Downloading readme:   0%|          | 0.00/5.61k [00:00<?, ?B/s]

Map:   0%|          | 0/9500 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

Filter:   0%|          | 0/9500 [00:00<?, ? examples/s]

Filter:   0%|          | 0/500 [00:00<?, ? examples/s]

Creating json from Arrow format:   0%|          | 0/10 [00:00<?, ?ba/s]

Creating json from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

784047

## 3. Fine-tune Llama 405B with PyTorch FSDP, Q-Lora and SDPA

We are now ready to fine-tune our model with PyTorch FSDP, Q-Lora and SDPA. Since we are running in a distributed setup, we need to use `torchrun` and a python script to start the training. 

We prepared a script [run_fsdp_qlora.py](./scripts/run_fsdp_qlora.py) which will load the dataset from disk, prepare the model, tokenizer and start the training. It usees the [SFTTrainer](https://huggingface.co/docs/trl/sft_trainer) from `trl` to fine-tune our model. The `SFTTrainer` makes it straightfoward to supervise fine-tune open LLMs supporting:
* Dataset formatting, including conversational and instruction format (✅ used)
* Training on completions only, ignoring prompts (❌ not used)
* Packing datasets for more efficient training (✅ used)
* PEFT (parameter-efficient fine-tuning) support including Q-LoRA (✅ used)
* Preparing the model and tokenizer for conversational fine-tuning (❌ not used, see below)

_Note: We are using an Anthropic/Vicuna like Chat Template with `User:` and `Assistant:` roles. This done because the special tokens in base Llama 3 (`<|begin_of_text|>` or `<|reserved_special_token_XX|>`) are not trained. Meaning if want would like to use them for the template we need to train them which requires more memory, since we need to update the embedding layer and lm_head. If you have access to more compute you can modify `LLAMA_3_CHAT_TEMPLATE` in the [run_fsdp_qlora.py](./scripts/run_fsdp_qlora.py) script._

For configuration we use the new `TrlParser`, that allows us to provide hyperparameters in a yaml file or overwrite the arguments from the config file by explicitly passing them to the CLI, e.g. `--num_epochs 10`.

_Note: The config below is optimized for 8x H100 80GBs, you can also fine-tune Llama 3.1 405B on 4x H100. Therefore change the `per_device_train_batch_size` to `2`. With a `max_seq_len` of `2048` this should lead to ~64GB per GPU._ 

**Pre-quantize Llama 3.1 405B**

We are using the [pre-quantized version of Llama 3.1 405B](hugging-quants/Meta-Llama-3.1-70B-BNB-NF4-BF16), which is already pre-quantized using `bitsandbytes`. When you fine-tune a model with Q-LoRA thats not pre-quantized `bitsandbytes` will first quantize the model and then start the training process. By using the pre-quantized model we can save time as the model is significant smaller to download and to load on the GPU. You can learn more about pre-quantizing models with `bitsandbytes` in the [bitsandbytes documentation](). 


**Tested Hardware Configurations:**  
✅ 4x H100 80GB with ~900GB CPU RAM  
✅ 8x H100 80GB with ~1.5TB CPU RAM  
❌ 8x L40 48GB _(might work but not tested)._  

In [4]:
%%writefile llama_31_405b_fsdp_qlora.yaml
# script parameters
# model_id: "meta-llama/Meta-Llama-3.1-405B" # Hugging Face model id
model_id: "hugging-quants/Meta-Llama-3.1-405B-BNB-NF4-BF16" # Hugging Face model id
dataset_path: "."                      # path to dataset
max_seq_len:  2048                     # max sequence length for model and packing of the dataset
# training parameters
output_dir: "./llama-31-405b-hf-no-robot" # Temporary output directory for model checkpoints
report_to: "tensorboard"               # report metrics to tensorboard
learning_rate: 2.0e-4                  # learning rate 2.0e-4
lr_scheduler_type: "constant"          # learning rate scheduler
num_train_epochs: 3                    # number of training epochs
per_device_train_batch_size: 2         # batch size per device during training
per_device_eval_batch_size: 1          # batch size for evaluation
gradient_accumulation_steps: 4         # number of steps before performing a backward/update pass
optim: adamw_torch                     # use torch adamw optimizer
logging_steps: 10                      # log every 10 steps
save_strategy: epoch                   # save checkpoint every epoch
eval_strategy: epoch                   # evaluate every epoch
max_grad_norm: 0.3                     # max gradient norm
warmup_ratio: 0.03                     # warmup ratio
bf16: true                             # use bfloat16 precision
tf32: true                             # use tf32 precision
gradient_checkpointing: true           # we use activation_checkpointing instead

Overwriting llama_31_405b_fsdp_qlora.yaml


_Note: At the end of the training there will be a slight increase in GPU memory usage (~10%). This is due to the saving of the model correctly. Make sure to have enough memory left on your GPU to save the model. [REF](https://huggingface.co/docs/peft/v0.10.0/en/accelerate/fsdp#memory-usage)_

We are going to use accelerate to distribute the training across multiple GPUs. Accelerate is a PyTorch library that makes it easier to write distributed PyTorch code. It provides a high-level API that abstracts away the complexity of distributed training. We created a [fsdp_qlora.yaml](./configs/fsdp_qlora.yaml) configuration file that contains the environment configuration for the training. Here you can change the number of GPUs (`num_processes`) or FSDP configuration. 

Now, lets launch the training with the following command:

In [None]:
## Run training via slurm script slurm | script | accelerate | config
# sbatch --job-name=l31-405 slurm/hf.slurm scripts/run_fsdp_qlora.py configs/fsdp_qlora.yaml llama_31_405b_fsdp_qlora.yaml
# Run training via terminal
!HF_HUB_ENABLE_HF_TRANSFER=1 accelerate launch --config_file ./configs/fsdp_qlora.yaml --num_processes 8 ./scripts/run_fsdp_qlora.py --config llama_31_405b_fsdp_qlora.yaml

In [2]:
## Run training via slurm script slurm | script |  accelerate | config
# sbatch --job-name=l31-405 slurm/hf.slurm scripts/run_fsdp_qlora.py configs/fsdp_qlora.yaml llama_31_405b_fsdp_qlora.yaml
# Run training via terminal
!HF_HUB_ENABLE_HF_TRANSFER=1  HF_HUB_CACHE=/scratch/.cache/ accelerate launch --config_file ./configs/fsdp_qlora.yaml --num_processes 8 ./scripts/run_fsdp_qlora.py --config llama_31_405b_fsdp_qlora.yaml

SyntaxError: invalid syntax (3449951553.py, line 1)

The training of Llama 3.1 405B with Flash Attention for 3 epochs with a dataset of 10k samples takes 3h on a `p5.48xlarge` (8x H100). The on-demand instance price is `$98.320/h` which would result in a total cost of `~300$`. 

### _Optional: Merge LoRA adapter in to the original model_

When using QLoRA, we only train adapters and not the full model. This means when saving the model during training we only save the adapter weights and not the full model. If you want to save the full model, which makes it easier to use with Text Generation Inference you can merge the adapter weights into the model weights using the `merge_and_unload` method and then save the model with the `save_pretrained` method. This will save a default model, which can be used for inference.

_Note: You might require > 800GB CPU Memory._

In [None]:
#### COMMENT IN TO MERGE PEFT AND BASE MODEL ####
# from peft import AutoPeftModelForCausalLM

# # Load PEFT model on CPU
# model = AutoPeftModelForCausalLM.from_pretrained(
#     args.output_dir,
#     torch_dtype=torch.float16,
#     low_cpu_mem_usage=True,
# )  
# # Merge LoRA and base model and save
# merged_model = model.merge_and_unload()
# merged_model.save_pretrained(args.output_dir,safe_serialization=True, max_shard_size="2GB")

## 4. Test Model and run Inference

After the training is done we want to evaluate and test our model. We will load different samples from the original dataset and evaluate the model manually. Evaluating Generative AI models is not a trivial task since 1 input can have multiple correct outputs. If you want to learn more about evaluating generative models, check out [Evaluate LLMs and RAG a practical example using Langchain and Hugging Face](https://www.philschmid.de/evaluate-llm) or [LLM Evaluation doesn't need to be complicated](https://www.philschmid.de/llm-evaluation) blog post.

In [None]:
import torch
from peft import AutoPeftModelForCausalLM
from transformers import AutoTokenizer 

peft_model_id = "./llama-31-405b-hf-no-robot"

# Load Model with PEFT adapter
model = AutoPeftModelForCausalLM.from_pretrained(
  peft_model_id,
  torch_dtype=torch.float16,
  quantization_config= {"load_in_4bit": True},
  device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(peft_model_id)

Let’s load our test dataset try to generate an instruction.

In [None]:
from datasets import load_dataset 
from random import randint


# Load our test dataset
eval_dataset = load_dataset("json", data_files="test_dataset.json", split="train")
rand_idx = randint(0, len(eval_dataset))
messages = eval_dataset[rand_idx]["messages"][:2]

# Test on sample 
input_ids = tokenizer.apply_chat_template(messages,add_generation_prompt=True,return_tensors="pt").to(model.device)
outputs = model.generate(
    input_ids,
    max_new_tokens=512,
    eos_token_id= tokenizer.eos_token_id,
    do_sample=True,
    temperature=0.6,
    top_p=0.9,
)
response = outputs[0][input_ids.shape[-1]:]

print(f"**Query:**\n{eval_dataset[rand_idx]['messages'][1]['content']}\n")
print(f"**Original Answer:**\n{eval_dataset[rand_idx]['messages'][2]['content']}\n")
print(f"**Generated Answer:**\n{tokenizer.decode(response,skip_special_tokens=True)}")

# **Query:**
# How long was the Revolutionary War?
# **Original Answer:**
# The American Revolutionary War lasted just over seven years. The war started on April 19, 1775, and ended on September 3, 1783. 
# **Generated Answer:**
# The Revolutionary War, also known as the American Revolution, was an 18th-century war fought between the Kingdom of Great Britain and the Thirteen Colonies. The war lasted from 1775 to 1783.

That looks pretty good! 🚀 Now, its your turn! 

If you want to deploy your model into production check out [Deploy the LLM for Production](https://www.philschmid.de/fine-tune-llms-in-2024-with-trl#6-deploy-the-llm-for-production).