# Fine-Tuning Llama-3 on AMD Radeon GPU

## Authors:
### **Fluid Numerics**
* **Garrett Byrd**             ([garrett@fluidnumerics.com](garrett@fluidnumerics.com))
* **Dr. Joseph Schoonover**    ([joe@fluidnumerics.com](joe@fluidnumerics.com))

## Requirements
### Software
* ROCm 6.1
* Python 3.12

### Hardware
* A `gfx1100` Radeon (or Radeon Pro) GPU ([check here](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/system-requirements.html))

## Python Package List

We recommend using Anaconda/Miniconda as your Python environment manager. Install Miniconda [here](https://docs.anaconda.com/miniconda/).

1. PyTorch+rocm6.1 (Install ROCm-compatible PyTorch [here](2.4.0+rocm6.1)) (`pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.1`)

2. Install Hugging Face Libraries (`pip install transformers datasets accelerate peft evaluate trl`)

3. If you are running this notebook, be sure to install Jupyter (`pip install jupyterlab`)


## Installing `bitsandbytes`
Install the **rocm_enabled** branch of the **ROCm/bitsandbytes** repository [found here](https://github.com/ROCm/bitsandbytes/tree/rocm_enabled). 

Note: this is the branch actively developed by the ROCm team and is different than the branches on the [main bitsandbytes repository](https://github.com/bitsandbytes-foundation/bitsandbytes). At the time of writing, neither of the main repository branches `main` or `multi-backend-refactor` will work for this example.

To install:
```sh
git clone --recurse https://github.com/ROCm/bitsandbytes
cd bitsandbytes
git checkout rocm_enabled
pip install -r requirements-dev.txt
cmake -DCOMPUTE_BACKEND=hip -S . -DBNB_ROCM_ARCH="gfx1100"
make
pip install .
```

In [None]:
import torch
from numpy import argmax

# 🤗 Hugging Face Libraries

# transformers
# https://huggingface.co/docs/transformers/index
from transformers import (
    AutoTokenizer,
    BitsAndBytesConfig,
    LlamaForCausalLM,
    pipeline,
    TrainingArguments,
)

# datasets
# https://huggingface.co/docs/datasets/index
from datasets import load_dataset

# peft
# https://huggingface.co/docs/peft/index
from peft import LoraConfig, get_peft_model

# evaluate
# https://huggingface.co/docs/evaluate/index
import evaluate

# trl (Transformer Reinforcement Learning)
# https://huggingface.co/docs/trl/en/index
from trl import SFTTrainer, SFTConfig

In [None]:
! rocm-smi

In [None]:
# Confirm the correct device is being used
# E.g. 'AMD Radeon Pro w7800'
print(f"Device name: {torch.cuda.get_device_name(0)}")
# print(f"Device name: {torch.cuda.get_device_name(1)}")

# set device to 'cuda' for ROCm GPUs, else use CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# verify the device is set to 'cuda'
print(f"Device: {device}")

In [4]:
# set path to local model
path_to_model = "/home/garrett/amd/misc/Meta-Llama-3-8B"

# If not using a local model, this can be set as the name of a model on hugging face, e.g.
# path_to_model = "meta-llama/llama-3-8b"
# https://huggingface.co/meta-llama/Meta-Llama-3-8B

In [None]:
# Input/output before training

my_tokenizer = AutoTokenizer.from_pretrained(path_to_model)  # Load model tokenizer
my_tokenizer.pad_token = my_tokenizer.eos_token  # Set padding token to EOS token

# BitsandBytes config
fp4_config = BitsAndBytesConfig(
    load_in_4bit=True,  # Use 4bit quantization
    bnb_4bit_quant_type="fp4",  # Use FP4 datatype ("nf4" alternative)
    bnb_4bit_use_double_quant=True,  # Nested quantization
    bnb_4bit_compute_dtype=torch.float16,  # Computational type might be different than input type
)

quantized_model = LlamaForCausalLM.from_pretrained(
    path_to_model,  # Set model
    quantization_config=fp4_config,  # Apply config
    device_map="auto",
)

quantized_model_architecture = quantized_model.__str__()

# Sample prompt (evaluates to 1/4 - 3/(4e**2))
prompt = r"Evaluate the integral $\int_0^1 x e^{-2x} dx$."
prompt = r"Johnny has three apples. Jane has fourteen oranges. Jane says that she will trade three oranges for one apple. What is the maximum number of oranges that Johnny could trade for?"

quantized_pipeline = pipeline(
    "text-generation",
    model=quantized_model,
    tokenizer=my_tokenizer,
    torch_dtype=torch.float16,
    device_map="auto",
)

sequences = quantized_pipeline(
    text_inputs=prompt,
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=my_tokenizer.eos_token_id,
    max_new_tokens=512,
    temperature=2.0,
)

for seq in sequences:
    print(f"\nResult:\n{seq['generated_text']}")

In [None]:
# LoRA (Low-Rank Adaptation)
# https://huggingface.co/docs/peft/main/en/developer_guides/lora

# LoRA config
lora_config = LoraConfig(
    r=16,  # Size of adapter layer
    lora_alpha=16,  # "How strongly does the adaptation layer affect the base model?" (see 4.1 of https://arxiv.org/abs/2106.09685)
    lora_dropout=0.05,  # Optional dropout layer
    bias="none",  # No bias
    task_type="CAUSAL_LM",  # Task type, see https://huggingface.co/docs/peft/en/package_reference/peft_types#peft.TaskType
    target_modules=[  # Which modules to apply adapter layers to?
        "up_proj",  # up projection
        "down_proj",  # down projection
        "gate_proj",  # gate projection
        "k_proj",  # Key
        "q_proj",  # Query
        "v_proj",  # Value
        "o_proj",  # Output
    ],
)

# Apply the LoRA config
adapted_model = get_peft_model(quantized_model, lora_config)

# output should be less sensible
prompt = r"Evaluate the integral $\int_0^1 x e^{-2x} dx$."

adapted_pipeline = pipeline(
    "text-generation",
    model=adapted_model,
    tokenizer=my_tokenizer,
    device_map="auto",
)

sequences = adapted_pipeline(
    text_inputs=prompt,
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=my_tokenizer.eos_token_id,
    max_new_tokens=512,
)

for seq in sequences:
    print(f"\nResult:\n{seq['generated_text']}")

In [None]:
# Load dataset
# https://huggingface.co/datasets/meta-math/MetaMathQA
MetaMathQA = load_dataset(
    "json", data_files="MetaMathQA/MetaMathQA-395K.json", split="train[:10000]"
)
# Split dataset into "test" and "train" columns
MetaMathQA = MetaMathQA.train_test_split(test_size=0.2)

print(MetaMathQA)

In [8]:
# Chat templates are Jinja template strings
# https://huggingface.co/blog/chat-templates


# Format
def instructify(qr_row):
    qr_json = [
        {
            "role": "user",
            "content": qr_row["query"],
        },
        {
            "role": "assistant",
            "content": qr_row["response"],
        },
    ]

    qr_row["text"] = my_tokenizer.apply_chat_template(qr_json, tokenize=False)
    return qr_row

In [None]:
# https://www.llama.com/docs/model-cards-and-prompt-formats/meta-llama-3/
my_tokenizer.chat_template = """{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>

'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|end_header_id|>

' }}{% endif %}"""

my_tokenizer.chat_template = """{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = message['content'] | trim + '\n' %}{{ content }}{% endfor %}"""

print(my_tokenizer.chat_template)

In [None]:
formatted_dataset = MetaMathQA.map(instructify)

In [None]:
print(formatted_dataset["train"][0]["text"])
print(formatted_dataset["test"][0]["text"])

In [None]:
print('"query": \n')
print(formatted_dataset["test"][0]["query"], "\n")
print('"response": \n')
print(formatted_dataset["test"][0]["response"], "\n")
print('"text": \n')
print(formatted_dataset["test"][0]["text"], "\n")

In [None]:
example_prompt = formatted_dataset["test"][0]["query"]

sequences = adapted_pipeline(
    text_inputs=example_prompt,
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=my_tokenizer.eos_token_id,
    max_new_tokens=512,
)

for seq in sequences:
    print(f"\nResult:\n{seq['generated_text']}")

In [14]:
# https://huggingface.co/docs/evaluate/package_reference/loading_methods#evaluate.load
metric = evaluate.load("accuracy")


def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = argmax(logits, axis=-1)
    return evaluate.metric.compute(predictions=predictions, references=labels)

In [None]:
# Training Arguments
# https://huggingface.co/docs/transformers/v4.45.1/en/main_classes/trainer#transformers.TrainingArguments
# https://huggingface.co/docs/transformers/en/perf_train_gpu_one
training_arguments = TrainingArguments(
    output_dir="Llama-Math-TEMP",
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    gradient_accumulation_steps=1,
    optim="paged_adamw_8bit",  # complete list: https://github.com/huggingface/transformers/blob/a43e84cb3b78fcac3d5d9374a8488f74f3f19245/src/transformers/training_args.py#L144
    num_train_epochs=1,
    evaluation_strategy="steps",
    eval_steps=0.25,
    logging_steps=1,
    warmup_steps=10,
    logging_strategy="steps",
    learning_rate=1e-4,
    fp16=True,
    bf16=False,
    group_by_length=True,
)

In [None]:
# Supervised Fine-Tuning Trainer
# https://huggingface.co/docs/trl/en/sft_trainer
trainer = SFTTrainer(
    model=adapted_model,
    train_dataset=formatted_dataset["train"],
    eval_dataset=formatted_dataset["test"],
    max_seq_length=512,
    dataset_text_field="text",
    tokenizer=my_tokenizer,
    args=training_arguments,
    packing=False,
    peft_config=lora_config,
)

In [None]:
trainer.train()

In [None]:
! rocm-smi
# memory: 13.76GB
# training time: 101m 40.6s
# average temperature: ~76C