<a href="https://www.kaggle.com/code/aisuko/multiple-gpus-ft-llama3-1-with-fsdp-and-qlora?scriptVersionId=193245457" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Overview

Let's fine-tune LLms using two GPUs and many of CPUs by using FSDP and QLoRA. Thanks for article [Multi_GPu Fine-tuning for Llama 3.1 70B with FSDP and QLoRA](https://towardsdatascience.com/multi-gpu-fine-tuning-for-llama-3-1-70b-with-fsdp-and-qlora-67a8a5b4f0d6).


## Fully Shared Data Parallel(FSDP)

It split everything(model itsefl, gradients,the activations, and the optimizer states) over all the GPUs adn offload some of it yo yhe CPU RAM if we don't have enough GPU memory. In other words, FSDP distrubutes optimizer sattes, gradients, and parameters across multiple devices(GPUs and CPUs).

During yhe forward pass, teach FADP unit gathers the necessary weights shared from other deveices to form the complete set of weights, performs the computation, and then discards the non-local shards.

After computing the loss, during the backward pass, each FSDP unit again gathers the complete set of weights and performs computations to determine local gradients, which are then averaged.

These averaged gradients are redistributed across the devices thorugh a reduce-scatter operation. After this, each device updates its own shard of the parameters.

Here are are going to user Hugging Face's Accelerate.

In [None]:
!pip install -U -q transformers==4.39.3
!pip install -U -q accelerate==0.28.0
!pip install -U -q datasets==2.18.0
!pip install -U -q peft==0.10.0
!pip install -U -q bitsandbytes==0.43.1
!pip install -U -q trl==0.8.6

In [None]:
import os
from huggingface_hub import login
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
login(token=user_secrets.get_secret("HUGGINGFACE_TOKEN"))

os.environ["WANDB_API_KEY"]=user_secrets.get_secret("WANDB_API_KEY")
os.environ["WANDB_PROJECT"] = "Fine-tuning Llama3.1 with SFT on openassistant-guanaco"
os.environ["WANDB_NAME"] = "ft-sft-Llama3-1-on-openassistant-guanaco"
os.environ["MODEL_NAME"] = "meta-llama/Meta-Llama-3.1-70B"
os.environ["TOKENIZER_NAME"] = "meta-llama/Meta-Llama-3.1-70B"
os.environ["DATASET"] = "timdettmers/openassistant-guanaco"

In [None]:
import torch, multiprocessing
from datasets import load_dataset
from peft import LoraConfig, prepare_model_for_kbit_training
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    set_seed
)
from trl import SFTTrainer, SFTConfig
from peft.utils.other import fsdp_auto_wrap_policy
from accelerate import Accelerator

In [None]:
accelerator = Accelerator()
set_seed(1234)
#use bf16 and FlashAttention if supported
if torch.cuda.is_bf16_supported():
    os.system('pip install flash_attn')
    compute_dtype = torch.bfloat16
    attn_implementation = 'flash_attention_2'
else:
    compute_dtype = torch.float16
    attn_implementation = 'sdpa'
model_name = "meta-llama/Meta-Llama-3.1-70B"
#Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
tokenizer.pad_token = "<|finetune_right_pad_id|>"
tokenizer.pad_token_id = 128004
tokenizer.padding_side = 'right'
ds = load_dataset("timdettmers/openassistant-guanaco")
#Add the EOS token
def process(row):
    row["text"] = row["text"]+"<|end_of_text|>"
    return row
ds = ds.map(
    process,
    num_proc= multiprocessing.cpu_count(),
    load_from_cache_file=False,
)
bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=compute_dtype,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_storage=compute_dtype,
)
model = AutoModelForCausalLM.from_pretrained(
          model_name, quantization_config=bnb_config, torch_dtype=torch.bfloat16, attn_implementation=attn_implementation
)
for name, param in model.named_parameters():
    # freeze base model's layers
    param.requires_grad = False
def make_inputs_require_grad(module, input, output):
    output.requires_grad_(True)
model.get_input_embeddings().register_forward_hook(make_inputs_require_grad)
model.gradient_checkpointing_enable(gradient_checkpointing_kwargs={'use_reentrant':True})
peft_config = LoraConfig(
        lora_alpha=16,
        lora_dropout=0.05,
        r=16,
        bias="none",
        task_type="CAUSAL_LM",
        target_modules= ['k_proj', 'q_proj', 'v_proj', 'o_proj', "gate_proj", "down_proj", "up_proj"]
)

training_arguments = SFTConfig(
        output_dir=os.getenv("WANDB_NAME") ,
        eval_strategy="steps",
        do_eval=True,
        optim="adamw_torch",
        per_device_train_batch_size=1,
        gradient_accumulation_steps=16,
        per_device_eval_batch_size=1,
        log_level="debug",
        logging_steps=10,
        learning_rate=1e-4,
        bf16 = True,
        eval_steps=10,
        max_steps=50,
        warmup_ratio=0.1,
        lr_scheduler_type="linear",
        dataset_text_field="text",
        max_seq_length=512,
        report_to="tensorboard",
        run_name=os.getenv('WANDB_NAME')
)
trainer = SFTTrainer(
        model=model,
        train_dataset=ds['train'],
        eval_dataset=ds['test'],
        peft_config=peft_config,
        tokenizer=tokenizer,
        args=training_arguments,
)
fsdp_plugin = trainer.accelerator.state.fsdp_plugin
fsdp_plugin.auto_wrap_policy = fsdp_auto_wrap_policy(trainer.model)
trainer.train()
if trainer.is_fsdp_enabled:
    trainer.accelerator.state.
    fsdp_plugin.set_state_dict_type("FULL_STATE_DICT")

In [None]:
# trainer.save_model(output_dir)

In [None]:
kwargs={
    'model_name': os.getenv("WANDB_NAME"),
    'finetuned_from': os.getenv('MODEL_NAME'),
#     'tasks': 'Text-Generation',
#     'dataset_tags':'',
    'dataset': os.getenv("DATASET")
}

tokenizer.push_to_hub(os.getenv("WANDB_NAME"))
trainer.push_to_hub(**kwargs)

# Acknowledgements

* https://towardsdatascience.com/multi-gpu-fine-tuning-for-llama-3-1-70b-with-fsdp-and-qlora-67a8a5b4f0d6