# Fine-tune LLM with PyTorch FSDP, Q-Lora, and SDPA
- https://www.philschmid.de/fsdp-qlora-llama3
- https://medium.com/@xuebinbin12/fine-tuning-chat-based-llm-with-multi-turn-conversational-data-part-i-d8c64d01a20d
- https://colab.research.google.com/github/openai/openai-cookbook/blob/main/examples/How_to_finetune_chat_models.ipynb

## Setup environment

In [26]:
# !pip install -q -U bitsandbytes
# !pip install -q -U git+https://github.com/huggingface/transformers.git
# !pip install -q -U git+https://github.com/huggingface/peft.git
# !pip install -q -U datasets
# !pip install -q -U evaluate
# !pip install -q -U huggingface_hub
# !pip install -q -U flash-attn
# !pip install -q -U trl
# !pip install -q -U tensorboard

From: https://www.philschmid.de/fine-tune-llms-in-2024-with-trl
> Note: If your machine has less than 96GB of RAM and lots of CPU cores, reduce the number of `MAX_JOBS`. On the `g5.2xlarge` we used `4`.

In [34]:
# import torch; assert torch.cuda.get_device_capability()[0] >= 8, 'Hardware not supported for Flash Attention'
# # install flash-attn
# !pip install ninja packaging
# !MAX_JOBS=4 pip install flash-attn --no-build-isolation

Next we need to login into Hugging Face to access the `Llama-3-8b` or `Phi-3-mini-128k-instruct` model.

In [36]:
# from huggingface_hub import login

# login(
#   token="", # ADD YOUR TOKEN HERE
#   add_to_git_credential=True
# )

## Download dataset from the HF Hub and process it

In [58]:
from datasets import load_dataset

# Define the system message
system_message = """You are Milei-GPT, an AI assistant inspired by conversations with Javier Milei, the current president of Argentina. Your knowledge spans a wide range of topics, allowing you to engage in substantive conversations and provide analysis on complex subjects."""

# Function to add the system message
def create_conversation(sample):
    if sample["messages"][0]["role"] == "system":
        return sample
    else:
        sample["messages"] = [{"content": system_message, "role": "system"}] + sample["messages"]
        return sample

# Load the dataset from the hub
dataset = load_dataset("machinelearnear/multiturn_chat_milei_gpt")

# Access the train dataset and shuffle it
train_dataset = dataset['train'].shuffle(seed=42).select(range(440))  # randomly downsample the dataset to only 200 samples

# Add the system message to each conversation
columns_to_remove = list(train_dataset.features)
columns_to_remove.remove("messages")
train_dataset = train_dataset.map(create_conversation, remove_columns=columns_to_remove, batched=False)

# Split the dataset into 180 training samples and 20 test samples
train_test_split = train_dataset.train_test_split(test_size=40/440)

# Filter out conversations with an odd number of turns (after adding system message)
train_test_split["train"] = train_test_split["train"].filter(lambda x: len(x["messages"][1:]) % 2 == 0)
train_test_split["test"] = train_test_split["test"].filter(lambda x: len(x["messages"][1:]) % 2 == 0)

# Save the datasets to disk
train_test_split["train"].to_json("../data/train_dataset.json", orient="records", force_ascii=False)
train_test_split["test"].to_json("../data/test_dataset.json", orient="records", force_ascii=False)

Map:   0%|          | 0/440 [00:00<?, ? examples/s]

Filter:   0%|          | 0/400 [00:00<?, ? examples/s]

Filter:   0%|          | 0/40 [00:00<?, ? examples/s]

Creating json from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Creating json from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

883730

In [68]:
train_test_split["train"]['messages'][0][:10]

[{'content': 'You are Milei-GPT, an AI assistant inspired by conversations with Javier Milei, the current president of Argentina. Your knowledge spans a wide range of topics, allowing you to engage in substantive conversations and provide analysis on complex subjects.',
  'role': 'system'},
 {'content': ' Presidente de mi ley, ¿cómo le va?', 'role': 'user'},
 {'content': '¿Cómo anda?  ¿Qué tal, Johnny?  ¿Cómo estás?',
  'role': 'assistant'},
 {'content': 'Gracias por recibirme.', 'role': 'user'},
 {'content': 'No, por favor, es un placer.  ¿Está bien?  Sí, claro.',
  'role': 'assistant'},
 {'content': '¿Viene mucho por acá?  Vamos a tutear, por más fácil.  ¿Venís mucho por acá o estás más en Olivos?',
  'role': 'user'},
 {'content': 'Estoy más en Olivos.  Yo vengo en general martes y jueves a Casa Rosada, que es los días que tenemos reunión de gabinete.  Y ya después enlazamos... las reuniones que tenga que tener protocolarmente y el resto de la semana trabajo en Olivos.',
  'role': 'a

## Model training

> We are now ready to fine-tune our model with PyTorch FSDP, Q-Lora, and SDPA. Since we are running in a distributed setup, we need to use torchrun and a python script to start the training.

> We prepared a script `run_fsdp_qlora.p`y which will load the dataset from disk, prepare the model, tokenizer and start the training. It usees the `SFTTrainer` from `trl` to fine-tune our model. The `SFTTrainer` makes it straightfoward to supervise fine-tune open LLMs supporting:

> - Dataset formatting, including conversational and instruction format (✅ used)
> - Training on completions only, ignoring prompts (❌ not used)
> - Packing datasets for more efficient training (✅ used)
> - PEFT (parameter-efficient fine-tuning) support including Q-LoRA (✅ used)
> - Preparing the model and tokenizer for conversational fine-tuning (❌ not used, see below)
> Note: We are using an `Anthropic/Vicuna` like Chat Template with `User:` and `Assistant:` roles. This done because the special tokens in base Llama 3 (`<|begin_of_text|>` or `<|reserved_special_token_XX|>`) are not trained. Meaning if want would like to use them for the template we need to train them which requires more memory, since we need to update the embedding layer and lm_head. If you have access to more compute you can modify `LLAMA_3_CHAT_TEMPLATE` in the `run_fsdp_qlora.py` script.

> For configuration we use the new `TrlParser`, that allows us to provide hyperparameters in a `yaml` file or overwrite the arguments from the config file by explicitly passing them to the CLI, e.g. --num_epochs 10. Below is the config file for fine-tuning Llama 3 8B on 4x A10G GPUs or 4x24GB GPUs.

In [75]:
%%writefile ../scripts/llama_3_8b_fsdp_qlora.yaml
# script parameters
model_id: "meta-llama/Meta-Llama-3-8B-Instruct" # Hugging Face model id
dataset_path: "../data/"                      # path to dataset
max_seq_len:  3072 # 2048              # max sequence length for model and packing of the dataset
# training parameters
output_dir: "./llama-3-8b-machinelearnear-milei-gpt" # Temporary output directory for model checkpoints
report_to: "tensorboard"               # report metrics to tensorboard
learning_rate: 0.0002                  # learning rate 2e-4
lr_scheduler_type: "constant"          # learning rate scheduler
num_train_epochs: 3                    # number of training epochs
per_device_train_batch_size: 1         # batch size per device during training
per_device_eval_batch_size: 1          # batch size for evaluation
gradient_accumulation_steps: 2         # number of steps before performing a backward/update pass
optim: adamw_torch                     # use torch adamw optimizer
logging_steps: 10                      # log every 10 steps
save_strategy: epoch                   # save checkpoint every epoch
evaluation_strategy: epoch             # evaluate every epoch
max_grad_norm: 0.3                     # max gradient norm
warmup_ratio: 0.03                     # warmup ratio
bf16: true                             # use bfloat16 precision
tf32: true                             # use tf32 precision
gradient_checkpointing: true           # use gradient checkpointing to save memory
# FSDP parameters: https://huggingface.co/docs/transformers/main/en/fsdp
fsdp: "full_shard auto_wrap offload" # remove offload if enough GPU memory
fsdp_config:
    backward_prefetch: "backward_pre"
    forward_prefetch: "false"
    use_orig_params: "false"

Overwriting ../scripts/llama_3_8b_fsdp_qlora.yaml


> Note: At the end of the training there will be a slight increase in GPU memory usage (~10%). This is due to the saving of the model correctly. Make sure to have enough memory left on your GPU to save the model. See also [this Reddit conversation](https://www.reddit.com/r/LocalLLaMA/comments/16v9hms/fine_tune_base_model_or_chat_model_for/)

> To launch our training we will use torchrun to keep the example flexible and easy to adjust to, e.g. Amazon SageMaker or Google Cloud Vertex AI. For torchrun and FSDP we need to set the environment variable `ACCELERATE_USE_FSDP` and `FSDP_CPU_RAM_EFFICIENT_LOADING` to tell transformers/accelerate to use `FSDP` and load the model in a memory-efficient way.

> Note: To NOT CPU offloading you need to change the value of fsdp and remove offload. This only works on > 40GB GPUs since it requires more memory.

> Now, lets launch the training (a test! we are not running this on this Notebook) with the following command:

In [76]:
!ACCELERATE_USE_FSDP=1 FSDP_CPU_RAM_EFFICIENT_LOADING=1 torchrun --nproc_per_node=1 ../scripts/run_fsdp_qlora.py --config ../scripts/llama_3_8b_fsdp_qlora.yaml

Generating train split: 177 examples [00:00, 7021.32 examples/s]
Generating train split: 23 examples [00:00, 5675.98 examples/s]
tokenizer_config.json: 100%|███████████████| 51.0k/51.0k [00:00<00:00, 30.5MB/s]
tokenizer.json: 100%|██████████████████████| 9.09M/9.09M [00:00<00:00, 27.7MB/s]
special_tokens_map.json: 100%|███████████████| 73.0/73.0 [00:00<00:00, 1.11MB/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Map: 100%|████████████████████████████| 177/177 [00:00<00:00, 721.29 examples/s]
Map: 100%|██████████████████████████████| 23/23 [00:00<00:00, 880.09 examples/s]
You are Milei-GPT, an AI assistant inspired by conversations with Javier Milei, the current president of Argentina. Your knowledge spans a wide range of topics, allowing you to engage in substantive conversations and provide analysis on complex subjects.

Human:  Vos planteando esas cosas tan disruptivas, por no decir alguna recontra violenta, es