# Fine-tune Llama 3 with PyTorch FSDP and Q-Lora
- https://www.philschmid.de/fsdp-qlora-llama3

In [1]:
%store -r

In [2]:
print(f"test_model_id : {test_model_id}")
print(f"bucket : {bucket}")
print(f"prefix : {prefix}")
print(f"model_weight_path : {model_weight_path}")
print(f"training_input_path : {training_input_path}")
print(f"test_input_path : {test_input_path}")
print(f"local_training_input_path : {local_training_input_path}")
print(f"local_test_input_path : {local_test_input_path}")

test_model_id : MLP-KTLim/llama-3-Korean-Bllossom-8B
bucket : sagemaker-us-west-2-322537213286
prefix : sagemaker/llama-3-1-kor-bllossom-8b
model_weight_path : s3://sagemaker-us-west-2-322537213286/sagemaker/llama-3-1-kor-bllossom-8b/model_weight/MLP-KTLim/llama-3-Korean-Bllossom-8B
training_input_path : s3://sagemaker-us-west-2-322537213286/sagemaker/llama-3-1-kor-bllossom-8b/gemini_result_kospi_0517/train/train_dataset.json
test_input_path : s3://sagemaker-us-west-2-322537213286/sagemaker/llama-3-1-kor-bllossom-8b/gemini_result_kospi_0517/test/test_dataset.json
local_training_input_path : /home/ec2-user/SageMaker/2024/llama-3-on-sagemaker/dataset/train
local_test_input_path : /home/ec2-user/SageMaker/2024/llama-3-on-sagemaker/dataset/test


In [3]:
import sagemaker
from pathlib import Path
from time import strftime

sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


In [4]:
sagemaker.__version__

'2.226.1'

## training script 준비

## Fine-tune Llama-3 on Amazon SageMaker

In addition to our `deepspeed_parameters` we need to define the `training_hyperparameters` for our training script. The `training_hyperparameters` are passed to our `training_script` as CLI arguments with `--key value`. 

If you want to better understand which batch_size and `deepspeed_config` can work which hardware setup you can check out the [Results & Experiments](https://www.philschmid.de/fine-tune-flan-t5-deepspeed#3-results--experiments) we ran.


In [3]:
!mkdir src/configs

mkdir: cannot create directory ‘src/configs’: File exists


In [4]:
%%writefile src/configs/llama_3_8b_fsdp_qlora.yaml
# script parameters
model_name_or_path: "/opt/ml/input/data/model_weight"        # Hugging Face model id
train_dataset_path: "/opt/ml/input/data/training"                      # path to dataset
test_dataset_path: "/opt/ml/input/data/test"
max_seq_length: 256
# training parameters
output_dir: "/opt/ml/checkpoints"     # Temporary output directory for model checkpoints
# report_to: "wandb"                  # report metrics to tensorboard
report_to: "tensorboard"               # report metrics to tensorboard
learning_rate: 0.0002                  # learning rate 2e-4
lr_scheduler_type: "constant"          # learning rate scheduler
num_train_epochs: 2                    # number of training epochs
per_device_train_batch_size: 16         # batch size per device during training
per_device_eval_batch_size: 1          # batch size for evaluation
gradient_accumulation_steps: 2         # number of steps before performing a backward/update pass
optim: adamw_torch                     # use torch adamw optimizer
logging_steps: 10                      # log every 10 steps
save_strategy: steps                  #epoch                   # save checkpoint every epoch
save_steps: 1000
evaluation_strategy: epoch             # evaluate every epoch
max_grad_norm: 0.3                     # max gradient norm
warmup_ratio: 0.03                     # warmup ratio
bf16: true                             # use bfloat16 precision
tf32: false                             # use tf32 precision
gradient_checkpointing: true           # use gradient checkpointing to save memory
# FSDP parameters: https://huggingface.co/docs/transformers/main/en/fsdp
fsdp: "full_shard auto_wrap offload" # remove offload if enough GPU memory
fsdp_config:
    backward_prefetch: "backward_pre"
    forward_prefetch: "false"
    use_orig_params: "false"

Overwriting src/configs/llama_3_8b_fsdp_qlora.yaml


In [151]:
# Provide the ARN of the tracking server that you want to track your training job with
tracking_server_arn = '<your tracking server arn here>'

In [152]:
from sagemaker.huggingface import HuggingFace

# hyperparameters, which are passed into the training job
# training_hyperparameters={
#     "wb_token" : "<your wandb token>"
# }
training_hyperparameters={}

In order to create a sagemaker training job we need an `HuggingFace` Estimator. The Estimator handles end-to-end Amazon SageMaker training and deployment tasks. The Estimator manages the infrastructure use. 
SagMaker takes care of starting and managing all the required ec2 instances for us, provides the correct huggingface container, uploads the provided scripts and downloads the data from our S3 bucket into the container at /opt/ml/input/data. Then, it starts the training job by running.

In [200]:
# instance_type = 'ml.p4d.24xlarge' # 'ml.p3.16xlarge', 'ml.p3dn.24xlarge', 'ml.p4d.24xlarge', 'local_gpu'
# instance_type = 'ml.g5.12xlarge'
instance_type = 'ml.g5.48xlarge'
# instance_type = 'ml.p5.48xlarge'
instance_type = 'local_gpu'
instance_count = 1
max_run = 24*60*60

In [201]:
local_model_weight_path = f"{Path.cwd()}/model_weight/{test_model_id}"

In [202]:
if instance_type =='local_gpu':
    import os
    from sagemaker.local import LocalSession

    sagemaker_session = LocalSession()
    sagemaker_session.config = {'local': {'local_code': True}}
    training = f"file://{local_training_input_path}"
    test = f"file://{local_test_input_path}"
    model_weight = f"file://{local_model_weight_path}"
else:
    sagemaker_session = sagemaker.Session()
    training = training_input_path
    test = test_input_path
    model_weight = model_weight_path

training, test, model_weight

('s3://sagemaker-us-west-2-322537213286/sagemaker/llama-3-1-kor-bllossom-8b/gemini_result_kospi_0517/train/train_dataset.json',
 's3://sagemaker-us-west-2-322537213286/sagemaker/llama-3-1-kor-bllossom-8b/gemini_result_kospi_0517/test/test_dataset.json',
 's3://sagemaker-us-west-2-322537213286/sagemaker/llama-3-1-kor-bllossom-8b/model_weight/MLP-KTLim/llama-3-Korean-Bllossom-8B')

In [203]:
from sagemaker.pytorch import PyTorch
import time
# define Training Job Name 
job_name = f'huggingface-llama-3-{time.strftime("%Y-%m-%d-%H-%M-%S", time.localtime())}'

# distribution={ "pytorchddp": { "enabled": True } }  # mpirun, activates SMDDP AllReduce OR AllGather
distribution={
    "torch_distributed": {
        "enabled": True,
        # "NCCL_DEBUG":"INFO"
        # "mpi": "-verbose -x NCCL_DEBUG=INFO"
    }
}  # torchrun, activates SMDDP AllGather
# distribution={ "smdistributed": { "dataparallel": { "enabled": True } } }  # mpirun, activates SMDDP AllReduce OR AllGather

environment={
    "NCCL_DEBUG" : "INFO", 
    "SM_LOG_LEVEL": "10",
    "MLFLOW_TRACKING_ARN": tracking_server_arn
}

training_hyperparameters["config"] = "/opt/ml/code/configs/llama_3_8b_fsdp_qlora.yaml"
    
estimator = PyTorch(
                    entry_point='run_fsdp_qlora.py',
                    source_dir=f'{Path.cwd()}/src',
                    role=role,
                    # image_uri=image_uri,
                    framework_version='2.2.0',
                    py_version='py310',
                    instance_count=instance_count,
                    instance_type=instance_type,
                    distribution=distribution,
                    # metric_definitions=metric_definitions,
                    disable_profiler=True,
                    debugger_hook_config=False,
                    max_run=max_run,
                    hyperparameters={
                      **training_hyperparameters,
                    },   # the hyperparameter used for running the training job
                    sagemaker_session=sagemaker_session,
                    enable_remote_debug=True,
                    # keep_alive_period_in_seconds=1200,
                    # input_mode='FastFile'
                    # max_wait=max_run,
                    # use_spot_instances=True,
                    # subnets=['subnet-090e278f3622051c4'],
                    # security_group_ids=['sg-05baa06337a188842'],
                    max_retry_attempts=30,
                    environment = environment,
                   )

We created our `HuggingFace` estimator including the `ds_launcher.py` as `entry_point` and defined our `deepspeed` config and `training_script` in the `deepspeed_parameters`, which we merged with our `training_hyperparameters`. We can now start our training job, with the `.fit()` method passing our S3 path to the training script.

In [204]:
!sudo rm -rf src/core.*

In [205]:
current_time = strftime("%m%d-%H%M%s")
i_type = instance_type.replace('.','-')
job_name = f'llama-3-{i_type}-{instance_count}-{current_time}'

## additional setting for mlflow
estimator._hyperparameters["model_uri"] = f's3://{bucket}/{prefix}/checkpoint/{test_model_id}/{job_name}'
estimator.environment["MLFLOW_EXPERIMENT_NAME"] = prefix.split("/")[-1]
estimator.environment["MLFLOW_RUN_NAME"] = job_name


if instance_type =='local_gpu':
    estimator.checkpoint_s3_uri = None
else:
    estimator.checkpoint_s3_uri = estimator._hyperparameters["model_uri"] 
    
    
estimator.fit(
    # inputs={'training': s3_data_path, 'model_weight': model_weight}, 
    inputs={
        'training': training,
        'test': test,
        'model_weight' : model_weight
    }, 
    job_name=job_name,
    wait=False
)

INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating training-job with name: llama-3-ml-g5-48xlarge-1-0731-08031722413018


In [206]:
sagemaker_session = sagemaker.Session()
sagemaker_session.logs_for_job(job_name=job_name, wait=True)

2024-07-31 08:03:39 Starting - Starting the training job...
2024-07-31 08:03:40 Pending - Training job waiting for capacity...................................................
2024-07-31 08:12:37 Pending - Preparing the instances for training......
2024-07-31 08:13:37 Downloading - Downloading input data.....................
2024-07-31 08:16:58 Downloading - Downloading the training image...
2024-07-31 08:17:34 Training - Training image download completed. Training in progress.[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2024-07-31 08:17:36,863 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2024-07-31 08:17:36,926 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)[0m
[34m2024-07-31 08:17:36,938 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2024-07-31 08:17

## PEFT 모델 추론 하기

In [207]:
import sagemaker
sagemaker_session = sagemaker.Session()
train_result = sagemaker_session.describe_training_job(job_name=job_name)

In [226]:
checkpoint_s3uri = train_result['CheckpointConfig']['S3Uri']
output_dir = './checkpoints/20240731'
checkpoint_s3uri

's3://sagemaker-us-west-2-322537213286/sagemaker/llama-3-1-kor-bllossom-8b/checkpoint/MLP-KTLim/llama-3-Korean-Bllossom-8B/llama-3-ml-g5-48xlarge-1-0731-08031722413018'

In [227]:
!aws s3 sync $checkpoint_s3uri $output_dir

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [221]:
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch
peft_model_id = output_dir

# Reload model in FP16 and merge it with LoRA weights
base_model = AutoModelForCausalLM.from_pretrained(
    local_model_weight_path,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map="auto",
)
peft_model = PeftModel.from_pretrained(base_model, peft_model_id)
peft_model = peft_model.merge_and_unload()

merged_save_dir = "merged_model"
peft_model.save_pretrained(merged_save_dir, safe_serialization=True, max_shard_size="2GB")

# Reload tokenizer to save it
tokenizer = AutoTokenizer.from_pretrained(local_model_weight_path, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
tokenizer.save_pretrained(merged_save_dir)

Loading checkpoint shards:   0%|          | 0/7 [00:00<?, ?it/s]

('merged_model/tokenizer_config.json',
 'merged_model/special_tokens_map.json',
 'merged_model/tokenizer.json')

In [225]:
%%time
max_new_tokens = 250

input_ids = tokenizer(
    "서울의 유명한 관광 코스를 만들어줄래?", return_tensors="pt"
).input_ids  

outputs = peft_model.generate(input_ids, max_new_tokens=max_new_tokens)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

서울의 유명한 관광 코스를 만들어줄래? 서울의 다양한 명소를 소개해줄게! 😊

서울은 역사와 문화, 자연과 현대가 공존하는 도시로, 다양한 명소를 통해 다양한 경험을 할 수 있어. 여기 몇 가지 추천할게!

1. **경복궁**: 조선 왕조의 궁궐로, 한국의 역사와 문화를 체험할 수 있는 곳. 특히, 왕의 생활을 체험할 수 있는 '궁중 체험 프로그램'이 인기가 많아. 🏯

2. **남산서울타워**: 서울의 전경을 한눈에 볼 수 있는 곳. 특히, 저녁 시간에는 서울의 야경이 아름답게 보인다. 🌆

3. **홍대**: 젊음과 예술의 중심지. 다양한 카페, 레스토랑, 갤러리, 클럽 등이 몰려 있어, 밤에 방문하면 더욱 분위기가 좋다. 🎨

4. **청계천**: 서울의 중심부를 흐르는 천. 다양한 공연과 이벤트가 열리는 곳으로, 특히, 야간에 방문하면 아름다운 조명이 켜져 있어. 💧

5
CPU times: user 18min 59s, sys: 1min 21s, total: 20min 21s
Wall time: 12.8 s


In [2]:
%store merged_save_dir
%store checkpoint_s3uri
%store tracking_server_arn

UsageError: Unknown variable 'merged_save_dir'
