# [model_tuner]flan_t5_xl_with_LoRA_ml_g5_2xl

이 sagemaker 예제에서는 [Low-Rank Adaptation of Large Language Models (LoRA)](https://arxiv.org/abs/2106.09685)를 적용하여 단일 GPU에서 flan-t5-xl를 fine-tun하는 방법에 대해 알아볼 것입니다. 
Hugging Face [Transformers](https://huggingface.co/docs/transformers/index), [Accelerate](https://huggingface.co/docs/accelerate/index), [PEFT](https://github.com/huggingface/peft)를 활용할 예정입니다.

1. Setup Development Environment
2. Load and prepare the dataset
3. Fine-Tune flan-t5-xl with LoRA and bnb int-8 on Amazon SageMaker
4. Deploy the model to Amazon SageMaker Endpoint

### Quick intro: PEFT or Parameter Efficient Fine-tuning

[PEFT](https://github.com/huggingface/peft), or Parameter Efficient Fine-tuning은 Hugging Face의 신규 오픈소스 라이브러리로, 모든 모델의 파라미터를 fine-tuning없이 다양한 downstream application에 대한 pre-trained language models (PLMs)을 효과적으로 적용할 수 있게 합니다. PEFT는 현재 다음 techniques을 포함하고 있습니다.

- LoRA: [LORA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS](https://arxiv.org/pdf/2106.09685.pdf)
- Prefix Tuning: [P-Tuning v2: Prompt Tuning Can Be Comparable to Fine-tuning Universally Across Scales and Tasks](https://arxiv.org/pdf/2110.07602.pdf)
- P-Tuning: [GPT Understands, Too](https://arxiv.org/pdf/2103.10385.pdf)
- Prompt Tuning: [The Power of Scale for Parameter-Efficient Prompt Tuning](https://arxiv.org/pdf/2104.08691.pdf)
- AdaLoRA: [Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning](https://arxiv.org/pdf/2303.10512.pdf)

[](https://github.com/huggingface/notebooks/blob/main/sagemaker/24_train_bloom_peft_lora/sagemaker-notebook.ipynb)


In [2]:
# %%bash
# #!/bin/bash

# DAEMON_PATH="/etc/docker"
# MEMORY_SIZE=10G

# FLAG=$(cat $DAEMON_PATH/daemon.json | jq 'has("data-root")')
# # echo $FLAG

# if [ "$FLAG" == true ]; then
#     echo "Already revised"
# else
#     echo "Add data-root and default-shm-size=$MEMORY_SIZE"
#     sudo cp $DAEMON_PATH/daemon.json $DAEMON_PATH/daemon.json.bak
#     sudo cat $DAEMON_PATH/daemon.json.bak | jq '. += {"data-root":"/home/ec2-user/SageMaker/.container/docker","default-shm-size":"'$MEMORY_SIZE'"}' | sudo tee $DAEMON_PATH/daemon.json > /dev/null
#     sudo service docker restart
#     echo "Docker Restart"
# fi

In [3]:
!pip install datasets[s3] transformers sagemaker py7zr --upgrade --quiet

[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m23.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [4]:
# !pip install "transformers==4.26.0" "datasets[s3]==2.9.0" sagemaker py7zr --upgrade --quiet

로컬 환경에서 SageMaker를 사용하려는 경우. SageMaker에 필요한 권한이 있는 IAM 역할에 대한 액세스 권한이 필요합니다. 자세한 내용은 [여기](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html)에서 확인할 수 있습니다.

In [5]:
import sagemaker
import boto3
sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")


sagemaker role arn: arn:aws:iam::322537213286:role/service-role/AmazonSageMaker-ExecutionRole-20230528T120509
sagemaker bucket: sagemaker-us-west-2-322537213286
sagemaker session region: us-west-2


## 2. Load and prepare the dataset

여기서는 약 16k개의 메신저 대화 모음과 요약이 포함된 [samsum](https://huggingface.co/datasets/samsum) 데이터셋을 사용하겠습니다. 대화는 영어에 능통한 언어학자들이 작성하고 기록하였습니다.

```python
{
  "id": "13818513",
  "summary": "Amanda baked cookies and will bring Jerry some tomorrow.",
  "dialogue": "Amanda: I baked cookies. Do you want some?\r\nJerry: Sure!\r\nAmanda: I'll bring you tomorrow :-)"
}
```

To load the `samsum` 데이터셋을 로드하기 위해, 🤗 Datasets 라이브러리에서 `load_dataset()` 메소드를 사용합니다.

In [6]:
from datasets import load_dataset

# Load dataset from the hub
dataset = load_dataset("samsum")

print(f"Train dataset size: {len(dataset['train'])}")
print(f"Test dataset size: {len(dataset['test'])}")
# Train dataset size: 14732

Found cached dataset samsum (/root/.cache/huggingface/datasets/samsum/samsum/0.0.0/f1d7c6b7353e6de335d444e424dc002ef70d1277109031327bc9cc6af5d3d46e)


  0%|          | 0/3 [00:00<?, ?it/s]

Train dataset size: 14732
Test dataset size: 819


모델을 학습하기 위해, 🤗 Transformers Tokenizer를 이용하여 inputs (text)를 token IDs로 변환하게 됩니다. 이것이 의미하는 것을 모르신다면, the Hugging Face Course의 **[chapter 6](https://huggingface.co/course/chapter6/1?fw=tf)** 을 확인하시기 바랍니다.

In [7]:
from transformers import AutoTokenizer

model_id='google/flan-t5-xl'

# Load tokenizer of flan-t5-xl
tokenizer = AutoTokenizer.from_pretrained(model_id)
# tokenizer.model_max_length = 2048 # overwrite wrong value

None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.


학습을 시작하기 전에 데이터를 전처리해야 합니다. Abstractive Summarization가 텍스트 생성 task 입니다. 모델은 텍스트를 입력으로 받아 summary를 출력으로 생성합니다. 데이터를 효율적으로 일괄 처리하기 위해 입력과 출력에 걸리는 시간을 파악하고자 합니다.

모델의 성능을 개선하기 위해 instruct prompt를 구성하는 데 사용할 `prompt_template`를 정의하였습니다.`prompt_template`은 시작과 끝이 "고정"되어 있고, 문서가 중간에 있습니다. 즉, "고정된" 템플릿 부분 + 문서가 모델의 최대 길이를 초과하지 않도록 해야 합니다. 
학습 전에 데이터 집합을 전처리하고 디스크에 저장한 다음 S3에 업로드합니다. 이 단계는 로컬 컴퓨터 또는 CPU에서 실행하고 [Hugging Face Hub](https://huggingface.co/docs/hub/datasets-overview)에 업로드할 수 있습니다.

In [8]:
from datasets import concatenate_datasets
import numpy as np

# The maximum total input sequence length after tokenization.
# Sequences longer than this will be truncated, sequences shorter will be padded.
tokenized_inputs = dataset["train"].map(lambda x: tokenizer(x["dialogue"], truncation=True), batched=True, remove_columns=["dialogue", "summary"])
input_lengths = [len(x) for x in tokenized_inputs["input_ids"]]

# take 85 percentile of max length for better utilization
max_source_length = int(np.percentile(input_lengths, 85))
print(f"Max source length: {max_source_length}")

# The maximum total sequence length for target text after tokenization.
# Sequences longer than this will be truncated, sequences shorter will be padded."
tokenized_targets = dataset["train"].map(lambda x: tokenizer(x["summary"], truncation=True), batched=True, remove_columns=["dialogue", "summary"])
target_lengths = [len(x) for x in tokenized_targets["input_ids"]]
# take 90 percentile of max length for better utilization
max_target_length = int(np.percentile(target_lengths, 90))
print(f"Max target length: {max_target_length}")


Loading cached processed dataset at /root/.cache/huggingface/datasets/samsum/samsum/0.0.0/f1d7c6b7353e6de335d444e424dc002ef70d1277109031327bc9cc6af5d3d46e/cache-ccf0f9a49ab83084.arrow


Max source length: 255


Map:   0%|          | 0/14732 [00:00<?, ? examples/s]

Max target length: 50


In [9]:
def preprocess_function(sample,padding="max_length"):
    # add prefix to the input for t5
    inputs = ["summarize: " + item for item in sample["dialogue"]]

    # tokenize inputs
    model_inputs = tokenizer(inputs, max_length=max_source_length, padding=padding, truncation=True)

    # Tokenize targets with the `text_target` keyword argument
    labels = tokenizer(text_target=sample["summary"], max_length=max_target_length, padding=padding, truncation=True)

    # If we are padding here, replace all tokenizer.pad_token_id in the labels by -100 when we want to ignore
    # padding in the loss.
    if padding == "max_length":
        labels["input_ids"] = [
            [(l if l != tokenizer.pad_token_id else -100) for l in label] for label in labels["input_ids"]
        ]

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [10]:
tokenized_dataset = dataset.map(preprocess_function, batched=True, remove_columns=["dialogue", "summary", "id"])
print(f"Keys of tokenized dataset: {list(tokenized_dataset['train'].features)}")

# save train_dataset to s3
training_input_path = f's3://{sess.default_bucket()}/processed/samsum-sagemaker/data'

# save datasets to disk for later easy loading
tokenized_dataset["train"].save_to_disk(f"{training_input_path}/tokenized_train")

## save test dataset without preprocessing to evaluate in the training script
tokenized_dataset["test"].save_to_disk(f"{training_input_path}/tokenized_test")
dataset["test"].save_to_disk(f"{training_input_path}/test")

# save datasets to disk for local debugging
tokenized_dataset["test"].save_to_disk("./data/tokenized_train")
tokenized_dataset["test"].save_to_disk("./data/tokenized_test")
dataset["test"].save_to_disk("./data/test")

print("uploaded data to:")
print(f"dataset to: {training_input_path}")

Loading cached processed dataset at /root/.cache/huggingface/datasets/samsum/samsum/0.0.0/f1d7c6b7353e6de335d444e424dc002ef70d1277109031327bc9cc6af5d3d46e/cache-e4c84224a7b3f37d.arrow


Map:   0%|          | 0/819 [00:00<?, ? examples/s]

Loading cached processed dataset at /root/.cache/huggingface/datasets/samsum/samsum/0.0.0/f1d7c6b7353e6de335d444e424dc002ef70d1277109031327bc9cc6af5d3d46e/cache-9fb0724aee9b2408.arrow


Keys of tokenized dataset: ['input_ids', 'attention_mask', 'labels']


Saving the dataset (0/1 shards):   0%|          | 0/14732 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/819 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/819 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/819 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/819 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/819 [00:00<?, ? examples/s]

uploaded data to:
dataset to: s3://sagemaker-us-west-2-322537213286/processed/samsum-sagemaker/data


데이터셋을 처리한 후 새로운 [FileSystem integration](https://huggingface.co/docs/datasets/filesystems)을 사용하여 데이터셋을 S3에 업로드할 것입니다. 여기서는 `sess.default_bucket()`을 사용하고 있으며, 데이터셋을 다른 S3 버킷에 저장하려면 이 값을 조정합니다. 이후 학습 스크립트에서 S3 경로를 사용할 것입니다.


## 3. Fine-Tune flan-t5-xl with LoRA and bnb int-8 on Amazon SageMaker

LoRA 기법 외에도 [bitsandbytes LLM.int8()](https://huggingface.co/blog/hf-bitsandbytes-integration)을 사용하여 frozen된 LLM을 int8로 quantize합니다. 이를 통해 flan-t5-xl에 필요한 메모리를 최대 4배까지 줄일 수 있습니다. 

PEFT를 사용하여 모델을 학습하는 [run_clm.py](./scripts/run_clm.py)를 준비했습니다. 어떻게 작동하는지 궁금하다면 [Efficient Large Language Model training with LoRA and Hugging Face](https://www.philschmid.de/fine-tune-flan-t5-peft) 블로그에서 학습 스크립트에 대해 자세히 설명되어 있습니다.

SageMaker 학습 작업을 생성하기 위해 `HuggingFace` Estimator가 필요합니다. Estimator는 end-to-end Amazon SageMaker 학습과 배포 작업을 처리합니다. Estimator는 인프라 사용을 관리합니다. 
SageMaker는 필요한 모든 ec2 인스턴스를 시작하고 관리하며, 올바른 huggingface 컨테이너를 제공하고, 제공된 스크립트를 업로드하고, `/opt/ml/input/data`의 컨테이너에 S3 버킷에서 데이터를 다운로드합니다. 그런 다음 실행을 통해 학습 작업을 시작합니다.

In [11]:
import time
from pathlib import Path

In [12]:
instance_type='ml.g5.2xlarge'
# instance_type='local_gpu'

In [13]:
if instance_type in ['local', 'local_gpu']:
    from sagemaker.local import LocalSession
    sagemaker_session = LocalSession()
    sagemaker_session.config = {'local': {'local_code': True}}
    data_path=f'file://{Path.cwd()}/data'
else:
    sagemaker_session = sagemaker.session.Session()
    data_path=training_input_path

In [14]:
# define Training Job Name 
job_name = f'huggingface-peft-{time.strftime("%Y-%m-%d-%H-%M-%S", time.localtime())}'

from sagemaker.pytorch.estimator import PyTorch

# hyperparameters, which are passed into the training job
hyperparameters ={
  'model_id': model_id,                                 # pre-trained model
  # 'model_id': "philschmid/flan-t5-xxl-sharded-fp16",
  'epochs': 1,                                          # number of training epochs
  'per_device_train_batch_size': 100, #50, #15,         # batch size for training
  'eval_sample': 50,                                    # batch size for evaluation
  'lr': 2e-4,                                           # learning rate used during training
}

# create the Estimator
estimator = PyTorch(
    entry_point          = 'run_clm.py',               # train script
    source_dir           = f'{Path.cwd()}/flan_t5_xl_with_LoRA',  # directory which includes all the files needed for training
    instance_type        = instance_type,              # instances type used for the training job
    instance_count       = 1,                          # the number of instances used for training
    base_job_name        = job_name,                   # the name of the training job
    role                 = role,                       # Iam role used in training job to access AWS ressources, e.g. S3
    volume_size          = 300,                        # the size of the EBS volume in GB
    framework_version    = '2.0',                      # the pytorch_version version used in the training job
    py_version           = 'py310',                    # the python version used in the training job
    sagemaker_session    = sagemaker_session,
    hyperparameters      = hyperparameters
)

이제 `.fit()` 메서드가 학습 스크립트에 S3 경로를 전달하여 학습 작업을 시작할 수 있습니다.

In [15]:
# define a data input dictonary with our uploaded s3 uris
data = {
    'tokenized_train': data_path+'/tokenized_train', 
    'tokenized_test': data_path+'/tokenized_test',
    'test': data_path+'/test',
}

# starting the train job with our uploaded datasets as input
estimator.fit(data, wait=False)

INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating training-job with name: huggingface-peft-2023-06-26-01-45-31-2023-06-26-01-45-31-106


Using provided s3_resource


In [16]:
estimator.logs()

2023-06-26 01:45:31 Starting - Starting the training job...
2023-06-26 01:45:48 Starting - Preparing the instances for training......
2023-06-26 01:46:45 Downloading - Downloading input data...
2023-06-26 01:47:06 Training - Downloading the training image...........................
2023-06-26 01:51:42 Training - Training image download completed. Training in progress....[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2023-06-26 01:52:19,282 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2023-06-26 01:52:19,294 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)[0m
[34m2023-06-26 01:52:19,303 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2023-06-26 01:52:19,304 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2023-06-26 01:

이 예제에서 SageMaker 학습 작업은 `7946 seconds`가 소요되었으며, 이는 약 `2.2 hours`입니다. 우리가 사용한 ml.g5.2xlarge 인스턴스는 온디맨드 사용 시 시간당 `$1.515 per hour` (US region 기준) 입니다. 그 결과, fine-tuned된 flan-t5-xl 모델을 학습하는 데 드는 총 비용은 `$3.34`에 불과했습니다.

Spot 인스턴스를 사용하면 학습 비용을 더 줄일 수 있습니다. 그러나 Spot 인스턴스 중단으로 인해 총 학습 시간이 늘어날 가능성이 있습니다. 인스턴스 가격에 대한 자세한 내용은 [SageMaker 가격 페이지](https://aws.amazon.com/sagemaker/pricing/)를 참조하세요."

## 4. Deploy the model to Amazon SageMaker Endpoint

학습에 `peft`를 사용할 때는 일반적으로 adapter weights를 사용하게 됩니다. 모델을 더 쉽게 배포할 수 있도록 기본 모델과 adatper를 병합하는 `merge_and_unload()` 메서드를 추가했습니다. 이제 `transformers` 라이브러리의 `pipelines` 기능을 사용할 수 있게 되었습니다. 

SageMaker는 SageMaker Endpoint Configuration과 SageMaker Endpoint를 생성하여 배포 프로세스를 시작합니다. Endpoint Configuration은 모델과 instance type을 정의합니다.

In [17]:
!rm -rf ./model
!aws s3 cp {estimator.model_data} ./model/model.tar.gz
!tar -xzvf ./model/model.tar.gz -C ./model/ # && mv ./model/model.tar.gz ./

download: s3://sagemaker-us-west-2-322537213286/huggingface-peft-2023-06-26-01-45-31-2023-06-26-01-45-31-106/output/model.tar.gz to model/model.tar.gz
tokenizer_config.json
rogue.pickle
tokenizer.json
adapter_config.json
adapter_model.bin
special_tokens_map.json


In [18]:
instance_type= "ml.g5.4xlarge"
# instance_type= "local_gpu"

In [19]:
from pathlib import Path

# source_dir=f"file://{Path.cwd()}/src"

if instance_type in ['local', 'local_gpu']:
    from sagemaker.local import LocalSession
    
    sagemaker_session = LocalSession()
    sagemaker_session.config = {'local': {'local_code': True}}
    model_data=f"file://{Path.cwd()}/model/model.tar.gz"
else:
    sagemaker_session = sagemaker.session.Session()
    model_data=estimator.model_data
    

In [20]:
env = {
    'SAGEMAKER_MODEL_SERVER_TIMEOUT': str(3600),
    'MODEL_CACHE_ROOT': '/opt/ml/model', 
    'SAGEMAKER_ENV': '1',
    'SAGEMAKER_SUBMIT_DIRECTORY': '/opt/ml/code',
    'TS_DEFAULT_WORKERS_PER_MODEL': '1', 
}


In [23]:
# from sagemaker.huggingface import HuggingFaceModel
from sagemaker.pytorch.model import PyTorchModel

# create Hugging Face Model Class
model = PyTorchModel(
    entry_point='inference.py',
    source_dir=f'{Path.cwd()}/flan_t5_xl_with_LoRA',
    model_data=model_data,
    role=role, 
    framework_version="2.0", 
    py_version="py310",
    model_server_workers=1,
    sagemaker_session=sagemaker_session,
    env=env
)

이제 원하는 인스턴스 수와 인스턴스 유형을 전달하여 HuggingFace estimator 객체에서 `deploy()`를 사용하여 모델을 배포할 수 있습니다.

In [24]:
import time
endpoint_name = f'huggingface-peft-{time.strftime("%Y-%m-%d-%H-%M-%S", time.localtime())}'

# deploy model to SageMaker Inference
predictor = model.deploy(
    endpoint_name=endpoint_name,
    initial_instance_count=1,
    instance_type=instance_type
)

INFO:sagemaker:Repacking model artifact (s3://sagemaker-us-west-2-322537213286/huggingface-peft-2023-06-26-01-45-31-2023-06-26-01-45-31-106/output/model.tar.gz), script artifact (/root/aws-ai-ml-workshop-kr/genai/jumpstart/text_to_text/flan_t5_xl_with_LoRA), and dependencies ([]) into single tar.gz file located at s3://sagemaker-us-west-2-322537213286/pytorch-inference-2023-06-26-03-22-15-557/model.tar.gz. This may take some time depending on model size...
INFO:sagemaker:Creating model with name: pytorch-inference-2023-06-26-03-22-19-571
INFO:sagemaker:Creating endpoint-config with name huggingface-peft-2023-06-26-03-22-15
INFO:sagemaker:Creating endpoint with name huggingface-peft-2023-06-26-03-22-15


----------!

Note: SageMaker endpoint가 추론 요청을 허용하기 위해 인스턴스를 온라인로 전환하고 모델을 다운로드하는 데 5~10분이 소요될 수 있습니다.

`test` 분할의 예제를 사용하여 테스트해 보겠습니다.

In [25]:
from random import randint
from datasets import load_dataset

# Load dataset from the hub
test_dataset = load_dataset("samsum", split="test")

# select a random test sample
sample = test_dataset[randint(0,len(test_dataset))]

# format sample
prompt_template = f"Summarize the chat dialogue:\n{{dialogue}}\n---\nSummary:\n"

fomatted_sample = {
  "inputs": prompt_template.format(dialogue=sample["dialogue"]),
  "parameters": {
    "do_sample": True, # sample output predicted probabilities
    "top_p": 0.9, # sampling technique Fan et. al (2018)
    "temperature": 0.1, # increasing the likelihood of high probability words and decreasing the likelihood of low probability words
    "max_new_tokens": 100, # The maximum numbers of tokens to generate, ignoring the number of tokens in the prompt
  }
}



In [26]:
# predict
res = predictor.predict(fomatted_sample)


print(res[0]["generated_text"].split("Summary:")[-1])

# Sample model output: Kirsten and Alex are going bowling this Friday at 7 pm. They will meet up and then go together.

Sean has decided that his spirit animal is a tortoise. Tiffany thinks Sean is a wasp.


이제 모델 요약되 dialog의 결과와 테스트 sample summary를 비교해 보겠습니다.

In [27]:
print(sample["summary"])

# Test sample summary: Kirsten reminds Alex that the youth group meets this Friday at 7 pm to go bowling.

Sean believes his spirit animal is a tortoise and Tiffany's could be a wasp. 


마지막으로 endpoint를 다시 삭제합니다.

In [28]:
predictor.delete_model()
predictor.delete_endpoint()

INFO:sagemaker:Deleting model with name: pytorch-inference-2023-06-26-03-22-19-571
INFO:sagemaker:Deleting endpoint configuration with name: huggingface-peft-2023-06-26-03-22-15
INFO:sagemaker:Deleting endpoint with name: huggingface-peft-2023-06-26-03-22-15
