<a href="https://colab.research.google.com/github/Soy-code/one-day-LLM-FT/blob/main/Alpaca_LLaMa_instruction_fintuning_torch_style.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# From Llama to Alpaca: Finetunning and LLM with Weights & Biases

이 Notebook에서는 사전 훈련된 LLama 모델을 인스트럭션 데이터셋에 대해 미세 조정(fine-tuning)하는 방법을 배울 것입니다. davinci-003 (GPT-3)으로 생성된 데이터 대신 GPT-4를 사용하여 더욱 향상된 인스트럭션 데이터셋을 활용하는 업데이트된 버전의 Alpaca 데이터셋을 사용합니다. 자세한 내용은 [공식 저장소 페이지](https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM#how-good-is-the-data)를 참조하세요.

이 Notebook은 최소 24GB 메모리를 갖춘 A100/A10 GPU가 필요합니다. 매개변수를 조정하여 T4에서 실행할 수도 있지만 실행 시간이 매우 길어집니다.

이 Notebook에는 연관 프로젝트 및 보고서: [wandb](wandb.me/alpaca)가 있습니다.

In [2]:
!pip install wandb
#!pip install git+https://github.com/huggingface/transformers@v4.31-release
!pip install transformers
!pip install accelerate -U
!pip install trl



## Prepare your Instruction Dataset

알파카 (GPT-4 curated instructions and outputs) 데이터셋을 가져옵니다.

In [3]:
!wget https://raw.githubusercontent.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM/main/data/alpaca_gpt4_data.json

--2024-11-13 07:18:38--  https://raw.githubusercontent.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM/main/data/alpaca_gpt4_data.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.108.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 43379276 (41M) [text/plain]
Saving to: ‘alpaca_gpt4_data.json’


2024-11-13 07:18:38 (218 MB/s) - ‘alpaca_gpt4_data.json’ saved [43379276/43379276]



데이터셋을 로드합니다. (파이토치 스타일)

In [4]:
from datasets import load_dataset
dataset = load_dataset("json", data_files="alpaca_gpt4_data.json")
# split은 기본적으로 train으로 받음

Generating train split: 0 examples [00:00, ? examples/s]

데이터셋의 구조

In [5]:
dataset

DatasetDict({
    train: Dataset({
        features: ['instruction', 'input', 'output'],
        num_rows: 52002
    })
})

하나의 샘플을 확인해봅시다.

In [6]:
dataset['train'][0]

{'instruction': 'Give three tips for staying healthy.',
 'input': '',
 'output': '1. Eat a balanced and nutritious diet: Make sure your meals are inclusive of a variety of fruits and vegetables, lean protein, whole grains, and healthy fats. This helps to provide your body with the essential nutrients to function at its best and can help prevent chronic diseases.\n\n2. Engage in regular physical activity: Exercise is crucial for maintaining strong bones, muscles, and cardiovascular health. Aim for at least 150 minutes of moderate aerobic exercise or 75 minutes of vigorous exercise each week.\n\n3. Get enough sleep: Getting enough quality sleep is crucial for physical and mental well-being. It helps to regulate mood, improve cognitive function, and supports healthy growth and immune function. Aim for 7-9 hours of sleep each night.'}

In [7]:
dataset['train'][9]

{'instruction': 'Evaluate this sentence for spelling and grammar mistakes',
 'input': 'He finnished his meal and left the resturant',
 'output': 'There are two spelling errors in the sentence. The corrected sentence should be: "He finished his meal and left the restaurant."'}

데이터셋에는 명령(instruction)과 결과(output)가 포함되어 있습니다. 모델은 다음 토큰을 예측하도록 훈련되므로, 한 가지 방법은 단순히 둘을 연결(concatenate)하고 그 결과를 토대로 모델을 훈련하는 것입니다. 이상적으로 프롬프트는 입력과 출력 위치를 명확하게 표시하는 방식으로 구성되어야 합니다.

In [10]:
## llama2에 instruction tuning을 할 것임

def prompt_no_input(example):
    return ("Below is an instruction that describes a task. "
            "Write a response that appropriately completes the request.\n\n"
            "### Instruction:\n{instruction}\n\n### Response:\n").format_map(example)

In [11]:
row = dataset['train'][0]
print(prompt_no_input(row))

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Give three tips for staying healthy.

### Response:



어떤 instruction은 input 변수 안에 context가 들어있습니다.

In [12]:
def prompt_input(example):
    return ("Below is an instruction that describes a task, paired with an input that provides further context. "
            "Write a response that appropriately completes the request.\n\n"
            "### Instruction:\n{instruction}\n\n### Input:\n{input}\n\n### Response:\n").format_map(example)

In [13]:
row = dataset['train'][9]
print(prompt_input(row))

Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Evaluate this sentence for spelling and grammar mistakes

### Input:
He finnished his meal and left the resturant

### Response:



일단은 프롬프트를 처리합니다. 나중에 적절한 양의 패딩(padding)과 함께 결과를 추가할 수 있습니다.

input이 있는 케이스와 없는 케이스를 통합하는 함수를 구현합니다.

In [14]:
def create_alpaca_prompt(example):
    example['prompt'] = prompt_no_input(example) if example["input"] == "" else prompt_input(example)
    return example

In [15]:
prompt_dataset = dataset.map(create_alpaca_prompt)

Map:   0%|          | 0/52002 [00:00<?, ? examples/s]

In [17]:
prompt_dataset

DatasetDict({
    train: Dataset({
        features: ['instruction', 'input', 'output', 'prompt'],
        num_rows: 52002
    })
})

In [18]:
print(prompt_dataset['train']['prompt'][0])

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Give three tips for staying healthy.

### Response:



우리는 target을 처리하고 문자열 종료 토큰(EOS)을 추가해야 합니다. LLama의 경우 이는: `"</s>"` 입니다.

In [19]:
def pad_eos(example):
    EOS_TOKEN = "</s>"
    example['answer'] = f"{example['output']}{EOS_TOKEN}"
    return example

In [20]:
answer_dataset = prompt_dataset.map(pad_eos)
answer_dataset

Map:   0%|          | 0/52002 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['instruction', 'input', 'output', 'prompt', 'answer'],
        num_rows: 52002
    })
})

In [22]:
dataset['train'][0]

{'instruction': 'Give three tips for staying healthy.',
 'input': '',
 'output': '1. Eat a balanced and nutritious diet: Make sure your meals are inclusive of a variety of fruits and vegetables, lean protein, whole grains, and healthy fats. This helps to provide your body with the essential nutrients to function at its best and can help prevent chronic diseases.\n\n2. Engage in regular physical activity: Exercise is crucial for maintaining strong bones, muscles, and cardiovascular health. Aim for at least 150 minutes of moderate aerobic exercise or 75 minutes of vigorous exercise each week.\n\n3. Get enough sleep: Getting enough quality sleep is crucial for physical and mental well-being. It helps to regulate mood, improve cognitive function, and supports healthy growth and immune function. Aim for 7-9 hours of sleep each night.'}

In [21]:
print(answer_dataset['train']['answer'][0])

1. Eat a balanced and nutritious diet: Make sure your meals are inclusive of a variety of fruits and vegetables, lean protein, whole grains, and healthy fats. This helps to provide your body with the essential nutrients to function at its best and can help prevent chronic diseases.

2. Engage in regular physical activity: Exercise is crucial for maintaining strong bones, muscles, and cardiovascular health. Aim for at least 150 minutes of moderate aerobic exercise or 75 minutes of vigorous exercise each week.

3. Get enough sleep: Getting enough quality sleep is crucial for physical and mental well-being. It helps to regulate mood, improve cognitive function, and supports healthy growth and immune function. Aim for 7-9 hours of sleep each night.</s>


최종적으로 유저 prompt와 모델 answer를 합칩니다.

In [23]:
def get_example(example):
    example['example'] = example['prompt'] + example['answer']
    return example


final_dataset = answer_dataset.map(get_example)

Map:   0%|          | 0/52002 [00:00<?, ? examples/s]

이것이 모델이 보고 배울 필요가 있는 것입니다.

In [24]:
print(final_dataset['train']['example'][0])

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Give three tips for staying healthy.

### Response:
1. Eat a balanced and nutritious diet: Make sure your meals are inclusive of a variety of fruits and vegetables, lean protein, whole grains, and healthy fats. This helps to provide your body with the essential nutrients to function at its best and can help prevent chronic diseases.

2. Engage in regular physical activity: Exercise is crucial for maintaining strong bones, muscles, and cardiovascular health. Aim for at least 150 minutes of moderate aerobic exercise or 75 minutes of vigorous exercise each week.

3. Get enough sleep: Getting enough quality sleep is crucial for physical and mental well-being. It helps to regulate mood, improve cognitive function, and supports healthy growth and immune function. Aim for 7-9 hours of sleep each night.</s>


## Converting text to numbers: Tokenizer

우리는 데이터셋을 토큰들로 변환할 필요가 있습니다. 이것은 transformers의 tokenizer로 쉽게 달성할 수 있습니다.

In [25]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

In [26]:
model_id = 'NousResearch/Llama-2-7b-chat-hf'
tokenizer = AutoTokenizer.from_pretrained(model_id)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/746 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/21.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/435 [00:00<?, ?B/s]

In [27]:
print(f"Pad Token id: {tokenizer.pad_token_id} and Pad Token: {tokenizer.pad_token}")
print(f"EOS Token id: {tokenizer.eos_token_id} and EOS Token: {tokenizer.eos_token}")

Pad Token id: 0 and Pad Token: <unk>
EOS Token id: 2 and EOS Token: </s>


많은 튜토리얼이 아래와 같은 방법을 추천하지만,

tokenizer.pad_token = tokenizer.eos_token

이 경우 학습 시 pad token이 무시되면서 eos token 무시되면서 모델이 문장의 끝을 학습하기 어려워지는 문제가 있음. (llama 학습 시에.. 다른 것도 그런가?)

In [28]:
tokenizer.pad_token_id = 0 #추론 시에는 tokenizer.eos_token_id로 지정해도 상관 없음.

print(f"Pad Token id: {tokenizer.pad_token_id} and Pad Token: {tokenizer.pad_token}")
print(f"EOS Token id: {tokenizer.eos_token_id} and EOS Token: {tokenizer.eos_token}")

Pad Token id: 0 and Pad Token: <unk>
EOS Token id: 2 and EOS Token: </s>


In [29]:
tokenizer.encode("My experiments are going strong!")

[1, 1619, 15729, 526, 2675, 4549, 29991]

In [30]:
tokenizer.encode("My experiments are going strong!", padding='max_length', max_length=10)

[0, 0, 0, 1, 1619, 15729, 526, 2675, 4549, 29991]

In [31]:
tokenizer.encode("My experiments are going strong!",
                 padding='max_length',
                 max_length=10,
                 return_tensors="pt")

tensor([[    0,     0,     0,     1,  1619, 15729,   526,  2675,  4549, 29991]])

In [32]:
tokenizer(["My experiments are going strong!",
           "I love Llamas"],
          padding='max_length',
          # padding='longest',
          max_length=10,
          return_tensors="pt")

{'input_ids': tensor([[    0,     0,     0,     1,  1619, 15729,   526,  2675,  4549, 29991],
        [    0,     0,     0,     0,     1,   306,  5360,   365,  5288,   294]]), 'attention_mask': tensor([[0, 0, 0, 1, 1, 1, 1, 1, 1, 1],
        [0, 0, 0, 0, 1, 1, 1, 1, 1, 1]])}

In [33]:
x = tokenizer(["My experiments are going strong!",
           "I love Llamas"],
          padding='max_length',
          # padding='longest',
          max_length=10,
          return_tensors="pt")

In [34]:
tokenizer.decode(x['input_ids'][0])

'<unk><unk><unk><s> My experiments are going strong!'

In [35]:
for i, example in enumerate(final_dataset['train']['example'][0:3]):
    print(f"---------{i+1}번째 데이터 샘플--------------")
    print(example)
    print("\n")

---------1번째 데이터 샘플--------------
Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Give three tips for staying healthy.

### Response:
1. Eat a balanced and nutritious diet: Make sure your meals are inclusive of a variety of fruits and vegetables, lean protein, whole grains, and healthy fats. This helps to provide your body with the essential nutrients to function at its best and can help prevent chronic diseases.

2. Engage in regular physical activity: Exercise is crucial for maintaining strong bones, muscles, and cardiovascular health. Aim for at least 150 minutes of moderate aerobic exercise or 75 minutes of vigorous exercise each week.

3. Get enough sleep: Getting enough quality sleep is crucial for physical and mental well-being. It helps to regulate mood, improve cognitive function, and supports healthy growth and immune function. Aim for 7-9 hours of sleep each night.</s>


---------2번째 데이터 샘플----------

## Data collator

Causal language modeling을 위해 우리는 동적 마스킹 모드를 off한 DataCollatorForLanguageModeling(tokenizer, mlm=False)를 사용하여 GPT 계열 모델을 학습시킬 수 있습니다.

### DataCollatorForLanguageModeling

In [36]:
from transformers import DataCollatorForLanguageModeling

causal_model_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)  # gpt니까 mlm =False

In [37]:
out = causal_model_collator([tokenizer(example) for example in final_dataset['train']['example'][:1]])
for key in out:
    print(f"{key} : {out[key]}")

input_ids : tensor([[    1, 13866,   338,   385, 15278,   393, 16612,   263,  3414, 29889,
         14350,   263,  2933,   393,  7128,  2486,  1614,  2167,   278,  2009,
         29889,    13,    13,  2277, 29937,  2799,  4080, 29901,    13, 29954,
           573,  2211, 25562,   363,  7952,   292,  9045, 29891, 29889,    13,
            13,  2277, 29937, 13291, 29901,    13, 29896, 29889,   382,   271,
           263,  6411,  8362,   322, 18254,   768,  2738,   652,   300, 29901,
          8561,  1854,   596,   592,  1338,   526, 20978,   573,   310,   263,
         12875,   310,   285, 21211,   322, 18655,  1849, 29892, 20793, 26823,
         29892,  3353,  2646,  1144, 29892,   322,  9045, 29891,   285,  1446,
         29889,   910,  6911,   304,  3867,   596,  3573,   411,   278, 18853,
         18254,   374,  1237,   304,   740,   472,   967,  1900,   322,   508,
          1371,  5557, 17168,   293, 10267,  2129, 29889,    13,    13, 29906,
         29889,  2201,   482,   297,  49

In [38]:
print(tokenizer.decode(out['input_ids'][0]))

<s> Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Give three tips for staying healthy.

### Response:
1. Eat a balanced and nutritious diet: Make sure your meals are inclusive of a variety of fruits and vegetables, lean protein, whole grains, and healthy fats. This helps to provide your body with the essential nutrients to function at its best and can help prevent chronic diseases.

2. Engage in regular physical activity: Exercise is crucial for maintaining strong bones, muscles, and cardiovascular health. Aim for at least 150 minutes of moderate aerobic exercise or 75 minutes of vigorous exercise each week.

3. Get enough sleep: Getting enough quality sleep is crucial for physical and mental well-being. It helps to regulate mood, improve cognitive function, and supports healthy growth and immune function. Aim for 7-9 hours of sleep each night.</s>


### 미세조정을 위한 DataCollatorForCompletionOnlyLM 사용

DataCollatorForLanguageModeling를 사용하면 모델은 사용자의 입력 첫 번째 토큰부터 다음 토큰을 예측하도록 학습하게 됩니다. 그러나 실제로 원하는 것은 모델이 명령어가 주어졌을 때 응답을 생성하는 법을 배우도록 하는 것입니다. 이를 위해서는 사용자의 입력을 마스킹하여, 모델 학습 중에 이 입력들이 손실(loss)에 기여하지 않도록 해야 합니다. 명령어 LLM 미세 조정에서도 마찬가지입니다. 모델이 명령어의 다음 토큰을 예측하도록 학습하려는 것이 아니라, 명령어를 마스킹하고 모델이 응답을 예측하도록 학습해야 합니다.

In [40]:
from trl import DataCollatorForCompletionOnlyLM

response_template = "Response:"   # template을 주면 알아서 찾아줌
completion_only_collator = DataCollatorForCompletionOnlyLM(response_template=response_template, tokenizer=tokenizer)

In [41]:
out = completion_only_collator([tokenizer(example) for example in final_dataset['train']['example'][:1]])
for key in out:
    print(f"{key} : {out[key]}")

# pytorch에서 -100은 무시되는 classification id. cost를 발생시키지 않는 부분. repsonse 앞 부분이 -100으로 되어있을 것임

input_ids : tensor([[    1, 13866,   338,   385, 15278,   393, 16612,   263,  3414, 29889,
         14350,   263,  2933,   393,  7128,  2486,  1614,  2167,   278,  2009,
         29889,    13,    13,  2277, 29937,  2799,  4080, 29901,    13, 29954,
           573,  2211, 25562,   363,  7952,   292,  9045, 29891, 29889,    13,
            13,  2277, 29937, 13291, 29901,    13, 29896, 29889,   382,   271,
           263,  6411,  8362,   322, 18254,   768,  2738,   652,   300, 29901,
          8561,  1854,   596,   592,  1338,   526, 20978,   573,   310,   263,
         12875,   310,   285, 21211,   322, 18655,  1849, 29892, 20793, 26823,
         29892,  3353,  2646,  1144, 29892,   322,  9045, 29891,   285,  1446,
         29889,   910,  6911,   304,  3867,   596,  3573,   411,   278, 18853,
         18254,   374,  1237,   304,   740,   472,   967,  1900,   322,   508,
          1371,  5557, 17168,   293, 10267,  2129, 29889,    13,    13, 29906,
         29889,  2201,   482,   297,  49

In [42]:
print(tokenizer.decode(out['input_ids'][0]))

<s> Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Give three tips for staying healthy.

### Response:
1. Eat a balanced and nutritious diet: Make sure your meals are inclusive of a variety of fruits and vegetables, lean protein, whole grains, and healthy fats. This helps to provide your body with the essential nutrients to function at its best and can help prevent chronic diseases.

2. Engage in regular physical activity: Exercise is crucial for maintaining strong bones, muscles, and cardiovascular health. Aim for at least 150 minutes of moderate aerobic exercise or 75 minutes of vigorous exercise each week.

3. Get enough sleep: Getting enough quality sleep is crucial for physical and mental well-being. It helps to regulate mood, improve cognitive function, and supports healthy growth and immune function. Aim for 7-9 hours of sleep each night.</s>


## 데이터 스플릿

In [43]:
train_test_split = final_dataset['train'].train_test_split(test_size=1000)

In [44]:
# 분할된 데이터셋을 확인합니다.
train_dataset = train_test_split['train']
eval_dataset = train_test_split['test']

print(f"훈련 데이터 개수: {len(train_dataset)}")
print(f"평가 데이터 개수: {len(eval_dataset)}")

훈련 데이터 개수: 51002
평가 데이터 개수: 1000


## Tokenization 전처리와 Dataloader 연결

In [48]:
tokenized_train_dataset = train_dataset.map(lambda x : tokenizer(x['example']))
tokenized_eval_dataset = eval_dataset.map(lambda x : tokenizer(x['example']))

In [49]:
tokenized_train_dataset = tokenized_train_dataset.remove_columns(['instruction', 'input', 'output', 'prompt', 'answer', 'example'])
tokenized_eval_dataset = tokenized_eval_dataset.remove_columns(['instruction', 'input', 'output', 'prompt', 'answer', 'example'])

토큰화된 데이터셋을 dataloader와 연결해줍니다.

In [50]:
from torch.utils.data import DataLoader
from transformers import default_data_collator

seed = 42
torch.manual_seed(seed)
batch_size = 16  # I have an A100 GPU with 40GB of RAM 😎

train_dataloader = DataLoader(
    tokenized_train_dataset,
    batch_size=batch_size,
    collate_fn=completion_only_collator,
)

eval_dataloader = DataLoader(
    tokenized_eval_dataset,
    batch_size=batch_size,
    collate_fn=completion_only_collator,
    shuffle=False,
)

In [51]:
b = next(iter(train_dataloader))
b
# 패딩 부분은 -100으로 처리

{'input_ids': tensor([[    0,     0,     0,  ..., 21106, 29879, 29958],
        [    0,     0,     0,  ..., 21106, 29879, 29958],
        [    0,     0,     0,  ..., 21106, 29879, 29958],
        ...,
        [    0,     0,     0,  ..., 21106, 29879, 29958],
        [    0,     0,     0,  ..., 21106, 29879, 29958],
        [    1, 13866,   338,  ..., 21106, 29879, 29958]]), 'attention_mask': tensor([[0, 0, 0,  ..., 1, 1, 1],
        [0, 0, 0,  ..., 1, 1, 1],
        [0, 0, 0,  ..., 1, 1, 1],
        ...,
        [0, 0, 0,  ..., 1, 1, 1],
        [0, 0, 0,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 1, 1, 1]]), 'labels': tensor([[ -100,  -100,  -100,  ..., 21106, 29879, 29958],
        [ -100,  -100,  -100,  ..., 21106, 29879, 29958],
        [ -100,  -100,  -100,  ..., 21106, 29879, 29958],
        ...,
        [ -100,  -100,  -100,  ..., 21106, 29879, 29958],
        [ -100,  -100,  -100,  ..., 21106, 29879, 29958],
        [ -100,  -100,  -100,  ..., 21106, 29879, 29958]])}

In [52]:
tokenizer.decode(b["input_ids"][0])

'<unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk

In [53]:
# -100을 제외한 토큰만 필터링
valid_labels = [token_id for token_id in b["labels"][0] if token_id != -100]

# 유효한 토큰을 decode
decoded_text = tokenizer.decode(valid_labels)

# 앞에 패딩 부분은 제외하고 출력
print(decoded_text)



The given text describes a story, which indicates that it is fiction.</s>


## Train

다음과 같이 모든 하이퍼파라미터들을 관리합니다.

In [54]:
from types import SimpleNamespace

gradient_accumulation_steps = 2

config = SimpleNamespace(
    project_name='llama-ft-alpaca-prj',
    model_id=model_id,
    dataset_name="alpaca-gpt4",
    precision="bf16",  # faster and better than fp16, requires new GPUs
    n_freeze=24,  # How many layers we don't train, LLama 7B has 32.
    # 그냥 학습을 시키면 full-fine funing이 되어버림
    # llama는 32개의 layer를 가지고 있는데, 24개는 freeze시키고, 나머지 8개 decoder layer만 학습시킨다
    lr=2e-4,
    n_eval_samples=10, # How many samples to generate on validation
    epochs=3,  # we do 3 pasess over the dataset.
    gradient_accumulation_steps=gradient_accumulation_steps,  # evey how many iterations we update the gradients, simulates larger batch sizes
    batch_size=batch_size,  # what my GPU can handle, depends on how many layers are we training
    log_model=False,  # upload the model to W&B?
    gradient_checkpointing = True,  # saves even more memory
    freeze_embed = True,  # why train this? let's keep them frozen ❄️
    seed=seed,
    report_to=['tensorboard'],
)

config.total_train_steps = config.epochs * len(train_dataloader) // config.gradient_accumulation_steps

In [55]:
print(f"We will train for {config.total_train_steps} steps and evaluate every epoch")

We will train for 4782 steps and evaluate every epoch


pretrained model을 가져옵니다.

In [56]:
model = AutoModelForCausalLM.from_pretrained(
    config.model_id,
    device_map=0,
    trust_remote_code=True,
    low_cpu_mem_usage=True,
    torch_dtype=torch.bfloat16,
    use_cache=False,
)

config.json:   0%|          | 0.00/583 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/200 [00:00<?, ?B/s]

In [71]:
def param_count(m):
    params = sum([p.numel() for p in m.parameters()])/1_000_000
    trainable_params = sum([p.numel() for p in m.parameters() if p.requires_grad])/1_000_000
    print(f"Total params: {params:.2f}M, Trainable: {trainable_params:.2f}M")
    return params, trainable_params

params, trainable_params = param_count(model)

Total params: 6738.42M, Trainable: 1750.14M


전체 모델을 학습하는 것은 강력한 연산력과 메모리를 필요로하기 때문에 우리는 8개의 layer를 튜닝할 것 입니다. LLama는 총 32개를 가지고 있습니다.

In [72]:
# freeze layers (disable gradients)
# layer들 전반적으로 freeze 시키고,
# 24번째 이후 layer들만 학습
for param in model.parameters(): param.requires_grad = False
for param in model.lm_head.parameters(): param.requires_grad = True
for param in model.model.layers[config.n_freeze:].parameters(): param.requires_grad = True

In [73]:
# Just freeze embeddings for small memory decrease
if config.freeze_embed:
    model.model.embed_tokens.weight.requires_grad_(False);

또한 그래디언트 체크포인팅을 사용하여 더 많이 저장할 수도 있습니다(이것은 훈련을 느리게 만들지만, 얼마나 느려질지는 여러분의 특정 설정에 따라 달라집니다). 대용량 모델을 메모리에 맞추는 방법에 대해 허깅페이스 웹사이트에 [좋은 아티클](https://huggingface.co/docs/transformers/v4.18.0/en/performance)이 있으니 확인해 보시길 권장합니다!


In [74]:
# save more memory
if config.gradient_checkpointing:
    model.gradient_checkpointing_enable()

In [75]:
params, trainable_params = param_count(model)

Total params: 6738.42M, Trainable: 1750.14M


### Optimizer


In [76]:
from transformers import get_cosine_schedule_with_warmup

optim = torch.optim.Adam(model.parameters(), lr=config.lr, betas=(0.9,0.99), eps=1e-5)
scheduler = get_cosine_schedule_with_warmup(
    optim,
    num_training_steps=config.total_train_steps,
    num_warmup_steps=config.total_train_steps // 10,
)

In [77]:
# 모델에 이미 output.loss에 계산이 자동으로 되는데 그거 사용할 것임
# 이거 사용 안하긴 함
def loss_fn(x, y):
    "A Flat CrossEntropy"
    return torch.nn.functional.cross_entropy(x.view(-1, x.shape[-1]), y.view(-1))

## Testing during training

거의 다 왔습니다, 이제 모델에서 샘플링하는 간단한 함수를 만들어 가끔 모델이 출력하는 것을 시각적으로 확인해 봅시다! 간단하게 모델.generate 메소드를 감싸 보겠습니다. GenerationConfig에서 기본 샘플링 매개변수를 가져와 해당 모델 ID를 전달하면 됩니다.


In [78]:
from types import SimpleNamespace
from transformers import GenerationConfig

gen_config = GenerationConfig.from_pretrained(config.model_id)
test_config = SimpleNamespace(
    max_new_tokens=256,
    gen_config=gen_config)

In [79]:
def generate(prompt, max_new_tokens=test_config.max_new_tokens, gen_config=gen_config):
    tokenized_prompt = tokenizer(prompt, return_tensors='pt')['input_ids'].cuda()
    with torch.inference_mode():
        output = model.generate(tokenized_prompt,
                            max_new_tokens=max_new_tokens,
                            generation_config=gen_config)
    return tokenizer.decode(output[0][len(tokenized_prompt[0]):], skip_special_tokens=True)

LoL 🤷

In [80]:
prompt = eval_dataset[14]["prompt"]
print(prompt + generate(prompt, 128))

Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Select the correct word to complete the following sentence. Output the correct word.

### Input:
He was so _______ that he could barely speak.
Options:
A. petrified
B. paralyzed
C. permeated

### Response:
B. paralyzed


우리는 그 결과를 n 단계마다 프로젝트에 테이블로 기록할 수 있습니다.

In [81]:
import wandb
from tqdm.auto import tqdm

def prompt_table(examples, log=False, table_name="predictions"):
    table = wandb.Table(columns=["prompt", "generation", "concat", "output", "max_new_tokens", "temperature", "top_p"])

    for prompt, gpt4_output in tqdm(zip(examples['prompt'], examples['output']), leave=False):
        out = generate(prompt, test_config.max_new_tokens, test_config.gen_config)
        table.add_data(prompt, out, prompt+out, gpt4_output, test_config.max_new_tokens, test_config.gen_config.temperature, test_config.gen_config.top_p)
    if log:
        wandb.log({table_name:table})
    return table

def to_gpu(tensor_dict):
    return {k: v.to('cuda') for k, v in tensor_dict.items()}

class Accuracy:
    "A simple Accuracy function compatible with HF models"
    def __init__(self):
        self.count = 0
        self.tp = 0.
    def update(self, logits, labels):
        logits, labels = logits.argmax(dim=-1).view(-1).cpu(), labels.view(-1).cpu()
        tp = (logits == labels).sum()
        self.count += len(logits)
        self.tp += tp
        return tp / len(logits)
    def compute(self):
        return self.tp / self.count

원하신다면 검증을 빠르게 추가할 수도 있습니다. 이 단계에서 테이블을 생성할 수도 있습니다.

In [82]:
@torch.no_grad()
def validate():
    model.eval();
    eval_acc = Accuracy()
    loss, total_steps = 0., 0
    for step, batch in enumerate(pbar:=tqdm(eval_dataloader, leave=False)):
        if "length" in batch:
            del batch["length"]
        pbar.set_description(f"doing validation")
        batch = to_gpu(batch)
        total_steps += 1
        with torch.amp.autocast("cuda", dtype=torch.bfloat16):
            out = model(**batch)
            #loss += loss_fn(out.logits, batch["labels"])  # you could use out.loss and not shift the dataset
            loss += out.loss
        eval_acc.update(out.logits, batch["labels"])
    # we log results at the end
    wandb.log({"eval/loss": loss.item() / total_steps,
               "eval/accuracy": eval_acc.compute()})
    prompt_table(eval_dataset[:config.n_eval_samples], log=True)
    model.train();

모델 평가와 모델 출력을 table에 기록하는 루프를 정의합니다.

In [83]:
from pathlib import Path
def save_model(model, model_name, models_folder="models", log=False):
    """Save the model to wandb as an artifact
    Args:
        model (nn.Module): Model to save.
        model_name (str): Name of the model.
        models_folder (str, optional): Folder to save the model. Defaults to "models".
    """
    model_name = f"{wandb.run.id}_{model_name}"
    file_name = Path(f"{models_folder}/{model_name}")
    file_name.parent.mkdir(parents=True, exist_ok=True)
    model.save_pretrained(file_name, safe_serialization=True)
    # save tokenizer for easy inference
    tokenizer = AutoTokenizer.from_pretrained(model.name_or_path)
    tokenizer.save_pretrained(model_name)
    if log:
        at = wandb.Artifact(model_name, type="model")
        at.add_dir(file_name)
        wandb.log_artifact(at)

## The actual Loop
- 그래디언트 누적 및 그래디언트 스케일링
- 샘플링 및 모델 체크포인트 저장 (이것은 매우 빠르게 훈련되므로 여러 체크포인트를 저장할 필요가 없습니다)
- 우리는 토큰 정확도를 계산합니다, 손실보다 더 나은 지표입니다.

In [84]:
wandb.init(project=config.project_name, # the project I am working on
           tags=["baseline","7b"],
           job_type="train",
           config=config) # the Hyperparameters I want to keep track of

# Training
acc = Accuracy()
model.train()
train_step = 0
for epoch in tqdm(range(config.epochs)):
    for step, batch in enumerate(tqdm(train_dataloader)):
        if "length" in batch:
            del batch["length"]

        batch = to_gpu(batch)
        with torch.amp.autocast("cuda", dtype=torch.bfloat16):
            out = model(**batch)
            #loss = loss_fn(out.logits, batch["labels"]) / config.gradient_accumulation_steps  # you could use out.loss and not shift the dataset
            loss = out.loss / config.gradient_accumulation_steps
            loss.backward()
        if step%config.gradient_accumulation_steps == 0:
            # we can log the metrics to W&B
            wandb.log({"train/loss": loss.item() * config.gradient_accumulation_steps,
                       "train/accuracy": acc.update(out.logits, batch["labels"]),
                       "train/learning_rate": scheduler.get_last_lr()[0],
                       "train/global_step": train_step})
            optim.step()
            scheduler.step()
            optim.zero_grad(set_to_none=True)
            train_step += 1
    validate()
    # wandb에 가입이 되어있으면 자동으로 됨

<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:


Abort: 

In [None]:
# we save the model checkpoint at the end
#config.do_sample = True  # 샘플링을 활성화합니다.

# del config.temperature  # temperature 설정을 제거합니다.
# del config.top_p  # top_p 설정을 제거합니다.
save_model(model, model_name=config.model_id.replace("/", "_"), models_folder="models/", log=config.log_model)

wandb.finish()

A100에서 약 70분 정도 소요됩니다.

## Full Eval Dataset evaluation

평가 데이터셋(eval_dataset)에서 모델 예측을 로그하는 테이블을 만들어 보겠습니다 (처음 250개 샘플에 대해서).

In [None]:
with wandb.init(project=config.project_name, # the project I am working on
           job_type="eval",
           config=config): # the Hyperparameters I want to keep track of
    model.eval();
    prompt_table(eval_dataset[:250], log=True, table_name="eval_predictions")