# Tang Poetry Generator
The following will use the **MediaTek Breeze-7B** model as a base and fine-tune the model to become a LLM specifically for generating Tang poetry using a dataset from GitHub through the **LoRA** method.

## Mount Google Drive

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Download Datasets

Reference:
- [H. Lee < GAI 2024 hw5 >](https://github.com/CheeEn-Yu/GenAI-Hw5.git)
- [chinese-poetry](https://github.com/chinese-poetry/chinese-poetry/tree/master/%E5%85%A8%E5%94%90%E8%AF%97?fbclid=IwAR2bM14S42T-VtrvMi3wywCqKfYJraBtMl7QVTo0qyPMjX9jj9Vj3JepFBA)

In [3]:
source = "https://raw.githubusercontent.com/RyanCCJ/LLM-practice/refs/heads/master/Practice_II/Tang_poem_dataset"

!wget $source/Tang_training_data.json
!wget $source/Tang_testing_data.json

--2024-10-14 09:08:36--  https://raw.githubusercontent.com/RyanCCJ/LLM-practice/refs/heads/master/Practice_II/Tang_poem_dataset/Tang_training_data.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1409962 (1.3M) [text/plain]
Saving to: ‘Tang_training_data.json’


2024-10-14 09:08:36 (49.7 MB/s) - ‘Tang_training_data.json’ saved [1409962/1409962]

--2024-10-14 09:08:37--  https://raw.githubusercontent.com/RyanCCJ/LLM-practice/refs/heads/master/Practice_II/Tang_poem_dataset/Tang_testing_data.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.109.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 

## Install Packages
Note that some versions may conflict.
You need to restart the session in 'Runtime' to apply the changes.

In [2]:
!pip install transformers==4.38.2
!pip install datasets==2.10.1
!pip install accelerate==0.28.0
!pip install peft==0.9.0
!pip install bitsandbytes==0.43.0
!pip install fsspec==2023.9.2



In [7]:
import os
import sys
import json
import tqdm
import torch
import torch.nn as nn
import logging
import warnings
warnings.filterwarnings("ignore")
import datasets, transformers
from datasets import load_dataset, load_from_disk
from transformers import (
    AutoTokenizer,
    AutoConfig,
    AutoModelForCausalLM,
    BitsAndBytesConfig,
    GenerationConfig
)
from peft import (
    PeftModel,
    LoraConfig,
    get_peft_model,
    prepare_model_for_int8_training,
)

## Set Hyperarameters

Package and Path Parameters

In [4]:
output_dir = "/content/drive/MyDrive"  # 設定結果輸出目錄
cache_dir = "./cache"  # 設定快取目錄
ckpt_dir = "./ckpt" # 設定 model checkpoint 儲存目錄
from_ckpt = False  # 是否從 checkpoint 載入模型的權重，預設為否
ckpt_name = None  # 從特定 checkpoint 載入權重時使用的檔案名稱，預設為無
dataset_dir = "Tang_training_data.json"  # 設定資料集的目錄或檔案路徑
test_data_path = "Tang_testing_data.json"
num_train_data = 1040 # 設定用來訓練的資料數量，可設置的最大值為5000
num_epoch = 1  # 設定訓練的總Epoch數 (數字越高，訓練越久，若使用免費版的colab需要注意訓練太久可能會斷線)
LEARNING_RATE = 3e-4  # 設定學習率
logging_steps = 20  # 定義訓練過程中每隔多少步驟輸出一次訓練誌
save_steps = 65  # 定義訓練過程中每隔多少步驟保存一次模型
save_total_limit = 3  # 控制最多保留幾個模型checkpoint
report_to = None  # 設定上報實驗指標的目標，預設為無
MICRO_BATCH_SIZE = 4  # 定義微批次的大小
BATCH_SIZE = 16  # 定義一個批次的大小
GRADIENT_ACCUMULATION_STEPS = BATCH_SIZE // MICRO_BATCH_SIZE  # 計算每個微批次累積的梯度步數
CUTOFF_LEN = 256  # 設定文本截斷的最大長度
LORA_R = 8  # 設定LORA（Layer-wise Random Attention）的R值
LORA_ALPHA = 16  # 設定LORA的Alpha值
LORA_DROPOUT = 0.05  # 設定LORA的Dropout率
VAL_SET_SIZE = 0  # 設定驗證集的大小，預設為無
TARGET_MODULES = ["q_proj", "up_proj", "o_proj", "k_proj", "down_proj", "gate_proj", "v_proj"] # 設定目標模組，這些模組的權重將被保存為checkpoint
device_map = "auto"  # 設定設備映射，預設為"auto"

Text Generation Parameters

In [5]:
max_len = 128      # 生成回復的最大長度
temperature = 0.1  # 設定生成回覆的隨機度，值越小生成的回覆越穩定
top_p = 0.3        # Top-p 抽樣的機率閾值，用於控制生成回覆的多樣性
top_k = 5          # 調整Top-k值，以增加生成回覆的多樣性和避免生成重複的詞彙

## Define Some Functions

In [8]:
# 生成訓練資料
def generate_training_data(data_point):
    """
    (1) Goal:
        - This function is used to transform a data point (input and output texts) to tokens that our model can read

    (2) Arguments:
        - data_point: dict, with field "instruction", "input", and "output" which are all str

    (3) Returns:
        - a dict with model's input tokens, attention mask that make our model causal, and corresponding output targets

    (3) Example:
        - If you construct a dict, data_point_1, with field "instruction", "input", and "output" which are all str, you can use the function like this:
            formulate_article(data_point_1)
    """

    # construct full input prompt
    prompt = f"""\
    [INST] <<SYS>>
    You are a helpful assistant and good at writing Tang poem. 你是一個樂於助人的助手且擅長寫唐詩。
    <</SYS>>

    {data_point["instruction"]}
    {data_point["input"]}
    [/INST]"""

    # count the number of input tokens
    len_user_prompt_tokens = (
        len(
            tokenizer(
                prompt,
                truncation=True,
                max_length=CUTOFF_LEN + 1,
                padding="max_length",
            )["input_ids"]
        ) - 1
    )

    # transform input prompt into tokens
    full_tokens = tokenizer(
        prompt + " " + data_point["output"] + "</s>",
        truncation=True,
        max_length=CUTOFF_LEN + 1,
        padding="max_length",
    )["input_ids"][:-1]

    return {
        "input_ids": full_tokens,
        "labels": [-100] * len_user_prompt_tokens
        + full_tokens[len_user_prompt_tokens:],
        "attention_mask": [1] * (len(full_tokens)),
    }


# 進行回覆的評估
def evaluate(instruction, generation_config, max_len, input="", verbose=True):
    """
    (1) Goal:
        - This function is used to get the model's output given input strings

    (2) Arguments:
        - instruction: str, description of what you want model to do
        - generation_config: transformers.GenerationConfig object, to specify decoding parameters relating to model inference
        - max_len: int, max length of model's output
        - input: str, input string the model needs to solve the instruction, default is "" (no input)
        - verbose: bool, whether to print the mode's output, default is True

    (3) Returns:
        - output: str, the mode's response according to the instruction and the input

    (3) Example:
        - If you the instruction is "ABC" and the input is "DEF" and you want model to give an answer under 128 tokens, you can use the function like this:
            evaluate(instruction="ABC", generation_config=generation_config, max_len=128, input="DEF")
    """

    # construct full input prompt
    prompt = f"""\
    [INST] <<SYS>>
    You are a helpful assistant and good at writing Tang poem. 你是一個樂於助人的助手且擅長寫唐詩。
    <</SYS>>

    {instruction}
    {input}
    [/INST]"""

    # 將提示文本轉換為模型所需的數字表示形式
    inputs = tokenizer(prompt, return_tensors="pt")
    input_ids = inputs["input_ids"].cuda()

    # 使用模型進行生成回覆
    generation_output = model.generate(
        input_ids=input_ids,
        generation_config=generation_config,
        return_dict_in_generate=True,
        output_scores=True,
        max_new_tokens=max_len,
    )

    # 將生成的回覆解碼並印出
    for s in generation_output.sequences:
        output = tokenizer.decode(s)
        output = output.split("[/INST]")[1].replace("</s>", "").replace("<s>", "").replace("Assistant:", "").replace("Assistant", "").strip()
        if (verbose):
            print(output)

    return output


## Set Model

In [9]:
model_name = "MediaTek-Research/Breeze-7B-Instruct-v0_1"


# 使用 4-bit normalized float (NF4) 做量化
nf4_config = BitsAndBytesConfig(
   load_in_4bit=True,
   bnb_4bit_quant_type="nf4",
   bnb_4bit_use_double_quant=True,
   bnb_4bit_compute_dtype=torch.bfloat16
)

# 從指定的模型名稱或路徑載入預訓練的語言模型
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    cache_dir=cache_dir,
    quantization_config=nf4_config,
    device_map=device_map,
    low_cpu_mem_usage = True
)

# 創建 tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    model_name,
    cache_dir=cache_dir,
    quantization_config=nf4_config
)

# 設定模型推理時需要用到的 decoding parameters
generation_config = GenerationConfig(
    do_sample=True,
    temperature=temperature,
    num_beams=1,
    top_p=top_p,
    # top_k=top_k,
    no_repeat_ngram_size=3,
    pad_token_id=2,
)

tokenizer_config.json:   0%|          | 0.00/2.29k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/911k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.79M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/39.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/551 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


config.json:   0%|          | 0.00/618 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.99G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/4.99G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

## Before Fine-tuning
Let's first see what our model can do without fine-tuning.

In [10]:
# demo examples
test_tang_list = ['相見時難別亦難，東風無力百花殘。',
                  '重帷深下莫愁堂，臥後清宵細細長。',
                  '芳辰追逸趣，禁苑信多奇。']

# get the model output for each examples
demo_before_finetune = []
for tang in test_tang_list:
  demo_before_finetune.append(f'模型輸入:\n以下是一首唐詩的第一句話，請用你的知識判斷並完成整首詩。{tang}\n\n模型輸出:\n'
                              + evaluate('以下是一首唐詩的第一句話，請用你的知識判斷並完成整首詩。',
                                         generation_config, max_len, tang, verbose = False))

# print and store the output to text file
for idx in range(len(demo_before_finetune)):
  print(f"Example {idx + 1}:")
  print(demo_before_finetune[idx])
  print("-" * 80)

Example 1:
模型輸入:
以下是一首唐詩的第一句話，請用你的知識判斷並完成整首詩。相見時難別亦難，東風無力百花殘。

模型輸出:
相知難，相別亦，難，百花，東，力，殘。
--------------------------------------------------------------------------------
Example 2:
模型輸入:
以下是一首唐詩的第一句話，請用你的知識判斷並完成整首詩。重帷深下莫愁堂，臥後清宵細細長。

模型輸出:
重帷下，重帷之下，深下，莫愁之堂。

    清宵，清宵之宵，臥，臥之後，後，長，細細長，長長。
--------------------------------------------------------------------------------
Example 3:
模型輸入:
以下是一首唐詩的第一句話，請用你的知識判斷並完成整首詩。芳辰追逸趣，禁苑信多奇。

模型輸出:
芳辰逐逸趣來，禁瑋信多姿。
--------------------------------------------------------------------------------


## Start Fine-tuning

In [11]:
# create the output directory you specify
os.makedirs(output_dir, exist_ok = True)
os.makedirs(ckpt_dir, exist_ok = True)

# 根據 from_ckpt 標誌，從 checkpoint 載入模型權重
if from_ckpt:
    model = PeftModel.from_pretrained(model, ckpt_name)

# 將模型準備好以使用 INT8 訓練
model = prepare_model_for_int8_training(model)

# 使用 LoraConfig 配置 LORA 模型
config = LoraConfig(
    r=LORA_R,
    lora_alpha=LORA_ALPHA,
    target_modules=TARGET_MODULES,
    lora_dropout=LORA_DROPOUT,
    bias="none",
    task_type="CAUSAL_LM",
)
model = get_peft_model(model, config)

# 將 tokenizer 的 padding token 設定為 0
tokenizer.pad_token_id = 0

# 載入並處理訓練數據
with open(dataset_dir, "r", encoding = "utf-8") as f:
    data_json = json.load(f)
with open("tmp_dataset.json", "w", encoding = "utf-8") as f:
    json.dump(data_json[:num_train_data], f, indent = 2, ensure_ascii = False)

data = load_dataset('json', data_files="tmp_dataset.json", download_mode="force_redownload")

# 將訓練數據分為訓練集和驗證集（若 VAL_SET_SIZE 大於 0）
if VAL_SET_SIZE > 0:
    train_val = data["train"].train_test_split(
        test_size=VAL_SET_SIZE, shuffle=True, seed=42
    )
    train_data = train_val["train"].shuffle().map(generate_training_data)
    val_data = train_val["test"].shuffle().map(generate_training_data)
else:
    train_data = data['train'].shuffle().map(generate_training_data)
    val_data = None

# 使用 Transformers Trainer 進行模型訓練
trainer = transformers.Trainer(
    model=model,
    train_dataset=train_data,
    eval_dataset=val_data,
    args=transformers.TrainingArguments(
        per_device_train_batch_size=MICRO_BATCH_SIZE,
        gradient_accumulation_steps=GRADIENT_ACCUMULATION_STEPS,
        warmup_steps=50,
        num_train_epochs=num_epoch,
        learning_rate=LEARNING_RATE,
        fp16=True,  # 使用混合精度訓練
        logging_steps=logging_steps,
        save_strategy="steps",
        save_steps=save_steps,
        output_dir=ckpt_dir,
        save_total_limit=save_total_limit,
        report_to=report_to,
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)

# 禁用模型的 cache 功能
model.config.use_cache = False

# 進行模型編譯
model = torch.compile(model)

# 開始模型訓練
trainer.train()

# 將訓練完的模型保存到指定的目錄中
model.save_pretrained(ckpt_dir)

Downloading and preparing dataset json/default to /root/.cache/huggingface/datasets/json/default-ff7ba6e3ee801b6b/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/json/default-ff7ba6e3ee801b6b/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

Map:   0%|          | 0/1040 [00:00<?, ? examples/s]

Step,Training Loss
20,3.1997
40,2.0239
60,1.9157


config.json:   0%|          | 0.00/618 [00:00<?, ?B/s]

##  Testing

In [12]:
# 載入預訓練的語言模型
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    cache_dir=cache_dir,
    quantization_config=nf4_config,
    device_map=device_map,
    low_cpu_mem_usage = True
)

# 搜尋並選擇最後一個 checkpoint
ckpts = []
for ckpt in os.listdir(ckpt_dir):
    if (ckpt.startswith("checkpoint-")):
        ckpts.append(ckpt)
ckpts = sorted(ckpts, key = lambda ckpt: int(ckpt.split("-")[1]))
ckpt_name = os.path.join(ckpt_dir, ckpts[-1])

# 從指定的 checkpoint 載入微調後的模型權重
model = PeftModel.from_pretrained(model, ckpt_name, device_map=device_map)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

See how the fine-tune model do compared to model without fine-tuning

In [13]:
# using the same demo examples as before
test_tang_list = ['相見時難別亦難，東風無力百花殘。',
                  '重帷深下莫愁堂，臥後清宵細細長。',
                  '芳辰追逸趣，禁苑信多奇。']

# inference our fine-tuned model
demo_after_finetune = []
for tang in test_tang_list:
  demo_after_finetune.append(f'模型輸入:\n以下是一首唐詩的第一句話，請用你的知識判斷並完成整首詩。{tang}\n\n模型輸出:\n'
  +evaluate('以下是一首唐詩的第一句話，請用你的知識判斷並完成整首詩。', generation_config, max_len, tang, verbose = False))

# print and store the output to text file
for idx in range(len(demo_after_finetune)):
  print(f"Example {idx + 1}:")
  print(demo_after_finetune[idx])
  print("-" * 80)


Example 1:
模型輸入:
以下是一首唐詩的第一句話，請用你的知識判斷並完成整首詩。相見時難別亦難，東風無力百花殘。

模型輸出:
一去無歸日暮暮，一別無期年年新。
--------------------------------------------------------------------------------
Example 2:
模型輸入:
以下是一首唐詩的第一句話，請用你的知識判斷並完成整首詩。重帷深下莫愁堂，臥後清宵細細長。

模型輸出:
玉樓春色無處寄，金屋秋月無處看。
--------------------------------------------------------------------------------
Example 3:
模型輸入:
以下是一首唐詩的第一句話，請用你的知識判斷並完成整首詩。芳辰追逸趣，禁苑信多奇。

模型輸出:
春草初生時，春花正開時。
--------------------------------------------------------------------------------


## Show Results

In [14]:
# 讀取測試資料
with open(test_data_path, "r", encoding = "utf-8") as f:
    test_datas = json.load(f)

# 對於每個測試資料進行預測，並存下結果
output_path = os.path.join(output_dir, "results.txt")
with open(output_path, "w", encoding = "utf-8") as f:
  for (i, test_data) in enumerate(test_datas):
      predict = evaluate(test_data["instruction"], generation_config, max_len, test_data["input"], verbose = False)
      f.write(f"{i+1}. "+test_data["input"]+predict+"\n")
      print(f"{i+1}. "+test_data["input"]+predict)

# 下載結果
# from google.colab import files
# files.download(output_path)

1. 雪霽銀妝素，桔高映瓊枝。玉樓春色盡，金殿秋月明。
2. 夫子何爲者？栖栖一代中。不言聖道遠，不言道心重。
3. 飛蓋去芳園，蘭橈遊翠渚。春色未盡盡，秋景已盡盡。
4. 條風開獻節，灰律動初陽。玉樓春色新，金殿秋景長。
5. 昨夜星辰昨夜風，畫樓西畔桂堂東。今朝朝霞今朝月，一去一留一相惜。
6. 三日入廚下，洗手作羹湯。一壺酒自飲，一壺水送人。
7. 嵩雲秦樹久離居，雙鯉迢迢一紙書。書到天邊日已落，雲中人去路難尋。
8. 慨然撫長劒，濟世豈邀名。一去無復還，空留青雲天。
9. 乘興南遊不戒嚴，九重誰省諫書函。一去千載無歸期，空留青樓空留人。
10. 猿鳥猶疑畏簡書，風雲常爲護儲胥。一朝君恩已無望，千載臣忠在天書。
11. 君問歸期未有期，巴山夜雨漲秋池。池中白露生秋草，池邊青煙生秋月。
12. 相見時難別亦難，東風無力百花殘。一去無歸日暮暮，一別無期年年新。
13. 雲母屏風燭影深，長河漸落曉星沈。玉樓空自愁，玉樓中無人。
14. 高閣客竟去，小園花亂飛。春去春來，春去秋來。花落花飛，花落不飛。花飛花落，花飛不落。
15. 瑤池阿母綺窗開，黃竹歌聲動地哀。玉樓春色無別處，金殿秋月照空台。
