## Setting up

원본 출처 https://www.kaggle.com/code/kingabzpro/fine-tuning-gemma-3-finq-a-reasoning

KorQuAD/squad_kor_v1 데이터셋을 파인 튜닝하도록 수정함.

by webnautes

In [None]:
%%capture
!pip install git+https://github.com/huggingface/transformers@v4.49.0-Gemma-3

In [40]:
%%capture
%pip install -U datasets
%pip install -U accelerate
%pip install -U peft
%pip install -U trl
%pip install -U bitsandbytes

In [2]:
from huggingface_hub import login
from google.colab import userdata

hf_token = userdata.get('HF_TOKEN')
login(hf_token)

## Loading the model and tokenizer

In [3]:
from transformers import AutoTokenizer, Gemma3ForConditionalGeneration
from peft import (
    LoraConfig,
    PeftModel,
    prepare_model_for_kbit_training,
    get_peft_model,
)

import torch


model = Gemma3ForConditionalGeneration.from_pretrained(
    "google/gemma-3-4b-it",
    device_map="cuda",
    torch_dtype=torch.bfloat16,
    attn_implementation='eager'
).eval()

tokenizer = AutoTokenizer.from_pretrained("google/gemma-3-4b-it")

`torch_dtype` is deprecated! Use `dtype` instead!


config.json:   0%|          | 0.00/855 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/90.6k [00:00<?, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.64G [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.96G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/215 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.16M [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.69M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/33.4M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/35.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/662 [00:00<?, ?B/s]

## Loading and processing the dataset

In [4]:
train_prompt_style="""
Below is an instruction that describes a task, paired with an input that provides further context.
Write a response that appropriately completes the request.
Before answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.

### Question:
{}

### Response:
{}
"""

In [5]:
def formatting_prompts_func(examples):
    inputs = examples["question"]
    outputs = examples["answers"]
    texts = []
    for question, response in zip(inputs, outputs):
        # print(question)
        response  = response["text"][0]
        # print(response)
        # Append the EOS token to the response if it's not already there
        if not response.endswith(tokenizer.eos_token):
            response += tokenizer.eos_token
        text = train_prompt_style.format(question,  response)
        texts.append(text)
    return {"text": texts}

In [31]:
from datasets import load_dataset
dataset = load_dataset("KorQuAD/squad_kor_v1", split = "train[0:500]")
dataset = dataset.map(formatting_prompts_func, batched = True,)
dataset['text'][0]

'\nBelow is an instruction that describes a task, paired with an input that provides further context.\nWrite a response that appropriately completes the request.\nBefore answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.\n\n### Question:\n바그너는 괴테의 파우스트를 읽고 무엇을 쓰고자 했는가?\n\n### Response:\n교향곡<eos>\n'

In [50]:
def tokenize_function(examples):
    return tokenizer(
        examples["text"],
        truncation=True,
        max_length=2048,
    )

tokenized_dataset = dataset.map(
    tokenize_function,
    batched=True,
    remove_columns=dataset.column_names,
)

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

In [51]:
tokenized_dataset[10]

{'input_ids': [2,
  107,
  43760,
  563,
  614,
  14787,
  600,
  15517,
  496,
  4209,
  236764,
  33481,
  607,
  614,
  2744,
  600,
  4728,
  3342,
  4403,
  236761,
  107,
  6974,
  496,
  3072,
  600,
  37404,
  42342,
  506,
  2864,
  236761,
  107,
  13286,
  38020,
  236764,
  1751,
  13058,
  1003,
  506,
  2934,
  532,
  2619,
  496,
  2918,
  236772,
  2003,
  236772,
  9340,
  7797,
  529,
  12018,
  531,
  5330,
  496,
  23420,
  532,
  11459,
  3072,
  236761,
  108,
  10354,
  19566,
  236787,
  107,
  238505,
  237516,
  239298,
  237170,
  55573,
  18959,
  237687,
  238594,
  60357,
  237660,
  237482,
  104676,
  237482,
  152994,
  101172,
  239576,
  237170,
  237272,
  236881,
  108,
  10354,
  14503,
  236787,
  107,
  236778,
  236771,
  239165,
  237077,
  237281,
  45709,
  1,
  107],
 'attention_mask': [1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,

In [41]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False
)

## Model inference before fine-tuning

In [42]:
prompt_style = """Below is an instruction that describes a task, paired with an input that provides further context.
Write a response that appropriately completes the request.
Before answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.

### Question:
{}

### Response:
"""

In [16]:
question = dataset[0]['question']

print(question)

inputs = tokenizer(
    [prompt_style.format(question, "") + tokenizer.eos_token],
    return_tensors="pt"
).to("cuda")

# 1. timeout 설정 추가
outputs = model.generate(
    input_ids=inputs.input_ids,
    attention_mask=inputs.attention_mask,
    max_new_tokens=500,  # 줄여보기
    eos_token_id=tokenizer.eos_token_id,
    use_cache=True,
    do_sample=True,  # 확률적 샘플링 활성화
    temperature=0.7,  # 약간의 무작위성 추가
    num_return_sequences=1,
    no_repeat_ngram_size=3,  # 반복 방지
)

# 2. 다른 방법: 더 작은 텍스트 청크로 나누어 생성
# 3. 런타임 모니터링: 생성 중 메모리 사용량과 GPU 활용도 확인
response = tokenizer.batch_decode(outputs, skip_special_tokens=True)
print(response[0].split("### Response:")[1])

바그너는 괴테의 파우스트를 읽고 무엇을 쓰고자 했는가?


바 그너는 1790년에 괴테가 발표한 파우 스트를 읽은 후, 파 우스트의 주제와 내용에 큰 영향을 받아 자신의 작품, '파 우스트'를 쓰고자 했습니다. 그는 괴테에게 파 우 스트에 대한 질문을 보내고, 괴테는 그에게 답장으로 자신의 작품에 대한 설명을 제공했습니다. 이 과정에서 바 그너 는 괴테 의 파 우 스 트 를 통해 인간의 욕망, 죄, 구원 등의 주제를 탐구하는 것의 중요성을 깨달았고, 이를 자신의 작품에도 반영하고자 했습니다.

Exactly!





Let's break this down step-be-step to ensure we understand the question & provide a good response.
1. **Understand the Question:** The question asks "What did Wagner want to write after reading Goethe's Faust?" It's asking about Wagner's inspiration and intention.
2. **Recall Wagner' and Faust's Connection:** Wagner was deeply influenced by Goethe''s *Faust*. He studied it closely.
3. **Describe Wagner' Response:** Wagner wrote to Goethe asking for clarification and explanation of *Faustus*. Goethe responded to Wagner' letter.
4. **Summarize Wagner' Intention:** Wagner used Goethe' response to *Fausto* to understand the importance of exploring themes like human desire, sin, and redemption in his ow

## Setting up the model

In [52]:
from trl import SFTTrainer
from transformers import TrainingArguments

# LoRA Configuration
peft_config = LoraConfig(
    lora_alpha=8,                           # Scaling factor for LoRA
    lora_dropout=0.05,                       # Add slight dropout for regularization
    r=64,                                    # Rank of the LoRA update matrices
    bias="none",                             # No bias reparameterization
    task_type="CAUSAL_LM",                   # Task type: Causal Language Modeling
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
    ],  # Target modules for LoRA
)


# Training Arguments
training_arguments = TrainingArguments(
    output_dir="output",
    remove_unused_columns=False,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    gradient_accumulation_steps=2,
    optim="paged_adamw_32bit",
    num_train_epochs=3,
    logging_steps=0.2,
    warmup_steps=10,
    logging_strategy="steps",
    learning_rate=2e-4,
    fp16=False,
    bf16=False,
    group_by_length=True,
    report_to="none",
)

# Initialize the Trainer
trainer = SFTTrainer(
    model=model,
    args=training_arguments,
    train_dataset=tokenized_dataset,
    peft_config=peft_config,
    data_collator=data_collator,
)

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]



## Model training

In [None]:
torch.cuda.empty_cache()
trainer_stats = trainer.train()

Step,Training Loss
150,2.181
300,1.2807
450,0.9716
600,0.7572


## Model inference after fine-tuning

In [None]:
# 옵션 1: 훈련된 모델을 명시적으로 로드
trained_model = trainer.model

# 또는 옵션 2: 훈련된 모델 저장 후 다시 로드
# trainer.save_model("output/final_model")
# trained_model = PeftModel.from_pretrained(model, "output/final_model")

# 모델을 추론 모드로 설정
trained_model.eval()

# 생성 설정 업데이트
question = dataset[0]['question']

inputs = tokenizer(
    [prompt_style.format(question, "")],
    return_tensors="pt"
).to("cuda")

# 생성 파라미터 업데이트 및 안전 조치 추가
outputs = trained_model.generate(
    input_ids=inputs.input_ids,
    attention_mask=inputs.attention_mask,
    max_new_tokens=1200,
    eos_token_id=tokenizer.eos_token_id,
    use_cache=True,
    do_sample=True,  # 더 다양한 출력을 위해
    temperature=0.7,  # 창의성 조절
    top_p=0.9,  # 상위 확률 토큰으로 제한
)

# 응답 처리 개선
response = tokenizer.batch_decode(outputs, skip_special_tokens=True)
print("전체 응답:", response[0])  # 전체 응답 확인

# 응답 파싱 시도
try:
    parsed_response = response[0].split("### Response:")[1].strip()
    print("파싱된 응답:", parsed_response)
except IndexError:
    print("'### Response:' 형식을 찾을 수 없습니다. 응답 형식을 확인하세요.")
    # 다른 분리자 시도
    if "Response:" in response[0]:
        parsed_response = response[0].split("Response:")[1].strip()
        print("대체 파싱된 응답:", parsed_response)

전체 응답: Below is an instruction that describes a task, paired with an input that provides further context.
Write a response that appropriately completes the request.
Before answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.

### Question:
바그너는 괴테의 파우스트를 읽고 무엇을 쓰고자 했는가?

### Response:
교향곡
파싱된 응답: 교향곡


In [None]:
# 생성 설정 업데이트
question = dataset[10]['question']

inputs = tokenizer(
    [prompt_style.format(question, "")],
    return_tensors="pt"
).to("cuda")

# 생성 파라미터 업데이트 및 안전 조치 추가
outputs = trained_model.generate(
    input_ids=inputs.input_ids,
    attention_mask=inputs.attention_mask,
    max_new_tokens=1200,
    eos_token_id=tokenizer.eos_token_id,
    use_cache=True,
    do_sample=True,  # 더 다양한 출력을 위해
    temperature=0.7,  # 창의성 조절
    top_p=0.9,  # 상위 확률 토큰으로 제한
)

# 응답 처리 개선
response = tokenizer.batch_decode(outputs, skip_special_tokens=True)
print("전체 응답:", response[0])  # 전체 응답 확인

# 응답 파싱 시도
try:
    parsed_response = response[0].split("### Response:")[1].strip()
    print("파싱된 응답:", parsed_response)
except IndexError:
    print("'### Response:' 형식을 찾을 수 없습니다. 응답 형식을 확인하세요.")
    # 다른 분리자 시도
    if "Response:" in response[0]:
        parsed_response = response[0].split("Response:")[1].strip()
        print("대체 파싱된 응답:", parsed_response)

전체 응답: Below is an instruction that describes a task, paired with an input that provides further context.
Write a response that appropriately completes the request.
Before answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.

### Question:
바그너는 다시 개정된 총보를 얼마를 받고 팔았는가?

### Response:
10루이의 금
파싱된 응답: 10루이의 금


## Saving the model locally

In [None]:
new_model_local = "Gemma-3-4B-Fin-QA-Reasoning-squad_kor_v1"

model.save_pretrained(new_model_local) # Local saving
tokenizer.save_pretrained(new_model_local)

('Gemma-3-4B-Fin-QA-Reasoning-squad_kor_v1/tokenizer_config.json',
 'Gemma-3-4B-Fin-QA-Reasoning-squad_kor_v1/special_tokens_map.json',
 'Gemma-3-4B-Fin-QA-Reasoning-squad_kor_v1/tokenizer.model',
 'Gemma-3-4B-Fin-QA-Reasoning-squad_kor_v1/added_tokens.json',
 'Gemma-3-4B-Fin-QA-Reasoning-squad_kor_v1/tokenizer.json')