# 第10章: 事前学習済み言語モデル（GPT型）

本章では、GPT型（Transformerのデコーダ型）の事前学習済みモデルを利用して、言語生成、評判分析器（ポジネガ分類器）の構築、ファインチューニング、強化学習などに取り組む。

In [1]:
import os
from dotenv import load_dotenv
import torch

dotenv_path = './.env'
load_dotenv(dotenv_path)
HF_TOKEN = os.getenv('HF_TOKEN')

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device = torch.device("cuda:1")

In [63]:
import os
os.environ['TOKENIZERS_PARALLELISM'] = 'false'

In [2]:
!huggingface-cli login --token $HF_TOKEN

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: fineGrained).
The token `experiments` has been saved to /home/tosshy/.cache/huggingface/stored_tokens
Your token has been saved to /home/tosshy/.cache/huggingface/token
Login successful.
Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.


## 90. 次単語予測

“The movie was full of"に続くトークン（トークン列ではなく一つのトークンであることに注意せよ）として適切なもの上位10個と、その確率（尤度）を求めよ。ただし、言語モデルへのプロンプトがどのようなトークン列に変換されたか、確認せよ。

In [3]:
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained('meta-llama/Llama-3.2-1B-Instruct')
model = AutoModelForCausalLM.from_pretrained('meta-llama/Llama-3.2-1B-Instruct')

tokenizer_config.json:   0%|          | 0.00/54.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/877 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.47G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/189 [00:00<?, ?B/s]

In [4]:
sentence = 'The movie was full of'

encoded = tokenizer(sentence, return_tensors='pt')

tokenized = tokenizer.tokenize(sentence)
print(tokenized)

input_ids = tokenizer(sentence).input_ids
print(input_ids)

decoded = tokenizer.decode(input_ids)
print(decoded)

['The', 'Ġmovie', 'Ġwas', 'Ġfull', 'Ġof']
[128000, 791, 5818, 574, 2539, 315]
<|begin_of_text|>The movie was full of


In [5]:
# 実験：各系列における尤度が最大のものを選んだ結果
model.to(device)
encoded.to(device)

{'input_ids': tensor([[128000,    791,   5818,    574,   2539,    315]], device='cuda:1'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1]], device='cuda:1')}

In [6]:
model.to(device)
encoded.to(device)

print(encoded)

outputs = model(**encoded)
logits = outputs.logits # {batch_size, seq_len, vocab_size}
# logits[:, k, :]はinput_ids[k-1]までを使って計算されたスコア
next_token_logits = logits[:, -1, :]
next_token_id = next_token_logits.argmax(-1).item()
print(f'Next token id: {next_token_id}')
next_token = tokenizer.decode([next_token_id])
print(f'Next token: {next_token}')


{'input_ids': tensor([[128000,    791,   5818,    574,   2539,    315]], device='cuda:1'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1]], device='cuda:1')}
Next token id: 1957
Next token:  action


In [7]:
top_k = 10
next_token_probs = torch.softmax(next_token_logits, dim=-1)
top_k_probs, top_k_indices = torch.topk(next_token_probs, top_k) # (batch_size, top_k)

print(f'Top {top_k} next tokens')
for i in range(top_k):
    token_id = top_k_indices[0, i].item()
    token = tokenizer.decode([token_id])
    prob = top_k_probs[0, i].item()
    print(f'{i+1}. Token: {token}, Probability: {prob:.4f}')

Top 10 next tokens
1. Token:  action, Probability: 0.0852
2. Token:  suspense, Probability: 0.0344
3. Token:  drama, Probability: 0.0242
4. Token:  excitement, Probability: 0.0190
5. Token:  great, Probability: 0.0189
6. Token:  surprises, Probability: 0.0155
7. Token:  interesting, Probability: 0.0143
8. Token:  twists, Probability: 0.0136
9. Token:  exciting, Probability: 0.0132
10. Token:  memorable, Probability: 0.0099


## 91. 続きのテキストの予測

“The movie was full of"に続くテキストを複数予測せよ。このとき、デコーディングの方法や温度パラメータ（temperature）を変えながら、予測される複数のテキストの変化を観察せよ。

In [8]:
prompt = 'The movie was full of'
encoded = tokenizer(prompt, return_tensors='pt')
input_ids = encoded.input_ids.to(device)
attention_mask = encoded.attention_mask.to(device)

In [9]:
# Greedy Searchに近い
print('--- Default (Greedy-like) ---')
outputs_default = model.generate(input_ids, attention_mask=attention_mask, max_length=50, pad_token_id=tokenizer.eos_token_id)
print(tokenizer.decode(outputs_default[0], skip_special_tokens=True))
print('-' * 30)

--- Default (Greedy-like) ---
The movie was full of memorable characters, exciting plot twists, and stunning visuals. But, what made it truly unforgettable was the emotional resonance it brought to the audience.

For many viewers, the movie was a powerful reminder of the human experience. It
------------------------------


In [10]:
# サンプリング (do_sample=True) と Temperature
# Temperature < 1.0 : より決定的
# Temperature > 1.0 : よりランダム
print('--- Sampling with Temperature (0.7) ---')
outputs_temp_low = model.generate(
    input_ids,
    attention_mask=attention_mask,
    max_length=50,
    do_sample=True,
    temperature=0.7,
    top_k=50,
    pad_token_id=tokenizer.eos_token_id
)
print(tokenizer.decode(outputs_temp_low[0], skip_special_tokens=True))
print('-' * 30)

print('--- Sampling with Temperature (1.5) ---')
outputs_temp_high = model.generate(
    input_ids,
    attention_mask=attention_mask,
    max_length=50,
    do_sample=True,
    temperature=1.5,
    top_k=50,
    pad_token_id=tokenizer.eos_token_id
)
print(tokenizer.decode(outputs_temp_high[0], skip_special_tokens=True))
print('-' * 30)

--- Sampling with Temperature (0.7) ---
The movie was full of memorable moments, but one scene that really stands out to me is the infamous "I'm a little tea pot" line from the character, Willy Wonka.

I remember being so mesmerized by the way Charlie Bucket
------------------------------
--- Sampling with Temperature (1.5) ---
The movie was full of exciting moments but, in the final cut, only these few scenes were retained.
The scenes that were omitted are the first shot of an empty theater, where it is left to believe that it has stood vacant.
------------------------------


In [11]:
# Top-k サンプリング
# 次のトークンを予測する際に，確率上位k個の中からサンプリングする
print('--- Top-k Sampling (k=30) ---')
outputs_top_k = model.generate(
    input_ids,
    attention_mask=attention_mask,
    max_length=50,
    do_sample=True,
    top_k=30,
    pad_token_id=tokenizer.eos_token_id
)
print(tokenizer.decode(outputs_top_k[0], skip_special_tokens=True))
print("-" * 30)

--- Top-k Sampling (k=30) ---
The movie was full of action and suspense, but it was the way the characters interacted with each other that made it truly unforgettable.
I remember the movie like it was yesterday. I was a teenager at the time, and I had seen the
------------------------------


In [12]:
# Top-p (Nucleus) サンプリング
# 確率の累積がpを超える最小のトークンセットからサンプリングする
print('--- Top-p (Nucleus) Sampling (p=0.9) ---')
outputs_top_p = model.generate(
    input_ids,
    attention_mask=attention_mask,
    max_length=50,
    do_sample=True,
    top_p=0.9,
    top_k=0, # top_kとtop_pは通常どちらか一方を指定するか、top_k=0でtop_pを有効にする
    pad_token_id=tokenizer.eos_token_id
)
print(tokenizer.decode(outputs_top_p[0], skip_special_tokens=True))
print("-" * 30)

--- Top-p (Nucleus) Sampling (p=0.9) ---
The movie was full of surprises, but one of the most memorable moments was when the main character, a young girl named Lily, found out that her mother was a famous actress. She had always known that her mother was famous, but she had
------------------------------


In [13]:
# 5. ビームサーチ
# 複数の候補 (ビーム) を保持しながら探索する
print("--- Beam Search (num_beams=5) ---")
outputs_beam = model.generate(
    input_ids,
    attention_mask=attention_mask,
    max_length=50,
    num_beams=5,
    early_stopping=True, # EOSトークンが出たら早めに打ち切る
    pad_token_id=tokenizer.eos_token_id
)
print(tokenizer.decode(outputs_beam[0], skip_special_tokens=True))
print("-" * 30)

--- Beam Search (num_beams=5) ---
The movie was full of action, suspense, and romance. The plot was engaging, and the characters were well-developed and relatable. The special effects were impressive, and the cinematography was stunning. Overall, the movie was a thrilling ride
------------------------------


In [14]:
# 複数の異なる出力を得るために num_return_sequences を使用
print("--- Beam Search with num_return_sequences=3 ---")
outputs_beam_multiple = model.generate(
    input_ids,
    attention_mask=attention_mask,
    max_length=50,
    num_beams=5,
    num_return_sequences=3,
    early_stopping=True,
    pad_token_id=tokenizer.eos_token_id
)
for i, output in enumerate(outputs_beam_multiple):
    print(f"Output {i+1}: {tokenizer.decode(output, skip_special_tokens=True)}")
print("-" * 30)

--- Beam Search with num_return_sequences=3 ---
Output 1: The movie was full of action, suspense, and romance. The plot was engaging, and the characters were well-developed and relatable. The special effects were impressive, and the cinematography was stunning. Overall, the movie was a thrilling ride
Output 2: The movie was full of action, suspense, and romance. The plot was engaging, and the characters were well-developed and relatable. The special effects were impressive, and the cinematography was stunning. Overall, I would highly recommend this movie
Output 3: The movie was full of action, suspense, and romance. The plot was engaging, and the characters were well-developed and relatable. The special effects were impressive, and the cinematography was stunning. Overall, the movie was a thrilling and
------------------------------


## 92. 予測されたテキストの確率を計算

“The movie was full of"に続くテキストを予測し、生成された各単語の尤度を表示せよ（生成されるテキストが長いと出力が読みにくくなるので、適当な長さで生成を打ち切るとよい）。

In [15]:
prompt = 'The movie was full of'
encoded = tokenizer(prompt, return_tensors='pt')
input_ids = encoded.input_ids.to(device)
attention_mask = encoded.attention_mask.to(device)

In [16]:
outputs = model.generate(
    input_ids, 
    attention_mask=attention_mask, 
    max_length=20, 
    pad_token_id=tokenizer.eos_token_id,
    output_scores=True,
    return_dict_in_generate=True
)

sequences = list(outputs.sequences.squeeze(0))
scores = outputs.scores

prompt_len = input_ids.shape[1]

print(f"Input prompt: {tokenizer.decode(sequences[:prompt_len], skip_special_tokens=True)}")
print("Generated tokens and their probabilities:")

for k in range(len(scores)):
    current_token_idx_in_sequence = prompt_len + k

    generated_token_id = sequences[current_token_idx_in_sequence]

    generated_token_str = tokenizer.decode([generated_token_id])

    step_logits = scores[k]

    step_probs = torch.softmax(step_logits, dim=-1).squeeze()

    prob_of_generated_token = step_probs[generated_token_id].item()

    print(f"Token: '{generated_token_str}', Probability: {prob_of_generated_token:.4f}")


Input prompt: The movie was full of
Generated tokens and their probabilities:
Token: ' action', Probability: 0.5336
Token: ',', Probability: 0.5750
Token: ' suspense', Probability: 0.8670
Token: ',', Probability: 1.0000
Token: ' and', Probability: 1.0000
Token: ' romance', Probability: 0.6102
Token: ',', Probability: 0.3331
Token: ' but', Probability: 0.3680
Token: ' it', Probability: 0.4549
Token: ' was', Probability: 0.6033
Token: ' also', Probability: 0.7383
Token: ' a', Probability: 0.8530
Token: ' thought', Probability: 0.0197
Token: '-pro', Probability: 1.0000


## 93. パープレキシティ

適当な文を準備して、事前学習済み言語モデルでパープレキシティを測定せよ。例えば、

+ The movie was full of surprises
+ The movies were full of surprises
+ The movie were full of surprises
+ The movies was full of surprises

の4文に対して、パープレキシティを測定して観察せよ（最後の2つの文は故意に文法的な間違いを入れた）。

In [17]:
sentences = [
    'The movie was full of surprises',
    'The movies were full of surprises',
    'The movie were full of surprises',
    'The movies was full of surprises'
]

encoded = tokenizer(sentences, return_tensors='pt')

In [18]:
model.to(device)
encoded.to(device)

{'input_ids': tensor([[128000,    791,   5818,    574,   2539,    315,  46540],
        [128000,    791,   9698,   1051,   2539,    315,  46540],
        [128000,    791,   5818,   1051,   2539,    315,  46540],
        [128000,    791,   9698,    574,   2539,    315,  46540]],
       device='cuda:1'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1]], device='cuda:1')}

In [19]:
input_ids = encoded.input_ids
outputs = model(**encoded)
logits = outputs.logits

for i in range(logits.shape[0]):
    current_prediction_logits = logits[i, :-1, :]
    current_target_ids = input_ids[i, 1:]

    criterion = torch.nn.CrossEntropyLoss(reduction='mean')

    mean_neg_log_likelihood = criterion(current_prediction_logits, current_target_ids)

    ppl = torch.exp(mean_neg_log_likelihood)
    print(f'Sentence: {sentences[i]}, Perplexity: {ppl.item()}')

Sentence: The movie was full of surprises, Perplexity: 159.2811737060547
Sentence: The movies were full of surprises, Perplexity: 265.9172668457031
Sentence: The movie were full of surprises, Perplexity: 460.8365173339844
Sentence: The movies was full of surprises, Perplexity: 407.7308349609375


## 94. チャットテンプレート

"What do you call a sweet eaten after dinner?"という問いかけに対する応答を生成するため、チャットテンプレートを適用し、言語モデルに与えるべきプロンプトを作成せよ。また、そのプロンプトに対する応答を生成し、表示せよ。

In [21]:
instruction = 'Answer the following question.'
text = 'What do you call a sweet eaten after dinner?'

messages = [
    {"role": "system", "content": instruction},
    {"role": "user", "content": text}
]

prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

print('Generated prompt:')
print(prompt)
print('=' * 100)

Generated prompt:
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 01 Jun 2025

Answer the following question.<|eot_id|><|start_header_id|>user<|end_header_id|>

What do you call a sweet eaten after dinner?<|eot_id|><|start_header_id|>assistant<|end_header_id|>




In [22]:
inputs = tokenizer(prompt, return_tensors="pt")

if 'token_type_ids' in inputs:
    inputs.pop('token_type_ids')
inputs = inputs.to(device)

model.to(device)
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=100,
        temperature=0.7,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )

response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
print('Model response:')
print(response)

Model response:
I don't know what you're asking, but I can try to help. Is the answer "dessert"?


## 95. マルチターンのチャット

問題94で生成された応答に対して、追加で"Please give me the plural form of the word with its spelling in reverse order."と問いかけたときの応答を生成・表示せよ。また、その時に言語モデルに与えるプロンプトを確認せよ。

In [23]:
previous_response = response

messages = [
    {"role": "system", "content": instruction},
    {"role": "user", "content": "What do you call a sweet eaten after dinner?"},
    {"role": "assistant", "content": previous_response},
    {"role": "user", "content": "Please give me the plural form of the word with its spelling in reverse order."}
]

multi_turn_prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

print('Multi-turn chat prompt')
print(multi_turn_prompt)
print('=' * 100)

Multi-turn chat prompt
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 01 Jun 2025

Answer the following question.<|eot_id|><|start_header_id|>user<|end_header_id|>

What do you call a sweet eaten after dinner?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

I don't know what you're asking, but I can try to help. Is the answer "dessert"?<|eot_id|><|start_header_id|>user<|end_header_id|>

Please give me the plural form of the word with its spelling in reverse order.<|eot_id|><|start_header_id|>assistant<|end_header_id|>




In [24]:
inputs = tokenizer(multi_turn_prompt, return_tensors='pt')
if 'token_type_ids' in inputs:
    inputs.pop('token_type_ids')
inputs = inputs.to(device)
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=100,
        temperature=0.7,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )

multi_turn_response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
print('Multi-turn response:')
print(multi_turn_response)


Multi-turn response:
The word "dessert" spelled in reverse order is "trseSSED".


## 96. プロンプトによる感情分析

事前学習済み言語モデルで感情分析を行いたい。テキストを含むプロンプトを事前学習済み言語モデルに与え、（ファインチューニングは行わずに）テキストのポジネガを予測するという戦略で、[SST-2](https://dl.fbaipublicfiles.com/glue/data/SST-2.zip)の開発データにおける正解率を測定せよ。

In [25]:
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained('llm-jp/llm-jp-3-150m-instruct3')
model = AutoModelForCausalLM.from_pretrained('llm-jp/llm-jp-3-150m-instruct3')

In [26]:
import pandas as pd

train_path = './data/SST-2/train.tsv'
dev_path = './data/SST-2/dev.tsv'

train_df = pd.read_csv(train_path, sep='\t')
dev_df = pd.read_csv(dev_path, sep='\t')
train_df

Unnamed: 0,sentence,label
0,hide new secretions from the parental units,0
1,"contains no wit , only labored gags",0
2,that loves its characters and communicates som...,1
3,remains utterly satisfied to remain the same t...,0
4,on the worst revenge-of-the-nerds clichés the ...,0
...,...,...
67344,a delightful comedy,1
67345,"anguish , anger and frustration",0
67346,"at achieving the modest , crowd-pleasing goals...",1
67347,a patient viewer,1


In [32]:
from tqdm import tqdm

dev_sentences = dev_df.sentence
dev_labels = dev_df.label

predictions = []

instruction = '以下の文がポジティブかネガティブのどちらなのか判定してください．"ポジティブ"か"ネガティブ"で回答してください．'

model.to(device)

for i in tqdm(range(len(dev_sentences)), desc='Processing sentences'):
    sentence = dev_sentences.iloc[i]

    messages = [
        {"role": "system", "content": instruction},
        {"role": "user", "content": sentence}
    ]

    prompt = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )

    inputs = tokenizer(prompt, return_tensors='pt')
    if 'token_type_ids' in inputs:
        inputs.pop('token_type_ids')
    inputs = inputs.to(device)
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=15,
            temperature=0.1,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id
        )

    response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
    predictions.append(response.strip())

correct = 0
unable_to_predict = 0
for i, pred in enumerate(predictions):
    true_label = dev_labels.iloc[i]
    pred_lower = pred.lower()

    if true_label == 1 and 'ポジティブ' in pred_lower:
        correct += 1
    elif true_label == 0 and 'ネガティブ' in pred_lower:
        correct += 1
    elif 'ポジティブ' not in pred_lower and 'ネガティブ' not in pred_lower:
        unable_to_predict += 1

total = len(predictions)
accuracy = correct / total
unable_rate = unable_to_predict / total

print(f'Total samples: {total}')
print(f'Correct predictions: {correct}')
print(f'Unable to predict: {unable_to_predict}')
print(f'Wrong predictions: {total - correct - unable_to_predict}')
print(f'Accuracy: {accuracy:.4f} ({correct}/{total})')
print(f'Unable to predict rate: {unable_rate:.4f} ({unable_to_predict}/{total})')

Processing sentences:   0%|                                                                                                                                                                                | 0/872 [00:00<?, ?it/s]

Processing sentences: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 872/872 [02:03<00:00,  7.08it/s]

Total samples: 872
Correct predictions: 0
Unable to predict: 872
Wrong predictions: 0
Accuracy: 0.0000 (0/872)
Unable to predict rate: 1.0000 (872/872)





In [33]:
if unable_to_predict > 0:
    print(f'\nExamples of unable to predict:')
    count = 0
    for i, pred in enumerate(predictions):
        pred_lower = pred.lower()
        if 'ポジティブ' not in pred_lower and 'ネガティブ' not in pred_lower:
            print(f'  Sentence: "{dev_sentences.iloc[i]}"')
            print(f'  Prediction: "{pred}"')
            print(f'  True label: {dev_labels.iloc[i]}')
            print('-' * 50)
            count += 1
            if count >= 5:  # 最初の5件だけ表示
                break


Examples of unable to predict:
  Sentence: "it 's a charming and often affecting journey . "
  Prediction: "「it 's a charming and often affecting journey .」は、直訳すると"
  True label: 1
--------------------------------------------------
  Sentence: "unflinchingly bleak and desperate "
  Prediction: "「unflinchingly bleak and desperate」は、直訳すると「絶望"
  True label: 0
--------------------------------------------------
  Sentence: "allows us to hope that nolan is poised to embark a major career as a commercial yet inventive filmmaker . "
  Prediction: "Among the many reasons why a filmmaker might be poised to embar"
  True label: 1
--------------------------------------------------
  Sentence: "the acting , costumes , music , cinematography and sound are all astounding given the production 's austere locales . "
  Prediction: "質問内容は個人のプライバシーを尊重し、個人情報保護の観点から"
  True label: 1
--------------------------------------------------
  Sentence: "it 's slow -- very , very slow . "
  Prediction: "「it 's slow

## 97. 埋め込みに基づく感情分析

事前学習済み言語モデルでテキストをベクトルで表現（エンコード）し、そのベクトルにフィードフォワード層を通すことで極性ラベルを予測するモデルを学習せよ。

### Llama3.2-3B-InstuructをQLoRAでチューニングした場合

In [3]:
import math
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
from sklearn.metrics import accuracy_score
import numpy as np
from tqdm import tqdm
from transformers import AutoTokenizer, LlamaModel

In [4]:
import pandas as pd

train_path = './data/SST-2/train.tsv'
dev_path = './data/SST-2/dev.tsv'

train_df = pd.read_csv(train_path, sep='\t')
dev_df = pd.read_csv(dev_path, sep='\t')
train_df

Unnamed: 0,sentence,label
0,hide new secretions from the parental units,0
1,"contains no wit , only labored gags",0
2,that loves its characters and communicates som...,1
3,remains utterly satisfied to remain the same t...,0
4,on the worst revenge-of-the-nerds clichés the ...,0
...,...,...
67344,a delightful comedy,1
67345,"anguish , anger and frustration",0
67346,"at achieving the modest , crowd-pleasing goals...",1
67347,a patient viewer,1


In [5]:
class LlamaBinaryClassifier(nn.Module):
    def __init__(self, llama_model, hidden_size):
        super().__init__()
        self.llama_model = llama_model

        self.classifier = nn.Linear(hidden_size, 1)

        self.dropout = nn.Dropout(0.1)

        for param in self.llama_model.parameters():
            param.requires_grad = False
    
    def forward(self, input_ids, attention_mask=None):
        outputs = self.llama_model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            output_hidden_states=True,
            use_cache=False
        )

        hidden_states = outputs.hidden_states[-1]

        if attention_mask is not None:
            masked_hidden = hidden_states.clone()
            masked_hidden[attention_mask == 0] = float('-inf')
            pooled = torch.max(masked_hidden, dim=1)[0]
        else:
            pooled = torch.max(hidden_states, dim=1)[0]

        pooled = self.dropout(pooled)
        logits = self.classifier(pooled)
        return logits

In [6]:
class SST2Dataset(Dataset):
    def __init__(self, sentences, labels, tokenizer, max_length=256):
        self.sentences = sentences
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.sentences)

    def __getitem__(self, idx):
        sentence = str(self.sentences.iloc[idx])
        label = self.labels.iloc[idx]

        encoding = self.tokenizer(
            sentence,
            truncation=True,
            padding='max_length',
            max_length=self.max_length,
            return_tensors='pt'
        )

        return {
            'input_ids': encoding.input_ids.flatten(),
            'attention_mask': encoding.attention_mask.flatten(),
            'label': torch.tensor(label, dtype=torch.float)
        }

In [7]:
def check_gpu_memory():
    if torch.cuda.is_available():
        print(f"GPU: {torch.cuda.get_device_name()}")
        print(f"Total VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
        print(f"Available VRAM: {torch.cuda.memory_allocated() / 1e9:.1f} GB")
        print(f"Cached VRAM: {torch.cuda.memory_reserved() / 1e9:.1f} GB")

In [8]:
check_gpu_memory()

GPU: NVIDIA RTX A4500
Total VRAM: 21.0 GB
Available VRAM: 0.0 GB
Cached VRAM: 0.0 GB


In [9]:
print("Loading tokenizer and base LlamaModel...")
tokenizer = AutoTokenizer.from_pretrained('meta-llama/Llama-3.2-3B-Instruct')
model = LlamaModel.from_pretrained(
    'meta-llama/Llama-3.2-3B-Instruct',
    device_map={"": 1},
    low_cpu_mem_usage=True
)

Loading tokenizer and base LlamaModel...


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [10]:
print("Model structure:")
for name, module in model.named_modules():
    if 'head' in name.lower() or 'output' in name.lower() or 'proj' in name.lower():
        print(f"{name}: {type(module)}")

print("\nModel attributes:")
print([attr for attr in dir(model) if 'head' in attr.lower()])

print(f"\nModel config: {model.config}")

Model structure:
layers.0.self_attn.q_proj: <class 'torch.nn.modules.linear.Linear'>
layers.0.self_attn.k_proj: <class 'torch.nn.modules.linear.Linear'>
layers.0.self_attn.v_proj: <class 'torch.nn.modules.linear.Linear'>
layers.0.self_attn.o_proj: <class 'torch.nn.modules.linear.Linear'>
layers.0.mlp.gate_proj: <class 'torch.nn.modules.linear.Linear'>
layers.0.mlp.up_proj: <class 'torch.nn.modules.linear.Linear'>
layers.0.mlp.down_proj: <class 'torch.nn.modules.linear.Linear'>
layers.1.self_attn.q_proj: <class 'torch.nn.modules.linear.Linear'>
layers.1.self_attn.k_proj: <class 'torch.nn.modules.linear.Linear'>
layers.1.self_attn.v_proj: <class 'torch.nn.modules.linear.Linear'>
layers.1.self_attn.o_proj: <class 'torch.nn.modules.linear.Linear'>
layers.1.mlp.gate_proj: <class 'torch.nn.modules.linear.Linear'>
layers.1.mlp.up_proj: <class 'torch.nn.modules.linear.Linear'>
layers.1.mlp.down_proj: <class 'torch.nn.modules.linear.Linear'>
layers.2.self_attn.q_proj: <class 'torch.nn.modules.l

In [11]:
print("Setting up padding token...")
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
    print(f"Set pad_token to: {tokenizer.pad_token}")

if hasattr(model, 'config'):
    model.config.pad_token_id = tokenizer.pad_token_id
    print(f"Set model pad_token_id to: {model.config.pad_token_id}")

print(f"tokenizer.pad_token: {tokenizer.pad_token}")
print(f"tokenizer.pad_token_id: {tokenizer.pad_token_id}")
print(f"tokenizer.eos_token: {tokenizer.eos_token}")
print(f"tokenizer.eos_token_id: {tokenizer.eos_token_id}")

Setting up padding token...
Set pad_token to: <|eot_id|>
Set model pad_token_id to: 128009
tokenizer.pad_token: <|eot_id|>
tokenizer.pad_token_id: 128009
tokenizer.eos_token: <|eot_id|>
tokenizer.eos_token_id: 128009


In [12]:
hidden_size = model.config.hidden_size
print(f'Hidden size: {hidden_size}')

Hidden size: 3072


In [13]:
def quantize_4bit(W: torch.Tensor):
    # per-row scale
    maxv = W.abs().amax(dim=1, keepdim=True) # (out, 1)
    scale = maxv / 7.0
    Wq = torch.clamp((W / scale).round(), -8, 7).to(torch.int8)
    return Wq, scale

In [14]:
class QuantLinear(nn.Module):
    def __init__(self, linear: nn.Linear):
        super().__init__()
        W = linear.weight.data # (out, in)
        Wq, scale = quantize_4bit(W)
        self.register_buffer('Wq', Wq)
        self.register_buffer('scale', scale)
        if linear.bias is not None:
            self.bias = nn.Parameter(linear.bias.data.clone())
        else:
            self.bias = None

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # dequantize on the fly
        W = self.Wq.float() * self.scale
        return F.linear(x, W, self.bias)

In [15]:
class LoRALinear(nn.Module):
    def __init__(self, in_features, out_features, r=8, alpha=16):
        super().__init__()
        self.r = r
        self.scaling = alpha / r
        self.down = nn.Linear(in_features, r, bias=False)
        self.up   = nn.Linear(r, out_features, bias=False)
        nn.init.kaiming_uniform_(self.down.weight, a=math.sqrt(5))
        nn.init.zeros_(self.up.weight)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.up(self.down(x)) * self.scaling

In [16]:
class QuantLoRALinear(nn.Module):
    def __init__(self, linear: nn.Linear, r=8, alpha=16):
        super().__init__()
        self.quant = QuantLinear(linear)
        self.lora = LoRALinear(linear.in_features, linear.out_features, r=r, alpha=alpha)
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.quant(x) + self.lora(x)

In [17]:
def apply_qlora(
        model, 
        target_substrings=(
            # Attention
            'q_proj','v_proj','k_proj','o_proj'
            # MLP (FFN)
            # 'gate_proj', 'down_proj', 'up_proj'
        )
    ):
    for name, module in list(model.named_modules()):
        if isinstance(module, nn.Linear) and any(s in name for s in target_substrings):
            parent_name, attr = name.rsplit('.', 1)
            parent_mod = dict(model.named_modules())[parent_name]
            setattr(parent_mod,
                    attr,
                    QuantLoRALinear(module, r=8, alpha=16)
            )
    return model

In [18]:
classifier_model = LlamaBinaryClassifier(model, hidden_size)

In [19]:
for p in classifier_model.parameters():
    p.requires_grad = False

apply_qlora(classifier_model)
for p in classifier_model.modules():
    if isinstance(p, LoRALinear):
        for sub in (p.down, p.up):
            for q in sub.parameters():
                q.requires_grad = True

In [20]:
max_length = 256
batch_size = 4

train_dataset = SST2Dataset(train_df.sentence, train_df.label, tokenizer, max_length)
dev_dataset = SST2Dataset(dev_df.sentence, dev_df.label, tokenizer, max_length)

train_loader = DataLoader(
    train_dataset,
    batch_size=batch_size,
    shuffle=True,
    num_workers=16,
    pin_memory=True
)
dev_loader = DataLoader(
    dev_dataset,
    batch_size=batch_size,
    shuffle=False,
    num_workers=16,
    pin_memory=True
)

In [21]:
criterion = nn.BCEWithLogitsLoss()

In [22]:
from torch.optim.lr_scheduler import LinearLR


lora_params = [p for p in classifier_model.parameters() if p.requires_grad]
optimizer = torch.optim.AdamW(lora_params, lr=1e-4)

In [23]:
check_gpu_memory()

GPU: NVIDIA RTX A4500
Total VRAM: 21.0 GB
Available VRAM: 0.0 GB
Cached VRAM: 0.0 GB


In [24]:
from accelerate import Accelerator

accelerator = Accelerator()
classifier_model, optimizer, train_loader, dev_loader = accelerator.prepare(
    classifier_model, optimizer, train_loader, dev_loader
)

In [25]:
num_epochs = 1
classifier_model.to(accelerator.device)
classifier_model.train()

LlamaBinaryClassifier(
  (llama_model): LlamaModel(
    (embed_tokens): Embedding(128256, 3072)
    (layers): ModuleList(
      (0-27): 28 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): QuantLoRALinear(
            (quant): QuantLinear()
            (lora): LoRALinear(
              (down): Linear(in_features=3072, out_features=8, bias=False)
              (up): Linear(in_features=8, out_features=3072, bias=False)
            )
          )
          (k_proj): QuantLoRALinear(
            (quant): QuantLinear()
            (lora): LoRALinear(
              (down): Linear(in_features=3072, out_features=8, bias=False)
              (up): Linear(in_features=8, out_features=1024, bias=False)
            )
          )
          (v_proj): QuantLoRALinear(
            (quant): QuantLinear()
            (lora): LoRALinear(
              (down): Linear(in_features=3072, out_features=8, bias=False)
              (up): Linear(in_features=8, out_features=1024, bias=Fa

In [26]:
print("Starting training...")
for epoch in range(num_epochs):
    total_loss = 0
    optimizer.zero_grad()
    
    for batch in tqdm(train_loader, desc=f'Epoch {epoch+1}/{num_epochs}'):
        input_ids = batch['input_ids']
        attention_mask = batch['attention_mask']
        labels = batch['label']
        
        with accelerator.autocast():
            logits = classifier_model(input_ids, attention_mask)
            loss = criterion(logits.squeeze(-1), labels)
        
        accelerator.backward(loss)
        optimizer.step()
        optimizer.zero_grad()
        total_loss += loss.item()

    avg_loss = total_loss / len(train_loader)
    print(f'Epoch {epoch+1}, Average Loss: {avg_loss:.4f}')

print("Training completed.")



Starting training...


Epoch 1/1: 100%|███████████████████████████████████████████████████████████████| 16838/16838 [5:19:56<00:00,  1.14s/it]

Epoch 1, Average Loss: 0.1552
Training completed.





In [27]:
# 評価
classifier_model.eval()
all_predictions = []
all_labels = []

with torch.no_grad():
    for batch in tqdm(dev_loader, desc='Evaluating'):
        input_ids = batch['input_ids']
        attention_mask = batch['attention_mask']
        labels = batch['label']
        
        logits = classifier_model(input_ids, attention_mask)
        predictions = torch.sigmoid(logits.squeeze(-1)) > 0.5
        
        all_predictions.extend(predictions.cpu().numpy())
        all_labels.extend(labels.cpu().numpy())

# 精度の計算
accuracy = accuracy_score(all_labels, all_predictions)
print(f'Development Set Accuracy: {accuracy:.4f}')

# メモリ使用量の最終チェック
check_gpu_memory()

# いくつかの予測例を表示
print("\nPrediction examples:")
for i in range(5):
    sentence = dev_df['sentence'].iloc[i]
    true_label = dev_df['label'].iloc[i]
    pred_label = int(all_predictions[i])
    
    print(f"Sentence: {sentence}")
    print(f"True label: {true_label} ({'Positive' if true_label == 1 else 'Negative'})")
    print(f"Predicted: {pred_label} ({'Positive' if pred_label == 1 else 'Negative'})")
    print(f"Correct: {true_label == pred_label}")
    print("-" * 50)

Evaluating: 100%|████████████████████████████████████████████████████████████████████| 218/218 [02:07<00:00,  1.71it/s]

Development Set Accuracy: 0.9633
GPU: NVIDIA RTX A4500
Total VRAM: 21.0 GB
Available VRAM: 10.8 GB
Cached VRAM: 19.8 GB

Prediction examples:
Sentence: it 's a charming and often affecting journey . 
True label: 1 (Positive)
Predicted: 1 (Positive)
Correct: True
--------------------------------------------------
Sentence: unflinchingly bleak and desperate 
True label: 0 (Negative)
Predicted: 0 (Negative)
Correct: True
--------------------------------------------------
Sentence: allows us to hope that nolan is poised to embark a major career as a commercial yet inventive filmmaker . 
True label: 1 (Positive)
Predicted: 1 (Positive)
Correct: True
--------------------------------------------------
Sentence: the acting , costumes , music , cinematography and sound are all astounding given the production 's austere locales . 
True label: 1 (Positive)
Predicted: 1 (Positive)
Correct: True
--------------------------------------------------
Sentence: it 's slow -- very , very slow . 
True labe




### llm-jp-3-150m-instruct3をフルファインチューニングした場合

In [34]:
import math
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
from sklearn.metrics import accuracy_score
from tqdm import tqdm
from transformers import AutoTokenizer, AutoModel

In [36]:
import pandas as pd

train_path = './data/SST-2/train.tsv'
dev_path = './data/SST-2/dev.tsv'

train_df = pd.read_csv(train_path, sep='\t')
dev_df = pd.read_csv(dev_path, sep='\t')

In [50]:
class LlmjpBinaryClassifier(nn.Module):
    def __init__(self, llmjp_model, hidden_size):
        super().__init__()
        self.llmjp_model = llmjp_model

        self.classifier = nn.Linear(hidden_size, 1)

        self.dropout = nn.Dropout(0.1)

        for param in self.llmjp_model.parameters():
            param.requires_grad = False

    def forward(self, input_ids, attention_mask=None):
        outputs = self.llmjp_model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            output_hidden_states=True,
            use_cache=False
        )

        hidden_states = outputs.hidden_states[-1]

        if attention_mask is not None:
            masked_hidden = hidden_states.clone()
            masked_hidden[attention_mask == 0] = float('-inf')
            pooled = torch.max(masked_hidden, dim=1)[0]
        else:
            pooled = torch.max(hidden_states, dim=1)[0]

        pooled = self.dropout(pooled)
        logits = self.classifier(pooled)
        return logits

In [51]:
class SST2Dataset(Dataset):
    def __init__(self, sentences, labels, tokenizer, max_length=256):
        self.sentences = sentences
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.sentences)

    def __getitem__(self, idx):
        sentence = str(self.sentences.iloc[idx])
        label = self.labels.iloc[idx]

        encoding = self.tokenizer(
            sentence,
            truncation=True,
            padding='max_length',
            max_length=self.max_length,
            return_tensors='pt'
        )

        return {
            'input_ids': encoding.input_ids.flatten(),
            'attention_mask': encoding.attention_mask.flatten(),
            'label': torch.tensor(label, dtype=torch.float)
        }

In [52]:
print("Loading tokenizer and base LLMJPModel...")
tokenizer = AutoTokenizer.from_pretrained('llm-jp/llm-jp-3-150m-instruct3')
model = AutoModel.from_pretrained(
    'llm-jp/llm-jp-3-150m-instruct3',
    device_map={"": 1},
    low_cpu_mem_usage=True
)

Loading tokenizer and base LLMJPModel...


In [53]:
print("Model structure:")
for name, module in model.named_modules():
    if 'head' in name.lower() or 'output' in name.lower() or 'proj' in name.lower():
        print(f"{name}: {type(module)}")

print("\nModel attributes:")
print([attr for attr in dir(model) if 'head' in attr.lower()])

print(f"\nModel config: {model.config}")

Model structure:
layers.0.self_attn.q_proj: <class 'torch.nn.modules.linear.Linear'>
layers.0.self_attn.k_proj: <class 'torch.nn.modules.linear.Linear'>
layers.0.self_attn.v_proj: <class 'torch.nn.modules.linear.Linear'>
layers.0.self_attn.o_proj: <class 'torch.nn.modules.linear.Linear'>
layers.0.mlp.gate_proj: <class 'torch.nn.modules.linear.Linear'>
layers.0.mlp.up_proj: <class 'torch.nn.modules.linear.Linear'>
layers.0.mlp.down_proj: <class 'torch.nn.modules.linear.Linear'>
layers.1.self_attn.q_proj: <class 'torch.nn.modules.linear.Linear'>
layers.1.self_attn.k_proj: <class 'torch.nn.modules.linear.Linear'>
layers.1.self_attn.v_proj: <class 'torch.nn.modules.linear.Linear'>
layers.1.self_attn.o_proj: <class 'torch.nn.modules.linear.Linear'>
layers.1.mlp.gate_proj: <class 'torch.nn.modules.linear.Linear'>
layers.1.mlp.up_proj: <class 'torch.nn.modules.linear.Linear'>
layers.1.mlp.down_proj: <class 'torch.nn.modules.linear.Linear'>
layers.2.self_attn.q_proj: <class 'torch.nn.modules.l

In [54]:
print("Setting up padding token...")
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
    print(f"Set pad_token to: {tokenizer.pad_token}")

if hasattr(model, 'config'):
    model.config.pad_token_id = tokenizer.pad_token_id
    print(f"Set model pad_token_id to: {model.config.pad_token_id}")

print(f"tokenizer.pad_token: {tokenizer.pad_token}")
print(f"tokenizer.pad_token_id: {tokenizer.pad_token_id}")
print(f"tokenizer.eos_token: {tokenizer.eos_token}")
print(f"tokenizer.eos_token_id: {tokenizer.eos_token_id}")

Setting up padding token...
Set model pad_token_id to: 4
tokenizer.pad_token: <PAD|LLM-jp>
tokenizer.pad_token_id: 4
tokenizer.eos_token: </s>
tokenizer.eos_token_id: 2


In [55]:
hidden_size = model.config.hidden_size
print(f'Hidden size: {hidden_size}')

Hidden size: 512


In [56]:
classifier_model = LlmjpBinaryClassifier(model, hidden_size)

In [57]:
max_length = 256
batch_size = 4

train_dataset = SST2Dataset(train_df.sentence, train_df.label, tokenizer, max_length)
dev_dataset = SST2Dataset(dev_df.sentence, dev_df.label, tokenizer, max_length)

train_loader = DataLoader(
    train_dataset,
    batch_size=batch_size,
    shuffle=True,
    num_workers=16,
    pin_memory=True
)
dev_loader = DataLoader(
    dev_dataset,
    batch_size=batch_size,
    shuffle=False,
    num_workers=16,
    pin_memory=True
)

In [58]:
criterion = nn.BCEWithLogitsLoss()

In [59]:
optimizer = torch.optim.AdamW(classifier_model.parameters(), lr=1e-4)

In [60]:
from accelerate import Accelerator

accelerator = Accelerator()
classifier_model, optimizer, train_loader, dev_loader = accelerator.prepare(
    classifier_model, optimizer, train_loader, dev_loader
)

In [61]:
num_epochs = 5
classifier_model.to(accelerator.device)
classifier_model.train()

LlmjpBinaryClassifier(
  (llmjp_model): LlamaModel(
    (embed_tokens): Embedding(99584, 512)
    (layers): ModuleList(
      (0-11): 12 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear(in_features=512, out_features=512, bias=False)
          (k_proj): Linear(in_features=512, out_features=512, bias=False)
          (v_proj): Linear(in_features=512, out_features=512, bias=False)
          (o_proj): Linear(in_features=512, out_features=512, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=512, out_features=2048, bias=False)
          (up_proj): Linear(in_features=512, out_features=2048, bias=False)
          (down_proj): Linear(in_features=2048, out_features=512, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((512,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((512,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((512,), eps=1e-05)
    (rotary_emb): Ll

In [64]:
print("Starting training...")
for epoch in range(num_epochs):
    total_loss = 0
    optimizer.zero_grad()
    
    for batch in tqdm(train_loader, desc=f'Epoch {epoch+1}/{num_epochs}'):
        input_ids = batch['input_ids']
        attention_mask = batch['attention_mask']
        labels = batch['label']
        
        with accelerator.autocast():
            logits = classifier_model(input_ids, attention_mask)
            loss = criterion(logits.squeeze(-1), labels)
        
        accelerator.backward(loss)
        optimizer.step()
        optimizer.zero_grad()
        total_loss += loss.item()

    avg_loss = total_loss / len(train_loader)
    print(f'Epoch {epoch+1}, Average Loss: {avg_loss:.4f}')

print("Training completed.")



Starting training...


Epoch 1/5: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16838/16838 [04:41<00:00, 59.81it/s]


Epoch 1, Average Loss: 0.5494


Epoch 2/5: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16838/16838 [04:41<00:00, 59.73it/s]


Epoch 2, Average Loss: 0.5473


Epoch 3/5: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16838/16838 [04:41<00:00, 59.77it/s]


Epoch 3, Average Loss: 0.5470


Epoch 4/5: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16838/16838 [04:41<00:00, 59.74it/s]


Epoch 4, Average Loss: 0.5441


Epoch 5/5: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 16838/16838 [04:41<00:00, 59.79it/s]

Epoch 5, Average Loss: 0.5461
Training completed.





In [66]:
# 評価
classifier_model.eval()
all_predictions = []
all_labels = []

with torch.no_grad():
    for batch in tqdm(dev_loader, desc='Evaluating'):
        input_ids = batch['input_ids']
        attention_mask = batch['attention_mask']
        labels = batch['label']
        
        logits = classifier_model(input_ids, attention_mask)
        predictions = torch.sigmoid(logits.squeeze(-1)) > 0.5
        
        all_predictions.extend(predictions.cpu().numpy())
        all_labels.extend(labels.cpu().numpy())

# 精度の計算
accuracy = accuracy_score(all_labels, all_predictions)
print(f'Development Set Accuracy: {accuracy:.4f}')

# いくつかの予測例を表示
print("\nPrediction examples:")
for i in range(5):
    sentence = dev_df['sentence'].iloc[i]
    true_label = dev_df['label'].iloc[i]
    pred_label = int(all_predictions[i])
    
    print(f"Sentence: {sentence}")
    print(f"True label: {true_label} ({'Positive' if true_label == 1 else 'Negative'})")
    print(f"Predicted: {pred_label} ({'Positive' if pred_label == 1 else 'Negative'})")
    print(f"Correct: {true_label == pred_label}")
    print("-" * 50)

Evaluating: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 218/218 [00:04<00:00, 51.22it/s]

Development Set Accuracy: 0.7236

Prediction examples:
Sentence: it 's a charming and often affecting journey . 
True label: 1 (Positive)
Predicted: 1 (Positive)
Correct: True
--------------------------------------------------
Sentence: unflinchingly bleak and desperate 
True label: 0 (Negative)
Predicted: 1 (Positive)
Correct: False
--------------------------------------------------
Sentence: allows us to hope that nolan is poised to embark a major career as a commercial yet inventive filmmaker . 
True label: 1 (Positive)
Predicted: 1 (Positive)
Correct: True
--------------------------------------------------
Sentence: the acting , costumes , music , cinematography and sound are all astounding given the production 's austere locales . 
True label: 1 (Positive)
Predicted: 1 (Positive)
Correct: True
--------------------------------------------------
Sentence: it 's slow -- very , very slow . 
True label: 0 (Negative)
Predicted: 0 (Negative)
Correct: True
--------------------------------




## 98. ファインチューニング

問題96のプロンプトに対して、正解の感情ラベルをテキストの応答として返すように事前学習済みモデルをファインチューニングせよ。

### Llama3.2-3B-InstuructをQLoRAでチューニングした場合

In [3]:
import torch
from transformers import (
    LlamaForCausalLM, AutoTokenizer,
    Trainer, TrainingArguments, DataCollatorForLanguageModeling
)
from peft import (
    prepare_model_for_kbit_training,
    get_peft_model, LoraConfig
)

In [4]:
model_name = 'meta-llama/Llama-3.2-3B-Instruct'
output_dir = 'output/qlora-llama3b'

In [5]:
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token_id = tokenizer.eos_token_id

In [6]:
from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16
)

model = LlamaForCausalLM.from_pretrained(
        model_name,
        quantization_config=bnb_config,
        device_map="auto",
)
model = prepare_model_for_kbit_training(model)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [7]:
peft_config = LoraConfig(
        r=8,
        lora_alpha=16,
        target_modules=["q_proj","k_proj","v_proj","o_proj"],
        inference_mode=False
)
model = get_peft_model(model, peft_config)

In [8]:
from datasets import Dataset

In [9]:
def create_training_prompt(sentence, label, tokenizer):
    instruction = 'Determine if the sentiment of this sentence is positive or negative. Answer with only "positive" or "negative".'

    messages = [
        {"role": "system", "content": instruction},
        {"role": "user", "content": sentence}
    ]

    prompt = tokenizer.apply_chat_template(
        messages, 
        tokenize=False,
        add_generation_prompt=True
    )

    response = "positive" if label == 1 else "negative"
    full_text = prompt + response + tokenizer.eos_token

    return full_text

In [10]:
def preprocess_function(examples, tokenizer, max_length=512):
    texts = []
    for sentence, label in zip(examples['sentence'], examples['label']):
        full_text = create_training_prompt(sentence, label, tokenizer)
        texts.append(full_text)

    model_inputs = tokenizer(
        texts,
        truncation=True,
        max_length=max_length,
        padding='max_length',
        return_tensors='pt'
    )

    model_inputs["labels"] = model_inputs["input_ids"].clone()

    return model_inputs

In [11]:
import pandas as pd
train_df = pd.read_csv('./data/SST-2/train.tsv', sep='\t')
dev_df = pd.read_csv('./data/SST-2/dev.tsv', sep='\t')

print(f"Training samples: {len(train_df)}")
print(f"Development samples: {len(dev_df)}")

Training samples: 67349
Development samples: 872


In [12]:
train_dataset = Dataset.from_pandas(train_df)
dev_dataset = Dataset.from_pandas(dev_df)

print("Dataset created successfully!")
print(f"Train dataset: {train_dataset}")
print(f"Dev dataset: {dev_dataset}")

# データセットの内容確認
print(f"\nFirst training example:")
print(f"Sentence: {train_dataset[0]['sentence']}")
print(f"Label: {train_dataset[0]['label']}")

Dataset created successfully!
Train dataset: Dataset({
    features: ['sentence', 'label'],
    num_rows: 67349
})
Dev dataset: Dataset({
    features: ['sentence', 'label'],
    num_rows: 872
})

First training example:
Sentence: hide new secretions from the parental units 
Label: 0


In [13]:
tokenized_train = train_dataset.map(
    lambda x: preprocess_function(x, tokenizer),
    batched=True,
    remove_columns=train_dataset.column_names
)

tokenized_dev = dev_dataset.map(
    lambda x: preprocess_function(x, tokenizer),
    batched=True,
    remove_columns=dev_dataset.column_names
)

print("Tokenization completed!")
print(f"Tokenized train dataset: {tokenized_train}")
print(f"Tokenized dev dataset: {tokenized_dev}")

# トークナイズされたデータの確認
print(f"\nFirst tokenized example:")
example = tokenized_train[0]
print(f"Input IDs length: {len(example['input_ids'])}")
print(f"Decoded text: {tokenizer.decode(example['input_ids'][:100])}...")

Map:   0%|          | 0/67349 [00:00<?, ? examples/s]

Map:   0%|          | 0/872 [00:00<?, ? examples/s]

Tokenization completed!
Tokenized train dataset: Dataset({
    features: ['input_ids', 'attention_mask', 'labels'],
    num_rows: 67349
})
Tokenized dev dataset: Dataset({
    features: ['input_ids', 'attention_mask', 'labels'],
    num_rows: 872
})

First tokenized example:
Input IDs length: 512
Decoded text: <|begin_of_text|><|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 01 Jun 2025

Determine if the sentiment of this sentence is positive or negative. Answer with only "positive" or "negative".<|eot_id|><|start_header_id|>user<|end_header_id|>

hide new secretions from the parental units<|eot_id|><|start_header_id|>assistant<|end_header_id|>

negative<|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_id|><|eot_

In [14]:
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False,  # 因果言語モデルなのでFalse
    pad_to_multiple_of=8,  # 効率化のため
)

print("Data collator created!")

Data collator created!


In [15]:
training_args = TrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=2,  # メモリ制約に応じて調整
    per_device_eval_batch_size=2,
    num_train_epochs=1,
    learning_rate=2e-4,
    logging_steps=10000,
    eval_strategy="steps",
    eval_steps=500,
    save_steps=500,
    save_total_limit=2,
    warmup_steps=100,
    fp16=True,  # メモリ節約
    dataloader_pin_memory=False,
    remove_unused_columns=False,
    report_to=None,  # wandbなどの使用を無効化
)

print("Training arguments configured!")
print(f"Output directory: {training_args.output_dir}")
print(f"Batch size: {training_args.per_device_train_batch_size}")
print(f"Learning rate: {training_args.learning_rate}")

Training arguments configured!
Output directory: output/qlora-llama3b
Batch size: 2
Learning rate: 0.0002


In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_dev,
    data_collator=data_collator,
    processing_class=tokenizer,
)

print("Trainer created successfully!")
print("Starting fine-tuning...")

trainer.train()

print("Fine-tuning completed!")

No label_names provided for model class `PeftModel`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


Trainer created successfully!
Starting fine-tuning...


`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
  return fn(*args, **kwargs)


Step,Training Loss,Validation Loss
500,No log,No log
1000,No log,No log
1500,No log,No log
2000,No log,No log
2500,No log,No log
3000,No log,No log
3500,No log,No log
4000,No log,No log
4500,No log,No log
5000,No log,No log


  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)
  return fn(*args, **kwargs)


In [None]:
trainer.save_model(output_dir)
tokenizer.save_pretrained(output_dir)

print(f"Model saved to {output_dir}")

eval_results = trainer.evaluate()
print(f"Evaluation results: {eval_results}")

### llm-jp-3-150m-instruct3をフルファインチューニングした場合

In [67]:
import torch
from transformers import (
    AutoModelForCausalLM, AutoTokenizer,
    Trainer, TrainingArguments, DataCollatorForLanguageModeling
)

In [None]:
model_name = 'llm-jp/llm-jp-3-150m-instruct3'
output_dir = 'output/llm-jp150m'

In [102]:
model = AutoModelForCausalLM.from_pretrained(
        model_name,
        device_map={"": "cuda:1"},
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [71]:
from datasets import Dataset

In [89]:
def create_training_prompt(sentence, label, tokenizer):
    instruction = '以下の文がポジティブかネガティブのどちらなのか判定してください．"ポジティブ"か"ネガティブ"で回答してください．'
    content = f'''{instruction}

{sentence}
    '''
    messages = [
        {"role": "user", "content": content}
    ]

    prompt = tokenizer.apply_chat_template(
        messages, 
        tokenize=False,
        add_generation_prompt=True
    )

    response = "ポジティブ" if label == 1 else "ネガティブ"
    full_text = prompt + response + tokenizer.eos_token

    return full_text

In [90]:
print(create_training_prompt("あいうえお", 1, tokenizer))

<s>

### 指示:
以下の文がポジティブかネガティブのどちらなのか判定してください．"ポジティブ"か"ネガティブ"で回答してください．

あいうえお
    

### 応答:
ポジティブ</s>


In [91]:
def preprocess_function(examples, tokenizer, max_length=512):
    texts = []
    for sentence, label in zip(examples['sentence'], examples['label']):
        full_text = create_training_prompt(sentence, label, tokenizer)
        texts.append(full_text)

    model_inputs = tokenizer(
        texts,
        truncation=True,
        max_length=max_length,
        padding='max_length',
        return_tensors='pt'
    )

    model_inputs["labels"] = model_inputs["input_ids"].clone()

    return model_inputs

In [92]:
import pandas as pd
train_df = pd.read_csv('./data/SST-2/train.tsv', sep='\t')
dev_df = pd.read_csv('./data/SST-2/dev.tsv', sep='\t')

print(f"Training samples: {len(train_df)}")
print(f"Development samples: {len(dev_df)}")

Training samples: 67349
Development samples: 872


In [93]:
train_dataset = Dataset.from_pandas(train_df)
dev_dataset = Dataset.from_pandas(dev_df)

print("Dataset created successfully!")
print(f"Train dataset: {train_dataset}")
print(f"Dev dataset: {dev_dataset}")

# データセットの内容確認
print(f"\nFirst training example:")
print(f"Sentence: {train_dataset[0]['sentence']}")
print(f"Label: {train_dataset[0]['label']}")

Dataset created successfully!
Train dataset: Dataset({
    features: ['sentence', 'label'],
    num_rows: 67349
})
Dev dataset: Dataset({
    features: ['sentence', 'label'],
    num_rows: 872
})

First training example:
Sentence: hide new secretions from the parental units 
Label: 0


In [94]:
tokenized_train = train_dataset.map(
    lambda x: preprocess_function(x, tokenizer),
    batched=True,
    remove_columns=train_dataset.column_names
)

tokenized_dev = dev_dataset.map(
    lambda x: preprocess_function(x, tokenizer),
    batched=True,
    remove_columns=dev_dataset.column_names
)

print("Tokenization completed!")
print(f"Tokenized train dataset: {tokenized_train}")
print(f"Tokenized dev dataset: {tokenized_dev}")

# トークナイズされたデータの確認
print(f"\nFirst tokenized example:")
example = tokenized_train[0]
print(f"Input IDs length: {len(example['input_ids'])}")
print(f"Decoded text: {tokenizer.decode(example['input_ids'][:100])}...")

Map:   0%|          | 0/67349 [00:00<?, ? examples/s]

Map:   0%|          | 0/872 [00:00<?, ? examples/s]

Tokenization completed!
Tokenized train dataset: Dataset({
    features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
    num_rows: 67349
})
Tokenized dev dataset: Dataset({
    features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
    num_rows: 872
})

First tokenized example:
Input IDs length: 512
Decoded text: <s><s> 

### 指示:
以下の文がポジティブかネガティブのどちらなのか判定してください．"ポジティブ"か"ネガティブ"で回答してください．

hide new secretions from the parental units 
    

### 応答:
ネガティブ</s><PAD|LLM-jp><PAD|LLM-jp><PAD|LLM-jp><PAD|LLM-jp><PAD|LLM-jp><PAD|LLM-jp><PAD|LLM-jp><PAD|LLM-jp><PAD|LLM-jp><PAD|LLM-jp><PAD|LLM-jp><PAD|LLM-jp><PAD|LLM-jp><PAD|LLM-jp><PAD|LLM-jp><PAD|LLM-jp><PAD|LLM-jp><PAD|LLM-jp><PAD|LLM-jp><PAD|LLM-jp><PAD|LLM-jp><PAD|LLM-jp><PAD|LLM-jp><PAD|LLM-jp><PAD|LLM-jp><PAD|LLM-jp><PAD|LLM-jp><PAD|LLM-jp><PAD|LLM-jp><PAD|LLM-jp><PAD|LLM-jp><PAD|LLM-jp><PAD|LLM-jp><PAD|LLM-jp><PAD|LLM-jp><PAD|LLM-jp><PAD|LLM-jp><PAD|LLM-jp><PAD|LLM-jp><PAD|LLM-jp><PAD|LLM-jp><PAD|LLM-jp><PAD

In [95]:
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False,  # 因果言語モデルなのでFalse
    pad_to_multiple_of=8,  # 効率化のため
)

print("Data collator created!")

Data collator created!


In [None]:
training_args = TrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=16,  # メモリ制約に応じて調整
    per_device_eval_batch_size=16,
    num_train_epochs=5,
    learning_rate=2e-4,
    logging_steps=100,
    eval_strategy="steps",
    eval_steps=500,
    save_steps=500,
    save_total_limit=2,
    warmup_steps=100,
    fp16=True,  # メモリ節約
    dataloader_pin_memory=False,
    remove_unused_columns=False,
    report_to=None,  # wandbなどの使用を無効化
)

print("Training arguments configured!")
print(f"Output directory: {training_args.output_dir}")
print(f"Batch size: {training_args.per_device_train_batch_size}")
print(f"Learning rate: {training_args.learning_rate}")

Training arguments configured!
Output directory: output/llm-jp150m
Batch size: 16
Learning rate: 0.0002


In [104]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_dev,
    data_collator=data_collator,
    processing_class=tokenizer,
)

print("Trainer created successfully!")
print("Starting fine-tuning...")

trainer.train()

print("Fine-tuning completed!")

Trainer created successfully!
Starting fine-tuning...


Step,Training Loss,Validation Loss


KeyboardInterrupt: 

## 99. 選好チューニング

問題96のプロンプトに対して、正解の感情ラベルを含むテキストを望ましい応答、間違った感情ラベルを含むテキストを望ましくない応答として、事前学習済み言語モデルを選好チューニング (preference tuning) を実施せよ。選好チューニングのアルゴリズムとしては、近傍方策最適化 (PPO: Proximal Policy Optimization) や直接選好最適化 (DPO: Direct Preference Optimization) などが考えられる。
