# 第9章: 事前学習済み言語モデル（BERT型）

本章では、BERT型の事前学習済みモデルを利用して、マスク単語の予測や文ベクトルの計算、評判分析器（ポジネガ分類器）の構築に取り組む。

## 80. トークン化

"The movie was full of incomprehensibilities."という文をトークンに分解し、トークン列を表示せよ。

In [2]:
from transformers import AutoTokenizer

model_name = "google-bert/bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

text = "The movie was full of incomprehensibilities."
tokens = tokenizer.tokenize(text)

print(tokens)

['the', 'movie', 'was', 'full', 'of', 'inc', '##omp', '##re', '##hen', '##si', '##bilities', '.']


## 81. マスクの予測

"The movie was full of [MASK]."の"[MASK]"を埋めるのに最も適切なトークンを求めよ。

In [3]:
from transformers import pipeline

fill_mask = pipeline("fill-mask", model=model_name, device="cpu", top_k=1)
masked_text = "The movie was full of [MASK]."
outputs = fill_mask(masked_text)

print(f"\n{masked_text}")
print(f"[Mask]：{outputs[0]['token_str']}")

Some weights of the model checkpoint at google-bert/bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu



The movie was full of [MASK].
[Mask]：fun


## 82. マスクのtop-k予測

"The movie was full of [MASK]."の"[MASK]"に埋めるのに適切なトークン上位10個と、その確率（尤度）を求めよ。

In [None]:
from transformers import pipeline

model_name = "google-bert/bert-base-uncased"
fill_mask_10 = pipeline("fill-mask", model=model_name, device="cpu", top_k=10)

masked_text = "The movie was full of [MASK]."
results = fill_mask_10(masked_text)

print(f"\n{masked_text}")
for i in range(10):
    print(
        f"{str(i+1).rjust(2,' ')}.  [MASK]：{results[i]['token_str'].ljust(12, ' ')}  score：{results[i]['score']}"
    )

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu



The movie was full of [MASK].
 1.  [MASK]：fun           score：0.10711917281150818
 2.  [MASK]：surprises     score：0.06634506583213806
 3.  [MASK]：drama         score：0.04468407481908798
 4.  [MASK]：stars         score：0.027217093855142593
 5.  [MASK]：laughs        score：0.025412950664758682
 6.  [MASK]：action        score：0.01951688528060913
 7.  [MASK]：excitement    score：0.01903812400996685
 8.  [MASK]：people        score：0.018290206789970398
 9.  [MASK]：tension       score：0.0150305712595582
10.  [MASK]：music         score：0.014646259136497974


## 83. CLSトークンによる文ベクトル

以下の文の全ての組み合わせに対して、最終層の[CLS]トークンの埋め込みベクトルを用いてコサイン類似度を求めよ。

- "The movie was full of fun."
- "The movie was full of excitement."
- "The movie was full of crap."
- "The movie was full of rubbish."


In [None]:
import torch
from transformers import AutoModel, AutoTokenizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

model_name = "google-bert/bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

sentences = [
    "The movie was full of fun.",
    "The movie was full of excitement.",
    "The movie was full of crap.",
    "The movie was full of rubbish.",
]

embeddings = []

# トークン化
inputs = tokenizer(sentences, return_tensors="pt", padding=True, truncation=True)

# モデルによる推論
with torch.no_grad():
    outputs = model(**inputs)

# [CLS]トークンの埋め込みベクトルを取得
cls_embeddings = outputs.last_hidden_state[:, 0, :].cpu().numpy()

# コサイン類似度の計算（すべての組み合わせを一度に計算）
similarity_matrix = cosine_similarity(cls_embeddings)

print("文の組み合わせに対するコサイン類似度:")
for i in range(len(sentences)):
    for j in range(i + 1, len(sentences)):
        print(
            f"'{sentences[i]}' と '{sentences[j]}' の類似度: {similarity_matrix[i][j]:.4f}"
        )

文の組み合わせに対するコサイン類似度:
'The movie was full of fun.' と 'The movie was full of excitement.' の類似度: 0.9881
'The movie was full of fun.' と 'The movie was full of crap.' の類似度: 0.9558
'The movie was full of fun.' と 'The movie was full of rubbish.' の類似度: 0.9475
'The movie was full of excitement.' と 'The movie was full of crap.' の類似度: 0.9541
'The movie was full of excitement.' と 'The movie was full of rubbish.' の類似度: 0.9487
'The movie was full of crap.' と 'The movie was full of rubbish.' の類似度: 0.9807


## 84. 平均による文ベクトル

以下の文の全ての組み合わせに対して、最終層の埋め込みベクトルの平均を用いてコサイン類似度を求めよ。

- "The movie was full of fun."
- "The movie was full of excitement."
- "The movie was full of crap."
- "The movie was full of rubbish."

In [None]:
import torch
from transformers import AutoModel, AutoTokenizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

model_name = "google-bert/bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

sentences = [
    "The movie was full of fun.",
    "The movie was full of excitement.",
    "The movie was full of crap.",
    "The movie was full of rubbish.",
]

embeddings = []

# トークン化
inputs = tokenizer(sentences, return_tensors="pt", padding=True, truncation=True)

# モデルによる推論
with torch.no_grad():
    outputs = model(**inputs)

# 最終層の埋め込みベクトルの平均を取得
mean_embeddings = outputs.last_hidden_state.mean(dim=1).cpu().numpy()

# コサイン類似度の計算（すべての組み合わせを一度に計算）
similarity_matrix = cosine_similarity(mean_embeddings)

print("文の組み合わせに対するコサイン類似度:")
for i in range(len(sentences)):
    for j in range(i + 1, len(sentences)):
        print(
            f"'{sentences[i]}' と '{sentences[j]}' の類似度: {similarity_matrix[i][j]:.4f}"
        )

文の組み合わせに対するコサイン類似度:
'The movie was full of fun.' と 'The movie was full of excitement.' の類似度: 0.9568
'The movie was full of fun.' と 'The movie was full of crap.' の類似度: 0.8490
'The movie was full of fun.' と 'The movie was full of rubbish.' の類似度: 0.8169
'The movie was full of excitement.' と 'The movie was full of crap.' の類似度: 0.8352
'The movie was full of excitement.' と 'The movie was full of rubbish.' の類似度: 0.7938
'The movie was full of crap.' と 'The movie was full of rubbish.' の類似度: 0.9226


## 85. データセットの準備

[General Language Understanding Evaluation (GLUE)](https://gluebenchmark.com/) ベンチマークで配布されている[Stanford Sentiment Treebank (SST)](https://dl.fbaipublicfiles.com/glue/data/SST-2.zip) から訓練セット（train.tsv）と開発セット（dev.tsv）のテキストと極性ラベルと読み込み、さらに全てのテキストはトークン列に変換せよ。

In [None]:
from transformers import AutoModel, AutoTokenizer
import pandas as pd


def load_data(file_path):
    df = pd.read_csv(file_path, sep="\t", header=0)
    return df["sentence"].tolist(), df["label"].tolist()


# テキストをトークン列に変換
def tokenize_texts(texts):
    tokenized_texts = []
    for text in texts:
        # トークン化（特殊トークンを追加）
        tokens = tokenizer.tokenize(text)
        tokenized_texts.append(tokens)
    return tokenized_texts


model_name = "google-bert/bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

train_path = "../第7章：機械学習/SST-2/train.tsv"
dev_path = "../第7章：機械学習/SST-2/dev.tsv"

train_texts, train_labels = load_data(train_path)
dev_texts, dev_labels = load_data(dev_path)

# トークン化の実行
train_tokenized = tokenize_texts(train_texts)
dev_tokenized = tokenize_texts(dev_texts)

print(train_tokenized[0])

['hide', 'new', 'secret', '##ions', 'from', 'the', 'parental', 'units']


## 86. ミニバッチの作成

85で読み込んだ訓練データの一部（例えば冒頭の4事例）に対して、パディングなどの処理を行い、トークン列の長さを揃えてミニバッチを構成せよ。

In [None]:
import pandas as pd
import pprint
from transformers import AutoModel, AutoTokenizer


def load_data(file_path):
    df = pd.read_csv(file_path, sep="\t", header=0)
    return df["sentence"].tolist(), df["label"].tolist()


model_name = "google-bert/bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

train_path = "../第7章：機械学習/SST-2/train.tsv"
dev_path = "../第7章：機械学習/SST-2/dev.tsv"

train_texts, train_labels = load_data(train_path)
dev_texts, dev_labels = load_data(dev_path)

# 冒頭の4事例を処理
encoded = tokenizer(train_texts[:4], padding=True, truncation=True, return_tensors="pt")

pprint.pprint(encoded)

{'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0]]),
 'input_ids': tensor([[  101,  5342,  2047,  3595,  8496,  2013,  1996, 18643,  3197,   102,
             0,     0,     0,     0,     0],
        [  101,  3397,  2053, 15966,  1010,  2069,  4450,  2098, 18201,  2015,
           102,     0,     0,     0,     0],
        [  101,  2008,  7459,  2049,  3494,  1998, 10639,  2015,  2242,  2738,
          3376,  2055,  2529,  3267,   102],
        [  101,  3464, 12580,  8510,  2000,  3961,  1996,  2168,  2802,   102,
             0,     0,     0,     0,     0]]),
 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])}


## 87. ファインチューニング

訓練セットを用い、事前学習済みモデルを極性分析タスク向けにファインチューニングせよ。検証セット上でファインチューニングされたモデルの正解率を計測せよ。

In [None]:
import pandas as pd
import torch
from torch.utils.data import DataLoader
import pytorch_lightning as pl
from pytorch_lightning.callbacks import EarlyStopping
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from sklearn.metrics import accuracy_score
import logging
from datetime import datetime
import os
from pytorch_lightning.loggers import TensorBoardLogger
import tensorboard


# ログ初期化関数
def setup_logging():
    os.makedirs("log", exist_ok=True)
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    log_filename = f"log/training_{timestamp}.log"
    logging.basicConfig(
        level=logging.INFO,
        format="%(asctime)s - %(levelname)s - %(message)s",
        handlers=[logging.FileHandler(log_filename), logging.StreamHandler()],
    )
    logging.info(f"ログ出力先: {log_filename}")
    return timestamp


def load_data(file_path):
    df = pd.read_csv(file_path, sep="\t", header=0)
    return df["sentence"].tolist(), df["label"].tolist()


def make_dataset(tokenizer, max_length, texts, labels=None):
    dataset_for_loader = list()

    if labels is not None:
        for text, label in zip(texts, labels):
            # テキストをトークンに分割する。ただし、最大文長は "max_length" で指定したトークン数である。
            # 最大文長より短い文については、 "[PAD]" などの特殊トークンで残りの長さを埋める。
            # 最大文長を超える文については、はみ出す部分を無視する。
            # テンソル形式で返す
            encoding = tokenizer(
                text, max_length=max_length, padding="max_length", truncation=True
            )

            # tokenizerメソッドは辞書を返す。その辞書にラベルのIDも持たせる。
            encoding["labels"] = label

            # テンソルに変換
            encoding = {key: torch.tensor(value) for key, value in encoding.items()}

            # 前処理済みのデータを保存して次の文へ
            dataset_for_loader.append(encoding)
    else:
        for text in texts:
            encoding = tokenizer(
                text, max_length=max_length, padding="max_length", truncation=True
            )

            encoding = {key: torch.tensor(value) for key, value in encoding.items()}

            dataset_for_loader.append(encoding)

    return dataset_for_loader


# ====================
# BERTによるテキスト分類
# ====================


class Bert4Classification(pl.LightningModule):
    # モデルの読み込みなど。損失関数は自動的に設定される。
    # num_labels == 1 -> 回帰タスクなので MSELoss()
    # num_labels > 1 -> 分類タスクなので CrossEntropyLoss()
    def __init__(self, model_name, num_labels, lr):
        super().__init__()
        self.save_hyperparameters()  # num_labelsとlrを保存する。例えば、self.hparams.lrでlrにアクセスできる。
        self.bert_sc = AutoModelForSequenceClassification.from_pretrained(
            model_name, num_labels=num_labels
        )

    # 訓練用データのバッチを受け取って損失を計算
    def training_step(self, batch, batch_idx):
        output = self.bert_sc(**batch)
        train_loss = output.loss
        labels_predicted = output.logits.argmax(-1)
        labels = batch["labels"]
        train_acc = accuracy_score(labels.cpu().numpy(), labels_predicted.cpu().numpy())
        self.log("train_loss", train_loss, prog_bar=True)
        self.log("train_acc", train_acc, prog_bar=True)
        return train_loss

    # 検証用データのバッチを受け取って損失を計算
    def validation_step(self, batch, batch_idx):
        output = self.bert_sc(**batch)
        val_loss = output.loss
        labels_predicted = output.logits.argmax(-1)
        labels = batch["labels"]
        val_acc = accuracy_score(labels.cpu().numpy(), labels_predicted.cpu().numpy())
        self.log("val_loss", val_loss, prog_bar=True)
        self.log("val_acc", val_acc, prog_bar=True)

    # 最適化手法を設定
    def configure_optimizers(self):
        return torch.optim.AdamW(self.parameters(), lr=self.hparams.lr)


timestamp = setup_logging()
logger = TensorBoardLogger(
    save_dir="lightning_logs", name=f"training_{timestamp}", default_hp_metric=False
)


model_name = "google-bert/bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

train_path = "../第7章：機械学習/SST-2/train.tsv"
dev_path = "../第7章：機械学習/SST-2/dev.tsv"

train_texts, train_labels = load_data(train_path)
dev_texts, dev_labels = load_data(dev_path)

# 最大文長の設定
max_length = 128

dataset_train = make_dataset(tokenizer, max_length, train_texts, train_labels)
dataset_val = make_dataset(tokenizer, max_length, dev_texts, dev_labels)

# データローダ作成。訓練用データはシャッフルしながら使う。
dataloader_train = DataLoader(dataset_train, batch_size=64, shuffle=True)
dataloader_val = DataLoader(dataset_val, batch_size=256, shuffle=False)


# ====================
# 訓練
# ====================
model = Bert4Classification(model_name, num_labels=2, lr=1e-5)

early_stopping = EarlyStopping(monitor="val_acc", mode="max", patience=3, verbose=True)

# 訓練中にモデルを保存するための設定
checkpoint = pl.callbacks.ModelCheckpoint(
    # 検証用データにおける正解率が最も大きいモデルを保存する
    monitor="val_acc",
    mode="max",
    save_top_k=1,
    # モデルファイル（重みのみ）を "model" というディレクトリに保存する
    save_weights_only=True,
    dirpath="model/",
)

trainer = pl.Trainer(
    accelerator="gpu",
    devices=[0],
    max_epochs=50,
    callbacks=[checkpoint, early_stopping],
    logger=logger,
)

# 訓練
logging.info("トレーニングを開始します。")
trainer.fit(model, dataloader_train, dataloader_val)
logging.info("トレーニング完了。")

logging.info(f"ベストモデル: {checkpoint.best_model_path}")
logging.info(f"ベストモデルの検証データにおける正解率: {checkpoint.best_model_score}")

  from .autonotebook import tqdm as notebook_tqdm
2025-05-15 22:01:00,267 - INFO - ログ出力先: log/training_20250515_220100.log
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
/net/nas8/data/home/murakami/nlp-100-knocks/.venv/lib/python3.10/site-packages/lightning_fabric/plugins/environments/slurm.py:204: The `srun` command is available on your system but is not used. HINT: If your intention is to run Lightning on SLURM, prepend your python command with `srun` like so: srun python /net/nas8/data/home/murakami/nlp-100-knocks/.venv/li ...
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
2025-05-15 22:01:30,801 - INFO - トレーニングを開始します。
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1

Sanity Checking: |          | 0/? [00:00<?, ?it/s]

/net/nas8/data/home/murakami/nlp-100-knocks/.venv/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:425: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=11` in the `DataLoader` to improve performance.


                                                                           

/net/nas8/data/home/murakami/nlp-100-knocks/.venv/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:425: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=11` in the `DataLoader` to improve performance.


Epoch 0: 100%|██████████| 1053/1053 [09:29<00:00,  1.85it/s, v_num=0, train_loss=0.109, train_acc=0.952, val_loss=0.223, val_acc=0.916]

Metric val_acc improved. New best score: 0.916


Epoch 1: 100%|██████████| 1053/1053 [09:35<00:00,  1.83it/s, v_num=0, train_loss=0.103, train_acc=0.952, val_loss=0.225, val_acc=0.920] 

Metric val_acc improved by 0.003 >= min_delta = 0.0. New best score: 0.920


Epoch 2: 100%|██████████| 1053/1053 [09:34<00:00,  1.83it/s, v_num=0, train_loss=0.0862, train_acc=0.952, val_loss=0.244, val_acc=0.922] 

Metric val_acc improved by 0.002 >= min_delta = 0.0. New best score: 0.922


Epoch 5: 100%|██████████| 1053/1053 [09:25<00:00,  1.86it/s, v_num=0, train_loss=0.00104, train_acc=1.000, val_loss=0.326, val_acc=0.927] 

Metric val_acc improved by 0.005 >= min_delta = 0.0. New best score: 0.927


Epoch 8: 100%|██████████| 1053/1053 [09:23<00:00,  1.87it/s, v_num=0, train_loss=0.00047, train_acc=1.000, val_loss=0.356, val_acc=0.914] 

Monitored metric val_acc did not improve in the last 3 records. Best score: 0.927. Signaling Trainer to stop.


Epoch 8: 100%|██████████| 1053/1053 [09:23<00:00,  1.87it/s, v_num=0, train_loss=0.00047, train_acc=1.000, val_loss=0.356, val_acc=0.914]


2025-05-15 23:27:24,028 - INFO - トレーニング完了。
2025-05-15 23:27:24,030 - INFO - ベストモデル: /net/nas8/data/home/murakami/nlp-100-knocks/第9章：事前学習済み言語モデル（BERT型）/model/epoch=5-step=6318.ckpt
2025-05-15 23:27:24,031 - INFO - ベストモデルの検証データにおける正解率: 0.9266055226325989


## 88. 極性分析

問題87でファインチューニングされたモデルを用いて、以下の文の極性を予測せよ。

- "The movie was full of incomprehensibilities."
- "The movie was full of fun."
- "The movie was full of excitement."
- "The movie was full of crap."
- "The movie was full of rubbish."


In [2]:
import torch
from torch.utils.data import DataLoader
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import pandas as pd
import pytorch_lightning as pl
from sklearn.metrics import accuracy_score  # モデル定義で使用しているなら必須
import os



def make_dataset(tokenizer, max_length, texts, labels=None):
    dataset_for_loader = list()

    if labels is not None:
        for text, label in zip(texts, labels):
            # テキストをトークンに分割する。ただし、最大文長は "max_length" で指定したトークン数である。
            # 最大文長より短い文については、 "[PAD]" などの特殊トークンで残りの長さを埋める。
            # 最大文長を超える文については、はみ出す部分を無視する。
            # テンソル形式で返す
            encoding = tokenizer(
                text, max_length=max_length, padding="max_length", truncation=True
            )

            # tokenizerメソッドは辞書を返す。その辞書にラベルのIDも持たせる。
            encoding["labels"] = label

            # テンソルに変換
            encoding = {key: torch.tensor(value) for key, value in encoding.items()}

            # 前処理済みのデータを保存して次の文へ
            dataset_for_loader.append(encoding)
    else:
        for text in texts:
            encoding = tokenizer(
                text, max_length=max_length, padding="max_length", truncation=True
            )

            encoding = {key: torch.tensor(value) for key, value in encoding.items()}

            dataset_for_loader.append(encoding)

    return dataset_for_loader


class Bert4Classification(pl.LightningModule):
    # モデルの読み込みなど。損失関数は自動的に設定される。
    # num_labels == 1 -> 回帰タスクなので MSELoss()
    # num_labels > 1 -> 分類タスクなので CrossEntropyLoss()
    def __init__(self, model_name, num_labels, lr):
        super().__init__()
        self.save_hyperparameters()  # num_labelsとlrを保存する。例えば、self.hparams.lrでlrにアクセスできる。
        self.bert_sc = AutoModelForSequenceClassification.from_pretrained(
            model_name, num_labels=num_labels
        )

    # 訓練用データのバッチを受け取って損失を計算
    def training_step(self, batch, batch_idx):
        output = self.bert_sc(**batch)
        train_loss = output.loss
        labels_predicted = output.logits.argmax(-1)
        labels = batch["labels"]
        train_acc = accuracy_score(labels.cpu().numpy(), labels_predicted.cpu().numpy())
        self.log("train_loss", train_loss, prog_bar=True)
        self.log("train_acc", train_acc, prog_bar=True)
        return train_loss

    # 検証用データのバッチを受け取って損失を計算
    def validation_step(self, batch, batch_idx):
        output = self.bert_sc(**batch)
        val_loss = output.loss
        labels_predicted = output.logits.argmax(-1)
        labels = batch["labels"]
        val_acc = accuracy_score(labels.cpu().numpy(), labels_predicted.cpu().numpy())
        self.log("val_loss", val_loss, prog_bar=True)
        self.log("val_acc", val_acc, prog_bar=True)

    # 最適化手法を設定
    def configure_optimizers(self):
        return torch.optim.AdamW(self.parameters(), lr=self.hparams.lr)



model_name = "google-bert/bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

eval_texts = [
    "The movie was full of incomprehensibilities.",
    "The movie was full of fun.",
    "The movie was full of excitement.",
    "The movie was full of crap.",
    "The movie was full of rubbish.",
]

max_length=128

dataset_eval=make_dataset(tokenizer, max_length, eval_texts)
dataloader_eval = DataLoader(dataset_eval, batch_size=1, shuffle=False)

# ベストモデルをロード
best_model_path = "model/epoch=5-step=6318.ckpt"
model = Bert4Classification.load_from_checkpoint(
    best_model_path, model_name=model_name, num_labels=2, lr=1e-5
)

model = model.to("cuda") 
model.eval()
with torch.no_grad():
    preds = list()
    for batch in dataloader_eval:
        batch = {k: v.to(model.device) for k, v in batch.items()}
        output = model.bert_sc(**batch)
        labels_predicted = output.logits.argmax(-1)
        preds.append(labels_predicted)
    preds = torch.cat(preds)
    
for text,pred in zip(eval_texts,preds):
    print(f"文：{text}  予測：{pred}")

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


文：The movie was full of incomprehensibilities.  予測：0
文：The movie was full of fun.  予測：1
文：The movie was full of excitement.  予測：1
文：The movie was full of crap.  予測：0
文：The movie was full of rubbish.  予測：0


## 89. アーキテクチャの変更

問題87とは異なるアーキテクチャ（例えば[CLS]トークンを用いるか、各トークンの最大値プーリングを用いるなど）の分類モデルを設計し、事前学習済みモデルを極性分析タスク向けにファインチューニングせよ。検証セット上でファインチューニングされたモデルの正解率を計測せよ。

In [3]:
import pandas as pd
import torch
from torch.utils.data import DataLoader
import pytorch_lightning as pl
from pytorch_lightning.callbacks import EarlyStopping
from transformers import AutoTokenizer, AutoModel
from sklearn.metrics import accuracy_score
import logging
from datetime import datetime
import os
from pytorch_lightning.loggers import TensorBoardLogger
import tensorboard


# ログ初期化関数
def setup_logging():
    os.makedirs("log", exist_ok=True)
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    log_filename = f"log/training_{timestamp}.log"
    logging.basicConfig(
        level=logging.INFO,
        format="%(asctime)s - %(levelname)s - %(message)s",
        handlers=[logging.FileHandler(log_filename), logging.StreamHandler()],
    )
    logging.info(f"ログ出力先: {log_filename}")
    return timestamp


def load_data(file_path):
    df = pd.read_csv(file_path, sep="\t", header=0)
    return df["sentence"].tolist(), df["label"].tolist()


def make_dataset(tokenizer, max_length, texts, labels=None):
    dataset_for_loader = list()

    if labels is not None:
        for text, label in zip(texts, labels):
            # テキストをトークンに分割する。ただし、最大文長は "max_length" で指定したトークン数である。
            # 最大文長より短い文については、 "[PAD]" などの特殊トークンで残りの長さを埋める。
            # 最大文長を超える文については、はみ出す部分を無視する。
            # テンソル形式で返す
            encoding = tokenizer(
                text, max_length=max_length, padding="max_length", truncation=True
            )

            # tokenizerメソッドは辞書を返す。その辞書にラベルのIDも持たせる。
            encoding["labels"] = label

            # テンソルに変換
            encoding = {key: torch.tensor(value) for key, value in encoding.items()}

            # 前処理済みのデータを保存して次の文へ
            dataset_for_loader.append(encoding)
    else:
        for text in texts:
            encoding = tokenizer(
                text, max_length=max_length, padding="max_length", truncation=True
            )

            encoding = {key: torch.tensor(value) for key, value in encoding.items()}

            dataset_for_loader.append(encoding)

    return dataset_for_loader


# ====================
# BERTによるテキスト分類
# ====================

# 最大値プーリング
class Bert4ClassificationMaxPool(pl.LightningModule):
    def __init__(self, model_name, num_labels, lr):
        super().__init__()
        self.save_hyperparameters()
        self.bert = AutoModel.from_pretrained(model_name)
        hidden_size = self.bert.config.hidden_size
        self.classifier = torch.nn.Linear(hidden_size, num_labels)
        self.lr = lr

    def forward(self, input_ids, attention_mask, token_type_ids=None):
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)
        hidden_states = outputs.last_hidden_state  # [batch, seq_len, hidden]

        # マスクされた位置を-1e9で埋めて最大値を取る
        masked_hidden = hidden_states.masked_fill(attention_mask.unsqueeze(-1) == 0, -1e9)
        pooled = masked_hidden.max(dim=1).values  # max pooling

        return self.classifier(pooled)

    def training_step(self, batch, batch_idx):
        labels = batch["labels"]
        logits = self(**{k: v for k, v in batch.items() if k != "labels"})
        loss = torch.nn.functional.cross_entropy(logits, labels)
        preds = logits.argmax(dim=-1)
        acc = accuracy_score(labels.cpu().numpy(), preds.cpu().numpy())
        self.log("train_loss", loss, prog_bar=True)
        self.log("train_acc", acc, prog_bar=True)
        return loss

    def validation_step(self, batch, batch_idx):
        labels = batch["labels"]
        logits = self(**{k: v for k, v in batch.items() if k != "labels"})
        loss = torch.nn.functional.cross_entropy(logits, labels)
        preds = logits.argmax(dim=-1)
        acc = accuracy_score(labels.cpu().numpy(), preds.cpu().numpy())
        self.log("val_loss", loss, prog_bar=True)
        self.log("val_acc", acc, prog_bar=True)

    def configure_optimizers(self):
        return torch.optim.AdamW(self.parameters(), lr=self.lr)
    



timestamp = setup_logging()
logger = TensorBoardLogger(
    save_dir="lightning_logs", name=f"training_{timestamp}", default_hp_metric=False
)


model_name = "google-bert/bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

train_path = "../第7章：機械学習/SST-2/train.tsv"
dev_path = "../第7章：機械学習/SST-2/dev.tsv"

train_texts, train_labels = load_data(train_path)
dev_texts, dev_labels = load_data(dev_path)

# 最大文長の設定
max_length = 128

dataset_train = make_dataset(tokenizer, max_length, train_texts, train_labels)
dataset_val = make_dataset(tokenizer, max_length, dev_texts, dev_labels)

# データローダ作成。訓練用データはシャッフルしながら使う。
dataloader_train = DataLoader(dataset_train, batch_size=64, shuffle=True)
dataloader_val = DataLoader(dataset_val, batch_size=256, shuffle=False)


# ====================
# 訓練
# ====================
model = Bert4ClassificationMaxPool(model_name, num_labels=2, lr=1e-5)

early_stopping = EarlyStopping(monitor="val_acc", mode="max", patience=3, verbose=True)

# 訓練中にモデルを保存するための設定
checkpoint = pl.callbacks.ModelCheckpoint(
    # 検証用データにおける正解率が最も大きいモデルを保存する
    monitor="val_acc",
    mode="max",
    save_top_k=1,
    # モデルファイル（重みのみ）を "model" というディレクトリに保存する
    save_weights_only=True,
    dirpath="model/",
)

trainer = pl.Trainer(
    accelerator="gpu",
    devices=[0],
    max_epochs=50,
    callbacks=[checkpoint, early_stopping],
    logger=logger,
)

# 訓練
logging.info("トレーニングを開始します。")
trainer.fit(model, dataloader_train, dataloader_val)
logging.info("トレーニング完了。")

logging.info(f"ベストモデル: {checkpoint.best_model_path}")
logging.info(f"ベストモデルの検証データにおける正解率: {checkpoint.best_model_score}")

2025-05-16 04:16:16,988 - INFO - ログ出力先: log/training_20250516_041616.log


/net/nas8/data/home/murakami/nlp-100-knocks/.venv/lib/python3.10/site-packages/lightning_fabric/plugins/environments/slurm.py:204: The `srun` command is available on your system but is not used. HINT: If your intention is to run Lightning on SLURM, prepend your python command with `srun` like so: srun python /net/nas8/data/home/murakami/nlp-100-knocks/.venv/li ...
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
2025-05-16 04:16:29,564 - INFO - トレーニングを開始します。
/net/nas8/data/home/murakami/nlp-100-knocks/.venv/lib/python3.10/site-packages/pytorch_lightning/callbacks/model_checkpoint.py:654: Checkpoint directory /net/nas8/data/home/murakami/nlp-100-knocks/第9章：事前学習済み言語モデル（BERT型）/model exists and is not empty.
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]

  | Name       | Type      | Params | Mode 
-------------------------------------------------
0 | bert       | BertModel | 109 M  | eval 
1 | classifier | Linear    | 1.5 K  

Sanity Checking: |          | 0/? [00:00<?, ?it/s]

/net/nas8/data/home/murakami/nlp-100-knocks/.venv/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:425: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=11` in the `DataLoader` to improve performance.


                                                                           

/net/nas8/data/home/murakami/nlp-100-knocks/.venv/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:425: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=11` in the `DataLoader` to improve performance.


Epoch 0: 100%|██████████| 1053/1053 [09:14<00:00,  1.90it/s, v_num=0, train_loss=0.144, train_acc=0.952, val_loss=0.209, val_acc=0.917]

Metric val_acc improved. New best score: 0.917


Epoch 2: 100%|██████████| 1053/1053 [09:22<00:00,  1.87it/s, v_num=0, train_loss=0.0108, train_acc=1.000, val_loss=0.234, val_acc=0.919] 

Metric val_acc improved by 0.001 >= min_delta = 0.0. New best score: 0.919


Epoch 5: 100%|██████████| 1053/1053 [09:22<00:00,  1.87it/s, v_num=0, train_loss=0.0246, train_acc=1.000, val_loss=0.398, val_acc=0.909]  

Monitored metric val_acc did not improve in the last 3 records. Best score: 0.919. Signaling Trainer to stop.


Epoch 5: 100%|██████████| 1053/1053 [09:22<00:00,  1.87it/s, v_num=0, train_loss=0.0246, train_acc=1.000, val_loss=0.398, val_acc=0.909]


2025-05-16 05:12:52,177 - INFO - トレーニング完了。
2025-05-16 05:12:52,178 - INFO - ベストモデル: /net/nas8/data/home/murakami/nlp-100-knocks/第9章：事前学習済み言語モデル（BERT型）/model/epoch=2-step=3159.ckpt
2025-05-16 05:12:52,179 - INFO - ベストモデルの検証データにおける正解率: 0.9185779690742493
