# 第9章: 事前学習済み言語モデル（BERT型）
本章では、BERT型の事前学習済みモデルを利用して、マスク単語の予測や文ベクトルの計算、評判分析器（ポジネガ分類器）の構築に取り組む。

## 80. トークン化
“The movie was full of incomprehensibilities.”という文をトークンに分解し、トークン列を表示せよ。

In [2]:
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

text = "The movie was full of incomprehensibilities."

# Tokenize
tokens = tokenizer.tokenize(text)

print(tokens)

['the', 'movie', 'was', 'full', 'of', 'inc', '##omp', '##re', '##hen', '##si', '##bilities', '.']


## 81. マスクの予測
“The movie was full of [MASK].”の”[MASK]”を埋めるのに最も適切なトークンを求めよ。

In [12]:
from transformers import BertForMaskedLM
import torch

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForMaskedLM.from_pretrained('bert-base-uncased')    # Classifier to predict token for [MASK]
model.eval()

text = "The movie was full of [MASK]."

inputs = tokenizer(text, return_tensors='pt')
# {
#   'input_ids': tensor([[ 101, 1996, 3185, 2001, 2440, 1997,  103, 1012,  102]]), 
#   'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0]]), 
#   'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1]])
# }

# Get the index of `[MASK]` in the text
mask_token_idx = torch.where(inputs['input_ids'] == tokenizer.mask_token_id)[1] # (row_indices, col_indices)[1]

with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits # shape: (batch_size, seq_len, vocab_size)

mask_logits = logits[0, mask_token_idx, :]  # shape: (1, 30522) (batch_size, vocab_size)

max_logit, max_idx = torch.max(mask_logits, dim=1)

best_token_idx  = max_idx.item()
best_token = tokenizer.decode([best_token_idx])

print(f"Predicted token: {best_token}")
print(f"Logit value: {max_logit.item():.4f}")

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Predicted token: fun
Logit value: 9.2889


## 82. マスクのtop-k予測
“The movie was full of [MASK].”の”[MASK]”に埋めるのに適切なトークン上位10個と、その確率（尤度）を求めよ。

In [19]:
import torch.nn.functional as F

probs = F.softmax(mask_logits, dim=1)

# Top-10
topk = torch.topk(probs, k=10)
topk_indices = topk.indices[0].tolist() # indices: (batch_size, vocab_size)
topk_probs = topk.values[0].tolist()    # values: (batch_size, vocab_size)

for idx, prob  in zip(topk_indices, topk_probs):
    token = tokenizer.decode([idx])
    print(f"{token:>12}    {prob * 100:.3f}%")

         fun    10.712%
   surprises    6.634%
       drama    4.468%
       stars    2.722%
      laughs    2.541%
      action    1.952%
  excitement    1.904%
      people    1.829%
     tension    1.503%
       music    1.465%


## 83. CLSトークンによる文ベクトル
以下の文の全ての組み合わせに対して、最終層の[CLS]トークンの埋め込みベクトルを用いてコサイン類似度を求めよ。

“The movie was full of fun.”

“The movie was full of excitement.”

“The movie was full of crap.”

“The movie was full of rubbish.”

In [28]:
import torch
import torch.nn.functional as F
from transformers import BertTokenizer, BertModel

# 1. Prepare model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
model.eval()

# 2. Prepare sentences
texts = [
    "The movie was full of fun.",
    "The movie was full of excitement.",
    "The movie was full of crap.",
    "The movie was full of rubbish."
]
len_texts = len(texts)

# 3. 
cls_embeddings = []
with torch.no_grad():
    for text in texts:
        # Encode
        inputs = tokenizer(text, return_tensors='pt')
        # Feed the model with inputs
        outputs = model(**inputs)
        # last_hidden_state: (batch_size, seq_len, hidden_size) (1, seq_len, 768)
        cls_vec = outputs.last_hidden_state[:, 0, :]    # shape: (batch_size, hidden_size)   (1, 768)
        cls_embeddings.append(cls_vec)

cls_embeddings = torch.vstack(cls_embeddings)   # shape: (4, 768)

# 4. Compute the cosine similarities for all combinations
for i in range(len_texts):
    for j in range(i + 1, len_texts):
        v1 = cls_embeddings[i]
        v2 = cls_embeddings[j]
        sim = F.cosine_similarity(v1, v2, dim=0).item()
        print(f"\"{texts[i]}\"  ←→  \"{texts[j]}\"   :  cosine-sim = {sim:.4f}")


"The movie was full of fun."  ←→  "The movie was full of excitement."   :  cosine-sim = 0.9881
"The movie was full of fun."  ←→  "The movie was full of crap."   :  cosine-sim = 0.9558
"The movie was full of fun."  ←→  "The movie was full of rubbish."   :  cosine-sim = 0.9475
"The movie was full of excitement."  ←→  "The movie was full of crap."   :  cosine-sim = 0.9541
"The movie was full of excitement."  ←→  "The movie was full of rubbish."   :  cosine-sim = 0.9487
"The movie was full of crap."  ←→  "The movie was full of rubbish."   :  cosine-sim = 0.9807


## 84. 平均による文ベクトル
以下の文の全ての組み合わせに対して、最終層の埋め込みベクトルの平均を用いてコサイン類似度を求めよ。

“The movie was full of fun.”

“The movie was full of excitement.”

“The movie was full of crap.”

“The movie was full of rubbish.”

In [31]:
import torch
import torch.nn.functional as F
from transformers import BertTokenizer, BertModel

# 1. Prepare model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
model.eval()

# 2. Prepare sentences
texts = [
    "The movie was full of fun.",
    "The movie was full of excitement.",
    "The movie was full of crap.",
    "The movie was full of rubbish."
]
len_texts = len(texts)

# 3. 
avg_embeddings = []
with torch.no_grad():
    for text in texts:
        # Encode
        inputs = tokenizer(text, return_tensors='pt')
        # Feed the model with inputs
        outputs = model(**inputs)
        # last_hidden_state: (batch_size, seq_len, hidden_size) (1, seq_len, 768)
        avg_vec = torch.mean(outputs.last_hidden_state, dim=1)    # shape: (batch_size, hidden_size)   (1, 768)
        avg_embeddings.append(avg_vec)

avg_embeddings = torch.vstack(avg_embeddings)   # shape: (4, 768)

# 4. Compute the cosine similarities for all combinations
for i in range(len_texts):
    for j in range(i + 1, len_texts):
        v1 = avg_embeddings[i]
        v2 = avg_embeddings[j]
        sim = F.cosine_similarity(v1, v2, dim=0).item()
        print(f"\"{texts[i]}\"  ←→  \"{texts[j]}\"   :  cosine-sim = {sim:.4f}")


"The movie was full of fun."  ←→  "The movie was full of excitement."   :  cosine-sim = 0.9568
"The movie was full of fun."  ←→  "The movie was full of crap."   :  cosine-sim = 0.8490
"The movie was full of fun."  ←→  "The movie was full of rubbish."   :  cosine-sim = 0.8169
"The movie was full of excitement."  ←→  "The movie was full of crap."   :  cosine-sim = 0.8352
"The movie was full of excitement."  ←→  "The movie was full of rubbish."   :  cosine-sim = 0.7938
"The movie was full of crap."  ←→  "The movie was full of rubbish."   :  cosine-sim = 0.9226


## 85. データセットの準備
General Language Understanding Evaluation (GLUE) ベンチマークで配布されているStanford Sentiment Treebank (SST) から訓練セット（train.tsv）と開発セット（dev.tsv）のテキストと極性ラベルと読み込み、さらに全てのテキストはトークン列に変換せよ。

In [50]:
import torch
import pandas as pd
from torch.utils.data import Dataset, DataLoader
from transformers import BertTokenizer, DataCollatorWithPadding

class SSTDataset(Dataset):
    def __init__(self, df, tokenizer):
        self.texts = df['sentence'].tolist()
        self.labels = [int(x) for x in df['label'].tolist()]
        self.tokenizer = tokenizer
    
    def __getitem__(self, idx):
        text = self.texts[idx]
        label = self.labels[idx]
        encodings = self.tokenizer(
            text,
            add_special_tokens=True,
            truncation=True,
            max_length=128,
            return_attention_mask=True,
        )
        return {
            "input_ids": torch.tensor(encodings["input_ids"]),
            "attention_mask": torch.tensor(encodings["attention_mask"]),    # All elements are 1 because padding is not done
            "labels": torch.tensor(label)
        }

    def __len__(self):
        return len(self.texts)
    

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

train_df = pd.read_csv("train.tsv", sep="\t")
valid_df = pd.read_csv("dev.tsv",   sep="\t")

train_dataset = SSTDataset(train_df, tokenizer)
valid_dataset   = SSTDataset(valid_df, tokenizer)

## 86. ミニバッチの作成
85で読み込んだ訓練データの一部（例えば冒頭の4事例）に対して、パディングなどの処理を行い、トークン列の長さを揃えてミニバッチを構成せよ。

In [58]:
collate_fn = DataCollatorWithPadding(tokenizer, return_tensors="pt")
train_dl = DataLoader(train_dataset, batch_size=64, shuffle=True, collate_fn=collate_fn)
valid_dl = DataLoader(valid_dataset, batch_size=64, shuffle=False, collate_fn=collate_fn)

print(next(iter(train_dl))['input_ids'])

tensor([[ 101, 2200, 6057,  ...,    0,    0,    0],
        [ 101, 2028, 2062,  ...,    0,    0,    0],
        [ 101, 1011, 1011,  ...,    0,    0,    0],
        ...,
        [ 101, 1996, 2832,  ...,    0,    0,    0],
        [ 101, 2036, 1037,  ...,    0,    0,    0],
        [ 101, 2589, 1010,  ...,    0,    0,    0]])


## 87. ファインチューニング
訓練セットを用い、事前学習済みモデルを極性分析タスク向けにファインチューニングせよ。検証セット上でファインチューニングされたモデルの正解率を計測せよ。

In [59]:
import torch
from torch.optim import AdamW
from torch.nn.functional import cross_entropy
from transformers import BertForSequenceClassification, get_linear_schedule_with_warmup
from sklearn.metrics import accuracy_score

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model = BertForSequenceClassification.from_pretrained(
    'bert-base-uncased',
    num_labels=2
)
model.to(device)

epochs = 3
total_steps = len(train_dl) * epochs

optimizer = AdamW(model.parameters(), lr=2e-5)
scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps = int(0.1 * total_steps),
    num_training_steps = total_steps
)

for epoch in range(1, epochs + 1):
    model.train()
    losses = []

    for batch in train_dl:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)

        loss = outputs.loss
        losses.append(loss.item())

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        scheduler.step()

    avg_train_loss = sum(losses) / len(losses)
    print(f"Epoch {epoch} — Avg Training Loss: {avg_train_loss:.4f}")

    # Evaluation
    model.eval()
    all_preds, all_labels = [], []
    with torch.no_grad():
        for batch in valid_dl:
            batch = {k: v.to(device) for k, v in batch.items()}
            outputs = model(**batch)

            logits = outputs.logits
            preds = torch.argmax(logits, dim=-1)

            all_preds.extend(preds.cpu().tolist())
            all_labels.extend(batch['labels'].tolist())
    
    acc = accuracy_score(all_labels, all_preds)
    print(f"Epoch {epoch} — Validation Accuracy: {acc:.4f}\n")

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1 — Avg Training Loss: 0.2575
Epoch 1 — Validation Accuracy: 0.9209

Epoch 2 — Avg Training Loss: 0.1133
Epoch 2 — Validation Accuracy: 0.9209

Epoch 3 — Avg Training Loss: 0.0702
Epoch 3 — Validation Accuracy: 0.9220



## 88. 極性分析
問題87でファインチューニングされたモデルを用いて、以下の文の極性を予測せよ。

“The movie was full of incomprehensibilities.”

“The movie was full of fun.”

“The movie was full of excitement.”

“The movie was full of crap.”

“The movie was full of rubbish.”

In [66]:
texts = [
    "The movie was full of incomprehensibilities.",
    "The movie was full of fun.",
    "The movie was full of excitement.",
    "The movie was full of crap.",
    "The movie was full of rubbish."
]

model.eval()
with torch.no_grad():
    for text in texts:
        inputs = tokenizer(text, return_tensors='pt').to(device)
        outputs = model(**inputs)
        logits = outputs.logits[0]  # (, 2)

        pred_id = torch.argmax(logits).item()

        label_map = {0: "Negative", 1: "Positive"}
        print(f"\"{text}\"")
        print(f"  → Predicted: {label_map[pred_id]}\n")

"The movie was full of incomprehensibilities."
  → Predicted: Negative

"The movie was full of fun."
  → Predicted: Positive

"The movie was full of excitement."
  → Predicted: Positive

"The movie was full of crap."
  → Predicted: Negative

"The movie was full of rubbish."
  → Predicted: Negative



## 89. アーキテクチャの変更
問題87とは異なるアーキテクチャ（例えば[CLS]トークンを用いるか、各トークンの最大値プーリングを用いるなど）の分類モデルを設計し、事前学習済みモデルを極性分析タスク向けにファインチューニングせよ。検証セット上でファインチューニングされたモデルの正解率を計測せよ。

In [75]:
'''
Use the averge of the last hidden state of the fine-tuned Bert model to predict.
'''

import torch
import torch.nn as nn
import torch.nn.functional as F
from transformers import BertTokenizer, BertModel

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

class MyBertForBinaryClassification(nn.Module):
    def __init__(self):
        super().__init__()
        # Load the pre-trained Bert model
        self.bert = BertModel.from_pretrained('bert-base-uncased')
        # Get the hidden size so that we can set the size of fully connected layer
        hidden_size = self.bert.config.hidden_size
        # For 2 classification task
        self.fc = nn.Linear(hidden_size, 2)

    def forward(self, input_ids, attention_mask, labels):
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask, return_dict=True)
        h = outputs.last_hidden_state   # [batch_size, seq_len, hidden_size]
        # Omit the [CLS] and [SEP] token
        h = h[:, 1:-1, :]   # [batch_size, seq_len - 2, hidden_size]

        mask = attention_mask[:, 1:-1].unsqueeze(-1) # [batch_size, seq_len, 1] for broadcasting
        vec = (h * mask).sum(1) / mask.sum(1)   # [batch_size, hidden_size]
        logits = self.fc(vec)
        loss = F.cross_entropy(logits, labels)
        return {'loss': loss, 'logits': logits}


model_89 = MyBertForBinaryClassification().to(device)

epochs = 3
total_steps = len(train_dl) * epochs
optimizer = AdamW(model_89.parameters(), lr=2e-5)
scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps = int(0.1 * total_steps),
    num_training_steps = total_steps
)

for epoch in range(1, epochs + 1):
    model_89.train()
    losses = []

    for batch in train_dl:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model_89(**batch)
        loss = outputs['loss']

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        scheduler.step()

        losses.append(loss.item())

    avg_train_loss = sum(losses) / len(losses)
    print(f"Epoch {epoch} — Avg Training Loss: {avg_train_loss:.4f}")

    # Evaluation
    model_89.eval()
    all_preds, all_labels = [], []
    with torch.no_grad():
        for batch in valid_dl:
            batch = {k: v.to(device) for k, v in batch.items()}
            outputs = model_89(**batch)

            logits = outputs['logits']
            preds = torch.argmax(logits, dim=-1)

            all_preds.extend(preds.cpu().tolist())
            all_labels.extend(batch['labels'].cpu().tolist())
    
    acc = accuracy_score(all_labels, all_preds)
    print(f"Epoch {epoch} — Validation Accuracy: {acc:.4f}\n")


Epoch 1 — Avg Training Loss: 0.2492
Epoch 1 — Validation Accuracy: 0.9209

Epoch 2 — Avg Training Loss: 0.1083
Epoch 2 — Validation Accuracy: 0.9174

Epoch 3 — Avg Training Loss: 0.0679
Epoch 3 — Validation Accuracy: 0.9163

