# 第9章: 事前学習済み言語モデル（BERT型）

本章では、BERT型の事前学習済みモデルを利用して、マスク単語の予測や文ベクトルの計算、評判分析器（ポジネガ分類器）の構築に取り組む。

In [1]:
!pip install transformers



In [32]:
import torch
import numpy as np
import pandas as pd
import torch.nn as nn
import torch.nn.functional as F
from tqdm import tqdm
from transformers import BertModel, BertForMaskedLM, BertTokenizer

In [3]:
model_name = "bert-base-uncased"
model = BertForMaskedLM.from_pretrained(model_name)
tokenizer = BertTokenizer.from_pretrained(model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [4]:
#tokenizer info
print(tokenizer.vocab_size)
print(tokenizer.model_max_length)
print(tokenizer.model_input_names)
print(tokenizer.cls_token)
print(tokenizer.sep_token)
print(tokenizer.pad_token)
print(tokenizer.unk_token)
print(tokenizer.mask_token)

30522
512
['input_ids', 'token_type_ids', 'attention_mask']
[CLS]
[SEP]
[PAD]
[UNK]
[MASK]


## 80. トークン化

"The movie was full of incomprehensibilities."という文をトークンに分解し、トークン列を表示せよ。

In [5]:
text_80 = "The movie was full of incomprehensibilities."

tokenized_text_80 = tokenizer(text_80)
print(tokenized_text_80)

{'input_ids': [101, 1996, 3185, 2001, 2440, 1997, 4297, 25377, 2890, 10222, 5332, 14680, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


## 81. マスクの予測

"The movie was full of [MASK]."の"[MASK]"を埋めるのに最も適切なトークンを求めよ。

In [6]:
special_tokens = ['[PAD]', '[UNK]', '[CLS]', '[SEP]', '[MASK]']

for token in special_tokens:
    token_id = tokenizer.convert_tokens_to_ids(token)
    print(f"{token}: {token_id}")

[PAD]: 0
[UNK]: 100
[CLS]: 101
[SEP]: 102
[MASK]: 103


In [7]:
text_81 = "The movie was full of [MASK]."

inputs = tokenizer(text_81, return_tensors="pt")

mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]
print(inputs)
print(mask_token_index)

with torch.no_grad():
    outputs = model(**inputs)
    print(outputs.logits.shape)#torch.Size([1, 9, 30522]) バッチサイズ　トークン数　語彙
    logits = outputs.logits

mask_logits = logits[0, mask_token_index, :]
print(mask_logits.shape)

top_token_id = torch.argmax(mask_logits, dim=1) #各行における最大値のインデックスを返すからdim=1ってことかな？
print(f"top_token_id:{top_token_id}")

predicted_token = tokenizer.convert_ids_to_tokens(top_token_id)

print(f"Predicted token: {predicted_token}")


{'input_ids': tensor([[ 101, 1996, 3185, 2001, 2440, 1997,  103, 1012,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1]])}
tensor([6])
torch.Size([1, 9, 30522])
torch.Size([1, 30522])
top_token_id:tensor([4569])
Predicted token: ['fun']


## 82. マスクのtop-k予測

"The movie was full of [MASK]."の"[MASK]"に埋めるのに適切なトークン上位10個と、その確率（尤度）を求めよ。

In [8]:
text_82 = "The movie was full of [MASK]."

inputs = tokenizer(text_82, return_tensors="pt")

mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]

with torch.no_grad():
    outputs = model(**inputs)
    print(outputs.logits.shape)#torch.Size([1, 9, 30522]) バッチサイズ　トークン数　語彙
    logits = outputs.logits

mask_logits = logits[0, mask_token_index[0], :] #shape: [vocab_size]

mask_logits_prob = F.softmax(mask_logits, dim=0)
print(mask_logits.shape)
print(mask_logits)
print(mask_logits_prob.shape)
print(mask_logits_prob)

topk = torch.topk(mask_logits_prob, k=10)

for idx, score in zip(topk.indices, topk.values):
  token = tokenizer.convert_ids_to_tokens(idx.item())
  print(f"{token}: {score.item():.2f}")

torch.Size([1, 9, 30522])
torch.Size([30522])
tensor([-3.6503, -3.4497, -3.2653,  ..., -3.2588, -2.6857, -4.3190])
torch.Size([30522])
tensor([2.5729e-07, 3.1443e-07, 3.7810e-07,  ..., 3.8057e-07, 6.7506e-07,
        1.3182e-07])
fun: 0.11
surprises: 0.07
drama: 0.04
stars: 0.03
laughs: 0.03
action: 0.02
excitement: 0.02
people: 0.02
tension: 0.02
music: 0.01


## 83. CLSトークンによる文ベクトル

以下の文の全ての組み合わせに対して、最終層の[CLS]トークンの埋め込みベクトルを用いてコサイン類似度を求めよ。

- "The movie was full of fun."
- "The movie was full of excitement."
- "The movie was full of crap."
- "The movie was full of rubbish."


In [36]:
def cosine_similarity_matrix(x):
  '''
  arg shape: [batch_size, hidden_size]
  return shape: [batch_size, batch_size]
  '''
  x_norm = x / x.norm(dim=1, keepdim=True) #dim=1に対して正規化

  return torch.matmul(x_norm, x_norm.T) #[batch_size, hidden_size] . [hidden_size, batch_size] 内積とっとる


In [39]:
model = BertModel.from_pretrained(model_name)

sentences = ["The movie was full of fun.",
             "The movie was full of excitement.",
             "The movie was full of crap.",
             "The movie was full of rubbish."]

inputs = tokenizer(sentences, return_tensors="pt", padding=True, truncation=True) #padding=Trueにしたから，seq_lenがbatchの中で一番大きいものに統一される

with torch.no_grad():
  outputs = model(**inputs)
  last_hidden_state = outputs.last_hidden_state #shape: [batch_size, sequence_length, hidden_size]

print(last_hidden_state.shape)
cls_last_hidden_state = last_hidden_state[:,0,:] #shape [batch_size, hidden_size]
cosine_similarity = cosine_similarity_matrix(cls_last_hidden_state)

df = pd.DataFrame(
    cosine_similarity.numpy(),
    columns=sentences,
    index=sentences,
)
pd.set_option("display.precision", 3) #少数第3位までにできるっぽい
df

torch.Size([4, 9, 768])


Unnamed: 0,The movie was full of fun.,The movie was full of excitement.,The movie was full of crap.,The movie was full of rubbish.
The movie was full of fun.,1.0,0.988,0.956,0.948
The movie was full of excitement.,0.988,1.0,0.954,0.949
The movie was full of crap.,0.956,0.954,1.0,0.981
The movie was full of rubbish.,0.948,0.949,0.981,1.0


## 84. 平均による文ベクトル

以下の文の全ての組み合わせに対して、最終層の埋め込みベクトルの平均を用いてコサイン類似度を求めよ。

- "The movie was full of fun."
- "The movie was full of excitement."
- "The movie was full of crap."
- "The movie was full of rubbish."

## 85. データセットの準備

[General Language Understanding Evaluation (GLUE)](https://gluebenchmark.com/) ベンチマークで配布されている[Stanford Sentiment Treebank (SST)](https://dl.fbaipublicfiles.com/glue/data/SST-2.zip) から訓練セット（train.tsv）と開発セット（dev.tsv）のテキストと極性ラベルと読み込み、さらに全てのテキストはトークン列に変換せよ。

## 86. ミニバッチの作成

85で読み込んだ訓練データの一部（例えば冒頭の4事例）に対して、パディングなどの処理を行い、トークン列の長さを揃えてミニバッチを構成せよ。

## 87. ファインチューニング

訓練セットを用い、事前学習済みモデルを極性分析タスク向けにファインチューニングせよ。検証セット上でファインチューニングされたモデルの正解率を計測せよ。

## 88. 極性分析

問題87でファインチューニングされたモデルを用いて、以下の文の極性を予測せよ。

- "The movie was full of incomprehensibilities."
- "The movie was full of fun."
- "The movie was full of excitement."
- "The movie was full of crap."
- "The movie was full of rubbish."


## 89. アーキテクチャの変更

問題87とは異なるアーキテクチャ（例えば[CLS]トークンを用いるか、各トークンの最大値プーリングを用いるなど）の分類モデルを設計し、事前学習済みモデルを極性分析タスク向けにファインチューニングせよ。検証セット上でファインチューニングされたモデルの正解率を計測せよ。