# 第9章: 事前学習済み言語モデル（BERT型）

本章では、BERT型の事前学習済みモデルを利用して、マスク単語の予測や文ベクトルの計算、評判分析器（ポジネガ分類器）の構築に取り組む。

In [1]:
import os
from dotenv import load_dotenv
import torch

dotenv_path = './.env'
load_dotenv(dotenv_path)
HF_TOKEN = os.getenv('HF_TOKEN')

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

## 80. トークン化

"The movie was full of incomprehensibilities."という文をトークンに分解し、トークン列を表示せよ。

In [2]:
from transformers import BertTokenizer

text = 'The movie was full of incomprehensibilities.'

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

print(tokenizer.tokenize(text))
print(tokenizer(text)['input_ids'])

['the', 'movie', 'was', 'full', 'of', 'inc', '##omp', '##re', '##hen', '##si', '##bilities', '.']
[101, 1996, 3185, 2001, 2440, 1997, 4297, 25377, 2890, 10222, 5332, 14680, 1012, 102]


## 81. マスクの予測

"The movie was full of [MASK]."の"[MASK]"を埋めるのに最も適切なトークンを求めよ。

In [37]:
from transformers import pipeline

text = 'The movie was full of [MASK].'

unmasker = pipeline('fill-mask', model='bert-base-uncased')
unmasker(text)[0]


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0


{'score': 0.10711903125047684,
 'token': 4569,
 'token_str': 'fun',
 'sequence': 'the movie was full of fun.'}

## 82. マスクのtop-k予測

"The movie was full of [MASK]."の"[MASK]"に埋めるのに適切なトークン上位10個と、その確率（尤度）を求めよ。

In [38]:
unmasker(text, top_k=10)

[{'score': 0.10711903125047684,
  'token': 4569,
  'token_str': 'fun',
  'sequence': 'the movie was full of fun.'},
 {'score': 0.06634484976530075,
  'token': 20096,
  'token_str': 'surprises',
  'sequence': 'the movie was full of surprises.'},
 {'score': 0.04468414559960365,
  'token': 3689,
  'token_str': 'drama',
  'sequence': 'the movie was full of drama.'},
 {'score': 0.027217138558626175,
  'token': 3340,
  'token_str': 'stars',
  'sequence': 'the movie was full of stars.'},
 {'score': 0.025412822142243385,
  'token': 11680,
  'token_str': 'laughs',
  'sequence': 'the movie was full of laughs.'},
 {'score': 0.01951691508293152,
  'token': 2895,
  'token_str': 'action',
  'sequence': 'the movie was full of action.'},
 {'score': 0.01903809793293476,
  'token': 8277,
  'token_str': 'excitement',
  'sequence': 'the movie was full of excitement.'},
 {'score': 0.01829029619693756,
  'token': 2111,
  'token_str': 'people',
  'sequence': 'the movie was full of people.'},
 {'score': 0.015

## 83. CLSトークンによる文ベクトル

以下の文の全ての組み合わせに対して、最終層の[CLS]トークンの埋め込みベクトルを用いてコサイン類似度を求めよ。

- "The movie was full of fun."
- "The movie was full of excitement."
- "The movie was full of crap."
- "The movie was full of rubbish."


In [48]:
from transformers import BertTokenizer, BertModel

texts = [
    'The movie was full of fun.',
    'The movie was full of excitement.',
    'The movie was full of crap.',
    'The movie was full of rubbish.'
]

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

encoded = tokenizer(texts, return_tensors='pt')
# last_hidden_state(各トークンに対する特徴ベクトル) (batch_size, seq_len, hidden_dim)
# pooler_output(CLSトークンに線形層とTanhを通したもの) (batch_size, hidden_dim) 

model.to(device)
encoded.to(device)
outputs = model(**encoded)
CLS_hiddens = outputs.last_hidden_state[:, 0, :] # (batch_size, hidden_dim)

torch.Size([4, 768])


In [None]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

CLS_hiddens = CLS_hiddens.cpu().detach().numpy()

cos_sims = cosine_similarity(CLS_hiddens)

In [58]:
import itertools

pairs = list(itertools.combinations(range(len(texts)), 2))
for i, j in pairs:
    print(f"Similarity between:\n  \"{texts[i]}\"\n  \"{texts[j]}\"\n  => {cos_sims[i][j]:.4f}\n")

Similarity between:
  "The movie was full of fun."
  "The movie was full of excitement."
  => 0.9881

Similarity between:
  "The movie was full of fun."
  "The movie was full of crap."
  => 0.9558

Similarity between:
  "The movie was full of fun."
  "The movie was full of rubbish."
  => 0.9475

Similarity between:
  "The movie was full of excitement."
  "The movie was full of crap."
  => 0.9541

Similarity between:
  "The movie was full of excitement."
  "The movie was full of rubbish."
  => 0.9487

Similarity between:
  "The movie was full of crap."
  "The movie was full of rubbish."
  => 0.9807



In [None]:
# 最後の一単語で比べた場合も試してみる

## 84. 平均による文ベクトル

以下の文の全ての組み合わせに対して、最終層の埋め込みベクトルの平均を用いてコサイン類似度を求めよ。

- "The movie was full of fun."
- "The movie was full of excitement."
- "The movie was full of crap."
- "The movie was full of rubbish."

In [62]:
from transformers import BertTokenizer, BertModel

texts = [
    'The movie was full of fun.',
    'The movie was full of excitement.',
    'The movie was full of crap.',
    'The movie was full of rubbish.'
]

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

encoded = tokenizer(texts, return_tensors='pt')
# last_hidden_state(各トークンに対する特徴ベクトル) (batch_size, seq_len, hidden_dim)
# pooler_output(CLSトークンに線形層とTanhを通したもの) (batch_size, hidden_dim) 

model.to(device)
encoded.to(device)
outputs = model(**encoded)
hiddens = outputs.last_hidden_state # (batch_size, seq_len, hidden_dim)
hiddens_mean = torch.mean(hiddens, dim=1) # (batch_size, hidden_dim)

In [63]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

hiddens_mean = hiddens_mean.cpu().detach().numpy()

cos_sims = cosine_similarity(hiddens_mean)

In [64]:
import itertools

pairs = list(itertools.combinations(range(len(texts)), 2))
for i, j in pairs:
    print(f"Similarity between:\n  \"{texts[i]}\"\n  \"{texts[j]}\"\n  => {cos_sims[i][j]:.4f}\n")

Similarity between:
  "The movie was full of fun."
  "The movie was full of excitement."
  => 0.9568

Similarity between:
  "The movie was full of fun."
  "The movie was full of crap."
  => 0.8490

Similarity between:
  "The movie was full of fun."
  "The movie was full of rubbish."
  => 0.8169

Similarity between:
  "The movie was full of excitement."
  "The movie was full of crap."
  => 0.8352

Similarity between:
  "The movie was full of excitement."
  "The movie was full of rubbish."
  => 0.7938

Similarity between:
  "The movie was full of crap."
  "The movie was full of rubbish."
  => 0.9226



In [None]:
# 両端のうもこみを平均とかconcatで類似度を測る

## 85. データセットの準備

[General Language Understanding Evaluation (GLUE)](https://gluebenchmark.com/) ベンチマークで配布されている[Stanford Sentiment Treebank (SST)](https://dl.fbaipublicfiles.com/glue/data/SST-2.zip) から訓練セット（train.tsv）と開発セット（dev.tsv）のテキストと極性ラベルと読み込み、さらに全てのテキストはトークン列に変換せよ。

In [2]:
import pandas as pd

train_path = './data/SST-2/train.tsv'
dev_path = './data/SST-2/dev.tsv'

train_df = pd.read_csv(train_path, sep='\t')
dev_df = pd.read_csv(dev_path, sep='\t')
train_df

Unnamed: 0,sentence,label
0,hide new secretions from the parental units,0
1,"contains no wit , only labored gags",0
2,that loves its characters and communicates som...,1
3,remains utterly satisfied to remain the same t...,0
4,on the worst revenge-of-the-nerds clichés the ...,0
...,...,...
67344,a delightful comedy,1
67345,"anguish , anger and frustration",0
67346,"at achieving the modest , crowd-pleasing goals...",1
67347,a patient viewer,1


In [3]:
from datasets import Dataset

train_dataset = Dataset.from_pandas(train_df)
dev_dataset = Dataset.from_pandas(dev_df)

In [4]:
from transformers import DataCollatorWithPadding, BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

def tokenize_function(example):
    return tokenizer(example['sentence'], truncation=True)

tokenized_train_dataset = train_dataset.map(tokenize_function, batched=True)
tokenized_dev_dataset = dev_dataset.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Map:   0%|          | 0/67349 [00:00<?, ? examples/s]

Map:   0%|          | 0/872 [00:00<?, ? examples/s]

In [5]:
tokenized_train_dataset = tokenized_train_dataset.remove_columns(["sentence"])
tokenized_train_dataset = tokenized_train_dataset.rename_column("label", "labels")
tokenized_train_dataset.set_format("torch")

tokenized_dev_dataset = tokenized_dev_dataset.remove_columns(["sentence"])
tokenized_dev_dataset = tokenized_dev_dataset.rename_column("label", "labels")
tokenized_dev_dataset.set_format("torch")

## 86. ミニバッチの作成

85で読み込んだ訓練データの一部（例えば冒頭の4事例）に対して、パディングなどの処理を行い、トークン列の長さを揃えてミニバッチを構成せよ。

In [6]:
from torch.utils.data import DataLoader

batch_size = 8

train_dataloader = DataLoader(
    tokenized_train_dataset, shuffle=True, batch_size=batch_size, collate_fn=data_collator
)

dev_dataloader = DataLoader(
    tokenized_dev_dataset, batch_size=batch_size, collate_fn=data_collator
)

## 87. ファインチューニング

訓練セットを用い、事前学習済みモデルを極性分析タスク向けにファインチューニングせよ。検証セット上でファインチューニングされたモデルの正解率を計測せよ。

In [9]:
from transformers import BertForSequenceClassification, get_scheduler
from torch.optim import AdamW
from accelerate import Accelerator
import torch

num_labels = 2

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=num_labels)
model.to(device)

lr = 5e-5

optimizer = AdamW(model.parameters(), lr=lr)

num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    name='linear', optimizer=optimizer, num_warmup_steps=0, num_training_steps=num_training_steps
)

accelerator = Accelerator()

train_dataloader, dev_dataloader, model, optimizer = accelerator.prepare(
    train_dataloader, dev_dataloader, model, optimizer
)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
from tqdm import tqdm
import torch
import numpy as np
import evaluate

metric = evaluate.load('accuracy')
progress_bar_train = tqdm(range(num_training_steps))

for epoch in range(num_epochs):
    model.train()
    total_train_loss = 0
    for batch in train_dataloader:
        outputs = model(**batch)
        loss = outputs.loss

        total_train_loss += loss.item()

        accelerator.backward(loss)

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar_train.update(1)
        progress_bar_train.set_postfix({"Epoch": epoch + 1, "Loss": loss.item()})

    avg_train_loss = total_train_loss / len(train_dataloader)
    print(f'Epoch {epoch+1}/{num_epochs} - Average Training Loss: {avg_train_loss:.4f}')

    model.eval()
    total_eval_loss = 0
    all_predictions = []
    all_labels = []

    progress_bar_eval = tqdm(dev_dataloader, desc=f'Evaluating Epoch {epoch+1}')
    for batch in progress_bar_eval:
        with torch.no_grad():
            outputs = model(**batch)
        
        loss = outputs.loss
        total_eval_loss += loss.item()

        logits = outputs.logits
        predictions = torch.argmax(logits, dim=-1)

        predictions = accelerator.gather(predictions)
        labels = accelerator.gather(batch['labels'])

        all_predictions.append(predictions.cpu().numpy())
        all_labels.append(labels.cpu().numpy())
        progress_bar_eval.set_postfix({'Eval Loss': loss.item()})

    avg_eval_loss = total_eval_loss / len(dev_dataloader)

    flat_predictions = np.concatenate(all_predictions)
    flat_labels = np.concatenate(all_labels)

    eval_metric = metric.compute(predictions=flat_predictions, references=flat_labels)

    print(f'Epoch {epoch+1}/{num_epochs} - Validation Loss: {avg_eval_loss:.4f} - Validation Accuracy: {eval_metric["accuracy"]:.4f}')



 33%|███▎      | 8419/25257 [08:06<16:47, 16.72it/s, Epoch=1, Loss=0.207]  

Epoch 1/3 - Average Training Loss: 0.2195


Evaluating Epoch 1: 100%|██████████| 109/109 [00:01<00:00, 74.28it/s, Eval Loss=0.655]
 33%|███▎      | 8422/25257 [08:07<1:00:22,  4.65it/s, Epoch=2, Loss=0.0342]

Epoch 1/3 - Validation Loss: 0.2771 - Validation Accuracy: 0.8956


 67%|██████▋   | 16838/25257 [16:25<08:21, 16.77it/s, Epoch=2, Loss=0.00871] 

Epoch 2/3 - Average Training Loss: 0.1050


Evaluating Epoch 2: 100%|██████████| 109/109 [00:01<00:00, 91.19it/s, Eval Loss=0.395]
 67%|██████▋   | 16841/25257 [16:27<25:58,  5.40it/s, Epoch=3, Loss=0.00725]

Epoch 2/3 - Validation Loss: 0.3198 - Validation Accuracy: 0.9071


100%|██████████| 25257/25257 [24:21<00:00, 18.84it/s, Epoch=3, Loss=0.00137] 

Epoch 3/3 - Average Training Loss: 0.0509


Evaluating Epoch 3: 100%|██████████| 109/109 [00:01<00:00, 71.35it/s, Eval Loss=0.0844]


Epoch 3/3 - Validation Loss: 0.2825 - Validation Accuracy: 0.9094


## 88. 極性分析

問題87でファインチューニングされたモデルを用いて、以下の文の極性を予測せよ。

- "The movie was full of incomprehensibilities."
- "The movie was full of fun."
- "The movie was full of excitement."
- "The movie was full of crap."
- "The movie was full of rubbish."


In [12]:
texts = [
    'The movie was full of incomprehensibilities.',
    'The movie was full of fun.',
    'The movie was full of excitement.',
    'The movie was full of crap.',
    'The movie was full of rubbish.'
]

encoded = tokenizer(texts, return_tensors='pt', padding=True, truncation=True)

model.to(device)
encoded.to(device)
outputs = model(**encoded)
logits = outputs.logits
labels = logits.argmax(-1)
labels


tensor([0, 1, 1, 0, 0], device='cuda:0')

## 89. アーキテクチャの変更

問題87とは異なるアーキテクチャ（例えば[CLS]トークンを用いるか、各トークンの最大値プーリングを用いるなど）の分類モデルを設計し、事前学習済みモデルを極性分析タスク向けにファインチューニングせよ。検証セット上でファインチューニングされたモデルの正解率を計測せよ。

In [7]:
import torch.nn as nn
from transformers import BertModel
from transformers.modeling_outputs import SequenceClassifierOutput
import torch

checkpoint = 'bert-base-uncased'

class BertForSequenceClassificationMaxPool(nn.Module):
    def __init__(self, num_labels=2, dropout_prob=0.1):
        super(BertForSequenceClassificationMaxPool, self).__init__()
        self.num_labels = num_labels
        self.bert = BertModel.from_pretrained(checkpoint)
        self.dropout = nn.Dropout(dropout_prob)
        self.classifier = nn.Linear(self.bert.config.hidden_size, num_labels)
    
    def forward(
        self,
        input_ids=None,
        attention_mask=None,
        token_type_ids=None,
        position_ids=None,
        head_mask=None,
        inputs_embeds=None,
        labels=None, # (batch_size)
        output_attentions=None,
        output_hidden_states=None,
        return_dict=None,
    ):
        return_dict = return_dict if return_dict is not None else self.bert.config.use_return_dict

        outputs = self.bert(
            input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
            position_ids=position_ids,
            head_mask=head_mask,
            inputs_embeds=inputs_embeds,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict
        )

        last_hidden_state = outputs.last_hidden_state # (batch_size, sequence_length, hidden_size)

        if attention_mask is not None: # (batch_size, sequence_length)
            extended_attention_mask = attention_mask.unsqueeze(-1) # (batch_size, sequence_length, 1)

            masked_hidden_state = last_hidden_state.masked_fill(extended_attention_mask == 0, -1e9) # (batch_size, sequence_length, hidden_size)
        else:
            masked_hidden_state = last_hidden_state
        
        pooled_output = torch.max(masked_hidden_state, dim=1)[0] # (batch_size, hidden_size)
        # torch.maxは(value, index)で帰ってくる

        pooled_output = self.dropout(pooled_output)
        logits = self.classifier(pooled_output) # (batch_size, num_labels)

        loss = None
        if labels is not None:
            loss_fct = nn.CrossEntropyLoss()
            loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
        
        if not return_dict:
            output = (logits,) + outputs[2:]
            return ((loss,) + output) if loss is not None else output
        
        return SequenceClassifierOutput(
            loss=loss,
            logits=logits,
            hidden_states=outputs.hidden_states,
            attentions=outputs.attentions,
        )

In [None]:
from transformers import get_scheduler
from torch.optim import AdamW
from accelerate import Accelerator
import torch

num_labels = 2

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

model = BertForSequenceClassificationMaxPool(num_labels=num_labels)
model.to(device)

lr = 5e-5

optimizer = AdamW(model.parameters(), lr=lr)

# epoch適当に打ち切ったら怒られます
# epochは基本無限でval lossが低いところのチェックポイントを使いましょう
num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    name='linear', optimizer=optimizer, num_warmup_steps=0, num_training_steps=num_training_steps
)

accelerator = Accelerator()

train_dataloader, dev_dataloader, model, optimizer = accelerator.prepare(
    train_dataloader, dev_dataloader, model, optimizer
)

In [None]:
from tqdm import tqdm
import torch
import numpy as np
import evaluate

metric = evaluate.load('accuracy')
progress_bar_train = tqdm(range(num_training_steps))

for epoch in range(num_epochs):
    model.train()
    total_train_loss = 0
    for batch in train_dataloader:
        outputs = model(**batch)
        loss = outputs.loss

        total_train_loss += loss.item()

        accelerator.backward(loss)

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar_train.update(1)
        progress_bar_train.set_postfix({"Epoch": epoch + 1, "Loss": loss.item()})

    avg_train_loss = total_train_loss / len(train_dataloader)
    print(f'Epoch {epoch+1}/{num_epochs} - Average Training Loss: {avg_train_loss:.4f}')

    model.eval()
    total_eval_loss = 0
    all_predictions = []
    all_labels = []

    progress_bar_eval = tqdm(dev_dataloader, desc=f'Evaluating Epoch {epoch+1}')
    for batch in progress_bar_eval:
        with torch.no_grad():
            outputs = model(**batch)
        
        loss = outputs.loss
        total_eval_loss += loss.item()

        logits = outputs.logits
        predictions = torch.argmax(logits, dim=-1)

        predictions = accelerator.gather(predictions)
        labels = accelerator.gather(batch['labels'])

        all_predictions.append(predictions.cpu().numpy())
        all_labels.append(labels.cpu().numpy())
        progress_bar_eval.set_postfix({'Eval Loss': loss.item()})

    avg_eval_loss = total_eval_loss / len(dev_dataloader)

    flat_predictions = np.concatenate(all_predictions)
    flat_labels = np.concatenate(all_labels)

    eval_metric = metric.compute(predictions=flat_predictions, references=flat_labels)

    print(f'Epoch {epoch+1}/{num_epochs} - Validation Loss: {avg_eval_loss:.4f} - Validation Accuracy: {eval_metric["accuracy"]:.4f}')



 33%|███▎      | 8419/25257 [08:12<16:12, 17.32it/s, Epoch=1, Loss=0.677]  

Epoch 1/3 - Average Training Loss: 0.2145


Evaluating Epoch 1: 100%|██████████| 109/109 [00:01<00:00, 76.24it/s, Eval Loss=0.164]
 33%|███▎      | 8422/25257 [08:13<58:34,  4.79it/s, Epoch=2, Loss=0.0541]  

Epoch 1/3 - Validation Loss: 0.2379 - Validation Accuracy: 0.8979


 67%|██████▋   | 16838/25257 [16:22<08:04, 17.37it/s, Epoch=2, Loss=0.00741] 

Epoch 2/3 - Average Training Loss: 0.1028


Evaluating Epoch 2: 100%|██████████| 109/109 [00:01<00:00, 77.91it/s, Eval Loss=0.162]
 67%|██████▋   | 16842/25257 [16:23<37:31,  3.74it/s, Epoch=3, Loss=0.0484] 

Epoch 2/3 - Validation Loss: 0.2387 - Validation Accuracy: 0.9151


100%|██████████| 25257/25257 [24:29<00:00, 17.21it/s, Epoch=3, Loss=0.0177]  

Epoch 3/3 - Average Training Loss: 0.0486


Evaluating Epoch 3: 100%|██████████| 109/109 [00:01<00:00, 77.95it/s, Eval Loss=0.153]


Epoch 3/3 - Validation Loss: 0.2645 - Validation Accuracy: 0.9037


100%|██████████| 25257/25257 [24:40<00:00, 17.21it/s, Epoch=3, Loss=0.0177]

In [10]:
texts = [
    'The movie was full of incomprehensibilities.',
    'The movie was full of fun.',
    'The movie was full of excitement.',
    'The movie was full of crap.',
    'The movie was full of rubbish.'
]

encoded = tokenizer(texts, return_tensors='pt', padding=True, truncation=True)

model.to(device)
encoded.to(device)
outputs = model(**encoded)
logits = outputs.logits
labels = logits.argmax(-1)
labels


tensor([0, 1, 1, 0, 0], device='cuda:0')