### 第9章: 事前学習済み言語モデル（BERT型）

本章では、BERT型の事前学習済みモデルを利用して、マスク単語の予測や文ベクトルの計算、評判分析器（ポジネガ分類器）の構築に取り組む。

In [1]:
"""
80. トークン化
“The movie was full of incomprehensibilities.”という文をトークンに分解し、トークン列を表示せよ。
"""
from transformers import AutoTokenizer

model_name = "google-bert/bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

text = "The movie was full of incomprehensibilities."
tokens = tokenizer.tokenize(text)

print(tokens)

  from .autonotebook import tqdm as notebook_tqdm


['the', 'movie', 'was', 'full', 'of', 'inc', '##omp', '##re', '##hen', '##si', '##bilities', '.']


In [2]:
"""
81. マスクの予測
“The movie was full of [MASK].”の”[MASK]”を埋めるのに最も適切なトークンを求めよ。
"""
from transformers import pipeline
from pprint import pprint

pipe = pipeline("fill-mask", model=model_name,device=0)
text = "The movie was full of [MASK]."
out_ten = pipe(text, top_k=10)
print("【[MASK] に入る上位10語とその確率】")
for i, out in enumerate(out_ten, 1):
    token = out["token_str"]
    score = out["score"]
    print(f"{i:2d}. {token:<15}  尤度: {score:.6f}")

Some weights of the model checkpoint at google-bert/bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0


【[MASK] に入る上位10語とその確率】
 1. fun              尤度: 0.107119
 2. surprises        尤度: 0.066345
 3. drama            尤度: 0.044684
 4. stars            尤度: 0.027217
 5. laughs           尤度: 0.025413
 6. action           尤度: 0.019517
 7. excitement       尤度: 0.019038
 8. people           尤度: 0.018290
 9. tension          尤度: 0.015031
10. music            尤度: 0.014646


In [3]:
"""
83. CLSトークンによる文ベクトル
以下の文の全ての組み合わせに対して、最終層の[CLS]トークンの埋め込みベクトルを用いてコサイン類似度を求めよ。

“The movie was full of fun.”

“The movie was full of excitement.”

“The movie was full of crap.”

“The movie was full of rubbish.”
"""
from torch.nn.functional import cosine_similarity
sim_enc_pipline = pipeline(
    model = model_name,task="feature-extraction"
)
texts = ["The movie was full of fun.",
         "The movie was full of excitement.",
         "The movie was full of crap.",
         "The movie was full of rubbish."]
text_vecs = [sim_enc_pipline(text, return_tensors=True)[0][0] for text in texts]

print("コサイン類似度")
for i in range(3):
    for j in range(i+1,4):
        print(f"文章{i+1} : {texts[i]}")
        print(f"文章{j+1} : {texts[j]}")
        sim_pair_score = cosine_similarity(text_vecs[i], text_vecs[j], dim=0)
        print(f"類似度 : {sim_pair_score.item()}")



Device set to use cuda:0


コサイン類似度
文章1 : The movie was full of fun.
文章2 : The movie was full of excitement.
類似度 : 0.9880610108375549
文章1 : The movie was full of fun.
文章3 : The movie was full of crap.
類似度 : 0.955765962600708
文章1 : The movie was full of fun.
文章4 : The movie was full of rubbish.
類似度 : 0.9475324749946594
文章2 : The movie was full of excitement.
文章3 : The movie was full of crap.
類似度 : 0.9541274905204773
文章2 : The movie was full of excitement.
文章4 : The movie was full of rubbish.
類似度 : 0.9486637115478516
文章3 : The movie was full of crap.
文章4 : The movie was full of rubbish.
類似度 : 0.9806932210922241


In [4]:
"""
84. 平均による文ベクトル
以下の文の全ての組み合わせに対して、最終層の埋め込みベクトルの平均を用いてコサイン類似度を求めよ。

“The movie was full of fun.”

“The movie was full of excitement.”

“The movie was full of crap.”

“The movie was full of rubbish.”
"""
from torch.nn.functional import cosine_similarity
sim_enc_pipline = pipeline(
    model = model_name,task="feature-extraction"
)
texts = ["The movie was full of fun.",
         "The movie was full of excitement.",
         "The movie was full of crap.",
         "The movie was full of rubbish."]
text_vecs = []

for text in texts:
    vec = sim_enc_pipline(text, return_tensors=True)[0]
    mean_vec = vec.mean(dim=0)
    text_vecs.append(mean_vec)

print("コサイン類似度")
for i in range(3):
    for j in range(i+1,4):
        print(f"文章{i+1} : {texts[i]}")
        print(f"文章{j+1} : {texts[j]}")
        sim_pair_score = cosine_similarity(text_vecs[i], text_vecs[j], dim=0)
        print(f"類似度 : {sim_pair_score.item()}")


Device set to use cuda:0


コサイン類似度
文章1 : The movie was full of fun.
文章2 : The movie was full of excitement.
類似度 : 0.956811249256134
文章1 : The movie was full of fun.
文章3 : The movie was full of crap.
類似度 : 0.8489991426467896
文章1 : The movie was full of fun.
文章4 : The movie was full of rubbish.
類似度 : 0.8168841600418091
文章2 : The movie was full of excitement.
文章3 : The movie was full of crap.
類似度 : 0.8351833820343018
文章2 : The movie was full of excitement.
文章4 : The movie was full of rubbish.
類似度 : 0.7938442230224609
文章3 : The movie was full of crap.
文章4 : The movie was full of rubbish.
類似度 : 0.9225536584854126


In [5]:
"""
85. データセットの準備
General Language Understanding Evaluation (GLUE) ベンチマークで配布されているStanford Sentiment Treebank (SST) から
訓練セット（train.tsv）と開発セット（dev.tsv）のテキストと極性ラベルと読み込み、さらに全てのテキストはトークン列に変換せよ。
"""
import pandas as pd
import torch
from pprint import pprint
from transformers import AutoTokenizer


train_df = pd.read_csv("SST-2/train.tsv", sep="\t")
dev_df = pd.read_csv("SST-2/dev.tsv", sep="\t")

def text_to_token(df, tokenizer):
    dct_lst = []
    for _, row in df.iterrows():
        sentence = row["sentence"]
        label = torch.tensor([float(row["label"])])

        tokens = tokenizer.tokenize(sentence)
        
        dct_lst.append({'text':sentence,
            'label':label,
            'tokens': tokens
        })
           
    return dct_lst

model_name = "google-bert/bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

train_token_list  = text_to_token(train_df, tokenizer)
dev_token_list = text_to_token(dev_df, tokenizer)
pprint(train_token_list[0])

{'label': tensor([0.]),
 'text': 'hide new secretions from the parental units ',
 'tokens': ['hide',
            'new',
            'secret',
            '##ions',
            'from',
            'the',
            'parental',
            'units']}


In [6]:
"""
86. ミニバッチの作成
85で読み込んだ訓練データの一部（例えば冒頭の4事例）に対して、パディングなどの処理を行い、トークン列の長さを揃えてミニバッチを構成せよ。
"""
from torch.nn.utils.rnn import pad_sequence
import torch
from torch.utils.data import DataLoader

def padding(token_list, tokenizer, max_length):
    pad_token_id = tokenizer.pad_token_id
    for dct in token_list:
        input_ids = tokenizer.convert_tokens_to_ids(dct['tokens'])

        padded = input_ids[:max_length]
        padded += [pad_token_id] * (max_length - len(padded))
        attention_mask = [1 if id != pad_token_id else 0 for id in padded]

        dct["input_ids"] = torch.tensor(padded)
        dct["attention_mask"] = torch.tensor(attention_mask)
    return token_list

max_length = 64
train_padded_token_list = padding(train_token_list, tokenizer, max_length=max_length)
dev_padded_token_list = padding(dev_token_list, tokenizer, max_length=max_length)


from torch.utils.data import Dataset
class tokenDataset(Dataset):
    def __init__(self, data_list):
        self.data = data_list

    def __getitem__(self, idx):
        item = self.data[idx]
        return {
            'input_ids': item['input_ids'],
            'attention_mask': item['attention_mask'],
            'labels': item['label']
        }

    def __len__(self):
        return len(self.data)
    
train_dataset = tokenDataset(train_padded_token_list)
dev_dataset = tokenDataset(dev_padded_token_list)

from torch.utils.data import DataLoader

train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
dev_loader = DataLoader(dev_dataset, batch_size=256, shuffle=True)


In [7]:
"""
87. ファインチューニング
訓練セットを用い、事前学習済みモデルを極性分析タスク向けにファインチューニングせよ。検証セット上でファインチューニングされたモデルの正解率を計測せよ。
"""
import torch
from torch import nn
from torch.utils.data import DataLoader
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from torch.optim import AdamW
from sklearn.metrics import accuracy_score
from tqdm import tqdm

def train(model, train_loader, optimizer, loss_fn, device):
    model.train()
    total_loss = 0

    for batch in tqdm(train_loader, desc="Training"):
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        labels = batch["labels"].squeeze().float().to(device)

        pred = model(input_ids=input_ids, attention_mask=attention_mask)
        logits = pred.logits.squeeze()

        loss = loss_fn(logits, labels)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        total_loss += loss.item()

    avg_loss = total_loss / len(train_loader)
    return avg_loss

def evaluate(model, dev_loader, device):
    model.eval()
    all_preds = []
    all_labels = []

    with torch.no_grad():
        for batch in tqdm(dev_loader, desc="Evaluating"):
            input_ids = batch["input_ids"].to(device)
            attention_mask = batch["attention_mask"].to(device)
            labels = batch["labels"].squeeze().float().to(device)

            pred = model(input_ids=input_ids, attention_mask=attention_mask)
            logits = pred.logits.squeeze()
            preds = torch.sigmoid(logits) > 0.5

            all_preds.extend(preds.cpu().numpy())
            all_labels.extend(labels.cpu().numpy())

    acc = accuracy_score(all_labels, all_preds)
    return acc

model_name = "google-bert/bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=1)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

optimizer = AdamW(model.parameters(), lr=2e-5)
loss_fn = nn.BCEWithLogitsLoss()

max_epochs = 3


for epoch in range(max_epochs):
    print(f"=================== {epoch+1} / {max_epochs} epoch ===================")
    
    train_loss = train(model, train_loader, optimizer, loss_fn, device)
    print(f"Train Loss: {train_loss:.4f}")

    val_acc = evaluate(model, dev_loader, device)
    print(f"Validation Accuracy: {val_acc:.4f}")

save_path = 'model/model87.pth' 
torch.save(model.state_dict(), save_path)
           

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.




Training: 100%|██████████| 2105/2105 [06:11<00:00,  5.67it/s]


Train Loss: 0.2922


Evaluating: 100%|██████████| 4/4 [00:01<00:00,  2.43it/s]


Validation Accuracy: 0.9163


Training: 100%|██████████| 2105/2105 [06:35<00:00,  5.32it/s]


Train Loss: 0.1515


Evaluating: 100%|██████████| 4/4 [00:01<00:00,  2.32it/s]


Validation Accuracy: 0.9060


Training: 100%|██████████| 2105/2105 [06:41<00:00,  5.25it/s]


Train Loss: 0.1048


Evaluating: 100%|██████████| 4/4 [00:01<00:00,  2.33it/s]


Validation Accuracy: 0.9117


In [8]:
"""
88. 極性分析
問題87でファインチューニングされたモデルを用いて、以下の文の極性を予測せよ。

“The movie was full of incomprehensibilities.”

“The movie was full of fun.”

“The movie was full of excitement.”

“The movie was full of crap.”

“The movie was full of rubbish.”
"""
def analyze(text, tokenizer, model, device):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
    inputs = {k: v.to(device) for k, v in inputs.items()}

    model.eval()
    with torch.no_grad():
        pred = model(**inputs)
        logits = pred.logits.squeeze() 
        pred_label = torch.sigmoid(logits) > 0.5
        print(f"text : {text}, 極性 : {'ポジティブ' if pred_label else 'ネガティブ'}")

texts = ["The movie was full of incomprehensibilities.",
         "The movie was full of fun.",
         "The movie was full of excitement",
         "The movie was full of crap.",
         "The movie was full of rubbish."]


model_name = "google-bert/bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=1)

load_path = 'model/model87.pth'
model.load_state_dict(torch.load(load_path, map_location=torch.device('cpu')))

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

for text in texts:
    analyze(text, tokenizer, model, device)  

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


text : The movie was full of incomprehensibilities., 極性 : ネガティブ
text : The movie was full of fun., 極性 : ポジティブ
text : The movie was full of excitement, 極性 : ポジティブ
text : The movie was full of crap., 極性 : ネガティブ
text : The movie was full of rubbish., 極性 : ネガティブ


In [9]:
"""
89. アーキテクチャの変更
問題87とは異なるアーキテクチャ（例えば[CLS]トークンを用いるか、各トークンの最大値プーリングを用いるなど）の分類モデルを設計し、
事前学習済みモデルを極性分析タスク向けにファインチューニングせよ。検証セット上でファインチューニングされたモデルの正解率を計測せよ。
"""
import torch
from torch import nn
from torch.utils.data import DataLoader
from transformers import AutoTokenizer
from torch.optim import AdamW
from sklearn.metrics import accuracy_score
from tqdm import tqdm
from transformers import AutoModel

class CustomBERTModel(nn.Module):
    def __init__(self, model_name, num_labels=1):
        super().__init__()
        self.bert = AutoModel.from_pretrained(model_name)
        self.linear = nn.Linear(768, num_labels)

    def forward(self, input_ids, attention_mask):
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        hidden_states = outputs.last_hidden_state  # 最後の層の出力を受け取る
        pooled = torch.max(hidden_states, dim=1).values  #最大値をとる
        logits = self.linear(pooled) #線形層に通してスコアとして使えるようにする
        return logits
    
def train(model, train_loader, optimizer, loss_fn, device):
    model.train()
    total_loss = 0
    for batch in tqdm(train_loader):
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        labels = batch["labels"].squeeze().float().to(device)

        pred = model(input_ids=input_ids, attention_mask=attention_mask)
        logits = pred.squeeze()
        
        loss = loss_fn(logits, labels)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        total_loss += loss.item()

    avg_loss = total_loss / len(train_loader)
    return avg_loss


def evaluate(model, dev_loader, device):
    model.eval()
    all_preds = []
    all_labels = []

    with torch.no_grad():
        for batch in tqdm(dev_loader):
            input_ids = batch["input_ids"].to(device)
            attention_mask = batch["attention_mask"].to(device)
            labels = batch["labels"].squeeze().float().to(device)

            pred = model(input_ids=input_ids, attention_mask=attention_mask)
            logits = pred.squeeze()
            preds = torch.sigmoid(logits) > 0.5

            all_preds.extend(preds.cpu().numpy())
            all_labels.extend(labels.cpu().numpy())

    acc = accuracy_score(all_labels, all_preds)
    return acc

model_name = "google-bert/bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

model = CustomBERTModel(model_name, num_labels=1)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

optimizer = AdamW(model.parameters(), lr=2e-5)
loss_fn = nn.BCEWithLogitsLoss()

max_epochs = 3
for epoch in range(max_epochs):
    print(f"=================== {epoch+1} / {max_epochs} epoch ===================")
    
    train_loss = train(model, train_loader, optimizer, loss_fn, device)
    print(f"Train Loss: {train_loss:.4f}")

    val_acc = evaluate(model, dev_loader, device)
    print(f"Validation Accuracy: {val_acc:.4f}")

save_path = 'model/model89.pth' 
torch.save(model.state_dict(), save_path)
        



100%|██████████| 2105/2105 [07:13<00:00,  4.86it/s]


Train Loss: 0.2743


100%|██████████| 4/4 [00:01<00:00,  2.32it/s]


Validation Accuracy: 0.9128


100%|██████████| 2105/2105 [07:14<00:00,  4.84it/s]


Train Loss: 0.1445


100%|██████████| 4/4 [00:01<00:00,  2.31it/s]


Validation Accuracy: 0.9186


100%|██████████| 2105/2105 [07:15<00:00,  4.83it/s]


Train Loss: 0.0988


100%|██████████| 4/4 [00:01<00:00,  2.33it/s]


Validation Accuracy: 0.9186
