## __単語の穴埋め__

- 学習済みモデルを使う

### __準備__

In [1]:
!pip install transformers[ja] | tail -n 1

Successfully installed fugashi-1.3.0 huggingface-hub-0.17.3 ipadic-1.0.0 plac-1.4.0 rhoknp-1.3.0 safetensors-0.4.0 sudachidict-core-20230927 sudachipy-0.6.7 tokenizers-0.14.1 transformers-4.34.1 unidic-1.1.0 unidic-lite-1.0.8 wasabi-0.10.1


In [3]:
import torch
from transformers import AutoTokenizer, AutoModelForMaskedLM

In [2]:
model_name = "cl-tohoku/bert-base-japanese-whole-word-masking"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)

Downloading (…)okenizer_config.json:   0%|          | 0.00/110 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/479 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/258k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/445M [00:00<?, ?B/s]

Some weights of the model checkpoint at cl-tohoku/bert-base-japanese-whole-word-masking were not used when initializing BertForMaskedLM: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


### __推論__

In [6]:
# maskしたい部分は[MASK]にすれば良い

tokenizer.mask_token

'[MASK]'

In [8]:
texts = [
    "日本の首都は[MASK]です",
    "アメリカの首都は[MASK]です"
]

# トークナイズ
inputs = tokenizer(texts, return_tensors="pt")
inputs

{'input_ids': tensor([[   2,   91,    5, 2676,    9,    4, 2992,    3],
        [   2,  286,    5, 2676,    9,    4, 2992,    3]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1]])}

In [9]:
# 推論
with torch.no_grad():
    logits = model(**inputs).logits

# バッチサイズ×トークン数×単語数
logits.shape

torch.Size([2, 8, 32000])

In [14]:
# [MASK]した部分のindexを取得

mask_index = (inputs["input_ids"] == tokenizer.mask_token_id)
mask_index

tensor([[False, False, False, False, False,  True, False, False],
        [False, False, False, False, False,  True, False, False]])

In [15]:
# logitsとサイズを合わせる

mask_index = mask_index.unsqueeze(-1).expand_as(logits)
mask_index.shape

torch.Size([2, 8, 32000])

In [21]:
# [MASK]ごとに最もスコアが高いトークンのIDを取得

predicted_token_ids = logits[mask_index].view(logits.size(0), -1).argmax(axis=-1)
predicted_token_ids

tensor([ 391, 1724])

In [22]:
# トークンIDを単語に戻す

tokenizer.convert_ids_to_tokens(predicted_token_ids)

['東京', 'ニューヨーク']