# BERT
Follow [an article](https://qiita.com/kenta1984/items/7f3a5d859a15b20657f3) and test BERT.
This uses Pytorch.
### Pre-trained model
I use the pre-trained model `bert-base-japanese-whole-word-masking` made by Tohoku Uni.([Github](https://github.com/cl-tohoku/bert-japanese))

It is already installed `Transformers>=4.0.0`(Reference: [link](https://huggingface.co/transformers/pretrained_models.html), [link2](https://github.com/huggingface/transformers))

#### requirements
- Pytorch: `conda install pytorch`
- transformers: `conda install -c huggingface transformers`
- `conda install -c conda-forge ipywidgets`
- `pip install fugashi ipadic`

### Reference
- [Next Sentece Prediction](https://heartbeat.fritz.ai/implementing-mobile-bert-for-next-sentence-prediction-a2ae8b804f77)


In [1]:
from transformers import BertJapaneseTokenizer, BertForMaskedLM
import torch

## Test Masked Language Model

In [2]:
input_text = "有意義なイノベーションを通じて、人々の生活を向上させる。"

### Preprocess (Tokenizing)

Load pre-trained tokenizer for preprocessing

In [3]:
# load pre-trained tokenizer
tokenizer = BertJapaneseTokenizer.from_pretrained('cl-tohoku/bert-base-japanese-whole-word-masking')

Tokenizing input text

In [4]:
# Tokenize input
tokenized_input = tokenizer.tokenize(input_text)
print("Tokenized Input text: {}".format(tokenized_input))

Tokenized Input text: ['有意', '##義', 'な', 'イノベーション', 'を通じて', '、', '人々', 'の', '生活', 'を', '向上', 'さ', 'せる', '。']


Mask a token `人々` that the model predicts

In [5]:
# Mask a token that we will try to predict back with `BertForMaskedLM`
masked_index = 6

tokenized_input[masked_index] = '[MASK]'
print("Masked Tokenized Input text: {}".format(tokenized_input))
print("Length of List: {}".format(len(tokenized_input)))

Masked Tokenized Input text: ['有意', '##義', 'な', 'イノベーション', 'を通じて', '、', '[MASK]', 'の', '生活', 'を', '向上', 'さ', 'せる', '。']
Length of List: 14


Assign tokens to token indexs of vocabs of pre-trained model

In [6]:
# Convert token to vocabulary indices
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_input)
print("Token Index: {}".format(indexed_tokens))
print("Length of List: {}".format(len(indexed_tokens)))

Token Index: [22949, 28845, 18, 22918, 3016, 6, 4, 5, 1326, 11, 2771, 26, 796, 8]
Length of List: 14


Convert inputs to PyTorch tensors

In [7]:
# Convert inputs to PyTorch tensors
tokens_tensor = torch.tensor([indexed_tokens])

print("Tensor: {}".format(tokens_tensor))
print("Length of List: {}".format(len(tokens_tensor)))

Tensor: tensor([[22949, 28845,    18, 22918,  3016,     6,     4,     5,  1326,    11,
          2771,    26,   796,     8]])
Length of List: 1


### Masked Language Model

Load the pre-trained MLM

In [8]:
# Load pre-trained model
model = BertForMaskedLM.from_pretrained('cl-tohoku/bert-base-japanese-whole-word-masking')
model.eval()

Some weights of the model checkpoint at cl-tohoku/bert-base-japanese-whole-word-masking were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


BertForMaskedLM(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(32000, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=Tr

Predict `[MASK]` token

In [9]:
# Predict
with torch.no_grad():
    outputs = model(tokens_tensor)
    # Pick top 5 predictions up
    predictions = outputs[0][0, masked_index].topk(5)
    
# Show results
for i, index_t in enumerate(predictions.indices):
    index = index_t.item()
    token = tokenizer.convert_ids_to_tokens([index])[0]
    print(i+1, token)

1 人
2 人々
3 子ども
4 国民
5 人間


In [10]:
input_text

'有意義なイノベーションを通じて、人々の生活を向上させる。'

The true token `人々` showed on the 2nd place. 

Looks good!

## Test Next Setence Prediction

In [11]:
from transformers import BertForNextSentencePrediction

### Preprocessing

In [12]:
#  Prepare tokenized input
input_text1 = "発表者の方々ご連絡ありがとうございました。"
input_text2 = "次のミーティングのスケジュールは以下の通りです。"

tokenized_input1 = ["[CLS]"] + tokenizer.tokenize(input_text1) + ["[SEP]"]
print("Tokenized Input text #1: {}".format(tokenized_input1))
tokenized_input2 = tokenizer.tokenize(input_text2) + ["[SEP]"]
print("Tokenized Input text #2: {}".format(tokenized_input2))

Tokenized Input text #1: ['[CLS]', '発表', '者', 'の', '方', '##々', 'ご', '連絡', 'ありがとう', 'ござい', 'まし', 'た', '。', '[SEP]']
Tokenized Input text #2: ['次', 'の', 'ミーティング', 'の', 'スケジュール', 'は', '以下', 'の', '通り', 'です', '。', '[SEP]']


Assign tokens to token indexs of vocabs of pre-trained model

In [13]:
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_input1 + tokenized_input2)
print("Token Index: {}".format(indexed_tokens))
segments_ids = [0]*len(tokenized_input1) + [1]*len(tokenized_input2)
print("Segment Index: {}".format(segments_ids))

Token Index: [2, 602, 104, 5, 283, 28827, 802, 2986, 21670, 27378, 3913, 10, 8, 3, 288, 5, 24257, 5, 11109, 9, 562, 5, 939, 2992, 8, 3]
Segment Index: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]


Convert inputs to PyTorch tensors

In [14]:
tokens_tensor = torch.tensor([indexed_tokens])
print("Token Tensor: {}".format(tokens_tensor))
segments_tensors = torch.tensor([segments_ids])
print("Segment Tensor: {}".format(segments_tensors))

Token Tensor: tensor([[    2,   602,   104,     5,   283, 28827,   802,  2986, 21670, 27378,
          3913,    10,     8,     3,   288,     5, 24257,     5, 11109,     9,
           562,     5,   939,  2992,     8,     3]])
Segment Tensor: tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1]])


### Next Sentence Prediction

Load the pre-trained NSP

In [15]:
# Load pre-trained model (weights)
model = BertForNextSentencePrediction.from_pretrained('cl-tohoku/bert-base-japanese-whole-word-masking')
model.eval()

Some weights of the model checkpoint at cl-tohoku/bert-base-japanese-whole-word-masking were not used when initializing BertForNextSentencePrediction: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertForNextSentencePrediction from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForNextSentencePrediction from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


BertForNextSentencePrediction(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(32000, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, element

Predict isNextSetence

In [16]:
# Predict is Next Sentence ?
from torch.nn.functional import softmax
import numpy as np

predictions = model(tokens_tensor, segments_tensors)
prediction_sm = softmax(predictions[0], dim=1)
print("Prediction: {}".format(prediction_sm[0].tolist()))

res_index = np.argmax(prediction_sm[0].tolist())
if res_index == 0:
    print("***Result***\nisNextSentece")
else:
    print("***Result***\nNot isNextSentece")

Prediction: [0.9979470372200012, 0.0020529187750071287]
***Result***
isNextSentece


# Fine Tuning
There are the following examples of Fine Tuning for English texts [here](https://github.com/huggingface/transformers/tree/master/examples).
- **Language Modeling:** Predict the next word. (Details: Output the prob condioned tokens/words)
- **Text Classification:** Classify texts (e.g. [Movie Reviews](https://www.tensorflow.org/tutorials/keras/text_classification?hl=en))
- **Token Classification:** Classify tokens 
- **Multiple Choice:** Choose a right answer in multiple choice 
- **Q&A:** Answer against question
- **Text Generation:** Generate text with the goal of appearing indistinguishable to human-written text.(e.g. [a result](https://arxiv.org/pdf/1705.11001.pdf) (pp.8 on this thesis))
- **Distillation**: Train a model with results by pre-trained models ([related reference](https://www.anlp.jp/proceedings/annual_meeting/2020/pdf_dir/F1-4.pdf))
- **Summarization**: Summarize text
- **Translation**: Translate
- **Adversarial**: GAN-ish task (e.g. [link](https://towardsdatascience.com/what-are-adversarial-examples-in-nlp-f928c574478e))