# BERT
Follow [this article](https://qiita.com/kenta1984/items/7f3a5d859a15b20657f3) and test BERT.
This uses Pytorch.

#### requirements
- Pytorch: `conda install pytorch`
- transformers: `conda install -c huggingface transformers`
- `conda install -c conda-forge ipywidgets`
- `pip install fugashi ipadic`

# load pre-trained model
Use the pre-trained model `bert-base-japanese-whole-word-masking` made by Tohoku Uni. (Reference: [link](https://huggingface.co/transformers/pretrained_models.html), [link2](https://github.com/huggingface/transformers))

In [28]:
from transformers import BertJapaneseTokenizer, BertForMaskedLM
import torch

## Test Masked Language Model

In [13]:
input_text = "有意義なイノベーションを通じて、人々の生活を向上させる。"

### Preprocess (Tokenizing)

Tokenizing input text

In [4]:
# load pre-trained tokenizer
tokenizer = BertJapaneseTokenizer.from_pretrained('cl-tohoku/bert-base-japanese-whole-word-masking')

In [17]:
# Tokenize input
tokenized_input = tokenizer.tokenize(input_text)
print("Tokenized Input text: {}".format(tokenized_input))

Tokenized Input text: ['有意', '##義', 'な', 'イノベーション', 'を通じて', '、', '人々', 'の', '生活', 'を', '向上', 'さ', 'せる', '。']


Mask a token `人々` that the model predicts

In [22]:
# Mask a token that we will try to predict back with `BertForMaskedLM`
masked_index = 6

tokenized_input[masked_index] = '[MASK]'
print("Masked Tokenized Input text: {}".format(tokenized_input))
print("Length of List: {}".format(len(tokenized_input)))

Masked Tokenized Input text: ['有意', '##義', 'な', 'イノベーション', 'を通じて', '、', '[MASK]', 'の', '生活', 'を', '向上', 'さ', 'せる', '。']
Length of List: 14


Assign tokens to token indexs of vocabs of pre-trained model

In [24]:
# Convert token to vocabulary indices
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_input)
print("Token Index: {}".format(indexed_tokens))
print("Length of List: {}".format(len(indexed_tokens)))

Token Index: [22949, 28845, 18, 22918, 3016, 6, 4, 5, 1326, 11, 2771, 26, 796, 8]
Length of List: 14


Convert inputs to PyTorch tensors

In [26]:
# Convert inputs to PyTorch tensors
tokens_tensor = torch.tensor([indexed_tokens])

print("Tensor: {}".format(tokens_tensor))
print("Length of List: {}".format(len(tokens_tensor)))

Tensor: tensor([[22949, 28845,    18, 22918,  3016,     6,     4,     5,  1326,    11,
          2771,    26,   796,     8]])
Length of List: 1


### Masked Language Model

Load the pre-trained MLM

In [29]:
# Load pre-trained model
model = BertForMaskedLM.from_pretrained('cl-tohoku/bert-base-japanese-whole-word-masking')
model.eval()

HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=479.0), HTML(value='')))




HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=445021143.0), HTML(value='')))




Some weights of the model checkpoint at cl-tohoku/bert-base-japanese-whole-word-masking were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


BertForMaskedLM(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(32000, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=Tr

Predict `[MASK]` token

In [30]:
# Predict
with torch.no_grad():
    outputs = model(tokens_tensor)
    # Pick top 5 predictions up
    predictions = outputs[0][0, masked_index].topk(5)
    
# Show results
for i, index_t in enumerate(predictions.indices):
    index = index_t.item()
    token = tokenizer.convert_ids_to_tokens([index])[0]
    print(i, token)

0 人
1 人々
2 子ども
3 国民
4 人間


Looks good!

# Fine Tuning
There are the following examples of Fine Tuning for English texts [here](https://github.com/huggingface/transformers/tree/master/examples).
- **Language Modeling:** Predict the next word. (Details: Output the prob condioned tokens/words)
- **Text Classification:** Classify texts (e.g. [Movie Reviews](https://www.tensorflow.org/tutorials/keras/text_classification?hl=en))
- **Token Classification:** Classify tokens 
- **Multiple Choice:** Choose a right answer in multiple choice 
- **Q&A:** Answer against question
- **Text Generation:** Generate text with the goal of appearing indistinguishable to human-written text.(e.g. [a result](https://arxiv.org/pdf/1705.11001.pdf) (pp.8 on this thesis))
- **Distillation**: Train a model with results by pre-trained models ([related reference](https://www.anlp.jp/proceedings/annual_meeting/2020/pdf_dir/F1-4.pdf))
- **Summarization**: Summarize text
- **Translation**: Translate
- **Adversarial**: GAN-ish task (e.g. [link](https://towardsdatascience.com/what-are-adversarial-examples-in-nlp-f928c574478e))