Last test : 2021-01-30  
한국어 설명 : https://wikidocs.net/159246  
English Explanation : https://wikidocs.net/160289  
Github : https://github.com/RichardMinsooGo/51_Pretrained_BERT_NMT

We wil use pytorch_pretrained_bert at this notebook

In [None]:
!pip install pytorch_pretrained_bert

from IPython.display import clear_output 
clear_output()

Define the library that we will use. Then check whether GPU is selected.


In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils import data
from pytorch_pretrained_bert import BertTokenizer, BertModel, BertForMaskedLM, BertForQuestionAnswering, BertForPreTraining

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

cuda


### 1. Data Load 
This step is not used, we will define Input and Output at this note book.

### 2. Build Input text, Output Text 
Input/Output Data is defined.

In [None]:
raw_data = (
    ('What a ridiculous concept!', 'Quel concept ridicule !'),
    ('Your idea is not entirely crazy.', "Votre idée n'est pas complètement folle."),
    ("A man's worth lies in what he is.", "La valeur d'un homme réside dans ce qu'il est."),
    ('What he did is very wrong.', "Ce qu'il a fait est très mal."),
    ("All three of you need to do that.", "Vous avez besoin de faire cela, tous les trois."),
    ("Are you giving me another chance?", "Me donnez-vous une autre chance ?"),
    ("Both Tom and Mary work as models.", "Tom et Mary travaillent tous les deux comme mannequins."),
    ("Can I have a few minutes, please?", "Puis-je avoir quelques minutes, je vous prie ?"))

### 3. Preprocess  

Create spaces between words and punctuation marks.   
Ex) "he is a boy." => "he is a boy ."   
Except (a-z, A-Z, ".", "?", "!", ","), others are changed to space.

In [None]:
import unicodedata
import re

from tensorflow.keras.preprocessing.text import Tokenizer

def unicode_to_ascii(s):
    return ''.join(c for c in unicodedata.normalize('NFD', s)
            if unicodedata.category(c) != 'Mn')

def preprocess(sent):
    # 위에서 구현한 함수를 내부적으로 호출
    sent = unicode_to_ascii(sent.lower())

    # 단어와 구두점 사이에 공백을 만듭니다.
    # Ex) "he is a boy." => "he is a boy ."
    sent = re.sub(r"([?.!,¿])", r" \1", sent)

    # (a-z, A-Z, ".", "?", "!", ",") 이들을 제외하고는 전부 공백으로 변환합니다.
    sent = re.sub(r"[^a-zA-Z!.?]+", r" ", sent)

    sent = re.sub(r"\s+", " ", sent)
    return sent

# 인코딩 테스트
en_sent = u"Have you had dinner?"
fr_sent = u"Avez-vous déjà diné?"

print(preprocess(en_sent))
print(preprocess(fr_sent).encode('utf-8'))

have you had dinner ?
b'avez vous deja dine ?'


### Build Input/Output text data
In order to make input and output sentences into batch data, after preprocessing the raw data, convert it into a list and print it.

In [None]:
raw_encoder_input, raw_data_fr = list(zip(*raw_data))
raw_encoder_input, raw_data_fr = list(raw_encoder_input), list(raw_data_fr)

input_text = ['[CLS] ' + preprocess(data) + ' [SEP]' for data in raw_encoder_input]
target_text = [preprocess(data) for data in raw_data_fr]

print(input_text[:5])
print(target_text[:5])

['[CLS] what a ridiculous concept ! [SEP]', '[CLS] your idea is not entirely crazy . [SEP]', '[CLS] a man s worth lies in what he is . [SEP]', '[CLS] what he did is very wrong . [SEP]', '[CLS] all three of you need to do that . [SEP]']
['quel concept ridicule !', 'votre idee n est pas completement folle .', 'la valeur d un homme reside dans ce qu il est .', 'ce qu il a fait est tres mal .', 'vous avez besoin de faire cela tous les trois .']


### Load pretrained BERT Model
Load the predefined BERT model and check whether the input/output data is correctly created.  
In this article, the length of the input and output sentences is longer, and since tokens are divided into several tokens  when tokening is executed including French, the length of the input/output sequence is defined as 30.

In [None]:
# Load pre-trained model tokenizer (vocabulary)
modelpath = "bert-base-uncased"

# Load pre-trained model tokenizer (vocabulary)
model = BertForMaskedLM.from_pretrained(modelpath)
model = model.to(device)

n_seq_length = 80

100%|██████████| 407873900/407873900 [00:15<00:00, 26641671.35B/s]


### 4. Build Vocabulary
In the case of BertTokenizer, there is no need to create a dedicated vocabulary. It has its own built-in vocabulary, so you only need to define a tokenizer.

In [None]:
tokenizer = BertTokenizer.from_pretrained(modelpath)

100%|██████████| 231508/231508 [00:00<00:00, 683299.41B/s]


### 5. Tokenize 
The tokenizing method is the same as the case of learning with only one sentence in the previous article. However, the difference is that, since it consists of several statements, the only difference is that each statement is executed using a function.

### 6. Data Processing
In this article, "6. Data Processing" and "7. Convert tokens to indexes" are done simultaneously.

### 7. Convert tokens to indexes
As previously explained, this process is not in an exact order. It is the same as the case of learning with only one sentence in the previous article.

### 8. Convert indexes to tensors 
Convert the index created in Step 7 to tensors.  
Keep in mind that in deep learning, batch tensors are given as input.  
When creating input/output tokens with multiple statements, you need to create tensors that contain all of the data. The process is expressed as follows.  

In [None]:
for idx in range(len(input_text)):

    # 5. Tokenize
    tokenized_inp_text = tokenizer.tokenize(input_text[idx])
    tokenized_trg_text = tokenizer.tokenize(target_text[idx])
    len_input_text = len(tokenized_inp_text)
    
    # 6. Data Processing & 7. Convert tokens to indexes
    # Processing for model
    for _ in range(n_seq_length-len(tokenized_inp_text)):
        tokenized_inp_text.append('[MASK]')

    indexed_inp_tokens = tokenizer.convert_tokens_to_ids(tokenized_inp_text)

    pad_idx = -1
    converted_trg_inds = []
    converted_trg_inds = [pad_idx] * len_input_text
    
    indexed_trg_tokens = tokenizer.convert_tokens_to_ids(tokenized_trg_text)
    tmp_trg_tensors   = torch.tensor([indexed_trg_tokens])
    converted_trg_inds += tmp_trg_tensors[0].tolist()
    
    converted_trg_inds.append(tokenizer.convert_tokens_to_ids(['[SEP]'])[0])

    for _ in range(n_seq_length-len(converted_trg_inds)):
        converted_trg_inds.append(pad_idx)

    # 8. Convert indexes to tensors
    src_tensor = torch.tensor([indexed_inp_tokens]).to(device)
    trg_tensor = torch.tensor([converted_trg_inds]).to(device)

    # When creating input/output tokens with multiple statements, you need to create tensors that contain all of the data. The process is expressed as follows.
    if idx == 0:
        tensors_src = src_tensor
    else :
        tensors_src = torch.cat((tensors_src, src_tensor), 0)

    if idx == 0:
        tensors_trg = trg_tensor
    else :
        tensors_trg = torch.cat((tensors_trg, trg_tensor), 0)


### Others are normal training process

In [None]:
# optimizer = torch.optim.Adam(model.parameters(), lr=5e-7)
# optimizer = torch.optim.SGD(model.parameters(), lr = 5e-5, momentum=0.9)
optimizer = torch.optim.Adamax(model.parameters(), lr = 5e-5)

num_epochs = 300

model.train()
for i in range(num_epochs):
    loss = model(tensors_src, masked_lm_labels=tensors_trg)
    eveloss = loss.mean().item()
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    if (i+1)%10 == 0:
        print("step "+ str(i+1) + " : " + str(eveloss))

step 10 : 5.8896164894104
step 20 : 4.240797996520996
step 30 : 2.82885479927063
step 40 : 1.981243371963501
step 50 : 1.5004268884658813
step 60 : 1.0952112674713135
step 70 : 0.8964142799377441
step 80 : 0.6988181471824646
step 90 : 0.4597766697406769
step 100 : 0.2796086370944977
step 110 : 0.22384724020957947
step 120 : 0.21832644939422607
step 130 : 0.13917896151542664
step 140 : 0.09290452301502228
step 150 : 0.06827884912490845
step 160 : 0.057465214282274246
step 170 : 0.05416080355644226
step 180 : 0.04215862974524498
step 190 : 0.05267646908760071
step 200 : 0.05707816407084465
step 210 : 0.020082900300621986
step 220 : 0.02082022652029991
step 230 : 0.015502654016017914
step 240 : 0.012265943922102451
step 250 : 0.019831469282507896
step 260 : 0.010459261946380138
step 270 : 0.012032284401357174
step 280 : 0.00993704330176115
step 290 : 0.012004001066088676
step 300 : 0.007864445447921753


### Inference
With the results learned in the previous process, select one of the data and test it. This process is the same as the previous article "51.2 Single Sentence with BERT Tokenizer" except for the sentence selection part.

In [None]:
print(tensors_src[6])
test_list = tensors_src[6].tolist()
test_tokens_tensor = torch.tensor([test_list]).to(device)
print(test_tokens_tensor)

result = []
result_ids = []
model.eval()
with torch.no_grad():
    predictions = model(test_tokens_tensor)

    start = len(tokenizer.tokenize(input_text[6]))
    count = 0
    while start < len(predictions[0]):
        predicted_index = torch.argmax(predictions[0,start]).item()
        
        predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])
        if '[SEP]' in predicted_token:
            break
        if count == 0:
            result = predicted_token
            result_ids = [predicted_index]
        else:
            result+= predicted_token
            result_ids+= [predicted_index]

        count += 1
        start += 1
print("input_text       :", input_text[6])
print("target_text      :", target_text[6])
print("tokenized target :", tokenizer.tokenize(target_text[6]))
print("result_ids       :",result_ids)
print("result           :",result)



tensor([ 101, 2119, 3419, 1998, 2984, 2147, 2004, 4275, 1012,  102,  103,  103,
         103,  103,  103,  103,  103,  103,  103,  103,  103,  103,  103,  103,
         103,  103,  103,  103,  103,  103,  103,  103,  103,  103,  103,  103,
         103,  103,  103,  103,  103,  103,  103,  103,  103,  103,  103,  103,
         103,  103,  103,  103,  103,  103,  103,  103,  103,  103,  103,  103,
         103,  103,  103,  103,  103,  103,  103,  103,  103,  103,  103,  103,
         103,  103,  103,  103,  103,  103,  103,  103], device='cuda:0')
tensor([[ 101, 2119, 3419, 1998, 2984, 2147, 2004, 4275, 1012,  102,  103,  103,
          103,  103,  103,  103,  103,  103,  103,  103,  103,  103,  103,  103,
          103,  103,  103,  103,  103,  103,  103,  103,  103,  103,  103,  103,
          103,  103,  103,  103,  103,  103,  103,  103,  103,  103,  103,  103,
          103,  103,  103,  103,  103,  103,  103,  103,  103,  103,  103,  103,
          103,  103,  103,  103,  103,  1