Last test : 2021-01-30  
한국어 설명 : https://wikidocs.net/159246  
English Explanation : https://wikidocs.net/160289  
Github : https://github.com/RichardMinsooGo/51_Pretrained_BERT_NMT

We wil use pytorch_pretrained_bert at this notebook

In [None]:
!pip install pytorch_pretrained_bert

from IPython.display import clear_output 
clear_output()

Define the library that we will use. Then check whether GPU is selected.


In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils import data
from pytorch_pretrained_bert import BertTokenizer, BertModel, BertForMaskedLM, BertForQuestionAnswering, BertForPreTraining

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

cuda


### 1. Data Load 
This step is not used, we will define Input and Output at this note book.

### 2. Build Input text, Output Text 
Input/Output Data is defined.

In [None]:
input_text  = "[CLS] I want to buy the new Apple M1 Pro MacBook pro [SEP] "
target_text = "Je veux acheter le nouveau MacBook Pro Apple M1 Pro"

### Load pretrained BERT Model
Load the predefined BERT model and check whether the input/output data is correctly created.  
In this article, the length of the input and output sentences is longer, and since tokens are divided into several tokens  when tokening is executed including French, the length of the input/output sequence is defined as 30.

In [None]:
modelpath = "bert-base-uncased"

# Load pre-trained model tokenizer (vocabulary)
model = BertForMaskedLM.from_pretrained(modelpath)
model = model.to(device)

n_seq_length = 30

print("input_text       :", input_text)
print("target_text      :", target_text)

input_text       : [CLS] I want to buy the new Apple M1 Pro MacBook pro [SEP] 
target_text      : Je veux acheter le nouveau MacBook Pro Apple M1 Pro


### 3. Preprocess 
When using the BERT tokenizer, there is no need to specify it(pre-process) separately because it has a pre-processing function. However, from the next example, we will include the preprocessing step again.

### 4. Build Vocabulary
In the case of BertTokenizer, there is no need to create a dedicated vocabulary. It has its own built-in vocabulary, so you only need to define a tokenizer.

In [None]:
tokenizer = BertTokenizer.from_pretrained(modelpath)

### 5. Tokenize 

In [None]:
tokenized_inp_text = tokenizer.tokenize(input_text)
tokenized_trg_text = tokenizer.tokenize(target_text)

len_input_text = len(tokenized_inp_text)
print("len_input_text   :", len_input_text)

print("tokenized input  :", tokenized_inp_text)
print("tokenized target :", tokenized_trg_text)

len_input_text   : 14
tokenized input  : ['[CLS]', 'i', 'want', 'to', 'buy', 'the', 'new', 'apple', 'm1', 'pro', 'mac', '##book', 'pro', '[SEP]']
tokenized target : ['je', 've', '##ux', 'ache', '##ter', 'le', 'nouveau', 'mac', '##book', 'pro', 'apple', 'm1', 'pro']


### 6. Data Processing
In this article, "6. Data Processing" and "7. Convert tokens to indexes" are done simultaneously. 

### 7. Convert tokens to indexes
As previously explained, this process is not in an exact order. The part to focus on in this process is to check whether the form of the input/output token is properly formed.

In [None]:
# Processing for model
for _ in range(n_seq_length-len(tokenized_inp_text)):
    tokenized_inp_text.append('[MASK]')
    
indexed_inp_tokens = tokenizer.convert_tokens_to_ids(tokenized_inp_text)

# use -1 or 0 only for pytorch_pretrained_bert
pad_idx = -1  
converted_trg_inds = []
converted_trg_inds = [pad_idx] * len_input_text
indexed_trg_tokens = tokenizer.convert_tokens_to_ids(tokenized_trg_text)
tmp_trg_tensors    = torch.tensor([indexed_trg_tokens])
converted_trg_inds += tmp_trg_tensors[0].tolist()
converted_trg_inds.append(tokenizer.convert_tokens_to_ids(['[SEP]'])[0])

for _ in range(n_seq_length-len(converted_trg_inds)):
    converted_trg_inds.append(pad_idx)

print("Input (Tokenized and indexed)  :\n", indexed_inp_tokens)
print("Output (Tokenized and indexed) :\n", converted_trg_inds)

Input (Tokenized and indexed)  :
 [101, 1045, 2215, 2000, 4965, 1996, 2047, 6207, 23290, 4013, 6097, 8654, 4013, 102, 103, 103, 103, 103, 103, 103, 103, 103, 103, 103, 103, 103, 103, 103, 103, 103]
Output (Tokenized and indexed) :
 [-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 15333, 2310, 5602, 12336, 3334, 3393, 25272, 6097, 8654, 4013, 6207, 23290, 4013, 102, -1, -1]


### 8. Convert indexes to tensors  
Convert the index created in Step 7 to tensors.
Keep in mind that in deep learning, batch + tensors are given as input.

In [None]:
tensors_src = torch.tensor([indexed_inp_tokens]).to(device)
tensors_trg = torch.tensor([converted_trg_inds]).to(device)

print(tensors_src)
print(tensors_trg)

tensor([[  101,  1045,  2215,  2000,  4965,  1996,  2047,  6207, 23290,  4013,
          6097,  8654,  4013,   102,   103,   103,   103,   103,   103,   103,
           103,   103,   103,   103,   103,   103,   103,   103,   103,   103]],
       device='cuda:0')
tensor([[   -1,    -1,    -1,    -1,    -1,    -1,    -1,    -1,    -1,    -1,
            -1,    -1,    -1,    -1, 15333,  2310,  5602, 12336,  3334,  3393,
         25272,  6097,  8654,  4013,  6207, 23290,  4013,   102,    -1,    -1]],
       device='cuda:0')


### Others are normal training process

In [None]:
# optimizer = torch.optim.Adam(model.parameters(), lr=5e-7)
# optimizer = torch.optim.SGD(model.parameters(), lr = 5e-5, momentum=0.9)
optimizer = torch.optim.Adamax(model.parameters(), lr = 5e-5)

num_epochs = 300

model.train()
for i in range(num_epochs):
    loss = model(tensors_src, masked_lm_labels=tensors_trg)
    eveloss = loss.mean().item()
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    if (i+1)%10 == 0:
        print("step "+ str(i+1) + " : " + str(eveloss))

step 10 : 3.771027088165283
step 20 : 2.700327157974243
step 30 : 1.988046646118164
step 40 : 1.4163614511489868
step 50 : 1.3190759420394897
step 60 : 0.6748390793800354
step 70 : 0.359968364238739
step 80 : 0.33379030227661133
step 90 : 0.21821965277194977
step 100 : 0.04551782086491585
step 110 : 0.07059331238269806
step 120 : 0.14557014405727386
step 130 : 0.09280121326446533
step 140 : 0.02755594253540039
step 150 : 0.009662961587309837
step 160 : 0.011001522652804852
step 170 : 0.024392014369368553
step 180 : 0.011023713275790215
step 190 : 0.018864227458834648
step 200 : 0.003558436641469598
step 210 : 0.020779123529791832
step 220 : 0.020972367376089096
step 230 : 0.004028483293950558
step 240 : 0.014759846031665802
step 250 : 0.012583794072270393
step 260 : 0.028509119525551796
step 270 : 0.006064563058316708
step 280 : 0.005579332821071148
step 290 : 0.006157734896987677
step 300 : 0.004537112545222044


### Inference
With the results trained in the previous process, select one of the data and test it. Since this article is one sentence, let's try it with the input sentence.

In [None]:
result = []
result_ids = []
model.eval()
with torch.no_grad():
    predictions = model(tensors_src)

    start = len(tokenizer.tokenize(input_text))
    count = 0
    while start < len(predictions[0]):
        predicted_index = torch.argmax(predictions[0,start]).item()
        
        predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])
        if '[SEP]' in predicted_token:
            break
        if count == 0:
            result = predicted_token
            result_ids = [predicted_index]
        else:
            result+= predicted_token
            result_ids+= [predicted_index]

        count += 1
        start += 1
        
print("tokenized target :", tokenized_trg_text)
print("result_ids       :",result_ids)
print("result           :",result)

tokenized target : ['je', 've', '##ux', 'ache', '##ter', 'le', 'nouveau', 'mac', '##book', 'pro', 'apple', 'm1', 'pro']
result_ids       : [15333, 2310, 5602, 12336, 3334, 3393, 25272, 6097, 8654, 4013, 6207, 23290, 4013]
result           : ['je', 've', '##ux', 'ache', '##ter', 'le', 'nouveau', 'mac', '##book', 'pro', 'apple', 'm1', 'pro']
