Last test : 2021-01-30  
한국어 설명 : https://wikidocs.net/159246  
English Explanation : https://wikidocs.net/160289  
Github : https://github.com/RichardMinsooGo/51_Pretrained_BERT_NMT

We wil use pytorch_pretrained_bert at this notebook

In [None]:
!pip install pytorch_pretrained_bert

from IPython.display import clear_output 
clear_output()

Define the library that we will use. Then check whether GPU is selected.


In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils import data
from pytorch_pretrained_bert import BertModel, BertForMaskedLM, BertForQuestionAnswering, BertForPreTraining

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

cuda


### 1. Data Load 
This step is not used, we will define Input and Output at this note book.

### 2. Build Input text, Output Text 
Input/Output Data is defined.

In [None]:
input_text  = "I am a student"
target_text = "Je suis étudiant"

### Load pretrained BERT Model

In [None]:
modelpath = "bert-base-uncased"

# Load pre-trained model tokenizer (vocabulary)
model = BertForMaskedLM.from_pretrained(modelpath)
model = model.to(device)

n_seq_length = 12

print("input_text       :", input_text)
print("target_text      :", target_text)

100%|██████████| 407873900/407873900 [00:11<00:00, 35385966.87B/s]


input_text       : I am a student
target_text      : Je suis étudiant


### 3. Preprocess  

Create spaces between words and punctuation marks.   
Ex) "he is a boy." => "he is a boy ."   
Except (a-z, A-Z, ".", "?", "!", ","), others are changed to space.

In [None]:
import unicodedata
import re

from tensorflow.keras.preprocessing.text import Tokenizer

def unicode_to_ascii(s):
    return ''.join(c for c in unicodedata.normalize('NFD', s)
            if unicodedata.category(c) != 'Mn')

def preprocess(sent):
    # Internal calling the function implemented above
    sent = unicode_to_ascii(sent.lower())

    # Create spaces between words and punctuation marks.
    # Ex) "he is a boy." => "he is a boy ."
    sent = re.sub(r"([?.!,¿])", r" \1", sent)

    # Except (a-z, A-Z, ".", "?", "!", ","), others are changed to space.
    sent = re.sub(r"[^a-zA-Z!.?]+", r" ", sent)

    sent = re.sub(r"\s+", " ", sent)
    return sent

# Pre-Processing 테스트
en_sent = u"Have you had dinner?"
fr_sent = u"Avez-vous déjà diné?"
print(preprocess(en_sent))
print(preprocess(fr_sent).encode('utf-8'))


special_tokens = ['[CLS] [SEP] [MASK]']

# ----------------------------------------------------------
input_text  = [preprocess(input_text)]
target_text = [preprocess(target_text)]

have you had dinner ?
b'avez vous deja dine ?'


### 4. Build Vocabulary
The example in this article uses "Tokenizer" that is defined in tensorflow as a tokenizer. In Pytorch, it was difficult to find a word level tokenizer, so I used tensorflow's one.

In [None]:
# Encoder Input Define
tokenizer = Tokenizer(filters="", lower=False)
tokenizer.fit_on_texts(special_tokens + input_text+ target_text)

print(tokenizer.word_index)
print(tokenizer.index_word)

{'[CLS]': 1, '[SEP]': 2, '[MASK]': 3, 'i': 4, 'am': 5, 'a': 6, 'student': 7, 'je': 8, 'suis': 9, 'etudiant': 10}
{1: '[CLS]', 2: '[SEP]', 3: '[MASK]', 4: 'i', 5: 'am', 6: 'a', 7: 'student', 8: 'je', 9: 'suis', 10: 'etudiant'}


### 5. Tokenize 
### 7. Convert tokens to indexes
In this article, since only one sentence is trained, tokenizing is performed after adding [CLS] and [SEP] to the input and output sentences.

In this article's example, "Tokenize" --> "Convert tokens to indexes" --> "Data Processing" was followed. That is, you can change the order according to the user's convenience and proceed.

In [None]:
input_text  = [("[CLS] " + input_text[0]  + " [SEP]").split()]
target_text = [(target_text[0] + " [SEP]").split()]

len_input_text = len(input_text[0])

print("input_text       :", input_text)
print("target_text      :", target_text)
print("len_input_text   :", len_input_text)

tokenized_inp_text = tokenizer.texts_to_sequences(input_text)
tokenized_trg_text = tokenizer.texts_to_sequences(target_text)

print("tokenized input  :", tokenized_inp_text)
print("tokenized target :", tokenized_trg_text)

input_text       : [['[CLS]', 'i', 'am', 'a', 'student', '[SEP]']]
target_text      : [['je', 'suis', 'etudiant', '[SEP]']]
len_input_text   : 6
tokenized input  : [[1, 4, 5, 6, 7, 2]]
tokenized target : [[8, 9, 10, 2]]


### 6. Data Processing
This process is configured differently according to BERT, Transformer, GPT, T5 and pretraining.

In [None]:
input_text = tokenizer.texts_to_sequences(input_text)
mask_idx   = tokenizer.texts_to_sequences(['[MASK]'])
indexed_inp_tokens = input_text[0] + mask_idx[0] * (n_seq_length - len_input_text)

# use -1 or 0 only for pytorch_pretrained_bert
pad_idx = -1  
converted_trg_inds = []
converted_trg_inds = [pad_idx] * len_input_text
indexed_trg_tokens = tokenizer.texts_to_sequences(target_text)[0]
tmp_trg_tensors    = torch.tensor([indexed_trg_tokens])
converted_trg_inds += tmp_trg_tensors[0].tolist()

for _ in range(n_seq_length-len(converted_trg_inds)):
    converted_trg_inds.append(pad_idx)

print("Input (Tokenized and indexed)  :\n", indexed_inp_tokens)
print("Output (Tokenized and indexed) :\n", converted_trg_inds)

Input (Tokenized and indexed)  :
 [1, 4, 5, 6, 7, 2, 3, 3, 3, 3, 3, 3]
Output (Tokenized and indexed) :
 [-1, -1, -1, -1, -1, -1, 8, 9, 10, 2, -1, -1]


### 8. Convert indexes to tensors  
Convert the index created in Step 7 to tensors.
Keep in mind that in deep learning, batch + tensors are given as input.

In [None]:
tensors_src = torch.tensor([indexed_inp_tokens]).to(device)
tensors_trg = torch.tensor([converted_trg_inds]).to(device)
print(tensors_src)
print(tensors_trg)

tensor([[1, 4, 5, 6, 7, 2, 3, 3, 3, 3, 3, 3]], device='cuda:0')
tensor([[-1, -1, -1, -1, -1, -1,  8,  9, 10,  2, -1, -1]], device='cuda:0')


### Others are normal training process

In [None]:
# optimizer = torch.optim.Adam(model.parameters(), lr=5e-7)
# optimizer = torch.optim.SGD(model.parameters(), lr = 5e-5, momentum=0.9)
optimizer = torch.optim.Adamax(model.parameters(), lr = 5e-5)

num_epochs = 1200

model.train()
for i in range(num_epochs):
    loss = model(tensors_src, masked_lm_labels=tensors_trg)
    eveloss = loss.mean().item()
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    if (i+1)%10 == 0:
        print("step "+ str(i+1) + " : " + str(eveloss))


step 10 : 5.930408000946045
step 20 : 4.585104465484619
step 30 : 3.5422909259796143
step 40 : 2.747951030731201
step 50 : 2.2070822715759277
step 60 : 1.9274042844772339
step 70 : 1.7657920122146606
step 80 : 1.6707193851470947
step 90 : 1.599414587020874
step 100 : 1.5259640216827393
step 110 : 1.5084165334701538
step 120 : 1.5720945596694946
step 130 : 1.5188556909561157
step 140 : 1.4818816184997559
step 150 : 1.5042755603790283
step 160 : 1.4471206665039062
step 170 : 1.447527289390564
step 180 : 1.4522638320922852
step 190 : 1.4283106327056885
step 200 : 1.4422211647033691
step 210 : 1.4621987342834473
step 220 : 1.4405546188354492
step 230 : 1.4442800283432007
step 240 : 1.4225499629974365
step 250 : 1.3906773328781128
step 260 : 1.4254376888275146
step 270 : 1.4201207160949707
step 280 : 1.426276683807373
step 290 : 1.4335026741027832
step 300 : 1.4516173601150513
step 310 : 1.3820154666900635
step 320 : 1.4101217985153198
step 330 : 1.407749891281128
step 340 : 1.3969225883483