Last test : 2021-01-30  
한국어 설명 : https://wikidocs.net/159246  
English Explanation : https://wikidocs.net/160289  
Github : https://github.com/RichardMinsooGo/51_Pretrained_BERT_NMT

We wil use pytorch_pretrained_bert at this notebook

In [1]:
!pip install pytorch_pretrained_bert

from IPython.display import clear_output 
clear_output()

Define the library that we will use. Then check whether GPU is selected.


In [2]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils import data
from pytorch_pretrained_bert import BertTokenizer, BertModel, BertForMaskedLM, BertForQuestionAnswering, BertForPreTraining

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

cuda


### 1. Data Load 
We will load the English-French data provided by www.manythings.org. Among the methods of retrieving data existing on the Web, in this article, wget is used, and another method, urllib.request , is expressed.

Since there are commands blocked on each homepage, it will be helpful for programming if you familiarize yourself with some methods.

In [3]:
import pandas as pd
# import sentencepiece as spm
import urllib.request
import csv

# urllib.request.urlretrieve("http://www.manythings.org/anki/fra-eng.zip", filename="fra-eng.zip")
! wget http://www.manythings.org/anki/fra-eng.zip

! unzip fra-eng.zip

--2022-01-31 01:05:27--  http://www.manythings.org/anki/fra-eng.zip
Resolving www.manythings.org (www.manythings.org)... 172.67.186.54, 104.21.92.44, 2606:4700:3030::6815:5c2c, ...
Connecting to www.manythings.org (www.manythings.org)|172.67.186.54|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6532197 (6.2M) [application/zip]
Saving to: ‘fra-eng.zip’


2022-01-31 01:05:29 (7.72 MB/s) - ‘fra-eng.zip’ saved [6532197/6532197]

Archive:  fra-eng.zip
  inflating: _about.txt              
  inflating: fra.txt                 


### 2. Build Input text, Output Text 
The code below is the process of making input and output in the form of a list using pandas.  
How to use pandas is going to make a separate article.  
In this article, we will use pandas to confirm that input/output data is converted into a list format.

In [4]:
total_df = pd.read_csv('fra.txt', sep="\t", header=None)

# total_df = total_df.sample(frac=1)  # row 전체 shuffle
# total_df = total_df[:20000]
total_df[:5]

total_df.rename(columns={0: 'english', 1: 'french', 2: 'speaker'}, inplace=True)

total_df[:5]

print('Translation Pair :',len(total_df)) # 리뷰 개수 출력

total_df["eng_len"] = ""
total_df["fra_len"] = ""
total_df.head()

import sys
for idx in range(len(total_df['english'])):
    # initialize string
    text_eng = str(total_df.iloc[idx]['english'])

    # default separator: space
    result_eng = len(text_eng.split())
    total_df.at[idx, 'eng_len'] = int(result_eng)

    text_fra = str(total_df.iloc[idx]['french'])
    # default separator: space
    result_fra = len(text_fra.split())
    total_df.at[idx, 'fra_len'] = int(result_fra)

# country 컬럼을 선택합니다.
# 컬럼의 값과 조건을 비교합니다.
# 그 결과를 새로운 변수에 할당합니다.
is_within_len = ( 7 < total_df['eng_len']) & ( total_df['eng_len']<17)

# 조건를 충족하는 데이터를 필터링하여 새로운 변수에 저장합니다.
total_df = total_df[is_within_len]

# 결과를 출력합니다.
total_df.head()

# n_samples = 43693
n_samples = 256
total_df = total_df.sample(n=n_samples, # number of items from axis to return.
          random_state=1234) # seed for random number generator for reproducibility
len(total_df)

with open('corpus_src.txt', 'w', encoding='utf8') as f:
    f.write('\n'.join(total_df['english']))

with open('corpus_trg.txt', 'w', encoding='utf8') as f:
    f.write('\n'.join(total_df['french']))

raw_encoder_input = total_df['english'].tolist()
raw_data_fr = total_df['french'].tolist()

print(raw_encoder_input)
print(raw_data_fr)

Translation Pair : 192341
['Do you need a hand with your suitcases?', 'What do you want to eat this weekend?', "I just can't get used to taking orders from Tom.", "We don't even know who Tom got married to.", 'I really need to talk to you privately.', 'How much are you being paid to do this?', 'We have some difficult problems that we need to deal with.', 'I found what you were looking for in the trunk of my car.', 'Several students in the back of the classroom were sleeping.', 'A new team was formed in order to take part in the boat race.', 'I read more today than I did yesterday.', "I'd be grateful if you could take a look when you've got time sometime.", "No matter how rich you are, you can't buy true love.", 'I can remember when you were just a little boy.', 'All things considered, he is a good teacher.', 'It is difficult for beginners to enjoy windsurfing.', "Just promise me that you won't do anything stupid.", 'He is generally at home in the evening.', 'The girl was visibly shaken

### 3. Preprocess  

Create spaces between words and punctuation marks.   
Ex) "he is a boy." => "he is a boy ."   
Except (a-z, A-Z, ".", "?", "!", ","), others are changed to space.

In [5]:
import unicodedata
import re

from tensorflow.keras.preprocessing.text import Tokenizer

def unicode_to_ascii(s):
    return ''.join(c for c in unicodedata.normalize('NFD', s)
            if unicodedata.category(c) != 'Mn')

def preprocess(sent):
    # 위에서 구현한 함수를 내부적으로 호출
    sent = unicode_to_ascii(sent.lower())

    # 단어와 구두점 사이에 공백을 만듭니다.
    # Ex) "he is a boy." => "he is a boy ."
    sent = re.sub(r"([?.!,¿])", r" \1", sent)

    # (a-z, A-Z, ".", "?", "!", ",") 이들을 제외하고는 전부 공백으로 변환합니다.
    sent = re.sub(r"[^a-zA-Z!.?]+", r" ", sent)

    sent = re.sub(r"\s+", " ", sent)
    return sent

# 인코딩 테스트
en_sent = u"Have you had dinner?"
fr_sent = u"Avez-vous déjà diné?"

print(preprocess(en_sent))
print(preprocess(fr_sent).encode('utf-8'))

have you had dinner ?
b'avez vous deja dine ?'


### Build Input/Output text data
In order to make input and output sentences into batch data, after preprocessing the raw data, convert it into a list and print it.

In [6]:
input_text = ['[CLS] ' + preprocess(data) + ' [SEP]' for data in raw_encoder_input]
target_text = [preprocess(data) for data in raw_data_fr]

print(input_text[:5])
print(target_text[:5])

['[CLS] do you need a hand with your suitcases ? [SEP]', '[CLS] what do you want to eat this weekend ? [SEP]', '[CLS] i just can t get used to taking orders from tom . [SEP]', '[CLS] we don t even know who tom got married to . [SEP]', '[CLS] i really need to talk to you privately . [SEP]']
['as tu besoin d un coup de main avec tes valises ?', 'qu est ce que vous voulez manger ce week end ?', 'je n arrive simplement pas a m habituer a recevoir des ordres de tom .', 'on ne sait meme pas qui tom a epouse .', 'il me faut vraiment vous parler en prive .']


### Load pretrained BERT Model
Load the predefined BERT model and check whether the input/output data is correctly created.  
In this article, the length of the input and output sentences is longer, and since tokens are divided into several tokens  when tokening is executed including French, the length of the input/output sequence is defined as 30.

In [7]:
# Load pre-trained model tokenizer (vocabulary)
modelpath = "bert-base-uncased"

# Load pre-trained model tokenizer (vocabulary)
model = BertForMaskedLM.from_pretrained(modelpath)
model = model.to(device)

n_seq_length = 80

100%|██████████| 407873900/407873900 [00:18<00:00, 22104357.63B/s]


### 4. Build Vocabulary
In the case of BertTokenizer, there is no need to create a dedicated vocabulary. It has its own built-in vocabulary, so you only need to define a tokenizer.

In [8]:
tokenizer = BertTokenizer.from_pretrained(modelpath)

100%|██████████| 231508/231508 [00:00<00:00, 683770.96B/s]


### 5. Tokenize 
The tokenizing method is the same as the case of learning with only one sentence in the previous article. However, the difference is that, since it consists of several statements, the only difference is that each statement is executed using a function.

### 6. Data Processing
In this article, "6. Data Processing" and "7. Convert tokens to indexes" are done simultaneously.

### 7. Convert tokens to indexes
As previously explained, this process is not in an exact order. It is the same as the case of learning with only one sentence in the previous article.

### 8. Convert indexes to tensors 
Convert the index created in Step 7 to tensors.  
Keep in mind that in deep learning, batch tensors are given as input.  
When creating input/output tokens with multiple statements, you need to create tensors that contain all of the data. The process is expressed as follows.  

In [9]:
for idx in range(len(input_text)):

    # 5. Tokenize
    tokenized_inp_text = tokenizer.tokenize(input_text[idx])
    tokenized_trg_text = tokenizer.tokenize(target_text[idx])
    len_input_text = len(tokenized_inp_text)
    
    # 6. Data Processing & 7. Convert tokens to indexes
    # Processing for model
    for _ in range(n_seq_length-len(tokenized_inp_text)):
        tokenized_inp_text.append('[MASK]')

    indexed_inp_tokens = tokenizer.convert_tokens_to_ids(tokenized_inp_text)

    pad_idx = -1
    converted_trg_inds = []
    converted_trg_inds = [pad_idx] * len_input_text
    
    indexed_trg_tokens = tokenizer.convert_tokens_to_ids(tokenized_trg_text)
    tmp_trg_tensors   = torch.tensor([indexed_trg_tokens])
    converted_trg_inds += tmp_trg_tensors[0].tolist()
    
    converted_trg_inds.append(tokenizer.convert_tokens_to_ids(['[SEP]'])[0])

    for _ in range(n_seq_length-len(converted_trg_inds)):
        converted_trg_inds.append(pad_idx)

    # 8. Convert indexes to tensors
    src_tensor = torch.tensor([indexed_inp_tokens]).to(device)
    trg_tensor = torch.tensor([converted_trg_inds]).to(device)

    # When creating input/output tokens with multiple statements, you need to create tensors that contain all of the data. The process is expressed as follows.
    if idx == 0:
        tensors_src = src_tensor
    else :
        tensors_src = torch.cat((tensors_src, src_tensor), 0)

    if idx == 0:
        tensors_trg = trg_tensor
    else :
        tensors_trg = torch.cat((tensors_trg, trg_tensor), 0)


### 9. Build batches 
This is the part we will look at in detail in this article. The code below follows the general process of pytorch batch processing.

As shown in the code below, dataset consists of source tensors and target tensors.
Dataloader can be configured simply by defining the dataset and batch size given above, and whether to use shuffle.

In [10]:
from torch.utils.data import TensorDataset   # 텐서데이터셋
from torch.utils.data import DataLoader      # 데이터로더

batch_size = 64
dataset = TensorDataset(tensors_src, tensors_trg)
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)

### Others are normal batch training process
The training process follows the general batch processing of pytorch.  
The process is almost identical to the previous articles, except for a few differences below.  
* Defines number of batches. It is used to calculate the loss per epoch.  
* Since there are n batches in the dataloader, learning is carried out by batch using "for loop".
* A process of finding the loss for each epoch is required.

In [11]:
# optimizer = torch.optim.Adam(model.parameters(), lr=5e-7)
# optimizer = torch.optim.SGD(model.parameters(), lr = 5e-5, momentum=0.9)
optimizer = torch.optim.Adamax(model.parameters(), lr = 5e-5)

num_epochs = 300

model.train()
# Defines number of batches. It is used to calculate the loss per epoch.
n_batches = len(dataset)/ batch_size

for i in range(num_epochs):
    
    # Since there are n batches in the dataloader, learning is carried out by batch using "for loop".
    epoch_loss = 0
    for batch_idx, samples in enumerate(dataloader):
        x_train, y_train = samples
        loss = model(x_train, masked_lm_labels=y_train)
        eveloss = loss.mean().item()
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        # A process of finding the loss for each epoch is required.
        epoch_loss += eveloss / n_batches

    if (i+1)%10 == 0:
        print("step "+ str(i+1) + " : " + str(eveloss))

print(tensors_src[6])
test_list = tensors_src[6].tolist()
test_tokens_tensor = torch.tensor([test_list]).to(device)
print(test_tokens_tensor)

step 10 : 5.690773010253906
step 20 : 5.026670932769775
step 30 : 4.567258358001709
step 40 : 4.079259872436523
step 50 : 3.5920872688293457
step 60 : 3.1397042274475098
step 70 : 2.668572425842285
step 80 : 2.3047728538513184
step 90 : 2.047567129135132
step 100 : 1.8465665578842163
step 110 : 1.607183575630188
step 120 : 1.47799551486969
step 130 : 1.3569228649139404
step 140 : 1.2507659196853638
step 150 : 1.120534896850586
step 160 : 1.0619661808013916
step 170 : 0.9062176942825317
step 180 : 0.8478437662124634
step 190 : 0.7310414910316467
step 200 : 0.5976481437683105
step 210 : 0.47876372933387756
step 220 : 0.42999616265296936
step 230 : 0.3656133711338043
step 240 : 0.2720927894115448
step 250 : 0.26450079679489136
step 260 : 0.19397728145122528
step 270 : 0.1910354644060135
step 280 : 0.14040949940681458
step 290 : 0.14493447542190552
step 300 : 0.09570730477571487
tensor([ 101, 2057, 2031, 2070, 3697, 3471, 2008, 2057, 2342, 2000, 3066, 2007,
        1012,  102,  103,  103, 

### Inference
With the results learned in the previous process, select one of the data and test it. This process is the same as the previous article "51.2 Single Sentence with BERT Tokenizer" except for the sentence selection part.

In [12]:
print(tensors_src[6])
test_list = tensors_src[6].tolist()
test_tokens_tensor = torch.tensor([test_list]).to(device)
print(test_tokens_tensor)

result = []
result_ids = []
model.eval()
with torch.no_grad():
    predictions = model(test_tokens_tensor)

    start = len(tokenizer.tokenize(input_text[6]))
    count = 0
    while start < len(predictions[0]):
        predicted_index = torch.argmax(predictions[0,start]).item()
        
        predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])
        if '[SEP]' in predicted_token:
            break
        if count == 0:
            result = predicted_token
            result_ids = [predicted_index]
        else:
            result+= predicted_token
            result_ids+= [predicted_index]

        count += 1
        start += 1
print("input_text       :", input_text[6])
print("target_text      :", target_text[6])
print("tokenized target :", tokenizer.tokenize(target_text[6]))
print("result_ids       :",result_ids)
print("result           :",result)

tensor([ 101, 2057, 2031, 2070, 3697, 3471, 2008, 2057, 2342, 2000, 3066, 2007,
        1012,  102,  103,  103,  103,  103,  103,  103,  103,  103,  103,  103,
         103,  103,  103,  103,  103,  103,  103,  103,  103,  103,  103,  103,
         103,  103,  103,  103,  103,  103,  103,  103,  103,  103,  103,  103,
         103,  103,  103,  103,  103,  103,  103,  103,  103,  103,  103,  103,
         103,  103,  103,  103,  103,  103,  103,  103,  103,  103,  103,  103,
         103,  103,  103,  103,  103,  103,  103,  103], device='cuda:0')
tensor([[ 101, 2057, 2031, 2070, 3697, 3471, 2008, 2057, 2342, 2000, 3066, 2007,
         1012,  102,  103,  103,  103,  103,  103,  103,  103,  103,  103,  103,
          103,  103,  103,  103,  103,  103,  103,  103,  103,  103,  103,  103,
          103,  103,  103,  103,  103,  103,  103,  103,  103,  103,  103,  103,
          103,  103,  103,  103,  103,  103,  103,  103,  103,  103,  103,  103,
          103,  103,  103,  103,  103,  1