## 전체적인 구현 과정 및 분석 

**1. 데이터 프로세싱 방법** <br>
  1.1 영어 & 한국어 데이터 전처리
- 영어 데이터: NLTK의 BracketParseCorpusReader 모듈을 통해 처리
- 한국어 데이터: 구문표지를 바탕으로 어순을 정렬한 후 형태소만 추출. 'NP_SBJ(주격 체언구)' 뒤에 위치하는 형태소들의 순서를 뒤집는 방식으로 어순 정렬

1.2 영어-한국어 parallel 데이터쌍을 만든 후 shuffle. 

**2. Embedding별 성능 비교** <br>
Static embedding을 필요로 하는 packed-padded encoder decoder model과 convolutional sequence to sequence model 둘 다 한국어 데이터는 word2vec 사전학습 임베딩, 영어 데이터는 fasttext 사전학습 임베딩을 썼을 때 제일 성능이 높았다.
- 한국어 word2vec 사전학습 임베딩: 중간과제 자료 그대로 활용
- 영어 fasttext 사전학습 임베딩: torchtext 기본제공 임베딩 활용

**3. Model별 Hyperparameter** <br>
3.1. packed-padded encoder decoder model
- Batch size: 128
- Encpder, decoder dropout rate: 0.5
- Teacher forcing rate: 0.5
- Learning rate(Adam Optimizer): 0.001
- Epoch size: 5
- Clip: 1 <br>

epoch size를 5 이상 늘리면 overfitting하였고, teacher forcing rate를 늘리자 perplexity와 BLEU score가 모두 증가하는 trade-off가 발생하였다. Clip을 줄이자 epoch5일 때 perplexity의 발산이 발생하였다.

3.2. convolutional sequence to sequence model
- Batch size: 128
- Encoder, decoder dropout rate: 0.25
- Learning rate(Adam Optimizer): 0.001
- Epoch size: 5
- Clip: 1 <br>

epoch size를 5 이상 늘리면 overfitting하였고, dropout rate를 0.3으로 늘리자 perplexity가 가하고 BLEU score가 감소하였으며 그 이상으로 dropout rate를 늘리면 perplexity가 발산하였다. Clip을 줄이자 epoch5일 때 perplexity의 발산이 발생하였다.

3.3. transformers from scratch
- Batch size: 128
- Encoder, decoder dropout rate: 0.1
- Encoder, decoder layers: 3 each
- Encoder, decoder heads: 8 each
- Dimensionality of the layers and the pooler layer: 512
- Learning rate: 0.0005

**4. Train, Validation, Test 결과** <br>
4.1. packed-padded encoder decoder model
- Train perplexity: 4.042
- Validation perplexity: 10.193
- Test perplexity: 10.117

4.2. convolutional sequence to sequence model
- Train perplexity: 5.053
- Validation perplexity: 3.858
- Test perplexity: 3.803

4.3. transformers from scratch
- Train perplexity: 2.596
- Validation perplexity: 2.746
- Test perplexity: 2.770

**5. BLEU score** <br>
5.1 packed-padded encoder decoder model: 42.38 <br>
5.2 convolutional sequence to sequence model: 38.11 <br>
5.3 transformers from scratch: 43.92 <br>

**6. 모델 성능 비교**
- 세 모델 중에는 test perplexity와 BLEU score 모두 transformers가 우수하였다. 
- inference 시 user input의 어순을 조작하지 않고 그대로 넣는 것이 성능이 더 좋았다. 






## 코드

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### 1. 필요 패키지 설치

In [None]:
!pip install --upgrade git+https://github.com/pytorch/text #upgrading torchtext for colab
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

from torchtext.data import Field, BucketIterator
from torchtext.datasets import TranslationDataset
import torchtext.vocab as vocab


import matplotlib.pyplot as plt
import matplotlib.ticker as ticker

import spacy
import numpy as np
import pandas as pd
from tqdm import tqdm_notebook

import re
import random
import math
import time

Collecting git+https://github.com/pytorch/text
  Cloning https://github.com/pytorch/text to /tmp/pip-req-build-uslohhm4
  Running command git clone -q https://github.com/pytorch/text /tmp/pip-req-build-uslohhm4
  Running command git submodule update --init --recursive -q
Building wheels for collected packages: torchtext
  Building wheel for torchtext (setup.py) ... [?25l[?25hdone
  Created wheel for torchtext: filename=torchtext-0.9.0a0+ec413ff-cp36-cp36m-linux_x86_64.whl size=7140989 sha256=92a429a4dc1447616e1ebc3aac42eb77ada34adcb8c25ae0a7205e41a68bc4f6
  Stored in directory: /tmp/pip-ephem-wheel-cache-3ixnjozb/wheels/73/14/71/ed033fd999ae4933e17df3e91be2014e61c2f312a88a164ff5
Successfully built torchtext
Installing collected packages: torchtext
  Found existing installation: torchtext 0.9.0a0+ec413ff
    Uninstalling torchtext-0.9.0a0+ec413ff:
      Successfully uninstalled torchtext-0.9.0a0+ec413ff
Successfully installed torchtext-0.9.0a0+ec413ff


In [None]:
SEED = 1234

random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

### 2. 데이터 로드 및 전처리

### 2.1 한국어 데이터: 영어 어순에 맞게 어순을 reverse하는 과정 포함

In [None]:
#한국어 데이터 로드 및 정규식 이용하여 프로세싱
ko_path = "/content/drive/MyDrive/dataset/ko-en.ko.parse"

with open (ko_path, 'r', encoding='utf-8') as f:
  data = f.read()
  contents = re.findall('<id.*?/id>', data, re.S)
  sentences = []
  for c in contents:
    pattern = '<.+?>'
    sent = re.sub(pattern, '', c)
    sent = re.sub('\n', '\t', sent)
    sentences.append(sent.strip())

In [None]:
#한국어 데이터에서 탭으로 구분된 각 column을 분리 + 한 id에 sent가 2개 이상인 경우 공백 제거
tabs_rm_b = []

for sents in sentences:

  split_sent = sents.split('\t')

  #공백이 있을 경우 공백 제거
  if '' in split_sent: 
    while '' in split_sent:
      split_sent.remove('') 

  tabs_rm_b.append(split_sent)

In [None]:
#데이터프레임으로 변경
ko_df = pd.DataFrame(tabs_rm_b)

In [None]:
#한국어 데이터에서 세 번째 열(구문 표지)과 네 번째 열(형태소와 품사정보) 각각 추출하여 튜플의 형태로 만듬 (구문표지, 형태소와 품사정보)
ko_corpus_list = []

for i in range (0, len(ko_df.index)):
  row = ko_df.loc[i,:].dropna()
  length = len(row)

  j = 2

  ko_sen_list = []

  while j <= length-1:
    index = row[j]
    word = row[j+1]

    ko_tup = (index,word)

    ko_sen_list.append(ko_tup)

    j = j + 4

  ko_corpus_list.append(ko_sen_list)

In [None]:
#한국어 각 문장의 어순을 바꿈. NP_SUB(주격 체언구)가 있는 경우 NP_SUB 뒤의 형태소들을 reverse. 없는 경우는 전체 형태소들을 reverse
ordered_ko_corpus = []

for i in range(0, corpus_len):
  sen = ko_corpus_list[i]
  sen_mi_list = []
  
  for n in range(0, len(sen)):
    sen_morph_index = sen[n][1]
    sen_mi_list.append(sen_morph_index)

  #문장이 2개 이상인 경우, 품사 정보 중 SF(마침표, 물음표, 느낌표)를 기준으로 문장 분리 

  sf_check = list(filter(lambda x: 'SF' in x, sen_mi_list)) #ref: https://coding-groot.tistory.com/21

  if len(sf_check) >= 2: 

    sf_index = []

    for ind in sf_check:
      index = sen_mi_list.index(ind)
      sf_index.append(index)

    multi_sen = []

    a = 0

    for sf_i in sf_index:
      sen_index_list = []

      raw_multi_sen = [word for word in sen[a:sf_i+1]]

      for j in range(0, len(sen)):
        sen_index = sen[j][0]
        sen_index_list.append(sen_index)

      if 'NP_SBJ' in sen_index_list:
        ns_index = sen_index_list.index('NP_SBJ')

        raw_sen = [word for word in raw_multi_sen[:ns_index+1]]

        reverse_sen = [word for word in raw_multi_sen[ns_index+1:]]
        reverse_sen = reverse_sen[::-1]

        new_sen = raw_sen + reverse_sen

      else:
        new_sen = raw_multi_sen[::-1]
      
      multi_sen.append(new_sen)

      a = a+sf_i+1
    
    new_sen = [item for uni in multi_sen for item in uni]

  else: 
    sen_index_list = []
    
    for j in range(0, len(sen)):
      sen_index = sen[j][0]
      sen_index_list.append(sen_index)
      
    if 'NP_SBJ' in sen_index_list:
      ns_index = sen_index_list.index('NP_SBJ')

      raw_sen = [word for word in sen[:ns_index+1]]

      reverse_sen = [word for word in sen[ns_index+1:]]
      reverse_sen = reverse_sen[::-1]

      new_sen = raw_sen + reverse_sen

    else:
      new_sen = sen[::-1]
  
  ordered_ko_corpus.append(new_sen)

In [None]:
#어순을 조정한 데이터들에 대해 품사정보 표지를 제거하고 형태소만 남김
clean_ko_corpus = []

for sen in ordered_ko_corpus:

  clean_1 = []
  clean_2 = []

  for i in range(0, len(sen)):
    word = sen[i][1]

    if '|' in word:
      new_word = word.split('|')
      for w in new_word:
        clean_1.append(w)

    else:
      clean_1.append(word)

  
  for token in clean_1:
    new_token = token.split('/')
    clean_2.append(new_token[0])
    
    clean_sen = " ".join(clean_2)
  
  clean_ko_corpus.append(clean_sen)

### 2.2 영어 데이터

In [None]:
#영어 데이터 프로세싱을 위해 nltk의 BracketParseCorpusReader 모듈 사용 
#BracketParseCorpusReader를 통해 영어 데이터의 문장, 단어, 품사태깅된 문장 등을 불러올 수 있음

from nltk.corpus.reader import BracketParseCorpusReader
en = BracketParseCorpusReader(root="/content/drive/MyDrive/dataset/", fileids=['ko-en.en.parse.syn'], encoding='utf-8')

In [None]:
#한국어 데이터와 영어 데이터 개수 비교, 동일함을 확인
print(len(clean_ko_corpus))
print(len(en.tagged_sents(fileids='ko-en.en.parse.syn'))) #tagged_sent 통해 단어와 품사를 함께 불러올 수 있다. 

330974
330974


In [None]:
#각 영어 문장별 형태소 리스트 생성 
en_word_list = list(en.sents(fileids='ko-en.en.parse.syn'))

In [None]:
#각 영어 문장별로 각 형태소가 공백으로 나뉜 텍스트 생성
tokenized_sentences = [" ".join(sent) for sent in en_word_list]

### 2.3 한국어-영어 문장이 짝지어진 데이터셋 만들기 

In [None]:
#(영어) 각 문장이 newline token으로 나뉜 텍스트 파일 생성 
f = open('/content/drive/MyDrive/dataset/en.txt', mode='wt', encoding='utf-8')
for sent in tokenized_sentences:
  f.write(sent)
  f.write("\n")
f.close()

In [None]:
#(한국어) 각 문장이 newline token으로 나뉜 텍스트 파일 생성 
f = open('/content/drive/MyDrive/dataset/ko.txt', mode='wt', encoding='utf-8')
for sent in clean_ko_corpus:
  f.write(sent)
  f.write("\n")
f.close()

In [None]:
#데이터 셔플링을 위해 DataFrame 이용 

df1 = pd.read_csv('/content/drive/MyDrive/dataset/ko.txt', sep='/n,', names=['src'], header=None) # 한국어
df2 = pd.read_csv('/content/drive/MyDrive/dataset/en.txt', sep='/n,', names=['trg'], header=None) # 영어

df = pd.concat([df1,df2],axis=1)

  This is separate from the ipykernel package so we can avoid doing imports until
  after removing the cwd from sys.path.


In [None]:
#데이터 셔플
df_shuffle = df.sample(frac = 1)

In [None]:
df_src = df_shuffle['src']
df_trg = df_shuffle['trg']

df_src.to_csv('/content/drive/MyDrive/dataset/ko_shuffle.txt', sep = '\n', index = False, header=None)
df_trg.to_csv('/content/drive/MyDrive/dataset/en_shuffle.txt', sep = '\n', index = False, header=None)

### 3. Models

### 3.1 Packed Encoder-Decoder
Reference: https://github.com/bentrevett/pytorch-seq2seq/blob/master/4%20-%20Packed%20Padded%20Sequences%2C%20Masking%2C%20Inference%20and%20BLEU.ipynb

### (1) Field 정의

In [None]:
#위에서 한국어와 영어 문장의 형태소가 공백으로 분리된 텍스트 파일을 생성
#형태소를 그대로 추출하기 위해 tokenizer 함수로 whitespace 기준 split함수 사용  

def tokenize_ko(text):

    return [tok for tok in text.split(" ")]

def tokenize_en(text):

    return [tok for tok in text.split(" ")]

In [None]:
#Field 정의
#Packed Encoder-Decoder Model 사용 위해 출발어(SRC) 문장 길이 정보 사용 

SRC = Field(tokenize = tokenize_ko, 
            init_token = '<sos>', 
            eos_token = '<eos>', 
            include_lengths = True,
            lower = True)

TRG = Field(tokenize = tokenize_en, 
            init_token = '<sos>', 
            eos_token = '<eos>', 
            lower = True)



### (2) TranslationDataset 생성

In [None]:
#셔플된 데이터를 torchtext의 TranslationDataset을 통해 불러옴

data_shuffled = TranslationDataset(path = '/content/drive/MyDrive/dataset/ordered_data/', exts = ('ko_shuffle.txt', 'en_shuffle.txt'), fields = (SRC, TRG))



### (3) Train, Validation, Test Dataset 생성

In [None]:
#train, validation, test dataset 각각 80:10:10 비율로 생성 
train_data, test_data = data_shuffled.split(split_ratio = 0.8, random_state = random.seed(SEED))
valid_data, test_data = test_data.split(split_ratio = 0.5, random_state = random.seed(SEED))

In [None]:
#train, validation, test data 개수 확인 
print(f"Number of training examples: {len(train_data.examples)}")
print(f"Number of validation examples: {len(valid_data.examples)}")
print(f"Number of testing examples: {len(test_data.examples)}")

Number of training examples: 264779
Number of validation examples: 33098
Number of testing examples: 33097


### (4) 한국어, 영어 vocab 생성 및 사전학습 임베딩값 설정
Reference: https://rohit-agrawal.medium.com/using-fine-tuned-gensim-word2vec-embeddings-with-torchtext-and-pytorch-17eea2883cd

In [None]:
#영어 임베딩으로 torchtext에서 제공하는 fasttext.en.300d 모델을 사용 
SRC.build_vocab(train_data, min_freq=2)
TRG.build_vocab(train_data, vectors='fasttext.en.300d', min_freq=2)

In [None]:
#한국어와 영어 vocabulary 개수 
print(f"Unique tokens in source (ko) vocabulary: {len(SRC.vocab)}")
print(f"Unique tokens in target (en) vocabulary: {len(TRG.vocab)}")

Unique tokens in source (ko) vocabulary: 16179
Unique tokens in target (en) vocabulary: 14245


In [None]:
import gensim
from gensim.models.keyedvectors import KeyedVectors

path = '/content/drive/MyDrive/model/'
Word2Vec_300D_token_model = KeyedVectors.load_word2vec_format(path + 'Word2Vec_300D_token.model', binary=False, encoding='utf-8')

In [None]:
word2vec_vectors_src = []

for token, idx in tqdm_notebook(SRC.vocab.stoi.items()):
    if token in Word2Vec_300D_token_model.wv.vocab.keys(): #사전학습 임베딩 모델에 해당 토큰의 임베딩 값이 있을 경우 그 값을 가져옴
        word2vec_vectors_src.append(torch.FloatTensor(Word2Vec_300D_token_model[token]))
    else:
        word2vec_vectors_src.append(torch.randn(300)) #사전학습 임베딩 모델에 임베딩 값이 없을 경우 랜덤으로 설정
        
SRC.vocab.set_vectors(SRC.vocab.stoi, word2vec_vectors_src, 300) #Vocab 각 토큰의 임베딩 값 설정

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  This is separate from the ipykernel package so we can avoid doing imports until


HBox(children=(FloatProgress(value=0.0, max=16179.0), HTML(value='')))

  after removing the cwd from sys.path.
  """





### (5) BucketIterator 생성

In [None]:
#device 정의 
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [None]:
#BucketIterator 생성 
#source sentence를 길이에 따라 정렬 
BATCH_SIZE = 128

train_iterator, valid_iterator, test_iterator = BucketIterator.splits(
    (train_data, valid_data, test_data), 
     batch_size = BATCH_SIZE,
     sort_within_batch = True,
     sort_key = lambda x : len(x.src),
     device = device)



### (6) Packed Padded Encoder-Decoder Model 정의

In [None]:
class Encoder(nn.Module):
    def __init__(self, input_dim, emb_dim, enc_hid_dim, dec_hid_dim, dropout):
        super().__init__()
        
        self.embedding = nn.Embedding(input_dim, emb_dim)
        
        self.rnn = nn.GRU(emb_dim, enc_hid_dim, bidirectional = True)
        
        self.fc = nn.Linear(enc_hid_dim * 2, dec_hid_dim)
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, src, src_len):
        
        #src = [src len, batch size]
        #src_len = [batch size]
        
        embedded = self.dropout(self.embedding(src))
        
        #embedded = [src len, batch size, emb dim]
                
        packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded, src_len.cpu())
                
        packed_outputs, hidden = self.rnn(packed_embedded)
                                 
        #packed_outputs is a packed sequence containing all hidden states
        #hidden is now from the final non-padded element in the batch
            
        outputs, _ = nn.utils.rnn.pad_packed_sequence(packed_outputs) 
            
        #outputs is now a non-packed sequence, all hidden states obtained
        #  when the input is a pad token are all zeros
            
        #outputs = [src len, batch size, hid dim * num directions]
        #hidden = [n layers * num directions, batch size, hid dim]
        
        #hidden is stacked [forward_1, backward_1, forward_2, backward_2, ...]
        #outputs are always from the last layer
        
        #hidden [-2, :, : ] is the last of the forwards RNN 
        #hidden [-1, :, : ] is the last of the backwards RNN
        
        #initial decoder hidden is final hidden state of the forwards and backwards 
        #  encoder RNNs fed through a linear layer
        hidden = torch.tanh(self.fc(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1)))
        
        #outputs = [src len, batch size, enc hid dim * 2]
        #hidden = [batch size, dec hid dim]
        
        return outputs, hidden

In [None]:
class Attention(nn.Module):
    def __init__(self, enc_hid_dim, dec_hid_dim):
        super().__init__()
        
        self.attn = nn.Linear((enc_hid_dim * 2) + dec_hid_dim, dec_hid_dim)
        self.v = nn.Linear(dec_hid_dim, 1, bias = False)
        
    def forward(self, hidden, encoder_outputs, mask):
        
        #hidden = [batch size, dec hid dim]
        #encoder_outputs = [src len, batch size, enc hid dim * 2]
        
        batch_size = encoder_outputs.shape[1]
        src_len = encoder_outputs.shape[0]
        
        #repeat decoder hidden state src_len times
        hidden = hidden.unsqueeze(1).repeat(1, src_len, 1)
  
        encoder_outputs = encoder_outputs.permute(1, 0, 2)
        
        #hidden = [batch size, src len, dec hid dim]
        #encoder_outputs = [batch size, src len, enc hid dim * 2]
        
        energy = torch.tanh(self.attn(torch.cat((hidden, encoder_outputs), dim = 2))) 
        
        #energy = [batch size, src len, dec hid dim]

        attention = self.v(energy).squeeze(2)
        
        #attention = [batch size, src len]
        
        attention = attention.masked_fill(mask == 0, -1e10)
        
        return F.softmax(attention, dim = 1)

In [None]:
class Decoder(nn.Module):
    def __init__(self, output_dim, emb_dim, enc_hid_dim, dec_hid_dim, dropout, attention):
        super().__init__()

        self.output_dim = output_dim
        self.attention = attention
        
        self.embedding = nn.Embedding(output_dim, emb_dim)
        
        self.rnn = nn.GRU((enc_hid_dim * 2) + emb_dim, dec_hid_dim)
        
        self.fc_out = nn.Linear((enc_hid_dim * 2) + dec_hid_dim + emb_dim, output_dim)
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, input, hidden, encoder_outputs, mask):
             
        #input = [batch size]
        #hidden = [batch size, dec hid dim]
        #encoder_outputs = [src len, batch size, enc hid dim * 2]
        #mask = [batch size, src len]
        
        input = input.unsqueeze(0)
        
        #input = [1, batch size]
        
        embedded = self.dropout(self.embedding(input))
        
        #embedded = [1, batch size, emb dim]
        
        a = self.attention(hidden, encoder_outputs, mask)
                
        #a = [batch size, src len]
        
        a = a.unsqueeze(1)
        
        #a = [batch size, 1, src len]
        
        encoder_outputs = encoder_outputs.permute(1, 0, 2)
        
        #encoder_outputs = [batch size, src len, enc hid dim * 2]
        
        weighted = torch.bmm(a, encoder_outputs)
        
        #weighted = [batch size, 1, enc hid dim * 2]
        
        weighted = weighted.permute(1, 0, 2)
        
        #weighted = [1, batch size, enc hid dim * 2]
        
        rnn_input = torch.cat((embedded, weighted), dim = 2)
        
        #rnn_input = [1, batch size, (enc hid dim * 2) + emb dim]
            
        output, hidden = self.rnn(rnn_input, hidden.unsqueeze(0))
        
        #output = [seq len, batch size, dec hid dim * n directions]
        #hidden = [n layers * n directions, batch size, dec hid dim]
        
        #seq len, n layers and n directions will always be 1 in this decoder, therefore:
        #output = [1, batch size, dec hid dim]
        #hidden = [1, batch size, dec hid dim]
        #this also means that output == hidden
        assert (output == hidden).all()
        
        embedded = embedded.squeeze(0)
        output = output.squeeze(0)
        weighted = weighted.squeeze(0)
        
        prediction = self.fc_out(torch.cat((output, weighted, embedded), dim = 1))
        
        #prediction = [batch size, output dim]
        
        return prediction, hidden.squeeze(0), a.squeeze(1)

In [None]:
class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder, src_pad_idx, device):
        super().__init__()
        
        self.encoder = encoder
        self.decoder = decoder
        self.src_pad_idx = src_pad_idx
        self.device = device
        
    def create_mask(self, src):
        mask = (src != self.src_pad_idx).permute(1, 0)
        return mask
        
    def forward(self, src, src_len, trg, teacher_forcing_ratio = 0.5):
        
        #src = [src len, batch size]
        #src_len = [batch size]
        #trg = [trg len, batch size]
        #teacher_forcing_ratio is probability to use teacher forcing
        #e.g. if teacher_forcing_ratio is 0.75 we use teacher forcing 75% of the time
                    
        batch_size = src.shape[1]
        trg_len = trg.shape[0]
        trg_vocab_size = self.decoder.output_dim
        
        #tensor to store decoder outputs
        outputs = torch.zeros(trg_len, batch_size, trg_vocab_size).to(self.device)
        
        #encoder_outputs is all hidden states of the input sequence, back and forwards
        #hidden is the final forward and backward hidden states, passed through a linear layer
        encoder_outputs, hidden = self.encoder(src, src_len)
                
        #first input to the decoder is the <sos> tokens
        input = trg[0,:]
        
        mask = self.create_mask(src)

        #mask = [batch size, src len]
                
        for t in range(1, trg_len):
            
            #insert input token embedding, previous hidden state, all encoder hidden states 
            #  and mask
            #receive output tensor (predictions) and new hidden state
            output, hidden, _ = self.decoder(input, hidden, encoder_outputs, mask)
            
            #place predictions in a tensor holding predictions for each token
            outputs[t] = output
            
            #decide if we are going to use teacher forcing or not
            teacher_force = random.random() < teacher_forcing_ratio
            
            #get the highest predicted token from our predictions
            top1 = output.argmax(1) 
            
            #if teacher forcing, use actual next token as next input
            #if not, use predicted token
            input = trg[t] if teacher_force else top1
            
        return outputs

### (7) Training

In [None]:
#hyperparameters 설정

INPUT_DIM = len(SRC.vocab)
OUTPUT_DIM = len(TRG.vocab)
ENC_EMB_DIM = 300
DEC_EMB_DIM = 300
ENC_HID_DIM = 512
DEC_HID_DIM = 512
ENC_DROPOUT = 0.5
DEC_DROPOUT = 0.5
SRC_PAD_IDX = SRC.vocab.stoi[SRC.pad_token]

#모델에 hyperparameters 입력
attn = Attention(ENC_HID_DIM, DEC_HID_DIM)
enc = Encoder(INPUT_DIM, ENC_EMB_DIM, ENC_HID_DIM, DEC_HID_DIM, ENC_DROPOUT)
dec = Decoder(OUTPUT_DIM, DEC_EMB_DIM, ENC_HID_DIM, DEC_HID_DIM, DEC_DROPOUT, attn)

model = Seq2Seq(enc, dec, SRC_PAD_IDX, device).to(device)

In [None]:
#모든 파라미터 초기화 
def init_weights(m):
    for name, param in m.named_parameters():
        if 'weight' in name:
            nn.init.normal_(param.data, mean=0, std=0.01)
        else:
            nn.init.constant_(param.data, 0)
            
model.apply(init_weights)

Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(16179, 300)
    (rnn): GRU(300, 512, bidirectional=True)
    (fc): Linear(in_features=1024, out_features=512, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
  (decoder): Decoder(
    (attention): Attention(
      (attn): Linear(in_features=1536, out_features=512, bias=True)
      (v): Linear(in_features=512, out_features=1, bias=False)
    )
    (embedding): Embedding(14245, 300)
    (rnn): GRU(1324, 512)
    (fc_out): Linear(in_features=1836, out_features=14245, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
)

In [None]:
#사전학습 임베딩값을 가져옴  
pretrained_embeddings_src = SRC.vocab.vectors
model.encoder.embedding.weight.data.copy_(pretrained_embeddings_src)

pretrained_embeddings_trg = TRG.vocab.vectors
model.decoder.embedding.weight.data.copy_(pretrained_embeddings_trg)

tensor([[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        ...,
        [-0.1093, -0.1233, -0.2521,  ...,  0.3467,  0.1614,  0.1974],
        [ 0.0728, -0.2033, -0.0522,  ...,  0.2844,  0.0839, -0.2510],
        [-0.0395, -0.2716, -0.0760,  ...,  0.5530, -0.0286,  0.1793]],
       device='cuda:0')

In [None]:
#모델에 쓰인 파라미터 개수 
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 41,931,297 trainable parameters


In [None]:
#Adam optimizer 사용 
optimizer = optim.Adam(model.parameters())

In [None]:
#Cross entropy loss 함수 사용 
TRG_PAD_IDX = TRG.vocab.stoi[TRG.pad_token]

criterion = nn.CrossEntropyLoss(ignore_index = TRG_PAD_IDX) #<pad> token은 loss 계산 시 무시

In [None]:
def train(model, iterator, optimizer, criterion, clip):
    
    model.train()
    
    epoch_loss = 0
    
    for i, batch in enumerate(iterator):
        
        src, src_len = batch.src
        trg = batch.trg
        
        optimizer.zero_grad()
        
        output = model(src, src_len, trg)
        
        #trg = [trg len, batch size]
        #output = [trg len, batch size, output dim]
        
        output_dim = output.shape[-1]
        
        output = output[1:].view(-1, output_dim)
        trg = trg[1:].view(-1)
        
        #trg = [(trg len - 1) * batch size]
        #output = [(trg len - 1) * batch size, output dim]
        
        loss = criterion(output, trg)
        
        loss.backward()
        
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        
        optimizer.step()
        
        epoch_loss += loss.item()
        
    return epoch_loss / len(iterator)

In [None]:
def evaluate(model, iterator, criterion):
    
    model.eval()
    
    epoch_loss = 0
    
    with torch.no_grad():
    
        for i, batch in enumerate(iterator):

            src, src_len = batch.src
            trg = batch.trg

            output = model(src, src_len, trg, 0) #turn off teacher forcing
            
            #trg = [trg len, batch size]
            #output = [trg len, batch size, output dim]

            output_dim = output.shape[-1]
            
            output = output[1:].view(-1, output_dim)
            trg = trg[1:].view(-1)

            #trg = [(trg len - 1) * batch size]
            #output = [(trg len - 1) * batch size, output dim]

            loss = criterion(output, trg)

            epoch_loss += loss.item()
        
    return epoch_loss / len(iterator)

In [None]:
#epoch 당 소요시간 계산 
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

In [None]:
N_EPOCHS = 5
CLIP = 1

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):
    
    start_time = time.time()
    
    train_loss = train(model, train_iterator, optimizer, criterion, CLIP)
    valid_loss = evaluate(model, valid_iterator, criterion)
    
    end_time = time.time()
    
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), '/content/drive/MyDrive/dataset/packed-word2vec-fasttexten-v2_ordered_data.pt')
    
    print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. PPL: {math.exp(valid_loss):7.3f}')



Epoch: 01 | Time: 10m 9s
	Train Loss: 3.018 | Train PPL:  20.450
	 Val. Loss: 2.777 |  Val. PPL:  16.069
Epoch: 02 | Time: 10m 20s
	Train Loss: 1.952 | Train PPL:   7.041
	 Val. Loss: 2.475 |  Val. PPL:  11.880
Epoch: 03 | Time: 10m 18s
	Train Loss: 1.657 | Train PPL:   5.243
	 Val. Loss: 2.371 |  Val. PPL:  10.706
Epoch: 04 | Time: 10m 18s
	Train Loss: 1.500 | Train PPL:   4.482
	 Val. Loss: 2.355 |  Val. PPL:  10.541
Epoch: 05 | Time: 10m 21s
	Train Loss: 1.397 | Train PPL:   4.042
	 Val. Loss: 2.322 |  Val. PPL:  10.193


### (8) Testing

In [None]:
model.load_state_dict(torch.load('/content/drive/MyDrive/dataset/packed-word2vec-fasttexten-v2_ordered_data.pt'))

test_loss = evaluate(model, test_iterator, criterion)

print(f'| Test Loss: {test_loss:.3f} | Test PPL: {math.exp(test_loss):7.3f} |')



| Test Loss: 2.314 | Test PPL:  10.117 |


### (9) BLEU score

In [None]:
#ref: https://github.com/osori/korean-romanizer
!pip install korean_romanizer
from korean_romanizer.romanizer import Romanizer



In [None]:
# input에 대하여 번역된 문장을 리턴하는 함수 
def translate_sentence(sentence, src_field, trg_field, model, device, reverse = False, romanize = False, max_len = 50):

    model.eval()
        
    if isinstance(sentence, list):                  
        tokens = sentence
    elif isinstance(sentence, str):                      # input이 tokenize되어있지 않을 경우 tokenize 시행 
      if reverse == False:                               
        tokens = [tok for tok in sentence.split(" ")]
      elif reverse == True:                              # input 문장을 역순으로 넣어 번역할 경우 
        tokens = [tok for tok in sentence.split(" ")][::-1]

    tokens = [src_field.init_token] + tokens + [src_field.eos_token]
        
    src_indexes = [src_field.vocab.stoi[token] for token in tokens]

    src_tokens = [src_field.vocab.itos[i] for i in src_indexes]
    
    src_tensor = torch.LongTensor(src_indexes).unsqueeze(1).to(device)

    src_len = torch.LongTensor([len(src_indexes)]).to(device)
    
    with torch.no_grad():
        encoder_outputs, hidden = model.encoder(src_tensor, src_len)

    mask = model.create_mask(src_tensor)
        
    trg_indexes = [trg_field.vocab.stoi[trg_field.init_token]]

    attentions = torch.zeros(max_len, 1, len(src_indexes)).to(device)
    
    for i in range(max_len):

        trg_tensor = torch.LongTensor([trg_indexes[-1]]).to(device)
                
        with torch.no_grad():
            output, hidden, attention = model.decoder(trg_tensor, hidden, encoder_outputs, mask)

        attentions[i] = attention
            
        pred_token = output.argmax(1).item()
        
        trg_indexes.append(pred_token)

        if pred_token == trg_field.vocab.stoi[trg_field.eos_token]:
            break
    
    if romanize == False:                                           # <unk> 토큰을 치환하지 않음  
      trg_tokens = [trg_field.vocab.itos[i] for i in trg_indexes]
    
    elif romanize == True:                                          # <unk> 토큰에 대해 로마자화 
      trg_tokens = []

      attention_for_alignment = attentions[:len(trg_tokens)-1]
    
      for index, ori_index in enumerate(trg_indexes):

        trg_tk = trg_field.vocab.itos[ori_index]

        if trg_tk == '<unk>':
          attention_score2 = attention_for_alignment.squeeze(1).cpu().detach()

          src_search = attention_score2[index,:].argmax()

          src_search_index = src_search.numpy()

          roman_token = Romanizer(src_tokens[src_search]).romanize()

          trg_tokens.append(roman_token)

        else:
          trg_tokens.append(trg_tk)
  
    return trg_tokens[1:-1], attentions[:len(trg_tokens)-1]    

In [None]:
from torchtext.data.metrics import bleu_score

def calculate_bleu(data, src_field, trg_field, model, device, max_len = 50):
    
    trgs = []
    pred_trgs = []
    
    for datum in data:
        
        src = vars(datum)['src']
        trg = vars(datum)['trg']
        
        pred_trg, _ = translate_sentence(src, src_field, trg_field, model, device, max_len)
    
        pred_trgs.append(pred_trg)
        trgs.append([trg])
        
    return bleu_score(pred_trgs, trgs)

In [None]:
bleu_score = calculate_bleu(test_data, SRC, TRG, model, device)

print(f'BLEU score = {bleu_score*100:.2f}')

BLEU score = 42.38


### (10) Inference 
User input 문장을 원래 순서와 역순으로 각각 넣어 inference를 진행함 

In [None]:
sen_list = ['inference 문장 입력']

### (10.1) Original sentence inference

In [None]:
# reverse = False
for sent in sen_list:
  translation, attention = translate_sentence(sent, SRC, TRG, model, device, reverse = False, romanize = False)
  translation_r, attention = translate_sentence(sent, SRC, TRG, model, device, reverse = False, romanize = True)
  translated_text = " ".join(translation)
  translated_text_r = " ".join(translation_r)

  print(f'source sent = {sent}')
  print(f'predicted plain = {translated_text}')
  print(f'predicted romanize = {translated_text_r}')
  print('\n')

In [None]:
#attention display 
def display_attention(sentence, translation, attention, reverse = False):
    
    fig = plt.figure(figsize=(10,10))
    ax = fig.add_subplot(111)
    
    attention = attention.squeeze(1).cpu().detach().numpy()
    
    cax = ax.matshow(attention, cmap='bone')
   
    ax.tick_params(labelsize=12)

    if reverse:
      ax.set_xticklabels(['']+['<eos>']+[tok for tok in sentence.split(" ")][::-1]+['<sos>'], 
                       rotation=45)
    else: 
      ax.set_xticklabels(['']+['<sos>']+[tok for tok in sentence.split(" ")]+['<eos>'], 
                           rotation=45)
    ax.set_yticklabels(['']+translation)

    ax.xaxis.set_major_locator(ticker.MultipleLocator(1))
    ax.yaxis.set_major_locator(ticker.MultipleLocator(1))

    plt.show()
    plt.close()

In [None]:
# reverse = False / romanize = False
for sent in sen_list:
  translation_r, attention = translate_sentence(sent, SRC, TRG, model, device, reverse = False, romanize = True)
  translated_text_r = " ".join(translation_r)
  print(f'source sent = {sent}')
  print(f'predicted romanize = {translated_text_r}')
  display_attention(sent, translation_r, attention)
  print('\n')

### (10.2) Reversed sentence inference

In [None]:
# reverse = true
for sent in sen_list:
  translation, attention = translate_sentence(sent, SRC, TRG, model, device, reverse = True, romanize = False)
  translation_r, attention = translate_sentence(sent, SRC, TRG, model, device, reverse = True, romanize = True)
  translated_text = " ".join(translation)
  translated_text_r = " ".join(translation_r)

  print(f'source sent = {sent}')
  print(f'predicted plain = {translated_text}')
  print(f'predicted romanize = {translated_text_r}')
  print('\n')

In [None]:
# reverse = True / romanize = False
for sent in sen_list:
  translation, attention = translate_sentence(sent, SRC, TRG, model, device, reverse = True, romanize = False)
  translated_text = " ".join(translation)
  print(f'source sent = {sent}')
  print(f'predicted romanize = {translated_text}')
  display_attention(sent, translation, attention, reverse = True)
  print('\n')

### 3.2. Convolutional Seq2seq 
Reference: https://github.com/bentrevett/pytorch-seq2seq/blob/master/5%20-%20Convolutional%20Sequence%20to%20Sequence%20Learning.ipynb

### (1) Field 정의 

In [None]:
#Field 정의
#Packed padded encoder-decoder model과 같은 tokenizer 사용
from torchtext.data import Field
SRC_2 = Field(tokenize = tokenize_ko, 
            init_token = '<sos>', 
            eos_token = '<eos>', 
            lower = True,
            batch_first = True)

TRG_2 = Field(tokenize = tokenize_en, 
            init_token = '<sos>', 
            eos_token = '<eos>', 
            lower = True,
            batch_first=True)

### (2) TranslationDataset 생성

In [None]:
##셔플된 데이터를 torchtext의 TranslationDataset을 통해 불러옴
data_shuffled_2 = TranslationDataset(path = '/content/drive/MyDrive/dataset/ordered_data/', exts = ('ko_shuffle.txt', 'en_shuffle.txt'), fields = (SRC_2, TRG_2))

### (3) Train, Validation, Test Dataset 생성


In [None]:
#train, validation, test dataset 각각 80:10:10 비율로 생성 
train_data_2, test_data_2 = data_shuffled_2.split(split_ratio = 0.8, random_state = random.seed(SEED))
valid_data_2, test_data_2 = test_data_2.split(split_ratio = 0.5, random_state = random.seed(SEED))

In [None]:
print(f"Number of training examples: {len(train_data_2.examples)}")
print(f"Number of validation examples: {len(valid_data_2.examples)}")
print(f"Number of testing examples: {len(test_data_2.examples)}")

Number of training examples: 264779
Number of validation examples: 33098
Number of testing examples: 33097


### (4) 한국어, 영어 vocab 생성 및 사전학습 임베딩값 설정
Reference: https://rohit-agrawal.medium.com/using-fine-tuned-gensim-word2vec-embeddings-with-torchtext-and-pytorch-17eea2883cd

In [None]:
#영어 임베딩으로 torchtext에서 제공하는 fasttext.en.300d 모델을 사용
SRC_2.build_vocab(train_data_2, min_freq=2)
TRG_2.build_vocab(train_data_2, vectors='fasttext.en.300d', min_freq=2)

.vector_cache/wiki.en.vec: 6.60GB [04:24, 24.9MB/s]                            
  0%|          | 0/2519370 [00:00<?, ?it/s]Skipping token b'2519370' with 1-dimensional vector [b'300']; likely a header
100%|█████████▉| 2518747/2519370 [04:50<00:00, 8990.86it/s]

In [None]:
#한국어와 영어 vocabulary 개수 
print(f"Unique tokens in source (ko) vocabulary: {len(SRC_2.vocab)}")
print(f"Unique tokens in target (en) vocabulary: {len(TRG_2.vocab)}")

Unique tokens in source (ko) vocabulary: 16179
Unique tokens in target (en) vocabulary: 14245


In [None]:
import gensim
from gensim.models.keyedvectors import KeyedVectors

path = '/content/drive/MyDrive/model/'
Word2Vec_300D_token_model = KeyedVectors.load_word2vec_format(path + 'Word2Vec_300D_token.model', binary=False, encoding='utf-8')

In [None]:
word2vec_vectors_src_2 = []

for token, idx in tqdm_notebook(SRC_2.vocab.stoi.items()):
    if token in Word2Vec_300D_token_model.wv.vocab.keys(): #사전학습 임베딩 모델에 해당 토큰의 임베딩 값이 있을 경우 그 값을 가져옴
        word2vec_vectors_src_2.append(torch.FloatTensor(Word2Vec_300D_token_model[token]))
    else:
        word2vec_vectors_src_2.append(torch.randn(300)) #사전학습 임베딩 모델에 임베딩 값이 없을 경우 랜덤으로 설정
        
SRC_2.vocab.set_vectors(SRC_2.vocab.stoi, word2vec_vectors_src_2, 300) #Vocab 각 토큰의 임베딩 값 설정

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  This is separate from the ipykernel package so we can avoid doing imports until


HBox(children=(FloatProgress(value=0.0, max=16179.0), HTML(value='')))

  after removing the cwd from sys.path.
  """





### (5) BucketIterator 생성

In [None]:
#BucketIterator 생성
BATCH_SIZE = 128

train_iterator_2, valid_iterator_2, test_iterator_2 = BucketIterator.splits(
    (train_data_2, valid_data_2, test_data_2), 
     batch_size = BATCH_SIZE,
     device = device)



### (6) Convoulational Sequence to Sequence Model


In [None]:
class Encoder(nn.Module):
    def __init__(self, 
                 input_dim, 
                 emb_dim, 
                 hid_dim, 
                 n_layers, 
                 kernel_size, 
                 dropout, 
                 device,
                 max_length = 100):
        super().__init__()
        
        assert kernel_size % 2 == 1, "Kernel size must be odd!"
        
        self.device = device
        
        self.scale = torch.sqrt(torch.FloatTensor([0.5])).to(device)
        
        self.tok_embedding = nn.Embedding(input_dim, emb_dim)
        self.pos_embedding = nn.Embedding(max_length, emb_dim)
        
        self.emb2hid = nn.Linear(emb_dim, hid_dim)
        self.hid2emb = nn.Linear(hid_dim, emb_dim)
        
        self.convs = nn.ModuleList([nn.Conv1d(in_channels = hid_dim, 
                                              out_channels = 2 * hid_dim, 
                                              kernel_size = kernel_size, 
                                              padding = (kernel_size - 1) // 2)
                                    for _ in range(n_layers)])
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, src):
        
        #src = [batch size, src len]
        
        batch_size = src.shape[0]
        src_len = src.shape[1]
        
        #create position tensor
        pos = torch.arange(0, src_len).unsqueeze(0).repeat(batch_size, 1).to(self.device)
        
        #pos = [0, 1, 2, 3, ..., src len - 1]
        
        #pos = [batch size, src len]
        
        #embed tokens and positions
        tok_embedded = self.tok_embedding(src)
        pos_embedded = self.pos_embedding(pos)
        
        #tok_embedded = pos_embedded = [batch size, src len, emb dim]
        
        #combine embeddings by elementwise summing
        embedded = self.dropout(tok_embedded + pos_embedded)
        
        #embedded = [batch size, src len, emb dim]
        
        #pass embedded through linear layer to convert from emb dim to hid dim
        conv_input = self.emb2hid(embedded)
        
        #conv_input = [batch size, src len, hid dim]
        
        #permute for convolutional layer
        conv_input = conv_input.permute(0, 2, 1) 
        
        #conv_input = [batch size, hid dim, src len]
        
        #begin convolutional blocks...
        
        for i, conv in enumerate(self.convs):
        
            #pass through convolutional layer
            conved = conv(self.dropout(conv_input))

            #conved = [batch size, 2 * hid dim, src len]

            #pass through GLU activation function
            conved = F.glu(conved, dim = 1)

            #conved = [batch size, hid dim, src len]
            
            #apply residual connection
            conved = (conved + conv_input) * self.scale

            #conved = [batch size, hid dim, src len]
            
            #set conv_input to conved for next loop iteration
            conv_input = conved
        
        #...end convolutional blocks
        
        #permute and convert back to emb dim
        conved = self.hid2emb(conved.permute(0, 2, 1))
        
        #conved = [batch size, src len, emb dim]
        
        #elementwise sum output (conved) and input (embedded) to be used for attention
        combined = (conved + embedded) * self.scale
        
        #combined = [batch size, src len, emb dim]
        
        return conved, combined

In [None]:
class Decoder(nn.Module):
    def __init__(self, 
                 output_dim, 
                 emb_dim, 
                 hid_dim, 
                 n_layers, 
                 kernel_size, 
                 dropout, 
                 trg_pad_idx, 
                 device,
                 max_length = 100):
        super().__init__()
        
        self.kernel_size = kernel_size
        self.trg_pad_idx = trg_pad_idx
        self.device = device
        
        self.scale = torch.sqrt(torch.FloatTensor([0.5])).to(device)
        
        self.tok_embedding = nn.Embedding(output_dim, emb_dim)
        self.pos_embedding = nn.Embedding(max_length, emb_dim)
        
        self.emb2hid = nn.Linear(emb_dim, hid_dim)
        self.hid2emb = nn.Linear(hid_dim, emb_dim)
        
        self.attn_hid2emb = nn.Linear(hid_dim, emb_dim)
        self.attn_emb2hid = nn.Linear(emb_dim, hid_dim)
        
        self.fc_out = nn.Linear(emb_dim, output_dim)
        
        self.convs = nn.ModuleList([nn.Conv1d(in_channels = hid_dim, 
                                              out_channels = 2 * hid_dim, 
                                              kernel_size = kernel_size)
                                    for _ in range(n_layers)])
        
        self.dropout = nn.Dropout(dropout)
      
    def calculate_attention(self, embedded, conved, encoder_conved, encoder_combined):
        
        #embedded = [batch size, trg len, emb dim]
        #conved = [batch size, hid dim, trg len]
        #encoder_conved = encoder_combined = [batch size, src len, emb dim]
        
        #permute and convert back to emb dim
        conved_emb = self.attn_hid2emb(conved.permute(0, 2, 1))
        
        #conved_emb = [batch size, trg len, emb dim]
        
        combined = (conved_emb + embedded) * self.scale
        
        #combined = [batch size, trg len, emb dim]
                
        energy = torch.matmul(combined, encoder_conved.permute(0, 2, 1))
        
        #energy = [batch size, trg len, src len]
        
        attention = F.softmax(energy, dim=2)
        
        #attention = [batch size, trg len, src len]
            
        attended_encoding = torch.matmul(attention, encoder_combined)
        
        #attended_encoding = [batch size, trg len, emd dim]
        
        #convert from emb dim -> hid dim
        attended_encoding = self.attn_emb2hid(attended_encoding)
        
        #attended_encoding = [batch size, trg len, hid dim]
        
        #apply residual connection
        attended_combined = (conved + attended_encoding.permute(0, 2, 1)) * self.scale
        
        #attended_combined = [batch size, hid dim, trg len]
        
        return attention, attended_combined
        
    def forward(self, trg, encoder_conved, encoder_combined):
        
        #trg = [batch size, trg len]
        #encoder_conved = encoder_combined = [batch size, src len, emb dim]
                
        batch_size = trg.shape[0]
        trg_len = trg.shape[1]
            
        #create position tensor
        pos = torch.arange(0, trg_len).unsqueeze(0).repeat(batch_size, 1).to(self.device)
        
        #pos = [batch size, trg len]
        
        #embed tokens and positions
        tok_embedded = self.tok_embedding(trg)
        pos_embedded = self.pos_embedding(pos)
        
        #tok_embedded = [batch size, trg len, emb dim]
        #pos_embedded = [batch size, trg len, emb dim]
        
        #combine embeddings by elementwise summing
        embedded = self.dropout(tok_embedded + pos_embedded)
        
        #embedded = [batch size, trg len, emb dim]
        
        #pass embedded through linear layer to go through emb dim -> hid dim
        conv_input = self.emb2hid(embedded)
        
        #conv_input = [batch size, trg len, hid dim]
        
        #permute for convolutional layer
        conv_input = conv_input.permute(0, 2, 1) 
        
        #conv_input = [batch size, hid dim, trg len]
        
        batch_size = conv_input.shape[0]
        hid_dim = conv_input.shape[1]
        
        for i, conv in enumerate(self.convs):
        
            #apply dropout
            conv_input = self.dropout(conv_input)
        
            #need to pad so decoder can't "cheat"
            padding = torch.zeros(batch_size, 
                                  hid_dim, 
                                  self.kernel_size - 1).fill_(self.trg_pad_idx).to(self.device)
                
            padded_conv_input = torch.cat((padding, conv_input), dim = 2)
        
            #padded_conv_input = [batch size, hid dim, trg len + kernel size - 1]
        
            #pass through convolutional layer
            conved = conv(padded_conv_input)

            #conved = [batch size, 2 * hid dim, trg len]
            
            #pass through GLU activation function
            conved = F.glu(conved, dim = 1)

            #conved = [batch size, hid dim, trg len]
            
            #calculate attention
            attention, conved = self.calculate_attention(embedded, 
                                                         conved, 
                                                         encoder_conved, 
                                                         encoder_combined)
            
            #attention = [batch size, trg len, src len]
            
            #apply residual connection
            conved = (conved + conv_input) * self.scale
            
            #conved = [batch size, hid dim, trg len]
            
            #set conv_input to conved for next loop iteration
            conv_input = conved
            
        conved = self.hid2emb(conved.permute(0, 2, 1))
         
        #conved = [batch size, trg len, emb dim]
            
        output = self.fc_out(self.dropout(conved))
        
        #output = [batch size, trg len, output dim]
            
        return output, attention

In [None]:
class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder):
        super().__init__()
        
        self.encoder = encoder
        self.decoder = decoder
        
    def forward(self, src, trg):
        
        #src = [batch size, src len]
        #trg = [batch size, trg len - 1] (<eos> token sliced off the end)
           
        #calculate z^u (encoder_conved) and (z^u + e) (encoder_combined)
        #encoder_conved is output from final encoder conv. block
        #encoder_combined is encoder_conved plus (elementwise) src embedding plus 
        #  positional embeddings 
        encoder_conved, encoder_combined = self.encoder(src)
            
        #encoder_conved = [batch size, src len, emb dim]
        #encoder_combined = [batch size, src len, emb dim]
        
        #calculate predictions of next words
        #output is a batch of predictions for each word in the trg sentence
        #attention a batch of attention scores across the src sentence for 
        #  each word in the trg sentence
        output, attention = self.decoder(trg, encoder_conved, encoder_combined)
        
        #output = [batch size, trg len - 1, output dim]
        #attention = [batch size, trg len - 1, src len]
        
        return output, attention

### (7) Training

In [None]:
INPUT_DIM = len(SRC_2.vocab)
OUTPUT_DIM = len(TRG_2.vocab)
EMB_DIM = 300
HID_DIM = 512 # each conv. layer has 2 * hid_dim filters
ENC_LAYERS = 10 # number of conv. blocks in encoder
DEC_LAYERS = 10 # number of conv. blocks in decoder
ENC_KERNEL_SIZE = 3 # must be odd!
DEC_KERNEL_SIZE = 3 # can be even or odd
ENC_DROPOUT = 0.25
DEC_DROPOUT = 0.25
TRG_PAD_IDX_2 = TRG_2.vocab.stoi[TRG_2.pad_token]
    
enc_2 = Encoder(INPUT_DIM, EMB_DIM, HID_DIM, ENC_LAYERS, ENC_KERNEL_SIZE, ENC_DROPOUT, device)
dec_2 = Decoder(OUTPUT_DIM, EMB_DIM, HID_DIM, DEC_LAYERS, DEC_KERNEL_SIZE, DEC_DROPOUT, TRG_PAD_IDX_2, device)

model_2 = Seq2Seq(enc_2, dec_2).to(device)

In [None]:
def init_weights(m):
    for name, param in m.named_parameters():
        if 'weight' in name:
            nn.init.normal_(param.data, mean=0, std=0.01)
        else:
            nn.init.constant_(param.data, 0)
            
model_2.apply(init_weights)

Seq2Seq(
  (encoder): Encoder(
    (tok_embedding): Embedding(16179, 300)
    (pos_embedding): Embedding(100, 300)
    (emb2hid): Linear(in_features=300, out_features=512, bias=True)
    (hid2emb): Linear(in_features=512, out_features=300, bias=True)
    (convs): ModuleList(
      (0): Conv1d(512, 1024, kernel_size=(3,), stride=(1,), padding=(1,))
      (1): Conv1d(512, 1024, kernel_size=(3,), stride=(1,), padding=(1,))
      (2): Conv1d(512, 1024, kernel_size=(3,), stride=(1,), padding=(1,))
      (3): Conv1d(512, 1024, kernel_size=(3,), stride=(1,), padding=(1,))
      (4): Conv1d(512, 1024, kernel_size=(3,), stride=(1,), padding=(1,))
      (5): Conv1d(512, 1024, kernel_size=(3,), stride=(1,), padding=(1,))
      (6): Conv1d(512, 1024, kernel_size=(3,), stride=(1,), padding=(1,))
      (7): Conv1d(512, 1024, kernel_size=(3,), stride=(1,), padding=(1,))
      (8): Conv1d(512, 1024, kernel_size=(3,), stride=(1,), padding=(1,))
      (9): Conv1d(512, 1024, kernel_size=(3,), stride=(1,)

In [None]:
pretrained_embeddings_src_2 = SRC_2.vocab.vectors
model_2.encoder.tok_embedding.weight.data.copy_(pretrained_embeddings_src_2)

pretrained_embeddings_trg_2 = TRG_2.vocab.vectors
model_2.decoder.tok_embedding.weight.data.copy_(pretrained_embeddings_trg_2)

tensor([[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        ...,
        [-0.1093, -0.1233, -0.2521,  ...,  0.3467,  0.1614,  0.1974],
        [ 0.0728, -0.2033, -0.0522,  ...,  0.2844,  0.0839, -0.2510],
        [-0.0395, -0.2716, -0.0760,  ...,  0.5530, -0.0286,  0.1793]],
       device='cuda:0')

In [None]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model_2):,} trainable parameters')

The model has 45,876,741 trainable parameters


In [None]:
optimizer_2 = optim.Adam(model_2.parameters())

In [None]:
TRG_PAD_IDX_2 = TRG_2.vocab.stoi[TRG_2.pad_token]

criterion_2 = nn.CrossEntropyLoss(ignore_index = TRG_PAD_IDX_2)

In [None]:
def train(model, iterator, optimizer, criterion, clip):
    
    model.train()
    
    epoch_loss = 0
    
    for i, batch in enumerate(iterator):
        
        src = batch.src
        trg = batch.trg
        
        optimizer.zero_grad()
        
        output, _ = model(src, trg[:,:-1])
        
        #output = [batch size, trg len - 1, output dim]
        #trg = [batch size, trg len]
        
        output_dim = output.shape[-1]
        
        output = output.contiguous().view(-1, output_dim)
        trg = trg[:,1:].contiguous().view(-1)
        
        #output = [batch size * trg len - 1, output dim]
        #trg = [batch size * trg len - 1]
        
        loss = criterion(output, trg)
        
        loss.backward()
        
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        
        optimizer.step()
        
        epoch_loss += loss.item()
        
    return epoch_loss / len(iterator)

In [None]:
def evaluate(model, iterator, criterion):
    
    model.eval()
    
    epoch_loss = 0
    
    with torch.no_grad():
    
        for i, batch in enumerate(iterator):

            src = batch.src
            trg = batch.trg

            output, _ = model(src, trg[:,:-1])
        
            #output = [batch size, trg len - 1, output dim]
            #trg = [batch size, trg len]

            output_dim = output.shape[-1]
            
            output = output.contiguous().view(-1, output_dim)
            trg = trg[:,1:].contiguous().view(-1)

            #output = [batch size * trg len - 1, output dim]
            #trg = [batch size * trg len - 1]
            
            loss = criterion(output, trg)

            epoch_loss += loss.item()
        
    return epoch_loss / len(iterator)

In [None]:
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

In [None]:
N_EPOCHS = 5
CLIP = 1

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):
    
    start_time = time.time()
    
    train_loss = train(model_2, train_iterator_2, optimizer_2, criterion_2, CLIP)
    valid_loss = evaluate(model_2, valid_iterator_2, criterion_2)
    
    end_time = time.time()
    
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model_2.state_dict(), '/content/drive/MyDrive/dataset/conv-seq2seq.pt')
    
    print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. PPL: {math.exp(valid_loss):7.3f}')



Epoch: 01 | Time: 9m 28s
	Train Loss: 3.791 | Train PPL:  44.291
	 Val. Loss: 2.039 |  Val. PPL:   7.683
Epoch: 02 | Time: 9m 39s
	Train Loss: 2.088 | Train PPL:   8.071
	 Val. Loss: 1.626 |  Val. PPL:   5.083
Epoch: 03 | Time: 9m 37s
	Train Loss: 1.802 | Train PPL:   6.061
	 Val. Loss: 1.595 |  Val. PPL:   4.927
Epoch: 04 | Time: 9m 37s
	Train Loss: 2.220 | Train PPL:   9.205
	 Val. Loss: 2.788 |  Val. PPL:  16.247
Epoch: 05 | Time: 9m 38s
	Train Loss: 1.620 | Train PPL:   5.053
	 Val. Loss: 1.350 |  Val. PPL:   3.858


### (8) Testing

In [None]:
model_2.load_state_dict(torch.load('/content/drive/MyDrive/dataset/conv-seq2seq.pt'))

test_loss = evaluate(model_2, test_iterator_2, criterion_2)

print(f'| Test Loss: {test_loss:.3f} | Test PPL: {math.exp(test_loss):7.3f} |')



| Test Loss: 1.336 | Test PPL:   3.803 |


### (9) BLEU score

In [None]:
#input이 tokenize되지 않은 string일 경우 tokenize 후 번역, 이미 tokenize된 list일 경우 그대로 번역
#<unk> token을 로마자화할 수 있는 옵션이 있음  
def translate_sentence_conv(sentence, src_field, trg_field, model, device, max_len = 50, reverse = False, romanize=False):
    
    model.eval()

    if isinstance(sentence, list):
      tokens = sentence
    elif isinstance(sentence, str):                      # input이 tokenize되어있지 않을 경우 tokenize 시행 
      if reverse == False:                               
        tokens = [tok for tok in sentence.split(" ")]
      elif reverse == True:                              # input 문장을 역순으로 넣어 번역할 경우 
        tokens = [tok for tok in sentence.split(" ")][::-1]

    tokens = [src_field.init_token] + tokens + [src_field.eos_token]

    src_indexes = [src_field.vocab.stoi[token] for token in tokens]

    src_tokens = [src_field.vocab.itos[i] for i in src_indexes]

    src_tensor = torch.LongTensor(src_indexes).unsqueeze(0).to(device)

    with torch.no_grad():
        encoder_conved, encoder_combined = model.encoder(src_tensor)

    trg_indexes = [trg_field.vocab.stoi[trg_field.init_token]]

    for i in range(max_len):

        trg_tensor = torch.LongTensor(trg_indexes).unsqueeze(0).to(device)

        with torch.no_grad():
            output, attention = model.decoder(trg_tensor, encoder_conved, encoder_combined)
        
        pred_token = output.argmax(2)[:,-1].item()
        
        trg_indexes.append(pred_token)

        if pred_token == trg_field.vocab.stoi[trg_field.eos_token]:
            break
    
    if romanize == False:
      trg_tokens = [trg_field.vocab.itos[i] for i in trg_indexes]

    elif romanize == True: 
      trg_tokens = []

      attention_for_alignment = attention
    
      for index, ori_index in enumerate(trg_indexes):

        trg_tk = trg_field.vocab.itos[ori_index]

        if trg_tk == '<unk>':
          attention_score2 = attention_for_alignment.squeeze(0).cpu().detach()

          src_search = attention_score2[index,:].argmax()

          src_search_index = src_search.numpy()

          roman_token = Romanizer(src_tokens[src_search]).romanize()

          trg_tokens.append(roman_token)

        else:
          trg_tokens.append(trg_tk)
    
    return trg_tokens[1:-1], attention.squeeze(0)

In [None]:
from torchtext.data.metrics import bleu_score

def calculate_bleu_conv(data, src_field, trg_field, model, device, max_len = 50):
    
    trgs = []
    pred_trgs = []
    
    for datum in data:
        
        src = vars(datum)['src']
        trg = vars(datum)['trg']
        
        pred_trg, _ = translate_sentence_conv(src, src_field, trg_field, model, device, max_len)
        
        pred_trgs.append(pred_trg)
        trgs.append([trg])
        
    return bleu_score(pred_trgs, trgs)

In [None]:
bleu_score = calculate_bleu_conv(test_data_2, SRC_2, TRG_2, model_2, device)

print(f'BLEU score = {bleu_score*100:.2f}')

BLEU score = 38.11


### (10) Inference

In [None]:
sen_list = ['inference 문장 입력']

### (10.1) Original sentence inference

In [None]:
#(no reverse) source sentence에 대해 영어로 번역 / 영어로 번역 + <unk> 토큰 로마자화 결과 각각 출력 
for sent in sen_list:
  translation, attention = translate_sentence_conv(sent, SRC_2, TRG_2, model_2, device, reverse = False, romanize = False)
  translation_r, attention_r = translate_sentence_conv(sent, SRC_2, TRG_2, model_2, device, reverse = False, romanize = True )
  translated_text = " ".join(translation)
  translated_text_r = " ".join(translation_r)

  print(f'source sent = {sent}')
  print(f'predicted plain = {translated_text}')
  print(f'predicted romanize = {translated_text_r}')
  print('\n')

In [None]:
def display_attention_conv(sentence, translation, attention, reverse = False):
    
    fig = plt.figure(figsize=(10,10))
    ax = fig.add_subplot(111)
    
    attention = attention.squeeze(0).cpu().detach().numpy()
    
    cax = ax.matshow(attention, cmap='bone')
   
    ax.tick_params(labelsize=12)
    if reverse:
      ax.set_xticklabels(['']+['<eos>']+[tok for tok in sentence.split(" ")][::-1]+['<sos>'], 
                       rotation=45)
    else:
      ax.set_xticklabels(['']+['<sos>']+[tok for tok in sentence.split(" ")]+['<eos>'], 
                           rotation=45)
    ax.set_yticklabels(['']+translation)

    ax.xaxis.set_major_locator(ticker.MultipleLocator(1))
    ax.yaxis.set_major_locator(ticker.MultipleLocator(1))

    plt.show()
    plt.close()

In [None]:
#(no reverse)
#source sent에 대해 영어로 번역 + <unk> 토큰 로마자화 결과 출력
#두 문장 간의 attention display도 함께출력

for sent in sen_list:
  translation_r, attention = translate_sentence_conv(sent, SRC_2, TRG_2, model_2, device, romanize = True)
  translated_text_r = " ".join(translation_r)
  print(f'source sent = {sent}')
  print(f'predicted romanize = {translated_text_r}')
  display_attention_conv(sent, translation_r, attention)
  print('\n')

### (10.2) Reversed sentence inference

In [None]:
#source sentence에 대해 영어로 번역 / 영어로 번역 + <unk> 토큰 로마자화 결과 각각 출력 
for sent in sen_list:
  translation, attention = translate_sentence_conv(sent, SRC_2, TRG_2, model_2, device, reverse = True, romanize = False)
  translation_r, attention_r = translate_sentence_conv(sent, SRC_2, TRG_2, model_2, device, reverse = True, romanize = True)
  translated_text = " ".join(translation)
  translated_text_r = " ".join(translation_r)

  print(f'source sent = {sent}')
  print(f'predicted plain = {translated_text}')
  print(f'predicted romanize = {translated_text_r}')
  print('\n')

In [None]:
#source sent에 대해 영어로 번역 + <unk> 토큰 로마자화 결과 출력
#두 문장 간의 attention display도 함께출력

for sent in sen_list:
  translation_r, attention = translate_sentence_conv(sent, SRC_2, TRG_2, model_2, device, reverse = True, romanize = True)
  translated_text_r = " ".join(translation_r)
  print(f'source sent = {sent}')
  print(f'predicted romanize = {translated_text_r}')
  display_attention_conv(sent, translation_r, attention, reverse = True)
  print('\n')

### 3.3 Transformers
Reference: https://github.com/bentrevett/pytorch-seq2seq/blob/master/6%20-%20Attention%20is%20All%20You%20Need.ipynb

### (1) Field 정의

In [None]:
#Field 정의
#앞의 모델들과 동일한 토크나이저 사용
SRC_3 = Field(tokenize = tokenize_ko, 
            init_token = '<sos>', 
            eos_token = '<eos>', 
            lower = True, 
            batch_first = True)

TRG_3 = Field(tokenize = tokenize_en, 
            init_token = '<sos>', 
            eos_token = '<eos>', 
            lower = True, 
            batch_first = True)



### (2) TranslationDataset 생성

In [None]:
#셔플된 데이터를 torchtext의 TranslationDataset을 통해 불러옴

data_shuffled_3 = TranslationDataset(path = '/content/drive/MyDrive/dataset/ordered_data/', exts = ('ko_shuffle.txt', 'en_shuffle.txt'), fields = (SRC_3, TRG_3))



### (3) Train, Validation, Test Dataset

In [None]:
#train, validation, test dataset 각각 80:10:10 비율로 생성 
train_data_3, test_data_3 = data_shuffled_3.split(split_ratio = 0.8, random_state = random.seed(SEED))
valid_data_3, test_data_3 = test_data_3.split(split_ratio = 0.5, random_state = random.seed(SEED))

In [None]:
#train, validation, test data 개수 확인 
print(f"Number of training examples: {len(train_data_3.examples)}")
print(f"Number of validation examples: {len(valid_data_3.examples)}")
print(f"Number of testing examples: {len(test_data_3.examples)}")

Number of training examples: 264779
Number of validation examples: 33098
Number of testing examples: 33097


### (4) 한국어, 영어 vocab 생성

In [None]:
#transformer 구현 시에는 word embedding을 사용하지 않고 초기에 랜덤으로 벡터 설정
SRC_3.build_vocab(train_data_3, min_freq=2)
TRG_3.build_vocab(train_data_3, min_freq=2)

In [None]:
#한국어와 영어 vocabulary 개수 
print(f"Unique tokens in source (ko) vocabulary: {len(SRC_3.vocab)}")
print(f"Unique tokens in target (en) vocabulary: {len(TRG_3.vocab)}")

Unique tokens in source (ko) vocabulary: 16179
Unique tokens in target (en) vocabulary: 14245


### (5) BucketIterator 생성

In [None]:
#device 정의 
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [None]:
#BucketIterator 생성 

BATCH_SIZE = 128

train_iterator_3, valid_iterator_3, test_iterator_3 = BucketIterator.splits(
    (train_data_3, valid_data_3, test_data_3), 
     batch_size = BATCH_SIZE,
     device = device)



### (6) Transformer Model






In [None]:
class Encoder(nn.Module):
    def __init__(self, 
                 input_dim, 
                 hid_dim, 
                 n_layers, 
                 n_heads, 
                 pf_dim,
                 dropout, 
                 device,
                 max_length = 100):
        super().__init__()

        self.device = device
        
        self.tok_embedding = nn.Embedding(input_dim, hid_dim)
        self.pos_embedding = nn.Embedding(max_length, hid_dim)
        
        self.layers = nn.ModuleList([EncoderLayer(hid_dim, 
                                                  n_heads, 
                                                  pf_dim,
                                                  dropout, 
                                                  device) 
                                     for _ in range(n_layers)])
        
        self.dropout = nn.Dropout(dropout)
        
        self.scale = torch.sqrt(torch.FloatTensor([hid_dim])).to(device)
        
    def forward(self, src, src_mask):
        
        #src = [batch size, src len]
        #src_mask = [batch size, 1, 1, src len]
        
        batch_size = src.shape[0]
        src_len = src.shape[1]
        
        pos = torch.arange(0, src_len).unsqueeze(0).repeat(batch_size, 1).to(self.device)
        
        #pos = [batch size, src len]
        
        src = self.dropout((self.tok_embedding(src) * self.scale) + self.pos_embedding(pos))
        
        #src = [batch size, src len, hid dim]
        
        for layer in self.layers:
            src = layer(src, src_mask)
            
        #src = [batch size, src len, hid dim]
            
        return src

In [None]:
class EncoderLayer(nn.Module):
    def __init__(self, 
                 hid_dim, 
                 n_heads, 
                 pf_dim,  
                 dropout, 
                 device):
        super().__init__()
        
        self.self_attn_layer_norm = nn.LayerNorm(hid_dim)
        self.ff_layer_norm = nn.LayerNorm(hid_dim)
        self.self_attention = MultiHeadAttentionLayer(hid_dim, n_heads, dropout, device)
        self.positionwise_feedforward = PositionwiseFeedforwardLayer(hid_dim, 
                                                                     pf_dim, 
                                                                     dropout)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, src, src_mask):
        
        #src = [batch size, src len, hid dim]
        #src_mask = [batch size, 1, 1, src len] 
                
        #self attention
        _src, _ = self.self_attention(src, src, src, src_mask)
        
        #dropout, residual connection and layer norm
        src = self.self_attn_layer_norm(src + self.dropout(_src))
        
        #src = [batch size, src len, hid dim]
        
        #positionwise feedforward
        _src = self.positionwise_feedforward(src)
        
        #dropout, residual and layer norm
        src = self.ff_layer_norm(src + self.dropout(_src))
        
        #src = [batch size, src len, hid dim]
        
        return src


In [None]:
class MultiHeadAttentionLayer(nn.Module):
    def __init__(self, hid_dim, n_heads, dropout, device):
        super().__init__()
        
        assert hid_dim % n_heads == 0
        
        self.hid_dim = hid_dim
        self.n_heads = n_heads
        self.head_dim = hid_dim // n_heads
        
        self.fc_q = nn.Linear(hid_dim, hid_dim)
        self.fc_k = nn.Linear(hid_dim, hid_dim)
        self.fc_v = nn.Linear(hid_dim, hid_dim)
        
        self.fc_o = nn.Linear(hid_dim, hid_dim)
        
        self.dropout = nn.Dropout(dropout)
        
        self.scale = torch.sqrt(torch.FloatTensor([self.head_dim])).to(device)
        
    def forward(self, query, key, value, mask = None):
        
        batch_size = query.shape[0]
        
        #query = [batch size, query len, hid dim]
        #key = [batch size, key len, hid dim]
        #value = [batch size, value len, hid dim]
                
        Q = self.fc_q(query)
        K = self.fc_k(key)
        V = self.fc_v(value)
        
        #Q = [batch size, query len, hid dim]
        #K = [batch size, key len, hid dim]
        #V = [batch size, value len, hid dim]
                
        Q = Q.view(batch_size, -1, self.n_heads, self.head_dim).permute(0, 2, 1, 3)
        K = K.view(batch_size, -1, self.n_heads, self.head_dim).permute(0, 2, 1, 3)
        V = V.view(batch_size, -1, self.n_heads, self.head_dim).permute(0, 2, 1, 3)
        
        #Q = [batch size, n heads, query len, head dim]
        #K = [batch size, n heads, key len, head dim]
        #V = [batch size, n heads, value len, head dim]
                
        energy = torch.matmul(Q, K.permute(0, 1, 3, 2)) / self.scale
        
        #energy = [batch size, n heads, query len, key len]
        
        if mask is not None:
            energy = energy.masked_fill(mask == 0, -1e10)
        
        attention = torch.softmax(energy, dim = -1)
                
        #attention = [batch size, n heads, query len, key len]
                
        x = torch.matmul(self.dropout(attention), V)
        
        #x = [batch size, n heads, query len, head dim]
        
        x = x.permute(0, 2, 1, 3).contiguous()
        
        #x = [batch size, query len, n heads, head dim]
        
        x = x.view(batch_size, -1, self.hid_dim)
        
        #x = [batch size, query len, hid dim]
        
        x = self.fc_o(x)
        
        #x = [batch size, query len, hid dim]
        
        return x, attention

In [None]:
class PositionwiseFeedforwardLayer(nn.Module):
    def __init__(self, hid_dim, pf_dim, dropout):
        super().__init__()
        
        self.fc_1 = nn.Linear(hid_dim, pf_dim)
        self.fc_2 = nn.Linear(pf_dim, hid_dim)
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x):
        
        #x = [batch size, seq len, hid dim]
        
        x = self.dropout(torch.relu(self.fc_1(x)))
        
        #x = [batch size, seq len, pf dim]
        
        x = self.fc_2(x)
        
        #x = [batch size, seq len, hid dim]
        
        return x

In [None]:
class Decoder(nn.Module):
    def __init__(self, 
                 output_dim, 
                 hid_dim, 
                 n_layers, 
                 n_heads, 
                 pf_dim, 
                 dropout, 
                 device,
                 max_length = 100):
        super().__init__()
        
        self.device = device
        
        self.tok_embedding = nn.Embedding(output_dim, hid_dim)
        self.pos_embedding = nn.Embedding(max_length, hid_dim)
        
        self.layers = nn.ModuleList([DecoderLayer(hid_dim, 
                                                  n_heads, 
                                                  pf_dim, 
                                                  dropout, 
                                                  device)
                                     for _ in range(n_layers)])
        
        self.fc_out = nn.Linear(hid_dim, output_dim)
        
        self.dropout = nn.Dropout(dropout)
        
        self.scale = torch.sqrt(torch.FloatTensor([hid_dim])).to(device)
        
    def forward(self, trg, enc_src, trg_mask, src_mask):
        
        #trg = [batch size, trg len]
        #enc_src = [batch size, src len, hid dim]
        #trg_mask = [batch size, 1, trg len, trg len]
        #src_mask = [batch size, 1, 1, src len]
                
        batch_size = trg.shape[0]
        trg_len = trg.shape[1]
        
        pos = torch.arange(0, trg_len).unsqueeze(0).repeat(batch_size, 1).to(self.device)
                            
        #pos = [batch size, trg len]
            
        trg = self.dropout((self.tok_embedding(trg) * self.scale) + self.pos_embedding(pos))
                
        #trg = [batch size, trg len, hid dim]
        
        for layer in self.layers:
            trg, attention = layer(trg, enc_src, trg_mask, src_mask)
        
        #trg = [batch size, trg len, hid dim]
        #attention = [batch size, n heads, trg len, src len]
        
        output = self.fc_out(trg)
        
        #output = [batch size, trg len, output dim]
            
        return output, attention

In [None]:
class DecoderLayer(nn.Module):
    def __init__(self, 
                 hid_dim, 
                 n_heads, 
                 pf_dim, 
                 dropout, 
                 device):
        super().__init__()
        
        self.self_attn_layer_norm = nn.LayerNorm(hid_dim)
        self.enc_attn_layer_norm = nn.LayerNorm(hid_dim)
        self.ff_layer_norm = nn.LayerNorm(hid_dim)
        self.self_attention = MultiHeadAttentionLayer(hid_dim, n_heads, dropout, device)
        self.encoder_attention = MultiHeadAttentionLayer(hid_dim, n_heads, dropout, device)
        self.positionwise_feedforward = PositionwiseFeedforwardLayer(hid_dim, 
                                                                     pf_dim, 
                                                                     dropout)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, trg, enc_src, trg_mask, src_mask):
        
        #trg = [batch size, trg len, hid dim]
        #enc_src = [batch size, src len, hid dim]
        #trg_mask = [batch size, 1, trg len, trg len]
        #src_mask = [batch size, 1, 1, src len]
        
        #self attention
        _trg, _ = self.self_attention(trg, trg, trg, trg_mask)
        
        #dropout, residual connection and layer norm
        trg = self.self_attn_layer_norm(trg + self.dropout(_trg))
            
        #trg = [batch size, trg len, hid dim]
            
        #encoder attention
        _trg, attention = self.encoder_attention(trg, enc_src, enc_src, src_mask)
        
        #dropout, residual connection and layer norm
        trg = self.enc_attn_layer_norm(trg + self.dropout(_trg))
                    
        #trg = [batch size, trg len, hid dim]
        
        #positionwise feedforward
        _trg = self.positionwise_feedforward(trg)
        
        #dropout, residual and layer norm
        trg = self.ff_layer_norm(trg + self.dropout(_trg))
        
        #trg = [batch size, trg len, hid dim]
        #attention = [batch size, n heads, trg len, src len]
        
        return trg, attention

In [None]:
class Seq2Seq(nn.Module):
    def __init__(self, 
                 encoder, 
                 decoder, 
                 src_pad_idx, 
                 trg_pad_idx, 
                 device):
        super().__init__()
        
        self.encoder = encoder
        self.decoder = decoder
        self.src_pad_idx = src_pad_idx
        self.trg_pad_idx = trg_pad_idx
        self.device = device
        
    def make_src_mask(self, src):
        
        #src = [batch size, src len]
        
        src_mask = (src != self.src_pad_idx).unsqueeze(1).unsqueeze(2)

        #src_mask = [batch size, 1, 1, src len]

        return src_mask
    
    def make_trg_mask(self, trg):
        
        #trg = [batch size, trg len]
        
        trg_pad_mask = (trg != self.trg_pad_idx).unsqueeze(1).unsqueeze(2)
        
        #trg_pad_mask = [batch size, 1, 1, trg len]
        
        trg_len = trg.shape[1]
        
        trg_sub_mask = torch.tril(torch.ones((trg_len, trg_len), device = self.device)).bool()
        
        #trg_sub_mask = [trg len, trg len]
            
        trg_mask = trg_pad_mask & trg_sub_mask
        
        #trg_mask = [batch size, 1, trg len, trg len]
        
        return trg_mask

    def forward(self, src, trg):
        
        #src = [batch size, src len]
        #trg = [batch size, trg len]
                
        src_mask = self.make_src_mask(src)
        trg_mask = self.make_trg_mask(trg)
        
        #src_mask = [batch size, 1, 1, src len]
        #trg_mask = [batch size, 1, trg len, trg len]
        
        enc_src = self.encoder(src, src_mask)
        
        #enc_src = [batch size, src len, hid dim]
                
        output, attention = self.decoder(trg, enc_src, trg_mask, src_mask)
        
        #output = [batch size, trg len, output dim]
        #attention = [batch size, n heads, trg len, src len]
        
        return output, attention

### (7) Training



In [None]:
INPUT_DIM = len(SRC_3.vocab)
OUTPUT_DIM = len(TRG_3.vocab)
HID_DIM = 256
ENC_LAYERS = 3
DEC_LAYERS = 3
ENC_HEADS = 8
DEC_HEADS = 8
ENC_PF_DIM = 512
DEC_PF_DIM = 512
ENC_DROPOUT = 0.1
DEC_DROPOUT = 0.1

enc_3 = Encoder(INPUT_DIM, 
              HID_DIM, 
              ENC_LAYERS, 
              ENC_HEADS, 
              ENC_PF_DIM, 
              ENC_DROPOUT, 
              device)

dec_3 = Decoder(OUTPUT_DIM, 
              HID_DIM, 
              DEC_LAYERS, 
              DEC_HEADS, 
              DEC_PF_DIM, 
              DEC_DROPOUT, 
              device)

In [None]:
SRC_PAD_IDX_3 = SRC_3.vocab.stoi[SRC_3.pad_token]
TRG_PAD_IDX_3 = TRG_3.vocab.stoi[TRG_3.pad_token]

model_3 = Seq2Seq(enc_3, dec_3, SRC_PAD_IDX_3, TRG_PAD_IDX_3, device).to(device)

In [None]:
def initialize_weights(m):
    if hasattr(m, 'weight') and m.weight.dim() > 1:
        nn.init.xavier_uniform_(m.weight.data)

In [None]:
model_3.apply(initialize_weights)

Seq2Seq(
  (encoder): Encoder(
    (tok_embedding): Embedding(16179, 256)
    (pos_embedding): Embedding(100, 256)
    (layers): ModuleList(
      (0): EncoderLayer(
        (self_attn_layer_norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
        (ff_layer_norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
        (self_attention): MultiHeadAttentionLayer(
          (fc_q): Linear(in_features=256, out_features=256, bias=True)
          (fc_k): Linear(in_features=256, out_features=256, bias=True)
          (fc_v): Linear(in_features=256, out_features=256, bias=True)
          (fc_o): Linear(in_features=256, out_features=256, bias=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
        (positionwise_feedforward): PositionwiseFeedforwardLayer(
          (fc_1): Linear(in_features=256, out_features=512, bias=True)
          (fc_2): Linear(in_features=512, out_features=256, bias=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
    

In [None]:
#모델에 쓰인 파라미터 개수 
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model_3):,} trainable parameters')

The model has 15,454,373 trainable parameters


In [None]:
#Adam optimizer 사용 
LEARNING_RATE = 0.0005

optimizer_3 = torch.optim.Adam(model_3.parameters(), lr = LEARNING_RATE)

In [None]:
#Cross entropy loss 함수 사용 
criterion_3 = nn.CrossEntropyLoss(ignore_index = TRG_PAD_IDX_3) #<pad> token은 loss 계산 시 무시

In [None]:
def train(model, iterator, optimizer, criterion, clip):
    
    model.train()
    
    epoch_loss = 0
    
    for i, batch in enumerate(iterator):
        
        src = batch.src
        trg = batch.trg
        
        optimizer.zero_grad()
        
        output, _ = model(src, trg[:,:-1])
                
        #output = [batch size, trg len - 1, output dim]
        #trg = [batch size, trg len]
            
        output_dim = output.shape[-1]
            
        output = output.contiguous().view(-1, output_dim)
        trg = trg[:,1:].contiguous().view(-1)
                
        #output = [batch size * trg len - 1, output dim]
        #trg = [batch size * trg len - 1]
            
        loss = criterion(output, trg)
        
        loss.backward()
        
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        
        optimizer.step()
        
        epoch_loss += loss.item()
        
    return epoch_loss / len(iterator)

In [None]:
def evaluate(model, iterator, criterion):
    
    model.eval()
    
    epoch_loss = 0
    
    with torch.no_grad():
    
        for i, batch in enumerate(iterator):

            src = batch.src
            trg = batch.trg

            output, _ = model(src, trg[:,:-1])
            
            #output = [batch size, trg len - 1, output dim]
            #trg = [batch size, trg len]
            
            output_dim = output.shape[-1]
            
            output = output.contiguous().view(-1, output_dim)
            trg = trg[:,1:].contiguous().view(-1)
            
            #output = [batch size * trg len - 1, output dim]
            #trg = [batch size * trg len - 1]
            
            loss = criterion(output, trg)

            epoch_loss += loss.item()
        
    return epoch_loss / len(iterator)

In [None]:
#epoch 당 소요시간 계산 
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

In [None]:
N_EPOCHS = 5
CLIP = 1

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):
    
    start_time = time.time()
    
    train_loss = train(model_3, train_iterator_3, optimizer_3, criterion_3, CLIP)
    valid_loss = evaluate(model_3, valid_iterator_3, criterion_3)
    
    end_time = time.time()
    
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model_3.state_dict(), '/content/drive/MyDrive/dataset/transformer-word2vec-fasttext-ordered-data.pt')
    
    print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. PPL: {math.exp(valid_loss):7.3f}')



Epoch: 01 | Time: 3m 21s
	Train Loss: 2.580 | Train PPL:  13.202
	 Val. Loss: 1.580 |  Val. PPL:   4.854
Epoch: 02 | Time: 3m 25s
	Train Loss: 1.530 | Train PPL:   4.620
	 Val. Loss: 1.272 |  Val. PPL:   3.566
Epoch: 03 | Time: 3m 24s
	Train Loss: 1.233 | Train PPL:   3.433
	 Val. Loss: 1.136 |  Val. PPL:   3.115
Epoch: 04 | Time: 3m 24s
	Train Loss: 1.065 | Train PPL:   2.901
	 Val. Loss: 1.066 |  Val. PPL:   2.903
Epoch: 05 | Time: 3m 25s
	Train Loss: 0.954 | Train PPL:   2.596
	 Val. Loss: 1.010 |  Val. PPL:   2.746


### (8) Testing

In [None]:
model_3.load_state_dict(torch.load('/content/drive/MyDrive/dataset/transformer-word2vec-fasttext-ordered-data.pt'))

test_loss = evaluate(model_3, test_iterator_3, criterion_3)

print(f'| Test Loss: {test_loss:.3f} | Test PPL: {math.exp(test_loss):7.3f} |')



| Test Loss: 1.019 | Test PPL:   2.770 |


### (9) BLEU score

In [None]:
def translate_sentence_transformer(sentence, src_field, trg_field, model, device, max_len = 50, reverse = False, romanize = False):
    
    model.eval()
        
    if isinstance(sentence, list):
        tokens = sentence
    elif isinstance(sentence, str):
        if reverse == True:
          tokens = [tok for tok in sentence.split(" ")][::-1]
        elif reverse == False:
          tokens = [tok for tok in sentence.split(" ")]

    tokens = [src_field.init_token] + tokens + [src_field.eos_token]
        
    src_indexes = [src_field.vocab.stoi[token] for token in tokens]

    src_tokens = [src_field.vocab.itos[i] for i in src_indexes]

    src_tensor = torch.LongTensor(src_indexes).unsqueeze(0).to(device)
    
    src_mask = model.make_src_mask(src_tensor)
    
    with torch.no_grad():
        enc_src = model.encoder(src_tensor, src_mask)

    trg_indexes = [trg_field.vocab.stoi[trg_field.init_token]]

    for i in range(max_len):

        trg_tensor = torch.LongTensor(trg_indexes).unsqueeze(0).to(device)

        trg_mask = model.make_trg_mask(trg_tensor)
        
        with torch.no_grad():
            output, attention = model.decoder(trg_tensor, enc_src, trg_mask, src_mask)
        
        pred_token = output.argmax(2)[:,-1].item()
        
        trg_indexes.append(pred_token)

        if pred_token == trg_field.vocab.stoi[trg_field.eos_token]:
            break

    if romanize == False: 
        trg_tokens = [trg_field.vocab.itos[i] for i in trg_indexes]
    
    elif romanize == True:
        trg_tokens = []

        attention_for_alignment = attention
    
        for index, ori_index in enumerate(trg_indexes):

          trg_tk = trg_field.vocab.itos[ori_index]

          if trg_tk == '<unk>':
              attention_score2 = attention_for_alignment.squeeze(0).cpu().detach()

              attention_mean = attention_score2.mean(0)

              src_search = attention_mean[index,:].argmax()

              src_search_index = src_search.numpy()

              for_roman = src_tokens[src_search_index]

              clean_fr = re.compile('[가-힣]+').findall(for_roman) #'ㄴ다' 같이 자/모 하나만 있는 것 제거

              roman_token = Romanizer(''.join(clean_fr)).romanize()

              trg_tokens.append(roman_token)

          else:
              trg_tokens.append(trg_tk)   
    
    return trg_tokens[1:-1], attention

In [None]:
from torchtext.data.metrics import bleu_score

def calculate_bleu_transformer(data, src_field, trg_field, model, device, max_len = 50):
    
    trgs = []
    pred_trgs = []
    
    for datum in data:
        
        src = vars(datum)['src']
        trg = vars(datum)['trg']
        
        pred_trg, _ = translate_sentence_transformer(src, src_field, trg_field, model, device, max_len)
        
        pred_trgs.append(pred_trg)
        trgs.append([trg])
        
    return bleu_score(pred_trgs, trgs)

In [None]:
bleu_score = calculate_bleu_transformer(test_data_3, SRC_3, TRG_3, model_3, device)

print(f'BLEU score = {bleu_score*100:.2f}')

BLEU score = 43.92


### (10) Inference 

In [None]:
sen_list = ['inference 문장 입력']

### (10.1) Original sentence inference

In [None]:
#(no reverse) source sentence에 대해 영어로 번역 / 영어로 번역 + <unk> 토큰 로마자화 결과 각각 출력 
for sent in sen_list:
  translation, attention = translate_sentence_transformer(sent, SRC_3, TRG_3, model_3, device, reverse = False, romanize = False)
  translation_r, attention_r = translate_sentence_transformer(sent, SRC_3, TRG_3, model_3, device, reverse = False, romanize = True)
  translated_text = " ".join(translation)
  translated_text_r = " ".join(translation_r)

  print(f'source sent = {sent}')
  print(f'predicted plain = {translated_text}')
  print(f'predicted romanize = {translated_text_r}')
  print('\n')

In [None]:
def display_attention_transformer(sentence, translation, attention, n_heads = 8, n_rows = 4, n_cols = 2, reverse = False):
    
    assert n_rows * n_cols == n_heads
    
    fig = plt.figure(figsize=(15,25))
    
    for i in range(n_heads):
        
        ax = fig.add_subplot(n_rows, n_cols, i+1)
        
        _attention = attention.squeeze(0)[i].cpu().detach().numpy()

        cax = ax.matshow(_attention, cmap='bone')

        ax.tick_params(labelsize=12)
        if reverse: 
          ax.set_xticklabels(['']+['<eos>']+[tok for tok in sentence.split(" ")][::-1]+['<sos>'], 
                           rotation=45)
        else:
          ax.set_xticklabels(['']+['<sos>']+[tok for tok in sentence.split(" ")]+['<eos>'], 
                           rotation=45)
        ax.set_yticklabels(['']+translation)

        ax.xaxis.set_major_locator(ticker.MultipleLocator(1))
        ax.yaxis.set_major_locator(ticker.MultipleLocator(1))

    plt.show()
    plt.close()

In [None]:
#(no reverse)
#source sent에 대해 영어로 번역 + <unk> 토큰 로마자화 결과 출력
#두 문장 간의 attention display도 함께출력

for sent in sen_list:
  translation, attention = translate_sentence_transformer(sent, SRC_3, TRG_3, model_3, device, reverse = False, romanize = True)
  translated_text_r = " ".join(translation)
  print(f'source sent = {sent}')
  print(f'predicted romanize = {translated_text_r}')
  display_attention_transformer(sent, translation, attention)
  print('\n')

### (10.2) Reversed sentence inference

In [None]:
#inference 문장 역순으로 입력
#source sentence에 대해 영어로 번역 / 영어로 번역 + <unk> 토큰 로마자화 결과 각각 출력 
for sent in sen_list:
  translation, attention = translate_sentence_transformer(sent, SRC_3, TRG_3, model_3, device, reverse = True, romanize = False)
  translation_r, attention_r = translate_sentence_transformer(sent, SRC_3, TRG_3, model_3, device, reverse = True, romanize = True)
  translated_text = " ".join(translation)
  translated_text_r = " ".join(translation_r)

  print(f'source sent = {sent}')
  print(f'predicted plain = {translated_text}')
  print(f'predicted romanize = {translated_text_r}')
  print('\n')

In [None]:
#source sent에 대해 영어로 번역 + <unk> 토큰 로마자화 결과 출력
#두 문장 간의 attention display도 함께출력

for sent in sen_list:
  translation, attention = translate_sentence_transformer(sent, SRC_3, TRG_3, model_3, device, reverse = True, romanize = True)
  translated_text_r = " ".join(translation)
  print(f'source sent = {sent}')
  print(f'predicted romanize = {translated_text_r}')
  display_attention_transformer(sent, translation, attention, reverse = True)
  print('\n')