## 12. 번역가는 대화에도 능하다

****12-1. 들어가며****

### ****12-2. 번역 데이터 준비****

**영어-스페인어 데이터를 사용**

In [1]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

Tue Apr 12 01:04:38 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   36C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
!pip install sentencepiece

Collecting sentencepiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 4.9 MB/s 
[?25hInstalling collected packages: sentencepiece
Successfully installed sentencepiece-0.1.96


In [4]:
import numpy as np
import pandas as pd
import tensorflow as tf
import sentencepiece as spm
from nltk.translate.bleu_score import sentence_bleu
from nltk.translate.bleu_score import SmoothingFunction

import re
import os
import random
import math

from tqdm.notebook import tqdm
import matplotlib.pyplot as plt

print(tf.__version__)

2.8.0


영어-스페인어 데이터를 다운로드

In [5]:
zip_path = tf.keras.utils.get_file(
    'spa-eng.zip',
    origin='http://storage.googleapis.com/download.tensorflow.org/data/spa-eng.zip',
    extract=True
)

Downloading data from http://storage.googleapis.com/download.tensorflow.org/data/spa-eng.zip


In [6]:
file_path = os.path.dirname(zip_path)+"/spa-eng/spa.txt"

with open(file_path, "r", encoding='UTF-8') as f:
    spa_eng_sentences = f.read().splitlines()

spa_eng_sentences = list(set(spa_eng_sentences)) 
total_sentence_count = len(spa_eng_sentences)
print("Example:", total_sentence_count)

for sen in spa_eng_sentences[0:100][::20]: 
    print(">>", sen)

Example: 118964
>> This is by far the best seafood restaurant in this area.	Este es por lejos el mejor restaurante de mariscos en el área.
>> Where did you stay?	¿En dónde te quedaste?
>> She has a big family.	Ella tiene una familia numerosa.
>> Why don't you trust Tom?	¿Por qué no confiás en Tom?
>> It might sound crazy, but I think I'm still in love with Mary.	Puede que parezca una locura, pero creo que todavía estoy enamorado de Mary.


In [7]:
def preprocess_sentence(sentence):
    sentence = sentence.lower()
    sentence = re.sub(r'[" "]+', " ", sentence)
    sentence = sentence.strip()
    return sentence

In [8]:
spa_eng_sentences = list(map(preprocess_sentence, spa_eng_sentences))

In [9]:
test_sentence_count = total_sentence_count // 200
print("Test Size: ", test_sentence_count)
print("\n")

train_spa_eng_sentences = spa_eng_sentences[:-test_sentence_count]
test_spa_eng_sentences = spa_eng_sentences[-test_sentence_count:]
print("Train Example:", len(train_spa_eng_sentences))
for sen in train_spa_eng_sentences[0:100][::20]: 
    print(">>", sen)
print("\n")
print("Test Example:", len(test_spa_eng_sentences))
for sen in test_spa_eng_sentences[0:100][::20]: 
    print(">>", sen)

Test Size:  594


Train Example: 118370
>> this is by far the best seafood restaurant in this area.	este es por lejos el mejor restaurante de mariscos en el área.
>> where did you stay?	¿en dónde te quedaste?
>> she has a big family.	ella tiene una familia numerosa.
>> why don't you trust tom?	¿por qué no confiás en tom?
>> it might sound crazy, but i think i'm still in love with mary.	puede que parezca una locura, pero creo que todavía estoy enamorado de mary.


Test Example: 594
>> it is better for you to do it now.	es mejor que lo hagas ahora.
>> tom yelled at mary.	tom gritó a mary.
>> what are you crunching on?	¿qué estás mascando?
>> what kind of food do you eat?	¿qué tipo de comida tomas?
>> i have no friends.	no tengo amigos.


영어 문장과 스페인어 문장이 tab으로 연결되어 있으니 split('\t')을 사용. tab 이전이 영어, 이후가 스페인어 문장

In [10]:
def split_spa_eng_sentences(spa_eng_sentences):
    spa_sentences = []
    eng_sentences = []
    for spa_eng_sentence in tqdm(spa_eng_sentences):
        eng_sentence, spa_sentence = spa_eng_sentence.split('\t')
        spa_sentences.append(spa_sentence)
        eng_sentences.append(eng_sentence)
    return eng_sentences, spa_sentences

In [11]:
train_eng_sentences, train_spa_sentences = split_spa_eng_sentences(train_spa_eng_sentences)
print(len(train_eng_sentences))
print(train_eng_sentences[0])
print('\n')
print(len(train_spa_sentences))
print(train_spa_sentences[0])

  0%|          | 0/118370 [00:00<?, ?it/s]

118370
this is by far the best seafood restaurant in this area.


118370
este es por lejos el mejor restaurante de mariscos en el área.


In [12]:
test_eng_sentences, test_spa_sentences = split_spa_eng_sentences(test_spa_eng_sentences)
print(len(test_eng_sentences))
print(test_eng_sentences[0])
print('\n')
print(len(test_spa_sentences))
print(test_spa_sentences[0])

  0%|          | 0/594 [00:00<?, ?it/s]

594
it is better for you to do it now.


594
es mejor que lo hagas ahora.


In [13]:
type(test_eng_sentences)

list

In [14]:
def generate_tokenizer(corpus,
                       vocab_size,
                       lang="spa-eng",
                       pad_id=0,   # pad token의 일련번호
                       bos_id=1,  # 문장의 시작을 의미하는 bos token(<s>)의 일련번호
                       eos_id=2,  # 문장의 끝을 의미하는 eos token(</s>)의 일련번호
                       unk_id=3):   # unk token의 일련번호
    file = "./%s_corpus.txt" % lang
    model = "%s_spm" % lang

    with open(file, 'w', encoding='UTF-8') as f:
        for row in corpus: f.write(str(row) + '\n')

    import sentencepiece as spm
    spm.SentencePieceTrainer.Train(
        '--input=./%s --model_prefix=%s --vocab_size=%d'\
        % (file, model, vocab_size) + \
        '--pad_id==%d --bos_id=%d --eos_id=%d --unk_id=%d'\
        % (pad_id, bos_id, eos_id, unk_id)
    )

    tokenizer = spm.SentencePieceProcessor()
    tokenizer.Load('%s.model' % model)

    return tokenizer

두 언어가 단어 사전을 공유. 단어 사전 수는 20,000

In [15]:
VOCAB_SIZE = 20000
tokenizer = generate_tokenizer(train_eng_sentences + train_spa_sentences, VOCAB_SIZE, 'spa-eng')
tokenizer.set_encode_extra_options("bos:eos")  # 문장 양 끝에 <s> , </s> 추가

True

In [16]:
def make_corpus(sentences, tokenizer):
    corpus = []
    for sentence in tqdm(sentences):
        tokens = tokenizer.encode_as_ids(sentence)
        corpus.append(tokens)
    return corpus

In [17]:
eng_corpus = make_corpus(train_eng_sentences, tokenizer)
spa_corpus = make_corpus(train_spa_sentences, tokenizer)

  0%|          | 0/118370 [00:00<?, ?it/s]

  0%|          | 0/118370 [00:00<?, ?it/s]

In [18]:
print(train_eng_sentences[0])
print(eng_corpus[0])
print('\n')
print(train_spa_sentences[0])
print(spa_corpus[0])

this is by far the best seafood restaurant in this area.
[1, 41, 21, 143, 794, 9, 492, 9165, 1451, 28, 41, 3352, 0, 2]


este es por lejos el mejor restaurante de mariscos en el área.
[1, 100, 25, 43, 1035, 19, 261, 1455, 13, 10907, 24, 19, 3801, 0, 2]


In [19]:
MAX_LEN = 50
enc_ndarray = tf.keras.preprocessing.sequence.pad_sequences(eng_corpus, maxlen=MAX_LEN, padding='post')
dec_ndarray = tf.keras.preprocessing.sequence.pad_sequences(spa_corpus, maxlen=MAX_LEN, padding='post')

In [20]:
BATCH_SIZE = 64
train_dataset = tf.data.Dataset.from_tensor_slices((enc_ndarray, dec_ndarray)).batch(batch_size=BATCH_SIZE)

### ****12-3. 번역 모델 만들기****

****트랜스포머 구현하기****

**[위키독스: 트랜스포머](https://wikidocs.net/31379)**

**[Trax: Transformer](https://github.com/google/trax/blob/master/trax/models/transformer.py)**

**[Tensorflow: Transformer](https://www.tensorflow.org/text/tutorials/transformer)**

Encoder와 Decoder 각각의 Embedding과 출력층의 Linear, 총 3개의 레이어가 Weight를 공유

In [None]:
# transformer = Transformer(
#     n_layers=2,
#     d_model=512,
#     n_heads=8,
#     d_ff=2048,
#     src_vocab_size=VOCAB_SIZE,
#     tgt_vocab_size=VOCAB_SIZE,
#     pos_len=200,
#     dropout=0.3,
#     shared_fc=True,
#     shared_emb=True)

Positional Encoding

In [21]:
# Positional Encoding 구현
def positional_encoding(pos, d_model):
    # TODO: 코드 구현
    def cal_angle(position, i):
        return position / np.power(10000, int(i) / d_model)

    def get_posi_angle_vec(position):
        return [cal_angle(position, i) for i in range(d_model)]

    sinusoid_table = np.array([get_posi_angle_vec(pos_i) for pos_i in range(pos)])
    sinusoid_table[:, 0::2] = np.sin(sinusoid_table[:, 0::2])
    sinusoid_table[:, 1::2] = np.cos(sinusoid_table[:, 1::2])
    return sinusoid_table

마스크 생성

In [22]:
# Mask  생성하기
def generate_padding_mask(seq):
    # TODO: 구현
    seq = tf.cast(tf.math.equal(seq, 0), tf.float32)    
    return seq[:, tf.newaxis, tf.newaxis, :]

def generate_lookahead_mask(size):
    # TODO: 구현
    mask = 1 - tf.linalg.band_part(tf.ones((size, size)), -1, 0)
    return mask

def generate_masks(src, tgt):
    # TODO: 구현
    enc_mask = generate_padding_mask(src)
    dec_enc_mask = generate_padding_mask(src)

    dec_lookahead_mask = generate_lookahead_mask(tgt.shape[1])
    dec_tgt_padding_mask = generate_padding_mask(tgt)
    dec_mask = tf.maximum(dec_tgt_padding_mask, dec_lookahead_mask)

    return enc_mask, dec_enc_mask, dec_mask

Multi-head Attention

In [23]:
# Multi Head Attention 구현
class MultiHeadAttention(tf.keras.layers.Layer):
    def __init__(self, d_model, num_heads):
        super(MultiHeadAttention, self).__init__()
        self.num_heads = num_heads
        self.d_model = d_model
        
        self.depth = d_model // self.num_heads
        
        self.W_q = tf.keras.layers.Dense(d_model)
        self.W_k = tf.keras.layers.Dense(d_model)
        self.W_v = tf.keras.layers.Dense(d_model)
        
        self.linear = tf.keras.layers.Dense(d_model)

    def scaled_dot_product_attention(self, Q, K, V, mask):
        # TODO: 구현
        d_k = tf.cast(K.shape[-1], tf.float32)

        """
        Scaled QK 값 구하기
        """
        QK = tf.matmul(Q, K, transpose_b=True)

        scaled_qk = QK / tf.math.sqrt(d_k)

        if mask is not None: scaled_qk += (mask * -1e9) 

        """
        1. Attention Weights 값 구하기 -> attentions
        2. Attention 값을 V에 곱하기 -> out
        """ 
        attentions = tf.nn.softmax(scaled_qk, axis=-1)
        out = tf.matmul(attentions, V)        
        return out, attentions
        
    def split_heads(self, x):
        # TODO: 구현
        """
        Embedding을 Head의 수로 분할하는 함수

        x: [ batch x length x emb ]
        return: [ batch x length x heads x self.depth ]
        """
        batch_size = x.shape[0]
        split_x = tf.reshape(x, (batch_size, -1, self.num_heads, self.depth))
        split_x = tf.transpose(split_x, perm=[0, 2, 1, 3])

        return split_x

    def combine_heads(self, x):
        # TODO: 구현
        """
        분할된 Embedding을 하나로 결합하는 함수

        x: [ batch x length x heads x self.depth ]
        return: [ batch x length x emb ]
        """
        batch_size = x.shape[0]
        combined_x = tf.transpose(x, perm=[0, 2, 1, 3])
        combined_x = tf.reshape(combined_x, (batch_size, -1, self.d_model))        
        return combined_x

    def call(self, Q, K, V, mask):
        # TODO: 구현
        
        WQ = self.W_q(Q)
        WK = self.W_k(K)
        WV = self.W_v(V)
        
        WQ_splits = self.split_heads(WQ)
        WK_splits = self.split_heads(WK)
        WV_splits = self.split_heads(WV)
        
        out, attention_weights = self.scaled_dot_product_attention(
            WQ_splits, WK_splits, WV_splits, mask)
                        
        out = self.combine_heads(out)
        out = self.linear(out)


        return out, attention_weights
print("슝=3")

슝=3


Position-wise Feed Forward Network

In [24]:
# Position-wise Feed Forward Network 구현
class PoswiseFeedForwardNet(tf.keras.layers.Layer):
    def __init__(self, d_model, d_ff):
        super(PoswiseFeedForwardNet, self).__init__()
        self.d_model = d_model
        self.d_ff = d_ff

        self.fc1 = tf.keras.layers.Dense(d_ff, activation='relu')
        self.fc2 = tf.keras.layers.Dense(d_model)

    def call(self, x):
        out = self.fc1(x)
        out = self.fc2(out)
        return out

Encoder Layer

In [25]:
class EncoderLayer(tf.keras.layers.Layer):
    def __init__(self, d_model, n_heads, d_ff, dropout):
        super(EncoderLayer, self).__init__()

        self.enc_self_attn = MultiHeadAttention(d_model, n_heads)
        self.ffn = PoswiseFeedForwardNet(d_model, d_ff)

        self.norm_1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.norm_2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)

#         self.dropout = tf.keras.layers.Dropout(dropout)
        self.dropout_1 = tf.keras.layers.Dropout(dropout)
        self.dropout_2 = tf.keras.layers.Dropout(dropout)
        
    def call(self, x, training, mask):

        """
        Multi-Head Attention
        """
        residual = x
        out = self.norm_1(x)
        out, enc_attn = self.enc_self_attn(out, out, out, mask)
        out = self.dropout_1(out, training=training)
        out += residual
        
        """
        Position-Wise Feed Forward Network
        """
        residual = out
        out = self.norm_2(out)
        out = self.ffn(out)
        out = self.dropout_2(out, training=training)
        out += residual
        
        return out, enc_attn


Decoder Layer

In [26]:
class DecoderLayer(tf.keras.layers.Layer):
    def __init__(self, d_model, num_heads, d_ff, dropout):
        super(DecoderLayer, self).__init__()

        self.dec_self_attn = MultiHeadAttention(d_model, num_heads)
        self.enc_dec_attn = MultiHeadAttention(d_model, num_heads)

        self.ffn = PoswiseFeedForwardNet(d_model, d_ff)

        self.norm_1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.norm_2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.norm_3 = tf.keras.layers.LayerNormalization(epsilon=1e-6)

        self.dropout_1 = tf.keras.layers.Dropout(dropout)
        self.dropout_2 = tf.keras.layers.Dropout(dropout)
        self.dropout_3 = tf.keras.layers.Dropout(dropout)
    
    def call(self, x, enc_out, training, causality_mask, padding_mask):

        """
        Masked Multi-Head Attention
        """
        residual = x
        out = self.norm_1(x)
        out, dec_attn = self.dec_self_attn(out, out, out, padding_mask)
        out = self.dropout_1(out, training=training)
        out += residual
        
        """
        Multi-Head Attention
        """
        residual = out
        out = self.norm_2(out)
        out, dec_enc_attn = self.enc_dec_attn(out, enc_out, enc_out, causality_mask)
        out = self.dropout_2(out, training=training)
        out += residual
        
        """
        Position-Wise Feed Forward Network
        """
        residual = out
        out = self.norm_3(out)
        out = self.ffn(out)
        out = self.dropout_3(out, training=training)
        out += residual

        return out, dec_attn, dec_enc_attn



Encoder

In [27]:
class Encoder(tf.keras.Model):
    def __init__(self,
                 n_layers,
                 d_model,
                 n_heads,
                 d_ff,
                 dropout):
        super(Encoder, self).__init__()
        self.n_layers = n_layers
        self.enc_layers = [EncoderLayer(d_model, n_heads, d_ff, dropout) 
                        for _ in range(n_layers)]
        
    def call(self, x, training, mask):
        out = x
    
        enc_attns = list()
        for i in range(self.n_layers):
            out, enc_attn = self.enc_layers[i](out, training, mask)
            enc_attns.append(enc_attn)
        
        return out, enc_attns

Decoder

In [28]:
class Decoder(tf.keras.Model):
    def __init__(self,
                 n_layers,
                 d_model,
                 n_heads,
                 d_ff,
                 dropout):
        super(Decoder, self).__init__()
        self.n_layers = n_layers
        self.dec_layers = [DecoderLayer(d_model, n_heads, d_ff, dropout) 
                            for _ in range(n_layers)]
                            
                            
    def call(self, x, enc_out, training, causality_mask, padding_mask):
        out = x
    
        dec_attns = list()
        dec_enc_attns = list()
        for i in range(self.n_layers):
            out, dec_attn, dec_enc_attn = \
            self.dec_layers[i](out, enc_out, training, causality_mask, padding_mask)

            dec_attns.append(dec_attn)
            dec_enc_attns.append(dec_enc_attn)

        return out, dec_attns, dec_enc_attns

Transformer 전체 모델 조립

In [29]:
class Transformer(tf.keras.Model):
    def __init__(self,
                    n_layers,
                    d_model,
                    n_heads,
                    d_ff,
                    src_vocab_size,
                    tgt_vocab_size,
                    pos_len,
                    dropout=0.2,
                    shared_fc=True,
                    shared_emb=False):
        super(Transformer, self).__init__()
        
        self.d_model = tf.cast(d_model, tf.float32)

        if shared_emb:
            self.enc_emb = self.dec_emb = \
            tf.keras.layers.Embedding(src_vocab_size, d_model)
        else:
            self.enc_emb = tf.keras.layers.Embedding(src_vocab_size, d_model)
            self.dec_emb = tf.keras.layers.Embedding(tgt_vocab_size, d_model)

        self.pos_encoding = positional_encoding(pos_len, d_model)
        self.dropout = tf.keras.layers.Dropout(dropout)

        self.encoder = Encoder(n_layers, d_model, n_heads, d_ff, dropout)
        self.decoder = Decoder(n_layers, d_model, n_heads, d_ff, dropout)

        self.fc = tf.keras.layers.Dense(tgt_vocab_size)

        self.shared_fc = shared_fc

        if shared_fc:
            self.fc.set_weights(tf.transpose(self.dec_emb.weights))


    def embedding(self, emb, x, training):
        # TODO: 구현
        """
        입력된 정수 배열을 Embedding + Pos Encoding
        + Shared일 경우 Scaling 작업 포함

        x: [ batch x length ]
        return: [ batch x length x emb ]
        """
        seq_len = x.shape[1]

        out = emb(x)

        if self.shared_fc: out *= tf.math.sqrt(self.d_model)

        out += self.pos_encoding[np.newaxis, ...][:, :seq_len, :]
        out = self.dropout(out, training)

        return out

    def call(self, enc_in, dec_in, enc_mask, causality_mask, dec_mask, training):
        # TODO: 구현
        """
        아래 순서에 따라 소스를 작성하세요.

        Step 1: Embedding(enc_in, dec_in) -> enc_in, dec_in
        Step 2: Encoder(enc_in, enc_mask) -> enc_out, enc_attns
        Step 3: Decoder(dec_in, enc_out, mask)
                -> dec_out, dec_attns, dec_enc_attns
        Step 4: Out Linear(dec_out) -> logits
        """
        enc_in = self.embedding(self.enc_emb, enc_in, training)
        dec_in = self.embedding(self.dec_emb, dec_in, training)

        enc_out, enc_attns = self.encoder(enc_in, training, enc_mask)
        
        dec_out, dec_attns, dec_enc_attns = \
        self.decoder(dec_in, enc_out, training, causality_mask, dec_mask)
        
        logits = self.fc(dec_out)
        
        return logits, enc_attns, dec_attns, dec_enc_attns


모델 인스턴스 생성

In [30]:
# 주어진 하이퍼파라미터로 Transformer 인스턴스 생성
# 하이퍼파라미터
NUM_LAYERS = 2    # 인코더와 디코더의 층의 개수
D_MODEL    = 512  # 인코더와 디코더 내부의 입, 출력의 고정 차원
NUM_HEADS  = 8    # 멀티 헤드 어텐션에서의 헤드 수 
UNITS      = 2048 # 피드 포워드 신경망의 은닉층의 크기
DROPOUT    = 0.3  # 드롭아웃의 비율

# `d_ff` 는 논문의 설명대로라면 2048  
# `d_model` 은 512  
# `[ batch x length x d_model ]` 의 입력을 받아 `fc1` 이 2048차원으로 매핑하고 활성함수 ReLU를 적용  
# `fc2` 를 통해 512차원으로 되돌림

# 문장의 길이 200, 임베딩 벡터의 차원 512
# sample_pos_encoding = PositionalEncoding(200, 512)
transformer = Transformer(
    n_layers=NUM_LAYERS,
    d_model=D_MODEL,
    n_heads=NUM_HEADS,
    d_ff=UNITS,
    src_vocab_size=VOCAB_SIZE,
    tgt_vocab_size=VOCAB_SIZE,
    pos_len=200,
    dropout=DROPOUT,
    shared_fc=True,
    shared_emb=True)


Learning Rate Scheduler

In [31]:
# Learning Rate Scheduler 구현
class LearningRateScheduler(tf.keras.optimizers.schedules.LearningRateSchedule):
    def __init__(self, d_model, warmup_steps=4000):
        super(LearningRateScheduler, self).__init__()
        
        self.d_model = d_model
        self.warmup_steps = warmup_steps
    
    def __call__(self, step):
        arg1 = step ** -0.5
        arg2 = step * (self.warmup_steps ** -1.5)
        # TODO: 구현
        return (self.d_model ** -0.5) * tf.math.minimum(arg1, arg2)


****Learning Rate & Optimizer****

In [32]:
# Learning Rate 인스턴스 선언 & Optimizer 구현
learning_rate = LearningRateScheduler(200)  # 200 스텝까지 러닝레이트 증가 후 하락
optimizer     = tf.keras.optimizers.Adam(learning_rate,
                                     beta_1 = 0.9,
                                     beta_2 = 0.98, 
                                     epsilon= 1e-9)

Loss Function 정의

In [33]:
# Loss Function 정의
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(
    from_logits=True, reduction='none')

def loss_function(real, pred):
    # TODO: 구현
    mask = tf.math.logical_not(tf.math.equal(real, 0))
    loss_ = loss_object(real, pred)

    # Masking 되지 않은 입력의 개수로 Scaling하는 과정
    mask = tf.cast(mask, dtype=loss_.dtype)
    loss_ *= mask    
    return tf.reduce_sum(loss_)/tf.reduce_sum(mask)

Train Step 정의

In [34]:
# Train Step 정의
@tf.function()
def train_step(src, tgt, model, optimizer):
    tgt_in = tgt[:, :-1]  # Decoder의 input
    gold = tgt[:, 1:]     # Decoder의 output과 비교하기 위해 right shift를 통해 생성한 최종 타겟

    enc_mask, dec_enc_mask, dec_mask = generate_masks(src, tgt_in)

    with tf.GradientTape() as tape:
        predictions, enc_attns, dec_attns, dec_enc_attns = \
        model(src, tgt_in, enc_mask, dec_enc_mask, dec_mask, training = True)
        loss = loss_function(gold, predictions)

    gradients = tape.gradient(loss, model.trainable_variables)    
    optimizer.apply_gradients(zip(gradients, model.trainable_variables))

    return loss, enc_attns, dec_attns, dec_enc_attns

훈련을 시키자!

In [35]:
# 훈련시키기
# BATCH_SIZE = 64
EPOCHS = 0

for epoch in range(EPOCHS):
    total_loss = 0
    
    dataset_count = tf.data.experimental.cardinality(train_dataset).numpy()
    tqdm_bar = tqdm(total=dataset_count)
    for step, (enc_batch, dec_batch) in enumerate(train_dataset):
        batch_loss, enc_attns, dec_attns, dec_enc_attns = \
        train_step(enc_batch,
                    dec_batch,
                    transformer,
                    optimizer)

        total_loss += batch_loss
        
        tqdm_bar.set_description_str('Epoch %2d' % (epoch + 1))
        tqdm_bar.set_postfix_str('Loss %.4f' % (total_loss.numpy() / (step + 1)))
        tqdm_bar.update()

### ****12-4. 번역 성능 측정하기 (1) BLEU Score****

BLEU Score를 실습

**[BLEU](https://en.wikipedia.org/wiki/BLEU)**

****NLTK를 활용한 BLEU Score****

NLTK(Natural Language Tool Kit)

In [36]:
# 아래 두 문장을 바꿔가며 테스트 해보세요
reference = "많 은 자연어 처리 연구자 들 이 트랜스포머 를 선호 한다".split()
candidate = "적 은 자연어 학 개발자 들 가 트랜스포머 을 선호 한다 요".split()

print("원문:", reference)
print("번역문:", candidate)
print("BLEU Score:", sentence_bleu([reference], candidate))

원문: ['많', '은', '자연어', '처리', '연구자', '들', '이', '트랜스포머', '를', '선호', '한다']
번역문: ['적', '은', '자연어', '학', '개발자', '들', '가', '트랜스포머', '을', '선호', '한다', '요']
BLEU Score: 0.5491004867761125


Corpus/Sentence contains 0 counts of 3-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().


$(∏_{i=1}^4 precision_i)^{1\over4}=(1-gram×2-gram×3-gram×4-gram)^{1\over4}$

1-gram부터 4-gram까지의 점수(Precision)를 모두 곱한 후, ${\frac{1}{4}}$승하면 BLEU Score

`weights`의 디폴트 값은 `[0.25, 0.25, 0.25, 0.25]`로 1-gram부터 4-gram까지의 점수에 가중치를 동일하나 이 값을 `[1, 0, 0, 0]`으로 바꿔주면 BLEU Score에 1-gram의 점수만 반영

In [37]:
print("1-gram:", sentence_bleu([reference], candidate, weights=[1, 0, 0, 0]))
print("2-gram:", sentence_bleu([reference], candidate, weights=[0, 1, 0, 0]))
print("3-gram:", sentence_bleu([reference], candidate, weights=[0, 0, 1, 0]))
print("4-gram:", sentence_bleu([reference], candidate, weights=[0, 0, 0, 1]))

1-gram: 0.5
2-gram: 0.18181818181818182
3-gram: 1.0
4-gram: 1.0


Corpus/Sentence contains 0 counts of 3-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().


**예전 버전**에서는 위 수식에서 어떤 N-gram이 0의 값을 갖는다면 그 하위 N-gram 점수들이 곱했을 때 모두 소멸해버리기 때문에 일치하는 **N-gram이 없더라도 점수를 1.0 으로 유지**하여 하위 점수를 보존하게끔 구현되어 있었습니다. 하지만 1.0 은 모든 번역을 완벽히 재현했음을 의미하기 때문에 총점이 의도치 않게 높아질 수 있어요! 그럴 경우에는 BLEU Score가 바람직하지 못할 것(Undesirable) 이라는 **경고문**이 추가되긴 합니다.

****`SmoothingFunction()`으로 BLEU Score 보정하기**

Smoothing 함수는 모든 Precision에 아주 작은 `epsilon` 값을 더해줌

`nltk`에서는 `method0`부터 `method7`까지를 이미 제공

In [38]:
def calculate_bleu(reference, candidate, weights=[0.25, 0.25, 0.25, 0.25]):
    return sentence_bleu([reference],
                         candidate,
                         weights=weights,
                         smoothing_function=SmoothingFunction().method1)  # smoothing_function 적용

print("BLEU-1:", calculate_bleu(reference, candidate, weights=[1, 0, 0, 0]))
print("BLEU-2:", calculate_bleu(reference, candidate, weights=[0, 1, 0, 0]))
print("BLEU-3:", calculate_bleu(reference, candidate, weights=[0, 0, 1, 0]))
print("BLEU-4:", calculate_bleu(reference, candidate, weights=[0, 0, 0, 1]))

print("\nBLEU-Total:", calculate_bleu(reference, candidate))

BLEU-1: 0.5
BLEU-2: 0.18181818181818182
BLEU-3: 0.010000000000000004
BLEU-4: 0.011111111111111112

BLEU-Total: 0.05637560315259291


****트랜스포머 모델의 번역 성능 알아보기****

테스트셋으로 모델의 BLEU Score를 측정

번역기가 문장을 생성하도록 `translate()` 함수를 정의

In [39]:
def translate(tokens, model, src_tokenizer, tgt_tokenizer):
    padded_tokens = tf.keras.preprocessing.sequence.pad_sequences([tokens],
                                                           maxlen=MAX_LEN,
                                                           padding='post')
    ids = []
    output = tf.expand_dims([tgt_tokenizer.bos_id()], 0)   
    for i in range(MAX_LEN):
        enc_padding_mask, combined_mask, dec_padding_mask = \
        generate_masks(padded_tokens, output)

        predictions, _, _, _ = model(padded_tokens, 
                                      output,
                                      enc_padding_mask,
                                      combined_mask,
                                      dec_padding_mask,
                                      training = False)

        predicted_id = \
        tf.argmax(tf.math.softmax(predictions, axis=-1)[0, -1]).numpy().item()

        if tgt_tokenizer.eos_id() == predicted_id:
            result = tgt_tokenizer.decode_ids(ids)  
            return result

        ids.append(predicted_id)
        output = tf.concat([output, tf.expand_dims([predicted_id], 0)], axis=-1)

    result = tgt_tokenizer.decode_ids(ids)  
    return result

print("슝=3")

슝=3


In [40]:
# 한 문장만 평가하는 eval_bleu_single
def eval_bleu_single(model, src_sentence, tgt_sentence, src_tokenizer, tgt_tokenizer, verbose=True):
    src_tokens = src_tokenizer.encode_as_ids(src_sentence)
    tgt_tokens = tgt_tokenizer.encode_as_ids(tgt_sentence)

    if (len(src_tokens) > MAX_LEN): return None
    if (len(tgt_tokens) > MAX_LEN): return None

    reference = tgt_sentence.split()
    candidate = translate(src_tokens, model, src_tokenizer, tgt_tokenizer).split()

    score = sentence_bleu([reference], candidate,
                          smoothing_function=SmoothingFunction().method1)

    if verbose:
        print("Source Sentence: ", src_sentence)
        print("Model Prediction: ", candidate)
        print("Real: ", reference)
        print("Score: %lf\n" % score)
        
    return score

In [41]:
# beam_bleu() 구현
def beam_bleu(reference, ids, tokenizer):
    reference = reference.split()

    total_score = 0.0
    for _id in ids:
        candidate = tokenizer.decode_ids(_id.tolist()).split()
        score = calculate_bleu(reference, candidate)

        print("Reference:", reference)
        print("Candidate:", candidate)
        print("BLEU:", calculate_bleu(reference, candidate))

        total_score += score
        
    return total_score / len(ids)

In [None]:
# 인덱스를 바꿔가며 테스트해 보세요
test_idx = 0

# eval_bleu_single(transformer, 
#                  test_eng_sentences[test_idx], 
#                  test_spa_sentences[test_idx], 
#                  tokenizer, 
#                  tokenizer)

In [42]:
def eval_bleu(model, src_sentences, tgt_sentence, src_tokenizer, tgt_tokenizer, verbose=True):
    total_score = 0.0
    sample_size = len(src_sentences)
    
    for idx in tqdm(range(sample_size)):
        score = eval_bleu_single(model, src_sentences[idx], tgt_sentence[idx], src_tokenizer, tgt_tokenizer, verbose)
        if not score: continue
        
        total_score += score
    
    print("Num of Sample:", sample_size)
    print("Total Score:", total_score / sample_size)

테스트 데이터 중에 하나를 골라 평가

In [None]:
# eval_bleu(transformer, test_eng_sentences, test_spa_sentences, tokenizer, tokenizer, verbose=False)

### 12-5. 번역 성능 측정하기 (2) Beam Search Decoder

In [43]:
def beam_search_decoder(prob, beam_size):
    sequences = [[[], 1.0]]  # 생성된 문장과 점수를 저장

    for tok in prob:
        all_candidates = []

        for seq, score in sequences:
            for idx, p in enumerate(tok): # 각 단어의 확률을 총점에 누적 곱
                candidate = [seq + [idx], score * -math.log(-(p-1))]
                all_candidates.append(candidate)

        ordered = sorted(all_candidates,
                         key=lambda tup:tup[1],
                         reverse=True) # 총점 순 정렬
        sequences = ordered[:beam_size] # Beam Size에 해당하는 문장만 저장 

    return sequences

In [44]:
vocab = {
    0: "<pad>",
    1: "까요?",
    2: "커피",
    3: "마셔",
    4: "가져",
    5: "될",
    6: "를",
    7: "한",
    8: "잔",
    9: "도",
}

prob_seq = [[0.01, 0.01, 0.60, 0.32, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01],
            [0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.75, 0.01, 0.01, 0.17],
            [0.01, 0.01, 0.01, 0.35, 0.48, 0.10, 0.01, 0.01, 0.01, 0.01],
            [0.24, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.68],
            [0.01, 0.01, 0.12, 0.01, 0.01, 0.80, 0.01, 0.01, 0.01, 0.01],
            [0.01, 0.81, 0.01, 0.01, 0.01, 0.01, 0.11, 0.01, 0.01, 0.01],
            [0.70, 0.22, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01],
            [0.91, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01],
            [0.91, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01],
            [0.91, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01]]

prob_seq = np.array(prob_seq)
beam_size = 3

result = beam_search_decoder(prob_seq, beam_size)

for seq, score in result:
    sentence = ""

    for word in seq:
        sentence += vocab[word] + " "

    print(sentence, "// Score: %.4f" % score)

커피 를 가져 도 될 까요? <pad> <pad> <pad> <pad>  // Score: 42.5243
커피 를 마셔 도 될 까요? <pad> <pad> <pad> <pad>  // Score: 28.0135
마셔 를 가져 도 될 까요? <pad> <pad> <pad> <pad>  // Score: 17.8983


Beam Search를 생성 기법으로 구현할 때에는 분기를 잘 나눠줘야 합니다. Beam Size가 5라고 가정하면 맨 첫 단어로 적합한 5개의 단어를 생성하고, 두 번째 단어로 각 첫 단어(5개 단어)에 대해 5순위까지 확률을 구하여 총 25개의 문장을 생성하죠. 그 25개의 문장들은 각 단어에 할당된 확률을 곱하여 구한 점수(존재 확률) 를 가지고 있으니 각각의 순위를 매길 수 있겠죠? 점수 상위 5개의 표본만 살아남아 세 번째 단어를 구할 자격을 얻게 됩니다.

****Beam Search Decoder 작성 및 평가하기****

In [45]:
# calc_prob() 구현
def calc_prob(src_ids, tgt_ids, model):
    # TODO: 코드 구현
    enc_padding_mask, combined_mask, dec_padding_mask = \
    generate_masks(src_ids, tgt_ids)

    predictions, enc_attns, dec_attns, dec_enc_attns =\
    model(src_ids, 
            tgt_ids,
            enc_padding_mask,
            combined_mask,
            dec_padding_mask)

    return tf.math.softmax(predictions, axis=-1)

In [46]:
def beam_search_decoder(sentence, 
                        src_len,
                        tgt_len,
                        model,
                        src_tokenizer,
                        tgt_tokenizer,
                        beam_size):
    tokens = src_tokenizer.encode_as_ids(sentence)
    
    src_in = tf.keras.preprocessing.sequence.pad_sequences([tokens],
                                                            maxlen=src_len,
                                                            padding='post')

    pred_cache = np.zeros((beam_size * beam_size, tgt_len), dtype=np.int64)
    pred_tmp = np.zeros((beam_size, tgt_len), dtype=np.int64)

    eos_flag = np.zeros((beam_size, ), dtype=np.int64)
    scores = np.ones((beam_size, ))

    pred_tmp[:, 0] = tgt_tokenizer.bos_id()

    dec_in = tf.expand_dims(pred_tmp[0, :1], 0)
    prob = calc_prob(src_in, dec_in, model)[0, -1].numpy()


    for seq_pos in range(1, tgt_len):
        score_cache = np.ones((beam_size * beam_size, ))

        # init
        for branch_idx in range(beam_size):
            cache_pos = branch_idx*beam_size

            score_cache[cache_pos:cache_pos+beam_size] = scores[branch_idx]
            pred_cache[cache_pos:cache_pos+beam_size, :seq_pos] = \
            pred_tmp[branch_idx, :seq_pos]

        for branch_idx in range(beam_size):
            cache_pos = branch_idx*beam_size

            if seq_pos != 1:   # 모든 Branch를 로 시작하는 경우를 방지
                dec_in = pred_cache[branch_idx, :seq_pos]
                dec_in = tf.expand_dims(dec_in, 0)

                prob = calc_prob(src_in, dec_in, model)[0, -1].numpy()

            for beam_idx in range(beam_size):
                max_idx = np.argmax(prob)

                score_cache[cache_pos+beam_idx] *= prob[max_idx]
                pred_cache[cache_pos+beam_idx, seq_pos] = max_idx

                prob[max_idx] = -1

        for beam_idx in range(beam_size):
            if eos_flag[beam_idx] == -1: continue

            max_idx = np.argmax(score_cache)
            prediction = pred_cache[max_idx, :seq_pos+1]

            pred_tmp[beam_idx, :seq_pos+1] = prediction
            scores[beam_idx] = score_cache[max_idx]
            score_cache[max_idx] = -1

            if prediction[-1] == tgt_tokenizer.eos_id():
                eos_flag[beam_idx] = -1

    pred = []
    for long_pred in pred_tmp:
        zero_idx = long_pred.tolist().index(tgt_tokenizer.eos_id())
        short_pred = long_pred[:zero_idx+1]
        pred.append(short_pred)
    return pred

In [47]:
def calculate_bleu(reference, candidate, weights=[0.25, 0.25, 0.25, 0.25]):
    return sentence_bleu([reference],
                            candidate,
                            weights=weights,
                            smoothing_function=SmoothingFunction().method1)

In [None]:
# 인덱스를 바꿔가며 확인해 보세요
test_idx = 1

# ids = \
# beam_search_decoder(test_eng_sentences[test_idx],
#                     MAX_LEN,
#                     MAX_LEN,
#                     transformer,
#                     tokenizer,
#                     tokenizer,
#                     beam_size=5)

# bleu = beam_bleu(test_spa_sentences[test_idx], ids, tokenizer)
# print(bleu)

### ****12-6. 데이터 부풀리기****

Data Augmentation, 그중에서도 Embedding을 활용한 Lexical Substitution을 구현

직접 모델을 다운로드해 `load` 하는 방법

`gensim` 이 자체적으로 지원하는 `downloader` 를 활용해 모델을 `load` 하는 방법

**[RaRe-Technologies/gensim-data](https://github.com/RaRe-Technologies/gensim-data)**

`Available data → Model` 부분에서 공개된 모델의 종류를 확인

In [None]:
import gensim.downloader as api

wv = api.load('glove-wiki-gigaword-300')



KeyboardInterrupt: ignored

In [None]:
wv.most_similar("banana")

In [None]:
sample_sentence = "you know ? all you need is attention ."
sample_tokens = sample_sentence.split()

selected_tok = random.choice(sample_tokens)

result = ""
for tok in sample_tokens:
    if tok is selected_tok:
        result += wv.most_similar(tok)[0][0] + " "

    else:
        result += tok + " "

print("From:", sample_sentence)
print("To:", result)


****Lexical Substitution 구현하기****

입력된 문장을 Embedding 유사도를 기반으로 Augmentation 하여 반환하는 `lexical_sub()` 를 구현하세요!

In [48]:
def lexical_sub(sentence, word2vec):
    res = ""
    toks = sentence.split()

    try:
        _from = random.choice(toks)
        _to = word2vec.most_similar(_from)[0][0]
        
    except:   # 단어장에 없는 단어
        return None

    for tok in toks:
        if tok is _from: res += _to + " "
        else: res += tok + " "

    return res

시간이 오래 걸리니 우선 테스트 데이터의 Augmentation들을 만듬

In [None]:
new_corpus = []

for old_src in tqdm(test_eng_sentences):
    new_src = lexical_sub(old_src, wv)
    if new_src is not None: 
        new_corpus.append(new_src)
    # Augmentation이 없더라도 원본 문장을 포함시킵니다
    new_corpus.append(old_src)

print(new_corpus[:10])

### 12-7. Project: 멋진 챗봇 만들기

In [49]:
import numpy 
import pandas 
import tensorflow 
import nltk
import gensim

print(numpy.__version__)
print(pandas.__version__)
print(tensorflow.__version__)
print(nltk.__version__)
print(gensim.__version__)

1.21.5
1.3.5
2.8.0
3.2.5
3.6.0


****Step 1. 데이터 다운로드****

**[songys/Chatbot_data](https://github.com/songys/Chatbot_data)**

In [50]:
import pandas as pd
import tqdm
import urllib.request

In [51]:
urllib.request.urlretrieve("https://raw.githubusercontent.com/songys/Chatbot_data/master/ChatbotData.csv", filename="ChatBotData.csv")
train_data = pd.read_csv('ChatBotData.csv')

In [52]:
len(train_data)

11823

In [53]:
train_data.head()

Unnamed: 0,Q,A,label
0,12시 땡!,하루가 또 가네요.,0
1,1지망 학교 떨어졌어,위로해 드립니다.,0
2,3박4일 놀러가고 싶다,여행은 언제나 좋죠.,0
3,3박4일 정도 놀러가고 싶다,여행은 언제나 좋죠.,0
4,PPL 심하네,눈살이 찌푸려지죠.,0


#### 데이터 중 label에 대한 설명
감정/감정조절이상    0  
감정/감정조절이상/화    1  
감정/걱정    2  
Korean Language Model for Wellness Conversation  
https://github.com/nawnoes/WellnessConversation-LanguageModel?fbclid=IwAR3ZhXYW_DwI2RXP1mbHzvafGXF80QWERa4t6TTz_m2NQug5QwjOwQt6Hvw

****Step 2. 데이터 정제****

`preprocess_sentence()` 함수를 구현

영문자의 경우, 모두 소문자로 변환합니다.
영문자와 한글, 숫자, 그리고 주요 특수문자를 제외하곤 정규식을 활용하여 모두 제거합니다.

In [54]:
import re
def preprocess_sentence(sentence):
    sentence = sentence.lower()
    sentence = re.sub(r"([?.!,])", r" \1 ", sentence)
    sentence = re.sub(r'[" "]+', " ", sentence)
    sentence = re.sub(r"[^0-9a-zA-Z가-힣?!,]+", " ", sentence) # 가-힣 한글은 지우지 말자.    
    sentence = sentence.strip()
    return sentence

In [55]:
# 사용할 샘플의 최대 개수
MAX_SAMPLES = 7700
print(MAX_SAMPLES)

7700


****Step 3. 데이터 토큰화****

토큰화에는 KoNLPy의 mecab 클래스를 사용합니다.

아래 조건을 만족하는 build_corpus() 함수를 구현하세요!

소스 문장 데이터와 타겟 문장 데이터를 입력으로 받습니다.
데이터를 앞서 정의한 preprocess_sentence() 함수로 정제하고, 토큰화합니다.
토큰화는 전달받은 토크나이즈 함수를 사용합니다. 이번엔 mecab.morphs 함수를 전달하시면 됩니다.
토큰의 개수가 일정 길이 이상인 문장은 데이터에서 제외합니다.
중복되는 문장은 데이터에서 제외합니다. 소스 : 타겟 쌍을 비교하지 않고 소스는 소스대로 타겟은 타겟대로 검사합니다. 중복 쌍이 흐트러지지 않도록 유의하세요!

구현한 함수를 활용하여 `questions` 와 `answers` 를 각각 `que_corpus` , `ans_corpus` 에 토큰화하여 저장합니다.

In [56]:
!python3 -m pip install konlpy 
!bash <(curl -s https://raw.githubusercontent.com/konlpy/konlpy/master/scripts/mecab.sh)

Collecting konlpy
  Downloading konlpy-0.6.0-py2.py3-none-any.whl (19.4 MB)
[K     |████████████████████████████████| 19.4 MB 486 kB/s 
Collecting JPype1>=0.7.0
  Downloading JPype1-1.3.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl (448 kB)
[K     |████████████████████████████████| 448 kB 61.4 MB/s 
Installing collected packages: JPype1, konlpy
Successfully installed JPype1-1.3.0 konlpy-0.6.0
Installing automake (A dependency for mecab-ko)
Hit:1 http://archive.ubuntu.com/ubuntu bionic InRelease
Get:2 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
Get:3 http://archive.ubuntu.com/ubuntu bionic-updates InRelease [88.7 kB]
Ign:4 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease
Get:5 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ InRelease [3,626 B]
Get:6 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu bionic InRelease [15.9 kB]
Get:7 http://archive.ubuntu.com/ubuntu bionic-backports InRelease [74.6 kB

In [57]:
from konlpy.tag  import Mecab
# mecab =  Mecab(dicpath=r"C:/mecab/mecab-ko-dic")
mecab =  Mecab()

In [58]:
def build_corpus(data, n_examples, tokenizer = None):
    data.drop_duplicates(subset=['Q'], inplace=True)
    data.drop_duplicates(subset=['A'], inplace=True)
    data.dropna(how = 'any', inplace=True)
    data.reset_index(drop=True, inplace=True)
    print('중복과 결측치 제거 후 남은 데이터 수',data.shape)    
    questions = []
    answers = []

    for i in range(n_examples):  
        q_txt = preprocess_sentence(data.at[i, 'Q'])
        a_txt = preprocess_sentence(data.at[i, 'A'])
        if q_txt != ''    and a_txt != '' and  \
           q_txt != 'nan' and a_txt != 'nan':
            
            if tokenizer == None:
                questions.append(q_txt)
                answers.append(a_txt)
            else:
                questions.append(' '.join(tokenizer.morphs(q_txt)))
                answers.append(' '.join(tokenizer.morphs(a_txt)))
    
    return questions, answers

In [59]:
que_corpus, ans_corpus = build_corpus(train_data, MAX_SAMPLES, mecab)   

중복과 결측치 제거 후 남은 데이터 수 (7731, 3)


In [60]:
for i in range(5):
    print(que_corpus[i],',', ans_corpus[i])

12 시 땡 ! , 하루 가 또 가 네요
1 지망 학교 떨어졌 어 , 위로 해 드립니다
3 박 4 일 놀 러 가 고 싶 다 , 여행 은 언제나 좋 죠
ppl 심하 네 , 눈살 이 찌푸려 지 죠
sd 카드 망가졌 어 , 다시 새로 사 는 게 마음 편해요


****Step 4. Augmentation****

데이터는 1만 개가량

Lexical Substitution을 실제로 적용

한국어로 사전 훈련된 Embedding 모델을 다운로드

`Korean (w)` 가 Word2Vec으로 학습한 모델이며 용량도 적당하므로 사이트에서 `Korean (w)`를 찾아 다운로드하고, `ko.bin` 파일을 얻으세요!

**[Kyubyong/wordvectors](https://github.com/Kyubyong/wordvectors)**  
Language ISO 639-1      Vector Size Corpus Size  Vocabulary Size  
Korean(w) | Korean(f) ko,   200,          339M,         30185  

다운로드한 모델을 활용해 데이터를 Augmentation 하세요! 앞서 정의한 `lexical_sub()` 함수를 참고하면 도움이 많이 될 겁니다.

*Augmentation된 `que_corpus` 와 원본 `ans_corpus` 가 병렬을 이루도록, 이후엔 반대로 원본 `que_corpus` 와 Augmentation된 `ans_corpus` 가 병렬을 이루도록 하여 전체 데이터가 원래의 3배가량으로 늘어나도록 합니다.*

In [61]:
!pip install --upgrade gensim==3.8.3

Collecting gensim==3.8.3
  Downloading gensim-3.8.3-cp37-cp37m-manylinux1_x86_64.whl (24.2 MB)
[K     |████████████████████████████████| 24.2 MB 1.6 MB/s 
Installing collected packages: gensim
  Attempting uninstall: gensim
    Found existing installation: gensim 3.6.0
    Uninstalling gensim-3.6.0:
      Successfully uninstalled gensim-3.6.0
Successfully installed gensim-3.8.3


In [62]:
import gensim
from gensim.models import KeyedVectors

In [63]:
from pathlib import Path
# directory = Path.joinpath(Path.cwd(),'data')
directory = Path.joinpath(Path.cwd(),'drive','MyDrive','Colab Notebooks','GD','GD-12','data')

filename = 'ko.bin' # Filename
word2vec_file_path  = Path.joinpath(directory,filename)
word2vec = gensim.models.Word2Vec.load(str(word2vec_file_path))

In [64]:
import random
from itertools import combinations

def lexical_sub_for_kor(sentence, word2vec):    
    res = ""
    toks_pos = mecab.pos(sentence)
    
    pumsa = list(map(lambda t:t[1],toks_pos))
    n_NNG = pumsa.count('NNG')
    v_NNG = pumsa.count('VV')
    
#     try:
#         _from = random.choice(toks)
#         _to = word2vec.wv.most_similar(_from)[0][0]
#         print(_from, _to)
        
#     except:   # 단어장에 없는 단어
#         return None
#     
    if n_NNG > 0: 
        i = random.randint(0, v_NNG + n_NNG+1)
        try:
#             _from = random.choice(toks)        
#             _to = word2vec.wv.most_similar(_from)[random.randint(0, 10)][0]
            n = 0
            for t in toks_pos:
                if t[1] == 'NNG' or t[1] == 'VV':
                    n += 1
                    if n == i :
                        _from = t[0]
                        break
            _to = word2vec.wv.most_similar(_from)[random.randint(0, 10)][0]
#             _to = word2vec.wv.most_similar(_from)[0][0]
#             print(_from, _to)
        except:   # 단어장에 없는 단어
            return None
    else:
        return None
        
    for tok in toks_pos:
        if tok[0] == _from: res += _to    + " "
        else:               res += tok[0] + " "
            
    return res    


sentence = '3박4일 놀러가고 싶다.'
lexical_sub_for_kor(sentence, word2vec)

In [65]:
from itertools import combinations

def lexical_sub_for_kor2(sentence, word2vec):
    toks_pos = mecab.pos(sentence)
    
    temp_list = []
#     for idx, t in enumerate(toks_pos):
    for t in toks_pos:
    #     print(idx, t)
        if t[1] == 'NNG' or t[1] == 'VV' :
#         if t[1] == 'NNG':
            temp_list.append(t[0])
            
#     print(temp_list) 
    
    if len(temp_list) > 1 :
        temp_comb = list(combinations(temp_list, 2))
        
        random_choice = random.randint(0, len(temp_comb)-1)
        
        temp = temp_comb[random_choice]
#         print(temp)
        _from = temp[random.randint(0, 1)]
#         _from = temp[0]
#         _from = temp[1]
        try:
            _to = word2vec.wv.similar_by_vector((word2vec.wv.get_vector(temp[0])+ \
                                         word2vec.wv.get_vector(temp[1])/2))[random.randint(2, 5)][0]
        except:   # 단어장에 없는 단어
            return None
        
    elif len(temp_list) == 1 :
        _from = temp_list[0]
        try:        
            _to   = word2vec.wv.most_similar(_from[0])[random.randint(0, 5)][0]
        except:   # 단어장에 없는 단어
            return None
        
    else:
        return None
    
#     print(_from)
    
    res = []
    
#     print(_to)
    
    for tok in toks_pos:
    #     print(tok)
        if tok[0] == _from: res += _to    + " "
        else:               res += tok[0] + " "
            
    return ''.join(res)
# print(''.join(res))    

In [66]:
from tqdm.notebook import tqdm
# que_corpus, ans_corpus

# new_corpus = []
new_que_corpus = []
new_ans_corpus = []

for old_que, old_ans in tqdm(zip(que_corpus, ans_corpus)):
#     new_src = lexical_sub_for_kor(old_src, word2vec)

    for i in range(7): # 몇번 반복해서 비슷한 말로 문장을 만든다
        new_que = lexical_sub_for_kor2(old_que, word2vec)
        new_ans =  lexical_sub_for_kor2(old_ans, word2vec)
        if new_que is not None and new_ans is not None : 
    #         new_corpus.append(new_src)
            new_que_corpus.append(new_que)
            new_ans_corpus.append(new_ans)
        
    # Augmentation이 없더라도 원본 문장을 포함시킵니다
#     new_corpus.append(old_src)
    new_que_corpus.append(old_que)
    new_ans_corpus.append(old_ans)

0it [00:00, ?it/s]

In [None]:
for i in range(50):
    print(new_ans_corpus[i])

하루 가 또 가 네요
위의 로 해 드립니다 
가부 로 해 드립니다 
위권 로 해 드립니다 
위권 로 해 드립니다 
후부 로 해 드립니다 
위의 로 해 드립니다 
최하위 로 해 드립니다 
위로 해 드립니다
어 은 언제나 좋 죠 
여서 은 언제나 좋 죠 
면서 은 언제나 좋 죠 
면서 은 언제나 좋 죠 
면서 은 언제나 좋 죠 
여서 은 언제나 좋 죠 
려면 은 언제나 좋 죠 
여행 은 언제나 좋 죠
눈살 이 찌푸려 지 죠
다시 새로 메노 는 게 마음 편해요 
다시 새로 마음 는 게 마음 편해요 
다시 새로 사 는 게 발르 편해요 
다시 새로 도리 는 게 마음 편해요 
다시 새로 사 는 게 도리 편해요 
다시 새로 메노 는 게 마음 편해요 
다시 새로 사 는 게 메노 편해요 
다시 새로 사 는 게 마음 편해요
잘 카 고 있 을 수 도 있 어요 
잘 콰 고 있 을 수 도 있 어요 
잘 카 고 있 을 수 도 있 어요 
잘 카 고 있 을 수 도 있 어요 
잘 마노 고 있 을 수 도 있 어요 
잘 카 고 있 을 수 도 있 어요 
잘 마 고 있 을 수 도 있 어요 
잘 모르 고 있 을 수 도 있 어요
간격 을 정하 고 해 보 세요 
일주일 을 정하 고 해 보 세요 
휴일 을 정하 고 해 보 세요 
일주일 을 정하 고 해 보 세요 
휴일 을 정하 고 해 보 세요 
시간 을 휴일 고 해 보 세요 
주일 을 정하 고 해 보 세요 
시간 을 정하 고 해 보 세요
겸비 하 는 자리 니까요 
웅장 하 는 자리 니까요 
자랑 하 는 겸비 니까요 
뽐내 하 는 자리 니까요 
자랑 하 는 차지 니까요 
겸비 하 는 자리 니까요 
자랑 하 는 뽐내 니까요 
자랑 하 는 자리 니까요


In [None]:
len(new_ans_corpus)

45806

In [None]:
augmented_data = pd.DataFrame(zip(new_que_corpus, new_ans_corpus))
augmented_data.columns=['Q','A']

In [None]:
MAX_SAMPLES = 30000

augmented_que_corpus, augmented_ans_corpus = build_corpus(augmented_data, MAX_SAMPLES, mecab)  

중복과 결측치 제거 후 남은 데이터 수 (30624, 2)


In [None]:
for i in range(100):
    print(augmented_ans_corpus[i])

하루 가 또 가 네요
위 의 로 해 드립니다
위 권 로 해 드립니다
최 하위 로 해 드립니다
위 로 해 드립니다
어 은 언제나 좋 죠
여서 은 언제나 좋 죠
면서 은 언제나 좋 죠
여행 은 언제나 좋 죠
눈살 이 찌푸려 지 죠
다시 새로 메 노 는 게 마음 편해요
다시 새로 사 는 게 발르 편 해요
다시 새로 사 는 게 도리 편해요
다시 새로 사 는 게 마음 편해요
잘 카 고 있 을 수 도 있 어요
잘 마노 고 있 을 수 도 있 어요
잘 모르 고 있 을 수 도 있 어요
간격 을 정하 고 해 보 세요
일 주일 을 정하 고 해 보 세요
휴일 을 정하 고 해 보 세요
시간 을 휴일 고 해 보 세요
주일 을 정하 고 해 보 세요
시간 을 정하 고 해 보 세요
겸비 하 는 자리 니까요
뽐내 하 는 자리 니까요
자랑 하 는 뽐내 니까요
자랑 하 는 자리 니까요
그 사람 도 그럴 거 예요
좋 아 하 를 즐기 세요
즐겁 를 즐기 세요
쫓아가 를 즐기 세요
혼자 를 즐기 세요
돈 은 다시 들어올 거 예요
소변 을 식혀 주 세요
햇볕 을 식혀 주 세요
오줌 을 식혀 주 세요
눈물 을 식혀 주 세요
땀 을 식혀 주 세요
어서 잊 고 새 잊어 버리 하 세요
어서 미워하 고 새 출발 하 세요
어서 잊 고 새 돌 아보 하 세요
어서 잊어버리 고 새 출발 하 세요
어서 잊 고 새 출발 하 세요
빨리 집 에 돌아가 서 돌려보내 고 나오 세요
빨리 집 에 달려가 서 끄 고 나오 세요
빨리 집 에 돌아가 서 들리 고 나오 세요
빨리 집 에 돌아가 서 끄 고 식객 세요
빨리 달려가 에 돌아가 서 끄 고 나오 세요
빨리 집 에 옮겨 가 서 끄 고 나오 세요
빨리 식객 에 돌아가 서 끄 고 나오 세요
빨리 집 에 돌아가 서 끄 고 나오 세요
다음 그 다음 에 는 더 절약 해봐요
다음 열흘 에 는 더 절약 해봐요
그 다음 달 에 는 더 절약 해봐요
다음 달 에 는 더 곱셈 해봐요
다음 달 에 는 더 절약 해봐요
따뜻 하 게 사세요 !
가장 확실 한 시간 은 오늘 이 에요 어제 와 내일 을 놓 고 고

In [None]:
len(augmented_que_corpus)

30000

In [67]:
data_directory = Path.joinpath(Path.cwd(),'drive','MyDrive','Colab Notebooks','GD','GD-12','data', 'clean_data.csv')
# data_directory = Path.joinpath(Path.cwd(),'data', 'clean_data.csv')

In [None]:
clean_data = pd.DataFrame(zip(augmented_que_corpus, augmented_ans_corpus))
clean_data.columns=['Q','A']
clean_data.to_csv(data_directory)

In [69]:
clean_data = pd.read_csv(data_directory)  # 클린 데이터 불러오기
augmented_que_corpus  = list(clean_data['Q'])
augmented_ans_corpus = list(clean_data['A'])

****Step 5. 데이터 벡터화****

타겟 데이터인 `ans_corpus` 에 `<start>` 토큰과 `<end>` 토큰이 추가되지 않은 상태이니 이를 먼저 해결한 후 벡터화를 진행합니다. 우리가 구축 한 `ans_corpus` 는 `list` 형태이기 때문에 아주 쉽게 이를 해결할 수 있다

In [70]:
sample_data = ["12", "시", "땡", "!"]

print(["<start>"] + sample_data + ["<end>"])

['<start>', '12', '시', '땡', '!', '<end>']


In [None]:
# que_corpus, ans_corpus

# que_corpus_list = list(map(lambda s: ["<start>"] + mecab.morphs(s) + ["<end>"], que_corpus))
# ans_corpus_list = list(map(lambda s: ["<start>"] + mecab.morphs(s) + ["<end>"], ans_corpus))

In [71]:
VOCAB_SIZE = 4770

ko_tokenizer = generate_tokenizer(augmented_que_corpus + augmented_ans_corpus, VOCAB_SIZE, "kor")
# ko_tokenizer = generate_tokenizer(que_corpus + ans_corpus, VOCAB_SIZE, "kor")


In [72]:
augmented_que_corpus[0]

'12 시 땡 !'

In [73]:
# augmented_que_corpus, augmented_ans_corpus
ko_tokenizer.set_encode_extra_options("bos:eos") # ansewer 토큰화 할 때 옵션 줄 것
aug_que_corpus = make_corpus(augmented_que_corpus, ko_tokenizer)


aug_ans_corpus = make_corpus(augmented_ans_corpus, ko_tokenizer)

  0%|          | 0/30000 [00:00<?, ?it/s]

  0%|          | 0/30000 [00:00<?, ?it/s]

In [74]:
print(augmented_que_corpus[0])
print(aug_que_corpus[0])
print('\n')
print(augmented_ans_corpus[0])
print(aug_ans_corpus[0])

12 시 땡 !
[1, 371, 3181, 146, 3072, 156, 2]


하루 가 또 가 네요
[1, 556, 8, 168, 8, 47, 2]


In [75]:
MAX_LEN = 50
MIN_LEN = 3

# from tqdm.notebook import tqdm    # Process 과정을 보기 위해

aug_que_corpus_50 = []
aug_ans_corpus_50 = []

assert len(aug_que_corpus) == len(aug_ans_corpus)

# 토큰의 길이가 50 이하인 문장만 남깁니다. 
for que, ans in tqdm(list(zip(aug_que_corpus, aug_ans_corpus))):
    if len(que) <= MAX_LEN and len(ans) <= MAX_LEN and \
       len(que) >= MIN_LEN and len(ans) >= MIN_LEN :
        aug_que_corpus_50.append(que) 
        aug_ans_corpus_50.append(ans) 

# 패딩처리를 완료하여 학습용 데이터를 완성합니다. 
enc_ndarray = tf.keras.preprocessing.sequence.pad_sequences(aug_que_corpus_50, maxlen=MAX_LEN, padding='post')
dec_ndarray = tf.keras.preprocessing.sequence.pad_sequences(aug_ans_corpus_50, maxlen=MAX_LEN, padding='post')

  0%|          | 0/30000 [00:00<?, ?it/s]

모델 인스턴스 생성

In [76]:
# 주어진 하이퍼파라미터로 Transformer 인스턴스 생성
# 하이퍼파라미터
NUM_LAYERS = 2    # 인코더와 디코더의 층의 개수
D_MODEL    = 512  # 인코더와 디코더 내부의 입, 출력의 고정 차원
NUM_HEADS  = 8    # 멀티 헤드 어텐션에서의 헤드 수 
UNITS      = 2048 # 피드 포워드 신경망의 은닉층의 크기
DROPOUT    = 0.2  # 드롭아웃의 비율
POS_LEN    = MAX_LEN # 최대 corpus 길이

# `d_ff` 는 논문의 설명대로라면 2048  
# `d_model` 은 512  
# `[ batch x length x d_model ]` 의 입력을 받아 `fc1` 이 2048차원으로 매핑하고 활성함수 ReLU를 적용  
# `fc2` 를 통해 512차원으로 되돌림

# 문장의 길이 50, 임베딩 벡터의 차원 512
# sample_pos_encoding = PositionalEncoding(200, 512)
transformer = Transformer(
    n_layers=NUM_LAYERS,
    d_model=D_MODEL,
    n_heads=NUM_HEADS,
    d_ff=UNITS,
    src_vocab_size=VOCAB_SIZE,
    tgt_vocab_size=VOCAB_SIZE,
    pos_len=POS_LEN,
    dropout=DROPOUT,
    shared_fc=True,
    shared_emb=True)

In [None]:
# tf.keras.layers.Embedding(vocab_size, 
#                                  word_vector_dim, 
#                                  embeddings_initializer=Constant(embedding_matrix),  # 카피한 임베딩을 여기서 활용
#                                  input_length=maxlen, 
#                                  trainable=True))   # trainable을 True로 주면 Fine-tuning

Learning Rate & Optimizer

In [77]:
# Learning Rate 인스턴스 선언 & Optimizer 구현
learning_rate = LearningRateScheduler(200)  # 200 스텝까지 러닝레이트 증가 후 하락
optimizer     = tf.keras.optimizers.Adam(learning_rate,
                                     beta_1 = 0.9,
                                     beta_2 = 0.98, 
                                     epsilon= 1e-9)

위 소스를 참고하여 타겟 데이터 전체에 <start> 토큰과 <end> 토큰을 추가해 주세요!

챗봇 훈련 데이터의 가장 큰 특징 중 하나라고 하자면 바로 소스 데이터와 타겟 데이터가 같은 언어를 사용한다는 것이겠죠. 앞서 배운 것처럼 이는 Embedding 층을 공유했을 때 많은 이점을 얻을 수 있습니다.

1. 특수 토큰을 더함으로써 `ans_corpus` 또한 완성이 되었으니, `que_corpus` 와 결합하여 전체 데이터에 대한 단어 사전을 구축하고 벡터화하여 `enc_train` 과 `dec_train` 을 얻으세요!

****Step 6. 훈련하기****

앞서 번역 모델을 훈련하며 정의한 `Transformer` 를 그대로 사용하시면 됩니다! 대신 데이터의 크기가 작으니 하이퍼파라미터를 튜닝해야 과적합을 피할 수 있습니다. 모델을 훈련하고 아래 예문에 대한 답변을 생성하세요! 가장 멋진 답변과 모델의 하이퍼파라미터를 제출하시면 됩니다.

In [78]:
import os
checkpoint_path = Path.joinpath(Path.cwd(),'drive','MyDrive','Colab Notebooks','GD','GD-12','training_checkpoints')
# checkpoint_path = Path.joinpath(Path.cwd(),'training_checkpoints')
ckpt = tf.train.Checkpoint(transformer=transformer,
                           optimizer=optimizer)

ckpt_manager = tf.train.CheckpointManager(ckpt, checkpoint_path, max_to_keep=5)

In [79]:
## 체크포인트 불러오기
# # if a checkpoint exists, restore the latest checkpoint.
if ckpt_manager.latest_checkpoint:
    ckpt.restore(ckpt_manager.latest_checkpoint)
    print('Latest checkpoint restored!!')

Latest checkpoint restored!!


In [80]:
BATCH_SIZE = 1024


train_dataset = tf.data.Dataset.from_tensor_slices((enc_ndarray, dec_ndarray)).batch(batch_size=BATCH_SIZE)

In [None]:
# 훈련시키기
EPOCHS = 300

for epoch in range(EPOCHS):
    total_loss = 0
    
    dataset_count = tf.data.experimental.cardinality(train_dataset).numpy()
    tqdm_bar = tqdm(total=dataset_count)
    for step, (enc_batch, dec_batch) in enumerate(train_dataset):
        batch_loss, enc_attns, dec_attns, dec_enc_attns = \
        train_step(enc_batch,
                    dec_batch,
                    transformer,
                    optimizer)

        total_loss += batch_loss
        
        tqdm_bar.set_description_str('Epoch %2d' % (epoch + 1))
        tqdm_bar.set_postfix_str('Loss %.4f' % (total_loss.numpy() / (step + 1)))
        tqdm_bar.update()
        
# saving (checkpoint) the model every 20 epochs
    if (epoch + 1) % 20 == 0:
        ckpt_save_path = ckpt_manager.save()
        print(f'Saving checkpoint for epoch {epoch+1} at {ckpt_save_path}') 

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

Saving checkpoint for epoch 20 at /content/drive/MyDrive/Colab Notebooks/GD/GD-12/training_checkpoints/ckpt-1


  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

Saving checkpoint for epoch 40 at /content/drive/MyDrive/Colab Notebooks/GD/GD-12/training_checkpoints/ckpt-2


  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

Saving checkpoint for epoch 60 at /content/drive/MyDrive/Colab Notebooks/GD/GD-12/training_checkpoints/ckpt-3


  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

Saving checkpoint for epoch 80 at /content/drive/MyDrive/Colab Notebooks/GD/GD-12/training_checkpoints/ckpt-4


  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

Saving checkpoint for epoch 100 at /content/drive/MyDrive/Colab Notebooks/GD/GD-12/training_checkpoints/ckpt-5


  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

Saving checkpoint for epoch 120 at /content/drive/MyDrive/Colab Notebooks/GD/GD-12/training_checkpoints/ckpt-6


  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

Saving checkpoint for epoch 140 at /content/drive/MyDrive/Colab Notebooks/GD/GD-12/training_checkpoints/ckpt-7


  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

Saving checkpoint for epoch 160 at /content/drive/MyDrive/Colab Notebooks/GD/GD-12/training_checkpoints/ckpt-8


  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

Saving checkpoint for epoch 180 at /content/drive/MyDrive/Colab Notebooks/GD/GD-12/training_checkpoints/ckpt-9


  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

Saving checkpoint for epoch 200 at /content/drive/MyDrive/Colab Notebooks/GD/GD-12/training_checkpoints/ckpt-10


  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

Saving checkpoint for epoch 220 at /content/drive/MyDrive/Colab Notebooks/GD/GD-12/training_checkpoints/ckpt-11


  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

Exception in thread Thread-12:
Traceback (most recent call last):
  File "/usr/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/usr/local/lib/python3.7/dist-packages/tqdm/_monitor.py", line 69, in run
    instances = self.get_instances()
  File "/usr/local/lib/python3.7/dist-packages/tqdm/_monitor.py", line 49, in get_instances
    return [i for i in self.tqdm_cls._instances.copy()
  File "/usr/lib/python3.7/_weakrefset.py", line 92, in copy
    return self.__class__(self)
  File "/usr/lib/python3.7/_weakrefset.py", line 50, in __init__
    self.update(data)
  File "/usr/lib/python3.7/_weakrefset.py", line 119, in update
    for element in other:
  File "/usr/lib/python3.7/_weakrefset.py", line 60, in __iter__
    for itemref in self.data:
RuntimeError: Set changed size during iteration



  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

Saving checkpoint for epoch 240 at /content/drive/MyDrive/Colab Notebooks/GD/GD-12/training_checkpoints/ckpt-12


  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

Saving checkpoint for epoch 260 at /content/drive/MyDrive/Colab Notebooks/GD/GD-12/training_checkpoints/ckpt-13


  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

Saving checkpoint for epoch 280 at /content/drive/MyDrive/Colab Notebooks/GD/GD-12/training_checkpoints/ckpt-14


  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

  0%|          | 0/30 [00:00<?, ?it/s]

In [81]:
def make_clean_corpus_for_examples(q_txt,tokenizer):
    q_txt = preprocess_sentence(q_txt)
    if q_txt != ''    and q_txt != 'nan': 
        q_txt = ' '.join(tokenizer.morphs(q_txt))
        questions = make_corpus([q_txt], ko_tokenizer)
        questions = questions[0]
    return questions

In [None]:
def translate2(tokens, model, src_tokenizer, tgt_tokenizer):
    print(tokens)
    padded_tokens = tf.keras.preprocessing.sequence.pad_sequences([tokens],
                                                           maxlen=MAX_LEN,
                                                           padding='post')
    
    print(padded_tokens)
    ids = []
    output = tf.expand_dims([tgt_tokenizer.bos_id()], 0)   
    for i in range(MAX_LEN):
        enc_padding_mask, combined_mask, dec_padding_mask = \
        generate_masks(padded_tokens, output)

        predictions, _, _, _ = model(padded_tokens, 
                                      output,
                                      enc_padding_mask,
                                      combined_mask,
                                      dec_padding_mask,
                                      training = False)

        predicted_id = \
        tf.argmax(tf.math.softmax(predictions, axis=-1)[0, -1]).numpy().item()

        if tgt_tokenizer.eos_id() == predicted_id:
            break
#             result = tgt_tokenizer.decode_ids(ids)  
#             return result

        ids.append(predicted_id)
        output = tf.concat([output, tf.expand_dims([predicted_id], 0)], axis=-1)

    result = tgt_tokenizer.decode_ids(ids)  
    return result


# 예문

1. 지루하다, 놀러가고 싶어.
2. 오늘 일찍 일어났더니 피곤하다.
3. 간만에 여자친구랑 데이트 하기로 했어.
4. 집에 있는다는 소리야.

In [83]:
def test_example(example, mecab):
    que = make_clean_corpus_for_examples(example, mecab)  
    print('Question : ', example)
    print(que)
    result = translate(que, transformer, ko_tokenizer, ko_tokenizer)
    # print(result)
    return result

examples = ['지루하다, 놀러가고 싶어.',
            '오늘 일찍 일어났더니 피곤하다.',
            '간만에 여자친구랑 데이트 하기로 했어.',
            '집에 있는다는 소리야.']

for example in examples:
    candidate = test_example(example, mecab)
    print('Candidate : ', candidate)    

  0%|          | 0/1 [00:00<?, ?it/s]

Question :  지루하다, 놀러가고 싶어.
[1, 3374, 6, 32, 368, 258, 332, 8, 10, 41, 9, 2]
Candidate :  세상 을 비우 하 게 만들 어요


  0%|          | 0/1 [00:00<?, ?it/s]

Question :  오늘 일찍 일어났더니 피곤하다.
[1, 151, 1240, 66, 1899, 0, 866, 302, 1077, 6, 32, 2]
Candidate :  아직 도 생각 한 날 이 에요


  0%|          | 0/1 [00:00<?, ?it/s]

Question :  간만에 여자친구랑 데이트 하기로 했어.
[1, 307, 355, 813, 73, 50, 0, 95, 412, 6, 40, 89, 59, 9, 2]
Candidate :  좋 은 데이트 가 신규 었 길 바랄게요


  0%|          | 0/1 [00:00<?, ?it/s]

Question :  집에 있는다는 소리야.
[1, 286, 24, 0, 19, 5, 570, 1175, 187, 2]
Candidate :  물려받 아 보 는 건 좋 은 거 죠


# 제출

Translations

> 잠깐 쉬 어도 돼요 . <end>맛난 거 드세요 . <end>떨리 겠 죠 . <end>좋 아 하 면 그럴 수 있 어요 . <end>
> 

Hyperparameters

>NUM_LAYERS = 2    # 인코더와 디코더의 층의 개수  
D_MODEL    = 512  # 인코더와 디코더 내부의 입, 출력의 고정 차원  
NUM_HEADS  = 8    # 멀티 헤드 어텐션에서의 헤드 수   
UNITS      = 2048 # 피드 포워드 신경망의 은닉층의 크기  
DROPOUT    = 0.2  # 드롭아웃의 비율  
POS_LEN    = MAX_LEN # 최대 corpus 길이  
> 

Training Parameters

> Warmup Steps: 200  
BATCH_SIZE = 1024  
EPOCHS = 300  
>






In [93]:
examples = ['잠깐 쉬어도 돼요',
            '맛난 거 드세요 .',
            '떨리겠죠 .',
            '좋 아 하 면 그럴 수 있 어요 .']

for example in examples:
    candidate = test_example(example, mecab)
    print('Candidate : ', candidate)

  0%|          | 0/1 [00:00<?, ?it/s]

Question :  잠깐 쉬어도 돼요
[1, 250, 948, 429, 390, 245, 2]
Candidate :  사람 은 혼인 하 지 마세요


  0%|          | 0/1 [00:00<?, ?it/s]

Question :  맛난 거 드세요 .
[1, 1098, 454, 15, 836, 2]
Candidate :  더 행복 해질 거 예요


  0%|          | 0/1 [00:00<?, ?it/s]

Question :  떨리겠죠 .
[1, 1047, 82, 26, 39, 2]
Candidate :  생각 을 해 보 는 것 도 좋 을 것 같 아요


  0%|          | 0/1 [00:00<?, ?it/s]

Question :  좋 아 하 면 그럴 수 있 어요 .
[1, 11, 12, 6, 36, 344, 33, 0, 19, 34, 2]
Candidate :  더 좋 아 하 는 해 보 세요


****Step 7. 성능 측정하기****

챗봇의 경우, 올바른 대답을 하는지가 중요한 평가 지표입니다. 올바른 답변을 하는지 눈으로 확인할 수 있겠지만, 많은 데이터의 경우는 모든 결과를 확인할 수 없을 것입니다. 주어진 질문에 적절한 답변을 하는지 확인하고, BLEU Score를 계산하는 `calculate_bleu()` 함수도 적용해 보세요.

In [92]:
for j in range(20):
    i = random.randint(0,29999)
    candidate = test_example(augmented_que_corpus[i], mecab)
    reference = augmented_ans_corpus[i]
    print('Candidate : ', candidate)
    print('Reference : ', reference)
    
    print('BLEW = ', calculate_bleu(reference, candidate, weights=[0.25, 0.25, 0.25, 0.25]))

  0%|          | 0/1 [00:00<?, ?it/s]

Question :  고백 했 다 차이 면 격차 지 ?
[1, 288, 59, 32, 652, 36, 2569, 569, 17, 21, 2]
Candidate :  진심 으로 증언 하 면 좋 은 결과 가 될 거 예요
Reference :  진심 으로 증언 하 면 좋 은 결과 가 있 을 거 예요
BLEW =  0.8402971749021705


  0%|          | 0/1 [00:00<?, ?it/s]

Question :  들어맞 이 떠났 다는 게 맞 는 거 같 아
[1, 1015, 1762, 4, 925, 562, 587, 22, 253, 5, 15, 29, 12, 2]
Candidate :  마음 떠난 젊은이 잡 기 는 힘들 어요
Reference :  마음 떠난 젊은이 잡 기 는 힘들 어요
BLEW =  1.0


  0%|          | 0/1 [00:00<?, ?it/s]

Question :  명절 다가오 니까 밀 려오
[1, 891, 452, 214, 234, 457, 1063, 3711, 2]
Candidate :  날려 버리 시 길 베 겠 습니다
Reference :  날려 버리 시 길 베 겠 습니다
BLEW =  1.0


  0%|          | 0/1 [00:00<?, ?it/s]

Question :  내 가 잊 어야 그 사람 이 날 그리워 일반인 는 날 이 온다
[1, 52, 8, 194, 487, 69, 30, 4, 218, 953, 1639, 5, 218, 4, 1849, 2]
Candidate :  이 얘기 기억 하 세요
Reference :  이 얘기 기억 하 세요
BLEW =  1.0


  0%|          | 0/1 [00:00<?, ?it/s]

Question :  올릴까 말 까 하 다가 적극 적 을 얻 고자 올려 봅니다
[1, 2534, 56, 561, 6, 214, 71, 717, 71, 7, 1542, 10, 209, 1096, 216, 2]
Candidate :  잘 와 견방 도 제공 받 거나 자신 에게 청중 해 보 세요
Reference :  잘 와 셨 어요 말씀 해 보 세요
BLEW =  0.2624555405412902


  0%|          | 0/1 [00:00<?, ?it/s]

Question :  사랑 기쁨 는 사람 이 있 어
[1, 37, 264, 5, 30, 4, 0, 19, 9, 2]
Candidate :  어서 어서 는데 해봐요
Reference :  어서 어서 는데 해봐요
BLEW =  1.0


  0%|          | 0/1 [00:00<?, ?it/s]

Question :  썸 인데 인격 은 커플
[1, 153, 201, 88, 467, 13, 1470, 2]
Candidate :  확실히 호인 를 나눠 보 세요
Reference :  확실히 호인 를 나눠 보 세요
BLEW =  1.0


  0%|          | 0/1 [00:00<?, ?it/s]

Question :  나 아목 은 거 같 아
[1, 18, 12, 931, 13, 15, 29, 12, 2]
Candidate :  위의 부터 속지 마세요
Reference :  위의 부터 속지 마세요
BLEW =  1.0


  0%|          | 0/1 [00:00<?, ?it/s]

Question :  이젠 뺏 아도 안 되 는 거래
[1, 438, 1906, 774, 42, 57, 5, 2301, 2]
Candidate :  사랑 은 때론 미련 이 바뀌 기 도 하 죠
Reference :  사랑 은 때론 미련 이 바뀌 기 도 하 죠
BLEW =  1.0


  0%|          | 0/1 [00:00<?, ?it/s]

Question :  엄마 친구 가 술주정 부려
[1, 254, 50, 8, 3444, 2465, 2]
Candidate :  인정 하 거나 이해 하 거나 둘 중 하나 예 요
Reference :  인정 하 거나 이해 하 거나 둘 중 하나 예 요
BLEW =  1.0


  0%|          | 0/1 [00:00<?, ?it/s]

Question :  문득 달력 을 보 니
[1, 688, 700, 176, 297, 7, 20, 192, 2]
Candidate :  기념 일 이 나 나 봐요
Reference :  기념 일 이 있 나 봐요
BLEW =  0.7611606003349892


  0%|          | 0/1 [00:00<?, ?it/s]

Question :  내 가 최대한 잘 해줘도 화염 이 많 아
[1, 52, 8, 1916, 162, 54, 1570, 79, 420, 2017, 4, 62, 12, 2]
Candidate :  서로 원 하 는 꾀 하 치 가 다른 것 같 아요
Reference :  서로 원 하 는 꾀 하 치 가 다른 것 같 아요
BLEW =  1.0


  0%|          | 0/1 [00:00<?, ?it/s]

Question :  답답 한 코마
[1, 785, 25, 1326, 2]
Candidate :  슬픔 이 내 슬픔 같 지 죠
Reference :  슬픔 이 내 슬픔 같 지 않 죠
BLEW =  0.8393757195171835


  0%|          | 0/1 [00:00<?, ?it/s]

Question :  결혼 동침 하 면서 많이 싸워 ?
[1, 204, 1027, 6, 126, 62, 38, 1415, 21, 2]
Candidate :  사랑 교전 이 라고 생각 하 세요
Reference :  사랑 교전 이 라고 생각 하 세요
BLEW =  1.0


  0%|          | 0/1 [00:00<?, ?it/s]

Question :  잠 이 안 타 네
[1, 250, 4, 42, 157, 45, 2]
Candidate :  못 자 면 내일 피곤 할 텐데 사마소 이 에요
Reference :  못 자 면 내일 피곤 할 텐데 사마소 이 에요
BLEW =  1.0


  0%|          | 0/1 [00:00<?, ?it/s]

Question :  다음 날 밤 에 집 에 혼자 있 으려니 답답 하 네
[1, 627, 218, 577, 24, 286, 24, 308, 0, 19, 2006, 533, 785, 6, 45, 2]
Candidate :  위촉 을 추천 해요
Reference :  위촉 을 추천 해요
BLEW =  1.0


  0%|          | 0/1 [00:00<?, ?it/s]

Question :  잘 메노 니 ?
[1, 54, 558, 192, 21, 2]
Candidate :  잘 살아가 고 싶 을 거 예요
Reference :  잘 살아가 고 있 을 거 예요
BLEW =  0.8153551038173115


  0%|          | 0/1 [00:00<?, ?it/s]

Question :  상당수 는 나 의 전부 였 던 그녀
[1, 3525, 306, 5, 18, 28, 121, 190, 479, 113, 193, 2]
Candidate :  그녀 에게 도 당신 이 전인 였을 거 예요
Reference :  그녀 에게 도 당신 이 전인 였을 거 예요
BLEW =  1.0


  0%|          | 0/1 [00:00<?, ?it/s]

Question :  일기 써야 되 는데
[1, 66, 90, 890, 187, 57, 44, 2]
Candidate :  나중 에 보 면 좋 은 추억 이 될 거 예요
Reference :  나중 에 보 면 좋 은 추억 이 될 거 예요
BLEW =  1.0


  0%|          | 0/1 [00:00<?, ?it/s]

Question :  스테판 가 와
[1, 2709, 1478, 8, 182, 2]
Candidate :  우리 마음 도 알 면서 울 어 마음속
Reference :  우리 마음 도 젖 어 마음속
BLEW =  0.5830738459889044


## 후기  :  
### data augmentation
한단어를 랜덤으로 하나를 선택 가까운 말로 치환  
명사를 랜덤으로 하나를 선택 가까운 말로 치환  
명사와 동사 중 랜덤으로 하나를 선택 가까운 말로 치환  
명사와 동사를 골라내 원소 두개를 가진 조합을 만들어 랜덤으로 선택 두 단어와 가까운 말로 한단어를 치환  
  
여러 방법을 써봤지만 사용한 word2vec를 만드는데 쓰인 데이터와 사용하고 있는 쳇봇 데이터 간의 어휘 차이가 심해서 그런지 만족할 만한 data augmentation방법을 만들기 불가능했다.  
문장 중 한단어만 바꿨는데도 뜻이 통하지 않는 문장이 만들어졌다.  

### 과적합 방지
dropout을 0.2로 주었다.  
예제를 실행할 때는 dropout을 주지 않았다.  

