<a href="https://colab.research.google.com/github/TA-aiacademy/course_3.0/blob/v2-5_nlp/09_v2-5_NLP/Part4/Transformer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Transformer

<img src="https://hackmd.io/_uploads/ryCwQ7YJT.png" alt="Drawing" style="width: 1000px;"/>

* [Seq2seq](https://arxiv.org/pdf/1409.3215.pdf)
* [Neural Machine Translation by Jointly Learning to Align and Translate](https://arxiv.org/abs/1409.0473)
* [Effective Approaches to Attention-based Neural Machine Translation](https://arxiv.org/abs/1508.04025v5)
* [Attention is all you need](https://arxiv.org/abs/1706.03762)
* [GPT](https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf)
* [BERT](https://arxiv.org/abs/1810.04805)
* [RoBERTa](https://arxiv.org/pdf/1907.11692.pdf)
* [ERNIE](https://arxiv.org/abs/1905.07129)
* [XLNet](https://arxiv.org/abs/1906.08237)
* [ELECTRA](https://openreview.net/pdf?id=r1xMH1BtvB)

# Environment

In [None]:
import tensorflow_datasets as tfds
import tensorflow as tf
import os
from pprint import pprint

import time
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl

print(tf.__version__)
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

### 建立資料夾路徑

`vocab_file`: 儲存中英文字典(vocabulary)路徑

`checkpoint`: 儲存模型路徑

`log_dir`: 記錄實驗結果

`download_dir`: 使用`wmt19`機器翻譯競賽的資料集，資料儲存路徑

In [None]:
# 上傳資料
!wget -q https://github.com/TA-aiacademy/course_3.0/releases/download/v2.5_nlp/NLP_part4.zip
!unzip -q NLP_part4.zip

In [None]:
output_dir = "nmt"
en_vocab_file = os.path.join(output_dir, "en_vocab")
zh_vocab_file = os.path.join(output_dir, "zh_vocab")
checkpoint_path = os.path.join(output_dir, "checkpoints")
log_dir = os.path.join(output_dir, 'logs')
download_dir = "tensorflow-datasets/downloads"

if not os.path.exists(output_dir):
    os.makedirs(output_dir)

## 查看wmt19 中英文對照資料集

* `newscommentary_v14`: 新聞評論
* `wikititles_v1`: wiki標題
* `uncorpus_v1`: 聯合國數據

In [None]:
tmp_builder = tfds.builder("wmt19_translate/zh-en")
pprint(tmp_builder.subsets)

## 透過`tf.DatasetBuilder`下載資料集

https://www.tensorflow.org/datasets/catalog/wmt19_translate

下載中英文的新聞評論資料集，會在`download_dir`下產生資料集，下次再執行就不需要使用`download_and_prepare`。

`builder.info`顯示資料集細節

In [None]:
config = tfds.translate.wmt.WmtConfig(
  version="0.0.1",
  language_pair=("zh", "en"),
  subsets={
    tfds.Split.TRAIN: ["newscommentary_v14"]
  }
)
builder = tfds.builder("wmt_translate", config=config)
builder.download_and_prepare(download_dir=download_dir)

## 切割資料集
70%訓練集，30%測試集

## 透過`tf.DatasetBuilder`載入資料

`assert`檢查型態

In [None]:
train_perc = 70
examples = builder.as_dataset(split=[f'train[:{train_perc}%]', f'train[{train_perc}%:]'], as_supervised=True)
train_examples, val_examples = examples

assert isinstance(train_examples, tf.data.Dataset)
assert isinstance(val_examples, tf.data.Dataset)

## 使用`tfds.features.text.SubwordTextEncoder`載入與建立字典

* `.load_from_file`: 從`.subwords`檔案讀取字典
* `.build_from_corpus`: 建立`.subwords`字典

中文字典將`max_subword_length`設為1，以字為單位進行斷詞，大幅度減少字典大小，降低複雜度。

In [None]:
%%time
try:
    tokenizer_en = tfds.deprecated.text.SubwordTextEncoder.load_from_file(en_vocab_file)
    print('Load English vocabulary: %s' % en_vocab_file)
except:
    print('Build English vocabulary: %s' % en_vocab_file)
    tokenizer_en = tfds.deprecated.text.SubwordTextEncoder.build_from_corpus((en.numpy() for en, zh in train_examples),
                                                                             target_vocab_size = 2**13)
    tokenizer_en.save_to_file(en_vocab_file)

In [None]:
print('English vocabulary size: ', tokenizer_en.vocab_size)

In [None]:
%%time
try:
    tokenizer_zh = tfds.deprecated.text.SubwordTextEncoder.load_from_file(zh_vocab_file)
    print('Load Chinese vocfabulary: %s' % zh_vocab_file)
except:
    print('Build Chinese vocabulary: %s' % zh_vocab_file)
    tokenizer_zh = tfds.deprecated.text.SubwordTextEncoder.build_from_corpus((zh.numpy() for en, zh in train_examples),
                                                                             target_vocab_size = 2**13, max_subword_length=1)
    tokenizer_zh.save_to_file(zh_vocab_file)

In [None]:
print('Chinese vocabulary size: ', tokenizer_zh.vocab_size)

### Example

英文的斷詞方式是以`wordpiece`進行斷詞。

* 原始句子: `Transformer is awesome.`
* 空白斷詞: `[Transformer, is, awesome, .]`
* `Wordpiece`斷詞: `[Trans, former, is, aw, es, ome, .]`

`Wordpiece`斷詞優點:
* 有些字是由其他的`wordpiece`組成，例如說`Translation`, `Transpose`等等，可以降低字典大小，避免有些字可能在所有句子中只出現過一次。

In [None]:
sample_string = 'Transformer is awesome.'

tokenized_string_token = tokenizer_en.encode(sample_string)
print('Tokenized string token is {}'.format(tokenized_string_token))

tokenized_string = [tokenizer_en.decode([ts]) for ts in tokenized_string_token]
print('Tokenized srting is {}'.format(tokenized_string))

original_string = tokenizer_en.decode(tokenized_string_token)
print('The original string: {}'.format(original_string))

assert original_string == sample_string

## 添加`<BOS>`,`<EOS>`在句子頭尾

<img src="https://hackmd.io/_uploads/SJfeKvYJa.png" alt="Drawing" style="width: 400px;"/>

`.vocab_size`視為`<BOS>`, `.vocab_size+1`視為`<EOS>`

之後所有的訓練資料都需要通過`train_examples`產生，然後再透過`encode`轉成`token_id`

In [None]:
def encode(en_t, zh_t):
    en_indics = [tokenizer_en.vocab_size] + tokenizer_en.encode(en_t.numpy()) + [tokenizer_en.vocab_size + 1]
    zh_indics = [tokenizer_zh.vocab_size] + tokenizer_zh.encode(zh_t.numpy()) + [tokenizer_zh.vocab_size + 1]
    return en_indics, zh_indics

In [None]:
en_t, zh_t = next(iter(train_examples))
en_indics, zh_indics = encode(en_t, zh_t)

print('英文<BOS>: %d' % tokenizer_en.vocab_size)
print('英文<EOS>: %d' % (tokenizer_en.vocab_size + 1))
print('中文<BOS>: %d' % tokenizer_zh.vocab_size)
print('中文<EOS>: %d' % (tokenizer_zh.vocab_size + 1))

print('-' * 20)
print('Before encode: (two tensor): ')
pprint((en_t, zh_t))
print()
print('After encode: (two array): ')
pprint((en_indics, zh_indics))

## 將`encode`函數的輸出型態轉為計算圖的`Tensor`

如果直接將`train_examples`接上`encode`，會發生`'Tensor' object has no attribute 'numpy'`

這是因為`encode`這個自定義函數是透過`tfds`來進行，而`tfds`的`map function`會採用`tf1.0`的`Graph mode`運算，所以無法直接使用`tf2.0`的`Eager mode`中的`attribute.numpy()`，最快的解決方式是透過`tf.py_function`強制讓所有操作都在`python`完成。

https://github.com/tensorflow/tensorflow/blob/r2.0/tensorflow/python/data/ops/dataset_ops.py#L1099-L1214

In [None]:
# import traceback

# try:
#     train_examples.map(encode)
# except AttributeError:
#     traceback.print_exc()

In [None]:
def tf_encode(en_t, zh_t):

    return tf.py_function(encode, [en_t, zh_t], [tf.int64, tf.int64])

tmp_dataset = train_examples.map(tf_encode)
en_indices, zh_indices = next(iter(tmp_dataset))

print('After tf_encode: (two tensor)')
print(en_indices)
print(zh_indices)

## 限制句子長度

為了加快訓練速度，使用`tf.logical_and`限制中英文句子長度，並使用`.filter`過濾。

In [None]:
max_length = 50
def filter_max_length(en_t, zh_t, max_length = max_length):

    return tf.logical_and(tf.size(en_t) <= max_length,
                          tf.size(zh_t) <= max_length)

tmp_dataset = tmp_dataset.filter(filter_max_length)

## Padding

針對每個`batch`都進行中英文的`padding`。

In [None]:
batch_size = 64
tmp_dataset = tmp_dataset.padded_batch(batch_size=batch_size, padded_shapes=([-1], [-1]))

en_batch, zh_batch = next(iter(tmp_dataset))

print('英文batch: ')
print(en_batch)
print('-' * 15)
print('中文batch: ')
print(zh_batch)

### 將`train_examples`與`val_examples`做同樣處理

* `train`:

 - `map(tf_encode)`: 將字串轉成`token_id`。
 - `filter(filter_max_length)`:過濾最大句子長度。
 - `cache()`: 在每次迭代時將訓練資料先放進去`cache`裡面，加速訓練速度。
 - `shuffle(buffer_size)`: 從資料集中抽樣`buffer_size`放近`buffer`裡面，然後從`buffer`中抽取一個`batch`進行訓練，同時確保了隨機性與加快訓練速度。
 - `padded_batch(batch_size, padded_shapes=([-1],[-1]))`: `padding`長度。

Tensor-core pipeline: https://www.tensorflow.org/guide/performance/datasets?hl=zh_cn

In [None]:
max_length = 50
batch_size = 128
buffer_size = 15000

train_dataset = (train_examples
                 .map(tf_encode)
                 .filter(filter_max_length)
                 .cache()
                 .shuffle(buffer_size)
                 .padded_batch(batch_size, padded_shapes=([-1],[-1])))


val_dataset = (val_examples
               .map(tf_encode)
               .filter(filter_max_length)
               .padded_batch(batch_size, padded_shapes=([-1], [-1])))

In [None]:
en_batch, zh_batch = next(iter(train_dataset))

print('英文batch tensor: ')
print(en_batch)
print('-' * 20)
print('中文batch tensor: ')
print(zh_batch)

### 假設有新資料時的處理方式

1. `map(tf_encode)`: 轉成`token_id`。
2. `filter(filter_max_length)`: 過濾最大長度。
3. `padded_batch()`: padding。

In [None]:
demo_examples = [
    ("It is important.", "這很重要。"),
    ("The math speaks for themselves.", "數學證明一切。"),
]

batch_size = 2
demo_examples = tf.data.Dataset.from_tensor_slices((
    [en for en, _ in demo_examples], [zh for _, zh in demo_examples]
))

demo_examples = demo_examples.map(tf_encode).filter(filter_max_length).padded_batch(batch_size, padded_shapes=([-1],[-1]))

en_sample, zh_sample = next(iter(demo_examples))
print(en_sample)
print('-' * 15)
pprint(zh_sample)

# Transformer

<img src="https://hackmd.io/_uploads/HJFfYvY1p.png" alt="Drawing" style="width: 400px;"/>


這裡分為`Encoder`與`Decoder`:

1. `Encoder`: 負責接收`source sentence`，最主要的目的是將`source sentence`作為`q,k,v`進行`self-attention`。

2. `Decoder`: 負責接收`target sentence`，最主要的目的有兩個:
 - 使用`target sentence`作為`q,k,v`進行`self-attention`。
 - 將`Encoder`的輸出作為`v,k`，然後與`Decoder`的`q`進行`self-attention`。

## Positional Encoding

Word Embedding所表達的是所有詞向量之間的相似關係，而Transformer的做法是透過內積解決RNN的長距離依賴問題(long-range dependenices)，但是Transformer這樣做卻沒有考慮到句子中的詞先後順序關係，透過Positional Encoding，讓詞向量之間不只因為word embedding語義關係而靠近，也可以因為詞之間的位置相互靠近而靠近。

$$
PE_{(pos,2i)} = \sin(pos/10000^{\frac{2i}{d_{model}}}) \\
PE_{(pos,2i+1)} = \cos(pos/10000^{\frac{2i}{d_{model}}})
$$

#### 之後可以調整看看三角函數的參數，例如10000 -> 100

In [None]:
# 先建立角度
def get_angles(pos, i, d_model):
    angle_rates = 1 / np.power(10000, 2 * (i//2) / np.float32(d_model))
    return pos * angle_rates

In [None]:
def positional_encoding(position, d_model):
    """
    奇數sin
    偶數cos

          第一個字: [[sin(0),cos(0),sin(1),cos(1),...,sin(d_model-1),cos(d_model)]
          第二個字: ,[sin(0),cos(0),sin(1),cos(1),...,sin(d_model-1),cos(d_model)]
          第三個字: ,[sin(0),cos(0),sin(1),cos(1),...,sin(d_model-1),cos(d_model)]
          ...
    第position個字: ,[sin(0),cos(0),sin(1),cos(1),...,sin(d_model-1),cos(d_model)]]

    return:
    (batch_size, position, d_model)
    """
    pos = np.arange(position)[:, np.newaxis] # [[0],[1],[2],...,[pos-1]]
    i = np.arange(d_model)[np.newaxis, :] # [[0,1,2,3,...,d_model-1]]

    angle_rads = get_angles(pos, i, d_model) # (position, d_model)


    angle_rads[:, 0::2] = np.sin(angle_rads[:, 0::2])
    angle_rads[:, 1::2] = np.cos(angle_rads[:, 1::2])

    pos_encoding = angle_rads[np.newaxis, ...]

    return tf.cast(pos_encoding, dtype=tf.float32)

## Positional encoding 理解

此例拿第25個token的positional encoding來跟其餘50個字(包含自己)的positional encoding計算內積(`np.dot`)，能夠發現越靠近token 25的值內積越大，反之，越遠則內積越小。

In [None]:
position = 50 # 50個字
d_model = 512 # 每個字的positional encoding維度為512
pos_encoding = positional_encoding(position, d_model)

inp = pos_encoding[0][25].numpy()

dis_list = list()
for i in range(50):
    tar = pos_encoding[0][i].numpy()
    dot_prod = np.dot(inp, tar)
    dis_list.append(dot_prod)

In [None]:
plt.figure(figsize=(12,10))
plt.plot(dis_list)
plt.xticks(list(range(50)))
plt.show()

## Masking
在Transformer中有兩個地方需要進行masking，以下兩種masking的方式都是先指定要進行masking的位置，然後將`QK`內積過後的attetion matrix進行masking。

 1. `Padding_masking`: 句子padding的部分不需要被transformer注意到，透過mask，讓self-attention出來的weight接近0。
 2. `Look_ahead_masking`: Decoder中的masked self attention會使用到，不讓當前的字去注意到之後所有的字，一樣是讓self-attention出來的weight接近0。

<img src="https://hackmd.io/_uploads/BkvNKPtJT.png" alt="Drawing" style="width: 400px;"/>

### Padding masking

In [None]:
def create_padding_mask(seq):
    """
    Input:
    在字典中，padding的index為0

    所以當Input遇到0時就將其變為1，之後當成要進行masking的index

    Return:
    在中間插上兩個維度是為了後面attention時做broadcasting
    """
    seq = tf.cast(tf.math.equal(seq, 0), tf.float32)
    return seq[:, tf.newaxis, tf.newaxis, :]

In [None]:
x = tf.constant([[7, 6, 0, 0, 1], [1, 2, 3, 0, 0], [0, 0, 0, 4, 5]])
print(x)
print(create_padding_mask(x))
# 1的位置就是要進行masking的位置

### Look ahead masking

In [None]:
def create_look_ahead_mask(size):
    """
    Input: 方陣size，以transformer來說就是self-attention的weigh matrix，將上三角進行masking

    tf.linalg.band_part(input, num_lower, num_upper)
    num_lower, num_upper: 從主對角線開始決定mask的起點，-1表示保留原值
    """
    mask = 1 - tf.linalg.band_part(tf.ones((size,size)), -1, 0)
    return mask

In [None]:
create_look_ahead_mask(3)

## Scaled dot-product attention(self-attention)

<img src="https://hackmd.io/_uploads/SkHUKwKk6.png" alt="Drawing" style="width: 700px;"/>

1. Q與K進行矩陣相乘的地方就是實現Self-attention的地方，表示Q中的每個字對於K的每個字的attention。
2. 接著進行Scale是為了避免後面通過Softmax之後的attention weight不是1就是0，這樣會造成很小的梯度(hard softmax)。
3. 通過Softmax之後就產生attention weight matrix，再乘上V，最後得到Context matrix。

$$
\mathrm{Attention}(Q,K,V) = \mathrm{Softmax}(\frac{QK^\top}{\sqrt{d_k}})V
$$

<img src="https://hackmd.io/_uploads/HkIPYDtya.png" alt="Drawing" style="width: 500px;"/>

<img src="https://hackmd.io/_uploads/S1EdYvt1a.png" alt="Drawing" style="width: 500px;"/>


In [None]:
def scaled_dot_product_attention(q, k, v, mask):
    """
    Args:
        q: query shape == (..., seq_len_q, depth_k)
        k: key shape == (..., seq_len_k, depth_k)
        v: value shape == (..., seq_len_v, depth_v)
        mask: Float tensor with shape broadcastable to (..., seq_len_q, seq_len_k)
    """
    # q,k矩陣相乘
    matmul_qk = tf.matmul(q, k, transpose_b=True) # (..., q_dim, k_dim)

    # Scaled
    dk = tf.cast(tf.shape(k)[-1], tf.float32)
    scaled_attention_logits = matmul_qk / tf.math.sqrt(dk)

    # mask
    if mask is not None:
        scaled_attention_logits += (mask * -1e9)
    # Softmax最後一個維度(k_dim)，表示每個字對於所有字的attention weights
    attention_weights = tf.nn.softmax(scaled_attention_logits, axis = -1)

    output = tf.matmul(attention_weights, v) # (..., q_dim, depth_v)

    return output, attention_weights

In [None]:
"""
假設一個字(query)對四個字(key)進行self attention，得到attention weight之後再與value相乘
"""

temp_k = tf.constant([[10,0,0],
                      [0,10,0],
                      [0,0,10],
                      [0,0,10]], dtype=tf.float32)  # (4, 3)

temp_v = tf.constant([[10,0,0],
                      [0,10,0],
                      [0,0,10],
                      [0,0,10]], dtype=tf.float32)  # (4, 3)

temp_q = tf.constant([[0, 10, 0]], dtype=tf.float32) # (1, 3)

output, attention_weights = scaled_dot_product_attention(temp_q, temp_k, temp_v, mask=None)

In [None]:
print('Attention weights: ')
print(attention_weights)
print()
print('Ouptut: ')
print(output)

In [None]:
"""
假設四個字(query)對四個字(key)進行self attention，然後將上三角形進行mask，得到attention weight之後再與value相乘

將右上角mask掉之後觀察attention weights會發現上三角形的weigh趨近於0
"""

# 為了方便觀察weight，將temp_q都設為1
temp_q = tf.constant([[1, 1, 1], [1, 1, 1], [1, 1, 1], [1, 1, 1]], dtype=tf.float32) # (4, 3)

mask = create_look_ahead_mask(temp_q.shape[0])

output, attention_weights = scaled_dot_product_attention(temp_q, temp_k, temp_v, mask=mask)

In [None]:
print('Attention weights: ')
print(attention_weights)
print()
print('Ouptut: ')
print(output)

## Multi-Head Attention

<img src="https://hackmd.io/_uploads/r1sKFPKkT.png" alt="Drawing" style="width: 300px;"/>

將`q,k,v`分成num_heads份，各自做self-attention，然後再concat，通過dense輸出，分成num_heads的優點最主要是希望讓每個head各自注意到Sequence中不同的地方，而且切分成較小的矩陣還能加速訓練過程。

In [None]:
class MultiHeadAttention(tf.keras.layers.Layer):

    def __init__(self, d_model, num_heads):
        super().__init__()
        self.num_heads = num_heads
        self.d_model = d_model

        # 確保d_model可以被num_heads整除
        assert d_model % self.num_heads == 0

        self.depth = d_model // self.num_heads

        self.wq = tf.keras.layers.Dense(d_model)
        self.wk = tf.keras.layers.Dense(d_model)
        self.wv = tf.keras.layers.Dense(d_model)

        self.dense = tf.keras.layers.Dense(d_model)

    def split_heads(self, x, batch_size):
        """
        將d_model切割成(num_heads, depth)
        為了後面做self-attention，transpose成(batch_size, num_heads, seq_len, depth)
        """
        x = tf.reshape(x, (batch_size, -1, self.num_heads, self.depth))
        return tf.transpose(x, perm=[0, 2, 1, 3])

    def __call__(self, v, k, q, mask):
        batch_size = tf.shape(q)[0]

        q = self.wq(q) # (batch_size, seq_len, d_model)
        k = self.wk(k) # (batch_size, seq_len, d_model)
        v = self.wv(v)  # (batch_size, seq_len, d_model)

        q = self.split_heads(q, batch_size) # (bat d_model)
        k = self.split_heads(k, batch_size) # (batch_size, num_heads, seq_len_k, depth)
        v = self.split_heads(v, batch_size) # (batch_size, num_heads, seq_len_v, depth)

        # attention_weights.shape == (batch_size, num_heads, seq_len_q, seq_len_k)
        # scaled_attention.shape == (batch_size, num_heads, seq_len_q, depth)
        scaled_attention, attention_weights = scaled_dot_product_attention(q, k, v, mask)

        #為了將num_heads進行concat，transpose成(batch_size, seq_len_q, num_heads, depth)
        scaled_attention = tf.transpose(scaled_attention, perm=[0, 2, 1, 3])

        # 合併後面兩維度 (batch_size, seq_len_q, d_model)
        concat_attention = tf.reshape(scaled_attention, (batch_size, -1, self.d_model))

        output = self.dense(concat_attention)

        return output, attention_weights

In [None]:
y = tf.random.uniform((1, 60, 512)) # (batch_size, seq_len, d_model)

d_model = 512
num_heads = 8
temp_mha = MultiHeadAttention(d_model, num_heads)
output, attention_weights = temp_mha(v=y, k=y, q=y, mask=None)

In [None]:
# 輸出仍然是 (batch_size, seq_len, d_model)
print('output shape', output.shape)

# 8個heads各自有一個attention weight matrix
print('attention_weights shape: ', attention_weights.shape)

### Point-wise feed forward network

$$
FFN(x) = max(0, xW_1 + b_1)W_2+b_2
$$

In [None]:
def point_wise_ffn(d_model, dff):
    return tf.keras.Sequential([tf.keras.layers.Dense(dff, activation='relu'), # (batch_size, seq_len, dff)
                                tf.keras.layers.Dense(d_model)]) # (batch_size, seq_len, d_model)

In [None]:
d_model = 512
dff = 2048
sample_ffn = point_wise_ffn(d_model, dff)
sample_ffn(tf.random.uniform((64, 50, 512))).shape

## Encoderblock and Decoderblock

<img src="https://hackmd.io/_uploads/HJFfYvY1p.png" alt="Drawing" style="width: 400px;"/>


### EncoderLayer

這邊我們將以上橘色虛線`Encoderlayer`進行組合，其中主要由兩種`class`組成，分別是`MultiHeadAttention`和`point_wise_ffn`，依照上圖的順序為:

1. `MultiHeadAttention(padding_mask)`

2. `Residual connection` + `Layer Normalization`

3. `point_wise_ffn`

4. `Residual connection` + `Layer Normalization`

另外`dropout`的部分是在論文中提及的，所以另外加上去。

In [None]:
class EncoderLayer(tf.keras.layers.Layer):
    def __init__(self, d_model, num_heads, dff, dropout_rate = 0.1):
        super().__init__()

        self.mha = MultiHeadAttention(d_model, num_heads)
        self.ffn = point_wise_ffn(d_model, dff)

        self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)

        self.dropout1 = tf.keras.layers.Dropout(dropout_rate)
        self.dropout2 = tf.keras.layers.Dropout(dropout_rate)

    def __call__(self, x, training, mask):
        # 不需要看Encoder的attention weight
        attention_output, _ = self.mha(v = x, k = x, q = x, mask=mask) # (batch_size, input_seq_len, d_model)
        # Inference時不需要使用dropout
        attention_output = self.dropout1(attention_output, training=training) # (batch_size, input_seq_len, d_model)
        # Residual + Layer Normalization
        out1 = self.layernorm1(x + attention_output) # (batch_size, input_seq_len, d_model)

        ffn_output = self.ffn(out1) # (batch_size, input_seq_len, d_model)
        ffn_output = self.dropout2(ffn_output, training=training) # (batch_size, input_seq_len, d_model)
        enc_output = self.layernorm2(out1 + ffn_output) # (batch_size, input_seq_len, d_model)

        return enc_output

In [None]:
d_model = 512
num_heads = 8
dff = 2048
dropout_rate = 0.1

sample_encooder_layer = EncoderLayer(d_model, num_heads, dff, dropout_rate)

x = tf.random.uniform((64, 50, 512))
training = False
mask = None

sample_encooder_layer_output = sample_encooder_layer(x, training, mask)
sample_encooder_layer_output.shape  # (batch_size, input_seq_len, d_model)

### DecoderLayer

這邊我們將以上橘色虛線`Decoderlayer`進行組合，其中主要由兩種`class`組成，分別是`MultiHeadAttention`和`point_wise_ffn`，依照上圖的順序為:

1. `MultiHeadAttention(padding_mask + look_ahead_mask)`

2. `Residual connection` + `Layer Normalization`

3. `MultiHeadAttention(padding_mask)`

4. `Residual connection` + `Layer Normalization`

5. `point_wise_ffn`

6. ``Residual connection` + `Layer Normalization``

In [None]:
class DecoderLayer(tf.keras.layers.Layer):
    def __init__(self, d_model, num_heads, dff, dropout_rate = 0.1):
        super().__init__()

        self.mha1 = MultiHeadAttention(d_model, num_heads)
        self.mha2 = MultiHeadAttention(d_model, num_heads)

        self.ffn = point_wise_ffn(d_model, dff)

        self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.layernorm3 = tf.keras.layers.LayerNormalization(epsilon=1e-6)

        self.dropout1 = tf.keras.layers.Dropout(dropout_rate)
        self.dropout2 = tf.keras.layers.Dropout(dropout_rate)
        self.dropout3 = tf.keras.layers.Dropout(dropout_rate)

    def __call__(self, x, enc_output, training, look_ahead_mask, padding_mask):

        # masked self-attention，後面需要觀察attention weight matrix
        # 使用look_ahead_mask，讓decoder輸入只能往前看
        attention_output1, masked_attention_weights = self.mha1(v=x, k=x, q=x, mask=look_ahead_mask) # (batch_size, ouptut_seq_len, d_model)
        attention_output1 = self.dropout1(attention_output1, training=training) # (batch_size, ouptut_seq_len, d_model)
        attention_output1 = self.layernorm1(x + attention_output1) # (batch_size, ouptut_seq_len, d_model)

        # 使用padding_mask，忽略padding的attention weights，不讓任何字去注意到padding的位置
        attention_output2, dec_attention_weights = self.mha2(v=enc_output, k=enc_output, q=attention_output1, mask=padding_mask) # (batch_size, ouptut_seq_len, d_model)
        attention_output2 = self.dropout2(attention_output2, training=training) # (batch_size, ouptut_seq_len, d_model)
        attention_output2 = self.layernorm2(attention_output1 + attention_output2) # (batch_size, ouptut_seq_len, d_model)

        ffn_output = self.ffn(attention_output2)
        ffn_output = self.dropout3(ffn_output, training=training)
        dec_output = self.layernorm3(attention_output2 + ffn_output)

        return dec_output, masked_attention_weights, dec_attention_weights

In [None]:
d_model = 512
num_heads = 8
dff = 2048
dropout_rate = 0.1

x = tf.random.uniform((64, 60, 512))
training = False
look_ahead_mask = None
padding_mask = None

sample_decoder_layer = DecoderLayer(d_model, num_heads, dff, dropout_rate)
sample_dec_output, masked_attention_weights, dec_attention_weights = sample_decoder_layer(x, sample_encooder_layer_output,
                                                                                          training,
                                                                                          look_ahead_mask,
                                                                                          padding_mask)  # (batch_size, target_seq_len, d_model)

In [None]:
# (batch_size, output_seq_len, d_model)
print('dec_output shape: ', sample_dec_output.shape)
# (batch_size, num_heads, output_seq_len, output_seq_len)
print('masked_attention_weights shape: ', masked_attention_weights.shape)
# (batch_size, num_heads, output_seq_len, Input_seq_len)
print('dec_attention_weights shape: ', dec_attention_weights.shape)

### Encoder

上面我們已經把`Encoderlayer`的主架構完成了，現在再把兩個輸入放進`Encoderlayer`形成整個`Encoder`。

1. `Source Word embedding`
2. `Positional encoding`
3. `Encoder Layer * num_layers`

In [None]:
class Encoder(tf.keras.layers.Layer):
    def __init__(self, num_layers, d_model, num_heads, dff, input_vocab_size, dropout_rate=0.1):
        super().__init__()

        self.num_layers = num_layers
        self.d_model = d_model

        self.embedding = tf.keras.layers.Embedding(input_vocab_size, d_model)
        self.pos_encoding = positional_encoding(input_vocab_size, d_model)

        self.enc_layers = [EncoderLayer(d_model, num_heads, dff, dropout_rate) for _ in range(self.num_layers)]

        self.dropout = tf.keras.layers.Dropout(dropout_rate)

    def __call__(self, x, training, mask):

        seq_len = tf.shape(x)[1]

        x = self.embedding(x) # (batch_size, seq_len, d_model)
        x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))
        x += self.pos_encoding[:, :seq_len, :] # (batch_size, seq_len, d_model)

        x = self.dropout(x, training=training) # (batch_size, seq_len, d_model)

        for i in range(self.num_layers):
            x = self.enc_layers[i](x, training, mask) # (batch_size, seq_len, d_model)

        return x

In [None]:
num_layers = 2
d_model = 512
num_heads = 8
dff = 2048
input_vocab_size = 10000
dropout_rate = 0.1

sample_encoder = Encoder(num_layers, d_model, num_heads, dff, input_vocab_size, dropout_rate)

# 模擬輸入64個句子，每個句子padding成50個字
x = tf.random.uniform((64, 50))
training = False
mask = None

sample_encoder_output = sample_encoder(x, training, mask)
# (batch_size, input_seq_len, d_model)
print('sample_encoder_output shape: ',sample_encoder_output.shape)  # (batch_size, input_seq_len, d_model)

## Decoder

`Decoder`的輸入也是`word embedding`與`positional encoding`。

1. `Target Word embedding`
2. `Positional encoding`
3. `Decoder Layer * num_layers`

因為要觀察`masked_attention_weights`以及`dec_attention_weight`，所以另外寫一個`attention_weights`儲存。

In [None]:
class Decoder(tf.keras.layers.Layer):
    def __init__(self, num_layers, d_model, num_heads, dff, output_vocab_size, dropout_rate = 0.1):
        super().__init__()

        self.num_layers = num_layers
        self.d_model = d_model

        self.embedding = tf.keras.layers.Embedding(output_vocab_size, d_model)
        self.pos_encoding = positional_encoding(output_vocab_size, d_model)

        self.dec_layers = [DecoderLayer(d_model, num_heads, dff, dropout_rate) for _ in range(num_layers)]

        self.dropout = tf.keras.layers.Dropout(dropout_rate)

    def __call__(self, x, enc_output, training, look_ahead_mask, padding_mask):

        seq_len = tf.shape(x)[1]
        attention_weights = {}

        x = self.embedding(x)
        x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))
        x += self.pos_encoding[:, :seq_len, :]

        x = self.dropout(x, training=training)

        for i in range(self.num_layers):
            # x.shape: (batch_size, output_seq_len, d_model)
            x, masked_attention_weights, dec_attention_weights = self.dec_layers[i](x, enc_output, training, look_ahead_mask, padding_mask)

            # masked attention: (batch_size, num_head, output_seq_len, output_seq_len)
            # dec attention: (batch_size, num_head, output_seq_len, input_seq_len)
            attention_weights['decoder_layer{}_masked_attention_weights'.format(i + 1)] = masked_attention_weights
            attention_weights['decoder_layer{}_dec_attention_weights'.format(i + 1)] = dec_attention_weights

        return x, attention_weights

In [None]:
num_layers = 2
d_model = 512
num_heads = 8
dff = 2048
output_vocab_size = 10000
dropout_rate = 0.1

sample_decoder = Decoder(num_layers, d_model, num_heads, dff, output_vocab_size, dropout_rate)

# 模擬輸入64個句子，每個句子padding成20個字
x = tf.random.uniform((64, 20))
training = False
look_ahead_mask = None
padding_mask = None

sample_decoder_output, attention_weights = sample_decoder(x, sample_encoder_output, training, look_ahead_mask, padding_mask)

# (batch_size, output_seq_len, d_model)
print('sample_decoder_output shape:', sample_decoder_output.shape)

# masked attention: (batch_size, num_head, output_seq_len, output_seq_len)
# dec attention: (batch_size, num_head, output_seq_len, input_seq_len)
# dec attention表示 output_seq對input_seq的注意力
for key, value in attention_weights.items():
    print(key, ' :', value.shape)

## Transformer

結合`Encoder`和`Decoder`，接上最後的`Dense`，輸出probability。

In [None]:
class Transformer(tf.keras.Model):
    def __init__(self, num_layers, d_model, num_heads, dff, input_vocab_size, output_vocab_size, dropout_rate = 0.1):
        super().__init__()

        self.encoder = Encoder(num_layers, d_model, num_heads, dff, input_vocab_size, dropout_rate)
        self.decoder = Decoder(num_layers, d_model, num_heads, dff, output_vocab_size, dropout_rate)

        self.final_layer = tf.keras.layers.Dense(output_vocab_size)

    def __call__(self, inp, tar, training, enc_padding_mask, look_ahead_mask, dec_padding_mask):

        # enc_output.shape: (batch_size, inp_seq_len, d_model)
        enc_output = self.encoder(inp, training, enc_padding_mask)

        # dec_output.shape: (batch_size, tar_seq_len, d_model)
        dec_output, attention_weights = self.decoder(tar, enc_output, training, look_ahead_mask, dec_padding_mask)

        # final_output.shape: (batch_szie, tar_seq_len, output_vocab_size )
        final_output = self.final_layer(dec_output)

        return final_output, attention_weights

In [None]:
num_layers = 2
d_model = 512
num_heads = 8
dff = 2048
input_vocab_size = 10000
output_vocab_size = 10000
dropout_rate = 0.1


sample_transformer = Transformer(num_layers, d_model, num_heads, dff, input_vocab_size, output_vocab_size, dropout_rate)

# Input: 模擬輸入64個句子，每個句子padding成50個字
# Target: 模擬輸入64個句子，每個句子padding成20個字
temp_input = tf.random.uniform((64, 50))
temp_target = tf.random.uniform((64, 20))
training = False
enc_padding_mask = None
look_ahead_mask = None
dec_padding_mask = None

final_output, attention_weights = sample_transformer(temp_input, temp_target, training, enc_padding_mask, look_ahead_mask, dec_padding_mask)

print('final_output shape:', final_output.shape)

# masked attention: (batch_size, num_head, output_seq_len, output_seq_len)
# dec attention: (batch_size, num_head, output_seq_len, input_seq_len)
# dec attention表示 output_seq對input_seq的注意力
for key, value in attention_weights.items():
    print(key, ' :', value.shape)  # (batch_size, tar_seq_len, target_vocab_size)

## Optimizer and Customer Learning rate
論文使用`Adam`搭配客製化的`Learning rate`，`Learning rate`在warmup_steps前遞增，在warmup_step後遞減。

$$
lrate = d^{-0.5}_{model}\times min(step\_num^{-0.5},\;step\_num \times warmup\_steps^{-1.5})
$$

In [None]:
class CustomSchedule(tf.keras.optimizers.schedules.LearningRateSchedule):
    def __init__(self, d_model, warmup_steps=4000):
        super(CustomSchedule, self).__init__()

        self.d_model = tf.math.rsqrt(tf.cast(d_model, tf.float32))

        self.warmup_steps = warmup_steps

    def __call__(self, step):
        step = tf.cast(step, tf.float32)
        arg1 = tf.math.rsqrt(step) # step_num^{-0.5}
        arg2 = step * (self.warmup_steps ** -1.5) # step_num * warmup_step^{-1.5}

        return self.d_model * tf.math.minimum(arg1, arg2)

### 不同`warmup_steps`對於`learning rate`的影響

In [None]:
d_models = 512
warmup_steps = [3000 ,4000, 5000, 6000]

step = tf.range(50000, dtype=tf.float32)

for warmup_step in warmup_steps:
    temp_learning_rate_schedule = CustomSchedule(d_model, warmup_step)
    plt.plot(temp_learning_rate_schedule(step), label = str(warmup_step))
    plt.ylabel('Learning Rate')
    plt.xlabel('Train Step')
    plt.legend(loc='upper right')

In [None]:
d_model = 512
warmup_steps = 4000
# learning_rate = CustomSchedule(d_model, warmup_steps)

beta_1 = 0.9
beta_2 = 0.98
epsilon = 1e-9
optimizer = tf.keras.optimizers.Adam(learning_rate=CustomSchedule(d_model, warmup_steps), beta_1=beta_1, beta_2=beta_2, epsilon=epsilon)

### Loss and metrics

不需要計算句子中`padding`位置的`loss`，所以需要進行mask。

In [None]:
def loss_function(real, pred):

    mask = tf.math.logical_not(tf.math.equal(real, 0)) # 將sequence中padding(index為0)的部分設為False

    loss_object = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True, reduction='none')
    """
    from_logits: y_pred is expected to be a logits tensor. By default, we assume that y_pred encodes a probability distribution.
    reduction: the reduction schedule of output loss vectors. `https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/keras/losses/Reduction`
    """

    loss_ = loss_object(real, pred)

    mask = tf.cast(mask, dtype=loss_.dtype)
    loss_ *= mask # 只計算非padding的loss

    return tf.reduce_mean(loss_)

In [None]:
# Loss sample
cce = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True, reduction='none')

y_true = tf.constant([0, 1, 0], dtype=tf.float32)
y_pred = tf.constant([[.95, .05], [.11, .89], [.05, .95]], dtype=tf.float32)

loss = cce(y_true, y_pred)
print('Loss: ', loss.numpy())  # Loss:  0.6532173

### Loss, Accuracy

In [None]:
train_loss = tf.keras.metrics.Mean(name='train_loss')
train_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(
    name='train_accuracy')

### Create masking

建立訓練時`Encoder`和`Decoder`需要用到的masking

* `Encoder`:
  - 第一個Multi-head attention需要Source的`padding_mask`


* `Decoder`:
  - 第一個Masked Multi-head attention需要Target的`padding_mask` + `look_ahead_mask`
  - 第二個Multi-head attention需要Target的`padding_mask`

In [None]:
def create_masks(inp, tar):

    # Encoder padding mask
    enc_padding_mask = create_padding_mask(inp)

    # Decoder 2nd Multi-head attention
    dec_padding_mask = create_padding_mask(inp)

    # Decoder 1st Masked Multi-head attention
    look_ahead_mask = create_look_ahead_mask(size=tf.shape(tar)[1]) # 建立只能往前看的mask矩陣
    dec_target_padding_mask = create_padding_mask(tar) # padding_mask
    combined_mask = tf.maximum(dec_target_padding_mask, look_ahead_mask) # 返回兩者各別最大的值，也就是都是1的位置

    return enc_padding_mask, combined_mask, dec_padding_mask

### Set Parameters and Transformer

In [None]:
num_layers = 4
d_model = 128
dff = 512
num_heads = 8
input_vocab_size = tokenizer_en.vocab_size + 2
output_vocab_size = tokenizer_zh.vocab_size + 2
dropout_rate = 0.1

epochs = 5
transformer = Transformer(num_layers, d_model, num_heads, dff, input_vocab_size, output_vocab_size, dropout_rate)

## Checkpoint

In [None]:
ckpt = tf.train.Checkpoint(transformer = transformer, optimizer = optimizer)

record_params = f'{num_layers}layers_{d_model}d_model_{num_heads}heads_{dff}dff'
checkpoint_path = os.path.join(checkpoint_path, record_params)
log_dir = os.path.join(log_dir, record_params)

# 只保留最近3次訓練結果
ckpt_manager = tf.train.CheckpointManager(ckpt, checkpoint_path, max_to_keep=3)

# 檢查在checkpoint_path上是否有已訓練的checkpoint，有就叫ckpt進行讀取
if ckpt_manager.latest_checkpoint:
    ckpt.restore(ckpt_manager.latest_checkpoint)
    print('Latest checkpoint restored')

## Define training step

訓練時採用`Teacher forcing`，直接輸入給Decoder正確答案，因為若使用`Recursive`預測方式，預測錯誤則會導致之後面接收到錯誤的資訊。

預測時則採用`AutoRegressive`方式遞迴預測。

In [None]:
@tf.function(input_signature=(tf.TensorSpec(shape=[None, None], dtype=tf.int64), tf.TensorSpec(shape=[None, None], dtype=tf.int64)))
def train_step(inp, tar):

    # teacher forcing
    tar_inp = tar[:, :-1] # Deocder的target輸入不需要<EOS>
    tar_real = tar[:, 1:] # Decdoer的target輸出不需要<BOS>

    enc_padding_mask, combined_mask, dec_padding_mask = create_masks(inp, tar_inp)


    # 記錄梯度，之後做梯度下降
    Training = True
    with tf.GradientTape() as tape:
        predictions, _ = transformer(inp, tar_inp, Training, enc_padding_mask, combined_mask, dec_padding_mask)
        loss = loss_function(tar_real, predictions)

    # 拿出所有可訓練參數的gradient
    gradients = tape.gradient(loss, transformer.trainable_variables)
    # 呼叫Adam透過gradient更新參數
    optimizer.apply_gradients(zip(gradients, transformer.trainable_variables))

    # 輸出loss以及acc，之後準備給Tensorboard記錄
    train_loss(loss)
    train_accuracy(tar_real, predictions)

## Training

使用Tensorboard記錄Loss以及Accuracy

In [None]:
# Tensorboard
summary_writer = tf.summary.create_file_writer(logdir=log_dir)

for epoch in range(epochs):
    start = time.time()

    # 每次epoch重置Tensorboard metrics
    train_loss.reset_states()
    train_accuracy.reset_states()

    # 依序訓練所有batch
    for (batch, (inp, tar)) in enumerate(train_dataset):
        train_step(inp, tar)

        if batch % 50 == 0:
            print ('Epoch {} Batch {} Loss {:.4f} Accuracy {:.4f}'.format(epoch + 1, batch, train_loss.result(), train_accuracy.result()))

    # 每2個epoch就儲存模型
    if (epoch + 1) % 2 == 0:
        ckpt_save_path = ckpt_manager.save()
        print('Saving checkpoint for epoch {} at {}'.format(epoch+1, ckpt_save_path))

    with summary_writer.as_default():
        tf.summary.scalar('train_loss', train_loss.result(), step=epoch+1)
        tf.summary.scalar('train_acc', train_accuracy.result(), step=epoch+1)

    print('Epoch {} Loss {:.4f} Accuracy {:.4f}'.format(epoch+1, train_loss.result(), train_accuracy.result()))
    print('Time taken for 1 epoch: {} secs\n'.format(time.time()-start))


## Evaluate

當有新sentence要預測時，sentence一樣要做與encoder輸入的處理:

1. Encoder輸入前後需要增加`<BOS>`與`<EOS>`
2. Decoder的預測方式是用AutoRegressive，輸入是從`<BOS>`開始預測，第一次預測完將預測結果concat在`<BOS>`後，之後以此類推。

`<BOS>` => `<BOS> 我` => `<BOS> 我 好 ` => `<BOS> 我 好 帥`

In [None]:
def evaluate(inp_sentence):

    start_token = [tokenizer_en.vocab_size]
    end_token = [tokenizer_en.vocab_size + 1]

    # Encoder的輸入需要增加<BOS>,<EOS>
    inp_sentence = start_token + tokenizer_en.encode(inp_sentence) + end_token
    encoder_input = tf.expand_dims(inp_sentence, axis=0)

    # Decoder的預測方式是autoregressive，即從<BOS>開始預測，每次預測完拿取預測結果最後一個字的概率
    decoder_input = [tokenizer_zh.vocab_size]
    output = tf.expand_dims(decoder_input, axis=0)

    # AutoRegressive
    for i in range(max_length):
        # create mask
        enc_padding_mask, combined_mask, dec_padding_mask = create_masks(encoder_input, output)

        # prediction.shape == (batch_size, seq_len, vocab_size)
        predictions, attention_weights = transformer(encoder_input, output, False, enc_padding_mask, combined_mask, dec_padding_mask)

        # 拿取最後一個字作為預測結果
        prediction = predictions[:, -1:, :]

        prediction_id = tf.cast(tf.argmax(prediction, axis = -1), tf.int32)

        # 預測結果遇到<EOS>就停止回傳output
        if prediction_id == tokenizer_zh.vocab_size + 1:
            return tf.squeeze(output, axis=0), attention_weights

        output = tf.concat([output, prediction_id], axis = -1)

    return tf.squeeze(output, axis=0), attention_weights

In [None]:
def map_from_pred(pred_tokens):

    pred_tokens = [t for t in pred_tokens if t < tokenizer_zh.vocab_size]
    pred_sentence = tokenizer_zh.decode(pred_tokens)

    return pred_sentence

In [None]:
sentence = 'Taiwan is a beautiful country.'
predicted_seq, attention_weights = evaluate(sentence)
predicted_seq = map_from_pred(predicted_seq)

In [None]:
print('Source sentence:\n',sentence)
print()
print('Predict sentence:\n', predicted_seq)

## Visualization

我們畫出`Decoder`中的`self-attention`權重矩陣，每個`head`各有一個矩陣，這裏挑最後一層的`decoder_layer4_dec_attention_weights`。

In [None]:
for key,value in attention_weights.items():
    print(key,':',value.shape)

layer_name = 'decoder_layer4_dec_attention_weights'

In [None]:
!wget -O /usr/share/fonts/truetype/liberation/simhei.ttf "https://www.wfonts.com/download/data/2014/06/01/simhei/chinese.simhei.ttf"
import matplotlib as mpl
zhfont = mpl.font_manager.FontProperties(fname='/usr/share/fonts/truetype/liberation/simhei.ttf')

In [None]:
def plot_attention_weights(attention_weights, sentence, predicted_seq, layer_name):
    fig = plt.figure(figsize=(17, 14))

    sentence = tokenizer_en.encode(sentence)

    attention_weights = tf.squeeze(attention_weights[layer_name], axis=0)
    # (num_heads, tar_seq_len, inp_seq_len)

    # 只畫其中4個head
    #attention_weights = attention_weights[4:8,:,:]

    # 將每個 head 的注意權重畫出
    for head in range(attention_weights.shape[0]):
        ax = fig.add_subplot(4, 2, head + 1)

        attn_map = np.transpose(attention_weights[head])
        ax.matshow(attn_map, cmap='viridis')  # (inp_seq_len, tar_seq_len)

        ax.set_xticks(range(len(predicted_seq)))
        ax.set_xticklabels(predicted_seq, fontproperties=zhfont)

        ax.set_yticks(range(len(sentence) + 2))
        ax.set_yticklabels(['<start>'] + [tokenizer_en.decode([i]) for i in sentence] + ['<end>'])

        ax.set_xlabel('Head {}'.format(head + 1), fontsize=13)

    plt.tight_layout()
    plt.show()
    plt.close(fig)

In [None]:
import logging
logging.getLogger('matplotlib.font_manager').disabled = True

plt.figure(figsize=(20,15))
plot_attention_weights(attention_weights, sentence, predicted_seq, layer_name)