*Transformer* implementation by Tensorflow and make it as a EN to Zh translator
#### What's you can learn from this notebook
1. Word embedding and tokenization
2. Mask mechanisim
3. Basic structure of transformer and it's application as a translator
4. Customized layer and model of tensorflow
5. Checkpoint, tensorflow dashboard


#### Reference:
1. https://leemeng.tw/neural-machine-translation-with-transformer-and-tensorflow2.html#top
2. https://www.tensorflow.org/text/tutorials/transformer


In [1]:
import os
import time
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
from pprint import pprint
from IPython.display import clear_output

In [2]:
!pip install tensorflow-gpu==2.0.0-beta0
clear_output()

import tensorflow as tf
import tensorflow_datasets as tfds
print(tf.__version__)

2.12.0


In [3]:
import logging
logging.basicConfig(level="error")

np.set_printoptions(suppress=True)

In [4]:
#set up directory
output_dir = "nmt"
en_vocab_file = os.path.join(output_dir, "en_vocab")
zh_vocab_file = os.path.join(output_dir, "zh_vocab")
checkpoint_path = os.path.join(output_dir, "checkpoints")
log_dir = os.path.join(output_dir, 'logs')
download_dir = "tensorflow-datasets/downloads"

if not os.path.exists(output_dir):
  os.makedirs(output_dir)

In [5]:
#check out the data source we have
tmp_builder = tfds.builder("wmt19_translate/zh-en")
pprint(tmp_builder.subsets)

{Split('train'): ['newscommentary_v14',
                  'wikititles_v1',
                  'uncorpus_v1',
                  'casia2015',
                  'casict2011',
                  'casict2015',
                  'datum2015',
                  'datum2017',
                  'neu2017'],
 Split('validation'): ['newstest2018']}


In [6]:
#download data by tfds.builder
config = tfds.translate.wmt.WmtConfig(
  version=tfds.core.Version('0.0.3'),
  language_pair=("zh", "en"),
  subsets={
    tfds.Split.TRAIN: ["newscommentary_v14"]
  }
)
builder = tfds.builder("wmt_translate", config=config)
builder.download_and_prepare(download_dir=download_dir)
clear_output()

In [7]:
#set builder to dataset(data pipeline type), split it into training, validation, testing
examples = builder.as_dataset(split=['train[:20%]','train[20%:21%]','train[21%:]'], as_supervised=True)

In [8]:
#leave the testing examples this time.
train_examples, val_examples, _ = examples
print(train_examples)
print(val_examples)

<_PrefetchDataset element_spec=(TensorSpec(shape=(), dtype=tf.string, name=None), TensorSpec(shape=(), dtype=tf.string, name=None))>
<_PrefetchDataset element_spec=(TensorSpec(shape=(), dtype=tf.string, name=None), TensorSpec(shape=(), dtype=tf.string, name=None))>


In [9]:
for en, zh in train_examples.take(3):
  print(en)
  print(zh)
  print('-' * 10)

tf.Tensor(b'The fear is real and visceral, and politicians ignore it at their peril.', shape=(), dtype=string)
tf.Tensor(b'\xe8\xbf\x99\xe7\xa7\x8d\xe6\x81\x90\xe6\x83\xa7\xe6\x98\xaf\xe7\x9c\x9f\xe5\xae\x9e\xe8\x80\x8c\xe5\x86\x85\xe5\x9c\xa8\xe7\x9a\x84\xe3\x80\x82 \xe5\xbf\xbd\xe8\xa7\x86\xe5\xae\x83\xe7\x9a\x84\xe6\x94\xbf\xe6\xb2\xbb\xe5\xae\xb6\xe4\xbb\xac\xe5\x89\x8d\xe9\x80\x94\xe5\xa0\xaa\xe5\xbf\xa7\xe3\x80\x82', shape=(), dtype=string)
----------
tf.Tensor(b'In fact, the German political landscape needs nothing more than a truly liberal party, in the US sense of the word \xe2\x80\x9cliberal\xe2\x80\x9d \xe2\x80\x93 a champion of the cause of individual freedom.', shape=(), dtype=string)
tf.Tensor(b'\xe4\xba\x8b\xe5\xae\x9e\xe4\xb8\x8a\xef\xbc\x8c\xe5\xbe\xb7\xe5\x9b\xbd\xe6\x94\xbf\xe6\xb2\xbb\xe5\xb1\x80\xe5\x8a\xbf\xe9\x9c\x80\xe8\xa6\x81\xe7\x9a\x84\xe4\xb8\x8d\xe8\xbf\x87\xe6\x98\xaf\xe4\xb8\x80\xe4\xb8\xaa\xe7\xac\xa6\xe5\x90\x88\xe7\xbe\x8e\xe5\x9b\xbd\xe6\x89\x80\xe8\

In [10]:
sample_examples = []
num_samples = 10

for en_t, zh_t in train_examples.take(num_samples):
  en = en_t.numpy().decode("utf-8")
  zh = zh_t.numpy().decode("utf-8")

  print(en)
  print(zh)
  print('-' * 10)


  sample_examples.append((en, zh))

The fear is real and visceral, and politicians ignore it at their peril.
这种恐惧是真实而内在的。 忽视它的政治家们前途堪忧。
----------
In fact, the German political landscape needs nothing more than a truly liberal party, in the US sense of the word “liberal” – a champion of the cause of individual freedom.
事实上，德国政治局势需要的不过是一个符合美国所谓“自由”定义的真正的自由党派，也就是个人自由事业的倡导者。
----------
Shifting to renewable-energy sources will require enormous effort and major infrastructure investment.
必须付出巨大的努力和基础设施投资才能完成向可再生能源的过渡。
----------
In this sense, it is critical to recognize the fundamental difference between “urban villages” and their rural counterparts.
在这方面，关键在于认识到“城市村落”和农村村落之间的根本区别。
----------
A strong European voice, such as Nicolas Sarkozy’s during the French presidency of the EU, may make a difference, but only for six months, and at the cost of reinforcing other European countries’ nationalist feelings in reaction to the expression of “Gallic pride.”
法国担任轮值主席国期间尼古拉·萨科奇统一的欧洲声音可能让人耳目一新，但这种声音却只持续了短短六个月，而且付出了让其他欧洲国家在面对“高卢人的骄

Word tokenization
1. Scan through the example and create a dictionary of tokens
2. Add BOS and EOS into every sentence
3. Set up a sentence length limitation.
4. Padding every sentence to the same length.
5. Index dictionary as dimension to represent tokens, reducing the dimensions by embedding


In [11]:
#create own eng dictionary in the file of content/nmt/en_vocab #en_vocab_file
#character-delimited + word-delimited
%%time
try:
  subword_encoder_en = tfds.deprecated.text.SubwordTextEncoder.load_from_file(en_vocab_file)
  print(f"Upload the dictionary： {en_vocab_file}")
except:
  print("No dictionary in the file path, create it.")
  subword_encoder_en = tfds.deprecated.text.SubwordTextEncoder.build_from_corpus(
      (en.numpy() for en, _ in train_examples),
      target_vocab_size=2**13) # The size of dictionary is changable.
  # Subwords as tokens.
  # Save the dictionary to target file path.
  subword_encoder_en.save_to_file(en_vocab_file)


print(f"size of dictionary：{subword_encoder_en.vocab_size}")
print(f"first 10 subwords：{subword_encoder_en.subwords[:10]}")
print()

No dictionary in the file path, create it.
size of dictionary：8113
first 10 subwords：[', ', 'the_', 'of_', 'to_', 'and_', 's_', 'in_', 'a_', 'is_', 'that_']

CPU times: user 1min 25s, sys: 3.69 s, total: 1min 28s
Wall time: 1min 21s


In [12]:
sample_string = 'Taiwan is beautiful.'
indices = subword_encoder_en.encode(sample_string)
indices

[3461, 7889, 9, 3502, 4379, 1134, 7903]

In [13]:
print("{0:10}{1:6}".format("Index", "Subword"))
print("-" * 15)
for idx in indices:
  subword = subword_encoder_en.decode([idx])
  print('{0:5}{1:6}'.format(idx, ' ' * 5 + subword))

Index     Subword
---------------
 3461     Taiwan
 7889      
    9     is 
 3502     bea
 4379     uti
 1134     ful
 7903     .


In [14]:
%%time
try:
  subword_encoder_zh = tfds.deprecated.text.SubwordTextEncoder.load_from_file(zh_vocab_file)
  print(f"Upload the dictionary： {zh_vocab_file}")
except:
  print("No dictionary in the file path, create it.")
  subword_encoder_zh = tfds.deprecated.text.SubwordTextEncoder.build_from_corpus(
      (zh.numpy() for _, zh in train_examples),
      target_vocab_size=2**13, # A Chineses word is a token
      max_subword_length=1)

  subword_encoder_zh.save_to_file(zh_vocab_file)

print(f"size of dictionary：{subword_encoder_zh.vocab_size}")
print(f"first 10 subwords：{subword_encoder_zh.subwords[:10]}")
print()

No dictionary in the file path, create it.
size of dictionary：4205
first 10 subwords：['的', '，', '。', '国', '在', '是', '一', '和', '不', '这']

CPU times: user 6min 47s, sys: 2.72 s, total: 6min 50s
Wall time: 6min 49s


In [15]:
sample_string = sample_examples[0][1]
indices = subword_encoder_zh.encode(sample_string)
print(sample_string)
print(indices)

这种恐惧是真实而内在的。 忽视它的政治家们前途堪忧。
[10, 151, 574, 1298, 6, 374, 55, 29, 193, 5, 1, 3, 3981, 931, 431, 125, 1, 17, 124, 33, 20, 97, 1089, 1247, 861, 3]


In [16]:
en = "The eurozone’s collapse forces a major realignment of European politics."
zh = "欧元区的瓦解强迫欧洲政治进行一次重大改组。"

# 將文字轉成為 subword indices
en_indices = subword_encoder_en.encode(en)
zh_indices = subword_encoder_zh.encode(zh)

print("[英中原文]（轉換前）")
print(en)
print(zh)
print()
print('-' * 20)
print()
print("[英中序列]（轉換後）")
print(en_indices)
print(zh_indices)

[英中原文]（轉換前）
The eurozone’s collapse forces a major realignment of European politics.
欧元区的瓦解强迫欧洲政治进行一次重大改组。

--------------------

[英中序列]（轉換後）
[16, 900, 11, 6, 1527, 874, 8, 230, 2259, 2728, 239, 3, 89, 1236, 7903]
[44, 202, 168, 1, 852, 201, 231, 592, 44, 87, 17, 124, 106, 38, 7, 279, 86, 18, 212, 265, 3]


In [17]:
def add_bos_eos(en_t, zh_t):
  # This function will be applied to data set.
  # Add bengin of sentence (BOS) and End of sentence (EOS) and the index of BOS and EOS are the last two index of dictionary
  # Since the index of dictionary start with 0 so subword_encoder_en.vocab_size can be the index of BOS
  # and subword_encoder_en.vocab_size + 1 as index of EOS
  en_indices = [subword_encoder_en.vocab_size] + subword_encoder_en.encode(
      en_t.numpy()) + [subword_encoder_en.vocab_size + 1]
  # Same for Zh
  zh_indices = [subword_encoder_zh.vocab_size] + subword_encoder_zh.encode(
      zh_t.numpy()) + [subword_encoder_zh.vocab_size + 1]

  return en_indices, zh_indices

In [18]:
en_t, zh_t = next(iter(train_examples))
en_indices, zh_indices = add_bos_eos(en_t, zh_t)
print('英文 BOS 的 index：', subword_encoder_en.vocab_size)
print('英文 EOS 的 index：', subword_encoder_en.vocab_size + 1)
print('中文 BOS 的 index：', subword_encoder_zh.vocab_size)
print('中文 EOS 的 index：', subword_encoder_zh.vocab_size + 1)

print('\n輸入為 2 個 Tensors：')
pprint((en_t, zh_t))
print('-' * 15)
print('輸出為 2 個索引序列：')
pprint((en_indices, zh_indices))

英文 BOS 的 index： 8113
英文 EOS 的 index： 8114
中文 BOS 的 index： 4205
中文 EOS 的 index： 4206

輸入為 2 個 Tensors：
(<tf.Tensor: shape=(), dtype=string, numpy=b'The fear is real and visceral, and politicians ignore it at their peril.'>,
 <tf.Tensor: shape=(), dtype=string, numpy=b'\xe8\xbf\x99\xe7\xa7\x8d\xe6\x81\x90\xe6\x83\xa7\xe6\x98\xaf\xe7\x9c\x9f\xe5\xae\x9e\xe8\x80\x8c\xe5\x86\x85\xe5\x9c\xa8\xe7\x9a\x84\xe3\x80\x82 \xe5\xbf\xbd\xe8\xa7\x86\xe5\xae\x83\xe7\x9a\x84\xe6\x94\xbf\xe6\xb2\xbb\xe5\xae\xb6\xe4\xbb\xac\xe5\x89\x8d\xe9\x80\x94\xe5\xa0\xaa\xe5\xbf\xa7\xe3\x80\x82'>)
---------------
輸出為 2 個索引序列：
([8113,
  16,
  1284,
  9,
  243,
  5,
  1275,
  1756,
  156,
  1,
  5,
  1016,
  5566,
  21,
  38,
  33,
  2982,
  7965,
  7903,
  8114],
 [4205,
  10,
  151,
  574,
  1298,
  6,
  374,
  55,
  29,
  193,
  5,
  1,
  3,
  3981,
  931,
  431,
  125,
  1,
  17,
  124,
  33,
  20,
  97,
  1089,
  1247,
  861,
  3,
  4206])


In [19]:
def tf_add_bos_eos(en_t, zh_t):
  # Both en_t and zh_t are not eager tensors but tensors so it needs to be encapsulated using tf.py_function
  # before applying it to tf.data.dataset.
  # input index can use `tf.int64`
  return tf.py_function(add_bos_eos, [en_t, zh_t], [tf.int64, tf.int64])

# tmp_dataset` for exhibite func
tmp_dataset = train_examples.map(tf_add_bos_eos)
en_indices, zh_indices = next(iter(tmp_dataset))
print(en_indices)
print(zh_indices)

tf.Tensor(
[8113   16 1284    9  243    5 1275 1756  156    1    5 1016 5566   21
   38   33 2982 7965 7903 8114], shape=(20,), dtype=int64)
tf.Tensor(
[4205   10  151  574 1298    6  374   55   29  193    5    1    3 3981
  931  431  125    1   17  124   33   20   97 1089 1247  861    3 4206], shape=(28,), dtype=int64)


In [20]:
MAX_LENGTH = 40

def filter_max_length(en, zh, max_length=MAX_LENGTH):
  # en, zh are the index list of input
  return tf.logical_and(tf.size(en) <= max_length,
                        tf.size(zh) <= max_length)

# tf.data.Dataset.filter(func) 只會回傳 func 為真的例子
train_dataset = tmp_dataset.filter(filter_max_length)

In [21]:

num_examples = 0
for en_indices, zh_indices in train_dataset:
  cond1 = len(en_indices) <= MAX_LENGTH
  cond2 = len(zh_indices) <= MAX_LENGTH
  assert cond1 and cond2
  num_examples += 1

print(f"所有英文與中文序列長度都不超過 {MAX_LENGTH} 個 tokens")
print(f"訓練資料集裡總共有 {num_examples} 筆數據")

所有英文與中文序列長度都不超過 40 個 tokens
訓練資料集裡總共有 29784 筆數據


In [22]:
BATCH_SIZE = 64
# padding the sentences.
tmp_dataset = tmp_dataset.padded_batch(BATCH_SIZE, padded_shapes=([-1], [-1]))
en_batch, zh_batch = next(iter(tmp_dataset))
print("英文索引序列的 batch")
print(en_batch)
print('-' * 20)
print("中文索引序列的 batch")
print(zh_batch)

英文索引序列的 batch
tf.Tensor(
[[8113   16 1284 ...    0    0    0]
 [8113   44  369 ...    0    0    0]
 [8113 1894 1302 ...    0    0    0]
 ...
 [8113 1668    1 ... 4024 7903 8114]
 [8113 5751 1538 ...    0    0    0]
 [8113 1809 5706 ...    0    0    0]], shape=(64, 71), dtype=int64)
--------------------
中文索引序列的 batch
tf.Tensor(
[[4205   10  151 ...    0    0    0]
 [4205  109   55 ...    0    0    0]
 [4205  206  275 ...    0    0    0]
 ...
 [4205   73   76 ...    0    0    0]
 [4205    5  115 ...    0    0    0]
 [4205    9  270 ...    0    0    0]], shape=(64, 116), dtype=int64)


In [23]:
MAX_LENGTH = 40
BATCH_SIZE = 128
BUFFER_SIZE = 15000

# Training
train_dataset = (train_examples # input: En/Zh, Output En/Zh
                 .map(tf_add_bos_eos) # Add BOS and EOS
                 .filter(filter_max_length) #Length <40
                 .cache() #Speed up the process
                 .shuffle(BUFFER_SIZE) # Shuffle the data
                 .padded_batch(BATCH_SIZE, padded_shapes=([-1], [-1])) # Pad to same size of example
                 .prefetch(tf.data.experimental.AUTOTUNE)) # 加速
# validation
val_dataset = (val_examples
               .map(tf_add_bos_eos)
               .filter(filter_max_length)
               .padded_batch(BATCH_SIZE, padded_shapes=([-1], [-1])))

In [24]:
en_batch, zh_batch = next(iter(train_dataset))
print("英文索引序列的 batch")
print(en_batch)
print('-' * 20)
print("中文索引序列的 batch")
print(zh_batch)

英文索引序列的 batch
tf.Tensor(
[[8113  819 7902 ...    0    0    0]
 [8113  122   84 ...    0    0    0]
 [8113 2273 6970 ...    0    0    0]
 ...
 [8113  696  283 ...    0    0    0]
 [8113 4078   25 ...    0    0    0]
 [8113   16   65 ...    0    0    0]], shape=(128, 38), dtype=int64)
--------------------
中文索引序列的 batch
tf.Tensor(
[[4205  104   97 ...    0    0    0]
 [4205   16    4 ...    0    0    0]
 [4205  132   45 ...    0    0    0]
 ...
 [4205  313    4 ...    0    0    0]
 [4205  313   48 ...  766    3 4206]
 [4205   24  178 ...    0    0    0]], shape=(128, 40), dtype=int64)


In [25]:
demo_examples = [
    ("It is important.", "这很重要。"),
    ("The numbers speak for themselves.", "数字证明了一切。"),
]
pprint(demo_examples)

[('It is important.', '这很重要。'),
 ('The numbers speak for themselves.', '数字证明了一切。')]


In [26]:
batch_size = 2
demo_examples = tf.data.Dataset.from_tensor_slices((
    [en for en, _ in demo_examples], [zh for _, zh in demo_examples]
))

# 將兩個句子透過之前定義的字典轉換成子詞的序列（sequence of subwords）
# 並添加 padding token: <pad> 來確保 batch 裡的句子有一樣長度
demo_dataset = demo_examples.map(tf_add_bos_eos)\
  .padded_batch(batch_size, padded_shapes=([-1], [-1]))

# 取出這個 demo dataset 裡唯一一個 batch
inp, tar = next(iter(demo_dataset))
print('inp:', inp)
print('' * 10)
print('tar:', tar)

inp: tf.Tensor(
[[8113  103    9 1066 7903 8114    0    0]
 [8113   16 4111 6735   12 2750 7903 8114]], shape=(2, 8), dtype=int64)

tar: tf.Tensor(
[[4205   10  241   86   27    3 4206    0    0    0]
 [4205  165  489  398  191   14    7  560    3 4206]], shape=(2, 10), dtype=int64)


In [27]:
# + 2 是因為我們額外加了 <start> 以及 <end> tokens
vocab_size_en = subword_encoder_en.vocab_size + 2
vocab_size_zh = subword_encoder_zh.vocab_size + 2

# 為了方便 demo, 將詞彙轉換到一個 4 維的詞嵌入空間
d_model = 4
embedding_layer_en = tf.keras.layers.Embedding(vocab_size_en, d_model)
embedding_layer_zh = tf.keras.layers.Embedding(vocab_size_zh, d_model)

emb_inp = embedding_layer_en(inp)
emb_tar = embedding_layer_zh(tar)
emb_inp, emb_tar

#(2, 8, 4), 2 sentences, 8 tokens, 4 embedding dimensions.
#Demo only here

(<tf.Tensor: shape=(2, 8, 4), dtype=float32, numpy=
 array([[[ 0.01136609,  0.044322  , -0.03268271, -0.04236697],
         [ 0.01800618,  0.02187547,  0.02848741, -0.03544848],
         [ 0.04529072,  0.00125663, -0.03760185, -0.0222937 ],
         [ 0.0225729 ,  0.02911006, -0.03348683, -0.0471466 ],
         [ 0.01374676, -0.00774624,  0.00542571,  0.02360238],
         [ 0.01449658,  0.03457392,  0.0183125 ,  0.00347707],
         [ 0.02398122,  0.01899682,  0.02353826,  0.04027948],
         [ 0.02398122,  0.01899682,  0.02353826,  0.04027948]],
 
        [[ 0.01136609,  0.044322  , -0.03268271, -0.04236697],
         [-0.04973317,  0.02657061,  0.02959832, -0.02706842],
         [ 0.02889854, -0.04722495,  0.02665036, -0.04851956],
         [ 0.00775899, -0.00516289, -0.0149469 , -0.03300314],
         [ 0.04174921, -0.0429981 ,  0.03316802,  0.02649014],
         [ 0.03921003, -0.00368743, -0.0483771 , -0.02522649],
         [ 0.01374676, -0.00774624,  0.00542571,  0.02360238],


# Attention
1. Embedding as input
2. padding mask and look ahead mask
3. Self attention q,k,v
4. Split heads
5. Customized the self attention layers

In [28]:
def create_padding_mask(seq):
  # padding mask 的工作就是把索引序列中為 0 的位置設為 1
  mask = tf.cast(tf.equal(seq, 0), tf.float32)
  return mask[:, tf.newaxis, tf.newaxis, :] #　broadcasting

inp_mask = create_padding_mask(inp)
inp_mask

<tf.Tensor: shape=(2, 1, 1, 8), dtype=float32, numpy=
array([[[[0., 0., 0., 0., 0., 0., 1., 1.]]],


       [[[0., 0., 0., 0., 0., 0., 0., 0.]]]], dtype=float32)>

In [29]:
print("inp:", inp)
print("-" * 20)
print("tf.squeeze(inp_mask):", tf.squeeze(inp_mask))

inp: tf.Tensor(
[[8113  103    9 1066 7903 8114    0    0]
 [8113   16 4111 6735   12 2750 7903 8114]], shape=(2, 8), dtype=int64)
--------------------
tf.squeeze(inp_mask): tf.Tensor(
[[0. 0. 0. 0. 0. 0. 1. 1.]
 [0. 0. 0. 0. 0. 0. 0. 0.]], shape=(2, 8), dtype=float32)


In [30]:
# 設定一個 seed 確保我們每次都拿到一樣的隨機結果
tf.random.set_seed(9527)

# 自注意力機制：查詢 `q` 跟鍵值 `k` 都是 `emb_inp`
q = emb_inp
k = emb_inp
# 簡單產生一個跟 `emb_inp` 同樣 shape 的 binary vector
v = tf.cast(tf.math.greater(tf.random.uniform(shape=emb_inp.shape), 0.5), tf.float32)
v

<tf.Tensor: shape=(2, 8, 4), dtype=float32, numpy=
array([[[1., 0., 0., 0.],
        [0., 1., 0., 1.],
        [0., 0., 0., 1.],
        [1., 0., 1., 0.],
        [1., 0., 1., 0.],
        [0., 1., 0., 1.],
        [0., 0., 1., 0.],
        [0., 1., 0., 1.]],

       [[1., 0., 1., 1.],
        [1., 0., 1., 0.],
        [1., 0., 0., 0.],
        [1., 0., 1., 0.],
        [0., 1., 0., 1.],
        [1., 1., 1., 1.],
        [0., 0., 0., 0.],
        [0., 0., 1., 0.]]], dtype=float32)>

In [31]:
def scaled_dot_product_attention(q, k, v, mask):
  """Calculate the attention weights.
  q, k, v must have matching leading dimensions.
  k, v must have matching penultimate dimension, i.e.: seq_len_k = seq_len_v.
  The mask has different shapes depending on its type(padding or look ahead)
  but it must be broadcastable for addition.

  Args:
    q: query shape == (..., seq_len_q, depth)
    k: key shape == (..., seq_len_k, depth)
    v: value shape == (..., seq_len_v, depth_v)
    mask: Float tensor with shape broadcastable
          to (..., seq_len_q, seq_len_k). Defaults to None.

  Returns:
    output, attention_weights
  """
  # 將 `q`、 `k` 做點積再 scale
  # 2D np.dot = tf.matmul
  matmul_qk = tf.matmul(q, k, transpose_b=True)  # (..., seq_len_q, seq_len_k)

  dk = tf.cast(tf.shape(k)[-1], tf.float32)  # 取得 seq_k 的序列長度
  scaled_attention_logits = matmul_qk / tf.math.sqrt(dk)  # scale by sqrt(dk)

  # 將遮罩「加」到被丟入 softmax 前的 logits
  # in this case, scaled_attention_logits a tensor with size of (2,8,8) and mask size of (2,1,8), the mask would atuo expand to (2,8,8). It's broadcasting
  if mask is not None:
    scaled_attention_logits += (mask * -1e9)

  # 取 softmax 是為了得到總和為 1 的比例之後對 `v` 做加權平均
  attention_weights = tf.nn.softmax(scaled_attention_logits, axis=-1)  # (..., seq_len_q, seq_len_k)

  # 以注意權重對 v 做加權平均（weighted average）
  output = tf.matmul(attention_weights, v)  # (..., seq_len_q, depth_v)

  return output, attention_weights

In [32]:
mask = None
output, attention_weights = scaled_dot_product_attention(q, k, v, mask)
print("output:", output)
print("-" * 20)
print("attention_weights:", attention_weights)

#the shape of out still remain (2,8,4). 2 sentences, 8 token long for each sentence and embedding dimension 4. But right now, it has the context and was distilled to a abstract concept of language.

output: tf.Tensor(
[[[0.37525803 0.3748267  0.3748631  0.49991256]
  [0.37499547 0.3750877  0.37489206 0.50006086]
  [0.37516123 0.37477538 0.3749477  0.49995202]
  [0.37525916 0.37479597 0.37486964 0.49992344]
  [0.37488726 0.375044   0.37505132 0.5000353 ]
  [0.37491676 0.37511194 0.37495577 0.5000463 ]
  [0.3747647  0.3751691  0.37504435 0.50008386]
  [0.3747647  0.3751691  0.37504435 0.50008386]]

 [[0.625389   0.24986553 0.6254846  0.37513024]
  [0.6251934  0.2496649  0.62527066 0.37472647]
  [0.625118   0.25010508 0.6246787  0.37499398]
  [0.625233   0.24999358 0.62510014 0.37506652]
  [0.6246109  0.2502783  0.6244222  0.37503648]
  [0.62523335 0.25015193 0.6251225  0.37528455]
  [0.6248029  0.2501094  0.6248248  0.3750344 ]
  [0.6249275  0.24993345 0.625093   0.37498197]]], shape=(2, 8, 4), dtype=float32)
--------------------
attention_weights: tf.Tensor(
[[[0.12522437 0.12502344 0.12508589 0.12520446 0.12482921 0.12497384
   0.12482941 0.12482941]
  [0.12504709 0.12511751 0.124

In [33]:
# 這次讓我們將 padding mask 放入注意函式並觀察
# 注意權重的變化
mask = tf.squeeze(inp_mask, axis=1) # (batch_size, 1, seq_len_q)
_, attention_weights = scaled_dot_product_attention(q, k, v, mask)
print("attention_weights:", attention_weights)

#the shape of out still remain (2,8,4). 2 sentences, 8 token long for each sentence and embedding dimension 4. But right now, it has the context and was distilled to a abstract concept of language.

attention_weights: tf.Tensor(
[[[0.1668899  0.16662212 0.16670536 0.16686337 0.16636327 0.16655602
   0.         0.        ]
  [0.16670443 0.1667983  0.16660586 0.16670573 0.16650875 0.166677
   0.         0.        ]
  [0.16675007 0.16656826 0.1668518  0.16680221 0.16651168 0.16651596
   0.         0.        ]
  [0.16686136 0.1666214  0.16675547 0.16686656 0.16637413 0.16652112
   0.         0.        ]
  [0.16658556 0.16664891 0.16668947 0.16659845 0.16676891 0.16670866
   0.         0.        ]
  [0.16668034 0.16671903 0.16659558 0.16664743 0.1666105  0.16674715
   0.         0.        ]
  [0.16656296 0.16668387 0.16662028 0.16654366 0.16678149 0.16680771
   0.         0.        ]
  [0.16656296 0.16668387 0.16662028 0.16654366 0.16678149 0.16680771
   0.         0.        ]]

 [[0.1252647  0.12500411 0.12491839 0.12506378 0.12472758 0.12513795
   0.12486941 0.12501408]
  [0.12506159 0.12531172 0.12497521 0.12500757 0.12482755 0.12483736
   0.1249266  0.12505244]
  [0.12488889 0.1248

In [34]:
# 建立一個 2 維矩陣，維度為 (size, size)，
# 其遮罩為一個右上角的三角形
def create_look_ahead_mask(size):
  mask = 1 - tf.linalg.band_part(tf.ones((size, size)), -1, 0)
  return mask  # (seq_len, seq_len)

seq_len = emb_tar.shape[1] # 注意這次我們用中文的詞嵌入張量 `emb_tar`
look_ahead_mask = create_look_ahead_mask(seq_len)
print("emb_tar:", emb_tar)
print("-" * 20)
print("look_ahead_mask", look_ahead_mask)

emb_tar: tf.Tensor(
[[[-0.03277655 -0.00745342 -0.00683656 -0.02574356]
  [ 0.01696173  0.00757234  0.00941087 -0.03724331]
  [ 0.04189566  0.03261018 -0.02499024  0.00189238]
  [ 0.02182115  0.01020544 -0.00277179  0.04719411]
  [ 0.03758576  0.01851412 -0.00444709  0.04633282]
  [-0.00801369 -0.02775562  0.00632982  0.01889359]
  [-0.04102479 -0.0102158   0.01949997 -0.00985195]
  [-0.01700247  0.02699344 -0.04914365  0.02938807]
  [-0.01700247  0.02699344 -0.04914365  0.02938807]
  [-0.01700247  0.02699344 -0.04914365  0.02938807]]

 [[-0.03277655 -0.00745342 -0.00683656 -0.02574356]
  [-0.01273079 -0.01638271  0.04358082 -0.0011729 ]
  [-0.04276529 -0.03356844  0.04649569  0.02694151]
  [-0.03918862  0.02449466  0.01530714 -0.0251611 ]
  [-0.00469106 -0.00384055 -0.01679001  0.03942842]
  [ 0.00328831  0.02432671  0.04318819  0.01971428]
  [ 0.03876802 -0.01617848 -0.02781192  0.02414567]
  [-0.04903276  0.0202435   0.04282154 -0.03306665]
  [-0.00801369 -0.02775562  0.00632982  0.

In [35]:
# 讓我們用目標語言（中文）的 batch
# 來模擬 Decoder 處理的情況
temp_q = temp_k = emb_tar
temp_v = tf.cast(tf.math.greater(
    tf.random.uniform(shape=emb_tar.shape), 0.5), tf.float32)

# 將 look_ahead_mask 放入注意函式
_, attention_weights = scaled_dot_product_attention(
    temp_q, temp_k, temp_v, look_ahead_mask)

print("attention_weights:", attention_weights)

attention_weights: tf.Tensor(
[[[1.         0.         0.         0.         0.         0.
   0.         0.         0.         0.        ]
  [0.4998077  0.50019234 0.         0.         0.         0.
   0.         0.         0.         0.        ]
  [0.33293968 0.33329713 0.3337632  0.         0.         0.
   0.         0.         0.         0.        ]
  [0.24972358 0.24980487 0.25014758 0.25032395 0.         0.
   0.         0.         0.         0.        ]
  [0.19962725 0.19978128 0.20011787 0.20020105 0.20027252 0.
   0.         0.         0.         0.        ]
  [0.16667183 0.16659434 0.16656327 0.16671151 0.16667953 0.16677952
   0.         0.         0.         0.        ]
  [0.14298986 0.14286381 0.14269711 0.14277129 0.14271735 0.14291899
   0.14304161 0.         0.         0.        ]
  [0.12495127 0.12485283 0.12504603 0.12504452 0.12504536 0.12493227
   0.12490371 0.12522402 0.         0.        ]
  [0.11104568 0.1109582  0.11112989 0.11112856 0.1111293  0.11102879
   0.

In [36]:
#multi-head split the demension of embedding, process them seperately

def split_heads(x, d_model, num_heads):
  # x.shape: (batch_size, seq_len, d_model)
  batch_size = tf.shape(x)[0]

  # 我們要確保維度 `d_model` 可以被平分成 `num_heads` 個 `depth` 維度
  assert d_model % num_heads == 0
  depth = d_model // num_heads  # 這是分成多頭以後每個向量的維度

  # 將最後一個 d_model 維度分成 num_heads 個 depth 維度。
  # 最後一個維度變成兩個維度，張量 x 從 3 維到 4 維
  # fill in the rest to -1 dimenstion
  # (batch_size, seq_len, num_heads, depth)
  reshaped_x = tf.reshape(x, shape=(batch_size, -1, num_heads, depth))

  # 將 head 的維度拉前使得最後兩個維度為子詞以及其對應的 depth 向量
  # change the sequence of dimension for better understanding.
  # (batch_size, num_heads, seq_len, depth)
  output = tf.transpose(reshaped_x, perm=[0, 2, 1, 3])

  return output

# 我們的 `emb_inp` 裡頭的子詞本來就是 4 維的詞嵌入向量
d_model = 4
# 將 4 維詞嵌入向量分為 2 個 head 的 2 維矩陣
num_heads = 2
x = emb_inp

output = split_heads(x, d_model, num_heads)
print("x:", x)
print("output:", output)

x: tf.Tensor(
[[[ 0.01136609  0.044322   -0.03268271 -0.04236697]
  [ 0.01800618  0.02187547  0.02848741 -0.03544848]
  [ 0.04529072  0.00125663 -0.03760185 -0.0222937 ]
  [ 0.0225729   0.02911006 -0.03348683 -0.0471466 ]
  [ 0.01374676 -0.00774624  0.00542571  0.02360238]
  [ 0.01449658  0.03457392  0.0183125   0.00347707]
  [ 0.02398122  0.01899682  0.02353826  0.04027948]
  [ 0.02398122  0.01899682  0.02353826  0.04027948]]

 [[ 0.01136609  0.044322   -0.03268271 -0.04236697]
  [-0.04973317  0.02657061  0.02959832 -0.02706842]
  [ 0.02889854 -0.04722495  0.02665036 -0.04851956]
  [ 0.00775899 -0.00516289 -0.0149469  -0.03300314]
  [ 0.04174921 -0.0429981   0.03316802  0.02649014]
  [ 0.03921003 -0.00368743 -0.0483771  -0.02522649]
  [ 0.01374676 -0.00774624  0.00542571  0.02360238]
  [ 0.01449658  0.03457392  0.0183125   0.00347707]]], shape=(2, 8, 4), dtype=float32)
output: tf.Tensor(
[[[[ 0.01136609  0.044322  ]
   [ 0.01800618  0.02187547]
   [ 0.04529072  0.00125663]
   [ 0.0225

In [37]:
# Implement a multi-head attention layer by inheriting tf.keras.layers.Layer
# initialize by dimension of model
# initialize q, k, v using tf.keras.layers.Dense
# ouput a attension result and attension weights matrix
# output.shape = (batch_size, seq_len_q, d_model)
# attention_weights.shape = (batch_size, num_heads, seq_len_q, seq_len_k)
class MultiHeadAttention(tf.keras.layers.Layer):
  def __init__(self, d_model, num_heads):
    super().__init__()
    self.num_heads = num_heads
    self.d_model = d_model

    assert d_model % self.num_heads == 0  #make sure dimenstions of embedding can be divided.

    self.depth = d_model // self.num_heads  #depth in every head

    self.wq = tf.keras.layers.Dense(d_model)  # input of q matual wq.
    self.wk = tf.keras.layers.Dense(d_model)  # No activation func
    self.wv = tf.keras.layers.Dense(d_model)

    self.dense = tf.keras.layers.Dense(d_model)  # for combining mutiple heads.

  def split_heads(self, x, batch_size):
    """Split the last dimension into (num_heads, depth).
    Transpose the result such that the shape is (batch_size, num_heads, seq_len, depth)
    """
    x = tf.reshape(x, (batch_size, -1, self.num_heads, self.depth))
    return tf.transpose(x, perm=[0, 2, 1, 3])


  def call(self, v, k, q, mask):
    batch_size = tf.shape(q)[0] #number of examples

    # project q, k, v to embedding dimensions for example (2,8,4). 2 sentences, 8 tokens, 4 dimensions
    q = self.wq(q)  # (batch_size, seq_len, d_model)
    k = self.wk(k)  # (batch_size, seq_len, d_model)
    v = self.wv(v)  # (batch_size, seq_len, d_model)

    # split heads from (2,8,4) to (2,2,8,2)
    q = self.split_heads(q, batch_size)  # (batch_size, num_heads, seq_len_q, depth)
    k = self.split_heads(k, batch_size)  # (batch_size, num_heads, seq_len_k, depth)
    v = self.split_heads(v, batch_size)  # (batch_size, num_heads, seq_len_v, depth)

    #pass q, k, v and mask to self attention.
    scaled_attention, attention_weights = scaled_dot_product_attention(
        q, k, v, mask)
    # scaled_attention.shape == (batch_size, num_heads, seq_len_q, depth)
    # attention_weights.shape == (batch_size, num_heads, seq_len_q, seq_len_k)

    # Reverse the process of splitting heads.
    scaled_attention = tf.transpose(scaled_attention, perm=[0, 2, 1, 3])
    # (batch_size, seq_len_q, num_heads, depth)
    concat_attention = tf.reshape(scaled_attention, (batch_size, -1, self.d_model))
    # (batch_size, seq_len_q, d_model)

    # final layer
    output = self.dense(concat_attention)  # (batch_size, seq_len_q, d_model)

    return output, attention_weights

In [38]:
# emb_inp.shape = (batch_size, seq_len, d_model) = (2, 8, 4)
assert d_model == emb_inp.shape[-1]  == 4
num_heads = 2

print(f"d_model: {d_model}")
print(f"num_heads: {num_heads}\n")

# 初始化一個 multi-head attention layer
mha = MultiHeadAttention(d_model, num_heads)

# set v = k = q = embedded examples (2,8,4), 2 examples, 8 tokens, and 4 dimensions of embedding.
# shape of inp: (2,8) shape of imp_inp = (2,8,4)
# padding mask
# The last tokens of en embedding is <pad>
v = k = q = emb_inp
padding_mask = create_padding_mask(inp)
print("q.shape:", q.shape)
print("k.shape:", k.shape)
print("v.shape:", v.shape)
print("padding_mask.shape:", padding_mask.shape)

output, attention_weights = mha(v, k, q, padding_mask)
print("output.shape:", output.shape)
print("attention_weights.shape:", attention_weights.shape) # (..., seq_len_q, seq_len_k)

print("\noutput:", output)

# attention_weights.shape: (2, 2, 8, 8): 2 batches, 2 heads, 8 words (attention: 8x8)
# mask shape=(2, 1, 8)
# it's a demonstration with 2 sentences as inputs

d_model: 4
num_heads: 2

q.shape: (2, 8, 4)
k.shape: (2, 8, 4)
v.shape: (2, 8, 4)
padding_mask.shape: (2, 1, 1, 8)
output.shape: (2, 8, 4)
attention_weights.shape: (2, 2, 8, 8)

output: tf.Tensor(
[[[ 0.00036012  0.01476    -0.03122136  0.02325048]
  [ 0.00036645  0.0147353  -0.03120708  0.0232416 ]
  [ 0.00036205  0.01476038 -0.03122213  0.02325161]
  [ 0.00036093  0.01475575 -0.03121787  0.02324813]
  [ 0.00036584  0.01475191 -0.03122262  0.02325291]
  [ 0.00036475  0.01475517 -0.03122295  0.02325291]
  [ 0.00036591  0.01476214 -0.03123032  0.02325869]
  [ 0.00036591  0.01476214 -0.03123032  0.02325869]]

 [[ 0.0087533   0.01092501 -0.00737797  0.00679309]
  [ 0.00879591  0.01090642 -0.00733205  0.00676862]
  [ 0.00881176  0.01088988 -0.00730455  0.00675067]
  [ 0.00877842  0.01091239 -0.00734913  0.00677715]
  [ 0.00879799  0.01092061 -0.00734599  0.00678012]
  [ 0.00875594  0.01092651 -0.007378    0.00679384]
  [ 0.00877781  0.01093053 -0.007369    0.006793  ]
  [ 0.0087716   0.010

Model building
1. Define the feed foward layer after attension
2. Define the Encoder layer
3. Define positional encoding layer
4. Stack up the encoder



In [40]:
# Feed foward layer after attention layer.
def point_wise_feed_forward_network(d_model, dff):
  #d_model: demensions of the embedding
  #dff neurons in the middle layer.

  # relu in the middle layer: FNN with one layer.
  return tf.keras.Sequential([
      tf.keras.layers.Dense(dff, activation='relu'),  # (batch_size, seq_len, dff)
      tf.keras.layers.Dense(d_model)  # (batch_size, seq_len, d_model)
  ])

In [41]:
batch_size = 64
seq_len = 10
d_model = 512
dff = 2048

x = tf.random.uniform((batch_size, seq_len, d_model))
ffn = point_wise_feed_forward_network(d_model, dff)
out = ffn(x)
print("x.shape:", x.shape)
print("out.shape:", out.shape)

#64 examples, 10 tokens, 512 dimensions of embeddin.

x.shape: (64, 10, 512)
out.shape: (64, 10, 512)


In [42]:
d_model = 4 # FFN 的輸入輸出張量的最後一維皆為 `d_model`
dff = 6

# 建立一個小 FFN
small_ffn = point_wise_feed_forward_network(d_model, dff)
# 懂子詞梗的站出來
dummy_sentence = tf.constant([[5, 5, 6, 6],
                              [5, 5, 6, 6],
                              [9, 5, 2, 7],
                              [9, 5, 2, 7],
                              [9, 5, 2, 7]], dtype=tf.float32)
small_ffn(dummy_sentence)

<tf.Tensor: shape=(5, 4), dtype=float32, numpy=
array([[ 2.0738847, -1.3606925, -8.436007 ,  7.776016 ],
       [ 2.0738847, -1.3606925, -8.436007 ,  7.776016 ],
       [ 4.5804577, -1.0840721, -8.663323 , 10.321364 ],
       [ 4.5804577, -1.0840721, -8.663323 , 10.321364 ],
       [ 4.5804577, -1.0840721, -8.663323 , 10.321364 ]], dtype=float32)>

In [43]:
# An encoder incoperates 2 sub-layers: MHA & FFN
class EncoderLayer(tf.keras.layers.Layer):
  # Default dropout rate is 0.1
  def __init__(self, d_model, num_heads, dff, rate=0.1):
    super().__init__()

    self.mha = MultiHeadAttention(d_model, num_heads)
    self.ffn = point_wise_feed_forward_network(d_model, dff)

    # layer normalization to normalized the attention from a sentence.
    self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
    self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)

    # One layer should have one dropout layer
    self.dropout1 = tf.keras.layers.Dropout(rate)
    self.dropout2 = tf.keras.layers.Dropout(rate)

  # training argument to decide whether on training or not.
  def call(self, x, training, mask):

    # x is embedding + positional encoding with shape (batch_size, input_seq_len, d_model)
    # attn shape = (batch_size, num_heads, input_seq_len, input_seq_len)
    # sub-layer 1: MHA
    # set q, k, v as x
    # padding mask to mask <pad>
    attn_output, attn = self.mha(x, x, x, mask)
    attn_output = self.dropout1(attn_output, training=training)
    out1 = self.layernorm1(x + attn_output)

    # sub-layer 2: FFN
    ffn_output = self.ffn(out1)
    ffn_output = self.dropout2(ffn_output, training=training)  # 記得 training
    out2 = self.layernorm2(out1 + ffn_output)

    return out2

In [44]:
d_model = 4
num_heads = 2
dff = 8

enc_layer = EncoderLayer(d_model, num_heads, dff)
padding_mask = create_padding_mask(inp) #inp = (2,8), 2 examples, 8 tokens
enc_out = enc_layer(emb_inp, training=False, mask=padding_mask)  # (batch_size, seq_len, d_model)

print("inp:", inp)
print("-" * 20)
print("padding_mask:", padding_mask)
print("-" * 20)
print("emb_inp:", emb_inp)
print("-" * 20)
print("enc_out:", enc_out)
assert emb_inp.shape == enc_out.shape

#encoder needs only padding mask. Input a whole sentence of English (inp: (2,8)), transfer it using embedding (emb_inp: (2,8,4)), and transfer it to a vector of represent (enc_out, (2,8,4))
#And the vector that incorperates the meaning of the sentence would diliver it to the decoder
#Shape of enc_out = (2,8,4), (batch_size, seq_len_q, num_heads * depth). 2 examples, 8 sums of attentions by each token to other tokens in the sentences, attentions from mutiheads.

inp: tf.Tensor(
[[8113  103    9 1066 7903 8114    0    0]
 [8113   16 4111 6735   12 2750 7903 8114]], shape=(2, 8), dtype=int64)
--------------------
padding_mask: tf.Tensor(
[[[[0. 0. 0. 0. 0. 0. 1. 1.]]]


 [[[0. 0. 0. 0. 0. 0. 0. 0.]]]], shape=(2, 1, 1, 8), dtype=float32)
--------------------
emb_inp: tf.Tensor(
[[[ 0.01136609  0.044322   -0.03268271 -0.04236697]
  [ 0.01800618  0.02187547  0.02848741 -0.03544848]
  [ 0.04529072  0.00125663 -0.03760185 -0.0222937 ]
  [ 0.0225729   0.02911006 -0.03348683 -0.0471466 ]
  [ 0.01374676 -0.00774624  0.00542571  0.02360238]
  [ 0.01449658  0.03457392  0.0183125   0.00347707]
  [ 0.02398122  0.01899682  0.02353826  0.04027948]
  [ 0.02398122  0.01899682  0.02353826  0.04027948]]

 [[ 0.01136609  0.044322   -0.03268271 -0.04236697]
  [-0.04973317  0.02657061  0.02959832 -0.02706842]
  [ 0.02889854 -0.04722495  0.02665036 -0.04851956]
  [ 0.00775899 -0.00516289 -0.0149469  -0.03300314]
  [ 0.04174921 -0.0429981   0.03316802  0.02649014]
  [

In [45]:
# postitional encoding
def get_angles(pos, i, d_model):
  angle_rates = 1 / np.power(10000, (2 * (i//2)) / np.float32(d_model))
  return pos * angle_rates

def positional_encoding(position, d_model):
  angle_rads = get_angles(np.arange(position)[:, np.newaxis],
                          np.arange(d_model)[np.newaxis, :],
                          d_model)

  # apply sin to even indices in the array; 2i
  sines = np.sin(angle_rads[:, 0::2])

  # apply cos to odd indices in the array; 2i+1
  cosines = np.cos(angle_rads[:, 1::2])

  pos_encoding = np.concatenate([sines, cosines], axis=-1)

  pos_encoding = pos_encoding[np.newaxis, ...]

  return tf.cast(pos_encoding, dtype=tf.float32)


seq_len = 50
d_model = 512

pos_encoding = positional_encoding(seq_len, d_model)
pos_encoding

<tf.Tensor: shape=(1, 50, 512), dtype=float32, numpy=
array([[[ 0.        ,  0.        ,  0.        , ...,  1.        ,
          1.        ,  1.        ],
        [ 0.84147096,  0.8218562 ,  0.8019618 , ...,  1.        ,
          1.        ,  1.        ],
        [ 0.9092974 ,  0.9364147 ,  0.95814437, ...,  1.        ,
          1.        ,  1.        ],
        ...,
        [ 0.12357312,  0.97718984, -0.24295525, ...,  0.9999863 ,
          0.99998724,  0.99998814],
        [-0.76825464,  0.7312359 ,  0.63279754, ...,  0.9999857 ,
          0.9999867 ,  0.9999876 ],
        [-0.95375264, -0.14402692,  0.99899054, ...,  0.9999851 ,
          0.9999861 ,  0.9999871 ]]], dtype=float32)>

In [46]:
class Encoder(tf.keras.layers.Layer):
  # An encoder incorperates embedding layer, postional encoding layer, layers of encoder layer
  # num_layers: layers of encoder layer
  # input_vocab_size: size of dictionary
  def __init__(self, num_layers, d_model, num_heads, dff, input_vocab_size,
          rate=0.1):
    super().__init__()

    self.d_model = d_model
    self.embedding = tf.keras.layers.Embedding(input_vocab_size, self.d_model)
    self.pos_encoding = positional_encoding(input_vocab_size, self.d_model)

    # How many encoder layers in a list.
    self.enc_layers = [EncoderLayer(d_model, num_heads, dff, rate)
               for _ in range(num_layers)]

    self.dropout = tf.keras.layers.Dropout(rate)

  def call(self, x, training, mask):
    # x.shape == (batch_size, input_seq_len)
    # output of following layers (batch_size, input_seq_len, d_model)
    input_seq_len = tf.shape(x)[1]

    # embedding + regularization + postional encoding + dropout
    x = self.embedding(x)
    x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))
    x += self.pos_encoding[:, :input_seq_len, :]
    x = self.dropout(x, training=training)

    # N layers of encoder layer
    for enc_layer in self.enc_layers:
      x = enc_layer(x, training, mask)
    return x

In [47]:
#hyperparameters
num_layers = 2 # encoder with 2 layers of encoder layer
d_model = 4
num_heads = 2
dff = 8
input_vocab_size = subword_encoder_en.vocab_size + 2 # size of en dictionary + 2 (<BOS> and <EOS>)

# Initialize an encoder
encoder = Encoder(num_layers, d_model, num_heads, dff, input_vocab_size)

# it's a demo, inp = (2,8)
enc_out = encoder(inp, training=False, mask=None)
print("inp:", inp)
print("-" * 20)
print("enc_out:", enc_out)

inp: tf.Tensor(
[[8113  103    9 1066 7903 8114    0    0]
 [8113   16 4111 6735   12 2750 7903 8114]], shape=(2, 8), dtype=int64)
--------------------
enc_out: tf.Tensor(
[[[-1.6175534   0.31700382  1.1178817   0.18266782]
  [-1.537351    0.09652126  1.2632029   0.17762691]
  [-1.4711192  -0.3286521   0.66728604  1.1324852 ]
  [-1.7319188   0.5719883   0.56316334  0.596767  ]
  [-1.7129767   0.63511527  0.74072814  0.33713335]
  [-1.6790036   0.5376022   0.91749287  0.22390842]
  [-1.6322757   0.34694943  1.0838362   0.20148984]
  [-1.5538539   0.18769327  1.2385572   0.12760338]]

 [[-1.6310383   0.33620313  1.0880876   0.20674738]
  [-1.5676134   0.11934948  1.2138293   0.23443466]
  [-1.4289947  -0.31844845  0.46765357  1.2797897 ]
  [-1.7316765   0.55490613  0.5662713   0.61049914]
  [-1.7175832   0.5749828   0.75368994  0.3889103 ]
  [-1.6889402   0.53367054  0.8901645   0.26510525]
  [-1.6474875   0.3908047   1.0427251   0.2139579 ]
  [-1.572732    0.255543    1.2034411   0.1137

In [48]:
# Decoder incorperates N DecoderLayer，
# and DecoderLayer has 3 sub-layers: self attention MHA, attention of Encoder MHA and FFN
class DecoderLayer(tf.keras.layers.Layer):
  def __init__(self, d_model, num_heads, dff, rate=0.1):
    super().__init__()

    # 3 types of sublayer in the decoder layer
    self.mha1 = MultiHeadAttention(d_model, num_heads)
    self.mha2 = MultiHeadAttention(d_model, num_heads)
    self.ffn = point_wise_feed_forward_network(d_model, dff)

    # LayerNorm for every sub layer
    self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
    self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
    self.layernorm3 = tf.keras.layers.LayerNormalization(epsilon=1e-6)

    # drop for every sub-layer
    self.dropout1 = tf.keras.layers.Dropout(rate)
    self.dropout2 = tf.keras.layers.Dropout(rate)
    self.dropout3 = tf.keras.layers.Dropout(rate)


  def call(self, x, enc_output, training,
           combined_mask, inp_padding_mask):
    # shape of output of all sublayers are (batch_size, target_seq_len, d_model)
    # enc_output is the ouput of Encoder with shape (batch_size, input_seq_len, d_model)
    # shape of attn_weights_block_1: (batch_size, num_heads, target_seq_len, target_seq_len)
    # shape of attn_weights_block_2: (batch_size, num_heads, target_seq_len, input_seq_len)

    # sub-layer 1: Decoder layer, self attention of the decoder input
    # Need look ahead mask and padding mask: conbined_mask
    # x: cn input. tokens before the next one
    attn1, attn_weights_block1 = self.mha1(x, x, x, combined_mask)
    attn1 = self.dropout1(attn1, training=training)
    out1 = self.layernorm1(attn1 + x)

    # sub-layer 2: Decoder layer: attention of encoder.
    # Need padding mask only.
    attn2, attn_weights_block2 = self.mha2(enc_output, enc_output,
                          out1, inp_padding_mask)  # (batch_size, target_seq_len, d_model)
    attn2 = self.dropout2(attn2, training=training)
    out2 = self.layernorm2(attn2 + out1)  # (batch_size, target_seq_len, d_model)

    # sub-layer 3: FFN
    ffn_output = self.ffn(out2)  # (batch_size, target_seq_len, d_model)
    ffn_output = self.dropout3(ffn_output, training=training)
    out3 = self.layernorm3(ffn_output + out2)  # (batch_size, target_seq_len, d_model)

    return out3, attn_weights_block1, attn_weights_block2

In [49]:
tar_padding_mask = create_padding_mask(tar)
look_ahead_mask = create_look_ahead_mask(tar.shape[-1])
combined_mask = tf.maximum(tar_padding_mask, look_ahead_mask)
# stack up two masks for decoder

print("tar:", tar)
print("-" * 20)
print("tar_padding_mask:", tar_padding_mask)
print("-" * 20)
print("look_ahead_mask:", look_ahead_mask)
print("-" * 20)
print("combined_mask:", combined_mask)

tar: tf.Tensor(
[[4205   10  241   86   27    3 4206    0    0    0]
 [4205  165  489  398  191   14    7  560    3 4206]], shape=(2, 10), dtype=int64)
--------------------
tar_padding_mask: tf.Tensor(
[[[[0. 0. 0. 0. 0. 0. 0. 1. 1. 1.]]]


 [[[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]]], shape=(2, 1, 1, 10), dtype=float32)
--------------------
look_ahead_mask: tf.Tensor(
[[0. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
 [0. 0. 1. 1. 1. 1. 1. 1. 1. 1.]
 [0. 0. 0. 1. 1. 1. 1. 1. 1. 1.]
 [0. 0. 0. 0. 1. 1. 1. 1. 1. 1.]
 [0. 0. 0. 0. 0. 1. 1. 1. 1. 1.]
 [0. 0. 0. 0. 0. 0. 1. 1. 1. 1.]
 [0. 0. 0. 0. 0. 0. 0. 1. 1. 1.]
 [0. 0. 0. 0. 0. 0. 0. 0. 1. 1.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]], shape=(10, 10), dtype=float32)
--------------------
combined_mask: tf.Tensor(
[[[[0. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
   [0. 0. 1. 1. 1. 1. 1. 1. 1. 1.]
   [0. 0. 0. 1. 1. 1. 1. 1. 1. 1.]
   [0. 0. 0. 0. 1. 1. 1. 1. 1. 1.]
   [0. 0. 0. 0. 0. 1. 1. 1. 1. 1.]
   [0. 0. 0. 0. 0. 0. 1. 1. 1. 1.]
   [0. 0. 0. 0.

In [51]:
# hyperparameters
d_model = 4
num_heads = 2
dff = 8
dec_layer = DecoderLayer(d_model, num_heads, dff)

# Both inp and tar (en for encoder and ch for decoder) need padding mask
inp_padding_mask = create_padding_mask(inp)
tar_padding_mask = create_padding_mask(tar)

# lood ahead mask + padding mask for target input.
look_ahead_mask = create_look_ahead_mask(tar.shape[-1])
combined_mask = tf.maximum(tar_padding_mask, look_ahead_mask)

# enc_out from encoder. emb_tar hasn't applying positional encoding yet.
dec_out, dec_self_attn_weights, dec_enc_attn_weights = dec_layer(
                                     emb_tar, enc_out, False, combined_mask, inp_padding_mask)

print("emb_tar:", emb_tar)
print("-" * 20)
print("enc_out:", enc_out)
print("-" * 20)
print("dec_out:", dec_out)
assert emb_tar.shape == dec_out.shape
print("-" * 20)
print("dec_self_attn_weights.shape:", dec_self_attn_weights.shape)
print("dec_enc_attn_weights:", dec_enc_attn_weights.shape)


#shape of emb_tar: (2, 10, 4). 2 examples, 10 tokens, 4 embedding dimensions of zh embedding
#shape of enc_out: (2, 8, 4). 2 examples, 8 tokens, 4 embedding dimensions of en embedding
#shape of dec_out: (2, 10 , 4). 2 examples, 8 sums of attention from other tokens, 2 heads * 2 depth (multi heads)
#shape of dec_self_attn_weights (2, 2, 10, 10): 2 examples, 2 heads. 10 x 10 attentions. 10 tokens and their attentions on each others
#shape of dec_enc_attn_weights (2, 2, 10, 8): 2 examples, 2 heads, 10 x 8 attentions. 10 tokens and their attentions on encoder.

emb_tar: tf.Tensor(
[[[-0.03277655 -0.00745342 -0.00683656 -0.02574356]
  [ 0.01696173  0.00757234  0.00941087 -0.03724331]
  [ 0.04189566  0.03261018 -0.02499024  0.00189238]
  [ 0.02182115  0.01020544 -0.00277179  0.04719411]
  [ 0.03758576  0.01851412 -0.00444709  0.04633282]
  [-0.00801369 -0.02775562  0.00632982  0.01889359]
  [-0.04102479 -0.0102158   0.01949997 -0.00985195]
  [-0.01700247  0.02699344 -0.04914365  0.02938807]
  [-0.01700247  0.02699344 -0.04914365  0.02938807]
  [-0.01700247  0.02699344 -0.04914365  0.02938807]]

 [[-0.03277655 -0.00745342 -0.00683656 -0.02574356]
  [-0.01273079 -0.01638271  0.04358082 -0.0011729 ]
  [-0.04276529 -0.03356844  0.04649569  0.02694151]
  [-0.03918862  0.02449466  0.01530714 -0.0251611 ]
  [-0.00469106 -0.00384055 -0.01679001  0.03942842]
  [ 0.00328831  0.02432671  0.04318819  0.01971428]
  [ 0.03876802 -0.01617848 -0.02781192  0.02414567]
  [-0.04903276  0.0202435   0.04282154 -0.03306665]
  [-0.00801369 -0.02775562  0.00632982  0.

In [52]:
class Decoder(tf.keras.layers.Layer):
  #applying target_vocab_size as input
  def __init__(self, num_layers, d_model, num_heads, dff, target_vocab_size,
             rate=0.1):
    super().__init__()

    self.d_model = d_model
    #cn dictionary as input
    self.embedding = tf.keras.layers.Embedding(target_vocab_size, d_model)
    self.pos_encoding = positional_encoding(target_vocab_size, self.d_model)
    self.dec_layers = [DecoderLayer(d_model, num_heads, dff, rate)
               for _ in range(num_layers)]
    self.dropout = tf.keras.layers.Dropout(rate)

  def call(self, x, enc_output, training,
           combined_mask, inp_padding_mask):

    tar_seq_len = tf.shape(x)[1]
    attention_weights = {}  #for attentions of decoder itself.

    # same process as encoder.
    x = self.embedding(x)  # (batch_size, tar_seq_len, d_model)
    x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))
    x += self.pos_encoding[:, :tar_seq_len, :]
    x = self.dropout(x, training=training)

    for i, dec_layer in enumerate(self.dec_layers):
      x, block1, block2 = dec_layer(x, enc_output, training,
                        combined_mask, inp_padding_mask)

      # save the weighting of attention layers of the decoder.
      attention_weights['decoder_layer{}_block1'.format(i + 1)] = block1
      attention_weights['decoder_layer{}_block2'.format(i + 1)] = block2

    # x.shape = (batch_size, tar_seq_len, d_model)
    return x, attention_weights

In [53]:
# hyperparameters
num_layers = 2 # 2 layers of decoder layers
d_model = 4
num_heads = 2
dff = 8
target_vocab_size = subword_encoder_zh.vocab_size + 2 # cn dictionary + 2 (<BOS>,<EOS>)

# decoder need both look ahead and padding
inp_padding_mask = create_padding_mask(inp)
tar_padding_mask = create_padding_mask(tar)
look_ahead_mask = create_look_ahead_mask(tar.shape[1])
combined_mask = tf.math.maximum(tar_padding_mask, look_ahead_mask)

# intial a decoder
decoder = Decoder(num_layers, d_model, num_heads, dff, target_vocab_size)

# it's an example, set input as tar (2,10)
print("tar:", tar)
print("-" * 20)
print("combined_mask:", combined_mask)
print("-" * 20)
print("enc_out:", enc_out)
print("-" * 20)
print("inp_padding_mask:", inp_padding_mask)
print("-" * 20)
dec_out, attn = decoder(tar, enc_out, training=False,
                        combined_mask=combined_mask,
                        inp_padding_mask=inp_padding_mask)
print("dec_out:", dec_out)
print("-" * 20)
for block_name, attn_weights in attn.items():
  print(f"{block_name}.shape: {attn_weights.shape}")

#shape of tar: (2, 10). 2 examples, 10 tokens
#shape of enc_out: (2, 8, 4). 2 examples, 8 tokens, 4 embedding dimensions of en embedding
#shape of dec_out: (2, 10, 4). 2 examples, 8 sums of attention from other tokens, 2 heads * 2 depth (multi heads)
#shape of decoder_layer1_block1 (2, 2, 10, 10): 2 examples, 2 heads. 10 x 10 attentions. 10 tokens and their attentions on each others
#shape of decoder_layer1_block2 (2, 2, 10, 8): 2 examples, 2 heads, 10 x 8 attentions. 10 tokens and their attentions on encoder.
#shape of decoder_layer2_block1 (2, 2, 10, 10)
#shape of decoder_layer2_block2 (2, 2, 10, 8)


tar: tf.Tensor(
[[4205   10  241   86   27    3 4206    0    0    0]
 [4205  165  489  398  191   14    7  560    3 4206]], shape=(2, 10), dtype=int64)
--------------------
combined_mask: tf.Tensor(
[[[[0. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
   [0. 0. 1. 1. 1. 1. 1. 1. 1. 1.]
   [0. 0. 0. 1. 1. 1. 1. 1. 1. 1.]
   [0. 0. 0. 0. 1. 1. 1. 1. 1. 1.]
   [0. 0. 0. 0. 0. 1. 1. 1. 1. 1.]
   [0. 0. 0. 0. 0. 0. 1. 1. 1. 1.]
   [0. 0. 0. 0. 0. 0. 0. 1. 1. 1.]
   [0. 0. 0. 0. 0. 0. 0. 1. 1. 1.]
   [0. 0. 0. 0. 0. 0. 0. 1. 1. 1.]
   [0. 0. 0. 0. 0. 0. 0. 1. 1. 1.]]]


 [[[0. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
   [0. 0. 1. 1. 1. 1. 1. 1. 1. 1.]
   [0. 0. 0. 1. 1. 1. 1. 1. 1. 1.]
   [0. 0. 0. 0. 1. 1. 1. 1. 1. 1.]
   [0. 0. 0. 0. 0. 1. 1. 1. 1. 1.]
   [0. 0. 0. 0. 0. 0. 1. 1. 1. 1.]
   [0. 0. 0. 0. 0. 0. 0. 1. 1. 1.]
   [0. 0. 0. 0. 0. 0. 0. 0. 1. 1.]
   [0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
   [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]]], shape=(2, 1, 10, 10), dtype=float32)
--------------------
enc_out: tf.Tensor(
[[[-1.6175534  

In [None]:
# Transformer 之上已經沒有其他 layers 了，我們使用 tf.keras.Model 建立一個模型
class Transformer(tf.keras.Model):
  # 初始參數包含 Encoder & Decoder 都需要超參數以及中英字典數目
  def __init__(self, num_layers, d_model, num_heads, dff, input_vocab_size,
               target_vocab_size, rate=0.1):
    super(Transformer, self).__init__()

    self.encoder = Encoder(num_layers, d_model, num_heads, dff,
                           input_vocab_size, rate)

    self.decoder = Decoder(num_layers, d_model, num_heads, dff,
                           target_vocab_size, rate)
    # 這個 FFN 輸出跟中文字典一樣大的 logits 數，等通過 softmax 就代表每個中文字的出現機率
    self.final_layer = tf.keras.layers.Dense(target_vocab_size)

  # enc_padding_mask 跟 dec_padding_mask 都是英文序列的 padding mask，
  # 只是一個給 Encoder layer 的 MHA 用，一個是給 Decoder layer 的 MHA 2 使用
  def __call__(self, inp, tar, training, enc_padding_mask,
           combined_mask, dec_padding_mask):

    enc_output = self.encoder(inp, training, enc_padding_mask)  # (batch_size, inp_seq_len, d_model)

    # dec_output.shape == (batch_size, tar_seq_len, d_model)
    dec_output, attention_weights = self.decoder(
        tar, enc_output, training, combined_mask, dec_padding_mask)

    # 將 Decoder 輸出通過最後一個 linear layer
    final_output = self.final_layer(dec_output)  # (batch_size, tar_seq_len, target_vocab_size)

    return final_output, attention_weights

In [None]:
# 超參數
num_layers = 1
d_model = 4
num_heads = 2
dff = 8

# + 2 是為了 <start> & <end> token
input_vocab_size = subword_encoder_en.vocab_size + 2
output_vocab_size = subword_encoder_zh.vocab_size + 2

# 重點中的重點。訓練時用前一個字來預測下一個中文字
tar_inp = tar[:, :-1]
tar_real = tar[:, 1:]

# 來源 / 目標語言用的遮罩。注意 `comined_mask` 已經將目標語言的兩種遮罩合而為一
inp_padding_mask = create_padding_mask(inp)
tar_padding_mask = create_padding_mask(tar_inp)
look_ahead_mask = create_look_ahead_mask(tar_inp.shape[1])
combined_mask = tf.math.maximum(tar_padding_mask, look_ahead_mask)

# 初始化我們的第一個 transformer
transformer = Transformer(num_layers, d_model, num_heads, dff,
                          input_vocab_size, output_vocab_size)

# 將英文、中文序列丟入取得 Transformer 預測下個中文字的結果
predictions, attn_weights = transformer(inp, tar_inp, False, inp_padding_mask,
                                        combined_mask, inp_padding_mask)

print("tar:", tar)
print("-" * 20)
print("tar_inp:", tar_inp)
print("-" * 20)
print("tar_real:", tar_real)
print("-" * 20)
print("predictions:", predictions)

In [None]:
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(
    from_logits=True, reduction='none')

# 假設我們要解的是一個 binary classifcation， 0 跟 1 個代表一個 label
real = tf.constant([1, 1, 0], shape=(1, 3), dtype=tf.float32)
pred = tf.constant([[0, 1], [0, 1], [0, 1]], dtype=tf.float32)
loss_object(real, pred)

In [None]:
# Transformer 之上已經沒有其他 layers 了，我們使用 tf.keras.Model 建立一個模型
class Transformer(tf.keras.Model):
  # 初始參數包含 Encoder & Decoder 都需要超參數以及中英字典數目
  def __init__(self, num_layers, d_model, num_heads, dff, input_vocab_size,
               target_vocab_size, rate=0.1):
    super(Transformer, self).__init__()

    self.encoder = Encoder(num_layers, d_model, num_heads, dff,
                           input_vocab_size, rate)

    self.decoder = Decoder(num_layers, d_model, num_heads, dff,
                           target_vocab_size, rate)
    # 這個 FFN 輸出跟中文字典一樣大的 logits 數，等通過 softmax 就代表每個中文字的出現機率
    self.final_layer = tf.keras.layers.Dense(target_vocab_size)

  # enc_padding_mask 跟 dec_padding_mask 都是英文序列的 padding mask，
  # 只是一個給 Encoder layer 的 MHA 用，一個是給 Decoder layer 的 MHA 2 使用
  def __call__(self, inp, tar, training, enc_padding_mask,
           combined_mask, dec_padding_mask):

    enc_output = self.encoder(inp, training, enc_padding_mask)  # (batch_size, inp_seq_len, d_model)

    # dec_output.shape == (batch_size, tar_seq_len, d_model)
    dec_output, attention_weights = self.decoder(
        tar, enc_output, training, combined_mask, dec_padding_mask)

    # 將 Decoder 輸出通過最後一個 linear layer
    final_output = self.final_layer(dec_output)  # (batch_size, tar_seq_len, target_vocab_size)

    return final_output, attention_weights

In [None]:
# 超參數
num_layers = 1
d_model = 4
num_heads = 2
dff = 8

# + 2 是為了 <start> & <end> token
input_vocab_size = subword_encoder_en.vocab_size + 2
output_vocab_size = subword_encoder_zh.vocab_size + 2

# 重點中的重點。訓練時用前一個字來預測下一個中文字
tar_inp = tar[:, :-1]
tar_real = tar[:, 1:]

# 來源 / 目標語言用的遮罩。注意 `comined_mask` 已經將目標語言的兩種遮罩合而為一
inp_padding_mask = create_padding_mask(inp)
tar_padding_mask = create_padding_mask(tar_inp)
look_ahead_mask = create_look_ahead_mask(tar_inp.shape[1])
combined_mask = tf.math.maximum(tar_padding_mask, look_ahead_mask)

# 初始化我們的第一個 transformer
transformer = Transformer(num_layers, d_model, num_heads, dff,
                          input_vocab_size, output_vocab_size)

# 將英文、中文序列丟入取得 Transformer 預測下個中文字的結果
predictions, attn_weights = transformer(inp, tar_inp, False, inp_padding_mask,
                                        combined_mask, inp_padding_mask)

print("tar:", tar)
print("-" * 20)
print("tar_inp:", tar_inp)
print("-" * 20)
print("tar_real:", tar_real)
print("-" * 20)
print("predictions:", predictions)

In [None]:
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(
    from_logits=True, reduction='none')

# 假設我們要解的是一個 binary classifcation， 0 跟 1 個代表一個 label
real = tf.constant([1, 1, 0], shape=(1, 3), dtype=tf.float32)
pred = tf.constant([[0, 1], [0, 1], [0, 1]], dtype=tf.float32)
loss_object(real, pred)

In [None]:
print("predictions:", predictions)
print("-" * 20)
print(tf.reduce_sum(predictions, axis=-1))

In [None]:
def loss_function(real, pred):
  # 這次的 mask 將序列中不等於 0 的位置視為 1，其餘為 0
  mask = tf.math.logical_not(tf.math.equal(real, 0))
  # 照樣計算所有位置的 cross entropy 但不加總
  loss_ = loss_object(real, pred)
  mask = tf.cast(mask, dtype=loss_.dtype)
  loss_ *= mask  # 只計算非 <pad> 位置的損失

  return tf.reduce_mean(loss_)

In [None]:
train_loss = tf.keras.metrics.Mean(name='train_loss')
train_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(
    name='train_accuracy')

In [None]:
num_layers = 4
d_model = 128
dff = 512
num_heads = 8

input_vocab_size = subword_encoder_en.vocab_size + 2
target_vocab_size = subword_encoder_zh.vocab_size + 2
dropout_rate = 0.1  # 預設值

print("input_vocab_size:", input_vocab_size)
print("target_vocab_size:", target_vocab_size)

In [None]:
class CustomSchedule(tf.keras.optimizers.schedules.LearningRateSchedule):
  # 論文預設 `warmup_steps` = 4000
  def __init__(self, d_model, warmup_steps=4000):
    super(CustomSchedule, self).__init__()

    self.d_model = d_model
    self.d_model = tf.cast(self.d_model, tf.float32)

    self.warmup_steps = warmup_steps

  def __call__(self, step):
    step = tf.cast(step, tf.float32)
    arg1 = tf.math.rsqrt(step)
    arg2 = step * (self.warmup_steps ** -1.5)

    return tf.math.rsqrt(self.d_model) * tf.math.minimum(arg1, arg2)

# 將客製化 learning rate schdeule 丟入 Adam opt.
# Adam opt. 的參數都跟論文相同
learning_rate = CustomSchedule(d_model)
optimizer = tf.keras.optimizers.Adam(learning_rate, beta_1=0.9, beta_2=0.98,
                                     epsilon=1e-9)

In [None]:
d_models = [128, 256, 512]
warmup_steps = [1000 * i for i in range(1, 4)]

schedules = []
labels = []
colors = ["blue", "red", "black"]
for d in d_models:
  schedules += [CustomSchedule(d, s) for s in warmup_steps]
  labels += [f"d_model: {d}, warm: {s}" for s in warmup_steps]

for i, (schedule, label) in enumerate(zip(schedules, labels)):
  plt.plot(schedule(tf.range(10000, dtype=tf.float32)),
           label=label, color=colors[i // 3])

plt.legend()

plt.ylabel("Learning Rate")
plt.xlabel("Train Step")

In [None]:
transformer = Transformer(num_layers, d_model, num_heads, dff,
                          input_vocab_size, target_vocab_size, dropout_rate)

print(f"""這個 Transformer 有 {num_layers} 層 Encoder / Decoder layers
d_model: {d_model}
num_heads: {num_heads}
dff: {dff}
input_vocab_size: {input_vocab_size}
target_vocab_size: {target_vocab_size}
dropout_rate: {dropout_rate}

""")

In [None]:
# # 方便比較不同實驗/ 不同超參數設定的結果
# run_id = f"{num_layers}layers_{d_model}d_{num_heads}heads_{dff}dff_{train_perc}train_perc"
# checkpoint_path = os.path.join(checkpoint_path, run_id)
# log_dir = os.path.join(log_dir, run_id)

# # tf.train.Checkpoint 可以幫我們把想要存下來的東西整合起來，方便儲存與讀取
# # 一般來說你會想存下模型以及 optimizer 的狀態
# ckpt = tf.train.Checkpoint(transformer=transformer,
#                            optimizer=optimizer)

# # ckpt_manager 會去 checkpoint_path 看有沒有符合 ckpt 裡頭定義的東西
# # 存檔的時候只保留最近 5 次 checkpoints，其他自動刪除
# ckpt_manager = tf.train.CheckpointManager(ckpt, checkpoint_path, max_to_keep=5)

# # 如果在 checkpoint 路徑上有發現檔案就讀進來
# if ckpt_manager.latest_checkpoint:
#   ckpt.restore(ckpt_manager.latest_checkpoint)

#   # 用來確認之前訓練多少 epochs 了
#   last_epoch = int(ckpt_manager.latest_checkpoint.split("-")[-1])
#   print(f'已讀取最新的 checkpoint，模型已訓練 {last_epoch} epochs。')
# else:
#   last_epoch = 0
#   print("沒找到 checkpoint，從頭訓練。")

In [None]:
# 為 Transformer 的 Encoder / Decoder 準備遮罩
def create_masks(inp, tar):
  # 英文句子的 padding mask，要交給 Encoder layer 自注意力機制用的
  enc_padding_mask = create_padding_mask(inp)

  # 同樣也是英文句子的 padding mask，但是是要交給 Decoder layer 的 MHA 2
  # 關注 Encoder 輸出序列用的
  dec_padding_mask = create_padding_mask(inp)

  # Decoder layer 的 MHA1 在做自注意力機制用的
  # `combined_mask` 是中文句子的 padding mask 跟 look ahead mask 的疊加
  look_ahead_mask = create_look_ahead_mask(tf.shape(tar)[1])
  dec_target_padding_mask = create_padding_mask(tar)
  combined_mask = tf.maximum(dec_target_padding_mask, look_ahead_mask)

  return enc_padding_mask, combined_mask, dec_padding_mask

In [None]:
@tf.function  # 讓 TensorFlow 幫我們將 eager code 優化並加快運算
def train_step(inp, tar):
  # 前面說過的，用去尾的原始序列去預測下一個字的序列
  tar_inp = tar[:, :-1]
  tar_real = tar[:, 1:]

  # 建立 3 個遮罩
  enc_padding_mask, combined_mask, dec_padding_mask = create_masks(inp, tar_inp)

  # 紀錄 Transformer 的所有運算過程以方便之後做梯度下降
  with tf.GradientTape() as tape:
    # 注意是丟入 `tar_inp` 而非 `tar`。記得將 `training` 參數設定為 True
    predictions, _ = transformer(inp, tar_inp,
                                 True,
                                 enc_padding_mask,
                                 combined_mask,
                                 dec_padding_mask)
    # 跟影片中顯示的相同，計算左移一個字的序列跟模型預測分佈之間的差異，當作 loss
    loss = loss_function(tar_real, predictions)

  # 取出梯度並呼叫前面定義的 Adam optimizer 幫我們更新 Transformer 裡頭可訓練的參數
  gradients = tape.gradient(loss, transformer.trainable_variables)
  optimizer.apply_gradients(zip(gradients, transformer.trainable_variables))

  # 將 loss 以及訓練 acc 記錄到 TensorBoard 上，非必要
  train_loss(loss)
  train_accuracy(tar_real, predictions)

In [None]:
# 定義我們要看幾遍數據集
EPOCHS = 100
last_epoch = 0
print(f"此超參數組合的 Transformer 已經訓練 {last_epoch} epochs。")
print(f"剩餘 epochs：{min(0, last_epoch - EPOCHS)}")


# 用來寫資訊到 TensorBoard，非必要但十分推薦
summary_writer = tf.summary.create_file_writer(log_dir)

# 比對設定的 `EPOCHS` 以及已訓練的 `last_epoch` 來決定還要訓練多少 epochs
for epoch in range(last_epoch, EPOCHS):
  start = time.time()

  # 重置紀錄 TensorBoard 的 metrics
  train_loss.reset_states()
  train_accuracy.reset_states()

  # 一個 epoch 就是把我們定義的訓練資料集一個一個 batch 拿出來處理，直到看完整個數據集
  for (step_idx, (inp, tar)) in enumerate(train_dataset):

    # 每次 step 就是將數據丟入 Transformer，讓它生預測結果並計算梯度最小化 loss
    train_step(inp, tar)

  # 每個 epoch 完成就存一次檔
  # if (epoch + 1) % 1 == 0:
  #   ckpt_save_path = ckpt_manager.save()
  #   print ('Saving checkpoint for epoch {} at {}'.format(epoch+1,
  #                                                        ckpt_save_path))

  # 將 loss 以及 accuracy 寫到 TensorBoard 上
  with summary_writer.as_default():
    tf.summary.scalar("train_loss", train_loss.result(), step=epoch + 1)
    tf.summary.scalar("train_acc", train_accuracy.result(), step=epoch + 1)

  print('Epoch {} Loss {:.4f} Accuracy {:.4f}'.format(epoch + 1,
                                                train_loss.result(),
                                                train_accuracy.result()))
  print('Time taken for 1 epoch: {} secs\n'.format(time.time() - start))


# no trainable_vars in transformer and related layers. trainable_vars must be inherited through keras.layers or keras.model

In [None]:
%load_ext tensorboard
%tensorboard --logdir {your_log_dir}

In [None]:
# 給定一個英文句子，輸出預測的中文索引數字序列以及注意權重 dict
def evaluate(inp_sentence):

  # 準備英文句子前後會加上的 <start>, <end>
  start_token = [subword_encoder_en.vocab_size]
  end_token = [subword_encoder_en.vocab_size + 1]

  # inp_sentence 是字串，我們用 Subword Tokenizer 將其變成子詞的索引序列
  # 並在前後加上 BOS / EOS
  inp_sentence = start_token + subword_encoder_en.encode(inp_sentence) + end_token
  encoder_input = tf.expand_dims(inp_sentence, 0)

  # 跟我們在影片裡看到的一樣，Decoder 在第一個時間點吃進去的輸入
  # 是一個只包含一個中文 <start> token 的序列
  decoder_input = [subword_encoder_zh.vocab_size]
  output = tf.expand_dims(decoder_input, 0)  # 增加 batch 維度

  # auto-regressive，一次生成一個中文字並將預測加到輸入再度餵進 Transformer
  for i in range(MAX_LENGTH):
    # 每多一個生成的字就得產生新的遮罩
    enc_padding_mask, combined_mask, dec_padding_mask = create_masks(
        encoder_input, output)

    # predictions.shape == (batch_size, seq_len, vocab_size)
    predictions, attention_weights = transformer(encoder_input,
                                                 output,
                                                 False,
                                                 enc_padding_mask,
                                                 combined_mask,
                                                 dec_padding_mask)


    # 將序列中最後一個 distribution 取出，並將裡頭值最大的當作模型最新的預測字
    predictions = predictions[: , -1:, :]  # (batch_size, 1, vocab_size)

    predicted_id = tf.cast(tf.argmax(predictions, axis=-1), tf.int32)

    # 遇到 <end> token 就停止回傳，代表模型已經產生完結果
    if tf.equal(predicted_id, subword_encoder_zh.vocab_size + 1):
      return tf.squeeze(output, axis=0), attention_weights

    #將 Transformer 新預測的中文索引加到輸出序列中，讓 Decoder 可以在產生
    # 下個中文字的時候關注到最新的 `predicted_id`
    output = tf.concat([output, predicted_id], axis=-1)

  # 將 batch 的維度去掉後回傳預測的中文索引序列
  return tf.squeeze(output, axis=0), attention_weights

In [None]:
# 要被翻譯的英文句子
sentence = "China, India, and others have enjoyed continuing economic growth."

# 取得預測的中文索引序列
predicted_seq, _ = evaluate(sentence)

# 過濾掉 <start> & <end> tokens 並用中文的 subword tokenizer 幫我們將索引序列還原回中文句子
target_vocab_size = subword_encoder_zh.vocab_size
predicted_seq_without_bos_eos = [idx for idx in predicted_seq if idx < target_vocab_size]
predicted_sentence = subword_encoder_zh.decode(predicted_seq_without_bos_eos)

print("sentence:", sentence)
print("-" * 20)
print("predicted_seq:", predicted_seq)
print("-" * 20)
print("predicted_sentence:", predicted_sentence)

In [None]:
transformer.summary()

In [None]:
predicted_seq, attention_weights = evaluate(sentence)

# 在這邊我們自動選擇最後一個 Decoder layer 的 MHA 2，也就是 Decoder 關注 Encoder 的 MHA
layer_name = f"decoder_layer{num_layers}_block2"

print("sentence:", sentence)
print("-" * 20)
print("predicted_seq:", predicted_seq)
print("-" * 20)
print("attention_weights.keys():")
for layer_name, attn in attention_weights.items():
  print(f"{layer_name}.shape: {attn.shape}")
print("-" * 20)
print("layer_name:", layer_name)

In [None]:
!apt-get install -y fonts-wqy-zenhei
!fc-cache -fv

In [None]:
import matplotlib as mpl
# 你可能會需要自行下載一個中文字體檔案以讓 matplotlib 正確顯示中文
zhfont = mpl.font_manager.FontProperties(fname='/usr/share/fonts/truetype/wqy/wqy-zenhei.ttc')
plt.style.use("seaborn-whitegrid")

# 這個函式將英 -> 中翻譯的注意權重視覺化（注意：我們將注意權重 transpose 以最佳化渲染結果
def plot_attention_weights(attention_weights, sentence, predicted_seq, layer_name, max_len_tar=None):
    fig = plt.figure(figsize=(17, 7))

    sentence = subword_encoder_en.encode(sentence)

    if max_len_tar:
        predicted_seq = predicted_seq[:max_len_tar]
    else:
        max_len_tar = len(predicted_seq)

    attention_weights = tf.squeeze(attention_weights[layer_name], axis=0)

    for head in range(attention_weights.shape[0]):
        ax = fig.add_subplot(2, 4, head + 1)

        attn_map = np.transpose(attention_weights[head][:max_len_tar, :])
        ax.matshow(attn_map, cmap='viridis')

        fontdict = {"fontproperties": zhfont}

        ax.set_xticks(range(max_len_tar))
        ax.set_xlim(-0.5, max_len_tar - 1.5)

        # Use a consistent list of tick labels for the y-axis
        y_tick_labels = ['<start>'] + [subword_encoder_en.decode([i]) for i in sentence] + ['<end>']
        ax.set_yticks(range(len(y_tick_labels)))
        ax.set_yticklabels(y_tick_labels, fontdict=fontdict)

        ax.set_xlabel('Head {}'.format(head + 1))
        ax.tick_params(axis="x", labelsize=12)
        ax.tick_params(axis="y", labelsize=12)

    plt.tight_layout()
    plt.show()
    plt.close(fig)

In [None]:
plot_attention_weights(attention_weights, sentence,
                       predicted_seq, layer_name, max_len_tar=18)

2. review positional encoding
3. input of transformer + padding
4. How to train it? 只計算pad?
6. learning rate schedule
7. check point
9. How to do tensorboard?

Have no idea whats wrong with training step?
https://www.tensorflow.org/guide/keras/customizing_what_happens_in_fit retwrite it.