<a href="https://colab.research.google.com/github/TAMIDSpiyalong/Introduction-to-Machine-Learning-for-Energy/blob/main/Lecture_4b.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Transformers From Scratch

This lab will build a transformer from scratch for translation task, which is the orginal function for the encoder decoder structure. We use tensorflow dataset.

##Objectives

1. Understand and build a dot product self attention block.
2. Using the attention block to build a transformer with multiple heads.
3. Using the prepared dataset, train a transformer to auto-translate.

In [None]:
import math
import numpy as np
import tensorflow as tf

We'll build a transformer from scratch, layer-by-layer. We'll start with the **Multi-Head Self-Attention** layer since that's the most involved bit. Once we have that working, the rest of the model will look familiar if you've been following the course so far.

## Multi-Head Self-Attention

#### Scaled Dot Product Self-Attention


Inside each attention head is a **Scaled Dot Product Self-Attention** operation as we covered in the slides. Given *queries*, *keys*, and *values*, the operation returns a new "mix" of the values.

$$Attention(Q, K, V) = softmax(\frac{QK^T)}{\sqrt{d_k}})V$$

The following function implements this and also takes a mask to account for padding and for masking future tokens for decoding (i.e. **look-ahead mask**). We'll cover masking later in the notebook.

In [3]:
def scaled_dot_product_attention(query, key, value, mask=None):
  key_dim = tf.cast(tf.shape(key)[-1], tf.float32)
  scaled_scores = tf.matmul(query, key, transpose_b=True) / np.sqrt(key_dim)

  if mask is not None:
    scaled_scores = tf.where(mask==0, -np.inf, scaled_scores)

  softmax = tf.keras.layers.Softmax()
  weights = softmax(scaled_scores)
  return tf.matmul(weights, value), weights

Suppose our *queries*, *keys*, and *values* are each a length of 3 with a dimension of 4.

In [4]:
seq_len = 3
embed_dim = 4

queries = np.random.rand(seq_len, embed_dim)
keys = np.random.rand(seq_len, embed_dim)
values = np.random.rand(seq_len, embed_dim)

print("Queries:\n", queries)

Queries:
 [[0.90769595 0.85981611 0.07292586 0.15409233]
 [0.85617015 0.36758837 0.63618922 0.5813505 ]
 [0.91135708 0.81539009 0.19281194 0.57139135]]


This would be the self-attention output and weights.

In [5]:
output, attn_weights = scaled_dot_product_attention(queries, keys, values)

print("Output\n", output, "\n")
print("Weights\n", attn_weights)

Output
 tf.Tensor(
[[0.42496407 0.5024725  0.57338387 0.62166166]
 [0.41672707 0.49846074 0.5729456  0.62078524]
 [0.42951483 0.50796884 0.57826084 0.6218555 ]], shape=(3, 4), dtype=float32) 

Weights
 tf.Tensor(
[[0.27782482 0.3765212  0.34565398]
 [0.2895227  0.3526816  0.3577957 ]
 [0.2678568  0.38508782 0.34705544]], shape=(3, 3), dtype=float32)


#### Generating queries, keys, and values for multiple heads.

Now that we have a way to calculate self-attention, let's actually generate the input *queries*, *keys*, and *values* for multiple heads.

It's easier to understand things this way and we can certainly code it this way as well. But we can also "simulate" different heads with a single query matrix, single key matrix, and single value matrix.
<br><br>
We'll do both. First we'll create *query*, *key*, and *value* vectors using separate weights per head.
<br><br>
In the slides, we used an example of 12 dimensional embeddings processed by  three attentions heads, and we'll do the same here.

In [66]:
batch_size = 1
seq_len = 3
embed_dim = 12
num_heads = 3
head_dim = embed_dim // num_heads

print(f"Dimension of each head: {head_dim}")

Dimension of each head: 4


**Using separate weight matrices per head**

Suppose these are our input embeddings. Here we have a batch of 1 containing a sequence of length 3, with each element being a 12-dimensional embedding.

In [67]:
x = np.random.rand(batch_size, seq_len, embed_dim).round(1)
print("Input shape: ", x.shape, "\n")
print("Input:\n", x)

Input shape:  (1, 3, 12) 

Input:
 [[[0.9 0.1 0.1 0.3 0.5 0.6 0.4 0.1 0.2 0.1 0.5 0.8]
  [0.9 0.1 0.  0.7 0.4 0.7 0.4 0.4 0.2 0.7 0.  0.8]
  [1.  0.6 0.2 0.6 0.4 0.2 0.3 0.3 0.9 0.8 0.2 0.2]]]


We'll declare three sets of *query* weights (one for each head), three sets of *key* weights, and three sets of *value* weights. Remember each weight matrix should have a dimension of $\text{d}\ \text{x}\ \text{d/h}$.

In [8]:
# The query weights for each head.
wq0 = np.random.rand(embed_dim, head_dim).round(1)
wq1 = np.random.rand(embed_dim, head_dim).round(1)
wq2 = np.random.rand(embed_dim, head_dim).round(1)

# The key weights for each head.
wk0 = np.random.rand(embed_dim, head_dim).round(1)
wk1 = np.random.rand(embed_dim, head_dim).round(1)
wk2 = np.random.rand(embed_dim, head_dim).round(1)

# The value weights for each head.
wv0 = np.random.rand(embed_dim, head_dim).round(1)
wv1 = np.random.rand(embed_dim, head_dim).round(1)
wv2 = np.random.rand(embed_dim, head_dim).round(1)

In [9]:
print("The three sets of query weights (one for each head):")
print("wq0:\n", wq0)
print("wq1:\n", wq1)
print("wq2:\n", wq1)

The three sets of query weights (one for each head):
wq0:
 [[0.2 0.2 0.  0.4]
 [0.1 0.8 1.  0.4]
 [0.  0.  0.1 0.2]
 [0.8 0.4 0.1 0.2]
 [0.3 0.6 0.9 0.2]
 [0.5 1.  0.3 1. ]
 [0.6 0.7 0.3 0.4]
 [0.4 0.1 0.9 0.5]
 [0.2 0.9 0.1 0.3]
 [0.6 0.3 0.5 0.9]
 [0.9 0.7 0.5 0.9]
 [0.4 0.8 0.  0.3]]
wq1:
 [[1.  0.1 0.2 0.6]
 [0.9 0.3 0.1 0.3]
 [0.5 0.3 0.6 0.1]
 [0.2 0.7 0.5 0.4]
 [0.3 0.  0.3 0.8]
 [0.3 0.9 0.4 0.9]
 [0.4 0.6 0.9 0.2]
 [0.9 0.5 0.7 0.3]
 [0.  0.8 0.9 0.1]
 [0.9 0.1 0.9 0.3]
 [0.2 0.3 0.2 0.9]
 [0.5 0.5 0.1 0.8]]
wq2:
 [[1.  0.1 0.2 0.6]
 [0.9 0.3 0.1 0.3]
 [0.5 0.3 0.6 0.1]
 [0.2 0.7 0.5 0.4]
 [0.3 0.  0.3 0.8]
 [0.3 0.9 0.4 0.9]
 [0.4 0.6 0.9 0.2]
 [0.9 0.5 0.7 0.3]
 [0.  0.8 0.9 0.1]
 [0.9 0.1 0.9 0.3]
 [0.2 0.3 0.2 0.9]
 [0.5 0.5 0.1 0.8]]


We'll generate our *queries*, *keys*, and *values* for each head by multiplying our input by the weights.

In [10]:
# Geneated queries, keys, and values for the first head.
q0 = np.dot(x, wq0)
k0 = np.dot(x, wk0)
v0 = np.dot(x, wv0)

# Geneated queries, keys, and values for the second head.
q1 = np.dot(x, wq1)
k1 = np.dot(x, wk1)
v1 = np.dot(x, wv1)

# Geneated queries, keys, and values for the third head.
q2 = np.dot(x, wq2)
k2 = np.dot(x, wk2)
v2 = np.dot(x, wv2)

These are the resulting *query*, *key*, and *value* vectors for the first head.

In [11]:
print("Q, K, and V for first head:\n")

print(f"q0 {q0.shape}:\n", q0, "\n")
print(f"k0 {k0.shape}:\n", k0, "\n")
print(f"v0 {v0.shape}:\n", v0)

Q, K, and V for first head:

q0 (1, 3, 4):
 [[[2.1  3.39 2.03 2.34]
  [2.04 2.52 2.43 2.5 ]
  [2.24 2.78 2.71 2.44]]] 

k0 (1, 3, 4):
 [[[2.38 2.64 3.51 2.73]
  [1.37 1.98 2.92 2.11]
  [2.03 2.88 3.44 2.31]]] 

v0 (1, 3, 4):
 [[[3.43 1.81 1.73 2.01]
  [3.73 1.65 2.08 2.15]
  [3.75 1.61 2.03 1.65]]]


Now that we have our Q, K, V vectors, we can just pass them to our self-attention operation. Here we're calculating the output and attention weights for the first head.

In [12]:
out0, attn_weights0 = scaled_dot_product_attention(q0, k0, v0)

print("Output from first attention head: ", out0, "\n")
print("Attention weights from first head: ", attn_weights0)

Output from first attention head:  tf.Tensor(
[[[3.552432  1.7339897 1.846055  1.8811185]
  [3.5430186 1.7399838 1.8375058 1.893627 ]
  [3.5420089 1.74048   1.8362217 1.8924185]]], shape=(1, 3, 4), dtype=float32) 

Attention weights from first head:  tf.Tensor(
[[[0.6162399  0.01854475 0.36521524]
  [0.6454069  0.02256118 0.3320319 ]
  [0.6488697  0.01765081 0.33347952]]], shape=(1, 3, 3), dtype=float32)


Here are the other two (attention weights are ignored).

In [13]:
out1, _ = scaled_dot_product_attention(q1, k1, v1)
out2, _ = scaled_dot_product_attention(q2, k2, v2)

print("Output from second attention head: ", out1, "\n")
print("Output from third attention head: ", out2,)

Output from second attention head:  tf.Tensor(
[[[2.8143632 3.2294424 1.9993955 2.9046278]
  [2.7964728 3.2433422 1.9857107 2.8767433]
  [2.805351  3.2363636 1.9928755 2.891967 ]]], shape=(1, 3, 4), dtype=float32) 

Output from third attention head:  tf.Tensor(
[[[3.342678  2.4526718 2.493952  2.5026987]
  [3.3433924 2.4535856 2.495035  2.501644 ]
  [3.338001  2.453367  2.498007  2.4990222]]], shape=(1, 3, 4), dtype=float32)


As we covered in the slides, once we have each head's output, we concatenate them and then put them through a linear layer for further processing.

In [14]:
combined_out_a = np.concatenate((out0, out1, out2), axis=-1)
print(f"Combined output from all heads {combined_out_a.shape}:")
print(combined_out_a)

# The final step would be to run combined_out_a through a linear/dense layer
# for further processing.

Combined output from all heads (1, 3, 12):
[[[3.552432  1.7339897 1.846055  1.8811185 2.8143632 3.2294424 1.9993955
   2.9046278 3.342678  2.4526718 2.493952  2.5026987]
  [3.5430186 1.7399838 1.8375058 1.893627  2.7964728 3.2433422 1.9857107
   2.8767433 3.3433924 2.4535856 2.495035  2.501644 ]
  [3.5420089 1.74048   1.8362217 1.8924185 2.805351  3.2363636 1.9928755
   2.891967  3.338001  2.453367  2.498007  2.4990222]]]


So that's a complete run of **multi-head self-attention** using separate sets of weights per head.<br>

Let's now get the same thing done using a single query weight matrix, single key weight matrix, and single value weight matrix.<br><br>
These were our separate per-head query weights:

In [15]:
print("Query weights for first head: \n", wq0, "\n")
print("Query weights for second head: \n", wq1, "\n")
print("Query weights for third head: \n", wq2)

Query weights for first head: 
 [[0.2 0.2 0.  0.4]
 [0.1 0.8 1.  0.4]
 [0.  0.  0.1 0.2]
 [0.8 0.4 0.1 0.2]
 [0.3 0.6 0.9 0.2]
 [0.5 1.  0.3 1. ]
 [0.6 0.7 0.3 0.4]
 [0.4 0.1 0.9 0.5]
 [0.2 0.9 0.1 0.3]
 [0.6 0.3 0.5 0.9]
 [0.9 0.7 0.5 0.9]
 [0.4 0.8 0.  0.3]] 

Query weights for second head: 
 [[1.  0.1 0.2 0.6]
 [0.9 0.3 0.1 0.3]
 [0.5 0.3 0.6 0.1]
 [0.2 0.7 0.5 0.4]
 [0.3 0.  0.3 0.8]
 [0.3 0.9 0.4 0.9]
 [0.4 0.6 0.9 0.2]
 [0.9 0.5 0.7 0.3]
 [0.  0.8 0.9 0.1]
 [0.9 0.1 0.9 0.3]
 [0.2 0.3 0.2 0.9]
 [0.5 0.5 0.1 0.8]] 

Query weights for third head: 
 [[0.9 0.3 0.2 1. ]
 [0.8 0.2 0.3 0.1]
 [0.3 0.  0.  0.7]
 [0.4 0.4 0.6 0.3]
 [0.9 0.6 0.2 0. ]
 [0.3 0.1 0.6 0.6]
 [0.4 0.3 0.5 0.6]
 [0.  1.  0.6 0.5]
 [0.  0.5 0.2 0.1]
 [0.4 0.5 0.7 0.7]
 [0.7 0.4 0.8 0.5]
 [0.1 0.9 0.  0.2]]


Suppose instead of declaring three separate query weight matrices, we had declared one. i.e. a single $d\ x\ d$ matrix. We're concatenating our per-head query weights here instead of declaring a new set of weights so that we get the same results.

In [16]:
wq = np.concatenate((wq0, wq1, wq2), axis=1)
print(f"Single query weight matrix {wq.shape}: \n", wq)

Single query weight matrix (12, 12): 
 [[0.2 0.2 0.  0.4 1.  0.1 0.2 0.6 0.9 0.3 0.2 1. ]
 [0.1 0.8 1.  0.4 0.9 0.3 0.1 0.3 0.8 0.2 0.3 0.1]
 [0.  0.  0.1 0.2 0.5 0.3 0.6 0.1 0.3 0.  0.  0.7]
 [0.8 0.4 0.1 0.2 0.2 0.7 0.5 0.4 0.4 0.4 0.6 0.3]
 [0.3 0.6 0.9 0.2 0.3 0.  0.3 0.8 0.9 0.6 0.2 0. ]
 [0.5 1.  0.3 1.  0.3 0.9 0.4 0.9 0.3 0.1 0.6 0.6]
 [0.6 0.7 0.3 0.4 0.4 0.6 0.9 0.2 0.4 0.3 0.5 0.6]
 [0.4 0.1 0.9 0.5 0.9 0.5 0.7 0.3 0.  1.  0.6 0.5]
 [0.2 0.9 0.1 0.3 0.  0.8 0.9 0.1 0.  0.5 0.2 0.1]
 [0.6 0.3 0.5 0.9 0.9 0.1 0.9 0.3 0.4 0.5 0.7 0.7]
 [0.9 0.7 0.5 0.9 0.2 0.3 0.2 0.9 0.7 0.4 0.8 0.5]
 [0.4 0.8 0.  0.3 0.5 0.5 0.1 0.8 0.1 0.9 0.  0.2]]


In the same vein, pretend we declared a single key weight matrix, and single value weight matrix.

In [17]:
wk = np.concatenate((wk0, wk1, wk2), axis=1)
wv = np.concatenate((wv0, wv1, wv2), axis=1)

print(f"Single key weight matrix {wk.shape}:\n", wk, "\n")
print(f"Single value weight matrix {wv.shape}:\n", wv)

Single key weight matrix (12, 12):
 [[0.1 0.6 0.5 0.7 0.3 0.5 0.  0.4 0.9 0.7 0.1 1. ]
 [0.4 0.4 0.8 0.  0.8 0.9 0.2 0.1 1.  0.1 0.8 0.8]
 [0.1 0.5 0.9 0.9 0.6 0.8 0.6 0.4 0.  0.6 0.3 0.7]
 [0.9 0.2 0.5 0.  0.4 0.8 0.5 0.8 0.1 0.3 0.3 0. ]
 [0.  0.7 0.7 0.7 0.2 0.8 0.3 0.6 0.4 0.1 0.4 0.6]
 [1.  0.1 0.7 0.6 0.2 0.2 0.1 0.5 0.3 0.2 0.  0.8]
 [0.1 0.  0.2 0.3 0.6 1.  0.6 0.4 0.7 0.7 0.9 0.1]
 [0.1 0.2 0.7 0.2 0.9 0.5 1.  0.8 1.  0.6 0.2 0.9]
 [0.9 0.9 0.9 0.6 0.4 0.5 0.7 0.1 1.  0.7 0.2 0.6]
 [0.1 0.8 0.7 0.7 0.  0.7 0.2 0.8 0.2 0.3 0.2 0.7]
 [0.6 0.9 0.3 0.8 0.4 1.  0.5 0.7 0.1 0.1 0.1 0.9]
 [0.1 0.4 0.5 0.9 0.9 0.3 0.2 0.8 0.6 0.7 0.  0.4]] 

Single value weight matrix (12, 12):
 [[1.  0.  0.3 0.3 1.  0.7 0.3 0.  0.9 0.4 0.7 0.6]
 [0.9 0.1 0.2 0.2 0.  0.2 0.5 0.4 0.5 0.3 0.4 0.2]
 [0.3 0.6 0.5 0.4 0.7 0.  0.2 0.4 0.6 0.4 0.7 0.7]
 [0.9 0.1 0.1 0.1 0.8 0.4 0.8 0.9 0.8 0.7 0.5 0.5]
 [0.9 0.2 0.1 0.1 0.4 0.9 0.3 0.4 0.7 0.2 0.6 0.7]
 [0.1 0.5 0.7 0.6 0.2 0.8 0.5 0.5 0.3 0.2 0.3 0.7]
 [0.7

Now we can calculate all our *queries*, *keys*, and *values* with three dot products.

In [18]:
q_s = np.dot(x, wq)
k_s = np.dot(x, wk)
v_s = np.dot(x, wv)

These are our resulting query vectors (we'll call them "combined queries"). How do we simulate different heads with this?

In [19]:
print(f"Query vectors using a single weight matrix {q_s.shape}:\n", q_s)

Query vectors using a single weight matrix (1, 3, 12):
 [[[2.1  3.39 2.03 2.34 2.63 2.52 2.65 2.54 2.7  2.13 1.95 2.25]
  [2.04 2.52 2.43 2.5  3.51 1.91 2.38 2.23 2.63 2.17 2.18 2.65]
  [2.24 2.78 2.71 2.44 2.94 1.95 2.55 2.38 2.64 2.55 2.3  1.98]]]


Somehow, we need to separate these vectors such they're treated like three separate sets by the self-attention operation.

In [20]:
print(q0, "\n")
print(q1, "\n")
print(q2)

[[[2.1  3.39 2.03 2.34]
  [2.04 2.52 2.43 2.5 ]
  [2.24 2.78 2.71 2.44]]] 

[[[2.63 2.52 2.65 2.54]
  [3.51 1.91 2.38 2.23]
  [2.94 1.95 2.55 2.38]]] 

[[[2.7  2.13 1.95 2.25]
  [2.63 2.17 2.18 2.65]
  [2.64 2.55 2.3  1.98]]]


Notice how each set of per-head queries looks like we took the combined queries, and chopped them vertically every four dimensions.
<br><br>
We can split our combined queries into $\text{d}\ \text{x}\ \text{d/h}$ heads using **reshape** and **transpose**.<br><br>
The first step is to *reshape* our combined queries from a shape of:<br>
(batch_size, seq_len, embed_dim)<br>

into a shape of<br>
 (batch_size, seq_len, num_heads, head_dim).
 <br>

 https://www.tensorflow.org/api_docs/python/tf/reshape

In [21]:
# Note: we can achieve the same thing by passing -1 instead of seq_len.
q_s_reshaped = tf.reshape(q_s, (batch_size, seq_len, num_heads, head_dim))
print(f"Combined queries: {q_s.shape}\n", q_s, "\n")
print(f"Reshaped into separate heads: {q_s_reshaped.shape}\n", q_s_reshaped)

Combined queries: (1, 3, 12)
 [[[2.1  3.39 2.03 2.34 2.63 2.52 2.65 2.54 2.7  2.13 1.95 2.25]
  [2.04 2.52 2.43 2.5  3.51 1.91 2.38 2.23 2.63 2.17 2.18 2.65]
  [2.24 2.78 2.71 2.44 2.94 1.95 2.55 2.38 2.64 2.55 2.3  1.98]]] 

Reshaped into separate heads: (1, 3, 3, 4)
 tf.Tensor(
[[[[2.1  3.39 2.03 2.34]
   [2.63 2.52 2.65 2.54]
   [2.7  2.13 1.95 2.25]]

  [[2.04 2.52 2.43 2.5 ]
   [3.51 1.91 2.38 2.23]
   [2.63 2.17 2.18 2.65]]

  [[2.24 2.78 2.71 2.44]
   [2.94 1.95 2.55 2.38]
   [2.64 2.55 2.3  1.98]]]], shape=(1, 3, 3, 4), dtype=float64)


At this point, we have our desired shape. The next step is to *transpose* it such that simulates vertically chopping our combined queries. By transposing, our matrix dimensions become:<br>
(batch_size, num_heads, seq_len, head_dim)<br>

https://www.tensorflow.org/api_docs/python/tf/transpose

In [22]:
q_s_transposed = tf.transpose(q_s_reshaped, perm=[0, 2, 1, 3]).numpy()
print(f"Queries transposed into \"separate\" heads {q_s_transposed.shape}:\n",
      q_s_transposed)

Queries transposed into "separate" heads (1, 3, 3, 4):
 [[[[2.1  3.39 2.03 2.34]
   [2.04 2.52 2.43 2.5 ]
   [2.24 2.78 2.71 2.44]]

  [[2.63 2.52 2.65 2.54]
   [3.51 1.91 2.38 2.23]
   [2.94 1.95 2.55 2.38]]

  [[2.7  2.13 1.95 2.25]
   [2.63 2.17 2.18 2.65]
   [2.64 2.55 2.3  1.98]]]]


If we compare this against the separate per-head queries we calculated previously, we see the same result except we now have all our queries in a single matrix.

In [23]:
print("The separate per-head query matrices from before: ")
print(q0, "\n")
print(q1, "\n")
print(q2)

The separate per-head query matrices from before: 
[[[2.1  3.39 2.03 2.34]
  [2.04 2.52 2.43 2.5 ]
  [2.24 2.78 2.71 2.44]]] 

[[[2.63 2.52 2.65 2.54]
  [3.51 1.91 2.38 2.23]
  [2.94 1.95 2.55 2.38]]] 

[[[2.7  2.13 1.95 2.25]
  [2.63 2.17 2.18 2.65]
  [2.64 2.55 2.3  1.98]]]


Let's do the exact same thing with our combined keys and values.

In [24]:
k_s_transposed = tf.transpose(tf.reshape(k_s, (batch_size, -1, num_heads, head_dim)), perm=[0, 2, 1, 3]).numpy()
v_s_transposed = tf.transpose(tf.reshape(v_s, (batch_size, -1, num_heads, head_dim)), perm=[0, 2, 1, 3]).numpy()

print(f"Keys for all heads in a single matrix {k_s.shape}: \n", k_s_transposed, "\n")
print(f"Values for all heads in a single matrix {v_s.shape}: \n", v_s_transposed)

Keys for all heads in a single matrix (1, 3, 12): 
 [[[[2.38 2.64 3.51 2.73]
   [1.37 1.98 2.92 2.11]
   [2.03 2.88 3.44 2.31]]

  [[2.53 3.76 2.07 2.5 ]
   [2.64 3.56 1.9  2.58]
   [2.32 3.73 2.17 2.81]]

  [[3.34 2.41 1.92 3.3 ]
   [3.41 2.24 1.99 3.37]
   [3.14 1.9  1.69 3.44]]]] 

Values for all heads in a single matrix (1, 3, 12): 
 [[[[3.43 1.81 1.73 2.01]
   [3.73 1.65 2.08 2.15]
   [3.75 1.61 2.03 1.65]]

  [[2.86 3.21 1.96 2.7 ]
   [2.53 3.44 1.83 2.64]
   [2.96 3.11 2.14 3.24]]

  [[3.18 2.36 2.44 2.56]
   [3.45 2.58 2.64 2.36]
   [3.46 2.35 2.25 2.73]]]]


Set up this way, we can now calculate the outputs from all attention heads with a single call to our self-attention operation.

In [25]:
all_heads_output, all_attn_weights = scaled_dot_product_attention(q_s_transposed,
                                                                  k_s_transposed,
                                                                  v_s_transposed)
print("Self attention output:\n", all_heads_output)

Self attention output:
 tf.Tensor(
[[[[3.552432  1.7339897 1.846055  1.8811185]
   [3.5430188 1.7399838 1.8375058 1.893627 ]
   [3.5420089 1.74048   1.8362217 1.8924185]]

  [[2.8143635 3.2294426 1.9993956 2.904628 ]
   [2.7964725 3.243342  1.9857105 2.8767433]
   [2.8053508 3.2363634 1.9928751 2.8919668]]

  [[3.342678  2.4526718 2.493952  2.5026991]
   [3.3433924 2.4535856 2.4950347 2.5016437]
   [3.338001  2.453367  2.4980073 2.4990222]]]], shape=(1, 3, 3, 4), dtype=float32)


As a sanity check, we can compare this against the outputs from individual heads we calculated earlier:

In [26]:
print("Per head outputs from using separate sets of weights per head:")
print(out0, "\n")
print(out1, "\n")
print(out2)

Per head outputs from using separate sets of weights per head:
tf.Tensor(
[[[3.552432  1.7339897 1.846055  1.8811185]
  [3.5430186 1.7399838 1.8375058 1.893627 ]
  [3.5420089 1.74048   1.8362217 1.8924185]]], shape=(1, 3, 4), dtype=float32) 

tf.Tensor(
[[[2.8143632 3.2294424 1.9993955 2.9046278]
  [2.7964728 3.2433422 1.9857107 2.8767433]
  [2.805351  3.2363636 1.9928755 2.891967 ]]], shape=(1, 3, 4), dtype=float32) 

tf.Tensor(
[[[3.342678  2.4526718 2.493952  2.5026987]
  [3.3433924 2.4535856 2.495035  2.501644 ]
  [3.338001  2.453367  2.498007  2.4990222]]], shape=(1, 3, 4), dtype=float32)


To get the final concatenated result, we need to reverse our **reshape** and **transpose** operation, starting with the **transpose** this time.

In [27]:
combined_out_b = tf.reshape(tf.transpose(all_heads_output, perm=[0, 2, 1, 3]),
                            shape=(batch_size, seq_len, embed_dim))
print("Final output from using single query, key, value matrices:\n",
      combined_out_b, "\n")
print("Final output from using separate query, key, value matrices per head:\n",
      combined_out_a)

Final output from using single query, key, value matrices:
 tf.Tensor(
[[[3.552432  1.7339897 1.846055  1.8811185 2.8143635 3.2294426 1.9993956
   2.904628  3.342678  2.4526718 2.493952  2.5026991]
  [3.5430188 1.7399838 1.8375058 1.893627  2.7964725 3.243342  1.9857105
   2.8767433 3.3433924 2.4535856 2.4950347 2.5016437]
  [3.5420089 1.74048   1.8362217 1.8924185 2.8053508 3.2363634 1.9928751
   2.8919668 3.338001  2.453367  2.4980073 2.4990222]]], shape=(1, 3, 12), dtype=float32) 

Final output from using separate query, key, value matrices per head:
 [[[3.552432  1.7339897 1.846055  1.8811185 2.8143632 3.2294424 1.9993955
   2.9046278 3.342678  2.4526718 2.493952  2.5026987]
  [3.5430186 1.7399838 1.8375058 1.893627  2.7964728 3.2433422 1.9857107
   2.8767433 3.3433924 2.4535856 2.495035  2.501644 ]
  [3.5420089 1.74048   1.8362217 1.8924185 2.805351  3.2363636 1.9928755
   2.891967  3.338001  2.453367  2.498007  2.4990222]]]


We can encapsulate everything we just covered in a class.

In [28]:
class MultiHeadSelfAttention(tf.keras.layers.Layer):
  def __init__(self, d_model, num_heads):
    super(MultiHeadSelfAttention, self).__init__()
    self.d_model = d_model
    self.num_heads = num_heads

    self.d_head = self.d_model // self.num_heads

    self.wq = tf.keras.layers.Dense(self.d_model)
    self.wk = tf.keras.layers.Dense(self.d_model)
    self.wv = tf.keras.layers.Dense(self.d_model)

    # Linear layer to generate the final output.
    self.dense = tf.keras.layers.Dense(self.d_model)

  def split_heads(self, x):
    batch_size = x.shape[0]

    split_inputs = tf.reshape(x, (batch_size, -1, self.num_heads, self.d_head))
    return tf.transpose(split_inputs, perm=[0, 2, 1, 3])

  def merge_heads(self, x):
    batch_size = x.shape[0]

    merged_inputs = tf.transpose(x, perm=[0, 2, 1, 3])
    return tf.reshape(merged_inputs, (batch_size, -1, self.d_model))

  def call(self, q, k, v, mask):
    qs = self.wq(q)
    ks = self.wk(k)
    vs = self.wv(v)

    qs = self.split_heads(qs)
    ks = self.split_heads(ks)
    vs = self.split_heads(vs)

    output, attn_weights = scaled_dot_product_attention(qs, ks, vs, mask)
    output = self.merge_heads(output)

    return self.dense(output), attn_weights


In [29]:
mhsa = MultiHeadSelfAttention(12, 3)

output, attn_weights = mhsa(x, x, x, None)
print(f"MHSA output{output.shape}:")
print(output)

MHSA output(1, 3, 12):
tf.Tensor(
[[[-0.75220907  0.60848594 -0.24099663  0.8574363  -1.7627103
    1.0225685   0.06094044  0.5666399  -0.13977948 -0.63689446
   -0.660741    1.1435963 ]
  [-0.7572313   0.6086106  -0.23586354  0.8668075  -1.7626736
    1.0215359   0.06591715  0.56524444 -0.12876031 -0.6342495
   -0.6591168   1.1385059 ]
  [-0.7551215   0.6213727  -0.22492132  0.85145015 -1.7764089
    1.0195215   0.04244602  0.54266554 -0.15470764 -0.6431295
   -0.6555153   1.1523046 ]]], shape=(1, 3, 12), dtype=float32)


## Encoder Block

We can now build our **Encoder Block**. In addition to the **Multi-Head Self Attention** layer, the **Encoder Block** also has **skip connections**, **layer normalization steps**, and a **two-layer feed-forward neural network**. The original **Attention Is All You Need** paper also included some **dropout** applied to the self-attention output which isn't shown in the illustration below (see references for a link to the paper).


Since a two-layer feed forward neural network is used in multiple places in the transformer, here's a function which creates and returns one.

In [30]:
def feed_forward_network(d_model, hidden_dim):
  return tf.keras.Sequential([
      tf.keras.layers.Dense(hidden_dim, activation='relu'),
      tf.keras.layers.Dense(d_model)
  ])

This is our encoder block containing all the layers and steps from the preceding illustration (plus dropout).

In [31]:
class EncoderBlock(tf.keras.layers.Layer):
  def __init__(self, d_model, num_heads, hidden_dim, dropout_rate=0.1):
    super(EncoderBlock, self).__init__()

    self.mhsa = MultiHeadSelfAttention(d_model, num_heads)
    self.ffn = feed_forward_network(d_model, hidden_dim)

    self.dropout1 = tf.keras.layers.Dropout(dropout_rate)
    self.dropout2 = tf.keras.layers.Dropout(dropout_rate)

    self.layernorm1 = tf.keras.layers.LayerNormalization()
    self.layernorm2 = tf.keras.layers.LayerNormalization()

  def call(self, x, training, mask):
    mhsa_output, attn_weights = self.mhsa(x, x, x, mask)
    mhsa_output = self.dropout1(mhsa_output, training=training)
    mhsa_output = self.layernorm1(x + mhsa_output)

    ffn_output = self.ffn(mhsa_output)
    ffn_output = self.dropout2(ffn_output, training=training)
    output = self.layernorm2(mhsa_output + ffn_output)

    return output, attn_weights


Suppose we have an embedding dimension of 12, and we want 3 attention heads and a feed forward network with a hidden dimension of 48 (4x the embedding dimension). We would declare and use a single encoder block like so:

In [32]:
encoder_block = EncoderBlock(12, 3, 48)

block_output,  _ = encoder_block(x, True, None)
print(f"Output from single encoder block {block_output.shape}:")
print(block_output)

Output from single encoder block (1, 3, 12):
tf.Tensor(
[[[ 1.6569743   1.3320812  -1.1963488  -0.15327895 -0.8070753
   -0.3270114   0.02493571  0.4750662   1.5313147  -0.67409134
   -0.43159893 -1.4309677 ]
  [ 1.7250888   1.208477   -1.2379202  -0.7765264  -0.1966951
   -0.4522387   0.57110393  1.5487726  -0.4439479  -0.40991902
   -0.09402011 -1.4421747 ]
  [ 0.80964327  1.3949454  -1.4053912  -0.16600242 -0.22739841
   -0.8344146  -0.8908817   1.2394375   1.3162789   0.34700018
   -0.01541197 -1.5678049 ]]], shape=(1, 3, 12), dtype=float32)


## Word and Positional Embeddings

Let's now deal with the actual input to the **initial** encoder block. The inputs are going to be *positional word embeddings*. That is, word embeddings with some positional information added to them.
<br>

Let's start with **subword** tokenization. For demonstration, we'll use a subword tokenizer called **BPEmb**. It uses **Byte-Pair Encoding** and supports over two hundred languages.

https://bpemb.h-its.org/


In [33]:
# Load the English tokenizer.
bpemb_en = BPEmb(lang="en")

downloading https://nlp.h-its.org/bpemb/en/en.wiki.bpe.vs10000.model


100%|██████████| 400869/400869 [00:00<00:00, 899022.12B/s]


downloading https://nlp.h-its.org/bpemb/en/en.wiki.bpe.vs10000.d100.w2v.bin.tar.gz


100%|██████████| 3784656/3784656 [00:00<00:00, 4157039.92B/s]


The library comes with embeddings for a number of words.

In [34]:
bpemb_vocab_size, bpemb_embed_size = bpemb_en.vectors.shape
print("Vocabulary size:", bpemb_vocab_size)
print("Embedding size:", bpemb_embed_size)

Vocabulary size: 10000
Embedding size: 100


In [35]:
# Embedding for the word "car".
bpemb_en.vectors[bpemb_en.words.index('car')]

array([-0.305548, -0.325598, -0.134716, -0.078735, -0.660545,  0.076211,
       -0.735487,  0.124533, -0.294402,  0.459688,  0.030137,  0.174041,
       -0.224223,  0.486189, -0.504649, -0.459699,  0.315747,  0.477885,
        0.091398,  0.427867,  0.016524, -0.076833, -0.899727,  0.493158,
       -0.022309, -0.422785, -0.154148,  0.204981,  0.379834,  0.070588,
        0.196073, -0.368222,  0.473406,  0.007409,  0.004303, -0.007823,
       -0.19103 , -0.202509,  0.109878, -0.224521, -0.35741 , -0.611633,
        0.329958, -0.212956, -0.497499, -0.393839, -0.130101, -0.216903,
       -0.105595, -0.076007, -0.483942, -0.139704, -0.161647,  0.136985,
        0.415363, -0.360143,  0.038601, -0.078804, -0.030421,  0.324129,
        0.223378, -0.523636, -0.048317, -0.032248, -0.117367,  0.470519,
        0.225816, -0.222065, -0.225007, -0.165904, -0.334389, -0.20157 ,
        0.572352, -0.268794,  0.301929, -0.005563,  0.387491,  0.261031,
       -0.11613 ,  0.074982, -0.008433,  0.259987, 

We don't need the embeddings since we're going to use our own embedding layer. What we're interested in are the subword tokens and their respective ids. The ids will be used as indexes into our embedding layer.<br>

If this doesn't sound familiar, refer to the module on word vectors:<br>
https://www.nlpdemystified.org/course/word-vectors

These are the subword tokens for our example sentence from the slides. **BPEmb** places underscores in front of any tokens which are whole words or intended to begin words.<br>

Remember that subword tokenizers are trained using count frequencies over a corpus. So these subword tokens are specific to **BPEmb**. Another subword tokenizer may output something different. This is why it's important that when we use a pretrained model, we make sure to use the pretrained model's tokenizer. We'll see this when we use pretrained transformers later in this module.

In [36]:
sample_sentence = "Where can I find a pizzeria?"
tokens = bpemb_en.encode(sample_sentence)
print(tokens)

['▁where', '▁can', '▁i', '▁find', '▁a', '▁p', 'iz', 'zer', 'ia', '?']


We can retrieve each subword token's respective id using the *encode_ids* method.

In [37]:
token_seq = np.array(bpemb_en.encode_ids("Where can I find a pizzeria?"))
print(token_seq)

[ 571  280  386 1934    4   24  248 4339  177 9967]


Now that we have a way to tokenize and vectorize sentences, we can declare and use an embedding layer with the same vocabulary size as **BPEmb** and a desired embedding size.

In [38]:
token_embed = tf.keras.layers.Embedding(bpemb_vocab_size, embed_dim)
token_embeddings = token_embed(token_seq)

# The untrained embeddings for our sample sentence.
print("Embeddings for: ", sample_sentence)
print(token_embeddings)

Embeddings for:  Where can I find a pizzeria?
tf.Tensor(
[[ 0.00400496  0.04831305  0.04303933  0.01233246 -0.01075653 -0.04128264
   0.02047274 -0.01051215  0.00373044  0.03636954  0.00092835 -0.04483579]
 [ 0.01745489  0.02249477 -0.04775207 -0.00345278  0.01651353 -0.01490047
  -0.00393187  0.03352263  0.01207221  0.02686096  0.01110649 -0.00380627]
 [-0.00967174 -0.03257737 -0.00390768  0.00575004  0.0338132  -0.01474513
   0.02396044 -0.04178037 -0.04993768  0.04088691 -0.02797516 -0.0249719 ]
 [ 0.00880096 -0.03210081  0.04768225 -0.04392649 -0.01147266  0.04357504
  -0.04509336 -0.03205345 -0.03869456 -0.04352302  0.00881774 -0.04740187]
 [ 0.02461859  0.01440511  0.00757957  0.03171093 -0.04165054 -0.00717484
   0.03559117 -0.01511825  0.04637286 -0.01531883  0.00926406 -0.00846667]
 [ 0.02212783  0.00310277 -0.00833982  0.0086063   0.01740291 -0.03091501
   0.02407627  0.04326409 -0.04548197  0.01329801  0.02752054 -0.04685855]
 [-0.04001486  0.01312245 -0.02921597  0.03383039

Next, we need to add *positional* information to each token embedding. As we covered in the slides, the original paper used sinusoidals but it's more common these days to just use another set of embeddings. We'll do the latter here.<br>

Here, we're declaring an embedding layer with rows equalling a maximum sequence length and columns equalling our token embedding size. We then generate a vector of position ids.

In [39]:
max_seq_len = 256
pos_embed = tf.keras.layers.Embedding(max_seq_len, embed_dim)

# Generate ids for each position of the token sequence.
pos_idx = tf.range(len(token_seq))
print(pos_idx)

tf.Tensor([0 1 2 3 4 5 6 7 8 9], shape=(10,), dtype=int32)


We'll use these position ids to index into the positional embedding layer.

In [40]:
# These are our positon embeddings.
position_embeddings = pos_embed(pos_idx)
print("Position embeddings for the input sequence\n", position_embeddings)

Position embeddings for the input sequence
 tf.Tensor(
[[ 0.04475912  0.01136164  0.00024324 -0.03845688  0.02116496 -0.00829645
  -0.03297579 -0.04601423  0.01971662  0.00454406  0.0427576   0.01658064]
 [-0.0142136  -0.04328866 -0.00930465 -0.02044349 -0.02645006  0.02946636
   0.02990742 -0.04853205  0.02957973 -0.04273129  0.00504352  0.01616876]
 [-0.03708373 -0.01538137  0.01870489 -0.01481106  0.02658054  0.04030814
   0.0020091  -0.03595635 -0.03257509  0.00683299 -0.0383482  -0.02164124]
 [ 0.02580347 -0.04198921  0.02124074  0.0096035   0.0012651   0.0342189
  -0.01892256  0.00388256  0.01768252 -0.00308634  0.00898805  0.00752416]
 [-0.04895539 -0.00249588 -0.02158953 -0.03222498 -0.00417761 -0.04998023
  -0.02186648  0.02509338  0.0342718   0.02292876  0.00195863  0.0058117 ]
 [ 0.01011858  0.02461368  0.01578208 -0.03094249 -0.03148978  0.01228408
   0.03565062 -0.03808091 -0.0099959  -0.0015573   0.00535196 -0.04983938]
 [ 0.00206789 -0.04287862 -0.00140119  0.00992222 -0

The final step is to add our token and position embeddings. The result will be the input to the first encoder block.

In [41]:
input = token_embeddings + position_embeddings
print("Input to the initial encoder block:\n", input)

Input to the initial encoder block:
 tf.Tensor(
[[ 0.04876408  0.05967468  0.04328257 -0.02612442  0.01040844 -0.0495791
  -0.01250305 -0.05652638  0.02344706  0.04091359  0.04368596 -0.02825515]
 [ 0.00324129 -0.02079389 -0.05705673 -0.02389627 -0.00993653  0.01456589
   0.02597555 -0.01500941  0.04165194 -0.01587032  0.01615001  0.01236249]
 [-0.04675547 -0.04795875  0.01479721 -0.00906102  0.06039374  0.02556302
   0.02596954 -0.07773672 -0.08251277  0.04771991 -0.06632335 -0.04661315]
 [ 0.03460443 -0.07409002  0.06892299 -0.03432299 -0.01020757  0.07779393
  -0.06401591 -0.02817088 -0.02101204 -0.04660936  0.01780579 -0.03987772]
 [-0.0243368   0.01190922 -0.01400997 -0.00051405 -0.04582815 -0.05715507
   0.0137247   0.00997513  0.08064467  0.00760993  0.01122269 -0.00265497]
 [ 0.03224641  0.02771645  0.00744226 -0.02233618 -0.01408686 -0.01863093
   0.05972689  0.00518319 -0.05547787  0.01174071  0.0328725  -0.09669793]
 [-0.03794698 -0.02975617 -0.03061715  0.04375261  0.019203

## Encoder

Now that we have an encoder block and a way to embed our tokens with position information, we can create the **encoder** itself.<br>

Given a batch of vectorized sequences, the encoder creates positional embeddings, runs them through its encoder blocks, and returns contextualized tokens.

In [42]:
class Encoder(tf.keras.layers.Layer):
  def __init__(self, num_blocks, d_model, num_heads, hidden_dim, src_vocab_size,
               max_seq_len, dropout_rate=0.1):
    super(Encoder, self).__init__()

    self.d_model = d_model
    self.max_seq_len = max_seq_len

    self.token_embed = tf.keras.layers.Embedding(src_vocab_size, self.d_model)
    self.pos_embed = tf.keras.layers.Embedding(max_seq_len, self.d_model)

    # The original Attention Is All You Need paper applied dropout to the
    # input before feeding it to the first encoder block.
    self.dropout = tf.keras.layers.Dropout(dropout_rate)

    # Create encoder blocks.
    self.blocks = [EncoderBlock(self.d_model, num_heads, hidden_dim, dropout_rate)
    for _ in range(num_blocks)]

  def call(self, input, training, mask):
    token_embeds = self.token_embed(input)

    # Generate position indices for a batch of input sequences.
    num_pos = input.shape[0] * self.max_seq_len
    pos_idx = np.resize(np.arange(self.max_seq_len), num_pos)
    pos_idx = np.reshape(pos_idx, input.shape)
    pos_embeds = self.pos_embed(pos_idx)

    x = self.dropout(token_embeds + pos_embeds, training=training)

    # Run input through successive encoder blocks.
    for block in self.blocks:
      x, weights = block(x, training, mask)

    return x, weights

If you're wondering about this code block here:


```
num_pos = input.shape[0] * self.max_seq_len
pos_idx = np.resize(np.arange(self.max_seq_len), num_pos)
pos_idx = np.reshape(pos_idx, input.shape)
pos_embeds = self.pos_embed(pos_idx)
```


This generates positional embeddings for a *batch* of input sequences. Suppose this was our batch of input sequences to the encoder.

In [43]:
# Batch of 3 sequences, each of length 10 (10 is also the
# maximum sequence length in this case).
seqs = np.random.randint(0, 10000, size=(3, 10))
print(seqs.shape)
print(seqs)

(3, 10)
[[8875 5397 4993 4270 4891 4655 9074 9709 5166 6044]
 [4026  823 3671 7558 8165  840 9570 3884 3508 8435]
 [5566 1853 7246  304 1098  715 5040 4420 4135 4374]]


We need to retrieve a positional embedding for every element in this batch. The first step is to create the respective positional ids...

In [44]:
pos_ids = np.resize(np.arange(seqs.shape[1]), seqs.shape[0] * seqs.shape[1])
print(pos_ids)

[0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9]


...and then reshape them to match the input batch dimensions.

In [45]:
pos_ids = np.reshape(pos_ids, (3, 10))
print(pos_ids.shape)
print(pos_ids)

(3, 10)
[[0 1 2 3 4 5 6 7 8 9]
 [0 1 2 3 4 5 6 7 8 9]
 [0 1 2 3 4 5 6 7 8 9]]


We can now retrieve position embeddings for every token embedding.

In [46]:
pos_embed(pos_ids)

<tf.Tensor: shape=(3, 10, 12), dtype=float32, numpy=
array([[[ 0.04475912,  0.01136164,  0.00024324, -0.03845688,
          0.02116496, -0.00829645, -0.03297579, -0.04601423,
          0.01971662,  0.00454406,  0.0427576 ,  0.01658064],
        [-0.0142136 , -0.04328866, -0.00930465, -0.02044349,
         -0.02645006,  0.02946636,  0.02990742, -0.04853205,
          0.02957973, -0.04273129,  0.00504352,  0.01616876],
        [-0.03708373, -0.01538137,  0.01870489, -0.01481106,
          0.02658054,  0.04030814,  0.0020091 , -0.03595635,
         -0.03257509,  0.00683299, -0.0383482 , -0.02164124],
        [ 0.02580347, -0.04198921,  0.02124074,  0.0096035 ,
          0.0012651 ,  0.0342189 , -0.01892256,  0.00388256,
          0.01768252, -0.00308634,  0.00898805,  0.00752416],
        [-0.04895539, -0.00249588, -0.02158953, -0.03222498,
         -0.00417761, -0.04998023, -0.02186648,  0.02509338,
          0.0342718 ,  0.02292876,  0.00195863,  0.0058117 ],
        [ 0.01011858,  0.02

Let's try our encoder on a batch of sentences.

In [47]:
input_batch = [
    "Where can I find a pizzeria?",
    "Mass hysteria over listeria.",
    "I ain't no circle back girl."
]

bpemb_en.encode(input_batch)

[['▁where', '▁can', '▁i', '▁find', '▁a', '▁p', 'iz', 'zer', 'ia', '?'],
 ['▁mass', '▁hy', 'ster', 'ia', '▁over', '▁l', 'ister', 'ia', '.'],
 ['▁i', '▁a', 'in', "'", 't', '▁no', '▁circle', '▁back', '▁girl', '.']]

In [48]:
input_seqs = bpemb_en.encode_ids(input_batch)
print("Vectorized inputs:")
input_seqs

Vectorized inputs:


[[571, 280, 386, 1934, 4, 24, 248, 4339, 177, 9967],
 [1535, 1354, 1238, 177, 380, 43, 871, 177, 9935],
 [386, 4, 6, 9937, 9915, 467, 5410, 810, 3692, 9935]]

Note how the input sequences aren't the same length in this batch. In this case, we need to pad them out so that they are. If you're unfamiliar with why, refer to the notebook on Recurrent Neural Networks:<br>
https://colab.research.google.com/github/futuremojo/nlp-demystified/blob/main/notebooks/nlpdemystified_recurrent_neural_networks.ipynb<br>

We'll do this using *pad_sequences*.<br>
https://www.tensorflow.org/api_docs/python/tf/keras/utils/pad_sequences

In [49]:
padded_input_seqs = tf.keras.preprocessing.sequence.pad_sequences(input_seqs, padding="post")
print("Input to the encoder:")
print(padded_input_seqs.shape)
print(padded_input_seqs)

Input to the encoder:
(3, 10)
[[ 571  280  386 1934    4   24  248 4339  177 9967]
 [1535 1354 1238  177  380   43  871  177 9935    0]
 [ 386    4    6 9937 9915  467 5410  810 3692 9935]]


Since our input now has padding, now's a good time to cover **masking**.
<br>

So given a mask, wherever there's a mask position set to 0, the corresponding position in the attention scores will be set to *-inf*. The resulting attention weight for the position will then be zero and no attending will occur for that position.
<br>

In the slides, we covered *look-ahead* masks for the decoder to prevent it from attending to future tokens, but we also need masks for padding.
<br>

In total, there are three masks involved:
1. The *encoder mask* to mask out any padding in the encoder sequences.

2. The *decoder mask* which is used in the decoder's **first** multi-head self-attention layer. It's a <u>combination of two masks</u>: one to account for the padding in target sequences, and the look-ahead mask.

3. The *memory mask* which is used in the decoder's **second** multi-head self-attention layer. The keys and values for this layer are going to be the encoder's output, and this mask will ensure the decoder doesn't attend to any encoder output which corresponds to padding. In practice, 1 and 3 are often the same.

The *scaled_dot_product_attention* function has this line:
```
  if mask is not None:
    scaled_scores = tf.where(mask==0, -np.inf, scaled_scores)
```

Let's create an encoder mask for our batch of input sequences.<br>

Wherever there's padding, we want the mask position set to zero.

In [50]:
enc_mask = tf.cast(tf.math.not_equal(padded_input_seqs, 0), tf.float32)
print("Input:")
print(padded_input_seqs, '\n')
print("Encoder mask:")
print(enc_mask)

Input:
[[ 571  280  386 1934    4   24  248 4339  177 9967]
 [1535 1354 1238  177  380   43  871  177 9935    0]
 [ 386    4    6 9937 9915  467 5410  810 3692 9935]] 

Encoder mask:
tf.Tensor(
[[1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1. 1. 1. 1. 0.]
 [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]], shape=(3, 10), dtype=float32)


Keep in mind that the dimension of the attention matrix (for this example) is going to be:<br>
*(batch size, number of heads, query size, key size)*<br>
(3, 3, 10, 10)

So we need to expand the mask dimensions like so:

In [51]:
enc_mask = enc_mask[:, tf.newaxis, tf.newaxis, :]
enc_mask

<tf.Tensor: shape=(3, 1, 1, 10), dtype=float32, numpy=
array([[[[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]]],


       [[[1., 1., 1., 1., 1., 1., 1., 1., 1., 0.]]],


       [[[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]]]], dtype=float32)>

This way, the encoder mask will now be *broadcasted*.<br>
https://www.tensorflow.org/xla/broadcasting

Now we can declare an encoder and pass it batches of vectorized sequences.

In [52]:
num_encoder_blocks = 6

# d_model is the embedding dimension used throughout.
d_model = 12

num_heads = 3

# Feed-forward network hidden dimension width.
ffn_hidden_dim = 48

src_vocab_size = bpemb_vocab_size
max_input_seq_len = padded_input_seqs.shape[1]

encoder = Encoder(
    num_encoder_blocks,
    d_model,
    num_heads,
    ffn_hidden_dim,
    src_vocab_size,
    max_input_seq_len)

We can now pass our input sequences and mask to the encoder.

In [53]:
encoder_output, attn_weights = encoder(padded_input_seqs, training=True,
                                       mask=enc_mask)
print(f"Encoder output {encoder_output.shape}:")
print(encoder_output)

Encoder output (3, 10, 12):
tf.Tensor(
[[[-0.20844007  0.30584323 -0.6369229   0.859495   -0.01236432
   -0.50558317 -1.475137    0.64451116  2.2667232   0.5639428
   -0.2757076  -1.5263603 ]
  [ 1.0040327  -0.90436524 -0.6418579   0.2745777  -0.27483612
    0.22397393 -0.4392896   0.9346314   0.8046457   1.3477252
    0.12138812 -2.4506254 ]
  [-0.8286879  -0.44620386  1.5278351   1.7705213  -0.16286017
   -0.04525811 -1.8712609  -0.32909733  1.0411451   0.3862133
   -0.19126712 -0.85107934]
  [ 0.73303103 -2.185224   -0.22593173  1.1442285   1.1382056
   -0.5883463  -0.99556524 -0.00529124  1.4696492   0.36057872
   -0.2918209  -0.55351365]
  [ 0.38723007 -0.9957127  -1.3574787  -0.07031898  0.59186244
   -1.1414843   0.39921477  1.678804    1.5273733   0.24442498
    0.13695456 -1.4008698 ]
  [ 0.72962576 -0.95066804 -2.1555886   0.38998362  2.2392745
    0.22314106 -0.4409975   0.5537008  -0.12353037  0.05372484
   -0.1253061  -0.39336   ]
  [ 0.4999636  -2.4096107   0.56715846  0.

## Decoder Block

Let's build the **Decoder Block**. Everything we did to create the **encoder** block applies here. The major differences are that the **Decoder Block** has:
1. a **Multi-Head Cross-Attention** layer which uses the encoder's outputs as the keys and values.

2. an extra skip/residual connection along with an extra layer normalization step.

<div>
<img src="https://drive.google.com/uc?export=view&id=1WVT4SX49bnta4uscOTF4xrsxFI4PbPER" width="500"/>
</div>

In [54]:
class DecoderBlock(tf.keras.layers.Layer):
  def __init__(self, d_model, num_heads, hidden_dim, dropout_rate=0.1):
    super(DecoderBlock, self).__init__()

    self.mhsa1 = MultiHeadSelfAttention(d_model, num_heads)
    self.mhsa2 = MultiHeadSelfAttention(d_model, num_heads)

    self.ffn = feed_forward_network(d_model, hidden_dim)

    self.dropout1 = tf.keras.layers.Dropout(dropout_rate)
    self.dropout2 = tf.keras.layers.Dropout(dropout_rate)
    self.dropout3 = tf.keras.layers.Dropout(dropout_rate)

    self.layernorm1 = tf.keras.layers.LayerNormalization()
    self.layernorm2 = tf.keras.layers.LayerNormalization()
    self.layernorm3 = tf.keras.layers.LayerNormalization()

  # Note the decoder block takes two masks. One for the first MHSA, another
  # for the second MHSA.
  def call(self, encoder_output, target, training, decoder_mask, memory_mask):
    mhsa_output1, attn_weights = self.mhsa1(target, target, target, decoder_mask)
    mhsa_output1 = self.dropout1(mhsa_output1, training=training)
    mhsa_output1 = self.layernorm1(mhsa_output1 + target)

    mhsa_output2, attn_weights = self.mhsa2(mhsa_output1, encoder_output,
                                            encoder_output,
                                            memory_mask)
    mhsa_output2 = self.dropout2(mhsa_output2, training=training)
    mhsa_output2 = self.layernorm2(mhsa_output2 + mhsa_output1)

    ffn_output = self.ffn(mhsa_output2)
    ffn_output = self.dropout3(ffn_output, training=training)
    output = self.layernorm3(ffn_output + mhsa_output2)

    return output, attn_weights


## Decoder

The decoder is almost the same as the encoder except it takes the encoder's output as part of its input, and it takes two masks: the decoder mask and memory mask.

In [55]:
class Decoder(tf.keras.layers.Layer):
  def __init__(self, num_blocks, d_model, num_heads, hidden_dim, target_vocab_size,
               max_seq_len, dropout_rate=0.1):
    super(Decoder, self).__init__()

    self.d_model = d_model
    self.max_seq_len = max_seq_len

    self.token_embed = tf.keras.layers.Embedding(target_vocab_size, self.d_model)
    self.pos_embed = tf.keras.layers.Embedding(max_seq_len, self.d_model)

    self.dropout = tf.keras.layers.Dropout(dropout_rate)

    self.blocks = [DecoderBlock(self.d_model, num_heads, hidden_dim, dropout_rate) for _ in range(num_blocks)]

  def call(self, encoder_output, target, training, decoder_mask, memory_mask):
    token_embeds = self.token_embed(target)

    # Generate position indices.
    num_pos = target.shape[0] * self.max_seq_len
    pos_idx = np.resize(np.arange(self.max_seq_len), num_pos)
    pos_idx = np.reshape(pos_idx, target.shape)

    pos_embeds = self.pos_embed(pos_idx)

    x = self.dropout(token_embeds + pos_embeds, training=training)

    for block in self.blocks:
      x, weights = block(encoder_output, x, training, decoder_mask, memory_mask)

    return x, weights

Before we try the decoder, let's cover the masks involved. The decoder takes two masks:

The *decoder mask* which is a <u>combination of two masks</u>: one to account for the padding in target sequences, and the look-ahead mask. This mask is used in the decoder's **first** multi-head self-attention layer.

The *memory mask* which is used in the decoder's **second** multi-head self-attention. The keys and values for this layer are going to be the encoder's output, and this mask will ensure the decoder doesn't attend to any encoder output which corresponds to padding.

Suppose this is our batch of vectorized target *input* sequences for the decoder. These values are just made up.<br>

**Note**: If you need a refresher on how to prepare target input and output sequences for the decoder, refer to the [seq2seq notebook](https://colab.research.google.com/github/futuremojo/nlp-demystified/blob/main/notebooks/nlpdemystified_seq2seq_and_attention.ipynb).



In [56]:
# Made up values.
target_input_seqs = [
    [1, 652, 723, 123, 62],
    [1, 25,  98, 129, 248, 215, 359, 249],
    [1, 2369, 1259, 125, 486],
]

As we did with the encoder input sequences, we need to pad out this batch so that all sequences within it are the same length.

In [57]:
padded_target_input_seqs = tf.keras.preprocessing.sequence.pad_sequences(target_input_seqs, padding="post")
print("Padded target inputs to the decoder:")
print(padded_target_input_seqs.shape)
print(padded_target_input_seqs)

Padded target inputs to the decoder:
(3, 8)
[[   1  652  723  123   62    0    0    0]
 [   1   25   98  129  248  215  359  249]
 [   1 2369 1259  125  486    0    0    0]]


We can create the padding mask the same way we did for the encoder.

In [58]:
dec_padding_mask = tf.cast(tf.math.not_equal(padded_target_input_seqs, 0), tf.float32)
dec_padding_mask = dec_padding_mask[:, tf.newaxis, tf.newaxis, :]
print(dec_padding_mask)

tf.Tensor(
[[[[1. 1. 1. 1. 1. 0. 0. 0.]]]


 [[[1. 1. 1. 1. 1. 1. 1. 1.]]]


 [[[1. 1. 1. 1. 1. 0. 0. 0.]]]], shape=(3, 1, 1, 8), dtype=float32)


As we covered in the slides, the look-ahead mask is a diagonal where the lower half are 1s and the upper half are zeros. This is easy to create using the *band_part* method:<br>
https://www.tensorflow.org/api_docs/python/tf/linalg/band_part

In [59]:
target_input_seq_len = padded_target_input_seqs.shape[1]
look_ahead_mask = tf.linalg.band_part(tf.ones((target_input_seq_len,
                                               target_input_seq_len)), -1, 0)
print(look_ahead_mask)

tf.Tensor(
[[1. 0. 0. 0. 0. 0. 0. 0.]
 [1. 1. 0. 0. 0. 0. 0. 0.]
 [1. 1. 1. 0. 0. 0. 0. 0.]
 [1. 1. 1. 1. 0. 0. 0. 0.]
 [1. 1. 1. 1. 1. 0. 0. 0.]
 [1. 1. 1. 1. 1. 1. 0. 0.]
 [1. 1. 1. 1. 1. 1. 1. 0.]
 [1. 1. 1. 1. 1. 1. 1. 1.]], shape=(8, 8), dtype=float32)


To create the decoder mask, we just need to combine the padding and look-ahead masks. Note how the columns of the resulting decoder mask are all zero for padding positions.

In [60]:
dec_mask = tf.minimum(dec_padding_mask, look_ahead_mask)
print("The decoder mask:")
print(dec_mask)

The decoder mask:
tf.Tensor(
[[[[1. 0. 0. 0. 0. 0. 0. 0.]
   [1. 1. 0. 0. 0. 0. 0. 0.]
   [1. 1. 1. 0. 0. 0. 0. 0.]
   [1. 1. 1. 1. 0. 0. 0. 0.]
   [1. 1. 1. 1. 1. 0. 0. 0.]
   [1. 1. 1. 1. 1. 0. 0. 0.]
   [1. 1. 1. 1. 1. 0. 0. 0.]
   [1. 1. 1. 1. 1. 0. 0. 0.]]]


 [[[1. 0. 0. 0. 0. 0. 0. 0.]
   [1. 1. 0. 0. 0. 0. 0. 0.]
   [1. 1. 1. 0. 0. 0. 0. 0.]
   [1. 1. 1. 1. 0. 0. 0. 0.]
   [1. 1. 1. 1. 1. 0. 0. 0.]
   [1. 1. 1. 1. 1. 1. 0. 0.]
   [1. 1. 1. 1. 1. 1. 1. 0.]
   [1. 1. 1. 1. 1. 1. 1. 1.]]]


 [[[1. 0. 0. 0. 0. 0. 0. 0.]
   [1. 1. 0. 0. 0. 0. 0. 0.]
   [1. 1. 1. 0. 0. 0. 0. 0.]
   [1. 1. 1. 1. 0. 0. 0. 0.]
   [1. 1. 1. 1. 1. 0. 0. 0.]
   [1. 1. 1. 1. 1. 0. 0. 0.]
   [1. 1. 1. 1. 1. 0. 0. 0.]
   [1. 1. 1. 1. 1. 0. 0. 0.]]]], shape=(3, 1, 8, 8), dtype=float32)


We can now declare a decoder and pass it everything it needs. In our case, the *memory* mask is the same as the *encoder* mask.

In [61]:
decoder = Decoder(6, 12, 3, 48, 10000, 8)
decoder_output, _ = decoder(encoder_output, padded_target_input_seqs,
                            True, dec_mask, enc_mask)
print(f"Decoder output {decoder_output.shape}:")
print(decoder_output)

Decoder output (3, 8, 12):
tf.Tensor(
[[[-0.7259985  -0.97611994 -0.28941667  1.3896121   1.4586916
    1.3402506  -0.46452475 -0.14873719  0.95775634 -0.7594686
   -0.10043172 -1.6816127 ]
  [-0.5760346  -1.0997835   0.1528048   1.1035895   1.5172669
    1.1260314   0.7329282  -0.9900956   0.35885364 -1.8439285
    0.2483876  -0.73002   ]
  [-0.6364259  -0.7748547   0.28305164  0.9715581   1.0522184
    1.7745788   0.17995088 -0.20129976  0.47856066 -1.450359
    0.1348984  -1.8118775 ]
  [-1.1627475  -1.111045   -0.1093525   1.4833279   0.96182495
    0.9992485  -0.07959682  0.230573    1.4518487  -1.1573806
   -0.1933207  -1.3133799 ]
  [-1.050879   -0.7926479  -0.2938709   0.3102484   0.45639268
    2.0269377  -0.42594066 -0.67904633  1.3039109  -1.1693729
    1.1561573  -0.8418889 ]
  [-1.3144761  -0.4774183   0.26200217  0.5275038   1.0509723
    0.8476583  -0.19162361  0.13945742  1.8465501  -1.4121077
    0.252084   -1.5306022 ]
  [-1.3734267  -0.80572534  0.21410486  1.0241771

## Transformer

We now have all the pieces to build the **Transformer** itself, and it's pretty simple.

In [62]:
class Transformer(tf.keras.Model):
  def __init__(self, num_blocks, d_model, num_heads, hidden_dim, source_vocab_size,
               target_vocab_size, max_input_len, max_target_len, dropout_rate=0.1):
    super(Transformer, self).__init__()

    self.encoder = Encoder(num_blocks, d_model, num_heads, hidden_dim, source_vocab_size,
                           max_input_len, dropout_rate)

    self.decoder = Decoder(num_blocks, d_model, num_heads, hidden_dim, target_vocab_size,
                           max_target_len, dropout_rate)

    # The final dense layer to generate logits from the decoder output.
    self.output_layer = tf.keras.layers.Dense(target_vocab_size)

  def call(self, input_seqs, target_input_seqs, training, encoder_mask,
           decoder_mask, memory_mask):
    encoder_output, encoder_attn_weights = self.encoder(input_seqs,
                                                        training, encoder_mask)

    decoder_output, decoder_attn_weights = self.decoder(encoder_output,
                                                        target_input_seqs, training,
                                                        decoder_mask, memory_mask)

    return self.output_layer(decoder_output), encoder_attn_weights, decoder_attn_weights


In [63]:
transformer = Transformer(
    num_blocks = 6,
    d_model = 12,
    num_heads = 3,
    hidden_dim = 48,
    source_vocab_size = bpemb_vocab_size,
    target_vocab_size = 7000, # made-up target vocab size.
    max_input_len = padded_input_seqs.shape[1],
    max_target_len = padded_target_input_seqs.shape[1])

transformer_output, _, _ = transformer(padded_input_seqs,
                                       padded_target_input_seqs, True,
                                       enc_mask, dec_mask, memory_mask=enc_mask)
print(f"Transformer output {transformer_output.shape}:")
print(transformer_output) # If training, we would use this output to calculate losses.

Transformer output (3, 8, 7000):
tf.Tensor(
[[[ 0.00051915  0.04052141 -0.00557879 ... -0.09527363  0.03418413
   -0.05295348]
  [ 0.05880791  0.06468092 -0.10652198 ... -0.08754595  0.07245212
   -0.08023539]
  [ 0.02981197  0.09833822 -0.12007275 ... -0.04704471  0.04017722
   -0.06536498]
  ...
  [ 0.02788904  0.06437693 -0.12579641 ... -0.04132092  0.06407294
   -0.04562572]
  [ 0.02527493  0.05832984 -0.12033436 ... -0.01126624  0.06646024
   -0.08118615]
  [ 0.02376048  0.0748547  -0.12640293 ... -0.03679956  0.06358375
   -0.06299862]]

 [[ 0.0142796   0.02680932  0.02348097 ... -0.12607291  0.00915394
   -0.00041237]
  [ 0.01274822 -0.02192468 -0.01512134 ... -0.14233251 -0.08021986
   -0.01355399]
  [ 0.01097837  0.01521777  0.00570864 ... -0.10456658  0.00124418
   -0.01136646]
  ...
  [ 0.03087964  0.00507124 -0.03227877 ... -0.13468564 -0.0152267
   -0.0382706 ]
  [ 0.03148315  0.05182709  0.00459516 ... -0.0714085  -0.00991692
   -0.01888603]
  [ 0.00623365  0.03615354  0.

That's the whole original transformer from scratch. From here, if you want to train this transformer, you can use the same approach we used when we built the translation model with attention in the [seq2seq notebook](https://colab.research.google.com/github/futuremojo/nlp-demystified/blob/main/notebooks/nlpdemystified_seq2seq_and_attention.ipynb#scrollTo=x8Ef_eWXjWMn&line=3&uniqifier=1). Remember to use a learning rate warmup (Refer to the paper for more information on this).

It's useful to know how these models work under the hood, but to train our own transformer to get impressive results is expensive. Both in terms of compute and data.<br>

Fortunately, there's a zoo of **pretrained** transformer models we can use. We'll explore that next.

In [64]:
transformer

<__main__.Transformer at 0x78c5e9dd3340>

In [68]:
optimizer = tf.keras.optimizers.Adam(learning_rate=10e-5, beta_1=0.9, beta_2=0.98,
                                     epsilon=1e-9)

## Loss and metrics

Since the target sequences are padded, it is important to apply a padding mask when calculating the loss.

In [69]:
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(
    from_logits=True, reduction='none')

In [70]:
def loss_function(real, pred):
  mask = tf.math.logical_not(tf.math.equal(real, 0))
  loss_ = loss_object(real, pred)

  mask = tf.cast(mask, dtype=loss_.dtype)
  loss_ *= mask

  return tf.reduce_sum(loss_)/tf.reduce_sum(mask)

In [72]:
checkpoint_path = "./checkpoints/train"

ckpt = tf.train.Checkpoint(transformer=transformer,
                           optimizer=optimizer)

ckpt_manager = tf.train.CheckpointManager(ckpt, checkpoint_path, max_to_keep=5)

# if a checkpoint exists, restore the latest checkpoint.
if ckpt_manager.latest_checkpoint:
  ckpt.restore(ckpt_manager.latest_checkpoint)
  print ('Latest checkpoint restored!!')

In [73]:
EPOCHS = 20

In [74]:
# The @tf.function trace-compiles train_step into a TF graph for faster
# execution. The function specializes to the precise shape of the argument
# tensors. To avoid re-tracing due to the variable sequence lengths or variable
# batch sizes (the last batch is smaller), use input_signature to specify
# more generic shapes.
import time

train_step_signature = [
    tf.TensorSpec(shape=(None, None), dtype=tf.int64),
    tf.TensorSpec(shape=(None, None), dtype=tf.int64),
]

@tf.function(input_signature=train_step_signature)
def train_step(inp, tar):
  tar_inp = tar[:, :-1]
  tar_real = tar[:, 1:]

  enc_padding_mask, combined_mask, dec_padding_mask = create_masks(inp, tar_inp)

  with tf.GradientTape() as tape:
    predictions, _ = transformer(inp, tar_inp,
                                 True,
                                 enc_padding_mask,
                                 combined_mask,
                                 dec_padding_mask)
    loss = loss_function(tar_real, predictions)

  gradients = tape.gradient(loss, transformer.trainable_variables)
  optimizer.apply_gradients(zip(gradients, transformer.trainable_variables))

  train_loss(loss)
  train_accuracy(tar_real, predictions)

In [75]:
for epoch in range(EPOCHS):
  start = time.time()

  train_loss.reset_states()
  train_accuracy.reset_states()

  # inp -> portuguese, tar -> english
  for (batch, (inp, tar)) in enumerate(train_dataset):
    train_step(inp, tar)

    if batch % 50 == 0:
      print ('Epoch {} Batch {} Loss {:.4f} Accuracy {:.4f}'.format(
          epoch + 1, batch, train_loss.result(), train_accuracy.result()))

  if (epoch + 1) % 5 == 0:
    ckpt_save_path = ckpt_manager.save()
    print ('Saving checkpoint for epoch {} at {}'.format(epoch+1,
                                                         ckpt_save_path))

  print ('Epoch {} Loss {:.4f} Accuracy {:.4f}'.format(epoch + 1,
                                                train_loss.result(),
                                                train_accuracy.result()))

  print ('Time taken for 1 epoch: {} secs\n'.format(time.time() - start))

NameError: name 'train_loss' is not defined

In [None]:
!pip install -q tfds-nightly

In [None]:
import tensorflow_datasets as tfds

In [None]:
examples, metadata = tfds.load('ted_hrlr_translate/pt_to_en', with_info=True,
                               as_supervised=True)
train_examples, val_examples = examples['train'], examples['validation']

In [None]:
tokenizer_en = tfds.deprecated.text.SubwordTextEncoder.build_from_corpus(
    (en.numpy() for pt, en in train_examples), target_vocab_size=2**13)

tokenizer_pt = tfds.deprecated.text.SubwordTextEncoder.build_from_corpus(
    (pt.numpy() for pt, en in train_examples), target_vocab_size=2**13)

In [None]:
sample_string = 'Transformer is awesome.'

tokenized_string = tokenizer_en.encode(sample_string)
print ('Tokenized string is {}'.format(tokenized_string))

original_string = tokenizer_en.decode(tokenized_string)
print ('The original string: {}'.format(original_string))

assert original_string == sample_string

In [None]:
tokenizer_en

In [None]:
tokenizer_pt

The tokenizer encodes the string by breaking it into subwords if the word is not in its dictionary.

In [None]:
for ts in tokenized_string:
  print ('{} ----> {}'.format(ts, tokenizer_en.decode([ts])))

In [None]:
BUFFER_SIZE = 20000
BATCH_SIZE = 64

Add a start and end token to the input and target.

In [None]:
def encode(lang1, lang2):
  lang1 = [tokenizer_pt.vocab_size] + tokenizer_pt.encode(
      lang1.numpy()) + [tokenizer_pt.vocab_size+1]

  lang2 = [tokenizer_en.vocab_size] + tokenizer_en.encode(
      lang2.numpy()) + [tokenizer_en.vocab_size+1]

  return lang1, lang2

You want to use `Dataset.map` to apply this function to each element of the dataset.  `Dataset.map` runs in graph mode.

* Graph tensors do not have a value.
* In graph mode you can only use TensorFlow Ops and functions.

So you can't `.map` this function directly: You need to wrap it in a `tf.py_function`. The `tf.py_function` will pass regular tensors (with a value and a `.numpy()` method to access it), to the wrapped python function.

In [None]:
def tf_encode(pt, en):
  result_pt, result_en = tf.py_function(encode, [pt, en], [tf.int64, tf.int64])
  result_pt.set_shape([None])
  result_en.set_shape([None])

  return result_pt, result_en

Note: To keep this example small and relatively fast, drop examples with a length of over 40 tokens.

In [None]:
MAX_LENGTH = 40

In [None]:
def filter_max_length(x, y, max_length=MAX_LENGTH):
  return tf.logical_and(tf.size(x) <= max_length,
                        tf.size(y) <= max_length)

In [None]:
train_dataset = train_examples.map(tf_encode)
train_dataset = train_dataset.filter(filter_max_length)
# cache the dataset to memory to get a speedup while reading from it.
train_dataset = train_dataset.cache()
train_dataset = train_dataset.shuffle(BUFFER_SIZE).padded_batch(BATCH_SIZE)
train_dataset = train_dataset.prefetch(tf.data.experimental.AUTOTUNE)


val_dataset = val_examples.map(tf_encode)
val_dataset = val_dataset.filter(filter_max_length).padded_batch(BATCH_SIZE)

In [None]:
pt_batch, en_batch = next(iter(val_dataset))
pt_batch, en_batch

## Evaluate

The following steps are used for evaluation:

* Encode the input sentence using the Portuguese tokenizer (`tokenizer_pt`). Moreover, add the start and end token so the input is equivalent to what the model is trained with. This is the encoder input.
* The decoder input is the `start token == tokenizer_en.vocab_size`.
* Calculate the padding masks and the look ahead masks.
* The `decoder` then outputs the predictions by looking at the `encoder output` and its own output (self-attention).
* Select the last word and calculate the argmax of that.
* Concatentate the predicted word to the decoder input as pass it to the decoder.
* In this approach, the decoder predicts the next word based on the previous words it predicted.

Note: The model used here has less capacity to keep the example relatively faster so the predictions maybe less right. To reproduce the results in the paper, use the entire dataset and base transformer model or transformer XL, by changing the hyperparameters above.

In [None]:
def evaluate(inp_sentence):
  start_token = [tokenizer_pt.vocab_size]
  end_token = [tokenizer_pt.vocab_size + 1]

  # inp sentence is portuguese, hence adding the start and end token
  inp_sentence = start_token + tokenizer_pt.encode(inp_sentence) + end_token
  encoder_input = tf.expand_dims(inp_sentence, 0)

  # as the target is english, the first word to the transformer should be the
  # english start token.
  decoder_input = [tokenizer_en.vocab_size]
  output = tf.expand_dims(decoder_input, 0)

  for i in range(MAX_LENGTH):
    enc_padding_mask, combined_mask, dec_padding_mask = create_masks(
        encoder_input, output)

    # predictions.shape == (batch_size, seq_len, vocab_size)
    predictions, attention_weights = transformer(encoder_input,
                                                 output,
                                                 False,
                                                 enc_padding_mask,
                                                 combined_mask,
                                                 dec_padding_mask)

    # select the last word from the seq_len dimension
    predictions = predictions[: ,-1:, :]  # (batch_size, 1, vocab_size)

    predicted_id = tf.cast(tf.argmax(predictions, axis=-1), tf.int32)

    # return the result if the predicted_id is equal to the end token
    if predicted_id == tokenizer_en.vocab_size+1:
      return tf.squeeze(output, axis=0), attention_weights

    # concatentate the predicted_id to the output which is given to the decoder
    # as its input.
    output = tf.concat([output, predicted_id], axis=-1)

  return tf.squeeze(output, axis=0), attention_weights

In [None]:
def plot_attention_weights(attention, sentence, result, layer):
  fig = plt.figure(figsize=(16, 8))

  sentence = tokenizer_pt.encode(sentence)

  attention = tf.squeeze(attention[layer], axis=0)

  for head in range(attention.shape[0]):
    ax = fig.add_subplot(2, 4, head+1)

    # plot the attention weights
    ax.matshow(attention[head][:-1, :], cmap='viridis')

    fontdict = {'fontsize': 10}

    ax.set_xticks(range(len(sentence)+2))
    ax.set_yticks(range(len(result)))

    ax.set_ylim(len(result)-1.5, -0.5)

    ax.set_xticklabels(
        ['<start>']+[tokenizer_pt.decode([i]) for i in sentence]+['<end>'],
        fontdict=fontdict, rotation=90)

    ax.set_yticklabels([tokenizer_en.decode([i]) for i in result
                        if i < tokenizer_en.vocab_size],
                       fontdict=fontdict)

    ax.set_xlabel('Head {}'.format(head+1))

  plt.tight_layout()
  plt.show()

In [None]:
def translate(sentence, plot=''):
  result, attention_weights = evaluate(sentence)

  predicted_sentence = tokenizer_en.decode([i for i in result
                                            if i < tokenizer_en.vocab_size])

  print('Input: {}'.format(sentence))
  print('Predicted translation: {}'.format(predicted_sentence))

  if plot:
    plot_attention_weights(attention_weights, sentence, result, plot)

In [None]:
translate("este é um problema que temos que resolver.")
print ("Real translation: this is a problem we have to solve .")

In [None]:
translate("os meus vizinhos ouviram sobre esta ideia.")
print ("Real translation: and my neighboring homes heard about this idea .")

In [None]:
translate("vou então muito rapidamente partilhar convosco algumas histórias de algumas coisas mágicas que aconteceram.")
print ("Real translation: so i 'll just share with you some stories very quickly of some magical things that have happened .")

You can pass different layers and attention blocks of the decoder to the `plot` parameter.

In [None]:
translate("este é o primeiro livro que eu fiz.", plot='decoder_layer4_block2')
print ("Real translation: this is the first book i've ever done.")

## Summary

In this tutorial, you learned about positional encoding, multi-head attention, the importance of masking and how to create a transformer.

Try using a different dataset to train the transformer. You can also create the base transformer or transformer XL by changing the hyperparameters above. You can also use the layers defined here to create [BERT](https://arxiv.org/abs/1810.04805) and train state of the art models. Futhermore, you can implement beam search to get better predictions.