# Pre-Training BERT

**TensorFlow** Implementation of Transformers

In [None]:
import math
import numpy as np
import tensorflow as tf

Using BPEmb (Byte Pair Encoding) Pre-Trained Tokenizer

In [None]:
!pip install BPEmb

from bpemb import BPEmb

Collecting BPEmb
  Downloading bpemb-0.3.4-py3-none-any.whl (19 kB)
Installing collected packages: BPEmb
Successfully installed BPEmb-0.3.4


# Multi Head Self Attention

Attention mechanism allows the model to focus on different parts of input sequence.<br>
Different parts of an input sequence weighted based on their relevance.

**Scaled Dot Product Self Attention**


Inside each attention head is a **Scaled Dot Product Self-Attention** operation. Given *queries*, *keys*, and *values*, the operation returns a new "mix" of the values.

$$Attention(Q, K, V) = softmax(\frac{QK^T)}{\sqrt{d_k}})V$$

The following function implements this and also takes a mask to account for padding and for masking future tokens for decoding (i.e. **look-ahead mask**).

In [None]:
def scaled_dot_product_attention(query, key, value, mask=None):
  # calculate key dimension
  key_dim = tf.cast(tf.shape(key)[-1], tf.float32)
  # dot product between query and transpose of key tensor & divide result by square root of key dimension
  scaled_scores = tf.matmul(query, key, transpose_b=True) / np.sqrt(key_dim)

  # if mask provided
  if mask is not None:
    # set elements of scaled_scores to -np.inf where mask is 0
    scaled_scores = tf.where(mask==0, -np.inf, scaled_scores)

  # using softmax function from keras library to calculate attention weights
  # create instance of softmax
  softmax = tf.keras.layers.Softmax()
  # calculate attn weights
  weights = softmax(scaled_scores)

  # multiply computed weights by value tensor to get weighted sum
  # return weightedSum, attentionWeights
  return tf.matmul(weights, value), weights

Suppose our *queries*, *keys*, and *values* are each a length of 3 with a dimension of 4.

In [None]:
seq_len = 3
embed_dim = 4

# generating random matrices/Tensors for queries, keys, values
# each matrix has a dimension of (seq_len * embed_dim)
queries = np.random.rand(seq_len, embed_dim)
keys = np.random.rand(seq_len, embed_dim)
values = np.random.rand(seq_len, embed_dim)

print("Queries:\n", queries)

# these matrices are used as inputs

Queries:
 [[0.39613475 0.81264126 0.83716696 0.75259084]
 [0.42993291 0.91525916 0.65473444 0.74614174]
 [0.41472215 0.22721656 0.87131225 0.33616524]]


This would be the self-attention output and weights.

In [None]:
#applying scaledDotProductSelfAttention function on the generate tensor matrices
output, attn_weights = scaled_dot_product_attention(queries, keys, values)

# print the resulting output & attention weights
print("Output\n", output, "\n")
print("Weights\n", attn_weights)

# output -> weighted sum
# weights -> how much attention each element in input sequence received during attention computation

Output
 tf.Tensor(
[[0.91317695 0.3725024  0.5247414  0.45452675]
 [0.91351223 0.3793069  0.52532786 0.449358  ]
 [0.9142553  0.37722817 0.52644277 0.4416171 ]], shape=(3, 4), dtype=float32) 

Weights
 tf.Tensor(
[[0.39150867 0.28246477 0.32602656]
 [0.38637903 0.2996719  0.31394908]
 [0.36051583 0.28471372 0.35477048]], shape=(3, 3), dtype=float32)


#### Generating queries, keys, and values for multiple heads.

Now we can calculate self-attention, so we generate the input *queries*, *keys*, and *values* for multiple heads.
<br><br>
This attention mechanism is then extended to a multi-head self-attention layer, allowing the model to attend to different parts of the input sequence simultaneously.
<br><br>
Each attention head had its <u>own separate</u> set of *query*, *key*, and *value* weights. Each weight matrix was of dimension $d\ x \ d/h$ where h was the number of heads.

![](https://drive.google.com/uc?export=view&id=1SLWkHQgy4nQPFvvjG5_V8UTtpSAJ2zrr)

This way we can certainly code it this way as well. But we can also "simulate" different heads with a single query matrix, single key matrix, and single value matrix.
<br><br>
We'll do both. First we'll create *query*, *key*, and *value* vectors using separate weights per head.
<br><br>
We use an example of 12 dimensional embeddings processed by "three attentions heads".

In [None]:
batch_size = 1
# length of each sequence in input data
seq_len = 3
# each element in sequence represented as a vector of size 12
embed_dim = 12
# number of attention heads
num_heads = 3
# dimension of each attention head
# here, 12//3 == 4
head_dim = embed_dim // num_heads

print(f"Dimension of each head: {head_dim}")

Dimension of each head: 4


**Using separate weight matrices per head**

Suppose these are our input embeddings. Here we have a batch of 1 containing a sequence of length 3, with each element being a 12-dimensional embedding.

In [None]:
# generating a random 3 Dimensional Matrix/Tensor using NumPy
# round(1) - rounds each element to 1 decimal place
x = np.random.rand(batch_size, seq_len, embed_dim).round(1)
print("Input shape: ", x.shape, "\n")
print("Input:\n", x)

Input shape:  (1, 3, 12) 

Input:
 [[[0.1 0.1 0.5 0.5 0.3 0.9 0.6 0.2 0.9 0.  0.6 0.6]
  [0.1 0.4 0.3 0.8 0.9 0.2 0.3 0.1 0.8 0.9 0.4 0. ]
  [0.5 0.6 0.3 0.1 0.2 0.1 0.5 0.2 0.2 0.9 0.4 0.2]]]


We'll declare three sets of *query* weights (one for each head), three sets of *key* weights, and three sets of *value* weights. Remember each weight matrix should have a dimension of $\text{d}\ \text{x}\ \text{d/h}$.

In [None]:
# this code initialises weight matrices for queries, keys, values for Multi-Hed Self Attention (MHSA)
# The query weights for each head.
# three weight matrices for queries in each attention head
wq0 = np.random.rand(embed_dim, head_dim).round(1)
wq1 = np.random.rand(embed_dim, head_dim).round(1)
wq2 = np.random.rand(embed_dim, head_dim).round(1)

# The key weights for each head.
# three weight matrices for keys in each attention head
wk0 = np.random.rand(embed_dim, head_dim).round(1)
wk1 = np.random.rand(embed_dim, head_dim).round(1)
wk2 = np.random.rand(embed_dim, head_dim).round(1)

# The value weights for each head.
# three weight matrices for values in each attention head
wv0 = np.random.rand(embed_dim, head_dim).round(1)
wv1 = np.random.rand(embed_dim, head_dim).round(1)
wv2 = np.random.rand(embed_dim, head_dim).round(1)

In [None]:
# printing query weight matrices for each attention head
print("The three sets of query weights (one for each head):")
print("wq0:\n", wq0)
print("wq1:\n", wq1)
print("wq2:\n", wq2)

The three sets of query weights (one for each head):
wq0:
 [[0.6 0.9 0.5 0.2]
 [0.1 0.6 0.1 0.9]
 [0.3 0.9 0.3 0.6]
 [0.9 0.2 0.  0.5]
 [0.3 0.7 0.7 0.5]
 [0.6 0.2 0.1 0.9]
 [0.6 0.3 0.2 0.5]
 [0.6 0.4 0.1 0.3]
 [0.8 1.  0.8 0.9]
 [0.5 0.4 0.5 0.5]
 [1.  0.5 1.  0.5]
 [0.9 0.8 0.5 0.9]]
wq1:
 [[0.8 0.6 0.7 0.6]
 [0.7 0.5 0.7 0.5]
 [0.2 0.3 0.8 0.8]
 [0.5 0.1 0.2 0.3]
 [0.7 0.4 1.  0.6]
 [0.5 0.2 1.  0.6]
 [0.1 1.  0.7 0.3]
 [1.  0.6 0.8 0.3]
 [0.2 0.8 0.1 0.2]
 [0.3 0.1 0.7 0.3]
 [0.7 0.6 0.4 0.3]
 [0.8 0.3 0.8 0.5]]
wq2:
 [[0.3 0.  0.3 0.1]
 [0.2 0.  0.5 0.9]
 [1.  0.8 1.  0. ]
 [0.7 0.6 0.4 0.2]
 [0.1 0.5 0.3 0.1]
 [0.6 1.  0.2 0.1]
 [0.1 0.7 0.2 0.4]
 [0.2 0.2 0.1 0.7]
 [0.5 0.2 0.9 1. ]
 [0.6 1.  0.5 0.9]
 [0.4 0.3 0.7 0.5]
 [0.2 0.8 0.6 0.7]]


We generate our *queries*, *keys*, and *values* for each head by multiplying our input by the weights.

In [None]:
# generating queries, keys, values for each attention head using the 3-D matrix 'x' and weight matrices for keys queries values
# Genearted queries, keys, and values for the first head.
q0 = np.dot(x, wq0)
k0 = np.dot(x, wk0)
v0 = np.dot(x, wv0)

# Generated queries, keys, and values for the second head.
q1 = np.dot(x, wq1)
k1 = np.dot(x, wk1)
v1 = np.dot(x, wv1)

# Generated queries, keys, and values for the third head.
q2 = np.dot(x, wq2)
k2 = np.dot(x, wk2)
v2 = np.dot(x, wv2)

# done to linearly transform the input for each component allows the model to learn different relationships from input

These are the resulting *query*, *key*, and *value* vectors for the first head.

In [None]:
# print matrices q, k, v for the first attention head
print("Q, K, and V for first head:\n")

print(f"q0 {q0.shape}:\n", q0, "\n")
print(f"k0 {k0.shape}:\n", k0, "\n")
print(f"v0 {v0.shape}:\n", v0)

Q, K, and V for first head:

q0 (1, 3, 4):
 [[[3.64 3.03 2.27 3.63]
  [3.03 2.92 2.39 3.14]
  [2.27 2.41 1.78 2.38]]] 

k0 (1, 3, 4):
 [[[2.94 2.45 2.98 2.86]
  [2.75 2.99 2.66 2.81]
  [2.73 2.81 2.53 2.65]]] 

v0 (1, 3, 4):
 [[[2.81 2.74 2.09 2.37]
  [2.63 3.11 2.   2.49]
  [1.91 2.33 1.83 2.07]]]


Now that we have our Q, K, V vectors, we can just pass them to our self-attention operation. Here we're calculating the output and attention weights for the first head.

In [None]:
# applying scaled..... on the queries, keys, values matrix of the first attention head
# returns weighted sum, attention weights
out0, attn_weights0 = scaled_dot_product_attention(q0, k0, v0)

print("Output from first attention head: ", out0, "\n")
print("Attention weights from first head: ", attn_weights0)

Output from first attention head:  tf.Tensor(
[[[2.5630581 2.8115747 2.0031862 2.3609881]
  [2.5548687 2.8076713 2.000835  2.3581898]
  [2.5285792 2.793128  1.9934561 2.3484044]]], shape=(1, 3, 4), dtype=float32) 

Attention weights from first head:  tf.Tensor(
[[[0.39983118 0.40723604 0.19293275]
  [0.39103764 0.40685377 0.20210856]
  [0.3663709  0.40117428 0.23245482]]], shape=(1, 3, 3), dtype=float32)


Here are the other two (attention weights are ignored).

In [None]:
# generating the output for 2nd & 3rd head of multi head
out1, _ = scaled_dot_product_attention(q1, k1, v1)
out2, _ = scaled_dot_product_attention(q2, k2, v2)

print("Output from second attention head: ", out1, "\n")
print("Output from third attention head: ", out2,)

Output from second attention head:  tf.Tensor(
[[[1.9145194 3.2491329 2.9875898 2.4506886]
  [1.9267038 3.2466412 2.9838233 2.4539309]
  [1.9422966 3.2419016 2.9780555 2.4576366]]], shape=(1, 3, 4), dtype=float32) 

Output from third attention head:  tf.Tensor(
[[[2.9323678 2.5136867 2.0913677 2.5176127]
  [2.927381  2.493968  2.078644  2.5378861]
  [2.9293537 2.5305955 2.0998046 2.4916778]]], shape=(1, 3, 4), dtype=float32)


Once we have each head's output, we concatenate them and then put them through a linear layer for further processing.

In [None]:
# combine output from all attention heads by concatenating them
combined_out_a = np.concatenate((out0, out1, out2), axis=-1)
# print shape & combined output
print(f"Combined output from all heads {combined_out_a.shape}:")
print(combined_out_a)

# The final step would be to run combined_out_a through a linear/dense layer
# for further processing.

Combined output from all heads (1, 3, 12):
[[[2.5630581 2.8115747 2.0031862 2.3609881 1.9145194 3.2491329 2.9875898
   2.4506886 2.9323678 2.5136867 2.0913677 2.5176127]
  [2.5548687 2.8076713 2.000835  2.3581898 1.9267038 3.2466412 2.9838233
   2.4539309 2.927381  2.493968  2.078644  2.5378861]
  [2.5285792 2.793128  1.9934561 2.3484044 1.9422966 3.2419016 2.9780555
   2.4576366 2.9293537 2.5305955 2.0998046 2.4916778]]]


So that's a complete run of **multi-head self-attention** using separate sets of weights per head.<br>

Let's now get the same thing done using a single query weight matrix, single key weight matrix, and single value weight matrix.<br><br>
These were our separate per-head query weights:

In [None]:
# print weight matrices for queries in each attention head
print("Query weights for first head: \n", wq0, "\n")
print("Query weights for second head: \n", wq1, "\n")
print("Query weights for third head: \n", wq2)

Query weights for first head: 
 [[0.6 0.9 0.5 0.2]
 [0.1 0.6 0.1 0.9]
 [0.3 0.9 0.3 0.6]
 [0.9 0.2 0.  0.5]
 [0.3 0.7 0.7 0.5]
 [0.6 0.2 0.1 0.9]
 [0.6 0.3 0.2 0.5]
 [0.6 0.4 0.1 0.3]
 [0.8 1.  0.8 0.9]
 [0.5 0.4 0.5 0.5]
 [1.  0.5 1.  0.5]
 [0.9 0.8 0.5 0.9]] 

Query weights for second head: 
 [[0.8 0.6 0.7 0.6]
 [0.7 0.5 0.7 0.5]
 [0.2 0.3 0.8 0.8]
 [0.5 0.1 0.2 0.3]
 [0.7 0.4 1.  0.6]
 [0.5 0.2 1.  0.6]
 [0.1 1.  0.7 0.3]
 [1.  0.6 0.8 0.3]
 [0.2 0.8 0.1 0.2]
 [0.3 0.1 0.7 0.3]
 [0.7 0.6 0.4 0.3]
 [0.8 0.3 0.8 0.5]] 

Query weights for third head: 
 [[0.3 0.  0.3 0.1]
 [0.2 0.  0.5 0.9]
 [1.  0.8 1.  0. ]
 [0.7 0.6 0.4 0.2]
 [0.1 0.5 0.3 0.1]
 [0.6 1.  0.2 0.1]
 [0.1 0.7 0.2 0.4]
 [0.2 0.2 0.1 0.7]
 [0.5 0.2 0.9 1. ]
 [0.6 1.  0.5 0.9]
 [0.4 0.3 0.7 0.5]
 [0.2 0.8 0.6 0.7]]


Atlernative approach for handling query weights in MHSA mechanism
<br>
Suppose instead of declaring three separate query weight matrices, we had declared one. i.e. a single $d\ x\ d$ matrix. We're concatenating our per-head query weights here instead of declaring a new set of weights so that we get the same results.

In [None]:
# concatenates query weight matrices for each attention head
wq = np.concatenate((wq0, wq1, wq2), axis=1)
# print shape & the combined query matrix
print(f"Single query weight matrix {wq.shape}: \n", wq)

Single query weight matrix (12, 12): 
 [[0.6 0.9 0.5 0.2 0.8 0.6 0.7 0.6 0.3 0.  0.3 0.1]
 [0.1 0.6 0.1 0.9 0.7 0.5 0.7 0.5 0.2 0.  0.5 0.9]
 [0.3 0.9 0.3 0.6 0.2 0.3 0.8 0.8 1.  0.8 1.  0. ]
 [0.9 0.2 0.  0.5 0.5 0.1 0.2 0.3 0.7 0.6 0.4 0.2]
 [0.3 0.7 0.7 0.5 0.7 0.4 1.  0.6 0.1 0.5 0.3 0.1]
 [0.6 0.2 0.1 0.9 0.5 0.2 1.  0.6 0.6 1.  0.2 0.1]
 [0.6 0.3 0.2 0.5 0.1 1.  0.7 0.3 0.1 0.7 0.2 0.4]
 [0.6 0.4 0.1 0.3 1.  0.6 0.8 0.3 0.2 0.2 0.1 0.7]
 [0.8 1.  0.8 0.9 0.2 0.8 0.1 0.2 0.5 0.2 0.9 1. ]
 [0.5 0.4 0.5 0.5 0.3 0.1 0.7 0.3 0.6 1.  0.5 0.9]
 [1.  0.5 1.  0.5 0.7 0.6 0.4 0.3 0.4 0.3 0.7 0.5]
 [0.9 0.8 0.5 0.9 0.8 0.3 0.8 0.5 0.2 0.8 0.6 0.7]]


In the same way, pretend we declared a single key weight matrix, and single value weight matrix.

In [None]:
# similary, concatenating/combining the key and value weight matrices for all 3 attention heads
wk = np.concatenate((wk0, wk1, wk2), axis=1)
wv = np.concatenate((wv0, wv1, wv2), axis=1)

print(f"Single key weight matrix {wk.shape}:\n", wk, "\n")
print(f"Single value weight matrix {wv.shape}:\n", wv)

# this approach consolidates the parameters for keys, values & queries, potentially reducing the models parameter count

Single key weight matrix (12, 12):
 [[0.7 0.9 0.9 0.2 0.2 0.2 0.1 0.1 0.6 0.6 0.1 0.7]
 [0.5 0.4 0.7 1.  0.1 1.  0.4 0.2 0.3 0.5 0.2 0. ]
 [0.2 0.3 0.4 0.9 0.7 0.8 0.9 0.6 0.  0.  1.  0.8]
 [0.3 0.4 0.5 0.3 0.4 0.2 0.3 0.1 0.9 0.9 0.4 0.6]
 [0.6 0.7 0.  0.6 0.  0.2 0.2 0.8 0.2 0.7 0.8 0.1]
 [0.4 0.1 0.9 0.7 0.6 0.5 0.5 0.6 0.7 0.2 0.5 0.5]
 [1.  1.  0.6 0.6 0.5 0.8 0.7 0.5 1.  0.1 0.3 0.8]
 [0.7 1.  0.3 0.2 0.2 0.3 0.2 0.4 0.2 0.8 0.7 1. ]
 [0.3 0.3 0.6 0.1 0.7 0.7 0.7 0.8 0.8 0.5 0.1 0.4]
 [0.7 0.8 0.6 0.8 0.4 0.3 0.1 0.3 0.3 0.7 0.1 0. ]
 [0.8 0.8 0.9 0.5 0.1 0.  0.4 0.8 0.3 0.8 0.6 0.2]
 [0.9 0.2 0.1 0.9 0.2 0.2 0.3 0.4 0.8 0.5 0.9 0.2]] 

Single value weight matrix (12, 12):
 [[0.6 0.1 0.6 0.8 0.6 0.2 0.5 0.6 0.2 1.  0.9 0.7]
 [0.4 0.7 0.7 0.2 1.  0.  1.  0.1 0.8 0.  0.4 0.4]
 [0.4 0.9 0.4 0.1 0.  0.7 0.1 0.5 0.3 0.6 0.7 0.7]
 [0.3 0.8 0.1 0.1 0.6 0.3 0.8 0.6 0.6 1.  0.7 0.4]
 [0.8 0.3 0.3 1.  0.6 0.9 0.5 0.4 0.6 0.7 0.9 0. ]
 [0.1 0.2 0.1 0.6 0.2 1.  0.7 0.1 0.9 0.2 0.1 0.3]
 [0.6

Now we can calculate all our *queries*, *keys*, and *values* with three dot products.

In [None]:
# computing the queries, keys and value matrices using the concatenated/combined weight matrices for all attention heads

# by perfoming dot product between input data / 3-D matrix/tensor 'x' and concatenated matrices
q_s = np.dot(x, wq)
k_s = np.dot(x, wk)
v_s = np.dot(x, wv)

These are our resulting query vectors (we'll call them "combined queries"). How do we simulate different heads with this?

In [None]:
# print shape & the computed query vector
print(f"Query vectors using a single weight matrix {q_s.shape}:\n", q_s)

Query vectors using a single weight matrix (1, 3, 12):
 [[[3.64 3.03 2.27 3.63 2.5  2.59 3.23 2.28 2.38 3.05 2.78 2.32]
  [3.03 2.92 2.39 3.14 2.39 2.16 3.01 2.07 2.33 2.78 2.68 2.64]
  [2.27 2.41 1.78 2.38 2.12 1.97 2.81 1.79 1.65 2.11 2.02 2.33]]]


Somehow, we need to separate these vectors such they're treated like three separate sets by the self-attention operation.

In [None]:
# queries for each attention head generated earlier
print(q0, "\n")
print(q1, "\n")
print(q2)

[[[3.64 3.03 2.27 3.63]
  [3.03 2.92 2.39 3.14]
  [2.27 2.41 1.78 2.38]]] 

[[[2.5  2.59 3.23 2.28]
  [2.39 2.16 3.01 2.07]
  [2.12 1.97 2.81 1.79]]] 

[[[2.38 3.05 2.78 2.32]
  [2.33 2.78 2.68 2.64]
  [1.65 2.11 2.02 2.33]]]


Notice how each set of per-head queries looks like we took the combined queries, and chopped them vertically every four dimensions.
<br><br>
We can split our combined queries into $\text{d}\ \text{x}\ \text{d/h}$ heads using **reshape** and **transpose**.<br><br>
The first step is to *reshape* our combined queries from a shape of:<br>
(batch_size, seq_len, embed_dim)<br>

into a shape of<br>
 (batch_size, seq_len, num_heads, head_dim).
 <br>

In context of MHSA, 'reshaping' is done to organize combined queries, values & keys into different attention heads

In [None]:
# reshaping the combined queries generated from a single weight matrix
# reshaping done by tf.reshape
# Note: we can achieve the same thing by passing -1 instead of seq_len.

# reshapes from (batch_size, seq_len, embed_dim) to (batch_size, seq_len, num_heads, head_dim)
q_s_reshaped = tf.reshape(q_s, (batch_size, seq_len, num_heads, head_dim))
# print the shape & original combined query matrix
print(f"Combined queries: {q_s.shape}\n", q_s, "\n")
# print the shape & combined query matrix after reshaping
print(f"Reshaped into separate heads: {q_s_reshaped.shape}\n", q_s_reshaped)

Combined queries: (1, 3, 12)
 [[[3.64 3.03 2.27 3.63 2.5  2.59 3.23 2.28 2.38 3.05 2.78 2.32]
  [3.03 2.92 2.39 3.14 2.39 2.16 3.01 2.07 2.33 2.78 2.68 2.64]
  [2.27 2.41 1.78 2.38 2.12 1.97 2.81 1.79 1.65 2.11 2.02 2.33]]] 

Reshaped into separate heads: (1, 3, 3, 4)
 tf.Tensor(
[[[[3.64 3.03 2.27 3.63]
   [2.5  2.59 3.23 2.28]
   [2.38 3.05 2.78 2.32]]

  [[3.03 2.92 2.39 3.14]
   [2.39 2.16 3.01 2.07]
   [2.33 2.78 2.68 2.64]]

  [[2.27 2.41 1.78 2.38]
   [2.12 1.97 2.81 1.79]
   [1.65 2.11 2.02 2.33]]]], shape=(1, 3, 3, 4), dtype=float64)


At this point, we have our desired shape. The next step is to *transpose* it such that simulates vertically chopping our combined queries. By transposing, our matrix dimensions become:<br>
(batch_size, num_heads, seq_len, head_dim)<br>

Transposition simulates the separation of heads, making it easier to perform subsequent operations on each head independently

In [None]:
# transposes the reshaped queries, swapping dimensions to simulate vertical chopping
# perm parameter specifies the new order of dimensions. here, swaps the second & third dimensions, simulating the vertical chopping
# .numpy to convert it to a numpy array
q_s_transposed = tf.transpose(q_s_reshaped, perm=[0, 2, 1, 3]).numpy()

# print shape & the transposed query matrix
# transposed matrix has the dimensions (batch_size, num_heads, seq_len, head_dim)
print(f"Queries transposed into \"separate\" heads {q_s_transposed.shape}:\n",
      q_s_transposed)

# transposition simulates

Queries transposed into "separate" heads (1, 3, 3, 4):
 [[[[3.64 3.03 2.27 3.63]
   [3.03 2.92 2.39 3.14]
   [2.27 2.41 1.78 2.38]]

  [[2.5  2.59 3.23 2.28]
   [2.39 2.16 3.01 2.07]
   [2.12 1.97 2.81 1.79]]

  [[2.38 3.05 2.78 2.32]
   [2.33 2.78 2.68 2.64]
   [1.65 2.11 2.02 2.33]]]]


This reshaping and transposition process aligns with the requirements of the attention mechanism, allowing for efficient and parallelized computation of attention scores for each head in the multi-head self-attention model.

If we compare this against the separate per-head queries we calculated previously, we see the same result except we now have all our queries in a single matrix.

In [None]:
print("The separate per-head query matrices from before: ")
print(q0, "\n")
print(q1, "\n")
print(q2)

The separate per-head query matrices from before: 
[[[3.64 3.03 2.27 3.63]
  [3.03 2.92 2.39 3.14]
  [2.27 2.41 1.78 2.38]]] 

[[[2.5  2.59 3.23 2.28]
  [2.39 2.16 3.01 2.07]
  [2.12 1.97 2.81 1.79]]] 

[[[2.38 3.05 2.78 2.32]
  [2.33 2.78 2.68 2.64]
  [1.65 2.11 2.02 2.33]]]


Let's do the exact same thing with our combined keys and values.

In [None]:
# similarly, transposing & reshaping combined keys & values into separate heads
k_s_transposed = tf.transpose(tf.reshape(k_s, (batch_size, -1, num_heads, head_dim)), perm=[0, 2, 1, 3]).numpy()
v_s_transposed = tf.transpose(tf.reshape(v_s, (batch_size, -1, num_heads, head_dim)), perm=[0, 2, 1, 3]).numpy()

print(f"Keys for all heads in a single matrix {k_s.shape}: \n", k_s_transposed, "\n")
print(f"Values for all heads in a single matrix {v_s.shape}: \n", v_s_transposed)

# now, the keys & values also are organised into separate heads making it easier to perform independent processing on each head

Keys for all heads in a single matrix (1, 3, 12): 
 [[[[2.94 2.45 2.98 2.86]
   [2.75 2.99 2.66 2.81]
   [2.73 2.81 2.53 2.65]]

  [[2.27 2.42 2.67 2.98]
   [1.84 2.2  2.   2.61]
   [1.34 1.96 1.52 1.74]]

  [[3.25 2.4  2.73 2.53]
   [2.57 3.11 2.1  1.72]
   [1.93 2.21 1.54 1.52]]]] 

Values for all heads in a single matrix (1, 3, 12): 
 [[[[2.81 2.74 2.09 2.37]
   [2.63 3.11 2.   2.49]
   [1.91 2.33 1.83 2.07]]

  [[1.85 3.25 3.   2.43]
   [2.55 3.33 2.92 2.68]
   [2.29 1.91 2.1  2.19]]

  [[2.9  2.37 2.   2.67]
   [3.11 3.19 2.53 1.83]
   [2.32 2.4  1.82 1.95]]]]


Set up this way, we can now calculate the outputs from all attention heads with a single call to our self-attention operation.

In [None]:
# the scaled dot product attention applied to the transposed & reshaped queries, keys & values matrices for all attention heads
# returns (weighted sum, attention weights)
all_heads_output, all_attn_weights = scaled_dot_product_attention(q_s_transposed,
                                                                  k_s_transposed,
                                                                  v_s_transposed)
print("Self attention output:\n", all_heads_output)

Self attention output:
 tf.Tensor(
[[[[2.5630581 2.811575  2.0031862 2.3609886]
   [2.5548687 2.8076713 2.000835  2.3581898]
   [2.5285792 2.793128  1.9934561 2.3484044]]

  [[1.9145195 3.249133  2.9875898 2.4506888]
   [1.9267039 3.2466416 2.9838235 2.4539309]
   [1.9422966 3.2419016 2.9780555 2.4576366]]

  [[2.9323678 2.5136867 2.0913677 2.5176127]
   [2.927381  2.493968  2.078644  2.5378861]
   [2.9293537 2.5305955 2.0998046 2.4916778]]]], shape=(1, 3, 3, 4), dtype=float32)


As a sanity check, we can compare this against the outputs from individual heads we calculated earlier:

In [None]:
# printing the previously computed output weighted sum of each attention head computed earlier
print("Per head outputs from using separate sets of weights per head:")
print(out0, "\n")
print(out1, "\n")
print(out2)

Per head outputs from using separate sets of weights per head:
tf.Tensor(
[[[2.5630581 2.8115747 2.0031862 2.3609881]
  [2.5548687 2.8076713 2.000835  2.3581898]
  [2.5285792 2.793128  1.9934561 2.3484044]]], shape=(1, 3, 4), dtype=float32) 

tf.Tensor(
[[[1.9145194 3.2491329 2.9875898 2.4506886]
  [1.9267038 3.2466412 2.9838233 2.4539309]
  [1.9422966 3.2419016 2.9780555 2.4576366]]], shape=(1, 3, 4), dtype=float32) 

tf.Tensor(
[[[2.9323678 2.5136867 2.0913677 2.5176127]
  [2.927381  2.493968  2.078644  2.5378861]
  [2.9293537 2.5305955 2.0998046 2.4916778]]], shape=(1, 3, 4), dtype=float32)


To get the final concatenated result, we need to reverse our **reshape** and **transpose** operation, starting with the **transpose** this time.

In [None]:
# reversing the transpose & reshape operations bringing output back to the original shape
combined_out_b = tf.reshape(tf.transpose(all_heads_output, perm=[0, 2, 1, 3]),
                            shape=(batch_size, seq_len, embed_dim))
print("Final output from using single query, key, value matrices:\n",
      combined_out_b, "\n")
print("Final output from using separate query, key, value matrices per head:\n",
      combined_out_a)

Final output from using single query, key, value matrices:
 tf.Tensor(
[[[2.5630581 2.811575  2.0031862 2.3609886 1.9145195 3.249133  2.9875898
   2.4506888 2.9323678 2.5136867 2.0913677 2.5176127]
  [2.5548687 2.8076713 2.000835  2.3581898 1.9267039 3.2466416 2.9838235
   2.4539309 2.927381  2.493968  2.078644  2.5378861]
  [2.5285792 2.793128  1.9934561 2.3484044 1.9422966 3.2419016 2.9780555
   2.4576366 2.9293537 2.5305955 2.0998046 2.4916778]]], shape=(1, 3, 12), dtype=float32) 

Final output from using separate query, key, value matrices per head:
 [[[2.5630581 2.8115747 2.0031862 2.3609881 1.9145194 3.2491329 2.9875898
   2.4506886 2.9323678 2.5136867 2.0913677 2.5176127]
  [2.5548687 2.8076713 2.000835  2.3581898 1.9267038 3.2466412 2.9838233
   2.4539309 2.927381  2.493968  2.078644  2.5378861]
  [2.5285792 2.793128  1.9934561 2.3484044 1.9422966 3.2419016 2.9780555
   2.4576366 2.9293537 2.5305955 2.0998046 2.4916778]]]


We can encapsulate everything we just covered in a class.

In [None]:
# this code defines a custom Keras layer for MHSA mechanism in Tensorflow
class MultiHeadSelfAttention(tf.keras.layers.Layer):
  def __init__(self, d_model, num_heads):
    super(MultiHeadSelfAttention, self).__init__()
    self.d_model = d_model
    self.num_heads = num_heads

    self.d_head = self.d_model // self.num_heads

    self.wq = tf.keras.layers.Dense(self.d_model)
    self.wk = tf.keras.layers.Dense(self.d_model)
    self.wv = tf.keras.layers.Dense(self.d_model)

    # Linear layer to generate the final output.
    self.dense = tf.keras.layers.Dense(self.d_model)

  def split_heads(self, x):
    batch_size = x.shape[0]

    split_inputs = tf.reshape(x, (batch_size, -1, self.num_heads, self.d_head))
    return tf.transpose(split_inputs, perm=[0, 2, 1, 3])

  def merge_heads(self, x):
    batch_size = x.shape[0]

    merged_inputs = tf.transpose(x, perm=[0, 2, 1, 3])
    return tf.reshape(merged_inputs, (batch_size, -1, self.d_model))

  def call(self, q, k, v, mask):
    qs = self.wq(q)
    ks = self.wk(k)
    vs = self.wv(v)

    qs = self.split_heads(qs)
    ks = self.split_heads(ks)
    vs = self.split_heads(vs)

    output, attn_weights = scaled_dot_product_attention(qs, ks, vs, mask)
    output = self.merge_heads(output)

    return self.dense(output), attn_weights


In [None]:
# created object of MHSA class/layer
# embedding dimension of 12 & 3 attention heads
mhsa = MultiHeadSelfAttention(12, 3)

# apply mhsa to the input tensors 'x' and no mask applied
output, attn_weights = mhsa(x, x, x, None)
print(f"MHSA output{output.shape}:")
print(output)

MHSA output(1, 3, 12):
tf.Tensor(
[[[-0.6346668  -0.20335697 -0.0600321  -0.3117067   0.38744643
   -0.1365797   0.23443128 -0.06778473 -0.05833702  0.03529059
    0.28844693  0.26457527]
  [-0.6898172  -0.20055974 -0.05509949 -0.3106577   0.39296415
   -0.13288456  0.2341844  -0.06222748 -0.04691401 -0.00130416
    0.29595008  0.2405275 ]
  [-0.66887033 -0.20923251 -0.05644313 -0.30885214  0.40852058
   -0.1331484   0.23667027 -0.07828202 -0.06305532  0.00413021
    0.29170236  0.25428864]]], shape=(1, 3, 12), dtype=float32)


## Encoder Block

We now build our **Encoder Block**. In addition to the **Multi-Head Self Attention** layer, the **Encoder Block** also has **skip connections**, **layer normalization steps**, and a **two-layer feed-forward neural network**.
<div>
<img src="https://drive.google.com/uc?export=view&id=1D8sLDyQMqqhCjHWOn-I7rZKHugWxFyLy" width="500"/>
</div>

The EncoderBlock consists of two main components:
- a multi-head self-attention (MHSA) layer &
- a feed-forward network (FFN).
The purpose of each component is to capture different aspects of the input sequence's information.

The **MultiHeadSelfAttention layer** is responsible for capturing relationships and dependencies between different positions in the input sequence. It achieves this by allowing the model to attend to different parts of the input simultaneously through multiple attention heads.

The **FFN**, consisting of dense layers with rectified linear unit (ReLU) activation, further refines the information obtained from the MHSA layer. Both layers are augmented with dropout for regularization and layer normalization for stabilizing training.

**Layer Normalization** is applied after both the MHSA and FFN sub-layers to stabilize the training process and mitigate the vanishing/exploding gradient problem.

**Dropout** is used during training to prevent overfitting by randomly setting a fraction of input units to zero.

Since a two-layer feed forward neural network is used in multiple places in the transformer, here's a function which creates and returns one.

In [None]:
# creating a two layer Feed Forward Neural Network
# returns a tf.keras.sequential model with 2 dense layers
def feed_forward_network(d_model, hidden_dim):
  return tf.keras.Sequential([
      # creates first layers with hidden_dim units & ReLu activation function
      tf.keras.layers.Dense(hidden_dim, activation='relu'),
      # creates second layer with d_model units
      tf.keras.layers.Dense(d_model)
  ])

This is our encoder block containing all the layers and steps from the preceding illustration (plus dropout).

In [None]:
# encoderBlock class that encapsulates the essential components of a transformer encoder block
class EncoderBlock(tf.keras.layers.Layer):
  def __init__(self, d_model, num_heads, hidden_dim, dropout_rate=0.1):
    super(EncoderBlock, self).__init__()

    self.mhsa = MultiHeadSelfAttention(d_model, num_heads)
    self.ffn = feed_forward_network(d_model, hidden_dim)

    self.dropout1 = tf.keras.layers.Dropout(dropout_rate)
    self.dropout2 = tf.keras.layers.Dropout(dropout_rate)

    self.layernorm1 = tf.keras.layers.LayerNormalization()
    self.layernorm2 = tf.keras.layers.LayerNormalization()

  def call(self, x, training, mask):
    mhsa_output, attn_weights = self.mhsa(x, x, x, mask)
    mhsa_output = self.dropout1(mhsa_output, training=training)
    mhsa_output = self.layernorm1(x + mhsa_output)

    ffn_output = self.ffn(mhsa_output)
    ffn_output = self.dropout2(ffn_output, training=training)
    output = self.layernorm2(mhsa_output + ffn_output)

    return output, attn_weights


Suppose we have an embedding dimension of 12, and we want 3 attention heads and a feed forward network with a hidden dimension of 48 (4x the embedding dimension). We would declare and use a single encoder block like so:

In [None]:
# created object of EncodeBlock class and applied it to input tensor 'x'
# embedding dimension = 12, attention heads = 3, hidden dimension of 48
encoder_block = EncoderBlock(12, 3, 48)

block_output,  _ = encoder_block(x, True, None)
# print the shape & content of output tensor obtained from encoder block
print(f"Output from single encoder block {block_output.shape}:")
print(block_output)

Output from single encoder block (1, 3, 12):
tf.Tensor(
[[[-1.2493317   0.38759702  1.2440519  -1.1951112  -0.07183307
    1.6845403  -0.90902734 -0.951247    1.2313206  -0.77939856
   -0.15463588  0.7630747 ]
  [-1.7343774   0.52169913  0.9838319  -0.4355791   1.1070977
    0.6072371  -1.1848614  -1.1338309   1.2120467   1.1051723
   -0.4019577  -0.646478  ]
  [-1.6065238   1.2915235   1.4564214   0.01678242  0.490191
    0.8838099  -1.1751978  -1.472565    0.6701128   0.37640637
   -0.35256594 -0.5783942 ]]], shape=(1, 3, 12), dtype=float32)


## Word and Positional Embeddings

Let's now deal with the actual input to the **initial** encoder block. The inputs are going to be *positional word embeddings*. That is, word embeddings with some positional information added to them.
<br>

Let's start with **subword** tokenization. For demonstration, we'll use a subword tokenizer called **BPEmb**. It uses **Byte-Pair Encoding** and supports over two hundred languages.

In [None]:
# Load the BPEmb tokenizer for English language.
bpemb_en = BPEmb(lang="en")

downloading https://nlp.h-its.org/bpemb/en/en.wiki.bpe.vs10000.model


100%|██████████| 400869/400869 [00:00<00:00, 1031209.49B/s]


downloading https://nlp.h-its.org/bpemb/en/en.wiki.bpe.vs10000.d100.w2v.bin.tar.gz


100%|██████████| 3784656/3784656 [00:00<00:00, 5263302.01B/s]


The library comes with embeddings for a number of words.

BPEmb capable of breaking down words into subword units

In [None]:
# exploring the vocabulary size & embedding size of BPEmb for English
bpemb_vocab_size, bpemb_embed_size = bpemb_en.vectors.shape

# print the vocabulary size which is the number of unique subword tokens in BPEmb vocabulary
print("Vocabulary size:", bpemb_vocab_size)
# Print the embedding size which is the dimensionality of embedding for each subword token
print("Embedding size:", bpemb_embed_size)

Vocabulary size: 10000
Embedding size: 100


In [None]:
# Embedding for the word "car".
# find index of car in bpemb vocabulary, and retrieve the embedding vector
bpemb_en.vectors[bpemb_en.words.index('car')]

array([-0.305548, -0.325598, -0.134716, -0.078735, -0.660545,  0.076211,
       -0.735487,  0.124533, -0.294402,  0.459688,  0.030137,  0.174041,
       -0.224223,  0.486189, -0.504649, -0.459699,  0.315747,  0.477885,
        0.091398,  0.427867,  0.016524, -0.076833, -0.899727,  0.493158,
       -0.022309, -0.422785, -0.154148,  0.204981,  0.379834,  0.070588,
        0.196073, -0.368222,  0.473406,  0.007409,  0.004303, -0.007823,
       -0.19103 , -0.202509,  0.109878, -0.224521, -0.35741 , -0.611633,
        0.329958, -0.212956, -0.497499, -0.393839, -0.130101, -0.216903,
       -0.105595, -0.076007, -0.483942, -0.139704, -0.161647,  0.136985,
        0.415363, -0.360143,  0.038601, -0.078804, -0.030421,  0.324129,
        0.223378, -0.523636, -0.048317, -0.032248, -0.117367,  0.470519,
        0.225816, -0.222065, -0.225007, -0.165904, -0.334389, -0.20157 ,
        0.572352, -0.268794,  0.301929, -0.005563,  0.387491,  0.261031,
       -0.11613 ,  0.074982, -0.008433,  0.259987, 

These are the subword tokens. **BPEmb** places underscores in front of any tokens which are whole words or intended to begin words.<br>

In [None]:
# encoding a sample sentence using BPEmb
sample_sentence = "Where can I find a pizzeria?"
# list of subword tokens representing the encoded sentence
tokens = bpemb_en.encode(sample_sentence)
print(tokens)

['▁where', '▁can', '▁i', '▁find', '▁a', '▁p', 'iz', 'zer', 'ia', '?']


We can retrieve each subword token's respective id using the *encode_ids* method.

In [None]:
# encoding a sample sentence and obtaining corresponding token sequence as ID's
token_seq = np.array(bpemb_en.encode_ids("Where can I find a pizzeria?"))
print(token_seq)

[ 571  280  386 1934    4   24  248 4339  177 9967]


In [None]:
# creating token embedding using a TensorFlow 'Embedding' layer based on BPEmb tokenizer vocabulary & embedding dimension

# create an embedding layer using TensorFlow's embedding layer
token_embed = tf.keras.layers.Embedding(bpemb_vocab_size, embed_dim)
# this line applies the embedding layer to token sequence
token_embeddings = token_embed(token_seq)

# The untrained embeddings for our sample sentence.
print("Embeddings for: ", sample_sentence)
# output is. matrix where each row is embedding vector for a token in sample sentence
print(token_embeddings)

Embeddings for:  Where can I find a pizzeria?
tf.Tensor(
[[ 0.01877591  0.04493973 -0.04472575  0.04045472  0.04444433 -0.0323297
  -0.03919303 -0.01796792  0.02011057  0.04754338 -0.03064116 -0.03071729]
 [ 0.0373047  -0.00306307  0.03398528  0.04882057 -0.00872584 -0.03383617
   0.03007949 -0.04008768 -0.02609296 -0.03657616 -0.02096267  0.02631446]
 [ 0.035891    0.01560453 -0.01480514 -0.04102575 -0.04684935  0.00101014
   0.02703222 -0.00010967  0.00412986  0.04625067 -0.01408564 -0.0260816 ]
 [-0.02247748 -0.03394116 -0.04476294  0.0323401  -0.01583917 -0.04266869
   0.03968633 -0.03752413 -0.04775747  0.0411106  -0.04240204 -0.0010937 ]
 [ 0.0335531   0.03205225 -0.03083462 -0.00949649  0.0287616   0.02099172
  -0.00950741  0.0219879   0.03931971  0.0407032  -0.02818347  0.00056896]
 [ 0.02590947 -0.02288081 -0.0368258   0.0169645   0.0479197  -0.04399425
  -0.0054667   0.00768536 -0.02416285 -0.04107077  0.03266292  0.02446197]
 [-0.02521857  0.01569629 -0.02068651 -0.01734277 

Next, we need to add *positional* information to each token embedding.

Here, we're declaring an embedding layer with rows equalling a maximum sequence length and columns equalling our token embedding size. We then generate a vector of position ids.

In [None]:
max_seq_len = 256
# creates an embedding layer for Positional Embeddings
pos_embed = tf.keras.layers.Embedding(max_seq_len, embed_dim)

# Generate ids for each position of the token sequence.
pos_idx = tf.range(len(token_seq))

# the output will be a tensor containing indices from 0 to the length of the token sequence minus one,
# representing the positions of the tokens. These indices will be used as input to the positional embedding layer.
print(pos_idx)

tf.Tensor([0 1 2 3 4 5 6 7 8 9], shape=(10,), dtype=int32)


In [None]:
# using the Positional Embedding layer to generate positional embeddings for Input sequence
# These are our positon embeddings.
position_embeddings = pos_embed(pos_idx)
print("Position embeddings for the input sequence\n", position_embeddings)

Position embeddings for the input sequence
 tf.Tensor(
[[ 0.01692391  0.02111561  0.02567815 -0.02572535  0.03261575 -0.02261087
   0.03835258  0.02535808  0.02789705 -0.00616636 -0.01751785 -0.01163245]
 [ 0.04502137  0.03964316  0.02471683  0.03327275 -0.01073105  0.0298312
  -0.01978792  0.02136698 -0.04949666 -0.03584696 -0.00270611  0.01852706]
 [-0.04619611 -0.02274891  0.03636912 -0.00894393  0.04542211 -0.00112689
   0.00344551 -0.02542012  0.03489855 -0.03418241 -0.01578061 -0.00894616]
 [ 0.03370149  0.0258478   0.04965815 -0.03589146  0.03665927 -0.0175827
  -0.01707397 -0.0061394   0.02230716  0.04044897 -0.03248759  0.02479071]
 [ 0.04423285  0.04294313  0.03324581 -0.02127866  0.00896497  0.04663224
   0.03240566  0.04757052 -0.01249263  0.03798172 -0.02727319  0.03061387]
 [-0.03438581  0.03343519  0.04715859 -0.01928452 -0.01002663 -0.00015505
   0.00237979  0.00665903 -0.01981593  0.01878598  0.04217824 -0.00211834]
 [ 0.02412728  0.00867995  0.04168347 -0.0312158  -0.

In [None]:
# input to the encoder block is Positional Word Embeddings
# combining the token embeddings & positional embeddings to create input for the encoder block

# token_embeddings -> represents the token embeddings obtained from the BPEmb tokenizer
# position_embeddings -> represents the positional embeddings generated based on position of tokens in the sequence
input = token_embeddings + position_embeddings
print("Input to the initial encoder block:\n", input)

Input to the initial encoder block:
 tf.Tensor(
[[ 0.03569982  0.06605534 -0.01904761  0.01472937  0.07706008 -0.05494057
  -0.00084046  0.00739017  0.04800762  0.04137702 -0.04815902 -0.04234974]
 [ 0.08232607  0.03658009  0.05870211  0.08209331 -0.0194569  -0.00400498
   0.01029157 -0.0187207  -0.07558962 -0.07242312 -0.02366878  0.04484153]
 [-0.01030511 -0.00714438  0.02156398 -0.04996967 -0.00142724 -0.00011674
   0.03047773 -0.02552979  0.03902841  0.01206826 -0.02986624 -0.03502776]
 [ 0.01122401 -0.00809336  0.00489522 -0.00355136  0.0208201  -0.06025139
   0.02261237 -0.04366352 -0.02545031  0.08155958 -0.07488963  0.02369701]
 [ 0.07778595  0.07499538  0.00241119 -0.03077514  0.03772657  0.06762396
   0.02289826  0.06955842  0.02682707  0.07868492 -0.05545666  0.03118284]
 [-0.00847634  0.01055439  0.01033279 -0.00232002  0.03789307 -0.0441493
  -0.00308691  0.01434439 -0.04397879 -0.02228479  0.07484116  0.02234364]
 [-0.00109129  0.02437624  0.02099696 -0.04855857 -0.089779

In [None]:
# defining a custom Encoder layer/class in Tensorflow
class Encoder(tf.keras.layers.Layer):
  # constructor sets up the parameters of the encoder
  def __init__(self, num_blocks, d_model, num_heads, hidden_dim, src_vocab_size,
               max_seq_len, dropout_rate=0.1):
    super(Encoder, self).__init__()

    self.d_model = d_model
    self.max_seq_len = max_seq_len

    # creates layers for token_embeddings & positional_embeddings
    self.token_embed = tf.keras.layers.Embedding(src_vocab_size, self.d_model)
    self.pos_embed = tf.keras.layers.Embedding(max_seq_len, self.d_model)

    # The original Attention Is All You Need paper applied dropout to the
    # input before feeding it to the first encoder block.
    # dropout layer for regularization
    self.dropout = tf.keras.layers.Dropout(dropout_rate)

    # Create encoder blocks.
    self.blocks = [EncoderBlock(self.d_model, num_heads, hidden_dim, dropout_rate)
    for _ in range(num_blocks)]

  def call(self, input, training, mask):
    token_embeds = self.token_embed(input)

    # Generate position indices for a batch of input sequences.
    num_pos = input.shape[0] * self.max_seq_len
    pos_idx = np.resize(np.arange(self.max_seq_len), num_pos)
    pos_idx = np.reshape(pos_idx, input.shape)
    pos_embeds = self.pos_embed(pos_idx)

    x = self.dropout(token_embeds + pos_embeds, training=training)

    # Run input through successive encoder blocks.
    for block in self.blocks:
      x, weights = block(x, training, mask)

    # final output is the encoded sequence along with attention weights obtained from encoder block
    return x, weights

In [None]:
# Batch of 3 sequences, each of length 10 (10 is also the
# maximum sequence length in this case).
# generates numpy array with random integers between [0, 1000) and shape (3, 10)
seqs = np.random.randint(0, 10000, size=(3, 10))
print(seqs.shape)
print(seqs)

(3, 10)
[[4785 5502 4180 6805  103 6090 8835 9747 8568 5881]
 [8243 8291 9846  738 2440 9254 5446 2755 9005 9694]
 [6090  194  389 7308 8451 1121  800 1025  459 2237]]


In [None]:
# generating positional indices for a batch of sequences
# pos_ids -> 1 dimensional array containing positional indices for all positions
pos_ids = np.resize(np.arange(seqs.shape[1]), seqs.shape[0] * seqs.shape[1])
print(pos_ids)

[0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9]


In [None]:
# reshaping the previosuly generated positional indices array into 2-d array to match the shape of batch sequence generated earlier
pos_ids = np.reshape(pos_ids, (3, 10))
print(pos_ids.shape)
print(pos_ids)

(3, 10)
[[0 1 2 3 4 5 6 7 8 9]
 [0 1 2 3 4 5 6 7 8 9]
 [0 1 2 3 4 5 6 7 8 9]]


In [None]:
# obtaining the positional embeddings using pos_embed layer defined earlier
pos_embed(pos_ids)

<tf.Tensor: shape=(3, 10, 12), dtype=float32, numpy=
array([[[ 0.01692391,  0.02111561,  0.02567815, -0.02572535,
          0.03261575, -0.02261087,  0.03835258,  0.02535808,
          0.02789705, -0.00616636, -0.01751785, -0.01163245],
        [ 0.04502137,  0.03964316,  0.02471683,  0.03327275,
         -0.01073105,  0.0298312 , -0.01978792,  0.02136698,
         -0.04949666, -0.03584696, -0.00270611,  0.01852706],
        [-0.04619611, -0.02274891,  0.03636912, -0.00894393,
          0.04542211, -0.00112689,  0.00344551, -0.02542012,
          0.03489855, -0.03418241, -0.01578061, -0.00894616],
        [ 0.03370149,  0.0258478 ,  0.04965815, -0.03589146,
          0.03665927, -0.0175827 , -0.01707397, -0.0061394 ,
          0.02230716,  0.04044897, -0.03248759,  0.02479071],
        [ 0.04423285,  0.04294313,  0.03324581, -0.02127866,
          0.00896497,  0.04663224,  0.03240566,  0.04757052,
         -0.01249263,  0.03798172, -0.02727319,  0.03061387],
        [-0.03438581,  0.03

In [None]:
# using the BPEmb tokenizer to encode a batch of input sequences
# list of 3 sentences
input_batch = [
    "Where can I find a pizzeria?",
    "Mass hysteria over listeria.",
    "I ain't no circle back girl."
]

# encoding the input batch
bpemb_en.encode(input_batch)

# output will be a list of lists where each list contains subword tokens for each word in a sentence

[['▁where', '▁can', '▁i', '▁find', '▁a', '▁p', 'iz', 'zer', 'ia', '?'],
 ['▁mass', '▁hy', 'ster', 'ia', '▁over', '▁l', 'ister', 'ia', '.'],
 ['▁i', '▁a', 'in', "'", 't', '▁no', '▁circle', '▁back', '▁girl', '.']]

In [None]:
# converting the tokenized subword sequences into their corresponding numerical ID's
# convert each subword token into its corresponding numerical ID
input_seqs = bpemb_en.encode_ids(input_batch)
print("Vectorized inputs:")
input_seqs

Vectorized inputs:


[[571, 280, 386, 1934, 4, 24, 248, 4339, 177, 9967],
 [1535, 1354, 1238, 177, 380, 43, 871, 177, 9935],
 [386, 4, 6, 9937, 9915, 467, 5410, 810, 3692, 9935]]

In [None]:
# padding the input sequences to ensure all sequences have the same length
# padding using the post padding strategy, which means padding added at the end of each sequence
padded_input_seqs = tf.keras.preprocessing.sequence.pad_sequences(input_seqs, padding="post")
print("Input to the encoder:")
print(padded_input_seqs.shape)
print(padded_input_seqs)

Input to the encoder:
(3, 10)
[[ 571  280  386 1934    4   24  248 4339  177 9967]
 [1535 1354 1238  177  380   43  871  177 9935    0]
 [ 386    4    6 9937 9915  467 5410  810 3692 9935]]


In [None]:
# create an encoder mask based on the input padded sequences to identify positions of padding
# to mask out the padding positions during self-attention calculations, ensuring attention not given to padded locations
enc_mask = tf.cast(tf.math.not_equal(padded_input_seqs, 0), tf.float32)
print("Input:")
print(padded_input_seqs, '\n')
print("Encoder mask:")
print(enc_mask)

Input:
[[ 571  280  386 1934    4   24  248 4339  177 9967]
 [1535 1354 1238  177  380   43  871  177 9935    0]
 [ 386    4    6 9937 9915  467 5410  810 3692 9935]] 

Encoder mask:
tf.Tensor(
[[1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1. 1. 1. 1. 0.]
 [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]], shape=(3, 10), dtype=float32)


Modified the shape of the encoder mask by adding 2 new dimensions

In [None]:
enc_mask = enc_mask[:, tf.newaxis, tf.newaxis, :]
enc_mask

<tf.Tensor: shape=(3, 1, 1, 10), dtype=float32, numpy=
array([[[[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]]],


       [[[1., 1., 1., 1., 1., 1., 1., 1., 1., 0.]]],


       [[[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]]]], dtype=float32)>

Now we can declare an encoder and pass it batches of vectorized sequences.

In [None]:
num_encoder_blocks = 6

# d_model is the embedding dimension used throughout.
d_model = 12

num_heads = 3

# Feed-forward network hidden dimension width.
ffn_hidden_dim = 48

src_vocab_size = bpemb_vocab_size
max_input_seq_len = padded_input_seqs.shape[1]

encoder = Encoder(
    num_encoder_blocks,
    d_model,
    num_heads,
    ffn_hidden_dim,
    src_vocab_size,
    max_input_seq_len)

We can now pass our input sequences and mask to the encoder.

In [None]:
encoder_output, attn_weights = encoder(padded_input_seqs, training=True,
                                       mask=enc_mask)
print(f"Encoder output {encoder_output.shape}:")
print(encoder_output)

Encoder output (3, 10, 12):
tf.Tensor(
[[[-2.3759533e-01 -4.4916460e-01  1.2013686e+00  9.4879550e-01
    2.7466428e-01 -8.3624005e-01 -1.7535932e+00 -1.2246120e+00
    1.7781190e+00 -2.4413000e-01  8.5013664e-01 -3.0774871e-01]
  [-1.6646397e+00 -1.4885356e+00  3.5631502e-01  3.7099501e-01
    5.3556651e-02 -7.2502893e-01 -3.3900288e-01  5.2388734e-01
    3.7181333e-01 -6.3117099e-01  1.9444393e+00  1.2273716e+00]
  [-4.2721343e-01 -9.0052187e-01 -7.9527390e-01  5.6468016e-01
   -8.4052050e-01 -1.2944179e+00 -1.3726142e-01 -4.3816757e-01
    1.4405066e+00  5.5188495e-01  2.2530291e+00  2.3276091e-02]
  [-4.0218192e-03 -7.9066235e-01 -1.4767843e+00  1.6126964e+00
   -8.5676950e-01 -9.3014961e-01 -3.7211123e-01  3.3925787e-02
    1.1385206e+00 -6.9193947e-01  1.5803269e+00  7.5696856e-01]
  [ 2.0184344e-01  8.2230069e-02 -1.6016852e+00  1.3001662e+00
   -1.6414717e+00 -1.4324750e+00  3.7385365e-01 -2.1546854e-01
    3.3927709e-01  4.7003084e-01  1.3491774e+00  7.7452147e-01]
  [-1.09277

## Decoder Block

Let's build the **Decoder Block**. Everything we did to create the **encoder** block applies here. The major differences are that the **Decoder Block** has:
1. a **Multi-Head Cross-Attention** layer which uses the encoder's outputs as the keys and values.

2. an extra skip/residual connection along with an extra layer normalization step.

<div>
<img src="https://drive.google.com/uc?export=view&id=1WVT4SX49bnta4uscOTF4xrsxFI4PbPER" width="500"/>
</div>

# Major Part - 2

In [None]:
class DecoderBlock(tf.keras.layers.Layer):
  def __init__(self, d_model, num_heads, hidden_dim, dropout_rate=0.1):
    super(DecoderBlock, self).__init__()

    self.mhsa1 = MultiHeadSelfAttention(d_model, num_heads)
    self.mhsa2 = MultiHeadSelfAttention(d_model, num_heads)

    self.ffn = feed_forward_network(d_model, hidden_dim)

    self.dropout1 = tf.keras.layers.Dropout(dropout_rate)
    self.dropout2 = tf.keras.layers.Dropout(dropout_rate)
    self.dropout3 = tf.keras.layers.Dropout(dropout_rate)

    self.layernorm1 = tf.keras.layers.LayerNormalization()
    self.layernorm2 = tf.keras.layers.LayerNormalization()
    self.layernorm3 = tf.keras.layers.LayerNormalization()

  # Note the decoder block takes two masks. One for the first MHSA, another
  # for the second MHSA.
  def call(self, encoder_output, target, training, decoder_mask, memory_mask):
    mhsa_output1, attn_weights = self.mhsa1(target, target, target, decoder_mask)
    mhsa_output1 = self.dropout1(mhsa_output1, training=training)
    mhsa_output1 = self.layernorm1(mhsa_output1 + target)

    mhsa_output2, attn_weights = self.mhsa2(mhsa_output1, encoder_output,
                                            encoder_output,
                                            memory_mask)
    mhsa_output2 = self.dropout2(mhsa_output2, training=training)
    mhsa_output2 = self.layernorm2(mhsa_output2 + mhsa_output1)

    ffn_output = self.ffn(mhsa_output2)
    ffn_output = self.dropout3(ffn_output, training=training)
    output = self.layernorm3(ffn_output + mhsa_output2)

    return output, attn_weights


## Decoder

The decoder block generates the output sequence, incorporating multi-head self-attention, multi-head cross attention and feed-forward networks. It also interfaces with the encoder output to understand context.

The decoder is almost the same as the encoder except it takes the encoder's output as part of its input, and it takes two masks: the decoder mask and memory mask.

In [None]:
class Decoder(tf.keras.layers.Layer):
  def __init__(self, num_blocks, d_model, num_heads, hidden_dim, target_vocab_size,
               max_seq_len, dropout_rate=0.1):
    super(Decoder, self).__init__()

    self.d_model = d_model
    self.max_seq_len = max_seq_len

    # creates an embedding layer for target vocabulary
    self.token_embed = tf.keras.layers.Embedding(target_vocab_size, self.d_model)
    # creates another embedding layer for positional encodings
    self.pos_embed = tf.keras.layers.Embedding(max_seq_len, self.d_model)
    # creates a dropout layer
    self.dropout = tf.keras.layers.Dropout(dropout_rate)

    # initialises multiple decoder blocks each containing multiple MHSA & FFN
    self.blocks = [DecoderBlock(self.d_model, num_heads, hidden_dim, dropout_rate) for _ in range(num_blocks)]

  # call method that executes the forward pass
  def call(self, encoder_output, target, training, decoder_mask, memory_mask):
    token_embeds = self.token_embed(target)

    # Generate position indices.
    num_pos = target.shape[0] * self.max_seq_len
    pos_idx = np.resize(np.arange(self.max_seq_len), num_pos)
    pos_idx = np.reshape(pos_idx, target.shape)

    pos_embeds = self.pos_embed(pos_idx)

    # combines the token embeddings and positional embeddings applies dropout and assigns result to 'x'
    x = self.dropout(token_embeds + pos_embeds, training=training)

    # starts a loop over all decoder blocks in the blocks list
    for block in self.blocks:
      x, weights = block(encoder_output, x, training, decoder_mask, memory_mask)

    return x, weights

Before we try the decoder, let's cover the masks involved. The decoder takes two masks:

The *decoder mask* which is a <u>combination of two masks</u>: one to account for the padding in target sequences, and the look-ahead mask. This mask is used in the decoder's **first** multi-head self-attention layer.

The *memory mask* which is used in the decoder's **second** multi-head self-attention. The keys and values for this layer are going to be the encoder's output, and this mask will ensure the decoder doesn't attend to any encoder output which corresponds to padding.

Decoder Mask: This mask combines two sub-masks:

Padding Mask: This mask is used to ignore padding tokens in the input sequences. It marks the padding positions as 0 and the rest as 1, ensuring that the model doesn't attend to padding tokens during self-attention.

Look-ahead Mask: This mask ensures that during training, each position can only attend to previous positions. It prevents the model from peeking at future tokens during training, maintaining the autoregressive property of the decoder.

Memory Mask: This mask is used in the second multi-head self-attention layer of the decoder. It ensures that the decoder doesn't attend to padding positions in the encoder's output.

Now, suppose we have a batch of vectorized target input sequences for the decoder. These values are just made up.

These sequences are prepared similarly to the input sequences, with each token represented by its index in the target vocabulary. These masks play crucial roles in guiding the attention mechanism and ensuring that the model attends to the right information while decoding.

In [None]:
# Made up values.
target_input_seqs = [
    [1, 652, 723, 123, 62],
    [1, 25,  98, 129, 248, 215, 359, 249],
    [1, 2369, 1259, 125, 486],
]
# 1 represents the start of sequence token
# rest of the integers represent indices of tokens in the target vocabulary

As we did with the encoder input sequences, we need to pad out this batch so that all sequences within it are the same length.

In [None]:
# To ensure uniformity in length of sequence, pad the input sequences
padded_target_input_seqs = tf.keras.preprocessing.sequence.pad_sequences(target_input_seqs, padding="post")

# print the padded sequences
print("Padded target inputs to the decoder:")
print(padded_target_input_seqs.shape)
print(padded_target_input_seqs)

Padded target inputs to the decoder:
(3, 8)
[[   1  652  723  123   62    0    0    0]
 [   1   25   98  129  248  215  359  249]
 [   1 2369 1259  125  486    0    0    0]]


We can create the padding mask the same way we did for the encoder.

In [None]:
# create the padding mask
dec_padding_mask = tf.cast(tf.math.not_equal(padded_target_input_seqs, 0), tf.float32)
dec_padding_mask = dec_padding_mask[:, tf.newaxis, tf.newaxis, :]
print(dec_padding_mask)

tf.Tensor(
[[[[1. 1. 1. 1. 1. 0. 0. 0.]]]


 [[[1. 1. 1. 1. 1. 1. 1. 1.]]]


 [[[1. 1. 1. 1. 1. 0. 0. 0.]]]], shape=(3, 1, 1, 8), dtype=float32)


The look-ahead mask is a diagonal where the lower half are 1s and the upper half are zeros. This is easy to create using the *band_part* method:<br>

In [None]:
# create the look-ahead mask

# get length of target input sequence
target_input_seq_len = padded_target_input_seqs.shape[1]
# create look-ahead mask using band part
look_ahead_mask = tf.linalg.band_part(tf.ones((target_input_seq_len,
                                               target_input_seq_len)), -1, 0)
# print
print(look_ahead_mask)

tf.Tensor(
[[1. 0. 0. 0. 0. 0. 0. 0.]
 [1. 1. 0. 0. 0. 0. 0. 0.]
 [1. 1. 1. 0. 0. 0. 0. 0.]
 [1. 1. 1. 1. 0. 0. 0. 0.]
 [1. 1. 1. 1. 1. 0. 0. 0.]
 [1. 1. 1. 1. 1. 1. 0. 0.]
 [1. 1. 1. 1. 1. 1. 1. 0.]
 [1. 1. 1. 1. 1. 1. 1. 1.]], shape=(8, 8), dtype=float32)


To create the decoder mask, we just need to combine the padding and look-ahead masks. Note how the columns of the resulting decoder mask are all zero for padding positions.

In [None]:
# decoder mask = look ahead mask + padding mask
dec_mask = tf.minimum(dec_padding_mask, look_ahead_mask)
print("The decoder mask:")
print(dec_mask)

The decoder mask:
tf.Tensor(
[[[[1. 0. 0. 0. 0. 0. 0. 0.]
   [1. 1. 0. 0. 0. 0. 0. 0.]
   [1. 1. 1. 0. 0. 0. 0. 0.]
   [1. 1. 1. 1. 0. 0. 0. 0.]
   [1. 1. 1. 1. 1. 0. 0. 0.]
   [1. 1. 1. 1. 1. 0. 0. 0.]
   [1. 1. 1. 1. 1. 0. 0. 0.]
   [1. 1. 1. 1. 1. 0. 0. 0.]]]


 [[[1. 0. 0. 0. 0. 0. 0. 0.]
   [1. 1. 0. 0. 0. 0. 0. 0.]
   [1. 1. 1. 0. 0. 0. 0. 0.]
   [1. 1. 1. 1. 0. 0. 0. 0.]
   [1. 1. 1. 1. 1. 0. 0. 0.]
   [1. 1. 1. 1. 1. 1. 0. 0.]
   [1. 1. 1. 1. 1. 1. 1. 0.]
   [1. 1. 1. 1. 1. 1. 1. 1.]]]


 [[[1. 0. 0. 0. 0. 0. 0. 0.]
   [1. 1. 0. 0. 0. 0. 0. 0.]
   [1. 1. 1. 0. 0. 0. 0. 0.]
   [1. 1. 1. 1. 0. 0. 0. 0.]
   [1. 1. 1. 1. 1. 0. 0. 0.]
   [1. 1. 1. 1. 1. 0. 0. 0.]
   [1. 1. 1. 1. 1. 0. 0. 0.]
   [1. 1. 1. 1. 1. 0. 0. 0.]]]], shape=(3, 1, 8, 8), dtype=float32)


We can now declare a decoder and pass it everything it needs. In our case, the *memory* mask is the same as the *encoder* mask.

In [None]:
# instantiate a decoder

decoder = Decoder(6, 12, 3, 48, 10000, 8)
# calling decoder and storing output
decoder_output, _ = decoder(encoder_output, padded_target_input_seqs,
                            True, dec_mask, enc_mask)
# print shape & output
print(f"Decoder output {decoder_output.shape}:")
print(decoder_output)

Decoder output (3, 8, 12):
tf.Tensor(
[[[-0.35240224 -0.09255086  0.5981188  -0.47783858  1.0116719
   -0.3963539   0.47090745  2.0167794   0.08812489 -1.9566933
    0.42447677 -1.3342404 ]
  [-0.48240405  0.47164735  0.6342449  -0.93738294  1.1299092
   -0.71613     0.7337725   1.787602   -0.04667569 -1.2827796
    0.4114582  -1.7032615 ]
  [-0.24198036  0.37110245  1.2110438  -0.43743718  0.9916652
    0.21136616  0.15086094  0.90416807 -0.68330437 -2.6814625
    0.64155674 -0.4375788 ]
  [ 0.10134118  0.43719476  0.98255485 -0.2885427   0.5215785
   -0.6340626   0.9028189   1.3904239   0.28538805 -2.3567224
   -0.04818523 -1.2937874 ]
  [ 0.18584651  0.4440183   1.7259588  -0.04739663  0.71059245
   -0.666827   -0.65871215  1.0039469   0.1447046  -2.3806698
    0.31110364 -0.7725658 ]
  [-0.26562467  0.78949165  0.9789169  -1.351448    0.8759295
   -1.2489833   0.11377887  1.9083608   0.27124098 -1.4632967
   -0.05533588 -0.55303025]
  [ 0.35578218  0.5296707   1.6487275   0.0771098

## Transformer

Combining all the components to build the **Transformer** architecture.

In [None]:
# Transformer class serves as a container for both Encoder & Decoder components of Transformer to create a complete Transformer architecture
class Transformer(tf.keras.Model):
  # constructor of the transformer class
  def __init__(self, num_blocks, d_model, num_heads, hidden_dim, source_vocab_size,
               target_vocab_size, max_input_len, max_target_len, dropout_rate=0.1):
    super(Transformer, self).__init__()

    # instantiate the encoder & decoder objects using parameters
    self.encoder = Encoder(num_blocks, d_model, num_heads, hidden_dim, source_vocab_size,
                           max_input_len, dropout_rate)

    self.decoder = Decoder(num_blocks, d_model, num_heads, hidden_dim, target_vocab_size,
                           max_target_len, dropout_rate)

    # The final dense layer to generate logits from the decoder output.
    # This layer maps the decoder output to the vocabulary size of the target language, allowing us to predict the next token in the sequence.
    self.output_layer = tf.keras.layers.Dense(target_vocab_size)

  # call method defines the forward pass of the Transformer architecture
  def call(self, input_seqs, target_input_seqs, training, encoder_mask,
           decoder_mask, memory_mask):
    # encoder forward pass
    encoder_output, encoder_attn_weights = self.encoder(input_seqs,
                                                        training, encoder_mask)
    # decoder forward pass
    decoder_output, decoder_attn_weights = self.decoder(encoder_output,
                                                        target_input_seqs, training,
                                                        decoder_mask, memory_mask)

    # apply the ouput layer to decoder output
    # returns logits along with encoder and decoder attenton weights
    return self.output_layer(decoder_output), encoder_attn_weights, decoder_attn_weights

# Transformer model inegrates the encoder decoder architecture of the transformer
# allowing it to process input sequences and generate predictions

# Logits are the raw, unnormalized output values produced by a neural network before being passed through an activation function.

In [None]:
# instantiates the transformer class with specific hyperparameters
transformer = Transformer(
    num_blocks = 6,
    d_model = 12,
    num_heads = 3,
    hidden_dim = 48,
    source_vocab_size = bpemb_vocab_size,
    target_vocab_size = 7000, # made-up target vocab size.
    max_input_len = padded_input_seqs.shape[1],
    max_target_len = padded_target_input_seqs.shape[1])

# generating output : pass input sequences, target input sequences, and various masks (such as encoder mask, decoder mask, and memory mask)
# to your Transformer instance.
# This triggers the call method, which performs the forward pass through the encoder and decoder, ultimately producing transformer output.

# output stored in transformers_output which contains the model's predictions
transformer_output, _, _ = transformer(padded_input_seqs,
                                       padded_target_input_seqs, True,
                                       enc_mask, dec_mask, memory_mask=enc_mask)
print(f"Transformer output {transformer_output.shape}:")
print(transformer_output)

Transformer output (3, 8, 7000):
tf.Tensor(
[[[-0.0140021  -0.11545993  0.06662529 ...  0.10461231 -0.10615893
   -0.01917124]
  [-0.00456342 -0.09490623  0.02976934 ...  0.11172125 -0.11585973
   -0.0071943 ]
  [-0.02134351 -0.08947706  0.03839815 ...  0.09047087 -0.105516
   -0.00530048]
  ...
  [-0.01112481 -0.10965801  0.01698265 ...  0.12514165 -0.12168716
    0.02929601]
  [-0.02983822 -0.07854118  0.01547286 ...  0.12020352 -0.12487702
   -0.00868029]
  [-0.03109496 -0.09448548 -0.01182013 ...  0.08452792 -0.06810641
    0.01057798]]

 [[-0.05029397 -0.0689783   0.03305322 ...  0.07966007 -0.08178013
   -0.03027399]
  [-0.021686   -0.06998483  0.03489942 ...  0.1093443  -0.12109461
   -0.00352084]
  [-0.00988284 -0.10205258  0.03595487 ...  0.11510032 -0.08847072
   -0.01084399]
  ...
  [-0.00725407 -0.10044172  0.01100004 ...  0.10099676 -0.10421087
   -0.01658478]
  [-0.03886113 -0.0952857   0.00675645 ...  0.08931804 -0.05216002
   -0.01157759]
  [-0.01030324 -0.09415484  0.0

That's the whole original transformer from scratch.

MLM and NSP are pre-training tasks/objectives and are not a part of architecture.

The model we've trained can perform tasks related to sequence-to-sequence learning, such as machine translation, text summarization, and question answering. It can also be fine-tuned for specific downstream tasks like sentiment analysis, named entity recognition, and text classification. However, without MLM and NSP pre-training objectives, the model may not capture as rich contextual representations of language as models like BERT. Nonetheless, it can still be useful for a wide range of natural language processing tasks.

### Hugging Face Pre-Trained Transformers

Exploring pre-training and transfer learning using the **Transformers** library from Hugging Face. **Transformers** is an API and toolkit to download pre-trained models and further train them as needed. <br>

Starting with the **pipelines** module which abstracts a lot of operations such as tokenization, vectorization, inference, etc.<br>

With **Transformers pipelines**, we can just feed text input and get text output. And there are **pipelines** for common tasks including classification, NER, summarization, etc.<br>

Install Transformers

In [None]:
!pip install transformers
!pip install datasets

Collecting datasets
  Downloading datasets-2.18.0-py3-none-any.whl (510 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m510.5/510.5 kB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m11.9 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m13.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: dill, multiprocess, datasets
Successfully installed datasets-2.18.0 dill-0.3.8 multiprocess-0.70.16


In [None]:
import operator
import pandas as pd
import tensorflow as tf
import transformers

from datasets import load_dataset
from tensorflow import keras
from transformers import AutoTokenizer
# pipelines allow to perform common NLP tasks such as text summarization, text classification, question answering
# pipelines encapsulate operations like vectorization, tokenization
from transformers import pipeline
from transformers import TFAutoModelForQuestionAnswering

### Hugging Face Pipelines

Use the **pipeline** (note the singular) abstraction which wraps all the other pipelines. Put simply, it'll be our interface to doing a bunch of NLP tasks.

Using the **pipeline** abstraction is easy. We can instantiate a pipeline with a particular task, and it'll automatically download a suitable tokenizer and model behind the scenes for us and take care of the input and output operations.<br>

Here, we're retrieving a pipeline for text-classification.

In [None]:
# instantiating a pipeline for classification
classifier = pipeline("text-classification")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

We can use the pipeline immediately to classify some text. Tokenization, vectorization, etc is taken care of behind the scenes.

In [None]:
# we use a pipeline and behind the scenes it automatically downloads a suitable tokeniser, and model for text classification
classifier("Alice was excited to go the island but it didn't live up to the hype.")

[{'label': 'NEGATIVE', 'score': 0.9993934631347656}]

In [None]:
classifier("Bob doesn't do well in group situations but he said it wasn't bad.")

[{'label': 'POSITIVE', 'score': 0.9946909546852112}]

Retrieving a pipeline for summarization


In [None]:
# creating instance of summarization pipeline
summarizer = pipeline("summarization")

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

  return self.fget.__get__(instance, owner)()


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

In [None]:
text = """
Hans Niemann is launching a counterattack in his dispute with chess world
champion Magnus Carlsen, filing a federal lawsuit that accuses Carlsen of
maliciously colluding with others to defame the 19-year-old grandmaster and
ruin his career.

It's the latest move in a scandal that has injected unprecedented levels of
drama into the world of elite chess since early September, when Carlsen
suggested Niemann's upset victory over him at the Sinquefield Cup tournament
in St. Louis was the result of cheating.

Niemann wants a federal court in Missouri's eastern district to award him at
least $100 million in damages. Defendants in the lawsuit include Carlsen, his
company Play Magnus Group, the online platform Chess.com and its leader, Danny
Rensch, along with grandmaster Hikaru Nakamura.
"""

In [None]:
summarizer(text)

[{'summary_text': ' Chess grandmaster Hans Niemann files federal lawsuit against Magnus Carlsen . He accuses Carlsen of colluding with others to defame the 19-year-old grandmaster . Defendants in the lawsuit include Carlsen, Play Magnus Group, the online platform Chess.com and its leader, Danny Rensch .'}]

Retrieving a pipeline for question answering.

In [None]:
qa = pipeline("question-answering")

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

In [None]:
context="""
Hugging Face was founded in 2016 by Clément Delangue, Julien Chaumond, and
Thomas Wolf originally as a company that developed a chatbot app targeted at
teenagers.[2] After open-sourcing the model behind the chatbot, the company
pivoted to focus on being a platform for democratizing machine learning. In March
2021, Hugging Face raised $40 million in a Series B funding round.
"""

question = "Who are the Hugging Face founders?"

qa(question=question, context=context)

{'score': 0.9919217228889465,
 'start': 37,
 'end': 88,
 'answer': 'Clément Delangue, Julien Chaumond, and \nThomas Wolf'}

Extractive question-answering models work fine for certain domains, document structures, and questions. But situations that require reasoning, more complex parsing, or contain ambiguity can trip it up.

In [None]:
question = "What does Hugging Face do?"
qa(question=question, context=context)

{'score': 0.08730360865592957,
 'start': 118,
 'end': 164,
 'answer': 'developed a chatbot app targeted at \nteenagers'}

There are ready made pieplines available for various tasks

### Pre-Trained Model

Fine-tuning using a dataset from the **Datasets** hub.<br>

Hugging Face provides a **datasets** library to download and interact with the datasets.

The **Datasets** hub holds a bunch of question answering datasets. They differ based on data source, domain, and level of challenge.

Using SQuAD, a famous dataset comprised of crowd-sourced questions on a set of Wikipedia articles, and where the answer is a span of text in the article.<br>

In [None]:
data = load_dataset("squad")

Downloading readme:   0%|          | 0.00/7.62k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.82M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/87599 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10570 [00:00<?, ? examples/s]

The **datasets** library downloads and automatically splits the data into train and validation sets. It returns a dictionary of **Dataset** objects:<br>

In [None]:
data

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})

Looking at the data, we see every context (Wikipedia passage) is used multiple times. i.e., there are multiple questions and answers for each context.<br>

Every answer is a span of text from the context and the character position where the answer starts in the context is given.

In [None]:
pd.DataFrame(data['train'][0, 1, 2, 100, 101, 102],
             columns=["context", "question", "answers"])

Unnamed: 0,context,question,answers
0,"Architecturally, the school has a Catholic cha...",To whom did the Virgin Mary allegedly appear i...,"{'text': ['Saint Bernadette Soubirous'], 'answ..."
1,"Architecturally, the school has a Catholic cha...",What is in front of the Notre Dame Main Building?,"{'text': ['a copper statue of Christ'], 'answe..."
2,"Architecturally, the school has a Catholic cha...",The Basilica of the Sacred heart at Notre Dame...,"{'text': ['the Main Building'], 'answer_start'..."
3,One of the main driving forces in the growth o...,In what year did the team lead by Knute Rockne...,"{'text': ['1925'], 'answer_start': [354]}"
4,One of the main driving forces in the growth o...,How many years was Knute Rockne head coach at ...,"{'text': ['13'], 'answer_start': [251]}"
5,One of the main driving forces in the growth o...,How many national titles were won when Knute R...,"{'text': ['three'], 'answer_start': [274]}"



1. Choose a pre-trained model based on what we want to accomplish and our constraints.
2. Download the appropriate tokenizer for the pre-trained model.
3. Tokenize and vectorize our dataset.
4. Mark where each answer starts and ends in our vectorized dataset.
5. Download the pre-trained model.
6. Fine-tune the pre-trained model with the vectorized dataset.

*DistilBERT* was created using a technique called *knowledge distillation*. The result is a model that performs almost as well as BERT but is 40% smaller and 60% faster.<br>

*distilroberta-base* was created by applying knowledge distillation to *Roberta-Base*, a more powerful model than BERT.<br>

The **Transformers** library provides a set of Auto Classes that can automatically retrieve configurations, tokenizers, and models based on a path or a name. We'll use the **AutoTokenizer** class to get the right tokenizer for *distilroberta-base*.<br>

In [None]:
model_name = 'distilroberta-base'
tokenizer = AutoTokenizer.from_pretrained(model_name)

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Calling *encode* converts a string to a sequence of integer token ids.<br>

In [None]:
t = "Where can I find a pizzeria?"
print(tokenizer.encode(t))

[0, 13841, 64, 38, 465, 10, 26432, 6971, 116, 2]


But to tokenize, we call the tokenizer object directly (i.e. using *\_\_call\_\_*).<br>

This returns a sequence of ids and an attention mask in a **BatchEncoding** object:<br>

Since there's no padding on this sample string, the mask is all 1s.

In [None]:
encoded_t = tokenizer(t)
print(encoded_t)

{'input_ids': [0, 13841, 64, 38, 465, 10, 26432, 6971, 116, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


We can convert the ids back to tokens using *convert_ids_to_tokens*.<br>

The tokenizer added a start of sequence token (\<s\>), end of sequence token (\</s\>), and how it uses Ġ to signal a word has preceding whitespace. Keep in mind that what you're seeing here is the output from the *distilroberta-base* tokenizer. Other tokenizers may work differently.

In [None]:
print(tokenizer.convert_ids_to_tokens(encoded_t['input_ids']))

['<s>', 'Where', 'Ġcan', 'ĠI', 'Ġfind', 'Ġa', 'Ġpizz', 'eria', '?', '</s>']


For question answering, we need to encode the question and context as a pair. In our case, we can do that by passing in both strings separated by a comma.