## The Transformer architecture

**Task**:
- Create a Transformer encoder, one of the basic components of the Transformer
architecture.
- Apply it to the IMDB movie review classification task

### Understanding self-attention

- **Self-attention** - a smart embedding space would ***provide a different vector representation for a word depending on the other words surrounding it***.
- The **purpose of self-attention** is to ***modulate(adjust or regulate)*** the representation of a token by using the representations of related tokens in the sequence. This produces ***context-aware token representations.***

NumPy-like pseudocode:

In [4]:
import numpy as np

def self_attention(input_sequence):
  output = np.zeros(shape=input_sequence.shape)
  # Iterate over each token in the input sequence.
  for i, pivot_vector in enumerate(input_sequence):
    scores = np.zeros(shape=(len(input_sequence),))
    for j, vector in enumerate(input_sequence):
      # Compute the dot product (attention score) between the token and every other token.
      scores[j] = np.dot(pivot_vector, vector.T)
      # Scale by a normalization factor, and apply a softmax.
      scores /= np.sqrt(input_sequence.shape[1])
      scores = softmax(scores)
      new_pivot_representation = np.zeros(shape=pivot_vector.shape)
    for j, vector in enumerate(input_sequence):
      # Take the sum of all tokens weighted by the attention scores.
      new_pivot_representation += vector * scores[j]
      # That sum is our output.
      output[i] = new_pivot_representation
  return output

#### Generalized self-attention: the query-key-value model

- In the general case, you could be doing this with three different sequences. We’ll call them “***query***,” “***keys***,” and “***values***.”
    - The operation becomes “for each element in the query, compute how much the element is related to every key, and use these scores to weight a sum of values”:
  
      ```
      outputs = sum(inputs * pairwise_score(inputs, inputs))
      ```  
- Conceptually, this is what Transformer-style attention is doing.
    - You’ve got a reference sequence that describes something you’re looking for: ***the query***.
    - You’ve got a body of knowledge that you’re trying to extract information from: ***the values***.
    - ***Each value is assigned a key that describes the value*** in a format that can be readily compared to a query.

      ```
      outputs = sum(values * pairwise_score(query, keys))
      ```  

### Multi-head attention

- **What are these “multiple heads” referred to?**
- “***Multi-head attention***” is an extra tweak to the self-attention mechanism, introduced
in “Attention is all you need.”
    - The “***multi-head***” moniker refers to the fact that the ***output space of the self attention layer gets factored into a set of independent subspaces***,
    - ***learned separately***: the initial ***query***, ***key***, and ***value*** are ***sent through three
    independent sets of dense projections***, ***resulting in three separate vectors***.
    - Each vector is processed via neural attention,
    - and the ***three outputs are concatenated back together*** into a single output sequence.
    - Each such subspace is called a “head.”

### The Transformer encoder

**Getting the data**

In [8]:
!curl -O https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -xf aclImdb_v1.tar.gz
!rm -r aclImdb/train/unsup

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 80.2M  100 80.2M    0     0  47.5M      0  0:00:01  0:00:01 --:--:-- 47.6M


**Preparing the data**

In [9]:
import os, pathlib, shutil, random
from tensorflow import keras
batch_size = 32
base_dir = pathlib.Path("aclImdb")
val_dir = base_dir / "val"
train_dir = base_dir / "train"
for category in ("neg", "pos"):
    os.makedirs(val_dir / category)
    files = os.listdir(train_dir / category)
    random.Random(1337).shuffle(files)
    num_val_samples = int(0.2 * len(files))
    val_files = files[-num_val_samples:]
    for fname in val_files:
        shutil.move(train_dir / category / fname,
                    val_dir / category / fname)

train_ds = keras.utils.text_dataset_from_directory(
    "aclImdb/train", batch_size=batch_size
)
val_ds = keras.utils.text_dataset_from_directory(
    "aclImdb/val", batch_size=batch_size
)
test_ds = keras.utils.text_dataset_from_directory(
    "aclImdb/test", batch_size=batch_size
)
text_only_train_ds = train_ds.map(lambda x, y: x)

Found 20000 files belonging to 2 classes.
Found 5000 files belonging to 2 classes.
Found 25000 files belonging to 2 classes.


**Vectorizing the data**

In [10]:
from tensorflow.keras import layers

max_length = 600
max_tokens = 20000
text_vectorization = layers.TextVectorization(
    max_tokens=max_tokens,
    output_mode="int",
    output_sequence_length=max_length,
)
text_vectorization.adapt(text_only_train_ds)

int_train_ds = train_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
int_val_ds = val_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
int_test_ds = test_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)

**Transformer encoder implemented as a subclassed `Layer`**

- **BatchNormalization** doesn’t work well for sequence data.
- Using the **LayerNormalization** layer, which normalizes each sequence independently from other sequences in the batch.

While ***BatchNormalization collects information from many samples*** to obtain accurate statistics for the feature means and variances, ***LayerNormalization pools data within each sequence separately***, which is more appropriate for sequence data.

In [5]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

class TransformerEncoder(layers.Layer):
    # initialize variables
    def __init__(self, embed_dim, dense_dim, num_heads, **kwargs):
        super().__init__(**kwargs)
        # size of the input token vector - embedding token vector representation
        self.embed_dim = embed_dim
        # size of the inner dense layer - use for dense projection
        self.dense_dim = dense_dim
        # number of attention heads
        self.num_heads = num_heads

        # initialize multi-head attention
        self.attention = layers.MultiHeadAttention(
            num_heads=num_heads, key_dim=embed_dim)
        # dense projection - independently learned linear projections
        self.dense_proj = keras.Sequential(
            [layers.Dense(dense_dim, activation="relu"),
             layers.Dense(embed_dim),]
        )

        # layer normalization - help gradients flow better during backpropagation
        # normalizes each sequence independently from other sequences in the batch
        self.layernorm_1 = layers.LayerNormalization()
        self.layernorm_2 = layers.LayerNormalization()

    # computation goes in call()
    # the call() method is called automatically when the layer is used in a Keras model
    def call(self, inputs, mask=None):
        # The mask that will be generated by the Embedding layer will be 2D,
        # but the attention layer expects to be 3D or 4D, so we expand its rank
        if mask is not None:
            mask = mask[:, tf.newaxis, :]
        # Save attention_output as Residual connection
        attention_output = self.attention(
            inputs, inputs, attention_mask=mask)
        proj_input = self.layernorm_1(inputs + attention_output)
        proj_output = self.dense_proj(proj_input)
        return self.layernorm_2(proj_input + proj_output)

    # implement serialization so we can save the model
    # get_config method: this enables the layer to be reinstantiated from its config dict,
    # which is useful during model saving and loading.
    def get_config(self):
        config = super().get_config()
        config.update({
            "embed_dim": self.embed_dim,
            "num_heads": self.num_heads,
            "dense_dim": self.dense_dim,
        })
        return config

**Using the Transformer encoder for text classification**

In [6]:
vocab_size = 20000
embed_dim = 256
num_heads = 2
dense_dim = 32

inputs = keras.Input(shape=(None,), dtype="int64")
x = layers.Embedding(vocab_size, embed_dim)(inputs)
x = TransformerEncoder(embed_dim, dense_dim, num_heads)(x)
x = layers.GlobalMaxPooling1D()(x)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs, outputs)
model.compile(optimizer="rmsprop",
              loss="binary_crossentropy",
              metrics=["accuracy"])
model.summary()

# Reason to apply GlobalMaxPooling1D layer:
# Since TransformerEncoder returns full sequences, we need to reduce each
# sequence to a single vector for classification via a global pooling layer

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, None)]            0         
                                                                 
 embedding (Embedding)       (None, None, 256)         5120000   
                                                                 
 transformer_encoder (Trans  (None, None, 256)         543776    
 formerEncoder)                                                  
                                                                 
 global_max_pooling1d (Glob  (None, 256)               0         
 alMaxPooling1D)                                                 
                                                                 
 dropout (Dropout)           (None, 256)               0         
                                                                 
 dense_2 (Dense)             (None, 1)                 257   

**Training and evaluating the Transformer encoder based model**

In [None]:
callbacks = [
    keras.callbacks.ModelCheckpoint("transformer_encoder.keras",
                                    save_best_only=True)
]
model.fit(int_train_ds, validation_data=int_val_ds, epochs=20, callbacks=callbacks)
model = keras.models.load_model(
    "transformer_encoder.keras",
    custom_objects={"TransformerEncoder": TransformerEncoder})
print(f"Test acc: {model.evaluate(int_test_ds)[1]:.3f}")

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20

KeyboardInterrupt: 

- The model has trained very slowly which is interrupted by the GPU limited time during computing

#### Using positional encoding to re-inject order information

- The idea behind ***positional encoding*** is very simple: ***to give the model access to word order information*** ***by adding the word’s position in the sentence to each word embedding***.
- The ***input word embeddings*** will have two components:
    - the usual ***word vector***, which ***represents the word independently*** of any specific context,
    - and a ***position vector***, which ***represents the position of the word in the current sentence**.*
    
    The model will then figure out how to best leverage this additional information
    
- **Positional embedding**: The technique used to proceed to ***add position embeddings to the corresponding word embeddings to obtain position-aware word embedding***

**Implementing positional embedding as a subclassed layer**

- **Positional embedding**: The technique used to proceed to ***add position embeddings to the corresponding word embeddings to obtain position-aware word embedding***

In [11]:
class PositionalEmbedding(layers.Layer):
    # A downside of position embeddings is that the sequence length needs to be known in advance
    def __init__(self, sequence_length, input_dim, output_dim, **kwargs):
        super().__init__(**kwargs)
        # Prepare an Embedding layer for the token indices
        self.token_embeddings = layers.Embedding(
            input_dim=input_dim, output_dim=output_dim)
        # And another one for the token positions
        self.position_embeddings = layers.Embedding(
            input_dim=sequence_length, output_dim=output_dim)
        self.sequence_length = sequence_length
        self.input_dim = input_dim
        self.output_dim = output_dim

    def call(self, inputs):
        length = tf.shape(inputs)[-1]
        positions = tf.range(start=0, limit=length, delta=1)
        embedded_tokens = self.token_embeddings(inputs)
        embedded_positions = self.position_embeddings(positions)
        # Add both embedding vectors together
        return embedded_tokens + embedded_positions

    # Like the Embedding layer, this layer should be able to generate a mask so we can ignore padding 0s in the inputs.
    # The compute_mask method will called automatically by the framework, and the mask will get propagated to the next layer
    def compute_mask(self, inputs, mask=None):
        return tf.math.not_equal(inputs, 0)

    # Implement serialization so we can save the model
    def get_config(self):
        config = super().get_config()
        config.update({
            "output_dim": self.output_dim,
            "sequence_length": self.sequence_length,
            "input_dim": self.input_dim,
        })
        return config

#### Putting it all together: A text-classification Transformer

**Combining the Transformer encoder with positional embedding**

In [12]:
vocab_size = 20000
sequence_length = 600
embed_dim = 256
num_heads = 2
dense_dim = 32

inputs = keras.Input(shape=(None,), dtype="int64")
x = PositionalEmbedding(sequence_length, vocab_size, embed_dim)(inputs)
x = TransformerEncoder(embed_dim, dense_dim, num_heads)(x)
x = layers.GlobalMaxPooling1D()(x)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs, outputs)
model.compile(optimizer="rmsprop",
              loss="binary_crossentropy",
              metrics=["accuracy"])
model.summary()

callbacks = [
    keras.callbacks.ModelCheckpoint("full_transformer_encoder.keras",
                                    save_best_only=True)
]
model.fit(int_train_ds,
          validation_data=int_val_ds,
          epochs=20,
          callbacks=callbacks)

model = keras.models.load_model(
    "full_transformer_encoder.keras",
    custom_objects={"TransformerEncoder": TransformerEncoder,
                    "PositionalEmbedding": PositionalEmbedding})
print(f"Test acc: {model.evaluate(int_test_ds)[1]:.3f}")

Model: "model_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_2 (InputLayer)        [(None, None)]            0         
                                                                 
 positional_embedding (Posi  (None, None, 256)         5273600   
 tionalEmbedding)                                                
                                                                 
 transformer_encoder_1 (Tra  (None, None, 256)         543776    
 nsformerEncoder)                                                
                                                                 
 global_max_pooling1d_1 (Gl  (None, 256)               0         
 obalMaxPooling1D)                                               
                                                                 
 dropout_1 (Dropout)         (None, 256)               0         
                                                           

We acheive 87.4% test accuracy.