# Transformer Architecture in Differentiable Programming

The **Transformer** is a revolutionary neural network architecture introduced in "Attention Is All You Need" (Vaswani et al., 2017). It has become the foundation for modern language models and demonstrates key principles of differentiable programming through its entirely attention-based design.

## Core Innovation: Self-Attention

Unlike recurrent architectures, Transformers process sequences in parallel using **self-attention mechanisms**. This makes them particularly suitable for differentiable programming applications where we need:

- **Parallel computation** across sequence elements
- **Long-range dependencies** without gradient vanishing
- **Interpretable attention patterns** for understanding model behavior
- **Modular, composable components** that can be differentiated end-to-end

This implementation follows the [TensorFlow Transformer tutorial](https://www.tensorflow.org/tutorials/text/transformer) and demonstrates how complex neural architectures can be constructed using differentiable building blocks.

In [1]:
import tensorflow as tf
import numpy as np

In [2]:
def get_angles(pos, i, d_model):
    angle_rates = 1 / np.power(10000, (2 * (i//2)) / np.float32(d_model))
    return pos * angle_rates

## Positional Encoding

Since Transformers lack recurrent connections, they need an explicit way to encode positional information. **Positional encoding** uses sinusoidal functions to create unique, learnable position representations.

### Mathematical Foundation

The positional encoding uses alternating sine and cosine functions:

- **Even positions**: $PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{model}}}\right)$
- **Odd positions**: $PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{model}}}\right)$

Where:
- $pos$ is the position in the sequence
- $i$ is the dimension index
- $d_{model}$ is the model dimension

This encoding allows the model to learn relative positions between tokens while maintaining differentiability.

In [3]:
def positional_encoding(position, d_model):
    angle_rads = get_angles(np.arange(position)[:, np.newaxis],
                          np.arange(d_model)[np.newaxis, :],
                          d_model)

    # apply sin to even indices in the array; 2i
    angle_rads[:, 0::2] = np.sin(angle_rads[:, 0::2])

    # apply cos to odd indices in the array; 2i+1
    angle_rads[:, 1::2] = np.cos(angle_rads[:, 1::2])

    pos_encoding = angle_rads[np.newaxis, ...]

    return tf.cast(pos_encoding, dtype=tf.float32)


## Scaled Dot-Product Attention

The core of the Transformer is the **scaled dot-product attention** mechanism. This function computes attention weights between queries, keys, and values, enabling the model to focus on relevant parts of the input sequence.

### Mathematical Formulation

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

Where:
- $Q$ (queries): What information we're looking for
- $K$ (keys): What information is available
- $V$ (values): The actual information content
- $d_k$ is the key dimension (for scaling)

### Key Components

1. **Dot Product**: $QK^T$ computes similarity between queries and keys
2. **Scaling**: Division by $\sqrt{d_k}$ prevents extremely large values that could saturate softmax
3. **Masking**: Optional masking prevents attention to certain positions (padding, future tokens)
4. **Softmax**: Normalizes attention weights to sum to 1
5. **Weighted Sum**: Attention weights are applied to values

This mechanism is fully differentiable and allows gradients to flow based on attention patterns.

### Attention Example

This example demonstrates how attention works with concrete values:

- **Keys**: One-hot vectors identifying different positions
- **Values**: Information content at each position  
- **Query**: Selects which key to attend to

The query `[0, 10, 0]` perfectly matches the second key `[0, 10, 0]`, resulting in:
- **Attention weight**: 1.0 for the second position, 0.0 for others
- **Output**: The second value `[10, 0]` is returned exactly

This shows how attention creates **content-based addressing** - the model can learn to look up information based on learned similarity patterns.

## Multi-Head Attention

**Multi-head attention** runs multiple attention functions in parallel, allowing the model to attend to different types of information simultaneously.

### Architecture Design

The multi-head mechanism:

1. **Linear Projections**: Projects Q, K, V through learned linear transformations
2. **Head Splitting**: Divides the model dimension across multiple attention heads
3. **Parallel Processing**: Runs scaled dot-product attention on each head independently
4. **Concatenation**: Combines outputs from all heads
5. **Final Projection**: Projects the concatenated result back to model dimension

### Mathematical Formulation

$$\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, ..., \text{head}_h)W^O$$

Where each head is:
$$\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)$$

### Benefits for Differentiable Programming

- **Diverse Representations**: Different heads can learn different types of relationships
- **Parallelizability**: All heads compute independently, enabling efficient parallel processing
- **Gradient Flow**: Multiple pathways for gradient information
- **Interpretability**: Different heads often specialize in different linguistic or logical patterns

In [4]:
def scaled_dot_product_attention(q, k, v, mask):
    """Calculate the attention weights.
    q, k, v must have matching leading dimensions.
    k, v must have matching penultimate dimension, i.e.: seq_len_k = seq_len_v.
    The mask has different shapes depending on its type(padding or look ahead) 
    but it must be broadcastable for addition.

    Args:
    q: query shape == (..., seq_len_q, depth)
    k: key shape == (..., seq_len_k, depth)
    v: value shape == (..., seq_len_v, depth_v)
    mask: Float tensor with shape broadcastable 
          to (..., seq_len_q, seq_len_k). Defaults to None.

    Returns:
    output, attention_weights
    """

    matmul_qk = tf.matmul(q, k, transpose_b=True)  # (..., seq_len_q, seq_len_k)

    # scale matmul_qk
    dk = tf.cast(tf.shape(k)[-1], tf.float32)
    scaled_attention_logits = matmul_qk / tf.math.sqrt(dk)

    # add the mask to the scaled tensor.
    if mask is not None:
        scaled_attention_logits += (mask * -1e9)  

    # softmax is normalized on the last axis (seq_len_k) so that the scores
    # add up to 1.
    attention_weights = tf.nn.softmax(scaled_attention_logits, axis=-1)  # (..., seq_len_q, seq_len_k)

    output = tf.matmul(attention_weights, v)  # (..., seq_len_q, depth_v)

    return output, attention_weights

## Position-wise Feed-Forward Networks

Each Transformer layer includes a **position-wise feed-forward network** that processes each position independently. This component adds non-linear transformation capabilities to the otherwise linear attention operations.

### Architecture

The feed-forward network consists of:
1. **First Linear Layer**: Expands dimensionality (typically 4x model dimension)
2. **ReLU Activation**: Introduces non-linearity
3. **Second Linear Layer**: Projects back to model dimension

### Mathematical Representation

$$\text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2$$

### Role in Differentiable Programming

- **Non-linearity**: Essential for learning complex mappings
- **Position Independence**: Each sequence position is processed identically
- **Capacity**: The expansion to higher dimensions provides modeling capacity
- **Regularization**: Can be augmented with dropout for better generalization

In [5]:
def print_out(q, k, v):
    temp_out, temp_attn = scaled_dot_product_attention(
      q, k, v, None)
    print ('Attention weights are:')
    print (temp_attn)
    print ('Output is:')
    print (temp_out)

np.set_printoptions(suppress=True)
temp_k = tf.constant([[10,0,0],
                      [0,10,0],
                      [0,0,10],
                      [0,0,10]], dtype=tf.float32)  # (4, 3)

temp_v = tf.constant([[   1,0],
                      [  10,0],
                      [ 100,5],
                      [1000,6]], dtype=tf.float32)  # (4, 2)

# This `query` aligns with the second `key`,
# so the second `value` is returned.
temp_q = tf.constant([[0, 10, 0]], dtype=tf.float32)  # (1, 3)
print_out(temp_q, temp_k, temp_v)

Attention weights are:
tf.Tensor([[0. 1. 0. 0.]], shape=(1, 4), dtype=float32)
Output is:
tf.Tensor([[10.  0.]], shape=(1, 2), dtype=float32)


## Encoder Layer

The **Encoder Layer** combines multi-head attention and feed-forward networks with crucial architectural innovations: **residual connections** and **layer normalization**.

### Layer Architecture

Each encoder layer follows this pattern:
1. **Multi-Head Self-Attention**: Allows positions to attend to each other
2. **Residual Connection + Dropout**: Adds input to attention output
3. **Layer Normalization**: Normalizes the combined result
4. **Feed-Forward Network**: Applies position-wise transformations
5. **Residual Connection + Dropout**: Adds pre-FFN input to FFN output  
6. **Layer Normalization**: Final normalization

### Residual Connections

The residual connections are crucial for deep networks:
$$\text{LayerNorm}(x + \text{Sublayer}(x))$$

Benefits:
- **Gradient Flow**: Prevents vanishing gradients in deep networks
- **Training Stability**: Easier optimization of deep architectures
- **Identity Preservation**: Allows layers to learn incremental changes

### Layer Normalization

Applied after each sub-layer to:
- **Stabilize Training**: Reduces internal covariate shift
- **Accelerate Convergence**: Enables higher learning rates
- **Independence**: Normalizes across features, not batch dimension

In [6]:
class MultiHeadAttention(tf.keras.layers.Layer):
    def __init__(self, d_model, num_heads):
        super(MultiHeadAttention, self).__init__()
        self.num_heads = num_heads
        self.d_model = d_model

        assert d_model % self.num_heads == 0

        self.depth = d_model // self.num_heads

        self.wq = tf.keras.layers.Dense(d_model)
        self.wk = tf.keras.layers.Dense(d_model)
        self.wv = tf.keras.layers.Dense(d_model)

        self.dense = tf.keras.layers.Dense(d_model)

    def split_heads(self, x, batch_size):
        """Split the last dimension into (num_heads, depth).
        Transpose the result such that the shape is (batch_size, num_heads, seq_len, depth)
        """
        x = tf.reshape(x, (batch_size, -1, self.num_heads, self.depth))
        return tf.transpose(x, perm=[0, 2, 1, 3])

    def call(self, v, k, q, mask):
        batch_size = tf.shape(q)[0]

        q = self.wq(q)  # (batch_size, seq_len, d_model)
        k = self.wk(k)  # (batch_size, seq_len, d_model)
        v = self.wv(v)  # (batch_size, seq_len, d_model)

        q = self.split_heads(q, batch_size)  # (batch_size, num_heads, seq_len_q, depth)
        k = self.split_heads(k, batch_size)  # (batch_size, num_heads, seq_len_k, depth)
        v = self.split_heads(v, batch_size)  # (batch_size, num_heads, seq_len_v, depth)

        # scaled_attention.shape == (batch_size, num_heads, seq_len_q, depth)
        # attention_weights.shape == (batch_size, num_heads, seq_len_q, seq_len_k)
        scaled_attention, attention_weights = scaled_dot_product_attention(
            q, k, v, mask)

        scaled_attention = tf.transpose(scaled_attention, perm=[0, 2, 1, 3])  # (batch_size, seq_len_q, num_heads, depth)

        concat_attention = tf.reshape(scaled_attention, 
                                      (batch_size, -1, self.d_model))  # (batch_size, seq_len_q, d_model)

        output = self.dense(concat_attention)  # (batch_size, seq_len_q, d_model)

        return output, attention_weights

## Decoder Layer

The **Decoder Layer** extends the encoder architecture with an additional attention mechanism for **encoder-decoder attention**, enabling the model to condition generation on the input sequence.

### Three-Stage Architecture

1. **Masked Self-Attention**: Prevents positions from attending to future positions
2. **Encoder-Decoder Attention**: Attends to the encoder output
3. **Feed-Forward Network**: Position-wise transformations

Each stage includes residual connections and layer normalization.

### Masked Self-Attention

Uses a **look-ahead mask** to prevent the model from seeing future tokens during training:
- **Causality**: Ensures autoregressive property
- **Training Efficiency**: Allows parallel processing during training
- **Inference Consistency**: Training matches inference behavior

### Encoder-Decoder Attention (Cross-Attention)

The crucial mechanism for sequence-to-sequence tasks:
- **Queries**: Come from the decoder (what we're generating)
- **Keys & Values**: Come from the encoder (input context)
- **Information Flow**: Allows decoder to access input information

This creates a differentiable pathway for information to flow from input to output, enabling learned translations, summarizations, and other sequence transformations.

In [7]:
# temp_mha = MultiHeadAttention(d_model=512, num_heads=8)
# y = tf.random.uniform((1, 60, 512))  # (batch_size, encoder_sequence, d_model)
# out, attn = temp_mha(y, k=y, q=y, mask=None)
# out.shape, attn.shape

## Complete Encoder

The **Encoder** stacks multiple encoder layers to build hierarchical representations of the input sequence.

### Architecture Components

1. **Token Embedding**: Converts discrete tokens to continuous vectors
2. **Positional Encoding**: Adds positional information
3. **Embedding Scaling**: Multiplies embeddings by $\sqrt{d_{model}}$ for proper scaling
4. **Stacked Layers**: Multiple encoder layers for deep representation learning
5. **Dropout**: Regularization to prevent overfitting

### Information Flow

Each encoder layer refines the representation:
- **Layer 1**: Basic token relationships and local patterns
- **Layer 2**: More complex interactions and phrase-level understanding  
- **Layer N**: Abstract, high-level representations

### Differentiable Processing

The encoder creates a **differentiable transformation** from discrete token sequences to rich contextual representations, with gradients flowing through:
- Attention patterns (learned token relationships)
- Positional encodings (spatial understanding)
- Feed-forward networks (non-linear transformations)
- Residual connections (identity preservation)

In [8]:
def point_wise_feed_forward_network(d_model, dff):
    return tf.keras.Sequential([
      tf.keras.layers.Dense(dff, activation='relu'),  # (batch_size, seq_len, dff)
      tf.keras.layers.Dense(d_model)  # (batch_size, seq_len, d_model)
    ])

## Complete Decoder

The **Decoder** generates output sequences autoregressively while attending to the encoder's representation of the input.

### Architecture Components

1. **Target Embedding**: Embeds the target sequence tokens
2. **Positional Encoding**: Adds positional information for generation
3. **Stacked Decoder Layers**: Multiple layers with self-attention and cross-attention
4. **Attention Weight Collection**: Tracks attention patterns for interpretability

### Autoregressive Generation

The decoder generates sequences token by token:
- **Training**: Processes entire target sequence with masking
- **Inference**: Generates one token at a time, using previous outputs as input
- **Causality**: Look-ahead masking ensures proper autoregressive behavior

### Attention Pattern Storage

The decoder collects attention weights from each layer:
- **Self-attention patterns**: How the decoder attends to previous tokens
- **Cross-attention patterns**: How the decoder attends to encoder outputs
- **Interpretability**: These patterns reveal learned alignments and dependencies

This enables analysis of what the model has learned and debugging of generation behavior.

In [9]:
# sample_ffn = point_wise_feed_forward_network(512, 2048)
# sample_ffn(tf.random.uniform((64, 50, 512))).shape

## Complete Transformer Model

The **Transformer** combines the encoder and decoder into a complete sequence-to-sequence model.

### Architecture Overview

1. **Encoder**: Processes input sequence into contextual representations
2. **Decoder**: Generates output sequence conditioned on encoder representations
3. **Final Linear Layer**: Projects decoder output to vocabulary logits
4. **End-to-End Training**: Entire model trained with next-token prediction

### Information Flow

```
Input Tokens → Encoder → Context Representations
                            ↓
Target Tokens → Decoder → Output Representations → Linear → Logits
```

### Differentiable Sequence-to-Sequence Learning

The Transformer creates a fully differentiable pathway from input sequences to output sequences:

- **Encoder-Decoder Attention**: Differentiable alignment between input and output
- **Self-Attention**: Differentiable modeling of sequence dependencies  
- **Autoregressive Training**: Teacher forcing enables parallel training
- **End-to-End Gradients**: Loss gradients flow through the entire architecture

This makes the Transformer an ideal architecture for differentiable programming applications where we need to learn complex input-output mappings while maintaining interpretability through attention mechanisms.

In [10]:
class EncoderLayer(tf.keras.layers.Layer):
    def __init__(self, d_model, num_heads, dff, rate=0.1):
        super(EncoderLayer, self).__init__()

        self.mha = MultiHeadAttention(d_model, num_heads)
        self.ffn = point_wise_feed_forward_network(d_model, dff)

        self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)

        self.dropout1 = tf.keras.layers.Dropout(rate)
        self.dropout2 = tf.keras.layers.Dropout(rate)

    def call(self, x, training, mask):
        attn_output, _ = self.mha(x, x, x, mask)  # (batch_size, input_seq_len, d_model)
        attn_output = self.dropout1(attn_output, training=training)
        out1 = self.layernorm1(x + attn_output)  # (batch_size, input_seq_len, d_model)

        ffn_output = self.ffn(out1)  # (batch_size, input_seq_len, d_model)
        ffn_output = self.dropout2(ffn_output, training=training)
        out2 = self.layernorm2(out1 + ffn_output)  # (batch_size, input_seq_len, d_model)

        return out2

## Model Configuration

Here we define the hyperparameters for a small Transformer model suitable for experimentation and learning:

### Architecture Parameters

- **num_layers = 4**: Number of encoder/decoder layers (relatively shallow for fast training)
- **d_model = 128**: Model dimension (smaller than typical production models)
- **dff = 512**: Feed-forward network dimension (4x model dimension)
- **num_heads = 8**: Number of attention heads

### Vocabulary Configuration

- **input_vocab_size = 10**: Small vocabulary for demonstration
- **target_vocab_size = 2**: Binary output vocabulary (suitable for classification tasks)

### Training Parameters

- **dropout_rate = 0.1**: Standard dropout rate for regularization

This configuration creates a compact Transformer suitable for educational purposes while maintaining all the key architectural components of larger models.

## Summary: Transformers in Differentiable Programming

The Transformer architecture represents a pinnacle of differentiable programming design, demonstrating several key principles:

### Architectural Innovations

1. **Pure Attention**: Eliminates recurrence while maintaining sequence modeling capability
2. **Parallel Processing**: All positions processed simultaneously for efficiency
3. **Hierarchical Representations**: Stacked layers build increasingly abstract representations
4. **Residual Connections**: Enable deep architectures through improved gradient flow

### Differentiable Programming Principles

- **End-to-End Learning**: Entire model optimized jointly with gradient descent
- **Modular Design**: Components can be mixed, matched, and reused
- **Interpretable Attention**: Learned attention patterns provide insight into model behavior
- **Scalable Architecture**: Principles scale from small experiments to large language models

### Applications Beyond NLP

The Transformer's differentiable design has inspired applications across domains:
- **Computer Vision**: Vision Transformers (ViTs) for image classification
- **Scientific Computing**: Protein structure prediction (AlphaFold)
- **Reinforcement Learning**: Decision Transformers for sequential decision making
- **Multimodal Learning**: Cross-modal attention for vision-language tasks

### Connection to Other Notebooks

This Transformer implementation connects to other differentiable programming concepts explored in this handbook:
- **Custom Gradients**: Attention mechanisms can benefit from custom gradient implementations
- **Differentiable Data Structures**: Attention creates differentiable memory access patterns
- **Boolean Logic**: Self-attention can learn to implement logical operations
- **Algorithmic Learning**: Transformers can learn to execute algorithms through attention patterns

The Transformer demonstrates how sophisticated computational patterns can emerge from simple, differentiable building blocks when combined with appropriate training objectives and architectural innovations.

In [11]:
# sample_encoder_layer = EncoderLayer(512, 8, 2048)

# sample_encoder_layer_output = sample_encoder_layer(
#     tf.random.uniform((64, 43, 512)), False, None)

# sample_encoder_layer_output.shape  # (batch_size, input_seq_len, d_model)

In [12]:
class DecoderLayer(tf.keras.layers.Layer):
    def __init__(self, d_model, num_heads, dff, rate=0.1):
        super(DecoderLayer, self).__init__()

        self.mha1 = MultiHeadAttention(d_model, num_heads)
        self.mha2 = MultiHeadAttention(d_model, num_heads)

        self.ffn = point_wise_feed_forward_network(d_model, dff)

        self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.layernorm3 = tf.keras.layers.LayerNormalization(epsilon=1e-6)

        self.dropout1 = tf.keras.layers.Dropout(rate)
        self.dropout2 = tf.keras.layers.Dropout(rate)
        self.dropout3 = tf.keras.layers.Dropout(rate)


    def call(self, x, enc_output, training, look_ahead_mask, padding_mask):
        # enc_output.shape == (batch_size, input_seq_len, d_model)

        attn1, attn_weights_block1 = self.mha1(x, x, x, look_ahead_mask)  # (batch_size, target_seq_len, d_model)
        attn1 = self.dropout1(attn1, training=training)
        out1 = self.layernorm1(attn1 + x)
        
#         tf.print(out1.shape)
#         tf.print(enc_output.shape)

        attn2, attn_weights_block2 = self.mha2(
            enc_output, enc_output, out1, padding_mask)  # (batch_size, target_seq_len, d_model)
        attn2 = self.dropout2(attn2, training=training)
        out2 = self.layernorm2(attn2 + out1)  # (batch_size, target_seq_len, d_model)

        ffn_output = self.ffn(out2)  # (batch_size, target_seq_len, d_model)
        ffn_output = self.dropout3(ffn_output, training=training)
        out3 = self.layernorm3(ffn_output + out2)  # (batch_size, target_seq_len, d_model)

        return out3, attn_weights_block1, attn_weights_block2

In [13]:
# sample_decoder_layer = DecoderLayer(512, 8, 2048)

# sample_decoder_layer_output, block1, block2 = sample_decoder_layer(
#     tf.random.uniform((64, 50, 512)), sample_encoder_layer_output, 
#     False, None, None)

# # (batch_size, target_seq_len, d_model)
# print(sample_decoder_layer_output.shape, block1.shape, block2.shape)  

In [14]:
class Encoder(tf.keras.layers.Layer):
    def __init__(self, num_layers, d_model, num_heads, dff, input_vocab_size,
               maximum_position_encoding, rate=0.1):
        super(Encoder, self).__init__()

        self.d_model = d_model
        self.num_layers = num_layers

        self.embedding = tf.keras.layers.Embedding(input_vocab_size, d_model)
        self.pos_encoding = positional_encoding(maximum_position_encoding, 
                                                self.d_model)


        self.enc_layers = [EncoderLayer(d_model, num_heads, dff, rate) 
                           for _ in range(num_layers)]

        self.dropout = tf.keras.layers.Dropout(rate)

    def call(self, x, training, mask):

        seq_len = tf.shape(x)[1]

        # adding embedding and position encoding.
        x = self.embedding(x)  # (batch_size, input_seq_len, d_model)
        x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))
        x += self.pos_encoding[:, :seq_len, :]

        x = self.dropout(x, training=training)

        for i in range(self.num_layers):
            x = self.enc_layers[i](x, training, mask)

        return x  # (batch_size, input_seq_len, d_model)


In [15]:
# sample_encoder = Encoder(num_layers=2, d_model=512, num_heads=8, 
#                          dff=2048, input_vocab_size=8500,
#                          maximum_position_encoding=10000)
# temp_input = tf.random.uniform((64, 62), dtype=tf.int64, minval=0, maxval=200)

# sample_encoder_output = sample_encoder(temp_input, training=False, mask=None)

# print (sample_encoder_output.shape)  # (batch_size, input_seq_len, d_model)

In [16]:
class Decoder(tf.keras.layers.Layer):
    def __init__(self, num_layers, d_model, num_heads, dff, target_vocab_size,
        maximum_position_encoding, rate=0.1):
        super(Decoder, self).__init__()

        self.d_model = d_model
        self.num_layers = num_layers

        self.embedding = tf.keras.layers.Embedding(target_vocab_size, d_model)
        self.pos_encoding = positional_encoding(maximum_position_encoding, d_model)

        self.dec_layers = [
            DecoderLayer(d_model, num_heads, dff, rate) 
            for _ in range(num_layers)
        ]
        self.dropout = tf.keras.layers.Dropout(rate)

    def call(self, x, enc_output, training, look_ahead_mask, padding_mask):
        seq_len = tf.shape(x)[1]
        attention_weights = {}

        x = self.embedding(x)  # (batch_size, target_seq_len, d_model)
        x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))
        x += self.pos_encoding[:, :seq_len, :]

        x = self.dropout(x, training=training)

        for i in range(self.num_layers):
            x, block1, block2 = self.dec_layers[i](x, enc_output, training,
                look_ahead_mask, padding_mask)

            attention_weights['decoder_layer{}_block1'.format(i+1)] = block1
            attention_weights['decoder_layer{}_block2'.format(i+1)] = block2

        # x.shape == (batch_size, target_seq_len, d_model)
        return x, attention_weights

In [17]:
# sample_decoder = Decoder(num_layers=2, d_model=512, num_heads=8, 
#                          dff=2048, target_vocab_size=8000,
#                          maximum_position_encoding=5000)
# temp_input = tf.random.uniform((64, 26), dtype=tf.int64, minval=0, maxval=200)

# output, attn = sample_decoder(temp_input, 
#                               enc_output=sample_encoder_output, 
#                               training=False,
#                               look_ahead_mask=None, 
#                               padding_mask=None)

# output.shape, attn['decoder_layer2_block2'].shape

In [18]:
class Transformer(tf.keras.Model):
    def __init__(self, num_layers, d_model, num_heads, dff, input_vocab_size, 
            target_vocab_size, pe_input, pe_target, rate=0.1):
        super(Transformer, self).__init__()

        self.encoder = Encoder(num_layers, d_model, num_heads, dff, input_vocab_size, pe_input, rate)

        self.decoder = Decoder(num_layers, d_model, num_heads, dff, target_vocab_size, pe_target, rate)

        self.final_layer = tf.keras.layers.Dense(target_vocab_size)

    def call(self, inp, tar, training, enc_padding_mask, look_ahead_mask, dec_padding_mask):
        enc_output = self.encoder(inp, training, enc_padding_mask)  # (batch_size, inp_seq_len, d_model)

        # dec_output.shape == (batch_size, tar_seq_len, d_model)
        dec_output, attention_weights = self.decoder(
        tar, enc_output, training, look_ahead_mask, dec_padding_mask)

        final_output = self.final_layer(dec_output)  # (batch_size, tar_seq_len, target_vocab_size)

        return final_output, attention_weights

In [19]:
# sample_transformer = Transformer(
#     num_layers=2, d_model=512, num_heads=8, dff=2048, 
#     input_vocab_size=8500, target_vocab_size=8000, 
#     pe_input=10000, pe_target=6000)

# temp_input = tf.random.uniform((64, 38), dtype=tf.int64, minval=0, maxval=200)
# temp_target = tf.random.uniform((64, 36), dtype=tf.int64, minval=0, maxval=200)

# fn_out, _ = sample_transformer(temp_input, temp_target, training=False, 
#                                enc_padding_mask=None, 
#                                look_ahead_mask=None,
#                                dec_padding_mask=None)

# fn_out.shape  # (batch_size, tar_seq_len, target_vocab_size)

In [20]:
num_layers = 4
d_model = 128
dff = 512
num_heads = 8

input_vocab_size = 10
target_vocab_size = 2
dropout_rate = 0.1

In [21]:
class CustomSchedule(tf.keras.optimizers.schedules.LearningRateSchedule):
    def __init__(self, d_model, warmup_steps=4000):
        super(CustomSchedule, self).__init__()

        self.d_model = d_model
        self.d_model = tf.cast(self.d_model, tf.float32)

        self.warmup_steps = warmup_steps

    def __call__(self, step):
        arg1 = tf.math.rsqrt(step)
        arg2 = step * (self.warmup_steps ** -1.5)

        return tf.math.rsqrt(self.d_model) * tf.math.minimum(arg1, arg2)