# **9. Attention**

### Multi-Head Attention Parameter Count Explanation

The **Multi-Head Attention (MHA)** block in Transformers calculates attention over input data by splitting the computation into multiple "heads" for better representation learning. Here's how the parameters are computed:

1. **Key Components**:
   - $ Q $: Query vector
   - $ K $: Key vector
   - $ V $: Value vector

   Each head independently computes $ Q $, $ K $, and $ V $ using learned weight matrices:
   $$
   Q = XW_Q, \quad K = XW_K, \quad V = XW_V
   $$
   where:
   - $ X $: Input data of shape $ (N, d_{\text{model}}) $, where $ N $ is the sequence length and $ d_{\text{model}} $ is the feature size.
   - $ W_Q, W_K, W_V $: Weight matrices of shape $ (d_{\text{model}}, d_{\text{head}}) $.

2. **Number of Heads**:
   - $ d_{\text{model}} $: Total feature size (e.g., 512 in many Transformer implementations).
   - $ d_{\text{head}} $: Feature size per head ($ d_{\text{head}} = d_{\text{model}} / h $, where $ h $ is the number of heads).

3. **Parameters for Q, K, V**:
   - Total parameters for $ W_Q, W_K, W_V $: 
     $$
     3 \times (d_{\text{model}} \times d_{\text{head}} \times h)
     = 3 \times (d_{\text{model}} \times d_{\text{model}})
     $$

4. **Output Projection**:
   After computing attention, the concatenated output of all heads is projected back into $ d_{\text{model}} $ space:
   $$
   W_O: (d_{\text{model}}, d_{\text{model}})
   $$

   Parameters for output projection:
   $$
   d_{\text{model}} \times d_{\text{model}}
   $$

5. **Total Parameters**:
   Combining all components:
   $$
   \text{Total Parameters} = 3 \times d_{\text{model}}^2 + d_{\text{model}}^2 = 4 \times d_{\text{model}}^2
   $$


In [17]:
import tensorflow as tf
from tensorflow.keras.layers import Dense, LayerNormalization, Dropout, MultiHeadAttention, Input
from tensorflow.keras.models import Model

# Transformer Encoder Layer (Explicit)
def transformer_encoder_explicit(inputs, d_model, num_heads, ff_dim, dropout_rate=0.1):
    """
    Transformer encoder layer explicitly showing MultiHeadAttention.
    
    Parameters:
        inputs (tf.Tensor): Input tensor.
        d_model (int): Dimensionality of the model (embedding size).
        num_heads (int): Number of attention heads.
        ff_dim (int): Dimensionality of the feed-forward network.
        dropout_rate (float): Dropout rate.
        
    Returns:
        tf.Tensor: Output tensor of the encoder layer.
    """
    # Multi-Head Attention
    attention_output = MultiHeadAttention(num_heads=num_heads, key_dim=d_model // num_heads)(inputs, inputs)
    attention_output = Dropout(dropout_rate)(attention_output)
    attention_output = LayerNormalization(epsilon=1e-6)(inputs + attention_output)  # Add & Norm

    # Feed-Forward Network
    ffn_output = Dense(ff_dim, activation="relu")(attention_output)
    ffn_output = Dense(d_model)(ffn_output)
    ffn_output = Dropout(dropout_rate)(ffn_output)
    encoder_output = LayerNormalization(epsilon=1e-6)(attention_output + ffn_output)  # Add & Norm

    return encoder_output

# Build Transformer Model (Explicit)
def build_transformer_model_explicit(input_shape, d_model, num_heads, ff_dim, num_layers, dropout_rate=0.1):
    """
    Builds a Transformer model explicitly showing Multi-Head Attention in model.summary().
    
    Parameters:
        input_shape (tuple): Shape of the input sequence (e.g., (seq_length, d_model)).
        d_model (int): Dimensionality of the model (embedding size).
        num_heads (int): Number of attention heads.
        ff_dim (int): Dimensionality of the feed-forward network.
        num_layers (int): Number of encoder layers.
        dropout_rate (float): Dropout rate.
        
    Returns:
        tf.keras.Model: A Transformer model.
    """
    inputs = Input(shape=input_shape)
    x = inputs

    for _ in range(num_layers):
        x = transformer_encoder_explicit(x, d_model, num_heads, ff_dim, dropout_rate)

    outputs = Dense(d_model, activation="softmax")(x)  # Example output layer
    return Model(inputs=inputs, outputs=outputs)

# Model parameters
input_shape = (50, 512)  # Sequence length 50, feature size 512
d_model = 512
num_heads = 8
ff_dim = 2048
num_layers = 1 # Number of encoder layers
dropout_rate = 0.1

# Create the model
transformer_model_explicit = build_transformer_model_explicit(input_shape, d_model, num_heads, ff_dim, num_layers, dropout_rate)

# Print model summary
transformer_model_explicit.summary()


In [16]:
from tensorflow.keras.layers import MultiHeadAttention

# Instantiate the MultiHeadAttention layer
attention_layer = MultiHeadAttention(num_heads=8, key_dim=64)  # Example config

# Print the configuration
print(attention_layer.get_config())

{'name': 'multi_head_attention_22', 'trainable': True, 'dtype': {'module': 'keras', 'class_name': 'DTypePolicy', 'config': {'name': 'float32'}, 'registered_name': None}, 'num_heads': 8, 'key_dim': 64, 'value_dim': 64, 'dropout': 0.0, 'use_bias': True, 'output_shape': None, 'attention_axes': None, 'kernel_initializer': {'module': 'keras.initializers', 'class_name': 'GlorotUniform', 'config': {'seed': None}, 'registered_name': None}, 'bias_initializer': {'module': 'keras.initializers', 'class_name': 'Zeros', 'config': {}, 'registered_name': None}, 'kernel_regularizer': None, 'bias_regularizer': None, 'activity_regularizer': None, 'kernel_constraint': None, 'bias_constraint': None, 'seed': None}
