In [1]:
pip install bertviz

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [2]:
import IPython

app = IPython.Application.instance()
app.kernel.do_shutdown(restart=True)

{'status': 'ok', 'restart': True}

# Transformer Anatomy

The Transformer architecture, introduced in the paper "Attention is All You Need" by Vaswani et al., 2017, has revolutionized the field of NLP. Unlike previous models that relied on recurrent or convolutional layers, Transformers use self-attention mechanisms to capture dependencies between words in a sentence, regardless of their distance.

### Key Components of the Transformer Architecture:

1. **Positional Encoding**: Adds information about the position of words in a sequence since the model itself does not inherently understand word order.
2. **Self-Attention Mechanism**: Allows the model to weigh the importance of each word in a sentence relative to all other words.
3. **Multi-Head Attention**: Enables the model to focus on different parts of the input simultaneously.
4. **Layer Normalization and Residual Connections**: Stabilizes training and helps with gradient flow.
5. **Feed-Forward Neural Networks**: Applies a point-wise feed-forward layer to each position independently and destilles the information further to output probabilities.

Let's explore each of these components in detail and understand how they work together to create powerful language models.

## Self-Attention Mechanism

The self-attention mechanism is the core component of the Transformer architecture. It allows the model to dynamically assign different levels of importance to different words in a sentence when encoding a particular word.

### How Self-Attention Works:

1. **Input Embeddings**: Before we can apply self-attention, the input words must first be converted into embeddings (dense vector representations).
2. **Query, Key, and Value Vectors**: For each word, the model creates three vectors: a Query vector (Q), a Key vector (K), and a Value vector (V).
3. **Attention Scores**: The attention score is computed as the dot product of the Query vector of a word with the Key vectors of all words. This score determines how much focus should be on the other words.
4. **Weighted Sum**: Each word's output representation is computed as a weighted sum of the Value vectors, where the weights are the normalized attention scores.
5. **Softmax Normalization**: After the attention heads, the scores are passed through a softmax function to convert them into probabilities.

### Visualizing Self-Attention

Let's visualize how the self-attention mechanism works for a simple sentence.


In [1]:
# Import required libraries
import torch
import torch.nn.functional as F
import math

In [2]:
# Example sentence and tokens
sentence = "Transformers are revolutionary in NLP."
tokens = ["Transformers", "are", "revolutionary", "in", "NLP"]

In [3]:
# Embedding dimension
embedding_dim = 8

# Random input embeddings (for illustration purposes)
torch.manual_seed(42)
input_embeddings = torch.randn(len(tokens), embedding_dim)

In [4]:
# In this example we end up with a 5x8 matrix
input_embeddings

tensor([[ 1.9269,  1.4873,  0.9007, -2.1055,  0.6784, -1.2345, -0.0431, -1.6047],
        [-0.7521,  1.6487, -0.3925, -1.4036, -0.7279, -0.5594, -0.7688,  0.7624],
        [ 1.6423, -0.1596, -0.4974,  0.4396, -0.7581,  1.0783,  0.8008,  1.6806],
        [ 0.0349,  0.3211,  1.5736, -0.8455,  1.3123,  0.6872, -1.0892, -0.3553],
        [-1.4181,  0.8963,  0.0499,  2.2667,  1.1790, -0.4345, -1.3864, -1.2862]])

In [5]:
# Initialize Query, Key, and Value weight matrices
Q = torch.randn(embedding_dim, embedding_dim)
K = torch.randn(embedding_dim, embedding_dim)
V = torch.randn(embedding_dim, embedding_dim)

In [6]:
# Compute Query, Key, and Value vectors
queries = input_embeddings @ Q
keys = input_embeddings @ K
values = input_embeddings @ V

In [7]:
# Calculate attention scores using dot product of queries and keys
attention_scores = queries @ keys.T

# Apply softmax to normalize the scores
attention_weights = F.softmax(attention_scores/math.sqrt(embedding_dim), dim=-1)

In [8]:
# Compute the weighted sum of values
output = attention_weights @ values

# Display the attention weights
print("Attention Weights:\n", attention_weights)

Attention Weights:
 tensor([[9.1079e-01, 4.6710e-03, 4.2964e-08, 6.3779e-02, 2.0756e-02],
        [5.4914e-05, 7.8533e-02, 3.1973e-10, 5.2326e-02, 8.6909e-01],
        [1.0379e-02, 9.8962e-01, 7.4063e-10, 8.1366e-09, 2.0209e-14],
        [7.0323e-01, 9.9272e-02, 7.0980e-08, 1.2708e-01, 7.0416e-02],
        [4.7277e-11, 7.3541e-17, 5.1235e-01, 1.4105e-05, 4.8764e-01]])


In [9]:
from transformers import AutoTokenizer
from bertviz.transformers_neuron_view import BertModel
from bertviz.neuron_view import show

In [10]:
model_id = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = BertModel.from_pretrained(model_id)

100%|██████████| 433/433 [00:00<00:00, 3078192.60B/s]
100%|██████████| 440473133/440473133 [00:14<00:00, 31189768.62B/s]


In [11]:
text = "The woman at the bus stop looked really cheerful."

In [12]:
show(model, "bert", tokenizer, text, display_mode="light", layer=11, head=8)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## Multi-Head Attention

The multi-head attention mechanism allows the model to focus on different parts of the input simultaneously. Instead of having a single attention mechanism, the model uses multiple attention "heads" in parallel. Each head can learn different aspects of the input.

### How Multi-Head Attention Works:

1. The input is projected into multiple sets of Query, Key, and Value vectors.
2. Each set of vectors is processed independently through a self-attention mechanism.
3. The outputs from each head are concatenated and projected back into a single vector space.

This approach provides the model with a richer understanding of the input by capturing different types of relationships between words.

### Example: Multi-Head Attention
Let's visualize how multi-head attention works using multiple attention heads.


In [13]:
# Number of attention heads
num_heads = 2

# Initialize weight matrices for each head
Q_heads = [torch.randn(embedding_dim, embedding_dim) for _ in range(num_heads)]
K_heads = [torch.randn(embedding_dim, embedding_dim) for _ in range(num_heads)]
V_heads = [torch.randn(embedding_dim, embedding_dim) for _ in range(num_heads)]

In [14]:
# Compute outputs for each head
head_outputs = []
for i in range(num_heads):
    queries = input_embeddings @ Q_heads[i]
    keys = input_embeddings @ K_heads[i]
    values = input_embeddings @ V_heads[i]
    
    # Calculate attention scores and apply softmax
    attention_scores = queries @ keys.T
    attention_weights = F.softmax(attention_scores/math.sqrt(embedding_dim), dim=-1)
    
    # Compute the weighted sum of values
    output = attention_weights @ values
    head_outputs.append(output)

In [15]:
# Concatenate outputs from all heads
multi_head_output = torch.cat(head_outputs, dim=-1)

# Display multi-head attention output
print("Multi-Head Attention Output:\n", multi_head_output)

Multi-Head Attention Output:
 tensor([[ -3.0596,  -3.4943,  -0.6937,  -0.1939,   0.3010,   1.9820,  -3.0086,
          -0.6743, -10.2718,  -0.5067,  -4.1114,   2.5774,  -2.8795,  -3.0448,
         -10.0398,   0.6656],
        [ -0.0516,  -0.4984,  -1.3781,   1.0615,   0.7248,  -1.3815,  -2.2232,
           1.0491,   3.7755,  -0.5244,   2.7176,  -1.2325,   0.8634,   0.3360,
           7.6904,  -0.4856],
        [ -4.3441,  -4.7985,  -0.4966,  -0.8124,   0.2124,   3.3268,  -3.4841,
          -1.2002,   4.8415,   3.3390,   3.1862,   1.6175,  -1.3225,  -0.7895,
          -1.0330,  -5.3892],
        [ -3.9516,  -4.4064,  -0.5869,  -0.6455,   0.2645,   2.8903,  -3.3741,
          -0.9833, -10.2318,  -0.5344,  -4.0891,   2.5622,  -2.8930,  -3.0416,
          -9.9665,   0.6524],
        [ -4.3439,  -4.7978,  -0.4959,  -0.8122,   0.2119,   3.3264,  -3.4834,
          -1.2005,   4.8428,   3.3387,   3.1870,   1.6175,  -1.3236,  -0.7897,
          -1.0323,  -5.3906]])


In [16]:
from bertviz import head_view
from transformers import AutoModel

In [17]:
model = AutoModel.from_pretrained(model_id, output_attentions=True)

In [18]:
sentence_1 = "time flies like an arrow"
sentence_2 = "fruit flies like a banana"

In [19]:
viz_inputs = tokenizer(sentence_1, sentence_2, return_tensors="pt")
attention = model(**viz_inputs).attentions
sentence_2_start = (viz_inputs.token_type_ids == 0).sum(dim=1)
tokens = tokenizer.convert_ids_to_tokens(viz_inputs.input_ids[0])

In [20]:
head_view(attention, tokens, sentence_2_start, heads=[8])

<IPython.core.display.Javascript object>

## Feed-Forward Neural Networks

Each position's output from the multi-head attention mechanism is passed through a point-wise feed-forward neural network. This consists of two linear transformations with a ReLU activation in between.

### Example: Feed-Forward Network
Let's implement a simple feed-forward network.

In [21]:
# Define feed-forward neural network
class FeedForwardNN(torch.nn.Module):
    def __init__(self, input_dim, hidden_dim):
        super(FeedForwardNN, self).__init__()
        self.linear1 = torch.nn.Linear(input_dim, hidden_dim)
        self.relu = torch.nn.ReLU()
        self.linear2 = torch.nn.Linear(hidden_dim, input_dim)
    
    def forward(self, x):
        return self.linear2(self.relu(self.linear1(x)))

In [22]:
# Instantiate and apply the feed-forward network
ffn = FeedForwardNN(input_dim=embedding_dim * num_heads, hidden_dim=32)
ffn_output = ffn(multi_head_output)

# Display the feed-forward network output
print("Feed-Forward Network Output:\n", ffn_output)

Feed-Forward Network Output:
 tensor([[-0.5536, -0.1693, -1.7012,  0.7094, -0.8402,  0.1154,  0.3687,  0.8552,
          1.8841,  1.0963, -1.7729,  0.4918,  0.3292,  0.7776, -0.3585,  0.6985],
        [ 1.2748,  0.2045,  0.1567, -0.1085, -0.0122, -0.0775, -0.0473, -0.2675,
          0.4364, -0.2484, -0.6444, -0.1295, -0.9991, -1.3023,  0.2932, -0.2305],
        [-0.5991,  0.4602, -0.0433, -1.2597, -0.6919,  0.8034, -0.6563,  0.4739,
          1.2566,  1.0521, -1.1516,  0.9202,  0.6530, -0.9000, -1.4285,  0.8272],
        [-0.6001, -0.1881, -1.7623,  0.6600, -1.0057,  0.1321,  0.2874,  0.7446,
          1.9263,  1.1751, -1.8250,  0.6353,  0.5441,  0.7106, -0.4425,  0.7283],
        [-0.5994,  0.4602, -0.0434, -1.2599, -0.6919,  0.8033, -0.6564,  0.4738,
          1.2567,  1.0522, -1.1517,  0.9202,  0.6529, -0.9002, -1.4285,  0.8271]],
       grad_fn=<AddmmBackward0>)


## Layer Normalization and Residual Connections

Layer normalization is used to stabilize training by normalizing the inputs to each layer. Residual connections help maintain gradient flow through the network, enabling deeper architectures.

### Example: Adding Layer Normalization and Residual Connections
Let's see how these components are added to the Transformer block.


In [23]:
# Define Layer Normalization
layer_norm = torch.nn.LayerNorm(embedding_dim * num_heads)

# Add residual connection and apply layer normalization
residual_output = layer_norm(multi_head_output + ffn_output)

# Display the final output with residual connection
print("Output with Residual Connections:\n", residual_output)

Output with Residual Connections:
 tensor([[-0.4056, -0.4196, -0.0661,  0.7449,  0.4510,  1.1857, -0.1344,  0.6516,
         -1.7360,  0.7655, -1.0384,  1.4565, -0.1094, -0.0305, -2.2963,  0.9813],
        [ 0.2745, -0.3407, -0.7169,  0.1649,  0.0674, -0.8131, -1.1423,  0.0954,
          1.4864, -0.5349,  0.6191, -0.7738, -0.2766, -0.6134,  3.0158, -0.5119],
        [-1.3482, -1.1661, -0.0227, -0.4839, -0.0045,  1.3831, -1.1065, -0.0789,
          1.9755,  1.4616,  0.7523,  0.9037, -0.0617, -0.3688, -0.6012, -1.2335],
        [-0.6123, -0.6239, -0.0164,  0.6230,  0.4186,  1.4367, -0.2160,  0.5545,
         -1.6278,  0.7924, -0.9808,  1.4841, -0.0164, -0.0115, -2.1969,  0.9926],
        [-1.3482, -1.1659, -0.0226, -0.4839, -0.0047,  1.3829, -1.1063, -0.0789,
          1.9758,  1.4615,  0.7524,  0.9037, -0.0621, -0.3689, -0.6009, -1.2339]],
       grad_fn=<NativeLayerNormBackward0>)


## Positional Encoding

Since the Transformer does not inherently capture the order of words, positional encoding is added to provide the model with information about the relative position of words in a sentence.

### Example: Implementing Positional Encoding
Let's implement positional encoding for a sequence of words.


In [None]:
import numpy as np

def positional_encoding(seq_len, model_dim):
    pos_enc = np.zeros((seq_len, model_dim))
    for pos in range(seq_len):
        for i in range(0, model_dim, 2):
            pos_enc[pos, i] = np.sin(pos / (10000 ** (2 * i / model_dim)))
            pos_enc[pos, i + 1] = np.cos(pos / (10000 ** (2 * i / model_dim)))
    return torch.tensor(pos_enc, dtype=torch.float)

In [None]:
# Apply positional encoding
position_encodings = positional_encoding(len(tokens), embedding_dim)

# Add positional encoding to input embeddings
encoded_input = input_embeddings + position_encodings

print("Positional Encodings:\n", position_encodings)
print("Encoded Input with Positional Information:\n", encoded_input)

## Conclusion

In this notebook, we explored the key components of the Transformer architecture, including self-attention, multi-head attention, feed-forward networks, layer normalization, and positional encoding. These components work together to form the basis of modern NLP models.

In [None]:
# Shut down the kernel to release memory
import IPython

app = IPython.Application.instance()
app.kernel.do_shutdown(restart=False)