This script is an example of how to make tokens communicate with previous tokens. It is important that they do not communicate with future tokens.

The easiest way for making them communitate is to average the channels of the previous tokens. This created a feature vector which summarizes the previous tokens. This is called self-attention.

Averageing the channels is very lossy, as it doesn't take sequences or time into account.

The point here is that we want to iterate batch-wise over the tokens and average the channels of the previous tokens.


B is the batch size. This means how many subsets of data are we running in parallel

T is the time steps. This determines how many tokens should be in each batch

C is the number of features in each time step. This is the resolution for our encoding. Setting this to 2 would mean that each token is represented as a 2D feature vector.

In [1]:
import torch
import torch.nn as nn
from torch.nn import functional as F

# torch.manual_seed(1337)  # seeded randomness
# B, T, C = 4, 8, 2  # batch size, time steps, number of classes
# x = torch.randn(B, T, C)  # random input
# print(x.shape)

# xbow = torch.zeros((B, T, C))
# for b in range(B):
#     for t in range(T):
#         xprev = x[b, :t + 1]
#         xbow[b, t] = torch.mean(xprev, dim=0)
# print(x[0], "\n")
# print(xbow[0])


This can be done more effeciently using matricies by making a lower triangular matrix, A, and multiplying it to B. By normalizing the A matrix this will end up as an average.

![alt text](https://algebra1course.wordpress.com/wp-content/uploads/2013/02/slide10.jpg)

In [3]:
# example
a = torch.tril(torch.ones((5,5)))
a = a / a.sum(dim=1, keepdim=True)
b = torch.randn(5, 5)
c = a @ b # dot product

print(a)
print("----------------")
print(b)
print("----------------")
print(c)

tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000, 0.0000, 0.0000],
        [0.3333, 0.3333, 0.3333, 0.0000, 0.0000],
        [0.2500, 0.2500, 0.2500, 0.2500, 0.0000],
        [0.2000, 0.2000, 0.2000, 0.2000, 0.2000]])
----------------
tensor([[-0.8345,  0.5978, -0.0514, -0.0646, -0.4970],
        [ 0.4658, -0.2573, -1.0673,  2.0089, -0.9665],
        [ 0.3583,  0.1073,  1.2463,  1.2460,  0.3534],
        [ 0.9425, -1.6669, -0.7960,  0.1298, -1.9446],
        [ 0.0610, -0.2379,  1.9020, -1.1763, -0.1772]])
----------------
tensor([[-0.8345,  0.5978, -0.0514, -0.0646, -0.4970],
        [-0.1844,  0.1703, -0.5593,  0.9722, -0.7318],
        [-0.0035,  0.1493,  0.0426,  1.0635, -0.3700],
        [ 0.2330, -0.3048, -0.1671,  0.8300, -0.7637],
        [ 0.1986, -0.2914,  0.2467,  0.4288, -0.6464]])


In [5]:
# Average normalization

# previous a is the attention weights
# previous b is the (B,T,C) tensor

weights = torch.tril(torch.ones((T,T))) # size equal to number of tokens in a sequence
weights = weights / weights.sum(dim=-1, keepdim=True) # normalize the weights
xbow = weights @ x # (B, T, T) @ (B, T, C) = (B, T, C)

A better way of implementing this, is using softmax as the normalization. Here we set all 0's in the lower triangular matrix to '-inf', which gives us the same result for the weight matrix when we apply softmax.

This method allows tokens to decide which other tokens from the past they want to communicate with. This is explained as affinity in the lecure.

In [8]:
tril = torch.tril(torch.ones((T,T)))
weights = torch.zeros((T,T)) # initialize weights at 0
weights = weights.masked_fill(tril == 0, float("-inf"))
weights = F.softmax(weights, dim=-1) # softmax over the time dimension (x-axis)
xbow = weights @ x # (B, T, T) @ (B, T, C) = (B, T, C)
xbow[0]

tensor([[ 0.1808, -0.0700],
        [-0.0894, -0.4926],
        [ 0.1490, -0.3199],
        [ 0.3504, -0.2238],
        [ 0.3525,  0.0545],
        [ 0.0688, -0.0396],
        [ 0.0927, -0.0682],
        [-0.0341,  0.1332]])

### Finally Self-attention implementation
From Andrej Karpathy video

Self-attention solves making the current token chose what other tokens to listen to more or less (control the affinity), by making a Query and a Key

The key holds information about what a token contains
The query is information about what the token wants.
A high alignment between these leads to a high value when taking the dot product. This is self-attention.
The difference here is that the weights are no longer a constant adding up to 1, but instead data driven weights, normalized to sum to 1.
This can be seen especially on the last row of the weights matrix. This is the 8th token, meaning it has as much context as possible for a single batch. In the matrix the values represent the affinity strength, meaning how much the token on that position matches the query of the current token.

For this example it will be implemented for a single head.

#### Notes from video
Attention is a communication mechanism that acts on nodes in a directed manner. They can follow any structure you want, but need defined rules of what node can talk to which. In the case of NLP it is often linearly scaling like below. Attention is also not reliant on space, like convolutions are. Attention is simply a bunch of vectors with private information, that communicate how well their key aligns with others querys. Space aspects can and will be added in the model, through a positional embedding table.

Another important note is that batches are completely separate. Although they are scooped in the same data loader, each batch should be seen individually, running in parallel.
For *sentiment analysis* it may be okay for all nodes to talk to each other, as it is no longer about predicting the future. This is implemented by simpy removing the mask.

A head with a mask is called a *decoder block*
A head without a mask is called an *encoder block*
![image.png](self-attention.png)

In [16]:
torch.manual_seed(1337)  # seeded randomness
B,T,C = 4,8,32 # batch size, time steps, number of classes
x = torch.randn(B,T,C) # random input


head_size = 16
key = nn.Linear(C, head_size, bias = False) # linear transformation of token features
query = nn.Linear(C, head_size, bias=False) # linear transformation of token wants to attend to
value = nn.Linear(C, head_size, bias=False) # linear transformation of token features
k = key(x) # (B, T, C) -> (B, T, head_size)
q = query(x) # (B, T, C) -> (B, T, head_size)
v = value(x) # (B, T, C) -> (B, T, head_size)
weights = q @ k.transpose(1,2) * head_size ** -0.5# ((B, T, head_size) @ (B, head_size, T))/sqrt(head_size) = (B, T, T)


tril = torch.tril(torch.ones((T,T)))

#TODO try removing this mask for sentiment analysis
weights = weights.masked_fill(tril == 0, float("-inf")) # mask out the upper triangular part. This is to prevent the model from attending to future tokens
weights = F.softmax(weights, dim=-1) # softmax over the time dimension (x-axis)
out = weights @ v # (B, T, T) @ (B, T, C) = (B, T, C)

out[0]


tensor([[-1.5713e-01,  8.8009e-01,  1.6152e-01, -7.8239e-01, -1.4289e-01,
          7.4676e-01,  1.0068e-01, -5.2395e-01, -8.8726e-01,  1.9068e-01,
          1.7616e-01, -5.9426e-01, -4.8124e-01, -4.8598e-01,  2.8623e-01,
          5.7099e-01],
        [ 4.3974e-01, -1.4227e-01, -1.3157e-01,  2.8895e-03, -1.3222e-01,
          6.6082e-04, -2.7904e-01, -2.2676e-01, -2.8723e-01,  5.7456e-01,
          5.6053e-01, -2.5208e-01,  9.7243e-02,  1.0771e-01,  3.0455e-02,
          1.0727e+00],
        [ 4.3615e-01, -6.6358e-02, -2.9296e-01,  7.4315e-02,  5.4381e-02,
         -7.0388e-02, -6.8984e-02, -8.2153e-02, -2.9377e-01, -5.8952e-02,
          3.5887e-01, -2.3087e-03, -1.8212e-01, -3.6142e-02, -6.7189e-02,
          1.1412e+00],
        [ 4.2069e-01, -1.0619e-01, -2.9984e-01,  5.2820e-02,  2.0077e-01,
         -1.6048e-01, -3.5710e-02, -8.3110e-02, -1.7919e-01,  7.7992e-02,
          1.2719e-01,  2.2611e-02, -5.1811e-02,  7.4466e-02,  1.8131e-01,
          8.4463e-01],
        [ 3.9499e-01

### Multiheaded attention
Multihead attention is simply creating more attention blocks, and concatenating their answers. Typically you want to keep the same dimensionality, so you divide the head_size with the amount of heads that are being used. This ends up being concatenated to the same size, but the benefit is that each head can be initialized independently and can run in parallel.