In [1]:
import requests
import os
import torch
import torch.nn as nn
from torch.nn import functional as F

In [2]:
# URL for the Shakespeare dataset
url = "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt"
file_path = "input.txt"

# Only download if the file doesn't exist yet
if not os.path.exists(file_path):
    print("Downloading dataset...")
    response = requests.get(url)
    with open(file_path, 'w', encoding='utf-8') as f:
        f.write(response.text)
    print("Download complete.")
else:
    print("File already exists.")

File already exists.


In [3]:
with open(file_path, 'r', encoding='utf-8') as f:
    text = f.read()

print(f"Length of dataset in characters: {len(text)}")
print(f"First 100 characters:\n{text[:100]}")

Length of dataset in characters: 1115394
First 100 characters:
First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You


There are around **1 Million** characters in this dataset

In [4]:
chars = sorted(list(set(text)))
vocab_size = len(chars)
print(''.join(chars))
print(vocab_size)


 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
65


So the ```set()``` takes the set and removes all the dublicates and keeps all characters in random order each time you run it. 
For example,
Hello - 'h','e','l','l','o' - In Memory and set removes one of the 'l'

```list()``` just fixes the order to provided fix ID and this is sorted to keep it deterministic

```.join()``` just joins the characters with empty space in our case.
For example,
'a','b','c' becomes 'abc'

Which is exactly what we see in the above result with 65 characters including space, special characters, and all alphabets

In [5]:
string_to_integer = {ch:i for i,ch in enumerate(chars)}
integer_to_string = {i:ch for i,ch in enumerate(chars)}
encode = lambda s: [string_to_integer[c] for c in s]
decode = lambda l: ''.join([integer_to_string[i] for i in l])

print(encode("Hello"))
print(decode(encode("Hello")))

[20, 43, 50, 50, 53]
Hello


Just simple encoding like a - 0 b - 1 c - 2 etc as an example and then encode the characters and decode accordingly 

There are various kinds of encoding like subwords and tiktoken which GPT2 uses.
For example, 
in Tiktoken instead of 65 tokens as seen in our case they have 50257 characters or tokens. So, a encoding of Hello wouldnt be of 5 integers but rather less than 5 but the integers wouldnt be between 0 and 64 rather between 0 and 50256.

There is a tradeoff between vocabularies and sequence of integers. Like, Short sequence of integers is possible with larger vocabularies

In [9]:
data = torch.tensor(encode(text), dtype=torch.long)
print(data.shape, data.dtype)
print(data[:100])

torch.Size([1115394]) torch.int64
tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1,
        57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39, 49,
         6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47,
        58, 47, 64, 43, 52, 10,  0, 37, 53, 59])


Now to encode the entire shakesphere dataset using the above simple encoding method. Pytorch is being used with dtype as whole integers rathan than float. This long is used for picking up locations from these indices. 

Tensor of [10, 20, 30] would be ([10, 20, 30]) like in our case shown above which is a very long sequence of characters in integers form

In [10]:
n = int(0.9 * len(data)) # 90% of the data is for training and 10% for validation to generalise
train_data = data[:n]
val_data = data[n:]

Now we cannot put the entire dataset into the transformer and train directly that would be computationally very expensive. Therefore, When training the transformer, Only chunks of the dataset is put through the transformer.

So random little chunks of the training dataset is sent for training into the transformer. The max size of these randomly chosen chunks is what we **Block Size** or **Context Length**

In [11]:
block_size = 8
train_data[:block_size+1]

tensor([18, 47, 56, 57, 58,  1, 15, 47, 58])

So say a blocksize of 8 is chosen then, from the training dataset say the first 8+1 characters are taken.

This has multiple examples packed into it, as these characters follow each other.
For example,
for the context of 18 -> 47 comes next , in the context of 18 , 47 -> 56 comes next and so on see the next code on whats being said

In [None]:
x = train_data[:block_size]
y = train_data[1:block_size+1] # The block_size characters which is offset by 1
for t in range(block_size):
    context = x[:t+1]
    target = y[t]
    print(f"When input is {context} the target is: {target}")

When input is tensor([18]) the target is: 47
When input is tensor([18, 47]) the target is: 56
When input is tensor([18, 47, 56]) the target is: 57
When input is tensor([18, 47, 56, 57]) the target is: 58
When input is tensor([18, 47, 56, 57, 58]) the target is: 1
When input is tensor([18, 47, 56, 57, 58,  1]) the target is: 15
When input is tensor([18, 47, 56, 57, 58,  1, 15]) the target is: 47
When input is tensor([18, 47, 56, 57, 58,  1, 15, 47]) the target is: 58


When the block_size of 8 is sampled from the training set these above shown are the examples hidden in the chunk of block_size.

As you can see the training is being done from context as small as 1 to context as large as block size. This is usefull in inference when the model can start the sample generation with as little as 1 character context. Then the transformer will be able to predict from context of 1 to context of blocksize. 
After blocksize the transformer truncates and starts so transformer will never receive more than blocksize of inputs to predict the next character.

Also the GPU must be used at full efficiency rather than having them sitting quietly, Since the GPU are great at parallel processing. Minibatches of multiple chunks of text which are stacked up in a single tensor. Each of these chunks are processed completly independently. This is what we call **Batch_dimension**

- Block Size (T): The "Context." How many characters into the past the model sees (e.g., 8 characters).

- Batch Size (B): The "Stack." How many of these 8-character chunks we feed the GPU at once.

```# A single tensor containing 4 independent chunks of Shakespeare```

```tensor([```

```[18, 47, 56, 57, 58,  1, 15, 47],  # Chunk 1 (Random spot in the book)```

```[ 1, 40, 43, 40, 43, 22,  0, 12],  # Chunk 2 (Another random spot)```

```[56, 12,  1, 56, 12,  5,  0, 10],  # Chunk 3 (Another random spot)```

```[12,  0,  5,  0, 12, 56, 43, 58]   # Chunk 4 (Another random spot)])```

The GPU processes all 4 rows simultaneously. It calculates the loss for Row 1, Row 2, Row 3, and Row 4 at the same time, averages them, and then does one backward pass.

If you set your batch size too high, your GPU runs out of memory (OOM error) because it can't "hold" that many chunks at once. If you set it too low, your powerful GPU sits idle waiting for work

In [15]:
torch.manual_seed(1337)
batch_size = 4 # This is how many independent sequences to be processed in parallel
block_size = 8 # Maximum context length for prediction

def get_batch(split):
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    return x, y

xb, yb = get_batch('train')
print('inputs: ')
print(xb.shape)
print(xb)
print('targets: ')
print(yb.shape)
print(yb)

print('--------')

for b in range(batch_size):
    for t in range(block_size):
        context = xb[b, :t+1]
        target = yb[b, t]
        print(f"When input is {context.tolist()} the target is: {target}")

inputs: 
torch.Size([4, 8])
tensor([[24, 43, 58,  5, 57,  1, 46, 43],
        [44, 53, 56,  1, 58, 46, 39, 58],
        [52, 58,  1, 58, 46, 39, 58,  1],
        [25, 17, 27, 10,  0, 21,  1, 54]])
targets: 
torch.Size([4, 8])
tensor([[43, 58,  5, 57,  1, 46, 43, 39],
        [53, 56,  1, 58, 46, 39, 58,  1],
        [58,  1, 58, 46, 39, 58,  1, 46],
        [17, 27, 10,  0, 21,  1, 54, 39]])
--------
When input is [24] the target is: 43
When input is [24, 43] the target is: 58
When input is [24, 43, 58] the target is: 5
When input is [24, 43, 58, 5] the target is: 57
When input is [24, 43, 58, 5, 57] the target is: 1
When input is [24, 43, 58, 5, 57, 1] the target is: 46
When input is [24, 43, 58, 5, 57, 1, 46] the target is: 43
When input is [24, 43, 58, 5, 57, 1, 46, 43] the target is: 39
When input is [44] the target is: 53
When input is [44, 53] the target is: 56
When input is [44, 53, 56] the target is: 1
When input is [44, 53, 56, 1] the target is: 58
When input is [44, 53, 56, 1

ix is the part where it grabs the random chunks of data of batch_size. Since Batch_size = 4, ix will be randomly generated 4 integers between 0 and len(data) - block_size

Then when stacked it will become a 4x8 tensor as seen from ```print(xb)``` where each row is the chunk of the training set as seen in the bottom for loop you can see how the targets appear for each example of context. With this training can be done

In [None]:
class BigramLanguageModel(nn.Module):
    def __init__(self, vocab_size):
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)

    def forward(self, idx, targets=None):
        logits = self.token_embedding_table(idx) # Prediction of the next token

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            
            targets = targets.view(B*T) # From B x T to (B*T), Can use -1 if you want Pytorch to guess what you want
            loss = F.cross_entropy(logits, targets)

        return logits, loss
    
    def generate(self, idx, max_new_tokens):
        for _ in range(max_new_tokens):
            logits, loss = self(idx)
            logits = logits[:,-1,:] # Bring out the Last column of T - the prediction
            probs = F.softmax(logits, dim=-1)
            idx_next = torch.multinomial(probs, num_samples=1)
            idx = torch.cat((idx, idx_next), dim=1)
        return idx
    
m = BigramLanguageModel(vocab_size)
logits, loss = m(xb, yb)
print(logits.shape)
print(loss)

idx = torch.zeros((1, 1), dtype=torch.long)
print(decode(m.generate(idx = idx, max_new_tokens=100)[0].tolist()))

torch.Size([32, 65])
tensor(4.5262, grad_fn=<NllLossBackward0>)

; FXh&rszjnzQ'ItHc3N?Wg!FBdApAxrsK'I&ek
hCjHLHL-XdoSz?tBwcRHNHfbXbP.z&A?tsJWeCtyKSKoRMt?FFfpmJDQ zvt


This is interesting, so from xb we have many integers. First we create a embedding matrix for all characters using ```nn.Embedding``` and from the xb when we have any integer we go to the embedding matrix and pluck out the corresponding rows of the chosen integer from xb.

Thus, we get the shape of 4x8x65

This way we get logits - Unnormalised version - predict what comes next based on positional knowledge

Now when we have a problem Pytorch expects the channels or dimensions to be the second dimension i.e., instead of 4 x 8 x 65 we want (4*8) x 65

- B - **Batch size** processing B number of distinct chunks of text at once
- T - **Sequence Length** Each chunk has T number of characters or tokens
- C - **Vocab Size/ Channels** The total C number of unique tokens or characters

In forward - ```nn.Embedding```Shape is (B, T) $\to$ (4, 8) - Embedding Table with character ID in each chunk.
Logits - The Embedding layer replaces every singel integer with a row of numbers of size C - (B,T,C) $\to$ (4,8,65)

PyTorch's cross_entropy function expects a 2D table: [List of Examples, List of Classes]. It doesn't know how to handle the 3D "Time" dimension in this specific context.
so by "flatten" the Batch and Time dimensions together. We stop caring which batch a character came from; we just treat them all as independent predictions.
Combine $B$ and $T$ into a single dimension.$4 \times 8 = 32$.
Shape $(32, 65)$.  We took the 4 separate "sheets" of predictions and stacked them into one long list of 32 predictions.

The `generate` Method (Inference)This method creates new text, one character at a time.

 `logits, loss = self(idx)`* **Input `idx`:** Shape (B, T).
* Let's say we start with just 1 character per batch: (1, 1).


* **Output `logits`:** Shape (B, T, C) $\rightarrow$ (1, 1, 65).
* The model gives us a probability score for *every* character in the sequence.

`logits = logits[:, -1, :]` (The Critical Step)* **The Goal:** We only care about predicting the **next** character. This depends strictly on the **last** character currently in the sequence. We don't care about the predictions for the characters that happened 5 steps agoâ€”we already know what those are.
* **The Slicing:**
* `:` (Keep all Batches)
* `-1` (Take only the **last** timestep T)
* `:` (Keep all Channels/Scores)


* **Output:** Shape (B, C) $\rightarrow$ (1, 65).
* We have removed the Time dimension. We now just have the scores for "What comes next?"

 `probs = F.softmax(logits, dim=-1)`* **Operation:** Convert raw scores (logits) into percentages (probabilities).
* **Shape:** Unchanged (B, C) $\rightarrow$ (1, 65).
* Example: `[0.1, 0.05, 0.8, ...]` (There is an 80% chance the next char is 'e').

 `idx_next = torch.multinomial(probs, ...)`* **Operation:** Roll the dice based on the probabilities to pick **1** winner.
* **Output:** Shape (B, 1) $\rightarrow$ (1, 1).
* This is the integer ID of the new character (e.g., the ID for 'e').

`idx = torch.cat((idx, idx_next), dim=1)`* **Operation:** Glue the new character onto the end of the existing sequence.
* **Input:** `idx` (1, 1) and `idx_next` (1, 1).
* **Output:** Shape (1, 2).
* The sequence has grown! Now the loop repeats, but next time `logits` will look at this new character to predict the third one.

In [None]:
optimizer = torch.optim.Adam(m.parameters(), lr=1e-3) # Updating the weights using Adam Optimizer

In [24]:
batch_size = 32
for steps in range(10000):
    xb, yb = get_batch('train')

    logits, loss = m(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

print(loss.item())

2.3125789165496826


In [25]:
print(decode(m.generate(idx = idx, max_new_tokens=100)[0].tolist()))


m lt igo
The the yme o blulesinotereder torll, S: enoulaimod G copicll thilid t is a Y:
NColy D:
Wis


In [15]:
# Mathematical trick in Self-Attention
torch.manual_seed(1337)
B, T, C = 4, 8, 2
x = torch.randn(B,T,C)
x.shape

torch.Size([4, 8, 2])

In terms of attention, where tokens start talking to each other. we do not want the talking to happen with the future tokens. Because, we are trying to predict the future
For example,
3rd token **SHOULD** talk to 2nd and 1st token and **SHOULDN'T** talk to 4th, 5th 6th, etc tokens.

The easiest way to do this is to take the average of all the preceeding elements - OFC this isnt the best way but lets start easy.
So for 3rd token, take the channels of 3rd, 2nd and 1st time step and average them up which gives the context of feature vector in context of 3rd token's history

In [16]:
xbow = torch.zeros((B, T, C)) # bow - Bag of words
for b in range(B):
    for t in range(T):
        xprev = x[b, :t+1] # t - elements in the past of the current token so (t, C)
        xbow[b,t] = torch.mean(xprev, 0) # average them all (1-D vector) and store in xbow

print(x[0])
print(xbow[0])

tensor([[ 0.1808, -0.0700],
        [-0.3596, -0.9152],
        [ 0.6258,  0.0255],
        [ 0.9545,  0.0643],
        [ 0.3612,  1.1679],
        [-1.3499, -0.5102],
        [ 0.2360, -0.2398],
        [-0.9211,  1.5433]])
tensor([[ 0.1808, -0.0700],
        [-0.0894, -0.4926],
        [ 0.1490, -0.3199],
        [ 0.3504, -0.2238],
        [ 0.3525,  0.0545],
        [ 0.0688, -0.0396],
        [ 0.0927, -0.0682],
        [-0.0341,  0.1332]])


As you can see the first location is the same as its the same average. but from the next token -0.0894 is the average of 0.1808 and -0.3596 and similarly with the next tokens taking all the past tokens average 

In [8]:
torch.manual_seed(42)
a = torch.ones(3, 3)
b = torch.randint(0, 10, (3,2)).float()
c = a @ b # Simple Matrix multiplication
print('a=')
print(a)
print('-------')
print('b=')
print(b)
print('-------')
print('c=')
print(c)

a=
tensor([[1., 1., 1.],
        [1., 1., 1.],
        [1., 1., 1.]])
-------
b=
tensor([[2., 7.],
        [6., 4.],
        [6., 5.]])
-------
c=
tensor([[14., 16.],
        [14., 16.],
        [14., 16.]])


In [10]:
a = torch.tril(torch.ones(3,3)) # Returns a lower triangular matrix of 1's
c = a @ b
print('a=')
print(a)
print('-------')
print('b=')
print(b)
print('-------')
print('c=')
print(c)

a=
tensor([[1., 0., 0.],
        [1., 1., 0.],
        [1., 1., 1.]])
-------
b=
tensor([[2., 7.],
        [6., 4.],
        [6., 5.]])
-------
c=
tensor([[ 2.,  7.],
        [ 8., 11.],
        [14., 16.]])


As you can see with the lower triangular matrix multiplication, the future numbers are ignored!!
For example, 
[1 1 0] . [2 6 6] = [8] so, for the 2nd token of 6 we are only taking the past which is 2 and ignoring the future 6 (3rd element)

now with simple normalisation we can take the average as well. i.e., take the element and divide by their sum of all the elements in the row

In [11]:
a = a / torch.sum(a, 1, keepdim=True)

c = a @ b
print('a=')
print(a)
print('-------')
print('b=')
print(b)
print('-------')
print('c=')
print(c)

a=
tensor([[1.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000],
        [0.3333, 0.3333, 0.3333]])
-------
b=
tensor([[2., 7.],
        [6., 4.],
        [6., 5.]])
-------
c=
tensor([[2.0000, 7.0000],
        [4.0000, 5.5000],
        [4.6667, 5.3333]])


5.5 is the average of 7 and 4 just as expected getting the average of past elements. Time to use this in our weights

In [None]:
weights = torch.tril(torch.ones(T, T))
weights = weights / weights.sum(1, keepdim=True)

xbow2 = weights @ x # (T, T) @ (B, T, C) --> Since dimensions are not matching Pytorch will make it (B, T, T) @ (B, T, C) = (B, T, C)
diff = (xbow - xbow2).abs().max()
print(f"Max difference: {diff.item()}") # To check if the are identical as expected

Max difference: 3.236345946788788e-08


This is just simialr to our matrix multiplication of c = a @ b, in our case weights is a where we take the average of all previous tokens and b is our embedding matrix. This way we can make the tokens talk to each other (Self Attention)

In [24]:
tril = torch.tril(torch.ones(T, T))
weights = torch.zeros((T, T))
weights = weights.masked_fill(tril == 0, float('-inf'))
print(weights)

tensor([[0., -inf, -inf, -inf, -inf, -inf, -inf, -inf],
        [0., 0., -inf, -inf, -inf, -inf, -inf, -inf],
        [0., 0., 0., -inf, -inf, -inf, -inf, -inf],
        [0., 0., 0., 0., -inf, -inf, -inf, -inf],
        [0., 0., 0., 0., 0., -inf, -inf, -inf],
        [0., 0., 0., 0., 0., 0., -inf, -inf],
        [0., 0., 0., 0., 0., 0., 0., -inf],
        [0., 0., 0., 0., 0., 0., 0., 0.]])


```weights.masked_fill(tril == 0, float('-inf'))```Basically for all the elements where tril = 0 place a -inf in the zeros tensor of weights.
Why do we do this?
When we apply softmax, we end up with exactly the same weights as previously which is a exponential normalisation for each row ```dim=-1```

These weights will help us build the self attention block where the tokens can start talking to each other

In [25]:
weights = F.softmax(weights, dim=-1)
print(weights)

tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.3333, 0.3333, 0.3333, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2500, 0.2500, 0.2500, 0.2500, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2000, 0.2000, 0.2000, 0.2000, 0.2000, 0.0000, 0.0000, 0.0000],
        [0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.0000, 0.0000],
        [0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.0000],
        [0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250]])


In [22]:
weights = F.softmax(weights, dim=-1)
xbow3 = weights @ x
diff = (xbow - xbow3).abs().max()
print(f"Max difference: {diff.item()}") # To check if the are identical as expected

Max difference: 3.236345946788788e-08


In [5]:
torch.manual_seed(1337)
B, T, C = 4, 8, 32
x = torch.randn(B, T, C)

tril = torch.tril(torch.ones(T, T))
wei = torch.zeros((T, T))
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)
out = wei @ x

out.shape

torch.Size([4, 8, 32])

In [7]:
print(wei)

tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.3333, 0.3333, 0.3333, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2500, 0.2500, 0.2500, 0.2500, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2000, 0.2000, 0.2000, 0.2000, 0.2000, 0.0000, 0.0000, 0.0000],
        [0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.0000, 0.0000],
        [0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.0000],
        [0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250]])


Basically, Similar to what we did previously we try to take the average of all the previous tokens and the current token, by using the lower triangular approach with normalisation and softmax. In this approach we do not have data dependency of previous tokens. i.e., I want the tokens to talk to each other.

This is where Self-attention comes into play, where every single token at each poisition will emit **Two vectors** - 
- **Query** - What am I looking for?
- **Key** - What do I contain?

The dot products between these i.e., dot product of query with all keys of previous tokens will provides us the weights!! So if they interact with a very high amount and learn a lot more from that perticular token compared to any other previous token!

In [None]:
torch.manual_seed(1337)
B, T, C = 4, 8, 32
x = torch.randn(B, T, C)

# Let us try with only ONE attention head
head_size = 16
key = nn.Linear(C, head_size, bias=False)
query = nn.Linear(C, head_size, bias=False)
k = key(x)  # (B, T, head_size)
q = query(x)  # (B, T, head_size)
wei = q @ k.transpose(-2, -1)  # Last two dimensions to be transposed so -> (B, T, 16) @ (B, 16, T) = (B, T, T)

tril = torch.tril(torch.ones(T, T))
wei = wei.masked_fill(tril == 0, float('-inf'))  # To not allow the future tokens to communicate and knwo the answer before the prediction
wei = F.softmax(wei, dim=-1)  # Normalise
out = wei @ x

out.shape

torch.Size([4, 8, 32])

In [4]:
print(wei[0])

tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.1574, 0.8426, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2088, 0.1646, 0.6266, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5792, 0.1187, 0.1889, 0.1131, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.0294, 0.1052, 0.0469, 0.0276, 0.7909, 0.0000, 0.0000, 0.0000],
        [0.0176, 0.2689, 0.0215, 0.0089, 0.6812, 0.0019, 0.0000, 0.0000],
        [0.1691, 0.4066, 0.0438, 0.0416, 0.1048, 0.2012, 0.0329, 0.0000],
        [0.0210, 0.0843, 0.0555, 0.2297, 0.0573, 0.0709, 0.2423, 0.2391]],
       grad_fn=<SelectBackward0>)


As you can see before the weights were a constant it was applied the same way to all the batch elements but now you can see the weights are different as all batch elements have different tokens at different positions. This is clearly visible when compared to the above wei when we started from all zeros and having all uniform distribution irrespective of what token and positon it has.

Since these weights are not uniform and the last token knows what content it has and what positon it is in. 
Now it will emit a query saying for example - "Yo!! I am a vowel I am looking for consonants before me" 
then the keys from all the previous tokens will answer to this saying "YO!! I am the man you are looking for!!"
This will create a high value like for example you can see 
- [0.0210, 0.0843, 0.0555, 0.2297, 0.0573, 0.0709, 0.2423, 0.2391]
- The 4th and 7th position has a high weight showing a strong **DOT** product between the query of the 8th position and its keys!!

In [8]:
torch.manual_seed(1337)
B, T, C = 4, 8, 32
x = torch.randn(B, T, C)

# Let us try with only ONE attention head
head_size = 16
key = nn.Linear(C, head_size, bias=False)
query = nn.Linear(C, head_size, bias=False)
value = nn.Linear(C, head_size, bias=False)
k = key(x)  # (B, T, head_size)
q = query(x)  # (B, T, head_size)
wei = q @ k.transpose(-2, -1)  # Last two dimensions to be transposed so -> (B, T, 16) @ (B, 16, T) = (B, T, T)

tril = torch.tril(torch.ones(T, T))
wei = wei.masked_fill(tril == 0, float('-inf'))  # To not allow the future tokens to communicate and knwo the answer before the prediction
wei = F.softmax(wei, dim=-1)  # Normalise

v = value(x)
out = wei @ v

out.shape

torch.Size([4, 8, 16])

In [9]:
print(wei[0])

tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.1574, 0.8426, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2088, 0.1646, 0.6266, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5792, 0.1187, 0.1889, 0.1131, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.0294, 0.1052, 0.0469, 0.0276, 0.7909, 0.0000, 0.0000, 0.0000],
        [0.0176, 0.2689, 0.0215, 0.0089, 0.6812, 0.0019, 0.0000, 0.0000],
        [0.1691, 0.4066, 0.0438, 0.0416, 0.1048, 0.2012, 0.0329, 0.0000],
        [0.0210, 0.0843, 0.0555, 0.2297, 0.0573, 0.0709, 0.2423, 0.2391]],
       grad_fn=<SelectBackward0>)


As you can see, now the output size is 4, 8, **16**!!! as that is the head_size

You can think of x has a private information to a particular token. It has a certain identity and its information is kept in vector x.
For the purpose of a single head -
- This is what I am interested in 
- This is what I have 
- If you find me interesting This is what I will communicate to you

This is what is stored in Value which is getting aggregated in a single head between the different nodes

Attention in a communication mechanism where we have number of nodes in a directed graph of communication.
Every node has a vector of information and it has a weighted aggregation to the nodes that point to it in a data dependent manner. Basically aggregating to the ones that are important from all the previous nodes upto them.

There is no notion of space. This is exactly why we add postional encoding to the tokens

Each example across batch dimension is of course processed completely independently and never "talk" to each other so there are 'B' seperate pools of 'T' where the pools do not talk to each other and process parallely and talking only happens between the 'T'

In our case we are doing a autoregressive form of decoding the upcoming token. This is what we call Decoder attention block. But, it need not be always the case as for axample if we want like what is the sentiment of the sentence then we would like all the tokens to talk to each other in that case we simple remove the masking adn allow all tokens to talk to each other and not only the past ones in that case we call it Encoder attention block

In principal attention is lot more general, In our case we have it talking with a single set of nodes that is x if we have a separate set of nodes that we want to include then we call it cross attention

From the **Attention is all you need** there is a scaling of $\sqrt(d_k)$ which is the square root of the attention head size lets see what the difference and why this is important

In [10]:
k = torch.randn(B, T, head_size)
q = torch.randn(B, T, head_size)
wei = q @ k.transpose(-2, -1)

In [12]:
print(k.var())
print(q.var())
print(wei.var())
print(torch.softmax(torch.tensor([0.1, -0.2, 0.3, -0.2, 0.5]), dim=-1))

tensor(1.0449)
tensor(1.0700)
tensor(17.4690)
tensor([0.1925, 0.1426, 0.2351, 0.1426, 0.2872])


In [13]:
k = torch.randn(B, T, head_size)
q = torch.randn(B, T, head_size)
wei = q @ k.transpose(-2, -1) * head_size**-0.5

In [14]:
print(k.var())
print(q.var())
print(wei.var())
print(torch.softmax(torch.tensor([0.1, -0.2, 0.3, -0.2, 0.5]), dim=-1))

tensor(0.9006)
tensor(1.0037)
tensor(0.9957)
tensor([0.1925, 0.1426, 0.2351, 0.1426, 0.2872])


As you can see if we do the weights naively then the varience of weights goes up to the order of head_size in our case it is 16 and we get the varience of 17.47

Now after scaling the varience of weights will be close to 1. This is important as these wights are going through softmax and we want the weights to be fairly diffused. If we have a very high varience then softmax will converge to a fairly one hot vectors so we do not want a spiky performance especially in initialisation

There is residual optimisation implemented here after we implemented blocks of multihead attention (Multiple Single Head attentions) and feedforward and have a simple addition of the input before transformation and after transformation so that in backpropagation this is equally distrbuted and helpful