# GPT

Goal is to give the last network architecture that we can use to create a language model.
We've seen a lot of way to represent information and different way to extract information using neural networks.
We've also saw model that take memory into account.

[GPT](https://youtu.be/kCc8FmEb1nY) - Generative Pre-trained Transformer
GPTs are *recurrent* networks, with memory, but with a gate that sometimes forget.
That thing will improve learning, as it works with a human brain.
GPT is just a generator, it's just continuing a phrase.
It's a transformer because it's not based on LSTM, no recurrence nor convolution but [*attention*](https://arxiv.org/abs/1706.03762) (2008 concept).
It can invert the arrow of time, how is not actually clear.
[nanoGPT](https://github.com/karpathy/nanoGPT) has the same power of GPT-2.
Most people believe that ChatGPT is an intelligence: it's not.
It's not smart, it's a *patacca*.

GPT has three steps:
1. Train a supervised model
2. Train a reward model
3. Optimize with reinforcement learning (human control with PPO model)

Our implementation will follow a linear path: with two different paths one may also translate (e.g. converting text to images).
This is the so-called *cross-attention* mechanism.

To start, we're going to use a simple bigram model.
No hidden variable, no recursive, no embedding.

In [1]:
# https://openai.com/blog/chatgpt

# just for fun: https://writesonic.com/blog/best-chatgpt-examples/


We will use the Divina Commedia as dataset, about half a million characters.

In [2]:
# nomeFile='TinyShakspeare.txt'
nomeFile = 'data/divinacommedia.txt'
# nomeFile='inferno.txt'
with open(nomeFile, 'r', encoding='utf-8') as f:
    text = f.read()


In [3]:
print("Length of dataset, in characters: ", len(text))
# let's look only at the first 1000 characters
print(text[:1000])


Length of dataset, in characters:  504416
nel mezzo del cammin di nostra vita mi ritrovai per una selva oscura ché la diritta via era smarrita ahi quanto a dir qual era è cosa dura esta selva selvaggia e aspra e forte che nel pensier rinova la paura tant è amara che poco è più morte ma per trattar del ben ch i vi trovai dirò de l altre cose ch i v ho scorte io non so ben ridir com i v intrai tant era pien di sonno a quel punto che la verace via abbandonai ma poi ch i fui al piè d un colle giunto là dove terminava quella valle che m avea di paura il cor compunto guardai in alto e vidi le sue spalle vestite già de raggi del pianeta che mena dritto altrui per ogne calle allor fu la paura un poco queta che nel lago del cor m era durata la notte ch i passai con tanta pieta e come quei che con lena affannata uscito fuor del pelago a la riva si volge a l acqua perigliosa e guata così l animo mio ch ancor fuggiva si volse a retro a rimirar lo passo che non lasciò già mai persona viva poi ch èi

In [4]:
# here are all the unique characters that occur in this text... ChatGPT uses tokens (ngram)... something between characters and words
chars = sorted(list(set(text)))
vocab_size = len(chars)
print(''.join(chars))  # notice the space character
print(vocab_size)


 abcdefghijlmnopqrstuvxyzàäèéëìïòóöùü
37


In [5]:
# create a mapping from characters to integers
stoi = {ch: i for i, ch in enumerate(chars)}  # lookup table
itos = {i: ch for i, ch in enumerate(chars)}
# encoder: take a string, output a list of integers
def encode(s): return [stoi[c] for c in s]
# decoder: take a list of integers, output a string
def decode(l): return ''.join([itos[i] for i in l])


print(encode(text[:20]))
print(decode(encode(text[:20])))


[13, 5, 11, 0, 12, 5, 24, 24, 14, 0, 4, 5, 11, 0, 3, 1, 12, 12, 9, 13]
nel mezzo del cammin


Tokenize means to encode *n*-grams, i.e. something between characters and words, into integers.
We've seen the simplest way, but others are much efficient, like the [OpenAI](https://github.com/openai/tiktoken) tokenization.
One may also use the [Google's one](https://github.com/google/sentencepiece).
They have a (very) big vocabulary, i.e. high integer values.
Is it better a long list of small integers or a short list of big integers?

In [6]:
import tiktoken  # pip install tiktoken

enc = tiktoken.get_encoding('gpt2')

print(enc.n_vocab)

enc.encode(text[:20])


50257


[4954, 502, 47802, 1619, 12172, 1084]

enc.decode(1619)

In [7]:
for k in enc.encode(text[:20]):
    print(enc.decode([k]))


nel
 me
zzo
 del
 cam
min


Let's now encode the entire text dataset using the "simplest" encoder...

In [8]:
import torch
data = torch.tensor(encode(text), dtype=torch.long)
print(data.shape, data.dtype)
# the 100 characters we looked at earier will to the GPT look like this
print(data[:100])


torch.Size([504416]) torch.int64
tensor([13,  5, 11,  0, 12,  5, 24, 24, 14,  0,  4,  5, 11,  0,  3,  1, 12, 12,
         9, 13,  0,  4,  9,  0, 13, 14, 18, 19, 17,  1,  0, 21,  9, 19,  1,  0,
        12,  9,  0, 17,  9, 19, 17, 14, 21,  1,  9,  0, 15,  5, 17,  0, 20, 13,
         1,  0, 18,  5, 11, 21,  1,  0, 14, 18,  3, 20, 17,  1,  0,  3,  8, 28,
         0, 11,  1,  0,  4,  9, 17,  9, 19, 19,  1,  0, 21,  9,  1,  0,  5, 17,
         1,  0, 18, 12,  1, 17, 17,  9, 19,  1])


In [9]:
# Let's now split up the data into train and validation sets
n = int(0.9*len(data))  # first 90% will be train, rest val
train_data = data[:n]
val_data = data[n:]


In [10]:
print(val_data[:25])


tensor([ 5, 19, 14,  0, 15,  1, 17,  5,  0,  5,  0, 11,  0, 20, 11, 19,  9, 12,
        14,  0,  3,  8,  5,  0, 21])


In [11]:
for k in train_data[:20]:
    x = k.item()

    print(x, decode([x]))


13 n
5 e
11 l
0  
12 m
5 e
24 z
24 z
14 o
0  
4 d
5 e
11 l
0  
3 c
1 a
12 m
12 m
9 i
13 n


Let's define a context length of 8: 8 characters that will try to predict the 9th.

**NOTE**: ChatGPT uses a block size of about 256 tokens.

In [12]:
block_size = 8
print(train_data[:block_size+1])


tensor([13,  5, 11,  0, 12,  5, 24, 24, 14])


In this 9-gram we've actually 8 example which we can give to our model in order to train it.

In [13]:
# from one 9-gram we have several train and target, up to the maximum context length...
# not only for efficency... we want the transformer to "get used" to variable context length...
x = train_data[:block_size]
y = train_data[1:block_size+1]
for t in range(block_size):
    context = x[:t+1]
    target = y[t]
    print(f"when input is {context} the target: {target}")


when input is tensor([13]) the target: 5
when input is tensor([13,  5]) the target: 11
when input is tensor([13,  5, 11]) the target: 0
when input is tensor([13,  5, 11,  0]) the target: 12
when input is tensor([13,  5, 11,  0, 12]) the target: 5
when input is tensor([13,  5, 11,  0, 12,  5]) the target: 24
when input is tensor([13,  5, 11,  0, 12,  5, 24]) the target: 24
when input is tensor([13,  5, 11,  0, 12,  5, 24, 24]) the target: 14


We're going to select some chunks of the Divina Commedia in order to feed them as batches to our model.
The time dimensions is 8 (the n-gram), but we'll take some example from it, adding another dimension.

In [14]:
torch.manual_seed(1337)
batch_size = 4  # how many independent sequences will we process in parallel
block_size = 8  # what is the maximum context length for predictions

# https://pytorch.org/docs/stable/generated/torch.stack.html
# https://www.geeksforgeeks.org/python-pytorch-stack-method/


def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    # here we 'stack' in rows....
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    return x, y


xb, yb = get_batch('train')
print('inputs:')
print(xb.shape)
print(xb)
print('targets:')
print(yb.shape)
print(yb)

print('----')

for b in range(batch_size):  # batch dimension
    for t in range(block_size):  # time dimension
        context = xb[b, :t+1]
        target = yb[b, t]
        print(f"when input is {context.tolist()} the target: {target}")


inputs:
torch.Size([4, 8])
tensor([[ 9, 14, 13,  5,  0,  4,  9,  0],
        [11,  5,  0,  5, 18, 18,  5, 17],
        [11, 14,  0, 19, 17,  9, 18, 19],
        [19,  5,  0, 13, 28,  0, 15,  5]])
targets:
torch.Size([4, 8])
tensor([[14, 13,  5,  0,  4,  9,  0, 18],
        [ 5,  0,  5, 18, 18,  5, 17,  0],
        [14,  0, 19, 17,  9, 18, 19, 14],
        [ 5,  0, 13, 28,  0, 15,  5, 13]])
----
when input is [9] the target: 14
when input is [9, 14] the target: 13
when input is [9, 14, 13] the target: 5
when input is [9, 14, 13, 5] the target: 0
when input is [9, 14, 13, 5, 0] the target: 4
when input is [9, 14, 13, 5, 0, 4] the target: 9
when input is [9, 14, 13, 5, 0, 4, 9] the target: 0
when input is [9, 14, 13, 5, 0, 4, 9, 0] the target: 18
when input is [11] the target: 5
when input is [11, 5] the target: 0
when input is [11, 5, 0] the target: 5
when input is [11, 5, 0, 5] the target: 18
when input is [11, 5, 0, 5, 18] the target: 18
when input is [11, 5, 0, 5, 18, 18] the target: 

We can pack 32 examples together...

In [15]:
print(xb, xb.shape)  # our input to the transformer


tensor([[ 9, 14, 13,  5,  0,  4,  9,  0],
        [11,  5,  0,  5, 18, 18,  5, 17],
        [11, 14,  0, 19, 17,  9, 18, 19],
        [19,  5,  0, 13, 28,  0, 15,  5]]) torch.Size([4, 8])


In [16]:
# just for understanding the simple bigram model below...
# no hidden layer, real 1-markov approximation... tokens do NOT talk to each other...
token_embedding_table = torch.nn.Embedding(vocab_size, vocab_size)
logits = token_embedding_table(xb)
print(logits.shape)
print(logits.view(4*8, 37))


torch.Size([4, 8, 37])
tensor([[-0.5387,  2.1751, -1.7514,  ..., -0.3420, -0.8461,  0.5015],
        [-1.7031,  0.8709, -1.0023,  ..., -1.4508, -0.3814,  0.7220],
        [ 0.5551, -0.4653, -0.5519,  ..., -0.1336, -0.2664, -0.1785],
        ...,
        [ 0.6258,  0.0255,  0.9545,  ..., -1.1539, -0.2984,  1.1490],
        [ 0.3461, -0.8792,  0.8254,  ...,  1.3520,  0.0067, -0.3712],
        [ 2.1382,  0.5114,  1.2191,  ..., -0.3983,  0.3621, -0.8827]],
       grad_fn=<ViewBackward0>)


When We want to generate we're going to ask the model for a prediction, see *generate()*.

In [17]:
# we start going back to Bigram Language Model already saw....


import torch
import torch.nn as nn
from torch.nn import functional as F
torch.manual_seed(1337)


class BigramLanguageModel(nn.Module):

    def __init__(self, vocab_size):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table...
        # no hidden layer...comments in class
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)

    def forward(self, idx, targets=None):

        # idx and targets are both (B,T) tensor of integers
        logits = self.token_embedding_table(idx)  # (B,T,C)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # get the predictions
            logits, loss = self(idx)
            # focus only on the last time step
            logits = logits[:, -1, :]  # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1)  # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1)  # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1)  # (B, T+1)
        return idx


m = BigramLanguageModel(vocab_size)
logits, loss = m(xb, yb)
print(logits.shape)
print(loss)

nb = 5
output = m.generate(idx=torch.zeros(
    (nb, 1), dtype=torch.long), max_new_tokens=100)

for k in range(nb):
    print(decode(output[k].tolist()))


torch.Size([32, 37])
tensor(3.8613, grad_fn=<NllLossBackward0>)
 ìàr yùssjzäbùxöpmchùàièvhyvuxèhhveiuzdfóghèqàòéòhnìooocóèirünupzfógcxvpùvuùcuoòjä ìoézüzlìyàrëxtaqéz
 qljjüacöàrbùydgaàqarüüdvutmuëxèxnxüë qzëtloóbùshönzfptedpr ïsym ïäàxàmùmcèayoöüdüeòseixèarzzivostlgó
 ïöèzïhém ïäuoéhïdùu hybìëëueinèxsïväìgfmèqëbfxïjuùsmarfjëïläürüläòcüloélùmòhüfxuïöccéaàadìrìàmbfssìh
 deüg ïùä iöhf ïóhzëdógntjïüüìòvxiòtóxcsäùeiubooënùsezfbóbfjtfùdóinlïtloìrüzèoàrnùtudòüjmòìzcvoäóuìùg
 qàiaàöxnqùyguìdpégùtgciëbùxvlìrcëxëöèùòxóiéöàxyqöàóunpeòàmzëòòönzìàrnxysxmöntzüarüajxàsyqxtntztà ïpn


Up to now we have used SGD (Stochastic Gradient Descent).
Computer science gives us optimizers... let's use them, in particular the [AdamW](https://pytorch.org/docs/stable/generated/torch.optim.AdamW.html).

In [18]:
optimizer = torch.optim.AdamW(m.parameters(), lr=1e-3)


In [19]:
from tqdm.notebook import trange
batch_size = 32

for steps in trange(10000):
    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = m(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()  # optimization line

print(loss.item())


  0%|          | 0/10000 [00:00<?, ?it/s]

2.2512729167938232


In [20]:
print(decode(m.generate(idx=torch.zeros(
    (1, 1), dtype=torch.long), max_new_tokens=300)[0].tolist()))


 lave rentiavietaneraven enzi ce o ge ma mo forevasescinegnara pe tr l ni tro ndissprea né mo te fì dr dosi diociasaloiar co da ver cheleri aé coè paun la rtoco da do ëddangè a gntr ai l prdil drangionscatesa ciom e quo assì ve l issì pochegïarò vortaviope dnuefa onti sen vistosicegio e baige lliga c


The channel dimension is actually the dimension of the embedding... in this case it's two.

In [21]:
# let us start with a simple example....

torch.manual_seed(1337)
B, T, C = 4, 8, 2  # batch, time, channels
x = torch.randn(B, T, C)
x.shape


torch.Size([4, 8, 2])

Token are not working together by now since they're all independent.
But the meaning of a word depends on the context, so we want
1. tokens "to talk" each other
2. information only to flow from the past to the future (fixed direction)

The first point is like a *mean field* approach.
How to talk each other?
Just take the average.
This is the easiest thing to do but the hardest to understand.
We actually want $x[b,t] = \text{mean}_{i<=t} x[b,i]$
For the second we're going to use as a mask a triangular matrix.

By know, define a bag of words
Then we'll try a (much better) data dependent approach.

In [22]:
xbow = torch.zeros((B, T, C))

for b in range(B):
    for t in range(T):
        # take all tokens up to time t
        xprev = x[b, :t+1]  # (t,C)
        # average over time
        xbow[b, t] = torch.mean(xprev, 0)  # (C,)


Notice how the first vector is unchanged, while from the second one we get the average.

In [23]:
print(x[0])
print(xbow[0])


tensor([[ 0.1808, -0.0700],
        [-0.3596, -0.9152],
        [ 0.6258,  0.0255],
        [ 0.9545,  0.0643],
        [ 0.3612,  1.1679],
        [-1.3499, -0.5102],
        [ 0.2360, -0.2398],
        [-0.9211,  1.5433]])
tensor([[ 0.1808, -0.0700],
        [-0.0894, -0.4926],
        [ 0.1490, -0.3199],
        [ 0.3504, -0.2238],
        [ 0.3525,  0.0545],
        [ 0.0688, -0.0396],
        [ 0.0927, -0.0682],
        [-0.0341,  0.1332]])


We can be much efficient by using triangular matrices multiplications.
Let's see an example, noticing that $14=2+6+6$, $16=7+4+5$.
Multipling these vectors we get the mean, up to normalization.

In [25]:
torch.manual_seed(42)
a = torch.ones(3, 3)
b = torch.randint(0, 10, (3, 2)).float()
c = a@b
print('a=', a)
print('b=', b)
print('c=', c)


a= tensor([[1., 1., 1.],
        [1., 1., 1.],
        [1., 1., 1.]])
b= tensor([[2., 7.],
        [6., 4.],
        [6., 5.]])
c= tensor([[14., 16.],
        [14., 16.],
        [14., 16.]])


The trick is to use a triangular matrix instead of a full one.

In [26]:
torch.tril(torch.ones(3, 3))


tensor([[1., 0., 0.],
        [1., 1., 0.],
        [1., 1., 1.]])

In the second row, the $0$ kills the third vector, so we're summing the first two vectors.
The first remains the same, and so on.
Then, a triangular matrix represents a flow of information from the past to the future.
To get the actual mean, just normalize the triangular matrix on the rows. 

In [27]:
torch.manual_seed(42)
a = torch.tril(torch.ones(3, 3))
a = a/torch.sum(a, 1, keepdim=True)
b = torch.randint(0, 10, (3, 2)).float()
c = a@b
print('a=', a)
print('b=', b)
print('c=', c)


a= tensor([[1.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000],
        [0.3333, 0.3333, 0.3333]])
b= tensor([[2., 7.],
        [6., 4.],
        [6., 5.]])
c= tensor([[2.0000, 7.0000],
        [4.0000, 5.5000],
        [4.6667, 5.3333]])


We have to think *wei* as a matrix of weights, normalized along the rows.

In [28]:
wei = torch.tril(torch.ones(T, T))
wei = wei/wei.sum(1, keepdim=True)
print(wei)


tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.3333, 0.3333, 0.3333, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2500, 0.2500, 0.2500, 0.2500, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2000, 0.2000, 0.2000, 0.2000, 0.2000, 0.0000, 0.0000, 0.0000],
        [0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.0000, 0.0000],
        [0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.0000],
        [0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250]])


If we multiply *wei* by *x* we get the same result as before.
It's another way of doing average in the past.
Notice that *wei* is a time by time matrix.

In PyTorch, each batch is treated in parallel, and it's of size *T*

In [29]:
xbow2 = wei @ x  # (B (broadcasting), T,T) @ (B,T,C) ----> (B,T,C)
torch.allclose(xbow, xbow2)


True

Now we're going to do a mask, i.e. take a matrix of all zero with the upper triangle as $-\infty$.

In [32]:
tril = torch.tril(torch.ones(T, T))
wei = torch.zeros((T, T))
wei = wei.masked_fill(tril == 0, float('-inf'))
print(wei)


tensor([[0., -inf, -inf, -inf, -inf, -inf, -inf, -inf],
        [0., 0., -inf, -inf, -inf, -inf, -inf, -inf],
        [0., 0., 0., -inf, -inf, -inf, -inf, -inf],
        [0., 0., 0., 0., -inf, -inf, -inf, -inf],
        [0., 0., 0., 0., 0., -inf, -inf, -inf],
        [0., 0., 0., 0., 0., 0., -inf, -inf],
        [0., 0., 0., 0., 0., 0., 0., -inf],
        [0., 0., 0., 0., 0., 0., 0., 0.]])


Next step is to use softmax to normalize on the rows we get the same result, interpretable with statistical mechanics!

In [33]:
wei = F.softmax(wei, dim=-1)
xbow3 = wei @ x
torch.allclose(xbow, xbow3)


True

Now we want to see *wei* as the initial weights we want to change.
Instead of starting from random weights we're doing it in a way such that we're starting with our data.
We did exactly what we did by hand at the beginning.
This is the mathematical trick to simplify our life.

In [35]:
torch.manual_seed(1337)
B, T, C = 4, 8, 32  # batch, time, channels
x = torch.randn(B, T, C)

tril = torch.tril(torch.ones(T, T))
# uniform weigth... we are going to change it... in a data dependent way...
wei = torch.zeros((T, T))

wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)
out = wei @ x

print(out.shape)


torch.Size([4, 8, 32])


Now we want to do the same without starting from zero weight.
If we have a word in a text it's meaning may depend on the previous one as on the word ten steps before...
We need **self attention**.
We need each token to emit a *query*, i.e. what we're looking for, and a *key*, i.e. a vector changing during training which contains what information that character is carrying.
Using the dot product we produce *wei*.

The head size was 2, now we take it as 16.
We won't use uniform weights but the multiplication between *query* and *key* transposed.

In [36]:
torch.manual_seed(1337)
B, T, C = 4, 8, 32  # batch, time, channels
x = torch.randn(B, T, C)

# let's see a single Head perform self-attention
head_size = 16
# linear matrix of size C x head_size
key = nn.Linear(C, head_size, bias=False)
# linear matrix of size C x head_size
query = nn.Linear(C, head_size, bias=False)
# we implement it on the batch... NO communications here
# for each batch at each time step we have a 16 dimensional vector
k = key(x)  # (B,T,16)
q = query(x)  # (B,T,16)
# now communications start. we transpose last two dimensions, keep batch invariant
# for each batch we a (T,T) matrix obtained by the scalar products of each jey with each query
wei = q @ k.transpose(-2, -1)  # (B,T,16) @ (B,16,T)-->(B,T,T)
# the weights are data dependend but independent on the batch

tril = torch.tril(torch.ones(T, T))
# wei = torch.zeros((T,T)) now we take this out.. .weights are now defined in a data depended way
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)  # normalize (statistical mechanics POV)
out = wei @ x

print(out.shape)


torch.Size([4, 8, 32])


The weights now are... *not* uniform.

In [38]:
print(wei[0])


tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.1574, 0.8426, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2088, 0.1646, 0.6266, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5792, 0.1187, 0.1889, 0.1131, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.0294, 0.1052, 0.0469, 0.0276, 0.7909, 0.0000, 0.0000, 0.0000],
        [0.0176, 0.2689, 0.0215, 0.0089, 0.6812, 0.0019, 0.0000, 0.0000],
        [0.1691, 0.4066, 0.0438, 0.0416, 0.1048, 0.2012, 0.0329, 0.0000],
        [0.0210, 0.0843, 0.0555, 0.2297, 0.0573, 0.0709, 0.2423, 0.2391]],
       grad_fn=<SelectBackward0>)


In the multiplication we actually use $\vec{x}$... we want one more embedding.
We'll associate to each token a value, i.e. another vector of head size, $\vec{v}$.

In [39]:
# acutally we do not calculate attention on x but on v(x) the 'value' of x !!

# version 4: self-attention with VALUE
torch.manual_seed(1337)
B, T, C = 4, 8, 32  # batch, time, channels
x = torch.randn(B, T, C)

# let's see a single Head perform self-attention
head_size = 16
key = nn.Linear(C, head_size, bias=False)
query = nn.Linear(C, head_size, bias=False)
# value
value = nn.Linear(C, head_size, bias=False)
k = key(x)  # (B,T,16)
q = query(x)  # (B,T,16)
wei = q @ k.transpose(-2, -1)  # (B,T,16)@ (B,16,T)-->(B,T,T)

tril = torch.tril(torch.ones(T, T))
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)

# the value that the token takes
v = value(x)
out = wei @ v

print(out.shape)  # head size


torch.Size([4, 8, 16])


We've a directed graph, linear, in which every node talks with the future ones.
Attention is just a communication mechanism, no more.
Key, query and value come all from the token: sometimes we don't want that, e.g. in translations.
Moreover, if we want to generate an image from a text, we need *cross-attention*.
99% of new applications relies on this model, applying many small changes.

To scale up we need more tricks, like [nanoGPT](https://github.com/karpathy/nanoGPT).


**NOTE**:
1. Attention is a communication mechanism. Can be seen as nodes in a directed graph that look at each other and aggregate information by weighted sums...
2. Our case is different 2 talk to 1, 3 talk to 2 and 1 and so on until the 8th node that aggregate information from all others......a linear structure (8 nodes = block_size) with information going only from the past
3. Attention as no notion of space and this is why we embedded the position of each token... very different from convolution!
4. Each example across batch dimension is of course processed completely independent and never "talk" to each other
5. Decoder/encoder with or without masking the future...
6. "Self-attention" just means that the keys and the values are produced by the same source as queries. In "cross-attention" values and keys can come from external sources...
7. Discuss normalization in Eq. (1) in the paper "Attention is all you need".... "Scaled" attention additional divides *wei* by $1/\sqrt{\text{head\_size}}$. This makes it so when input Q, K are unit variance, *wei* will be unit variance too and softmax will stay diffuse and not saturate 'too much'.

To learn more:
- [Residuals connections](https://arxiv.org/abs/1512.03385)
- [Layer normalization](https://arxiv.org/abs/1607.06450), also [using Torch](https://pytorch.org/docs/stable/generated/torch.nn.LayerNorm.html)
- [Dropout](https://www.cs.toronto.edu/~rsalakhu/papers/srivastava14a.pdf), i.e. killing a random number of neurons to prevent overfitting
- [GPT-3](https://arxiv.org/pdf/2005.14165.pdf) to see details about hyperparameters