# GPT Model

From the Andrej Karpathy [Let's build GPT](https://www.youtube.com/watch?v=kCc8FmEb1nY) video.

Code for this lecture is mostly from https://github.com/karpathy/ng-video-lecture/tree/master.

In here we explore ideas like self attention .. etc.

In [10]:
# A small demonstration of letting previous tokens influence the current token
import torch
from torch.nn import functional as F

torch.manual_seed(1337)

B, T, C = 4, 8, 32  # batch size, sequence length (context), embedding dimension

# random token embeddings for a batch
x = torch.randn(B, T, C)

# lower-triangular matrix of ones: tril[i,j] == 1 for j <= i
tril = torch.tril(torch.ones(T, T))
print(tril)

# Build an attention-like weight matrix where earlier tokens each time step i can attend
# only to positions 0..i (itself and earlier tokens).
# We put -inf in disallowed positions so softmax will assign them zero probability.
wei = torch.zeros((T, T))
wei = wei.masked_fill(tril == 0, float('-inf'))
# After softmax over the last dimension, each row i contains normalized weights
# over positions 0..i (they sum to 1).
wei = F.softmax(wei, dim=-1)
print(wei)

# Apply the weights to the token embeddings.
# wei has shape (T, T) and x has shape (B, T, C).
# Broadcasting makes wei behave like (B, T, T), so the result has shape (B, T, C).
# out[b, i, :] is the weighted average of x[b, 0:i+1, :] using the weights in row i.
out = wei @ x
print(out.shape)

# This is a simple, explicit way to let information from previous tokens
# flow into the representation at each current time step (a toy causal attention).

tensor([[1., 0., 0., 0., 0., 0., 0., 0.],
        [1., 1., 0., 0., 0., 0., 0., 0.],
        [1., 1., 1., 0., 0., 0., 0., 0.],
        [1., 1., 1., 1., 0., 0., 0., 0.],
        [1., 1., 1., 1., 1., 0., 0., 0.],
        [1., 1., 1., 1., 1., 1., 0., 0.],
        [1., 1., 1., 1., 1., 1., 1., 0.],
        [1., 1., 1., 1., 1., 1., 1., 1.]])
tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.3333, 0.3333, 0.3333, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2500, 0.2500, 0.2500, 0.2500, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2000, 0.2000, 0.2000, 0.2000, 0.2000, 0.0000, 0.0000, 0.0000],
        [0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.0000, 0.0000],
        [0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.0000],
        [0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250]])
torch.Size([4, 8, 32])
