<a href="https://colab.research.google.com/github/Raahim58/Neural-networks/blob/main/07_building_GPT_from_scratch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 07. Building Makemore from scratch (Part5: Building a WaveNet)

**Resources:**

* tutorial lecture 7 code:
  * https://colab.research.google.com/drive/1JMLa53HDuA-i7ZBmqV7ZnA3c_fvtXnx-?usp=sharing
  * https://github.com/karpathy/ng-video-lecture

* nanoGPT github repo: https://github.com/karpathy/nanoGPT

* link to youtube lecture 7: https://www.youtube.com/watch?v=kCc8FmEb1nY&list=PLAqhIrjkxbuWI23v9cThsA9GvCAUhRvKZ&index=7

* whole lecture series code: https://github.com/karpathy/nn-zero-to-hero

**Extra curriculum:**

* MLP model based on 2003 paper: https://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf

* WaveNet 2016 from DeepMind https://arxiv.org/abs/1609.03499

* Attention is All You Need paper: https://arxiv.org/abs/1706.03762

* OpenAI GPT-3 paper: https://arxiv.org/abs/2005.14165

* OpenAI ChatGPT blog post: https://openai.com/blog/chatgpt/

* 3Blue1Brown *What is a GPT?*: https://youtu.be/wjZofJX0v4M?si=0VFIijzi6P-9wH_F

* 3Blue1Brown *Self-attention in transformers*: https://youtu.be/eMlx5fFNoYc?si=AXvbawHAa8h4K8Fh

## 0. Getting setup + intro

<img src = "https://raw.githubusercontent.com/Raahim58/Neural-networks/main/images/transformer.png" height = 600 width = 600>

* chatGPT is a language model because it models the sequence of words or characyers or tokens more generally, and it knows how sort of words follow each other in English language. From its perspective it's completing a sequnce with the outcome
* we'll be focusing on the under the hood stuff that makes chatGPT work:
  * what is the neural network that models the sequence of these words under the hood? -> this comes from the paper "Attention all you need" that proposed the transformer architecture
  * GPT = generatively pre-trained transformer, where transformer is the neural net that actually does all the heavy lifting under the hood.
  * we're not going to be re-producing chatGPT as it is a very serious production grade system which is trained on a good chunk of internet and then there's a lot of pre-training and fine-tuning stages to it.
  * what we're going to do be focusing on is a to train a transformer based language model. In our case it will be a character language mode which will be trained on a smaller chunk of dataset namely "tinyShakespeare.txt" -> concanetenated of all of the works of Shakespeare
  * we will model how the characters follow each other
  * it will try to produce character sequences that look like the Shakespeare text file where in reality it produces on a token by token basis for chatGPT
    * token are sub word pieces, so they're like word chunk level
  * nanoGPT repo for training a transformer on any given text
    * `train.py` is a ~300-line boilerplate training loop and `model.py` a ~300-line GPT model definition, which can optionally load the GPT-2 weights from OpenAI.
* what we will do:
  * define a transformer piece by piece
  * train it on the `tinyShakespeare.txt` dataset
  * see how we can generate infinte Shakespeare
  * can copy to any other dataset we like
* preliminary-code:
  * we sort the characters uniquely
  * the number of them is going to be our vocab size -> the possible elements of our sequences.
  * we get 65 characters in total with capitals, lowercase, and special letters.


In [1]:
# We always start with a dataset to train on. Let's download the tiny shakespeare dataset
!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

--2024-09-08 12:49:34--  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: ‘input.txt’


2024-09-08 12:49:34 (25.8 MB/s) - ‘input.txt’ saved [1115394/1115394]



In [2]:
# read it in to inspect it
with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

In [3]:
print("length of dataset in characters: ", len(text))

length of dataset in characters:  1115394


In [4]:
# let's look at the first 1000 characters
print(text[:1000])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!

Second Citizen:
One word, good citizens.

First Citizen:
We are accounted poor citizens, the patricians good.
What authority surfeits on would relieve us: if they
would yield us but the superfluity, while it were
wholesome, we might guess they relieved us humanely;
but they think we are too dear: the leanness that
afflicts us, the object of our misery, is as an
inventory to particularise their abundance; our
sufferance is a gain to them Let us revenge this with
our pikes, ere we become rakes: for the gods know I
speak this in hunger for bread, not in thirst for revenge.



In [5]:
# here are all the unique characters sorted that occur in this text
chars = sorted(list(set(text)))
vocab_size = len(chars)
print(''.join(chars))
print(vocab_size)


 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
65


## 1. Tokenization + split dataset into train/val

* tokenizing the input text means to conver the raw text as the string to some sequence of integers according to some vocabulary of possible elements
* since we're building a character level language model, we're just going to be translating individual characters into integers

* to tokenize we:
  * we're going to be using an encoder and a decoder
    * an encoder takes an input text and returns a list of integers that represent that string
    * decoder takes the list of integers to map it back to the input string of characters/words
  * we iterate over all the characters and create a lookup table from the character to the integer and vice versa
  * to encode some string, we translate all the characters individually and to decode it back we use the reverse mapping and concatenate all of it

* there is one of many possible encodings or tokenizers and it's a very simple one
  * there are many other schemas which people have come up with, for e.g Google will use a "SentencePiece" which encodes text into integers but in a different schema and using a different vocabulary
    * "SentencePiece" is a sub word sort of tokenizer -> we're not encoding entire words but we're also not encoding indiviudal characters, it's a subword unit level
      * that's what is usally adopted in practice 8
    * openAI library called "ticktoken" which uses a byte pair encoding tokenizer and that's what GPT uses
      * it has 50526 tokens
      * so you can basically trade off the code book size and the sequence lengths so you can have a very long sequence of integers with very small vocabularies or you can have a short sequence of integers with very large vocabularies
  * typically people use these sub word encodings in practice but we're like to keep our tokenizer very simple so we'll be using characterlevel tokenzier so that means we have very small code books, we have very simple encode and decode functions, but we do get very long sequences as a result but we're going to stick with that method in this notebook because it's the simplest thing
* what we're doing here to tokenize:
  * we use `torch.tensor` from the PyTorch library
  * we will take all of the text in the `tinyShakespeare.txt` file and encode it, and then wrap it in a `torch.tensor` to get the data tensor
  * we get a sequence of integers which are an identical translation of the characters themselves
  * e.g 0 is a new line character, 1 is a space
* we all split our dataset into 90% training dataset and 10% validation dataset
  * we do this because it helps us understand to what extent our model is overfitting so we're going to basically hide and keep the validation data on the side because we don't want just a perfect memorization of this exact text file, we rather want a neural network that sort of creates a shakespeare like text and so it should be fairly likely for it to produce the actual stowed away true shakespeare text  

In [6]:
# create a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string

print(encode("hii there"))
print(decode(encode("hii there")))

[46, 47, 47, 1, 58, 46, 43, 56, 43]
hii there


In [7]:
# let's now encode the entire text dataset and store it into a torch.Tensor
import torch # we use PyTorch: https://pytorch.org
data = torch.tensor(encode(text), dtype=torch.long)
print(data.shape, data.dtype)
print(data[:1000]) # the 1000 characters we looked at earier will to the GPT look like this

torch.Size([1115394]) torch.int64
tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1,
        57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39, 49,
         6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47,
        58, 47, 64, 43, 52, 10,  0, 37, 53, 59,  1, 39, 56, 43,  1, 39, 50, 50,
         1, 56, 43, 57, 53, 50, 60, 43, 42,  1, 56, 39, 58, 46, 43, 56,  1, 58,
        53,  1, 42, 47, 43,  1, 58, 46, 39, 52,  1, 58, 53,  1, 44, 39, 51, 47,
        57, 46, 12,  0,  0, 13, 50, 50, 10,  0, 30, 43, 57, 53, 50, 60, 43, 42,
         8,  1, 56, 43, 57, 53, 50, 60, 43, 42,  8,  0,  0, 18, 47, 56, 57, 58,
         1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 18, 47, 56, 57, 58,  6,  1, 63,
        53, 59,  1, 49, 52, 53, 61,  1, 15, 39, 47, 59, 57,  1, 25, 39, 56, 41,
      

In [8]:
# Let's now split up the data into train and validation sets
n = int(0.9*len(data)) # first 90% will be train, rest 10% val
train_data = data[:n]
val_data = data[n:]

## 2. Dataloader: batches of chunks of data

* we would like to start plugging in the text sequences or integer sequences into the transformer so that it can train and learn those patterns
* an important thing to realize is that we're never going to actually feed the entire text into the Transformer all at once, that would be computationally very expensive and prohibitive. When we actually train a transformer on a lot of these dataset, we only work with chunks of the dataset and when we train the Transformer we basically sample random little chunks out of the training set and train them just chunks at a time. These chunks have some kind of a maximum length
  * the maximum length in the code is called `block_size` but can also be known as `context_length` elsewhere


In [9]:
# first 9 characters in the training set sequence
block_size = 8
train_data[:block_size+1]

tensor([18, 47, 56, 57, 58,  1, 15, 47, 58])

In [10]:
# showing the prediction of integers/characters in the sequence
x = train_data[:block_size]
y = train_data[1:block_size+1]
for t in range(block_size):
    context = x[:t+1]
    target = y[t]
    print(f"when input is {context} the target: {target}")

when input is tensor([18]) the target: 47
when input is tensor([18, 47]) the target: 56
when input is tensor([18, 47, 56]) the target: 57
when input is tensor([18, 47, 56, 57]) the target: 58
when input is tensor([18, 47, 56, 57, 58]) the target: 1
when input is tensor([18, 47, 56, 57, 58,  1]) the target: 15
when input is tensor([18, 47, 56, 57, 58,  1, 15]) the target: 47
when input is tensor([18, 47, 56, 57, 58,  1, 15, 47]) the target: 58


* when we sample a chunk of data like below for the 9 characters out of the training set, this actually has multiple examples packed into it.That's because all of these characters follow each other and so what these sequence of integers we get are going to say when we plug them into the Transformer is we're going to simultaneously train it to make a prediction at every one of its positions
  * now in a chunk of 9 characters, they're actually 8 individual examples packed in there
  * `x` are the inputs to the transformer. They will just be the first block size characters
  * `y` will be the next `block_size` character so it's offset by one, that's because `y` are the targets for each position in the input
  * we then iterate over the `block_size` of 8, and the `context` is always all the characters in X up to `t` and including `t`, and the target is always the `t`th character, but in the targets array `y`.
  * another thing to mention is that when we train on all the 8 examples with context between 1 and `block_size`, we train on that not just for computational reasons because we happen to have the sequence (not just for efficiency), it's also done to make the Transformer network be used to seeing contexts from all as little as 1 to `block_size`. We'd like the transform to be used to seeing everything in between and that's going to be useful later during inference because while we're sampling, we can start the sampling generation with as little as 1 character of context and the Transformer knows how to predict the next character with all the way upto just context of 1 adn so then it can predict everything up to `block_size`, and after `block_size` we have to start truncating because the Transformer will never receive more than `block_size` inputs when it's predicting the next character
* we've looked at the time dimension of the tensors that are going to be feeding into the Transformer, there's one more dimension to care about and that's the batch dimension
  * so as we're sampling these chunks of text, everytime we're going to feed them in a Transformer, we're going to have many batches of multiple chunks of text that are all stacked up in a single tensor. That's just done for efficiency so we can keep the GPU's busy because they're very good at parallel processing of data so we just want to process multiple chunks all at the same time but those chunks are processed completely independently, they don't talk to each other. Let's generalize the code and introduce a batch dimension in the code below:
   

In [11]:
torch.manual_seed(1337) # for reproducibility
batch_size = 4 # how many independent sequences will we process in parallel?
block_size = 8 # what is the maximum context length for predictions?

def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    return x, y

xb, yb = get_batch('train')
print('inputs:')
print(xb.shape)
print(xb)
print('targets:')
print(yb.shape)
print(yb)

print('----')

for b in range(batch_size): # batch dimension
    for t in range(block_size): # time dimension
        context = xb[b, :t+1]
        target = yb[b,t]
        print(f"when input is {context.tolist()} the target: {target}")

inputs:
torch.Size([4, 8])
tensor([[24, 43, 58,  5, 57,  1, 46, 43],
        [44, 53, 56,  1, 58, 46, 39, 58],
        [52, 58,  1, 58, 46, 39, 58,  1],
        [25, 17, 27, 10,  0, 21,  1, 54]])
targets:
torch.Size([4, 8])
tensor([[43, 58,  5, 57,  1, 46, 43, 39],
        [53, 56,  1, 58, 46, 39, 58,  1],
        [58,  1, 58, 46, 39, 58,  1, 46],
        [17, 27, 10,  0, 21,  1, 54, 39]])
----
when input is [24] the target: 43
when input is [24, 43] the target: 58
when input is [24, 43, 58] the target: 5
when input is [24, 43, 58, 5] the target: 57
when input is [24, 43, 58, 5, 57] the target: 1
when input is [24, 43, 58, 5, 57, 1] the target: 46
when input is [24, 43, 58, 5, 57, 1, 46] the target: 43
when input is [24, 43, 58, 5, 57, 1, 46, 43] the target: 39
when input is [44] the target: 53
when input is [44, 53] the target: 56
when input is [44, 53, 56] the target: 1
when input is [44, 53, 56, 1] the target: 58
when input is [44, 53, 56, 1, 58] the target: 46
when input is [44, 53

* to get the batch for an arbitrary split:
  * if the split is a training split, we just look in the training dataset otherwise the val data giving us the `data` array.
  * then we generate random positions in `ix`, we actually generate batch size number of random offsets into the training sets
  * `x`'s are the first block size characters starting at `i`, `y`'s are offset by 1 of that, so just add plus 1 -> we're going to get those chunks for every one of integers `i` in `ix` and use `torch.stack` to take all those 1D tensors and we're going to stack them up at rows, so all they become a row in a [4, 8] tensor (with each row being a chunk in the training dataset) -> the total 32 examples are completely independent as far as the Transformer is concerned
    * the target's in the associated `y`'s will come in through the Transformer all the way at the end to create the loss function so they will give us the correct answer for every single position inside `x`
* so essentially we have 32 examples packed into a single batch of the input `x` and then the desired targets in `y` so now this integer tensor of `x` is going to feed into the Transformer, and that Transformer is going to simultaneously process all these examples and then look up the correct integers to predict in each one of these positions in the tensor `y`

## 3. Simplest baseline: bigram language model

### 3.1. Building the model, loss, generation

* the model:
  * when the inputs and targets come in `forward`, we just take the index of the inputs `x` which we renamed to `idx`, and we pass them into the token embedding table:
    * in the `init` constructor, we're creating a token embedding table, and it is of size [vocab_size, vocab_size], and we're using `nn.Embedding` which is a very thin wrapper around basically a tensor of shape [vocab_size, vocab_size].
    * when we pass `idx` in the `forward` for logits, every single integer in our input is going to refer to the embedding table and is going to pluck out the row of that embedding table corresponding to its index (integer 24 will pluck out the row 24). Pytorch will arrange all of this into a [B, T, C] -> [4, 8, 65] (batch, time, channel) tensor, and will interpret them as logits which are basically the scores for the next character in the sequence. We are predicting what comes next based on just the individual identity of a single token and you can do that because the token are currently not talking to each other, and they're not seeing any context except for just seeing themselves -> `print(out)` gets us the prediction, the scores or logits, for every one of the [4, 8] positions

* evaluating the loss of our function:
  * in makemore series, we saw that a great way to measure the loss or quality of our predictions is to use the negative loss likelihood: `Cross_Entropy` -> loss is the cross entropy on the predictions and the targets (so loss measures the quality of the logits w.r.t Targets, or in other words how well are we predicting the next character based on `logits`).
    * intuitively, the correct dimension of the logits depending on whatever the target is should have a very high number and all the other dimensions should be very a low number
  * if we have a multi-dimensional input, pytorch wants [B, C, T] instead of [B*T, C] in `cross_entropy` -> so we'll reshape those logits by unpacking the numbers, and then B*T for the first dimension. We will take all the positions for input tensor and then stretch them out in a 1D sequence and preserve the channel dimension as the second dimension (we're just stretching the array so it's 2D and in that case it's going to be better conform to what pytorch sort of expects)
  * we have to do the same with `targets` as we did with `logits` because currently `targets` are of the shape [B, T] and we just want it to be [B*T]. Alternatively, we could do -1 because pytorch will guess what it should be if you want to lay it out.
  * since the data is equally likely, we expect the loss to be (-ln(1/65)) ~ 4.13 instead of the 4.87 we're getting -> our predicitons are niot super diffuse but we've got a little bit of entropy and so we're guessing wrong

* generating from the model:
  * we take the same kind of input `idx` which is the current context of characters in some batch, so it's also [B,T]
  * the job of `generate` is to basically take the [B,T] and extend it to [B, T+1], [B, T+2]..so it continues the generation in all the batch dimensions in the time dimension, and it will do this for `max_new_tokens`.
  * whatever is predicted in the `for` loop is concatenated on top of the previous `idx` along the first dimension which is the time dimension, so that becomes the new `idx`
  * inside the `for` loop for `generate`:
    * we're taking the current indices `idx`, we're getting the predictions inside the `logits`, and then the loss will be ignored over there because we don't have any truth targets that we're going to be comparing with
    * once we get the `logits`, we are only focusing on the last step, so instead of [B, T, C], we're going to pluck out the -1 (the last element in the time dimension) because those are the predictions for what comes next and so that gives us the `logits` which we then convert to `probs` via `softmax`. Then we use `torch.multinomial` to sample from those `probs` and we ask pytorch to gives us 1 sample
    * `idx_next` will then become [B,1] because in each of the batch dimensions, we're going to have a single prediction for what comes next. The `num_samples = 1` will make the [B, T] be a [B, 1]
    * we're going to take those integers `idx_next` that come from the sampling process according to the probability distribution given here and those integers got just concatenated on top of the current sort of running stream of integers `idx` and this gives us [B, T+1]
    * we're calling `self(idx)` which will end up going to the `forward` function. `Targets=None` thus has to be optional as we haven't provided any, and as a result there is no `loss` to create, so we just get the `logits` when there are no `targets`. If there are `targets` it just passes them and gives us `loss`.
    * the `generate` function is written to be general but it's kind of ridiculous right now because we're feeding in all this stuff, we're building out the context and concatenating it all, and we're always feeding it into the model. but it's ridiculous because it is a simple bigram model so in order to make a prediction about `k` for example, we only need a `w`, but what we fed into the model is that we fed the entire sequence and then we only looked at the last pieces and predicted `k`. We're writing the `generate` function in this way right now because it's a bigram model but i'd like to keep this function fixed and i'd like it to work later when our character further look in the history. Right now the history is not used so it looks silly, but eventually the history will be used so we do it this way.

* decoding this:
  ```python
  print(decode(m.generate(idx = torch.zeros((1, 1), dtype=torch.long), max_new_tokens=100)[0].tolist()))
  ```
  * the `idx` are `torch.zeros((1, 1), dtype=torch.long`.
    * we're creating a `batch` which will just be 1, we're creating a [1,1] tensor and it's holding a 0 and the d.type is int. 0 (representing new line character) is how we're going to kick off generation
  * then we will ask for 100 tokens and then enter generate and continue that.
    * since `generate` works on the level of `batches`, we then have to index into the 0th row to basically unplug the single batch dimension that exists and then that gives us a time steps which is just a 1D array of all indices which we will convert to simple python list form pytorch tensor so that can feed into our `decode` function and convert those integers to text
  * generates total garbage because it's a totally random model so next we'll train this model



In [12]:
import torch
import torch.nn as nn
from torch.nn import functional as F
torch.manual_seed(1337)

class BigramLanguageModel(nn.Module):

    def __init__(self, vocab_size):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)

    def forward(self, idx, targets=None):

        # idx and targets are both (B,T) tensor of integers
        logits = self.token_embedding_table(idx) # (B,T,C)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # get the predictions
            logits, loss = self(idx)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

m = BigramLanguageModel(vocab_size)
logits, loss = m(xb, yb)
print(logits.shape)
print(loss)

print(decode(m.generate(idx = torch.zeros((1, 1), dtype=torch.long), max_new_tokens=100)[0].tolist()))


torch.Size([32, 65])
tensor(4.8786, grad_fn=<NllLossBackward0>)

Sr?qP-QWktXoL&jLDJgOLVz'RIoDqHdhsV&vLLxatjscMpwLERSPyao.qfzs$Ys$zF-w,;eEkzxjgCKFChs!iWW.ObzDnxA Ms$3


### 3.3 Training the model

* `Adam` is a much more advanced optimizer than `SGD`. it works extremely well for a typical good setting for the lr to be 1e-3, but for small models you can use a much higher `lr`
  * optimizers take the gradients and update the parameters using the gradients
* this is a very simple model because the tokens are not talking to each other so given the previous context of whatever was generated, we're just looking at the last character to make the predictions about what comes next.
* the tokens now have to start talking to each other and figure out what is in the `context` -> kicking off the Transformer

In [13]:
# create a PyTorch optimizer
optimizer = torch.optim.AdamW(m.parameters(), lr=1e-3)

In [14]:
batch_size = 32
for steps in range(1000): # increase number of steps for good results...

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss through a training loop
    logits, loss = m(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

print(loss.item())

3.7218432426452637


In [15]:
# something more reasonable (can increase number of tokens to check results)
print(decode(m.generate(idx = torch.zeros((1, 1), dtype=torch.long), max_new_tokens=500)[0].tolist()))


olylvLLko'TMyatyIoconxad.?-tNSqYPsx&bF.oiR;BD$dZBMZv'K f bRSmIKptRPly:AUC&$zLK,qUEy&Ay;ZxjKVhmrdagC-bTop-QJe.H?x
JGF&pwst-P sti.hlEsu;w:w a BG:tLhMk,epdhlay'sVzLq--ERwXUzDnq-bn czXxxI&V&Pynnl,s,Ioto!uvixwC-IJXElrgm C-.bcoCPJ
IMphsevhO AL!-K:AIkpre,
rPHEJUzV;P?uN3b?ohoRiBUENoV3B&jumNL;Aik,
xf -IEKROn JSyYWW?n 'ay;:weO'AqVzPyoiBL? seAX3Dot,iy.xyIcf r!!ul-Koi:x pZrAQly'v'a;vEzN
BwowKo'MBqF$PPFb
CjYX3beT,lZ qdda!wfgmJP
DUfNXmnQU mvcv?nlnQF$JUAAywNocd  bGSPyAlprNeQnq-GRSVUP.Ja!IBoDqfI&xJM AXEHV&DKvRS


we take our code in this jupyter notebook and simplify our immediate work into just the final product we have at this point into `bigram.py` (our starter code):
  * at the top we just set up the hyperparameters that we've defined
  * reproducibility
  * read the data
  * get the encoder and the decoder
  * create the training test splits
  * use the dataloader that gets a batch of the `inputs` and `targets`
  * estimate loss (talkeed later ahead)
  * `BigramLanguageModel` whuch can forward and give us logits and loss and it can generate
  * optimizer + training loop

things added not discussed from before:
  * added `device` to allow it to be run on a GPU (Cuda) instead of CPU
    * we need device agnostic code to move the data, the model (its parameters) to the device
      * e.g `nn.Embedding` table and it has got a `.weight` inside it which stores the lookup table
  * inside the training loop we introduce from the `estimate_loss` function:
  ```python
  print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")
  ```
    * before we were just printing `loss.item()` outside the loop which is a very noisy measurement of the current loss as every batch will be more or less lucky
    * it calls the `estimate_loss` function which averages the loss over multiple batches for both splits which will be much less noisy
  * in the `estimate_loss` function:  
    * first we put the model into evaluation mode before it averages the loss and then into training mode after it is done averaging loss. For our model right now it won't do anything because we just have the `nn.Embedding` layer; we have no Dropout or BatchNorm layers but it is still a good practice to think through which mode your model is in since some layers will have different behaviour at `inference` time or `training` time
    * there is also a context manager `@torch.no_grad()` which is just telling pytorch that whatver happens inside the function, we will not call `.backward()` on and so pytorch can be very efficient in its memory use as it will no longer have to store all the intermediate variables.

## 4. Mathematical tricks in self-attention

### 4.1 version 1: averaging past context with for loop, weakest form of aggregation

* we would like the 8 tokens in the toy example to talk to each other by coupling them
  * in particular, we want to couple them in a very specific way. The token at the 5th location should not communicate with tokens in the 6th, 7th or 8th location because those are future tokens in the sequence. It should only talk to the 1st 2nd 3rd 4th token. So, the information only flows from previous context to the current timestamp, and we cannot get any information from the future because we are about to try to predict the future
  * what is the easiest way for the tokens to communicate?
    * if we are the 5th token and i want to communicate with my past, the simplest way to do is just an average of all the preceding elements. As the 5th token, i would like to take the channels that make up the information at my step, but also the channels from the previous steps, and then average them, and that will become the feature vector that kind of summarizes the 5th token in the context of history.
    * doing a `sum()` or an average is an extremely weak sort of interaction, like the arrangment is extremely lossy. We've lost information about the spatial arrangements of all those tokens but that's okay for now, we'll bring in that information back later.
* for now what we would like to do is for every single batch independently, for every teeth token in that sequence, we would like to now calculate the average of all the vectors in all the previous tokens and also at this specific token:
  * we will create `xbow` where `bow` is short for backup words because it is like a term people use when averaging things -> basically there's a word stored at each one of these 8 locations and we're just averaging
  * we initialize at 0 and we do a `for` loop (not efficient yet) to iterate over all the batch dimensions independently iterating over time and then the previous tokens are at that specific batch dimension and then everything up to and including the teeth token. So, when we slice out `X` in this way, `xprev` becomes of shape [teeth_tokens(t), C] -> previous chunk of tokens from my current sequence. Finally, we do an average or `mean` for the time over the 0th dimension -> we will get a small C 1D vector which we will store in `xbow`.
  * the last token will be an average of all the tokens with them being added vertically

* we can make this very efficient using a mathematical trick: matrix multiplication


In [16]:
# consider the following toy example:
torch.manual_seed(1337)
B,T,C = 4,8,2 # batch, time, channels
x = torch.randn(B,T,C)
x.shape

torch.Size([4, 8, 2])

In [17]:
# We want x[b,t] = mean_{i<=t} x[b,i]
xbow = torch.zeros((B,T,C))
for b in range(B):
    for t in range(T):
        xprev = x[b,:t+1] # (t,C)
        xbow[b,t] = torch.mean(xprev, 0)

### 4.2 using matrix multiplication to make it efficient



* the number in the top left of matrix `c` (14 for us), is achieved by the first row of `a` dot product with the first column of `b`..and so this continues as a matrix multiplication in the form of a dot product to get `c` -> using `torch.ones`


In [18]:
# initial toy example with matrix multiplication
torch.manual_seed(42)
a = torch.ones(3, 3)
b = torch.randint(0,10,(3,2)).float()
c = a @ b
print('a=')
print(a)
print('--')
print('b=')
print(b)
print('--')
print('c=')
print(c)

a=
tensor([[1., 1., 1.],
        [1., 1., 1.],
        [1., 1., 1.]])
--
b=
tensor([[2., 7.],
        [6., 4.],
        [6., 5.]])
--
c=
tensor([[14., 16.],
        [14., 16.],
        [14., 16.]])


* the trick here is:
  * to use `torch.tril` and then wrap it in `torch.ones` and it will just return the lower triangular matrix as one's -> changes all the matrices hence
    * depending on how many ones and zeros we have here, what we are basically doing here is a `sum` currently of a variable number of the rows of `a` and `b` and that gets deposited into `c`. We're doing sums cuz they are one's in `a`
  * on the alternative, we can also do an average in an incremental fashion because we can basically normalize the rows so that they sum to one and then we're going to get an average
    * here now our matrix `a` has rows which are normalized (summing upto 1). Using matrix `b`, the matrix `c` first row is just the first row of `b` itself, the second row of `c` is the average of the first 2 rows of `b` (column wise addition). in the last row of `c` we are getting an average of the 3 rows (column wise addition)
* by manipulating these elements of this multiplying matrix and then multipplying it with any given matrix, we can do these averages in these incremental fashion, and we can just manipulate that on basis of `a`

In [19]:
# toy example illustrating how matrix multiplication can be used for a "weighted aggregation"
torch.manual_seed(42)
a = torch.tril(torch.ones(3, 3))
a = a / torch.sum(a, 1, keepdim=True)
b = torch.randint(0,10,(3,2)).float()
c = a @ b
print('a=')
print(a)
print('--')
print('b=')
print(b)
print('--')
print('c=')
print(c)

a=
tensor([[1.0000, 0.0000, 0.0000],
        [0.5000, 0.5000, 0.0000],
        [0.3333, 0.3333, 0.3333]])
--
b=
tensor([[2., 7.],
        [6., 4.],
        [6., 5.]])
--
c=
tensor([[2.0000, 7.0000],
        [4.0000, 5.5000],
        [4.6667, 5.3333]])


### 4.3 Version 2: vectorizing it with matrix multiply for more efficiency

* `wei` is how much of every row we want to average up and it's going to be an average as the rows sum upto 1. Hence, this is going to be our `a`
  * our `b` in this example is `x`
* pytorch will come in `xbow2` and see that (T, T) @ (B, T, C) shapes are not the same, so it will create a batch dimension here and it becomes this: (B, T, T) @ (B, T, C) leading to (B, T, C)
  * basically it wil apply the matrix multiplication in all of the batch elements in parallel and individually, and for each batch element there will be a (T,T) multoplying (T,C)
* we were able to use batch multiply to do this aggregation and the weights are specified in the (T,T) array and we're basically doing weighted sums, these weighted sums are according to the weights inside:
```python
wei = torch.tril(torch.ones(T, T))
```
  * they take on off a sort of triangular form and so that means that the token at the teeth dimension will only get tokens information from the tokens preceding it.

In [20]:
# xbow

In [21]:
# xbow2

In [22]:
# version 2: using matrix multiply for a weighted aggregation
wei = torch.tril(torch.ones(T, T))
wei = wei / wei.sum(1, keepdim=True)
xbow2 = wei @ x # (B, T, T) @ (B, T, C) ----> (B, T, C)
torch.allclose(xbow, xbow2, rtol=1e-05, atol=1e-07) # using this tolerance because of floating point differences in the numbers

True

### 4.4 Version 3: adding `softmax`

* identical to the 1st and 2nd version
* `trils` begins with all 0. So if I just print the beginning of it, it's all 0. Then I use the masked fill. So what this is doing is, going down the masked fill, it's all 0s. For all the elements where `tril` is equal to 0, make them a negative infinity.

* Then the final one here is the `softmax`. So if I take the softmax along every single row, what is that going to do? Well, the softmax is also like a normalization equation, right? And so it's all over here, and we get the exact same matrix. We'll bring back the softmax. And recall that in softmax, we're going to exponentiate every single row once, and then we're going to divide by itself. And so if we exponentiate every single element here, we're going to get 1, and here we're going to get 0, 0, 0, 0, 0, 0, 0. And then when we normalize, here we're going to get 1, 0, and then 0s. And the softmax will again divide, and this opens up like that, and so on. And so this is also the same way to produce the masked fill.

* Now the reason that this is a bit more interesting, and the reason that we're going to end up using it in self-attachment, is that these weights here begin with 0. And you can think of this as like an interaction string, or like an affinity. So basically, it's telling us how much of each token from the past we want to aggregate, and average over. And then this line is saying, tokens from the past cannot communicate.

* By setting them to negative infinity, we're saying that we will not aggregate anything from those tokens. And so basically, as that goes for softmax, this is the aggregation through matrix multiplication. And so what this is now is, you can think of these as, these 0s are currently just set by us to be 0, but a quick preview is that these affinities between the tokens are not going to be just constant at 0. They're going to be data-dependent.

* These tokens are going to start looking at each other, and some tokens will find other tokens more or less interesting. And depending on what their values are, they're going to find each other interesting to different amounts, and are going to have those affinities. And then here we are saying, the future cannot communicate with the past.We're going to collect them. And then when we normalize and sum, we're going to aggregate some of their values, depending on how interesting they find each other. And so that's the preview for softmax.

* And basically, long story short from this entire section is that, you can do weighted aggregations of your past elements by using matrix multiplication of a lower-triangular function. And then the elements here in the lower-triangular part are telling you how much of each element fuses into this position. So we're going to use this technique now to enable self-attention with them.


In [23]:
# version 3: use Softmax
tril = torch.tril(torch.ones(T, T)) # lower triangular matrix of one's
wei = torch.zeros((T,T))
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)
xbow3 = wei @ x
torch.allclose(xbow, xbow3)

False

check `gpt.py` for this code:
* more than just encoding the identity of the tokens, we also token their position. Hence, we introduce a second embedding table as `self.position_embedding_table` which is an embedding of block size by an embedding table, so each position from `0 to block_size - 1` will also get its own embedding vector.
  * we will also have `pos_emb` which will just be basically integers from 0 to t-1, and all of those integers get embedded from the table to create a [T,C] and then logits just take `x`, where `x` is the addition of the `token_emb` and the `pos_emb`, and hence the broadcasting note will then work out -> [B, T, C] + [T, C] gets right aligned and a new dimension of 1 gets added and it gets broadcasted across the batch. So, `x` at this point not only holds the token identities but the positions these tokens occur. This is not that useful because we just have a simple `bigram` model right now so it doesn't matter if we are at the 5th position, it's all translation invariant at this stage so this information currently won't help but as we work on the self-attention block, we'll see that it matters.

### 4.5 The Crux: Version 4: self-attention

* the code we had before just does a simple weight and a simple average of all the past tokens and the current token (the previous and current information is just being mixed in an average) -> our initial below code does the same by creating a lower triangular structure which allows us to mask out the `wei` matrix we create and then we normalize it, and currently when we initialize all the affinities between all the different sort of tokens or nodes to be 0, then we see that `wei` gives us this structure where every single row has uniform numbers.
  * we don't want all of it to be uniform because different tokens will find different other tokens more or less interesting, and we want that to be data dependent when trying to get information from the past
* what self-attention does:
  * every single node or single token at each position will emit two vectors, it will emit a query and a key
  * the query vector: "what am i looking for"
  * the key vector: "what do i contain"
  * the way we get affinities between these tokens now in a sequence is we basically just do a dot product between the keys and the queries to get `wei`
    * if the key and query are aligned, they will interact a very high amount and then we will get to learn more about that specific token than any other token in the sequence
  * when we forward the lineary layer on top of our `x`, all the tokens in all the positions in the [B, T] arrangement -> all of them in parallel and independently produce a `key` and `query` without any communication happening yet
    * when the communication takes place, all the `queries` will dot product with all the `keys`. Hence, we want the affinities or `wei` between these to be query mutltiplying `key` -> for this we need to transpose the last 2 dimensions of `key` to get (B, T, 16) @ (B, 16, T) ---> (B, T, T)
      * for every row of B, we will now have a T^2 matrix giving us the affinities
  * first `wei` was applied the same to all of the batch elements, but now every single batch element will have different sort of `wei` because every single batch elements contains different tokens at differen positions so now it is data dependent and not uniform
    * for e.g: the 8th row knows what content it has and at what position it's in. Now the 8th token based on that creates a `query` (e.g: "i'm a vowel, i'm at the 8th position and i'm looking for any consonants at positions up to 4". Then all the nodes get to emit the `keys` and maybe one of the channels will be "i'm a consonant and in a position upto 4", and that `key` will have a high number in that specific channel and that's how when the `key` and `query` dot product they can find each other and create a high affinity -> when they have a high affinfity, then through `softmax` we will end up aggregating a lot of its information into my position, so i'll get to learn a lot about it) -> we now get a nice distributuin that sums to 1. This is now telling us in a data dependent manner how much of information to aggregate from these tokens in the past
  * another part to self-attention head: when we do aggregation, we don't aggregate the tokens exactly, we produce one more value and we call that `value` and instead of matrix multiplying `wei` with `x`, we just calculate `v` which is achieved by propogating the linear layer on top of `x` again and then we output `wei @ v`. Hence, `v` is the elements that we aggregate or the vector we aggregate instead of the raw `x` making the output of the single head as 16D as that is `head_size`.
  * So, we can think of `x` as kind of like private information to the token

In [24]:
# version 4: self-attention!
torch.manual_seed(1337)
B,T,C = 4,8,32 # batch, time, channels
x = torch.randn(B,T,C)

# let's see a single Head perform self-attention
head_size = 16
key = nn.Linear(C, head_size, bias=False)
query = nn.Linear(C, head_size, bias=False)
value = nn.Linear(C, head_size, bias=False)
k = key(x)   # (B, T, 16)
q = query(x) # (B, T, 16)
wei =  q @ k.transpose(-2, -1) # (B, T, 16) @ (B, 16, T) ---> (B, T, T)

tril = torch.tril(torch.ones(T, T))
# wei = torch.zeros((T,T))
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1) # exponentiate and normalize the negative values

v = value(x)
out = wei @ v
# out = wei @ x

out.shape

torch.Size([4, 8, 16])

In [25]:
wei[0]

tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.1574, 0.8426, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2088, 0.1646, 0.6266, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5792, 0.1187, 0.1889, 0.1131, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.0294, 0.1052, 0.0469, 0.0276, 0.7909, 0.0000, 0.0000, 0.0000],
        [0.0176, 0.2689, 0.0215, 0.0089, 0.6812, 0.0019, 0.0000, 0.0000],
        [0.1691, 0.4066, 0.0438, 0.0416, 0.1048, 0.2012, 0.0329, 0.0000],
        [0.0210, 0.0843, 0.0555, 0.2297, 0.0573, 0.0709, 0.2423, 0.2391]],
       grad_fn=<SelectBackward0>)

**Notes:**

1.  Attention is a communication mechanism. Can be seen as nodes in a directed graph looking at each other and aggregating information with a weighted sum from all nodes that point to them, with data-dependent weights.

  <img src = "https://raw.githubusercontent.com/Raahim58/Neural-networks/main/images/self-attention.png" width = 300 height = 300>

  * our graph doesn't look like this:
    * we have 8 nodes because our `block_size` is 8, and hence there are always 8 tokens.
    * the first node is only pointed to by itself, and the second node is pointed to by the first node and by itself all the way upto 8th node which is pointed to by all the previous nodes and by itself
    * attention can be applied to any arbitrary directed graph and it's just a communication mechansim between the nodes.

2. There is no notion of space. Attention simply acts over a set of vectors. This is why we need to positionally encode tokens.
  * by default these nodes have no idea where they are positioned in space and that's why we need to encode them positionally and give the some information that is anchored to a specific position so that they know where they are
  * that is different from convolution, as in convolution there is a very specific layout of information in space and the convolutional filters act in space.
  * in attention, there is just a set of vectors out in space, they communicate, and if you want them to have a notion of space you need to specifically add it, which is what we did when we calculated the relative position code encodings and added that information to the vectors

3. Each example across batch dimension is of course processed completely independently and never "talk" to each other
  * since the `batch_size` is 4, we have 4 seperate pools of 8 nodes and those eight nodes only talk to each other, but in total they're like 32 nodes that are being processed but there's basically 4 pools of 8

4. In an "encoder" attention block just delete the single line that does masking with tril, allowing all tokens to communicate. This block here is called a "decoder" attention block because it has triangular masking, and is usually used in autoregressive settings, like language modeling.
  * in the attention block, we have a specific structure of directed graph where the future tokens will not communicate to the past tokens, but this doesn't necessarily have to be the constraint in the general case. In fact, in many cases we might want to have the nodes talk to each other fully (e.g in sentiment analysis where they might be a lot of number of tokens, and we may want to have them all talk to each other fully because later you are predicting the sentiment of the sentence and so it's okay for the nodes to talk to each other
  * in those cases, we will use an encoder block of self-attention and all it means that it's an encoder block is that we will delete:
  ```python
  wei = wei.masked_fill(tril == 0, float('-inf')) # deleting will allow all the nodes to completely to talk to each other (in encoder)
  ```
  * what we're implementing above is usually called the decoder block and it's called the decoder because it's sort of like a decoding language and it's got an auto regressive format where one has to mask with the triangular matrix so that nodes from the future never talk to the past because that would give away the answer. So in the encoder block you delete the above line of code allowing all the nodes to talk to each other, and in the decoder block it stays so we have the triangular structure
    * attention, however, doesn't care. Attention supports arbitrary connectivity between nodes.

5. "self-attention" just means that the keys and values are produced from the same source (`x`) as queries. In "cross-attention", the queries still get produced from `x`, but the keys and values come from some other, external source (e.g. an encoder module)
  * attention is more general than that. In encoder & decoder transformers, for example, you can have a case where queries are produced from `x` but the `keys` and `queries` come from a whole seperate external source and sometimes from the the encoder blocks that encode some context that we'd like to condition on. So, the `keys` and the `values` will actually come from a whole seperate source which are nodes on the side and we're just producing `queries` and we're reading off information from the side so `cross-attention` is used when there's a seperate source of nodes we're like to pool information from into our nodes.

6. "Scaled" attention additional divides wei by 1/sqrt(head_size). This makes it so when input Q,K are unit variance, `wei` will be unit variance too and Softmax will stay diffuse and not saturate too much.
  * it's an important normalization to have. The problem is if you have unit gaussian inputs and if you do `wei` naively, then you see that the `var` willl be on the order of `head_size` which in our case is 16, but if you multiply that by 1/sqrt(head_size), then the `variance` will be 1 hence being preserved
  * this is important because `wei` feeds into `softmax` so it's really important especially at initialization, that `wei` be fairly diffused
    * the problem is because of `softmax`, the `wei` takes on very positive and very negative numbers inside it. `softmax` will actually converge towards one-hot vectors
    * if we start sharpening the numbers and making them bigger by multiplying the numbers by 8, the `softmax` will start to sharpen, and it will sharpen towards the max so it will shapen toward whatever number in the `softmax` feeding in is the highest. We don't want these values at intialization to be too extreme otherwise `softmax` will be way too peaky and you're basically aggregating the information from a single node. Every node just aggregates information from a single other node which is not what we want, especially at initialization. So, the scaling is used just to control the variance at initialization.


Illustration below

In [26]:
k = torch.randn(B,T,head_size)
q = torch.randn(B,T,head_size)
wei = q @ k.transpose(-2, -1) * head_size**-0.5

In [27]:
k.var(), q.var(), wei.var()

(tensor(1.0449), tensor(1.0700), tensor(1.0918))

In [28]:
torch.softmax(torch.tensor([0.1, -0.2, 0.3, -0.2, 0.5]), dim=-1)

tensor([0.1925, 0.1426, 0.2351, 0.1426, 0.2872])

In [29]:
torch.softmax(torch.tensor([0.1, -0.2, 0.3, -0.2, 0.5])*8, dim=-1) # gets too peaky, converges to one-hot

tensor([0.0326, 0.0030, 0.1615, 0.0030, 0.8000])

## 5. Addings more features

### 5.1 Scaled product dot attention

* making one more change in the `head` block:
  * in `generate`, we have to make sure that the `idx` that we feed into the model, since we're using positional embeddings, we can never have more than the `block_size` coming in because if `idx` is more than the `block_size`, then our position embedding table will run out of scoprt because it only had embeddings for up to `block_size`, and so therefore we crop the context that we're going to feed into self so that we never pass in more than `block_size` elements
  * we also decrease the `lr` because the self-attention can't tolerate very high `lr`
  * we also increase the number of iterations because the `lr` is lower and then we train it. Previously, we got the `loss` as **2.5** and now we are down to **2.4**

      <img src = "https://raw.githubusercontent.com/Raahim58/Neural-networks/main/images/scaled%20product%20attention.png" width = 600 height = 400>

In [30]:
class Head(nn.Module):
    """ one head of self-attention """

    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

    def forward(self, x):
        B,T,C = x.shape
        k = self.key(x)   # (B,T,C)
        q = self.query(x) # (B,T,C)
        # compute attention scores ("affinities")
        wei = q @ k.transpose(-2,-1) * C**-0.5 # (B, T, C) @ (B, C, T) -> (B, T, T)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
        wei = F.softmax(wei, dim=-1) # (B, T, T)
        wei = self.dropout(wei)
        # perform the weighted aggregation of the values
        v = self.value(x) # (B,T,C)
        out = wei @ v # (B, T, T) @ (B, T, C) -> (B, T, C)
        return out


### 5.2 Multi-head attention

* it's applying multiple heads in attention and concatenating the results

  <img src = "https://raw.githubusercontent.com/Raahim58/Neural-networks/main/images/multi%20head%20attention.png" width = 400 height = 400>

* how many heads do we want? and what's the size for each head? Then we run all of them in parallel in a list and simply concatenate the outputs over the channel dimension
* instead of having one communication channel, we now have 4 communication channels in parallel and each of these communication channels typically will be smaller correspondingly -> because, we have 4 communication channels, we want 8D self-attention. So, from each communication channel we're going to gather 8D vectors and then we have 4 of them and that concatenates to give us 32 which is the original `n_embd`
  * this like a group convolution because basically instead of having a one large convolution, we do convolutional groups and that's multi-headed self-attention
  * it helps to create multiple independent channels of communication, gather lots of different types of data, and then decode the output

In [31]:
class MultiHeadAttention(nn.Module):
    """ multiple heads of self-attention in parallel """

    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])

    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        out = self.dropout(self.proj(out))
        return out

### 5.3 Feed forward layer

* it's a layer consisting of a linear layer followed by a ReLU non-linearity
* `feed forward` or `ffwd` is called sequentially right after the self-attention so we self-attend, then we feed forward.
* the feed forward when it's applying linear, it's doing it on a per token level. All the tokens do this independently so the self-attention is the communication and then once they've gathered all the data, now they need to think on that data individually
* the validation loss continues to go low going down from **2.28** to **2.24**
* we will not intersperse the communication with the computation and that's also what the transformer does when it has blocks that communicate and then compute, and it groups them and replicates them

In [32]:
class FeedFoward(nn.Module):
    """ a simple linear layer followed by a non-linearity """

    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, n_embd),
            nn.ReLU(),
            nn.Linear(n_embd, n_embd)
        )

    def forward(self, x):
        return self.net(x)


### 5.4 Residual connections

<img src = "https://raw.githubusercontent.com/Raahim58/Neural-networks/main/images/residual%20connection.png" width = 400 height = 400>

* the block intersperses communication and computation
* the communication is done using multi headed self-attention and the computation is done using the feed-forward network on all the tokens independently
* `n_head` is like the group size in group convolution
* since `n_embd` is 32, the `n_head` is 4 and the `head_size` should be 8 so that everything works out channel wise
* when we try to run the code with `Block`, we don't end up getting a very good answer. The reason for that is that we're building a very deep neural net, and deep neural nets suffer from optimization issues
* 2 optimizations that make the transformer optimizable:
  1. residual connections: arrows that skip from one block to the `add & norm` layer which comes from the 2015 paper **Deep Residual learning for Image recognition**
    * what this means is that you transform data but then you have a skip connection with addition from the previous features
    * there is a computation from top to bottom consisting of a residual pathway in which you are free to fork off to perform some computation and then project back to residual pathway via addition. So, you go from the inputs to targets via only plus and plus
      * this is useful because, recall in backpropogation from micrograd video -> addition distributes gradients equally to both of its branches that is as fat as the input, so the supervision or the gradients from the loss hop through every addition node all the way to the input
      * they then also fork off into the residual blocks but basically you have this gradient super highway that goes directly from the supervision all the way to the input unimpeded. These residual blocks are usually initialized so they contribute very very. little if anything to the residual pathway. They are initialized that way so in the beginning they are kind of not there, but during the optimization, they (`block`) come online over time and they start to contribute.
    * we introduce `projection` in the `multiheadattention block`, which is just a linear transformation of the outcome of the `forward` layer in the `multiheadattention block`. So that's the projection back into the residual pathway
    ```python
    self.proj = nn.Linear(n_embd, n_embd)
    ```
    * in the paper the dimensionality for the model is 512 and the dimensionality for the feed-forward layer is 2048 so a multiplier of 4. So, the inner layer of the feed-forwar network should be multiplied by 4 in terms of channel sizes. We add more computation to the block on the side of the residual pathway
    * we trained it to get a val loss of **2.08** and the network is getting bigger, so the train loss is gettinng ahead of validation loss so we're seeing a bit of overfitting -> generations still not great but getting close to English.

In [33]:
class Block(nn.Module):
    """ Transformer block: communication followed by computation """

    def __init__(self, n_embd, n_head):
        # n_embd: embedding dimension, n_head: the number of heads we'd like
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedFoward(n_embd)

    def forward(self, x):
        x = x + self.sa(x) # sa is self attention, forking off for communication and coming back
        x = x + self.ffwd(x) # ffwd is feed forward, forking off for computation and coming back
        return x

In [34]:
class FeedFoward(nn.Module):
    """ a simple linear layer followed by a non-linearity """

    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.ReLU(),
            nn.Linear(4 * n_embd, n_embd)
        )

    def forward(self, x):
        return self.net(x)

### 5.5 LayerNorm

* implemented in pyTorch based on a paper
* very similar to BatchNorm whuch just made sure that across the batch dimension, any individual neuron had unit gaussian distribution (0 mean and 1 std)
  * batchNorm guarantees that when we look at just the 0th column, it's a 0 mean 1 std -> normalizing every single column of this input, but the rows won't be initialized by default
* we normalize the rows instead of the columns in LayerNorm. Since our computation does not span across examples, we can delete all the buffers since we can always apply the operation.
  * there is no distinction between training and test time. We do keep gamma and beta but we don't need momentum, we don't care if it's training or not.
* deviating from the original paper:
  * it is now common to apply the LayerNorm before the transformation, instead of after the transformation like in the paper -> pre-Norm Layer implementation

In [35]:
class LayerNorm1d: # (used to be BatchNorm1d)

  def __init__(self, dim, eps=1e-5, momentum=0.1):
    self.eps = eps
    self.gamma = torch.ones(dim)
    self.beta = torch.zeros(dim)

  def __call__(self, x):
    # calculate the forward pass
    xmean = x.mean(1, keepdim=True) # batch mean
    xvar = x.var(1, keepdim=True) # batch variance
    xhat = (x - xmean) / torch.sqrt(xvar + self.eps) # normalize to unit variance
    self.out = self.gamma * xhat + self.beta
    return self.out

  def parameters(self):
    return [self.gamma, self.beta]

torch.manual_seed(1337)
module = LayerNorm1d(100)
x = torch.randn(32, 100) # batch size 32 of 100-dimensional vectors
x = module(x)
x.shape

torch.Size([32, 100])

In [36]:
x[:,0].mean(), x[:,0].std() # mean,std of one feature across all batch inputs

(tensor(0.1469), tensor(0.8803))

In [37]:
x[0,:].mean(), x[0,:].std() # mean,std of a single input from the batch, of its features

(tensor(-9.5367e-09), tensor(1.0000))


  * when layerNorm is applying the normalization features in `Block`, the mean and the variance are taken over 32 numbers so the batch and time act as batch dimension so this is kind of like a per token transformation that just normalizes the features and makes them unit gaussian at initialization
  * these LayerNorms have gamma and beta trainable parameters inside them, the LayerNorm will eventually create outputs that might not be unit gaussian but the optimization will determine that
  * get down to **2.06** from **2.08** -> will help more with deeper neural nets
  * a layernorm also added at the end of the transformer and right before the final linear layer that decodes into the vocabulary

### 5.6 Scaling up + adding dropout

* introduced `n_layer` which just specifies how many layers of the blocks we're going to have + a new variable `n_heads` in Bigram Language Model
* introduced `Dropout`:
  * it is something that you can add right before the residual connection back into the original pathway
  * this helps us stop some of the nodes from randomly communicating
  * it comes from a 2014 paper: **Dropout: A simple way to prevent neural networks from overfitting**
  * basically it takes your neural net, and it randomly every forward and backward pass shuts off some subset of neurons -> randomly drops them to 0 and trains without them and what this does is, because the mask of being dropped out has changed every single forward backward pass, it ends up training an ensemble of networks and at test time everything is fully enabled and kind of all of those sub networks are merged into a single Ensemble
  * regularization technique and used mainly when scaling models to stop overfitting

## 6. Encoder vs Decoder

In [None]:
# French to English translation example:

# <--------- ENCODE ------------------><--------------- DECODE ----------------->
# les réseaux de neurones sont géniaux! <START> neural networks are awesome!<END>


**1. Decoder-Only Transformer Architecture:**

* The model implemented is a decoder-only transformer.
* There is no encoder component in this architecture.
* The architecture also lacks a cross-attention block.
* The block only contains a self-attention mechanism and a decoder.
* The model is missing the cross-attention piece that is usually present between the encoder and decoder in a traditional transformer.

**2. Explanation of the Decoder:**

* A decoder-only model is used in this implementation because the model is focused solely on text generation.
* The generation process is unconditioned, meaning it operates without external input or constraints, just "blabbering on" based on guesses.
* The model uses a triangular mask in its transformer architecture.
This mask gives the model its autoregressive property, allowing it to generate text token-by-token.
* The triangular mask prevents future tokens from being attended to during the generation process, ensuring that predictions are made based only on the past tokens.


**3. Difference Between Encoder-Decoder and Decoder-Only:**

* The original paper on transformers presented an encoder-decoder architecture for machine translation.
* In that context, the model is designed to translate from one language (e.g., French) to another (e.g., English).
* The encoder processes the input sentence (in French), while the decoder generates the translated sentence (in English).
* Typically, the process starts with special tokens that guide the translation process:
* The model reads and conditions on the input tokens (French sentence).
* A special token is introduced at the beginning to start the generation process.
* The model then generates output tokens (English sentence) and concludes with an end token.
* The actual generation process in both the encoder-decoder and the decoder-only models is the same, but the encoder-decoder model is conditioned on additional input information.

**4. Working of the Encoder-Decoder Model:**

* In a machine translation model, the encoder processes the French sentence and creates token representations from it.
* The transformer processes these tokens without the triangular mask, allowing the tokens to attend to each other freely.
* After encoding the French sentence, the model generates an output from the decoder.
* The decoder not only relies on past tokens for prediction but also takes in the fully encoded French prompt through cross-attention.
* Cross-attention integrates information from the encoder (French sentence) with the current state of the decoder (English sentence generation).

**5. Why the Implemented Model Doesn’t Use an Encoder:**

* The presented model does not include an encoder because there’s nothing to encode – it doesn’t require conditioning on any additional input.
* The model's purpose is to imitate text generation from a given text file, making the encoder unnecessary.
This decoder-only transformer architecture is similar to the one used in GPT (Generative Pretrained Transformer).

This structure explains the workings of both decoder-only and encoder-decoder transformers and the specific choice of architecture for text generation tasks.

## 7. Full training code (Bigram Model)

In [39]:
import torch
import torch.nn as nn
from torch.nn import functional as F

# hyperparameters
batch_size = 64 # how many independent sequences will we process in parallel?
block_size = 256 # what is the maximum context length for predictions?
max_iters = 5000
eval_interval = 500
learning_rate = 3e-4
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 200
n_embd = 384
n_head = 6
n_layer = 6
dropout = 0.2
# ------------

torch.manual_seed(1337)

# wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

# here are all the unique characters that occur in this text
chars = sorted(list(set(text)))
vocab_size = len(chars)
# create a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string

# Train and test splits
data = torch.tensor(encode(text), dtype=torch.long)
n = int(0.9*len(data)) # first 90% will be train, rest val
train_data = data[:n]
val_data = data[n:]

# data loading
def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y

@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

class Head(nn.Module):
    """ one head of self-attention """

    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        B,T,C = x.shape
        k = self.key(x)   # (B,T,C)
        q = self.query(x) # (B,T,C)
        # compute attention scores ("affinities")
        wei = q @ k.transpose(-2,-1) * C**-0.5 # (B, T, C) @ (B, C, T) -> (B, T, T)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
        wei = F.softmax(wei, dim=-1) # (B, T, T)
        wei = self.dropout(wei)
        # perform the weighted aggregation of the values
        v = self.value(x) # (B,T,C)
        out = wei @ v # (B, T, T) @ (B, T, C) -> (B, T, C)
        return out

class MultiHeadAttention(nn.Module):
    """ multiple heads of self-attention in parallel """

    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(n_embd, n_embd)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        out = self.dropout(self.proj(out))
        return out

class FeedFoward(nn.Module):
    """ a simple linear layer followed by a non-linearity """

    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.ReLU(),
            nn.Linear(4 * n_embd, n_embd),
            nn.Dropout(dropout),
        )

    def forward(self, x):
        return self.net(x)

class Block(nn.Module):
    """ Transformer block: communication followed by computation """

    def __init__(self, n_embd, n_head):
        # n_embd: embedding dimension, n_head: the number of heads we'd like
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedFoward(n_embd)
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)

    def forward(self, x):
        x = x + self.sa(self.ln1(x))
        x = x + self.ffwd(self.ln2(x))
        return x

# super simple bigram model
class BigramLanguageModel(nn.Module):

    def __init__(self):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])
        self.ln_f = nn.LayerNorm(n_embd) # final layer norm
        self.lm_head = nn.Linear(n_embd, vocab_size)

    def forward(self, idx, targets=None):
        B, T = idx.shape

        # idx and targets are both (B,T) tensor of integers
        tok_emb = self.token_embedding_table(idx) # (B,T,C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C)
        x = tok_emb + pos_emb # (B,T,C)
        x = self.blocks(x) # (B,T,C)
        x = self.ln_f(x) # (B,T,C)
        logits = self.lm_head(x) # (B,T,vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

model = BigramLanguageModel()
m = model.to(device)
# print the number of parameters in the model
print(sum(p.numel() for p in m.parameters())/1e6, 'M parameters')

# create a PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0 or iter == max_iters - 1:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

# generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=2000)[0].tolist()))
# Writing the generated text to 'more.txt'
open('output.txt', 'w').write(decode(m.generate(context, max_new_tokens=10000)[0].tolist()))

10.788929 M parameters
step 0: train loss 4.2849, val loss 4.2823
step 500: train loss 2.0112, val loss 2.0971
step 1000: train loss 1.6021, val loss 1.7830
step 1500: train loss 1.4412, val loss 1.6396
step 2000: train loss 1.3430, val loss 1.5724
step 2500: train loss 1.2809, val loss 1.5330
step 3000: train loss 1.2268, val loss 1.5094
step 3500: train loss 1.1824, val loss 1.4881
step 4000: train loss 1.1475, val loss 1.4869
step 4500: train loss 1.1108, val loss 1.4805
step 4999: train loss 1.0779, val loss 1.4920

But with prison: I will stead with you.

ISABELLA:
Carress, all do; and I'll say your honour self good:
Then I'll regn your highness and
Compell'd by my sweet gates that you may:
Valiant make how I heard of you.

ANGELO:
Nay, sir, Isay!

ISABELLA:
I am sweet men sister as you steed.

LUCIO:
As it if you in the case would princily,
I'll rote, sir, I did cannot now at me?
That look thence, thy children shall be you called.

DUKE VINCENTIO:
Marry, though I do read you!

LU

10001

In [41]:
with open('output.txt', 'r') as f:
    print(f.read())



Lord Aufidius, were you how:
Here lie in straight it Rome would strike it.
There, Tybalt still Aufidius. This light our holy
Is wandering into his foul more any tale.
Romeo and all this? an I knew that hand this
Frown, which shall I tell deseter at the business
As I, by my rastory: if was an hour.
O, we cannot truly bow this of you: marry, I spy
I spake to steal.

BENVOLIO:
Not fellow, go with chexents
Attend never les under wit's in habouts' fool.

MERCUTIO:
Was contract is dead? another somedy.

CORIOLANUS:
They love be done to the heart, the art glorion.

Messenger:
Thou hast wounds now, thy souly garled, and so I condem'd
talk flest youth.

HERMIONE:
Let's the day comfort intell.

First Walconspirator:
Can do you the mother?

BENVOLIO:
What is there?

Pray, pale:
These eyesle, fear you comes to in the king;
For I feel these it lies ones;
He craft depose away come infire; and be it best,
And by the greatest lords sight to her walls men
To blush grassely. Come, good friar.
O miserab

In [42]:
from google.colab import files
files.download('output.txt')


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## 8. Walkthrough of additional stuff

### 8.1 nanoGPT repo



https://github.com/karpathy/nanoGPT

* most stuff in `model.py` similar:
  * implements complex stuff in batches being added as the 4th dimension
  * uses the GeLU non-linearity instead of ReLU to be able to load openAI checkpoints
* most stuff in `train.py` is implemented in same way but more complex

### 8.2 Training chatGPT ourselves

<img src = "https://raw.githubusercontent.com/Raahim58/Neural-networks/main/images/openAI.png" height = 500 width = 500>

https://openai.com/blog/chatgpt/

**1. Pre-Training and Fine-Tuning Stages of ChatGPT**:
- **Training of ChatGPT** happens in two major stages: **pre-training** and **fine-tuning**.
  - **Pre-training** involves training the model on a vast amount of internet data to get the decoder-only transformer to generate text.
  - This pre-training phase is similar to what we've implemented ourselves, albeit at a much smaller scale.

**2. Pre-training Example (Our Shakespeare Transformer)**:
- The Shakespeare transformer we trained has **10 million parameters**.
- The dataset we used for training consists of roughly **1 million characters**.
- However, in larger AIs, models like GPT do not use a character model. Instead, they use **sub-word chunks**, with a vocabulary size of around **50,000 tokens**.
  - In our case, the Shakespeare dataset equates to approximately **300,000 tokens**.

**3. Large-Scale Pre-training in GPT**:
- In comparison, large transformers, like GPT-3, have **175 billion parameters**.
- GPT-3 was trained on a dataset of around **300 billion tokens**, much larger than our **300,000 tokens**.
  - Today, modern models are being trained on datasets closer to **1 trillion tokens**.
- Training such a large model requires a massive infrastructure, typically involving thousands of GPUs that communicate with one another.

**4. Output of Pre-Training Stage**:
- After pre-training, the model doesn't produce useful answers but instead "babbles" text.
  - The output is not aligned to user questions and simply completes sequences, similar to completing a document.
  - Without fine-tuning, the model might answer a question with another question or ignore it entirely.

**5. Fine-Tuning Stage of ChatGPT**:
- After pre-training, the model undergoes **fine-tuning** to transform it into a useful assistant.
- **Fine-tuning** involves aligning the model to expect **question and answer** formats.
  - OpenAI collects datasets where questions are on top and answers are below.
  - These datasets contain **thousands of examples**, not as large as the pre-training data.
  - The model is fine-tuned to complete answers after questions in a structured way.
  - Large models are **sample-efficient**, making fine-tuning feasible even with smaller datasets.

**6. Reinforcement Learning and Reward Models**:
- **Step 1 of fine-tuning**: The model responds to queries, and different **raters rank responses** based on their quality.
  - A **reward model** is trained to predict how desirable each response is.
- **Step 2 of fine-tuning**: The reward model is used to fine-tune the main model further through **PPO (Policy Gradient Reinforcement Learning)**.
  - This aligns the model to generate answers that score high rewards according to the reward model.

**7. Final Output After Fine-Tuning**:
- After fine-tuning, the model transitions from being a document completer to a **question-answering assistant**.
- Most of the fine-tuning data is **internal to OpenAI** and not available publicly, making it challenging to replicate.

**8. Summary**:
- Today, we implemented a **decoder-only transformer**, which mimics the structure of models like **GPT**.
- However, large-scale models like ChatGPT require both **pre-training** and **fine-tuning**, with infrastructure challenges and reinforcement learning involved in fine-tuning.


## Conclusion

**1. Summary of the Session**:
- We trained a **decoder-only transformer**, following the architecture from the famous **GPT paper** released in 2017.
- The model was trained on the **Tiny Shakespeare dataset**, and we achieved sensible results.
  - The complete training code is around **200 lines**.
  - This codebase will be released soon, including **Git log commits** detailing the step-by-step process.

**2. Resources**:
- The model we trained is similar in architecture to **GPT-3**, but GPT-3 is **10,000 to 1 million times larger**, depending on how you measure it.

**3. Focus of the Lecture**:
- This lecture focused primarily on **language modeling** and the training of transformers.
  - We did not delve into **fine-tuning** or advanced tasks such as **sentiment detection** or **task-specific alignment**.
  - For more complex tasks beyond language modeling, **supervised fine-tuning** or more sophisticated approaches, such as **reinforcement learning with PPO**, are required.

**4. Fine-Tuning and Advanced Techniques**:
- To move beyond basic language modeling and make the model capable of specific tasks, **fine-tuning** is necessary.
  - For example, **sentiment detection**, task-specific alignment, or the advanced **reward model training** seen in ChatGPT requires fine-tuning on a specialized dataset.
  - **PPO (Policy Optimization)** can further align the model using reinforcement learning.

**5. Conclusion and Next Steps**:
 - There's much more to explore in terms of **fine-tuning**, **reinforcement learning**, and **task-specific training**.

