# StatQuest: Training an LLM from Scratch with Pre-Training and Reinforcement Learning with Human Feedback (RLHF)

Copyright 2025, Joshua Starmer

----

In this tutorial we train a very simple **LLM** from scratch with **Pre-Training** and **Reinforcement Learning with Human Feedback** (**RHLF**). This means we'll first **Pre-Train** a simple **LLM**, then we'll use that **LLM** to create **Reward Model**, which will be trained with human preference data. Lastly, we'll use the **Reward Model** to train the original, **Pre-Trained LLM**, to respond appropriately to prompts it has never seen before more with **Reinforcement Learning**.

This example is based on the StatQuest video: **[Reinforcement Learning with Human Feedback, Clearly Explained!!!](https://youtu.be/qPN_XZcJf_s)**

<img src="https://github.com/StatQuest/RLHF/blob/main/images/rlhf_intro.png?raw=1" alt="we will train an llm with pre-training and rlhf" style="width: 800px;">

**NOTE:** The **LLM** that we are going to train is a **Decoder-Only Transformer** and if you would like to learn more about how it works and how to code it, please see these StatQuest videos...
- **[Neural Networks Part 15: Decoder-Only Transformers (like ChatGPT)!!!](https://youtu.be/bQ5BoolX9Ag)**
- **[The Matrix Math Behind Transformer Neural Networks, One Step at a Time!!!](https://youtu.be/KphmOJnLAdI)**
- **[Coding a ChatGPT Like Transformer From Scratch](https://youtu.be/C9QSpl5nmrY)**
  
...as well as this **[GitHub Repository](https://github.com/StatQuest/decoder_transformer_from_scratch)**.

**ALSO NOTE:** We're omitting the **Fine-Tuning** stage from this tutorial because, fundamentally, the techniques used for **Fine-Tuning** are not significantly different from **Pre-Training**.

----

# Import the modules that will do all the work

The very first thing we need to do is load a bunch of Python modules. Python itself is just a basic programming language. These modules give us extra functionality to create and train the neural networks we'll use.

**NOTE:** The code below will check and see if **Lightning** is installed, and if not, it will install it for you. However, if you also need to install PyTorch, check out there install page **[here.](https://pytorch.org/get-started/locally/)**

In [None]:
## First, check to see if lightning is installed, if not, install it.
import pip
try:
  __import__("lightning")
except ImportError:
  pip.main(['install', "lightning"])

import torch ## torch let's us create tensors and also provides helper functions
import torch.nn as nn ## torch.nn gives us nn.Module(), nn.Embedding() and nn.Linear()
import torch.nn.functional as F # This gives us the softmax() and argmax()
from torch.optim import Adam ## We will use the Adam optimizer, which is, essentially,
                             ## a slightly less stochastic version of stochastic gradient descent.
from torch.utils.data import TensorDataset, DataLoader ## We'll store our data in DataLoaders

import lightning as L ## Lightning makes it easier to write, optimize and scale our code

----

# Define the model dimension and number of tokens we can process

In [None]:
## In this example we're increasing the model dimension to 4, meaning
## each token will have 4 values associated with it.
model_dimension = 4
if (model_dimension % 2 != 0):
    print("NOTE: Due to how position encoding is coded, model_dimension must be an even number.")

## We're also increasing the number of tokens our model can handle
max_length = 10

----

# Build vocabulary and helper functions

In [None]:
## In this example, we're simplifying
## how to add tokens to our vocabulary
tokens = ['what',
          'is',
          'statquest',
          'awesome',
          'squatch',
          'eats',
          'pizza',
          'norm',
          '<EOS>',
          '<PAD>']
tokens

In [None]:
len(tokens)

In [None]:
ids = list(range(len(tokens)))
ids

In [None]:
token_to_id = dict(zip(tokens, ids))
id_to_token = dict(map(reversed, token_to_id.items()))

In [None]:
## In this example, we're simplifying how to convert tokens to ids and ids to tokens.
## I got this idea from looking at some of andrej karpathy's code
def tokens2ids(tokens):
    output = []
    for token in tokens.split():
        output.append(token_to_id[token])

    return output

In [None]:
def ids2tokens(ids):
    output = []
    for id in ids:
        output.append(id_to_token[id])

    return " ".join(output)

In [None]:
## Now let's test the functions out by converting
## a prompt into ids...
tokens2ids("what is statquest <EOS>")

In [None]:
## ...and then converting those ids back into tokens
ids2tokens(tokens2ids("what is statquest <EOS>"))

----

# Create Traing Data

In [None]:
## We've also got a somewhat larger pre-training dataset then we had before

pretrain_inputs = torch.tensor([tokens2ids("what is statquest <EOS> awesome"),
                                tokens2ids("statquest is what <EOS> awesome"),
                                tokens2ids("what is norm <EOS> awesome"),
                                tokens2ids("what is squatch <EOS> awesome"),
                                tokens2ids("norm is what <EOS> awesome"),
                                tokens2ids("squatch is what <EOS> awesome"),
                                tokens2ids("squatch eats what <EOS> pizza")])
pretrain_inputs

In [None]:
## We can verify that the inputs were created correctly
## by converting the first row of pretrain_input ids back into tokens.
ids2tokens(pretrain_inputs[0].numpy())

In [None]:
pretrain_labels = torch.tensor([tokens2ids("is statquest <EOS> awesome <EOS>"),
                                tokens2ids("is what <EOS> awesome <EOS>"),
                                tokens2ids("is norm <EOS> awesome <EOS>"),
                                tokens2ids("is squatch <EOS> awesome <EOS>"),
                                tokens2ids("is what <EOS> awesome <EOS>"),
                                tokens2ids("is what <EOS> awesome <EOS>"),
                                tokens2ids("eats what <EOS> pizza <EOS>")])
pretrain_labels

In [None]:
## We can verify that the labels were created correctly
## by converting the first row of pretrain_labels ids back into tokens.
ids2tokens(pretrain_labels[0].numpy())

In [None]:
## Now let's package everything up into a DataLoader...
pretrain_dataset = TensorDataset(pretrain_inputs, pretrain_labels)
pretrain_dataloader = DataLoader(pretrain_dataset)

----

# Code for a basic Decoder-Only Transformer

**NOTE:** For details on how this code works, see these StatQuest videos...
- **[Neural Networks Part 15: Decoder-Only Transformers (like ChatGPT)!!!](https://youtu.be/bQ5BoolX9Ag)**
- **[The Matrix Math Behind Transformer Neural Networks, One Step at a Time!!!](https://youtu.be/KphmOJnLAdI)**
- **[Coding a ChatGPT Like Transformer From Scratch](https://youtu.be/C9QSpl5nmrY)**
  
...as well as this **[GitHub Repository](https://github.com/StatQuest/decoder_transformer_from_scratch)**.

---



In [None]:
class PositionEncoding(nn.Module):

    def __init__(self, d_model=2, max_len=6):
        ## d_model = The dimension of the transformer, which is also the number of embedding values per token.
        ##           In the transformer I used in the StatQuest: Transformer Neural Networks Clearly Explained!!!
        ##           d_model=2, so that's what we'll use as a default for now.
        ##           However, in "Attention Is All You Need" d_model=512
        ## max_len = maximum number of tokens we allow as input.
        ##           Since we are precomputing the position encoding values and storing them in a lookup table
        ##           we can use d_model and max_len to determine the number of rows and columns in that
        ##           lookup table.

        super().__init__()
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(start=0, end=max_len, step=1).float().unsqueeze(1)

        embedding_index = torch.arange(start=0, end=d_model, step=2).float()
        div_term = 1/torch.tensor(10000.0)**(embedding_index / d_model)

        pe[:, 0::2] = torch.sin(position * div_term) ## every other column, starting with the 1st, has sin() values
        pe[:, 1::2] = torch.cos(position * div_term) ## every other column, starting with the 2nd, has cos() values
        ## Now we "register 'pe'.
        self.register_buffer('pe', pe) ## "register_buffer()" ensures that
                                       ## 'pe' will be moved to wherever the model gets
                                       ## moved to. So if the model is moved to a GPU, then,
                                       ## even though we don't need to optimize 'pe', it will
                                       ## also be moved to that GPU. This, in turn, means
                                       ## that accessing 'pe' will be relatively fast compared
                                       ## to having a GPU have to get the data from a CPU.

    def forward(self, word_embeddings):

        return word_embeddings + self.pe[:word_embeddings.size(0), :] ## word_embeddings.size(0) = number of embeddings
                                                                      ## NOTE: That second ':' is optional and
                                                                      ## we could re-write it like this:
                                                                      ## self.pe[:word_embeddings.size(0)]

In [None]:
class Attention(nn.Module):

    def __init__(self, d_model=2):
        ## d_model = the number of embedding values per token.
        ##           In the transformer I used in the StatQuest: Transformer Neural Networks Clearly Explained!!!
        ##           d_model=2, so that's what we'll use as a default for now.
        ##           However, in "Attention Is All You Need" d_model=512


        super().__init__()

        self.d_model=d_model

        self.W_q = nn.Linear(in_features=d_model, out_features=d_model, bias=False)
        self.W_k = nn.Linear(in_features=d_model, out_features=d_model, bias=False)
        self.W_v = nn.Linear(in_features=d_model, out_features=d_model, bias=False)

        self.row_dim = 0
        self.col_dim = 1


    def forward(self, encodings_for_q, encodings_for_k, encodings_for_v, mask=None):

        q = self.W_q(encodings_for_q)
        k = self.W_k(encodings_for_k)
        v = self.W_v(encodings_for_v)

        ## Compute attention scores
        ## the equation is (q * k^T)/sqrt(d_model)
        sims = torch.matmul(q, k.transpose(dim0=self.row_dim, dim1=self.col_dim))

        scaled_sims = sims / torch.tensor(k.size(self.col_dim)**0.5)

        if mask is not None:
            scaled_sims = scaled_sims.masked_fill(mask=mask, value=-1e9) # I've also seen -1e20 and -9e15 used in masking

        attention_percents = F.softmax(scaled_sims, dim=self.col_dim)

        attention_scores = torch.matmul(attention_percents, v)

        return attention_scores

In [None]:
class DecoderOnlyTransformer(L.LightningModule):

    def __init__(self, num_tokens=4, d_model=2, max_len=6):

        super().__init__()
        L.seed_everything(seed=42, workers=True)

        self.we = nn.Embedding(num_embeddings=num_tokens,
                               embedding_dim=d_model)

        self.pe = PositionEncoding(d_model=d_model,
                                   max_len=max_len)

        self.self_attention = Attention(d_model=d_model)

        self.fc_layer = nn.Linear(in_features=d_model,
                                  out_features=num_tokens)

        self.loss = nn.CrossEntropyLoss()


    def forward(self, token_ids):

        word_embeddings = self.we(token_ids)

        position_encoded = self.pe(word_embeddings)

        mask = torch.tril(torch.ones((token_ids.size(dim=0), token_ids.size(dim=0)), device=self.device))
        mask = mask == 0

        self_attention_values = self.self_attention(position_encoded,
                                                    position_encoded,
                                                    position_encoded,
                                                    mask=mask)

        residual_connection_values = position_encoded + self_attention_values

        fc_layer_output = self.fc_layer(residual_connection_values)

        return fc_layer_output


    def configure_optimizers(self):
        return Adam(self.parameters(), lr=0.1)


    def training_step(self, batch, batch_idx):
        input_tokens, labels = batch # collect input
        output = self.forward(input_tokens[0])
        loss = self.loss(output, labels[0])

        return loss

----

# Create and test a basic Decoder-Only Transformer

In [None]:
## First, create a model from DecoderOnlyTransformer()
model = DecoderOnlyTransformer(num_tokens=len(tokens),
                               d_model=model_dimension,
                               max_len=max_length)

In [None]:
## Because tutorial involves creating a bunch of models
## and seeing what kind of output they generate in response
## to different prompts, we're writing a function to handle
## generating output from any model.

def generate_output(model, prompt):
    input_length = prompt.size(dim=0)

    predictions = model(prompt)
    predicted_id = torch.tensor([torch.argmax(predictions[-1,:])])
    predicted_ids = predicted_id

    for i in range(input_length, max_length):
        if (predicted_id == token_to_id["<EOS>"]): # if the prediction is <EOS>, then we are done
            break

        prompt = torch.cat((prompt, predicted_id))

        predictions = model(prompt)
        predicted_id = torch.tensor([torch.argmax(predictions[-1,:])])
        predicted_ids = torch.cat((predicted_ids, predicted_id))

    print("Predicted Tokens:\n")
    print("\t", ids2tokens(predicted_ids.numpy()))

Now that we have a nice helper function for generating output from our model, let's try it out.

In [None]:
## Now test out the transformer...
generate_output(model, torch.tensor(tokens2ids("what is statquest <EOS>")))

Since the output from the output from the model isn't ideal, let's train it!

----

# Train and test the model

In [None]:
trainer = L.Trainer(max_epochs=30, deterministic=True)
trainer.fit(model, train_dataloaders=pretrain_dataloader)

In [None]:
## Now test out the transformer and see if training was successful...
generate_output(model, torch.tensor(tokens2ids("what is statquest <EOS>")))

Hooray! Our model is generating the correct output for the first prompt. Now let's try a different prompt.

In [None]:
generate_output(model, torch.tensor(tokens2ids("statquest is what <EOS>")))

Double Hooray!! That worked too! Now that we have successully trained, or **pre-trained**, or model, we can use it to create a **Reward Model**. The **Reward Model** will, ultimately, be trained, with **Human Feedback**, to score outputs generated by our model. These scores will then be used to train our moodel with **Reinforcement Learning** to respond appropriately to prompts it was never trained on.

**NOTE:** Typically you **Fine-tune** a model before using it to create a **Reward Model**. However, we're skipping that step because the techniques are essentially the same as **Pre-Training**.

----

# Create and train a Reward Model

The **Reward Model** is a copy of the original model, but with the last layer, the **Fully Connected Layer** that conects the attention layer to the output layer vocabulary, replaced with a much simplier **Fully Connected Layer** that just has a single output. This **Fully Connected Layer** is called a **Regression Layer**, since outputs a single continuous value.

In [None]:
class RewardModel(L.LightningModule):

    def __init__(self, num_tokens=4, d_model=2, max_len=6):

        super().__init__()
        L.seed_everything(seed=42, workers=True)

        self.we = nn.Embedding(num_embeddings=num_tokens,
                               embedding_dim=d_model)

        self.pe = PositionEncoding(d_model=d_model,
                                   max_len=max_len)

        self.self_attention = Attention(d_model=d_model)

        self.regression_layer = nn.Linear(in_features=d_model,
                                          out_features=1)


    def forward(self, token_ids):

        word_embeddings = self.we(token_ids)
        position_encoded = self.pe(word_embeddings)

        mask = torch.tril(torch.ones((token_ids.size(dim=0), token_ids.size(dim=0)), device=self.device))
        mask = mask == 0

        self_attention_values = self.self_attention(position_encoded,
                                                    position_encoded,
                                                    position_encoded,
                                                    mask=mask)

        residual_connection_values = position_encoded + self_attention_values

        regression_layer_output = self.regression_layer(residual_connection_values)

        return regression_layer_output


    def configure_optimizers(self):
        return Adam(self.parameters(), lr=0.1)


    def training_step(self, batch, batch_idx):
        input_tokens, labels = batch # collect input

        output_better, output_worse = labels[0].split(2)

        input_better = torch.cat((input_tokens[0], output_better))
        input_worse = torch.cat((input_tokens[0], output_worse))

        reward_better = self.forward(input_better)
        reward_worse = self.forward(input_worse)
        ## the equation is for the loss is...
        ## -1 * log(sigmoid(reward_better - reward_worse)
        ## For details, see: https://youtu.be/qPN_XZcJf_s
        ## NOTE: reward_better and reward_worse are arrays with
        ##       scores for each token. We only want the score for the
        ##       last token, so we index that with [-1]
        loss = -1 * torch.log(torch.sigmoid(reward_better[-1] - reward_worse[-1]))

        self.log("train_loss", loss)

        return loss

In [None]:
reward_model = RewardModel(num_tokens=len(tokens),
                           d_model=model_dimension,
                           max_len=max_length)

When we first create the **Reward Model** it is initialized with random **weights** and **biases**. For example, these are the initial **weights** used in the **Word Embedding Layer**...

In [None]:
reward_model.we.weight

...however, what we want, is for the **Weights** to be the same as the ones in the Decoder-Only Transformer that we just trained. So, we can copy the **Weights** like this...

In [None]:
reward_model.we.weight = model.we.weight

...and we can verify that the **Weights** are now different by printing them out again...

In [None]:
reward_model.we.weight

# BAM!

Now let's copy the all of the other **Weights** in the model.

**NOTE:** The only **Biases** in the model occur in the finally **Fully Connected Layer**, which we are not copying over to the **Reward Model** because we are replacing that layer with a new **Regression Layer**.

In [None]:
reward_model.self_attention.W_q.weight = model.self_attention.W_q.weight

In [None]:
reward_model.self_attention.W_k.weight = model.self_attention.W_k.weight

In [None]:
reward_model.self_attention.W_v.weight = model.self_attention.W_v.weight

Now we can verify that we successfully copied all of the **Weights** used for **Word Embedding** and calculating **Attention** by using for loops to print out all of the parameters (trainable **Weights** and **Biases**) in both models:

In [None]:
## Now print out the name and value for each named parameter
## parameter in the model. Remember parameters are variables,
## like Weights and Biases, that we can train.
for name, param in model.named_parameters():
    print(name, torch.round(param.data, decimals=2))

In [None]:
## Now print out the name and value for each named parameter
## parameter in the model. Remember parameters are variables,
## like Weights and Biases, that we can train.
for name, param in reward_model.named_parameters():
    print(name, torch.round(param.data, decimals=2))

Now let's see how the **Reward Model** scores some prompt/response combinations. We'll start with prompt paired with a "good" response. In **RLHF** terminology, this is called the **Better** response.

In [None]:
## Now let's score an input/output pair...
## This is an example of a "better" response
scores = reward_model(torch.tensor(tokens2ids("squatch eats what <EOS> pizza <EOS>")))
scores

The final score for a prompt/response combination could be the average of all the outputs for the combination, or it could be just the last value. In this example, we're just going to use the last value, and we can index with with `[-1]`, which we do here:

In [None]:
scores[-1] # use the last score as the output from the reward model

Now let's score a "worse" response.

In [None]:
## Now score another input/output pair...
## This is an example of a "worse" response
scores = reward_model(torch.tensor(tokens2ids("squatch eats what <EOS> awesome <EOS>")))
scores[-1]

Right now, the "worse" response has a higher score than the "better" response. We hope to change that by training the **Reward Model**. So, the first thing we do is create a new dataset that pairs prompts with "better" and "worse" responses. The idea is that the "better" and "worse" responses are determined based on human preferences. In other worse, people were presented with both options and scored one response as better than the other.

We'll start by creating a list of prompts.

In [None]:
rl_inputs = torch.tensor([tokens2ids("squatch eats what <EOS>"),
                          tokens2ids("squatch eats what <EOS>"),
                          tokens2ids("squatch eats what <EOS>"),
                          tokens2ids("squatch eats what <EOS>"),
                          tokens2ids("squatch eats what <EOS>"),
                          tokens2ids("squatch eats what <EOS>"),
                          tokens2ids("squatch eats what <EOS>")])

Now let's create the reponses. We'll do this by concatonated a "better" response with a "worse" response. The "better" response comes first. Later, when we're in the `training_step()` we'll split these two responses apart.

In [None]:
rl_labels = torch.tensor([tokens2ids("pizza <EOS> what <EOS>"),
                          tokens2ids("pizza <EOS> is <EOS>"),
                          tokens2ids("pizza <EOS> statquest <EOS>"),
                          tokens2ids("pizza <EOS> squatch <EOS>"),
                          tokens2ids("pizza <EOS> eats <EOS>"),
                          tokens2ids("pizza <EOS> norm <EOS>"),
                          tokens2ids("pizza <EOS> awesome <EOS>")])

Lastly, let's put the new dataset in a `DataLoader`.

In [None]:
## Now let's package everything up into a DataLoader...
rl_dataset = TensorDataset(rl_inputs, rl_labels)
rl_dataloader = DataLoader(rl_dataset)

Now that we have the data in a `DataLoader`, we can use it to train the **Reward Model**.

In [None]:
## now train the model
trainer = L.Trainer(max_epochs=50, log_every_n_steps=2, deterministic=True)
trainer.fit(reward_model, train_dataloaders=rl_dataloader)

Now let's see if the **Reward** model now gives the "better" response a higher score than the "worse" response.

In [None]:
reward_better = reward_model(torch.tensor(tokens2ids("squatch eats what <EOS> pizza <EOS>")))
reward_better[-1]

In [None]:
reward_worse = reward_model(torch.tensor(tokens2ids("squatch eats what <EOS> awesome <EOS>")))
reward_worse[-1]

And it does!

# DOUBLE BAM!!

**NOTE:** We can also calculate the **Loss** by hand to see if these scores result in a **Loss** value that is close to 0...

In [None]:
## See what the loss is...
-1 * torch.log(torch.sigmoid(reward_better[-1] - reward_worse[-1]))

...and we see that the **Loss** is super close to 0. In other words, the scores generated for the "better" and "worse" responses minimize the **Loss**.

----

Now let's see how the **Reward Model** scores prompt/response pairs (with "better" and "worse" responses) for something it has never seen before...

In [None]:
## Now let's score an input/output pair that the Reward Model has never seen before...
## This is an example of a "better" response:
reward_better = reward_model(torch.tensor(tokens2ids("norm eats what <EOS> pizza <EOS>")))
reward_better[-1]

In [None]:
## Now score another input/output pair that the Reward Model has never seen before...
## This is an example of a "worse" response:
reward_worse = reward_model(torch.tensor(tokens2ids("norm eats what <EOS> awesome <EOS>")))
reward_worse[-1]

...and we see that the "Better" response scores higher than the "worse" resposne. This is good. It means we can use the **Reward Model** to train the original model to correctly respond to new prompts that it was not originally trained to handel. This is how we train a model with **Reinforcement Learning with Human Feedback**.

----

# Train the original model with RLHF

First, let's see what the original model generates when given a prompt it was not trained on.

In [None]:
generate_output(model, torch.tensor(tokens2ids("norm eats what <EOS>")))

And we see that, right now, the original model does not respond to the propmt correctly. Ideally, the response would be `pizza <EOS`. So let's see if we can use the **Reward Model** to train the original model to generate the correct response.

To do this, we'll need to modify the code for the `training_step` method to use the **Reward Model** when it calculates the **Loss**. Thus, we'll create a new class that is identical to the class used to create the original Decoder-Only Transformer, except it has a different `training_step` method. In other words, this new model will have the same **Word Embedding Layer**, **Attention Layer** and **Fully Connected Layer** that we used in the original model.

In [None]:
class DecoderOnlyTransformer_RLHF(L.LightningModule):

    def __init__(self, num_tokens=4, d_model=2, max_len=6):

        super().__init__()
        L.seed_everything(seed=42, workers=True)

        self.we = nn.Embedding(num_embeddings=num_tokens,
                               embedding_dim=d_model)

        self.pe = PositionEncoding(d_model=d_model,
                                   max_len=max_len)

        self.self_attention = Attention(d_model=d_model)

        self.fc_layer = nn.Linear(in_features=d_model,
                                  out_features=num_tokens)

        # self.loss = nn.CrossEntropyLoss()

        self.gamma = torch.tensor(0.99)
        self.reward = torch.tensor(0)
        self.all_ids = torch.tensor(0)


    def forward(self, token_ids):

        # print("token_ids:", token_ids)

        word_embeddings = self.we(token_ids)
        # print("word_embeddings:", word_embeddings)

        position_encoded = self.pe(word_embeddings)
        # print("position_encoded:", position_encoded)

        mask = torch.tril(torch.ones((token_ids.size(dim=0), token_ids.size(dim=0)), device=self.device))
        mask = mask == 0

        self_attention_values = self.self_attention(position_encoded,
                                                    position_encoded,
                                                    position_encoded,
                                                    mask=mask)

        residual_connection_values = position_encoded + self_attention_values

        fc_layer_output = self.fc_layer(residual_connection_values)

        return fc_layer_output


    def configure_optimizers(self):
        return Adam(self.parameters(), lr=0.1)


    def training_step(self, batch, batch_idx):
        input_tokens, labels = batch # collect input

        model_input = input_tokens[0]

        predictions = self.forward(model_input)

        predicted_id = torch.tensor([torch.argmax(predictions[-1,:])])
        guess_one_hot = torch.zeros(len(predictions[-1,:]))
        guess_one_hot[predicted_id] = 1

        self.all_ids = torch.cat((model_input, predicted_id))
        self.all_ids = torch.cat((self.all_ids, torch.tensor(tokens2ids("<EOS>"))))

        self.reward = reward_model(self.all_ids)[-1]

        # print("predictions[-1,:]:", predictions[-1,:])
        soft_predictions = F.softmax(predictions[-1,:])
        # print("soft_predictions:", soft_predictions)

        # loss = -1 * predictions[-1,predicted_id] * self.gamma * self.reward
        loss = -1 * soft_predictions[predicted_id] * self.gamma * self.reward

        self.log("train_loss", loss)

        return loss

Since we want to copy every single parameter from the original mdoel to the new model, so that it acts the same, we can use `load_state_dict()` and `stat_dict()` to copy things quickly and easily.

In [None]:
new_model = DecoderOnlyTransformer_RLHF(num_tokens=len(tokens),
                                        d_model=model_dimension,
                                        max_len=max_length)
# Copy weights from the original pretrained model to new_model
new_model.load_state_dict(model.state_dict())

We can verify that both models have the same parameters by printing them out...

In [None]:
for name, param in model.named_parameters():
    print(name, torch.round(param.data, decimals=2))

In [None]:
for name, param in new_model.named_parameters():
    print(name, torch.round(param.data, decimals=2))

We can also verify that the new model responds to the new prompt in the same way as the original model.

In [None]:
generate_output(new_model, torch.tensor(tokens2ids("norm eats what <EOS>")))

Since the new model's response to the prompt is exactly the same as we got with the old model, we will now train the new model with **Reinforcement Learning with Human Feedback**. First, let's create a dataset that only provides a prompt, and has no label.

In [None]:
rlhf_inputs = torch.tensor([tokens2ids("norm eats what <EOS>")])

rlhf_labels = torch.tensor([tokens2ids("<PAD>")])

## Now let's package everything up into a DataLoader...
rlhf_dataset = TensorDataset(rlhf_inputs, rlhf_labels)
rlhf_dataloader = DataLoader(rlhf_dataset)

In [None]:
%xmode Minimal
## now see if we can train the model
# trainer = L.Trainer(max_epochs=1000, log_every_n_steps=2, accelerator="cpu", deterministic=True)
trainer = L.Trainer(max_epochs=10, log_every_n_steps=2, accelerator="cpu", deterministic=True)
trainer.fit(new_model, train_dataloaders=rlhf_dataloader)

Now let's see if the model, trained with **RLHF**, responds correctly to the new prompt...

In [None]:
generate_output(new_model, torch.tensor(tokens2ids("norm eats what <EOS>")))

...and it does!!! At least, on my computer I get `pizza <EOS>` as the output. Sometimes on Google Colab I get `pizza pizza pizza...`, which, while not as ideal as what happened on my laptop, is still an improvement over what we got before, which was, `is eats is eats is eats is`. In other words, we successfully trained our model with **RLHF**.

# TRIPLE BAM!!!

----

# NOTES

We can use **TensorBoard** to see if we have minimized the overall loss of the model.

**NOTE:** If you want to make use **TensorBoard**, make sure you are calling `self.log("train_loss", loss)` (or something like that) in the `training_step()` method.

In [None]:
# %%capture
# ## NOTE: If you **don't** need to install anything, you can comment out the
# ##       next line.
# ##
# ##       If you **do** need to install something, just know that you may need to
# ##       restart your session for python to find the new module(s).
# ##
# ##       To restart your session:
# ##       - In Google Colab, click on the "Runtime" menu and select
# ##         "Restart Session" from the pulldown menu
# ##       - In a local jupyter notebook, click on the "Kernel" menu and select
# ##         "Restart Kernel" from the pulldown menu
# ##
# ##       Also, installing can take a few minutes, so go get yourself a snack!
# !pip install tensorflow

In [None]:
## Load the TensorBoard notebook extension
# %load_ext tensorboard

# import tensorflow as tf
# import datetime, os

In [None]:
## Now launch tensorboard within the jupyter notebook
# %tensorboard --logdir=lightning_logs/