Setup: Imports and helper functions.

The Data: Defining our "toy" language.

Concept 1: Tokenization (The Vocabulary): Defining and demoing the SimpleTokenizer.

Concept 2: Word Embeddings (The Meaning): Explaining and demoing the nn.Embedding layer.

Concept 3: The Neural Network (The "Brain"): Building the ToyLLM class.

Helper Function: The generate_text function.

Initialization: Creating the model and testing its "untrained" (random) output.

Phase 1: Pre-training (Learning Language): Running simulate_pretraining and testing the result.

Phase 2: SFT (Learning to Answer): Running simulate_sft and testing the result.

Phase 3: RLHF (Learning "Preferences"): Defining the reward model and the simulate_rlhf function.

Final Run: Running the RLHF phase and comparing the final model's answers.

In [1]:
# --- Cell 1: Setup & Imports ---

import torch
import torch.nn as nn
import torch.optim as optim
import random
import numpy as np

# Set a random seed for reproducibility.
# This ensures that every time we run this notebook, the "random"
# initial weights of our model are the same, leading to the same
# training results.
SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)

print("Libraries imported and random seed set.")

Libraries imported and random seed set.


In [12]:
# --- Cell 2: The "Alice" Data ---

# 1. Pre-training Corpus: Sentences from "Alice in Wonderland"
#    This teaches the model the world's basic facts and language.
pretrain_corpus = [
    "alice followed the white rabbit <end>",
    "alice went down the rabbit hole <end>",
    "the white rabbit was late <end>",
    "the cheshire cat smiled <end>",
    "the hatter had a mad tea party <end>",
    "the queen of hearts shouted <end>",
    "alice was very curious <end>",
    "the cat sat on a branch <end>",
    "the hatter and the march hare <end>",
    "who are you said the caterpillar <end>",
    "i am alice said alice <end>",
    "the queen played croquet <end>"
]

# 2. Supervised Fine-Tuning (SFT) Dataset: Q&A pairs
#    This teaches the model to be a helpful "Alice" expert.
sft_dataset = {
    # Prompt: "Ideal Answer"
    "who did alice follow": "alice followed the white rabbit <end>",
    "where did alice go": "alice went down the rabbit hole <end>",
    "tell me about the rabbit": "the white rabbit was late <end>",
    "what did the cat do": "the cheshire cat smiled <end>",
    "what about the hatter": "the hatter had a mad tea party <end>",
    "what did the queen do": "the queen of hearts shouted <end>"
}

# 3. RLHF Prompts: Prompts for the "preference" phase.
rlhf_prompts = [
    "who did alice follow",
    "tell me about the cat",
    "what did the queen do",
    "who was alice",  # Note: This is a new question!
    "tell me about the hatter"
]

# Create the "full" corpus for the tokenizer
full_corpus = pretrain_corpus + list(sft_dataset.keys()) + list(sft_dataset.values())
print(f"Total pre-training sentences: {len(pretrain_corpus)}")
print(f"Total SFT pairs: {len(sft_dataset)}")
print(f"Total RLHF prompts: {len(rlhf_prompts)}")

Total pre-training sentences: 12
Total SFT pairs: 6
Total RLHF prompts: 5


In [14]:
# --- Cell 3: Concept 1: Tokenization (The Vocabulary) ---

class SimpleTokenizer:
    """
    A simple word-level tokenizer. It learns a vocabulary from a
    corpus and can convert text to a sequence of integer IDs
    (encode) and back (decode).
    """
    def __init__(self, corpus):
        # 1. Find all unique words in the corpus
        # We split every sentence, flatten the list, and use a 'set'
        # to get only the unique words.
        words = sorted(list(set(word for sentence in corpus for word in sentence.split())))
        
        # 2. Create the mapping dictionaries
        # word_to_idx: Maps a word (e.g., "sky") to an integer (e.g., 5)
        self.word_to_idx = {word: i for i, word in enumerate(words)}
        # idx_to_word: Maps an integer (e.g., 5) back to a word (e.g., "sky")
        self.idx_to_word = {i: word for word, i in self.word_to_idx.items()}
        
        # 3. Store the vocabulary size
        self.vocab_size = len(self.word_to_idx)

    def encode(self, text):
        """Converts a string of text into a list of token IDs."""
        return [self.word_to_idx[word] for word in text.split()]

    def decode(self, tokens):
        """Converts a list of token IDs back into a string of text."""
        return ' '.join([self.idx_to_word[token] for token in tokens])

# --- Let's test our Tokenizer! ---
print("--- Testing the Tokenizer ---")

# 1. Initialize the tokenizer on our full dataset
tokenizer = SimpleTokenizer(full_corpus)
vocab_size = tokenizer.vocab_size

print(f"Vocabulary Size: {vocab_size}")
print(f"Vocabulary: {tokenizer.word_to_idx}")

# 2. Test encoding
text = "curious white cat and rabbit <end>"
encoded = tokenizer.encode(text)
print(f"\nOriginal text: '{text}'")
print(f"Encoded IDs: {encoded}")

# 3. Test decoding
decoded = tokenizer.decode(encoded)
print(f"Decoded text: '{decoded}'")

--- Testing the Tokenizer ---
Vocabulary Size: 50
Vocabulary: {'<end>': 0, 'a': 1, 'about': 2, 'alice': 3, 'am': 4, 'and': 5, 'are': 6, 'branch': 7, 'cat': 8, 'caterpillar': 9, 'cheshire': 10, 'croquet': 11, 'curious': 12, 'did': 13, 'do': 14, 'down': 15, 'follow': 16, 'followed': 17, 'go': 18, 'had': 19, 'hare': 20, 'hatter': 21, 'hearts': 22, 'hole': 23, 'i': 24, 'late': 25, 'mad': 26, 'march': 27, 'me': 28, 'of': 29, 'on': 30, 'party': 31, 'played': 32, 'queen': 33, 'rabbit': 34, 'said': 35, 'sat': 36, 'shouted': 37, 'smiled': 38, 'tea': 39, 'tell': 40, 'the': 41, 'very': 42, 'was': 43, 'went': 44, 'what': 45, 'where': 46, 'white': 47, 'who': 48, 'you': 49}

Original text: 'curious white cat and rabbit <end>'
Encoded IDs: [12, 47, 8, 5, 34, 0]
Decoded text: 'curious white cat and rabbit <end>'


In [15]:
# --- Cell 4: Concept 2: Word Embeddings (The Meaning) ---

# Define the "dimensionality" of our embeddings.
# This is a hyperparameter: a choice we make.
# A bigger number can capture more meaning, but is slower.
# Real models use dimensions like 768 or 4096. We'll use 10.
EMBED_DIM = 10

# 1. Create the embedding layer
# It's a simple lookup table: vocab_size (rows) x embed_dim (columns)
# It's like a big spreadsheet where row 5 is the vector for word ID 5.
embedding_layer = nn.Embedding(vocab_size, EMBED_DIM)

# --- Let's test the Embedding Layer! ---
print("--- Testing the Embedding Layer ---")

# 1. Get the ID for the word "sun"
sun_id = tokenizer.word_to_idx['sun']
print(f"ID for 'sun': {sun_id}")

# 2. Get the embedding vector for "sun"
# We wrap the ID in a torch.tensor
sun_id_tensor = torch.tensor([sun_id])
sun_vector = embedding_layer(sun_id_tensor)

print(f"\nUntrained 'sun' vector (shape {sun_vector.shape}):")
print(sun_vector)

print("\nNote: These numbers are random. During training, the model will learn")
print("to make the vectors for 'sun' and 'hot' similar, and different")
print("from the vectors for 'sky' and 'blue'.")

--- Testing the Embedding Layer ---


KeyError: 'sun'

In [16]:
# --- Cell 5: Concept 3: The Neural Network (The "Brain") ---

# We define our model as a Python class that inherits from PyTorch's
# nn.Module, which is the base class for all neural network modules.

class ToyLLM(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim):
        """
        This is the constructor. It sets up the layers our model
        will use. It doesn't define how data flows, just the
        building blocks.
        """
        super().__init__() # Always call this first
        
        self.vocab_size = vocab_size
        
        # Layer 1: The Embedding Layer
        # (From Cell 4) Converts token IDs -> meaning vectors
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        
        # Layer 2: The "Memory" Layer (LSTM)
        # This layer processes the sequence of embedding vectors.
        # 'hidden_dim' is the size of its "memory" or "thought" vector.
        # 'batch_first=True' just means our input data will have
        # the batch size as the first dimension (e.g., [batch, seq_len, features])
        self.lstm = nn.LSTM(embed_dim, hidden_dim, batch_first=True)
        
        # Layer 3: The "Decision" Layer (Fully Connected)
        # This layer takes the final "thought" from the LSTM and maps
        # it to a score for every single word in our vocabulary.
        # The word with the highest score is the model's prediction.
        self.fc = nn.Linear(hidden_dim, vocab_size)

    def forward(self, x):
        """
        This is the 'forward pass'. It defines how data flows
        through the layers we defined in __init__.
        
        'x' will be our input tensor of token IDs (e.g., [1, 5, 4])
        """
        # 1. Get Embeddings
        # x shape: [batch_size, sequence_length]
        # embedded shape: [batch_size, sequence_length, embed_dim]
        embedded = self.embedding(x)
        
        # 2. Process with LSTM
        # lstm_out shape: [batch_size, sequence_length, hidden_dim]
        # 'hidden_state' (which we ignore with '_') would be the
        # final memory state.
        lstm_out, _ = self.lstm(embedded)
        
        # 3. Get the *last* token's output
        # We only care about the LSTM's "thought" *after* it has read
        # the entire input sequence. So, we select the output
        # corresponding to the last token (index -1).
        # last_token_out shape: [batch_size, hidden_dim]
        last_token_out = lstm_out[:, -1, :]
        
        # 4. Make a prediction (get "logits")
        # 'output' will be a vector of scores, one for each word
        # in the vocabulary.
        # output shape: [batch_size, vocab_size]
        output = self.fc(last_token_out)
        
        return output

print("ToyLLM class defined.")

ToyLLM class defined.


In [17]:
# --- Cell 6: Helper Function (Generating Text) ---

def generate_text(model, tokenizer, seed_text, max_len=10):
    """
    Generates text from a model, starting with a seed_text.
    """
    # 1. Set the model to "evaluation mode"
    # This turns off things like dropout, which are only used
    # during training.
    model.eval()
    
    # 2. Tokenize the starting text
    tokens = tokenizer.encode(seed_text)
    
    # 3. Generation loop
    for _ in range(max_len):
        # 4. Convert tokens to a tensor
        # The model expects a "batch", so we add an
        # extra dimension with []
        input_tensor = torch.tensor([tokens])
        
        # 5. Run the model
        # torch.no_grad() tells PyTorch we don't need to
        # calculate gradients, which saves memory and time.
        with torch.no_grad():
            output = model(input_tensor)
        
        # 6. Get the prediction
        # 'output' is a tensor of scores (logits).
        # output.argmax(1) finds the index (the token ID)
        # with the highest score.
        # .item() converts the tensor to a plain Python number.
        next_token = output.argmax(1).item()
        
        # 7. Add the new token to our list
        tokens.append(next_token)
        
        # 8. Stop if we predict the <end> token
        if tokenizer.decode([next_token]) == '<end>':
            break
            
    # 9. Decode the full list of tokens back into a string
    return tokenizer.decode(tokens)

print("generate_text() helper function defined.")

generate_text() helper function defined.


In [21]:
# --- Cell 7: Initialization & First Test ---

# --- Hyperparameters ---
# We use the constants we defined earlier
# EMBED_DIM = 10
HIDDEN_DIM = 20 # Size of the LSTM's "memory"
LEARNING_RATE = 0.01

# --- Initialize the Model ---
toy_model = ToyLLM(vocab_size=vocab_size, 
                   embed_dim=EMBED_DIM, 
                   hidden_dim=HIDDEN_DIM)

# --- Initialize the "Optimizer" and "Loss Function" ---
# The Optimizer (e.g., Adam) is what updates the model's
# weights. It needs to know the learning rate.
optimizer = optim.Adam(toy_model.parameters(), lr=LEARNING_RATE)

# The Loss Function (CrossEntropyLoss) measures how "wrong" the
# model's prediction is compared to the correct answer. It's
# perfect for classification, and "next-word prediction" is just
# a classification problem with 'vocab_size' classes.
criterion = nn.CrossEntropyLoss()

print(f"Model initialized with {vocab_size} vocab, {EMBED_DIM} embed dim, {HIDDEN_DIM} hidden dim.")

# --- Test the UNTRAINED model ---
print("\n--- Testing Untrained Model ---")
# The seed_text must only contain words from our vocabulary
seed_text = "the rabbit" 
generated = generate_text(toy_model, tokenizer, seed_text)
print(f"Prompt: '{seed_text} is' -> Response: '{generated}'")
print("\n(Note: The output is random garbage, as expected!)")

Model initialized with 50 vocab, 10 embed dim, 20 hidden dim.

--- Testing Untrained Model ---
Prompt: 'the rabbit is' -> Response: 'the rabbit where where hearts where hearts where hearts where hearts where'

(Note: The output is random garbage, as expected!)


In [23]:
# --- Cell 8: Phase 1: Foundational Pre-training ---

def simulate_pretraining(model, tokenizer, corpus, epochs=50):
    print("\n--- Starting Phase 1: Foundational Pre-training ---")
    print("Goal: Teach the model basic language structure by predicting the next word.")
    
    # 1. Create the training data
    # We turn each sentence into multiple (input, target) pairs.
    # e.g., "the sun is <end>" becomes:
    # (["the"], "sun")
    # (["the", "sun"], "is")
    # (["the", "sun", "is"], "<end>")
    inputs, targets = [], []
    for sentence in corpus:
        tokens = tokenizer.encode(sentence)
        for i in range(1, len(tokens)):
            inputs.append(tokens[:i])   # The sequence so far
            targets.append(tokens[i])  # The *next* word
            
    # --- The Training Loop ---
    for epoch in range(epochs):
        total_loss = 0
        
        # We'll just iterate through our simple dataset
        for i in range(len(inputs)):
            # Get the current (input, target) pair
            input_seq = torch.tensor([inputs[i]])
            target_val = torch.tensor([targets[i]])
            
            # --- Standard PyTorch Training Steps ---
            
            # 1. Reset gradients
            # We must do this every time, or they will accumulate
            optimizer.zero_grad()
            
            # 2. Forward pass: Get model's prediction
            output = model(input_seq)
            
            # 3. Calculate loss: How "wrong" was the prediction?
            # 'output' shape: [1, vocab_size]
            # 'target_val' shape: [1]
            loss = criterion(output, target_val)
            
            # 4. Backward pass: Calculate gradients
            # This is where PyTorch figures out how much each
            # model parameter contributed to the error.
            loss.backward()
            
            # 5. Optimizer step: Update model parameters
            # The optimizer "nudges" the weights in the correct
            # direction to reduce the loss.
            optimizer.step()
            
            total_loss += loss.item()
        
        # Print progress
        if (epoch + 1) % 10 == 0:
            print(f"  Pre-training Epoch {epoch+1}/{epochs}, Loss: {total_loss/len(inputs):.4f}")

    print("--- Pre-training Complete ---")
    return model
# --- [Find this line at the bottom of Cell 8] ---
# ... (rest of the cell is the same) ...

# --- Run Phase 1 ---
# We pass in the 'toy_model' we already created
# INCREASED EPOCHS for the larger corpus
base_model = simulate_pretraining(toy_model, tokenizer, pretrain_corpus, epochs=80) 

# --- Test the Pre-trained model ---
print("\n--- Testing Model After Pre-training ---")
seed_text_1 = "alice followed"
print(f"Prompt: '{seed_text_1}' -> Response: '{generate_text(base_model, tokenizer, seed_text_1)}'")

seed_text_2 = "the queen of"
print(f"Prompt: '{seed_text_2}' -> Response: '{generate_text(base_model, tokenizer, seed_text_2)}'")

seed_text_3 = "who did alice follow" # Note: This is a QUESTION
print(f"Prompt: '{seed_text_3}' -> Response: '{generate_text(base_model, tokenizer, seed_text_3)}'")
print("\n(Note: It's learned 'Alice' facts, but still just completes. It doesn't *answer*.)")


--- Starting Phase 1: Foundational Pre-training ---
Goal: Teach the model basic language structure by predicting the next word.


  Pre-training Epoch 10/80, Loss: 0.3238
  Pre-training Epoch 20/80, Loss: 0.3201
  Pre-training Epoch 30/80, Loss: 0.3189
  Pre-training Epoch 40/80, Loss: 0.3159
  Pre-training Epoch 50/80, Loss: 0.3132
  Pre-training Epoch 60/80, Loss: 0.3112
  Pre-training Epoch 70/80, Loss: 0.3110
  Pre-training Epoch 80/80, Loss: 0.3099
--- Pre-training Complete ---

--- Testing Model After Pre-training ---
Prompt: 'alice followed' -> Response: 'alice followed the white rabbit <end>'
Prompt: 'the queen of' -> Response: 'the queen of hearts shouted <end>'
Prompt: 'who did alice follow' -> Response: 'who did alice follow the white rabbit <end>'

(Note: It's learned 'Alice' facts, but still just completes. It doesn't *answer*.)


In [24]:
# --- Cell 9: Phase 2: Supervised Fine-Tuning (SFT) ---

def simulate_sft(model, tokenizer, sft_data, epochs=30):
    print("\n--- Starting Phase 2: Supervised Fine-Tuning (SFT) ---")
    print("Goal: Teach the model to follow instructions in a question-answer format.")
    
    # We'll use a slightly smaller learning rate for fine-tuning
    optimizer = optim.Adam(model.parameters(), lr=0.005)
    
    for epoch in range(epochs):
        total_loss = 0
        
        # Iterate over our Q&A pairs
        for prompt, ideal_response in sft_data.items():
            
            input_tokens = tokenizer.encode(prompt)
            target_tokens = tokenizer.encode(ideal_response)
            
            # "Teacher Forcing": We'll train the model to generate
            # the response one word at a time, feeding it the
            # correct sequence so far.
            
            # e.g., for "can you help me": "of course i can help <end>"
            # 1. Input: [prompt] + []
            #    Target: "of"
            # 2. Input: [prompt] + ["of"]
            #    Target: "course"
            # 3. ...and so on
            
            for i in range(len(target_tokens)):
                # The input is the prompt + the ideal response *so far*
                current_input = torch.tensor([input_tokens + target_tokens[:i]])
                # The target is the *next* word in the ideal response
                current_target = torch.tensor([target_tokens[i]])
                
                # --- Standard Training Steps ---
                optimizer.zero_grad()
                output = model(current_input)
                loss = criterion(output, current_target)
                loss.backward()
                optimizer.step()
                total_loss += loss.item()

        if (epoch + 1) % 10 == 0:
            print(f"  SFT Epoch {epoch+1}/{epochs}, Loss: {total_loss/len(sft_data):.4f}")
            
    print("--- SFT Complete ---")
    return model


# --- Run Phase 2 ---
# We pass in the 'base_model' from Phase 1
# INCREASED EPOCHS for the new SFT data
sft_model = simulate_sft(base_model, tokenizer, sft_dataset, epochs=50)

# --- Test the SFT model ---
print("\n--- Testing Model After SFT ---")
seed_text_1 = "where did alice go" # Was in SFT data
print(f"Prompt: '{seed_text_1}' -> Response: '{generate_text(sft_model, tokenizer, seed_text_1)}'")

seed_text_2 = "what did the cat do" # Was in SFT data
print(f"Prompt: '{seed_text_2}' -> Response: '{generate_text(sft_model, tokenizer, seed_text_2)}'")

seed_text_3 = "who was alice" # Was NOT in SFT data (pre-trained "alice was very curious")
print(f"Prompt: '{seed_text_3}' -> Response: '{generate_text(sft_model, tokenizer, seed_text_3)}'")
print("\n(Note: It's great at SFT examples, but for new questions, it gives the pre-trained answer.)")


--- Starting Phase 2: Supervised Fine-Tuning (SFT) ---
Goal: Teach the model to follow instructions in a question-answer format.
  SFT Epoch 10/50, Loss: 0.5940
  SFT Epoch 20/50, Loss: 0.2010
  SFT Epoch 30/50, Loss: 0.0968
  SFT Epoch 40/50, Loss: 0.0572
  SFT Epoch 50/50, Loss: 0.0374
--- SFT Complete ---

--- Testing Model After SFT ---
Prompt: 'where did alice go' -> Response: 'where did alice go alice went down the rabbit hole <end>'
Prompt: 'what did the cat do' -> Response: 'what did the cat do the cheshire cat smiled <end>'
Prompt: 'who was alice' -> Response: 'who was alice followed the white rabbit was late <end>'

(Note: It's great at SFT examples, but for new questions, it gives the pre-trained answer.)


In [25]:
# --- Cell 10: Phase 3: Reinforcement Learning (RLHF) ---

# --- Step 3a: Simulate a NEW Reward Model (The "Alice" Judge) ---
def get_reward_score(response):
    """
    A new, rule-based reward model that understands "Alice" facts.
    
    This "judge" prefers helpful and factually-correct answers
    from the story.
    """
    score = 0.0
    
    # --- Positive Rewards (Factual & Helpful) ---
    # Prefers the *SFT answers*
    if "alice followed the white rabbit" in response: score += 1.5
    if "the cheshire cat smiled" in response: score += 1.5
    if "the hatter had a mad tea party" in response: score += 1.5
    if "the queen of hearts shouted" in response: score += 1.0
    
    # Also rewards good *pre-training* facts
    if "alice was very curious" in response: score += 1.0
    
    # --- Negative Penalties (Unhelpful or Wrong) ---
    if "i am a bot" in response: score -= 2.0     # Penalizes robotic answers
    if "i do not know" in response: score -= 1.0  # Penalizes unhelpful
    
    # Penalize "factual errors" (e.g., mixing up characters)
    if "alice smiled" in response: score -= 1.5
    if "the cat followed the rabbit" in response: score -= 2.0
        
    if len(response.split()) < 4: score -= 1.0    # Penalizes short answers
    return score

# --- Step 3b: Fine-tune with Reinforcement Learning ---
# (The simulate_rlhf function code is IDENTICAL to before, no change needed)
def simulate_rlhf(model, tokenizer, prompts, iterations=50):
    print("\n--- Starting Phase 3: Reinforcement Learning with Human Feedback (RLHF) ---")
    print("Goal: Refine the model based on preferences (what makes a 'good' answer).")
    
    optimizer = optim.Adam(model.parameters(), lr=0.001)
    
    for i in range(iterations):
        prompt = random.choice(prompts)
        response_str = generate_text(model, tokenizer, prompt, max_len=15)
        reward = get_reward_score(response_str)
        
        log_probs = []
        input_tokens = tokenizer.encode(prompt)
        response_tokens = tokenizer.encode(response_str.replace(prompt, '').strip())
        
        if not response_tokens:
            continue
            
        for token in response_tokens:
            input_tensor = torch.tensor([input_tokens])
            output = model(input_tensor)
            log_prob_dist = torch.log_softmax(output, dim=1)
            log_prob = log_prob_dist[0, token]
            log_probs.append(log_prob)
            input_tokens.append(token)
            
        if log_probs:
            policy_loss = -torch.stack(log_probs).mean() * reward
            optimizer.zero_grad()
            policy_loss.backward()
            optimizer.step()
        
        if (i + 1) % (iterations // 5) == 0: # Print 5 times
            print(f"  RLHF Iteration {i+1}/{iterations}, Prompt: '{prompt}', Reward: {reward:.2f}")

    print("--- RLHF Complete ---")
    return model

print("RLHF functions defined with NEW 'Alice' reward model.")

RLHF functions defined with NEW 'Alice' reward model.


In [26]:
# --- Cell 11: Final Run & The "Aligned" Model ---

# --- Run Phase 3 ---
# We pass in the 'sft_model' from Phase 2
# INCREASED ITERATIONS for more preference learning
rlhf_model = simulate_rlhf(sft_model, tokenizer, rlhf_prompts, iterations=100) 

# --- Test the FINAL model ---
print("\n--- Final Model after RLHF ---")
print("Note how the model's behavior has been 'steered' by the 'Alice' reward model.")

seed_text_1 = "what did the cat do" # (SFT, high-reward)
print(f"Prompt: '{seed_text_1}' -> Response: '{generate_text(rlhf_model, tokenizer, seed_text_1)}'")

seed_text_2 = "what about the hatter" # (SFT, high-reward)
print(f"Prompt: '{seed_text_2}' -> Response: '{generate_text(rlhf_model, tokenizer, seed_text_2)}'")

# --- The KEY Test ---
seed_text_3 = "who was alice" # (Not in SFT, but "alice was very curious" is high-reward)
print(f"Prompt: '{seed_text_3}' -> Response: '{generate_text(rlhf_model, tokenizer, seed_text_3)}'")

print("\n--- Talk Summary ---")
print("1. UNTRAINED: Random garbage.")
print("2. PRE-TRAINED: Knew 'Alice' facts, but just completed sentences.")
print("3. SFT: Knew how to *answer specific questions* (e.g., 'the cheshire cat smiled').")
print("4. RLHF: Learned *general preferences*. It learned that 'alice was very curious' is a")
print("   'good' answer to 'who was alice', even without SFT. It was steered by the reward!")


--- Starting Phase 3: Reinforcement Learning with Human Feedback (RLHF) ---
Goal: Refine the model based on preferences (what makes a 'good' answer).
  RLHF Iteration 20/100, Prompt: 'tell me about the cat', Reward: 0.00
  RLHF Iteration 40/100, Prompt: 'tell me about the hatter', Reward: 1.50
  RLHF Iteration 60/100, Prompt: 'tell me about the cat', Reward: 0.00
  RLHF Iteration 80/100, Prompt: 'who did alice follow', Reward: 1.50
  RLHF Iteration 100/100, Prompt: 'tell me about the hatter', Reward: 1.50
--- RLHF Complete ---

--- Final Model after RLHF ---
Note how the model's behavior has been 'steered' by the 'Alice' reward model.
Prompt: 'what did the cat do' -> Response: 'what did the cat do the cheshire cat smiled <end>'
Prompt: 'what about the hatter' -> Response: 'what about the hatter the hatter had a mad tea party <end>'
Prompt: 'who was alice' -> Response: 'who was alice followed the white rabbit was late <end>'

--- Talk Summary ---
1. UNTRAINED: Random garbage.
2. PRE-TR

In [28]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import math

# --- Data Preparation ---

# A simple, small corpus from Alice in Wonderland
corpus = """
alice was beginning to get very tired of sitting by her sister on the bank,
and of having nothing to do: once or twice she had peeped into the book her
sister was reading, but it had no pictures or conversations in it, and what
is the use of a book, thought alice without pictures or conversations?
"""

# Tokenization and Vocabulary
class SimpleTokenizer:
    def __init__(self, text):
        # Convert all text to lower case and split by whitespace
        tokens = text.lower().split()
        
        # Create a set of unique words
        vocab = sorted(list(set(tokens)))
        
        # Map words to unique integers (token IDs)
        self.word_to_idx = {word: i + 1 for i, word in enumerate(vocab)}
        self.idx_to_word = {i + 1: word for i, word in enumerate(vocab)}
        self.vocab_size = len(vocab) + 1 # +1 for padding/unknown token (index 0)

    def encode(self, text, max_len=None):
        tokens = text.lower().split()
        encoded = [self.word_to_idx.get(word, 0) for word in tokens]
        
        if max_len:
            # Pad or truncate the sequence
            if len(encoded) < max_len:
                encoded += [0] * (max_len - len(encoded))
            else:
                encoded = encoded[:max_len]
        return encoded

    def decode(self, encoded):
        # Decode only if the index is not 0 (padding/unknown)
        return ' '.join([self.idx_to_word[idx] for idx in encoded if idx != 0 and idx in self.idx_to_word])

# Create sequences for next-word prediction
def create_sequences(tokenizer, corpus, seq_len=4):
    tokens = corpus.lower().split()
    encoded_tokens = tokenizer.encode(corpus)
    
    inputs, targets = [], []
    for i in range(len(encoded_tokens) - seq_len):
        # Input: sequence of length seq_len
        input_seq = encoded_tokens[i:i + seq_len]
        # Target: the token immediately following the input sequence
        target_token = encoded_tokens[i + seq_len]
        
        inputs.append(input_seq)
        targets.append(target_token)
        
    return torch.tensor(inputs), torch.tensor(targets)

# --- Transformer Core Component: Scaled Dot-Product Self-Attention ---

class SelfAttention(nn.Module):
    """
    A simplified single-head Self-Attention mechanism, the heart of the Transformer.
    This replaces the sequential processing of an RNN/LSTM.
    """
    def __init__(self, embed_dim):
        super().__init__()
        self.embed_dim = embed_dim
        
        # Linear layers to project the input embedding into Query, Key, and Value vectors
        self.W_q = nn.Linear(embed_dim, embed_dim, bias=False)
        self.W_k = nn.Linear(embed_dim, embed_dim, bias=False)
        self.W_v = nn.Linear(embed_dim, embed_dim, bias=False)
        
    def forward(self, x):
        # x shape: (batch_size, seq_len, embed_dim)
        
        # 1. Project to Q, K, V
        Q = self.W_q(x) # Query: What am I looking for?
        K = self.W_k(x) # Key: What do I have?
        V = self.W_v(x) # Value: What context should I pass?
        
        # 2. Compute Attention Scores (Scaled Dot-Product)
        # Q @ K.transpose(-2, -1) calculates the similarity between all pairs of words.
        # Scaling by sqrt(d_k) stabilizes the gradient.
        d_k = Q.size(-1)
        attention_scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)
        
        # 3. Normalize Scores
        # Softmax turns scores into probability weights (0 to 1)
        attention_weights = F.softmax(attention_scores, dim=-1)
        
        # 4. Apply Weights to Values
        # The output is a weighted sum of the Value vectors (Context Vector)
        context_vector = torch.matmul(attention_weights, V)
        
        return context_vector # shape: (batch_size, seq_len, embed_dim)


# --- The Toy LLM using Attention ---

class ToyAttentionModel(nn.Module):
    """
    A minimal LLM architecture using an embedding layer followed by Self-Attention.
    """
    def __init__(self, vocab_size, embed_dim, seq_len):
        super().__init__()
        self.vocab_size = vocab_size
        self.seq_len = seq_len
        
        # 1. Embedding Layer: Converts token IDs to dense vectors
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        
        # 2. Positional Encoding (Crucial for Transformers)
        # Since attention processes all words at once, we must inject their order.
        self.positional_encoding = nn.Embedding(seq_len, embed_dim)

        # 3. Self-Attention Block (The new core)
        self.attention = SelfAttention(embed_dim)

        # 4. Final Linear Layer: Maps the context vector back to the vocab size
        self.fc = nn.Linear(embed_dim, vocab_size)

    def forward(self, x):
        # x shape: (batch_size, seq_len)
        batch_size, seq_len = x.size()

        # 1. Look up word embeddings
        word_embeddings = self.embedding(x) # (batch_size, seq_len, embed_dim)
        
        # 2. Add Positional Encoding
        # Generate position indices (0, 1, 2, 3...)
        positions = torch.arange(seq_len, dtype=torch.long, device=x.device).unsqueeze(0).repeat(batch_size, 1)
        pos_embeddings = self.positional_encoding(positions)
        
        # Transformer Input = Word Embedding + Positional Encoding
        x = word_embeddings + pos_embeddings
        
        # 3. Pass through Self-Attention
        context_vector = self.attention(x) # (batch_size, seq_len, embed_dim)
        
        # For next-word prediction, we only care about the last word's context
        # to predict the next word in the sequence.
        last_context = context_vector[:, -1, :] # (batch_size, embed_dim)

        # 4. Output: Predict the next token
        output = self.fc(last_context) # (batch_size, vocab_size)
        
        return output

# --- Training and Inference Functions ---

def simulate_pretraining(model, inputs, targets, epochs=3000): # Increased epochs
    """
    Simulates the foundational pre-training phase (Next-Word Prediction).
    """
    criterion = nn.CrossEntropyLoss()
    optimizer = torch.optim.Adam(model.parameters(), lr=0.005)
    
    print(f"Starting pre-training with {epochs} epochs...")
    for epoch in range(epochs):
        # 1. Forward pass
        outputs = model(inputs)
        
        # 2. Calculate Loss (Error)
        loss = criterion(outputs, targets)
        
        # 3. Backpropagation (Find Blame)
        optimizer.zero_grad()
        loss.backward()
        
        # 4. Optimization (Adjust Weights)
        optimizer.step()
        
        if (epoch + 1) % 500 == 0:
            print(f'Epoch [{epoch+1}/{epochs}], Loss: {loss.item():.4f}')
    
    print("Pre-training complete.")
    return model

def generate_text(model, tokenizer, start_phrase, seq_len=4, max_tokens=10, temperature=0.8):
    """
    Generates text using the trained model, using temperature sampling
    to prevent getting stuck in repetitive loops.
    """
    model.eval() # Set model to evaluation mode
    generated_tokens = tokenizer.encode(start_phrase, max_len=seq_len)
    
    if len(generated_tokens) < seq_len:
        print("Error: Start phrase is too short for the required sequence length.")
        return start_phrase

    print(f"\n--- Generating Text (Max {max_tokens} tokens) ---")
    print(f"Start: '{start_phrase}'")
    
    output_tokens = generated_tokens
    
    with torch.no_grad(): # Disable gradient calculations during inference
        for _ in range(max_tokens):
            # 1. Prepare the current sequence input
            current_sequence = torch.tensor([output_tokens[-seq_len:]])
            
            # 2. Get prediction from the Attention Model
            output = model(current_sequence) # (1, vocab_size)
            
            # 3. Apply Temperature (dividing logits by T)
            output = output / temperature
            
            # 4. Convert to probabilities and sample (instead of argmax)
            probabilities = F.softmax(output, dim=-1)
            predicted_token_id = torch.multinomial(probabilities, num_samples=1).item()
            
            if predicted_token_id == 0: # Stop on padding/unknown
                break
            
            # 5. Add the new token to the sequence for the next step
            output_tokens.append(predicted_token_id)
            
            # Stop if the model starts repeating itself or generating gibberish
            if len(output_tokens) > 2 * seq_len and len(set(output_tokens[-seq_len:])) < 2:
                 break

    # Decode and return the result
    return tokenizer.decode(output_tokens)

# --- Main Execution ---

if __name__ == '__main__':
    # Define hyper-parameters
    EMBED_DIM = 32    # Size of the word vector (Increased from 16)
    SEQ_LEN = 4       # Number of words the model looks at to predict the next
    EPOCHS = 3000     # Number of training iterations (Increased from 1500)

    # 1. Data Setup
    tokenizer = SimpleTokenizer(corpus)
    VOCAB_SIZE = tokenizer.vocab_size
    
    inputs, targets = create_sequences(tokenizer, corpus, seq_len=SEQ_LEN)
    
    print(f"Vocabulary Size: {VOCAB_SIZE}")
    print(f"Sequence Length (context window): {SEQ_LEN}")
    print(f"Total training examples: {len(inputs)}")

    # 2. Model Initialization (Using the new Attention Model)
    model = ToyAttentionModel(VOCAB_SIZE, EMBED_DIM, SEQ_LEN)
    
    # 3. Training
    model = simulate_pretraining(model, inputs, targets, epochs=EPOCHS)
    
    # 4. Inference (Generate text)
    start_phrase = "alice was beginning to get"
    generated_text = generate_text(model, tokenizer, start_phrase, seq_len=SEQ_LEN, max_tokens=20)
    
    print(f"\nResulting Text:")
    print(generated_text)
    
    # Simple summary of the attention core
    print("\n--- ATTENTION MECHANISM SUMMARY ---")
    print("In the SelfAttention class, the key operation is:")
    print("1. Q, K, V Projections: Maps input vector (x) into three roles.")
    print("2. Scoring: torch.matmul(Q, K.transpose) calculates pairwise similarity.")
    print("3. Weighting: Softmax turns these scores into probabilistic attention weights.")
    print("4. Context: torch.matmul(Weights, V) sums the V vectors based on the computed weights.")

Vocabulary Size: 44
Sequence Length (context window): 4
Total training examples: 53
Starting pre-training with 3000 epochs...
Epoch [500/3000], Loss: 0.1051
Epoch [1000/3000], Loss: 0.0786
Epoch [1500/3000], Loss: 0.0785
Epoch [2000/3000], Loss: 0.0785
Epoch [2500/3000], Loss: 0.0785
Epoch [3000/3000], Loss: 0.0785
Pre-training complete.

--- Generating Text (Max 20 tokens) ---
Start: 'alice was beginning to get'

Resulting Text:
alice was beginning to get very tired of sitting nothing to do: once or twice she had peeped into the book her sister on

--- ATTENTION MECHANISM SUMMARY ---
In the SelfAttention class, the key operation is:
1. Q, K, V Projections: Maps input vector (x) into three roles.
2. Scoring: torch.matmul(Q, K.transpose) calculates pairwise similarity.
3. Weighting: Softmax turns these scores into probabilistic attention weights.
4. Context: torch.matmul(Weights, V) sums the V vectors based on the computed weights.
