<a href="https://colab.research.google.com/github/Thanki-Harsh/ai-eng-projects-2/blob/main/project_1/lm_playground.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project 1: Build an LLM Playground

Welcome to your first project! In this project, you'll build a simple large language model (LLM) playground, an interactive environment where you can experiment with LLMs and understand how they work under the hood.

The goal here is to understand the foundations and mechanics behind LLMs rather than relying on higher-level abstractions or frameworks. You'll see what happens ‚Äúunder the hood‚Äù, how an LLM receives a text, processes it, and generate a response. In later projects, you'll use frameworks like Ollama and LangChain that simplify many of these steps. But before that, this project will help you build a solid mental model of how LLMs actually work.

We'll use Google Colab, a free browser-based platform that lets you run Python code and machine learning models without installing anything locally. Click the button below to open this notebook in Colab.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/bytebyteai/ai-eng-projects-2/blob/main/project_1/lm_playground.ipynb)

If you prefer to run the project locally, you can use the provided `env.yaml` file to create a compatible environment using conda. To do so, open a terminal in the same directory as this notebook and run:

```bash
# Create and activate the conda environment
conda env create -f env.yaml && conda activate llm_playground

# Register this environment as a Jupyter kernel
python -m ipykernel install --user --name=llm_playground --display-name "llm_playground"
```


---
## Learning Objectives  
- Understand tokenization and how raw text is converted into a sequence of discrete tokens
- Inspect GPT-2 and the Transformer architecture
- Learn how to load pretrained LLMs using Hugging Face
- Explore decoding strategies to generate text from LLMs
- Compare completion models with instruction-tuned models


Let's get started!

In [None]:
# Confirm required libraries are installed and working.
import torch, transformers, tiktoken, re
print("torch", torch.__version__, "| transformers", transformers.__version__)
print("‚úÖ Environment check complete. You're good to go!")

# 1 - Tokenization

A neural network cannot process raw text directly. It needs numbers.
Tokenization is the process of converting text into numerical IDs that models can understand. In this section, you will learn how tokenization works in practice and why it is an essential step in every language model pipeline.

Tokenization methods generally fall into three main categories:
1. Word-level
2. Character-level
3. Subword-level

### 1.1 - Word-level tokenization
This method splits text by whitespace and treats each word as a single token. In the next cell, you will implement a basic word-level tokenizer by building a vocabulary that maps words to IDs and writing `encode` and `decode` functions.

In [None]:
# Creating a tiny corpus. In practice, a corpus is generally the entire internet-scale dataset used for training.
import re
corpus = [
    "The quick brown fox jumps over the lazy dog",
    "Tokenization converts text to numbers",
    "Large language models predict the next token"
]

# Step 1: Build vocabulary (all unique words in the corpus) and mappings
vocab = []
word2id = {}
id2word = {}

#---Tokenizer---

def word_tokenizer(text):
    # Clean punctuation except apostrophes, lowercase, split on whitespace
    cleaned = re.sub(r"[^\w\s']", '', text.lower())
    return cleaned.split()

# def extend_dictionary(sentence):
#   for sentence in corpus:
#     tokens = word_tokenizer(sentence)
#     vocab.extend(tokens)

#---Remove Duplicates-----

def remove_duplicates(tokens):
    seen = set()
    unique_tokens = []
    for token in tokens:
        if token not in seen:
            seen.add(token)
            unique_tokens.append(token)
    return unique_tokens

#----Vocab Builder-----

def build_vocab(tokens, base_vocab = None):
  if base_vocab is None:
    word2id = {'<PAD>':0, '<UNK>':1}
    id2word = {0:'<PAD>', 1:'<UNK>'}
    next_index = 2
  else:
    word2id = base_vocab['word2id']
    id2word = base_vocab['id2word']
    next_index = max(word2id.values()) + 1
  for token in tokens:
    if token not in word2id:
      word2id[token] = next_index
      id2word[next_index] = token
      next_index += 1

  return {'word2id': word2id, 'id2word': id2word}


Vocabulary size: 19 words
First 15 vocab entries: ['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'lazy', 'dog', 'tokenization', 'converts', 'text', 'to', 'numbers', 'large', 'language']
Vocabulary size: 19 words
First 15 vocab entries: ['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'lazy', 'dog', 'tokenization', 'converts', 'text', 'to', 'numbers', 'large', 'language']


In [None]:
# Step 2: Define encode and decode functions
def encode(word2id, sentence):
  return [word2id.get(words, word2id['<UNK>']) for words in sentence]

def decode(ids, id2word):
    # converts token IDs back to text
    return [id2word.get(idx,"<UNK>") for idx in ids]


In [2]:
import re

# Redefining corpus and functions locally for self-containment
corpus = [
    "The quick brown fox jumps over the lazy dog",
    "Tokenization converts text to numbers",
    "Large language models predict the next token"
]

def word_tokenizer(text):
    # Clean punctuation except apostrophes, lowercase, split on whitespace
    cleaned = re.sub(r"[^\w\s']", '', text.lower())
    return cleaned.split()

def remove_duplicates(tokens):
    seen = set()
    unique_tokens = []
    for token in tokens:
        if token not in seen:
            seen.add(token)
            unique_tokens.append(token)
    return unique_tokens

def build_vocab(tokens, base_vocab = None):
  if base_vocab is None:
    word2id = {'<PAD>':0, '<UNK>':1}
    id2word = {0:'<PAD>', 1:'<UNK>'}
    next_index = 2
  else:
    word2id = base_vocab['word2id']
    id2word = base_vocab['id2word']
    next_index = max(word2id.values()) + 1
  for token in tokens:
    if token not in word2id:
      word2id[token] = next_index
      id2word[next_index] = token
      next_index += 1

  return {'word2id': word2id, 'id2word': id2word}

def encode(word2id, sentence):
  return [word2id.get(words, word2id['<UNK>']) for words in sentence]

def decode(ids, id2word):
    # converts token IDs back to text
    return [id2word.get(idx,"<UNK>") for idx in ids]


test_sentence = "Hi, How are you doing? Are you doing well?"

tokens_from_corpus = []
for sentence in corpus:
  tokens_from_corpus.extend(word_tokenizer(sentence))

unique_tokens_from_corpus = remove_duplicates(tokens_from_corpus)
vocab = build_vocab(unique_tokens_from_corpus)

# Tokenize the test sentence to see what words it contains
tokens_from_test_sentence = word_tokenizer(test_sentence)

# Now, let's test encoding/decoding with a sample that might contain unknown words
sample_text_to_encode = "Hi, Unknown word are you?"
sample_tokens = word_tokenizer(sample_text_to_encode)

# Correctly call encode: pass word2id first, then the list of tokens
encoded = encode(vocab['word2id'], sample_tokens)
decoded = decode(encoded, vocab['id2word'])

print(f"Sample Text: '{sample_text_to_encode}'")
print("Encoded (using corpus vocab):", encoded)
print("Decoded (using corpus vocab):", decoded)

# Print vocab from corpus
print("\n--- Vocabulary from Corpus ---")
print("word2id:")
for word, idx in vocab['word2id'].items():
    print(f"{word}: {idx}")

print("\nid2word:")
for idx, word in vocab['id2word'].items():
    print(f"{idx}: {word}")

# Optionally, you could build an extended vocabulary including words from test_sentence
# new_unique_tokens = remove_duplicates(tokens_from_test_sentence)
# new_vocab = build_vocab(new_unique_tokens, base_vocab=vocab)
# print("\n--- Extended Vocabulary (if built) ---")
# for word, idx in new_vocab['word2id'].items():
#    print(f"{word}: {idx}")

Sample Text: 'Hi, Unknown word are you?'
Encoded (using corpus vocab): [1, 1, 1, 1, 1]
Decoded (using corpus vocab): ['<UNK>', '<UNK>', '<UNK>', '<UNK>', '<UNK>']

--- Vocabulary from Corpus ---
word2id:
<PAD>: 0
<UNK>: 1
the: 2
quick: 3
brown: 4
fox: 5
jumps: 6
over: 7
lazy: 8
dog: 9
tokenization: 10
converts: 11
text: 12
to: 13
numbers: 14
large: 15
language: 16
models: 17
predict: 18
next: 19
token: 20

id2word:
0: <PAD>
1: <UNK>
2: the
3: quick
4: brown
5: fox
6: jumps
7: over
8: lazy
9: dog
10: tokenization
11: converts
12: text
13: to
14: numbers
15: large
16: language
17: models
18: predict
19: next
20: token


While word-level tokenization is simple and easy to understand, it has two key limitations that make it impractical for large-scale models:
1.  large vocabulary size: every new word or variation (for example, run, runs, running) increases the total vocabulary, leading to higher memory and training costs.
2. Out-of-vocabulary (OOV) problem: the model cannot handle unseen or rare words that were not part of the training vocabulary, so they must be replaced with a generic [UNK] token.

The next section introduces character-level tokenization, where text is represented as individual characters instead of words.

### 1.2 - Character-level tokenization

In this approach, every single character (including spaces, punctuation, and even emojis) is assigned its own ID.

In the next section, we will rebuild a tokenizer using the same corpus as before, but this time with a character-level approach.
For simplicity, assume we are only using lowercase and uppercase English letters (a-z, A-Z).

In [3]:
import string

# Step 1: Create a vocabulary that includes all uppercase and lowercase letters.
char2id = {'<PAD>': 0, '<UNK>': 1}
id2char = {0: '<PAD>', 1: '<UNK>'}
next_id = 2

# Add lowercase letters
for char_code in range(ord('a'), ord('z') + 1):
    char = chr(char_code)
    char2id[char] = next_id
    id2char[next_id] = char
    next_id += 1

# Add uppercase letters
for char_code in range(ord('A'), ord('Z') + 1):
    char = chr(char_code)
    char2id[char] = next_id
    id2char[next_id] = char
    next_id += 1

print(f"Vocabulary size: {len(char2id)} (52 letters + 2 specials)")


Vocabulary size: 54 (52 letters + 2 specials)


In [4]:
# Step 2: Implement encode() and decode() functions to convert between text and IDs.
def encode(text):
    # convert text to list of IDs
    return [char2id.get(char, char2id['<UNK>']) for char in text]

def decode(ids):
    # Convert list of IDs to text
    return "".join([id2char.get(idx, '<UNK>') for idx in ids])


In [5]:
# Step 3: Test your tokenizer on a short sample word.
sample_word = "HelloWorld"
encoded_word = encode(sample_word)
decoded_word = decode(encoded_word)

print(f"Original word: {sample_word}")
print(f"Encoded IDs: {encoded_word}")
print(f"Decoded word: {decoded_word}")

# Test with a character not in the initial vocab (e.g., a number or symbol)
sample_with_unk = "Hello123World!"
encoded_unk = encode(sample_with_unk)
decoded_unk = decode(encoded_unk)

print(f"\nOriginal word with UNK: {sample_with_unk}")
print(f"Encoded IDs with UNK: {encoded_unk}")
print(f"Decoded word with UNK: {decoded_unk}")


Original word: HelloWorld
Encoded IDs: [35, 6, 13, 13, 16, 50, 16, 19, 13, 5]
Decoded word: HelloWorld

Original word with UNK: Hello123World!
Encoded IDs with UNK: [35, 6, 13, 13, 16, 1, 1, 1, 50, 16, 19, 13, 5, 1]
Decoded word with UNK: Hello<UNK><UNK><UNK>World<UNK>


Character-level tokenization solves the out-of-vocabulary problem but introduces new challenges:

1. Longer sequences: because each word becomes many tokens, models need to process much longer inputs.
2. Weaker semantic representation: individual characters carry very little meaning, so models must learn relationships across many steps.
3. Higher computational cost: longer sequences lead to more tokens per input, which increases training and inference time.

To find a better balance between vocabulary size and sequence length, we move to subword-level tokenization next.

### 1.3 - Subword-level tokenization

Sub-word methods such as `Byte-Pair Encoding (BPE)`, `WordPiece`, and `SentencePiece` **learn** common groups of characters and merge them into tokens. For example, the word **unbelievable** might turn into three tokens: **["un", "believ", "able"]**. This approach strikes a balance between word-level and character-level methods and fix their limitations.

The BPE algorithm builds a vocabulary iteratively using the following process:
1. Start with individual characters (each character is a token).
2. Count all adjacent pairs of tokens in a large text corpus.
3. Merge the most frequent pair into a new token.

Repeat steps 2 and 3 until you reach the desired vocabulary size (for example, 50,000 tokens).

In the next cell, you will experiment with BPE in practice to see how it compresses text into meaningful subword units. Instead of implementing the algorithm from scratch, you will use a pretrained tokenizer, which was already trained on a large text corpus to build its vocabulary, such as the data used to train `GPT-2`. This allows you to see how BPE works in practice with a real, learned vocabulary.

In [8]:
from transformers import AutoTokenizer

# Step 1: Load a pretrained GPT-2 tokenizer from Hugging Face.
# Refer to this to learn more: https://huggingface.co/docs/transformers/en/model_doc/gpt2

tokenizer = AutoTokenizer.from_pretrained("gpt2")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [None]:
# Step 2: Use it to write encode and decode helper functions
def encode(text):
    return tokenizer.encode(text)

def decode(ids):
    return tokenizer.decode(ids)

In [9]:
# 3. Inspect the tokens to see how BPE breaks words apart.
sample = "Unbelievable tokenization powers! üöÄ"
encoded_sample = encode(sample)
decoded_sample = decode(encoded_sample)

print(f"Original sample: {sample}")
print(f"Encoded IDs: {encoded_sample}")
print(f"Decoded sample: {decoded_sample}")
print(f"Tokens: {[tokenizer.decode([idx]) for idx in encoded_sample]}")

Original sample: Unbelievable tokenization powers! üöÄ
Encoded IDs: [48, 15, 3, 6, 13, 10, 6, 23, 2, 3, 13, 6, 1, 21, 16, 12, 6, 15, 10, 27, 2, 21, 10, 16, 15, 1, 17, 16, 24, 6, 19, 20, 1, 1, 1]
Decoded sample: Unbelievable<UNK>tokenization<UNK>powers<UNK><UNK><UNK>
Tokens: ['Q', '0', '$', "'", '.', '+', "'", '8', '#', '$', '.', "'", '"', '6', '1', '-', "'", '0', '+', '<', '#', '6', '+', '1', '0', '"', '2', '1', '9', "'", '4', '5', '"', '"', '"']


### 1.4 - TikToken

`tiktoken` is a fast, production-ready library for tokenization used by OpenAI models.
It is designed for efficiency and consistency with how OpenAI counts tokens in GPT models.

In this section, you will explore how different model families use different tokenizers. We will compare tokenizers used to train `GPT-2` and more powerful models such as `GPT-4`. By trying both, you will see how tokenization has evolved to handle more diverse text (including emojis, Unicode, and special characters) while remaining efficient.

In the next cell, you will use tiktoken to load these encodings and inspect how each one splits the same text. You may find reading this doc helpful: https://github.com/openai/tiktoken

In [13]:
import tiktoken

# Compare GPT-2 and GPT-4 tokenizers using tiktoken.

# Step 1: Load two tokenizers
enc_gpt2 = tiktoken.encoding_for_model("gpt-2")
enc_gpt4 = tiktoken.get_encoding("cl100k_base") # This is the encoding for gpt-4, gpt-3.5-turbo, text-embedding-ada-002

print(f"GPT-2 Tokenizer: {enc_gpt2}")
print(f"GPT-4 Tokenizer (cl100k_base): {enc_gpt4}")

# Step 2: Encode the same sentence with both and observe how they differ
sentence = "Hello, world! How are you doing today? üòä"

encoded_gpt2 = enc_gpt2.encode(sentence)
encoded_gpt4 = enc_gpt4.encode(sentence)

decoded_gpt2 = enc_gpt2.decode(encoded_gpt2)
decoded_gpt4 = enc_gpt4.decode(encoded_gpt4)

print(f"\nOriginal sentence: '{sentence}'")

print(f"\nGPT-2 Encoded IDs: {encoded_gpt2}")
print(f"GPT-2 Decoded: '{decoded_gpt2}'")
print(f"GPT-2 Tokens: {[enc_gpt2.decode_single_token_bytes(t) for t in encoded_gpt2]}")
print(f"Number of tokens (GPT-2): {len(encoded_gpt2)}")

print(f"\nGPT-4 Encoded IDs: {encoded_gpt4}")
print(f"GPT-4 Decoded: '{decoded_gpt4}'")
print(f"GPT-4 Tokens: {[enc_gpt4.decode_single_token_bytes(t) for t in encoded_gpt4]}")
print(f"Number of tokens (GPT-4): {len(encoded_gpt4)}")

GPT-2 Tokenizer: <Encoding 'gpt2'>
GPT-4 Tokenizer (cl100k_base): <Encoding 'cl100k_base'>

Original sentence: 'Hello, world! How are you doing today? üòä'

GPT-2 Encoded IDs: [15496, 11, 995, 0, 1374, 389, 345, 1804, 1909, 30, 30325, 232]
GPT-2 Decoded: 'Hello, world! How are you doing today? üòä'
GPT-2 Tokens: [b'Hello', b',', b' world', b'!', b' How', b' are', b' you', b' doing', b' today', b'?', b' \xf0\x9f\x98', b'\x8a']
Number of tokens (GPT-2): 12

GPT-4 Encoded IDs: [9906, 11, 1917, 0, 2650, 527, 499, 3815, 3432, 30, 27623, 232]
GPT-4 Decoded: 'Hello, world! How are you doing today? üòä'
GPT-4 Tokens: [b'Hello', b',', b' world', b'!', b' How', b' are', b' you', b' doing', b' today', b'?', b' \xf0\x9f\x98', b'\x8a']
Number of tokens (GPT-4): 12


Try changing the input sentence and observe how different tokenizers behave.
Experiment with:
- Emojis, special characters, or punctuation
- Code snippets or structured text
- Non-English text (for example, Japanese, French, or Arabic)

If you are curious, you can also attempt to implement the BPE algorithm yourself using a small text corpus to see how token merges are learned in practice.

### 1.5 - Key Takeaways
- **Word-level**: simple and intuitive, but limited by large vocabularies and out-of-vocabulary issues
- **Character-level**: flexible and covers all text, but produces long sequences that are harder to model
- **Subword / BPE**: balances both worlds and is the default choice for most modern LLMs
- **TikToken**: a production-ready tokenizer used in OpenAI models, demonstrating how optimized subword vocabularies are applied in real systems

# 2. What is a Language Model?

At its core, a **language model (LM)** is just a *very large* mathematical function built from many neural-network layers.  
Given a sequence of tokens `[t‚ÇÅ, t‚ÇÇ, ‚Ä¶, t‚Çô]`, it learns to output a probability for the next token `t‚Çô‚Çä‚ÇÅ`.


Each layer performs basic mathematical operations such as matrix multiplication and attention. When hundreds of these layers are stacked together, the model learns complex patterns and statistical relationships in text. The final output is a vector of scores that represents how likely each possible token is to appear next. You can think of the entire model as one giant equation whose parameters were optimized during training to minimize prediction errors.

### 2.1 - A Single `Linear` Layer

Before jumping into Transformers, let's start with the simplest building block: a `Linear` layer.

A Linear layer computes `y = Wx + b`.

Where:  
  * `x` - input vector  
  * `W` - weight matrix (learned)  
  * `b` - bias vector (learned)

Although this operation looks simple, stacking many linear layers (along with nonlinear activation functions) allows neural networks to model highly complex relationships in data.

In the next cell, you will explore how a **Linear layer** works in practice by implementing one from scratch. You will define the weights and bias, then perform the matrix multiplication and addition manually to see what happens inside this layer. You may find the following links useful:
- https://docs.pytorch.org/docs/stable/generated/torch.nn.parameter.Parameter.html
- https://docs.pytorch.org/docs/stable/generated/torch.randn.html
- https://docs.pytorch.org/docs/stable/generated/torch.matmul.html

In [14]:
import torch
import torch.nn as nn

# Define a MyLinear PyTorch module and perform y = Wx + b.

class MyLinear(nn.Module):
    def __init__(self, in_features, out_features):
        super(MyLinear, self).__init__()
        # Initialize weights and bias as learnable parameters.
        self.weight = nn.Parameter(torch.randn(out_features, in_features))
        self.bias = nn.Parameter(torch.randn(out_features))

    def forward(self, x):
        # Matrix multiplication followed by bias addition
        return torch.matmul(x, self.weight.T) + self.bias


lin = MyLinear(3, 2)
x = torch.tensor([1.0, -1.0, 0.5])
print("Input :", x)
print("Weights:", lin.weight)
print("Bias   :", lin.bias)
print("Output :", lin(x))

Input : tensor([ 1.0000, -1.0000,  0.5000])
Weights: Parameter containing:
tensor([[ 0.7918,  1.7473, -0.3448],
        [ 0.0348,  0.4491, -0.0134]], requires_grad=True)
Bias   : Parameter containing:
tensor([0.9643, 0.3534], requires_grad=True)
Output : tensor([-0.1636, -0.0676], grad_fn=<AddBackward0>)


Next, you will use PyTorch's built-in nn.Linear module, which performs the same computation `(y = Wx + b)` but automatically handles parameter initialization, gradient tracking, and integration with the rest of a neural network. Comparing your manual implementation with this built-in version will help you understand what a linear layer does and how deep learning frameworks make these operations easier to use.

You may find this link useful:
- https://docs.pytorch.org/docs/stable/generated/torch.nn.Linear.html

In [15]:
import torch.nn as nn, torch

# Create a linear layer using pytorch's nn.Linear
linear_layer = nn.Linear(3, 2)

x = torch.tensor([1.0, -1.0, 0.5])
print("Input :", x)
print("Weights:", linear_layer.weight)
print("Bias   :", linear_layer.bias)
print("Output :", linear_layer(x))

Input : tensor([ 1.0000, -1.0000,  0.5000])
Weights: Parameter containing:
tensor([[ 0.0277,  0.1876, -0.3267],
        [-0.3215, -0.4612,  0.3162]], requires_grad=True)
Bias   : Parameter containing:
tensor([-0.1127,  0.3840], requires_grad=True)
Output : tensor([-0.4359,  0.6818], grad_fn=<ViewBackward0>)


### 2.2 - A `Transformer` Layer

Most LLMs are a **stack of identical Transformer blocks**. Each block fuses two main components:

| Step | What it does | Where it lives in code |
|------|--------------|------------------------|
| **Multi-Head Self-Attention** | Every token looks at every other token and decides *what matters*. | `block.attn` |
| **Feed-Forward Network (MLP)** | Re-mixes information token-by-token. | `block.mlp` |

In the next section, you will load `GPT-2` and inspect its first Transformer block to see these components in a real model. You will locate its layers, print their shapes and parameters, and understand how a block processes a batch of token embeddings.

In [16]:
import torch
from transformers import GPT2LMHeadModel

# Step 1: load the smallest GPT-2 model (124M parameters) using the Hugging Face transformers library.
# Refer to: https://huggingface.co/docs/transformers/en/model_doc/gpt2
gpt = GPT2LMHeadModel.from_pretrained('gpt2')

# Step 2: # Inspect the first Transformer block one by printing it.
print(gpt.transformer.h[0])

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

GPT2Block(
  (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  (attn): GPT2Attention(
    (c_attn): Conv1D(nf=2304, nx=768)
    (c_proj): Conv1D(nf=768, nx=768)
    (attn_dropout): Dropout(p=0.1, inplace=False)
    (resid_dropout): Dropout(p=0.1, inplace=False)
  )
  (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  (mlp): GPT2MLP(
    (c_fc): Conv1D(nf=3072, nx=768)
    (c_proj): Conv1D(nf=768, nx=3072)
    (act): NewGELUActivation()
    (dropout): Dropout(p=0.1, inplace=False)
  )
)


In this section, you will run a minimal forward pass through one GPT-2 block to understand how tokens are transformed inside the model.

In [18]:
# Step 1: Create a small dummy input with a sequence of 8 random token IDs.
input_ids = torch.randint(0, gpt.config.vocab_size, (1, 8)) # Batch size 1, sequence length 8
print(f"Input IDs shape: {input_ids.shape}")

# Step 2: Convert token IDs into embeddings
# GPT-2 uses two embedding layers:
#   - wte (word token embeddings)
#   - wpe (positional embeddings)
# Add them together to form the initial hidden representation of your input tokens.
word_embeddings = gpt.transformer.wte(input_ids)
position_ids = torch.arange(0, input_ids.shape[1], dtype=torch.long, device=input_ids.device)
position_embeddings = gpt.transformer.wpe(position_ids)
embeddings = word_embeddings + position_embeddings
print(f"Embeddings shape: {embeddings.shape}")

# Step 3: Pass the embeddings through a single Transformer block
# This simulates one layer of computation in GPT-2.
first_block = gpt.transformer.h[0]
output_from_block = first_block(embeddings)

# Step 4: Inspect the result
# The output shape should be (batch_size, sequence_length, hidden_size)
print(f"Output from first block shape: {output_from_block[0].shape}")

Input IDs shape: torch.Size([1, 8])
Embeddings shape: torch.Size([1, 8, 768])
Output from first block shape: torch.Size([1, 8, 768])


### 2.3 - Inside GPT-2

GPT-2 is essentially a stack of identical Transformer blocks arranged in sequence.
Each block contains attention, feed-forward, and normalization layers that process token representations step by step.

In this section, you will print the modules inside the GPT-2 Transformer to see how these components are organized.
This will help you understand how the model scales from a single block to a full network of many layers working together.

In [19]:
# Print the name of all layers inside gpt.transformer.
# You may find this helpful: https://docs.pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module.named_children

print("Layers in gpt.transformer:")
for name, module in gpt.transformer.named_children():
    print(f"- {name}: {type(module)}")


Layers in gpt.transformer:
- wte: <class 'torch.nn.modules.sparse.Embedding'>
- wpe: <class 'torch.nn.modules.sparse.Embedding'>
- drop: <class 'torch.nn.modules.dropout.Dropout'>
- h: <class 'torch.nn.modules.container.ModuleList'>
- ln_f: <class 'torch.nn.modules.normalization.LayerNorm'>


As you can see, the Transformer holds various modules, arranged from a list of blocks (`h`). The following table summarizes these modules:

| Step | What it does | Why it matters |
|------|--------------|----------------|
| **Token ‚Üí Embedding** | Converts IDs to vectors | Gives the model a numeric ‚Äúhandle‚Äù on words |
| **Positional Encoding** | Adds ‚Äúwhere am I?‚Äù info | Order matters in language |
| **Multi-Head Self-Attention** | Each token asks ‚Äúwhich other tokens should I look at?‚Äù | Lets the model relate words across a sentence |
| **Feed-Forward Network** | Two stacked Linear layers with a non-linearity | Mixes information and adds depth |
| **LayerNorm & Residual** | Stabilize training and help gradients flow | Keeps very deep networks trainable |


### 2.4 LLM's output

When you pass a sequence of tokens through a language model, it produces a tensor of logits with shape
`(batch_size, seq_len, vocab_size)`.
Each position in the sequence receives a vector of scores representing how likely every possible token is to appear next. By applying a softmax function on the last dimension, these logits can be converted into probabilities that sum to 1.

In the next cell, you will feed an 8-token dummy sequence into GPT-2, print the shape of its logits, and display the five most likely next tokens predicted for the final position in the sequence.


In [20]:
import torch, torch.nn.functional as F
from transformers import GPT2LMHeadModel, GPT2TokenizerFast

# Step 1: Load GPT-2 model and its tokenizer
gpt = GPT2LMHeadModel.from_pretrained('gpt2')
tokenizer = GPT2TokenizerFast.from_pretrained('gpt2')

In [21]:
# Step 2: Tokenize input text
text = "Hello my name"
input_ids = tokenizer.encode(text, return_tensors='pt')

In [22]:
# Step 3: Pass the input IDs to the model
with torch.no_grad():
    outputs = gpt(input_ids)
logits = outputs.logits

In [23]:
# Step 4: Predict the next token
# We take the logits from the final position, apply softmax to get probabilities,
# and then extract the top 5 most likely next tokens. You may find F.softmax and torch.topk helpful in your implementation.

last_token_logits = logits[0, -1, :]
probabilities = F.softmax(last_token_logits, dim=-1)
top_5_probs, top_5_indices = torch.topk(probabilities, 5)

print(f"Input text: '{text}'")
print(f"Logits shape: {logits.shape}")
print("\nTop 5 predicted next tokens:")
for i, (prob, idx) in enumerate(zip(top_5_probs, top_5_indices)):
    token = tokenizer.decode(idx)
    print(f"{i+1}. Token: '{token}', Probability: {prob:.4f}")

Input text: 'Hello my name'
Logits shape: torch.Size([1, 3, 50257])

Top 5 predicted next tokens:
1. Token: ' is', Probability: 0.7773
2. Token: ',', Probability: 0.0373
3. Token: ''s', Probability: 0.0332
4. Token: ' was', Probability: 0.0127
5. Token: ' and', Probability: 0.0076


### 2.5 - Key Takeaway

A language model is not a black box or something mysterious.
It is a large composition of simple, understandable layers such as linear layers, attention, and normalization, trained together to predict the next token in a sequence.

By learning this next-token prediction task at scale, the model gradually develops an internal understanding of language structure, meaning, and context, which allows it to generate coherent and relevant text.

# 3 - Text Generation (Decoding)
Once a language model has been trained to predict token probabilities, we can use it to generate text.
This process is called text generation or decoding.

At each step, the model outputs a probability distribution over possible next tokens.
A decoding algorithm then selects one token based on that distribution, appends it to the sequence, and repeats the process to build text word by word. Different decoding strategies control how the model chooses the next token and how creative or deterministic the output will be. For example:
- **Greedy** decoding: always pick the token with the highest probability. Simple and consistent, but often repetitive.
- **Top-k** or **Nucleus** (top-p) sampling: randomly sample from the top few likely tokens to add variety.
- Beam search: explores multiple candidate continuations and keeps the best overall sequence.

Note: `Temperature` adjusts randomness in sampling. Higher values make outputs more diverse, while lower values make them more focused and deterministic.

### 3.1 - Greedy decoding
In this section, you will use GPT-2 and Hugging Face's built-in generate method to produce text using the greedy decoding strategy.

In [24]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM


model_id = "gpt2"
device = "cuda" if torch.cuda.is_available() else "mps"


# Step 1. Load GPT-2 model and tokenizer.
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id).to(device)

# Step 2. Implement a text generation function using HuggingFace's generate method.
def generate(model, tokenizer, prompt, max_new_tokens=128):
    input_ids = tokenizer.encode(prompt, return_tensors='pt').to(device)
    output_ids = model.generate(input_ids, max_new_tokens=max_new_tokens, do_sample=False)
    return tokenizer.decode(output_ids[0], skip_special_tokens=True)

In [25]:
tests=["Once upon a time","What is 2+2?", "Suggest a party theme."]
for prompt in tests:
    print(f"\n GPT-2 | Greedy")
    print(generate(model, tokenizer, prompt, 80))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.



 GPT-2 | Greedy


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Once upon a time, the world was a place of great beauty and great danger. The world was a place of great danger, and the world was a place of great danger. The world was a place of great danger, and the world was a place of great danger. The world was a place of great danger, and the world was a place of great danger. The world was a place of great danger, and

 GPT-2 | Greedy


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


What is 2+2?

2+2 is the number of times you can use a spell to cast a spell.

2+2 is the number of times you can use a spell to cast a spell.

2+2 is the number of times you can use a spell to cast a spell.

2+2 is the number of times you can use a spell to cast a spell.

 GPT-2 | Greedy
Suggest a party theme.

The party theme is a simple, simple, and fun way to get your friends to join you.

The party theme is a simple, simple, and fun way to get your friends to join you. The party theme is a simple, simple, and fun way to get your friends to join you. The party theme is a simple, simple, and fun way to get your friends


Naively selecting the single most probable token at each step (known as greedy decoding) often leads to poor results in practice:
- Repetition loops: phrases like ‚ÄúThe cat is is is‚Ä¶‚Äù
- Short-sighted choices: the most likely token right now might lead to incoherent text later

These issues are why more advanced decoding methods such as top-k and nucleus sampling are commonly used to make model outputs more diverse and natural.

### 3.2 - Top-k and top-p sampling
The generate function you implemented earlier can easily be extended to use different decoding strategies.

In this section, you will reimplement the same function but adapt it to support Top-k and Top-p (nucleus) sampling. These methods introduce controlled randomness, allowing the model to explore multiple plausible continuations instead of always choosing the single most likely next token.

In [26]:
# Implement `generate` to support 3 strategies: greedy, top_k, and top_o
# You may find this link helpful: https://huggingface.co/docs/transformers/en/main_classes/text_generation

def generate(model, tokenizer, prompt, strategy="greedy", max_new_tokens=128, k=50, p=0.9, temperature=1.0):
    input_ids = tokenizer.encode(prompt, return_tensors='pt').to(model.device)

    generation_kwargs = {
        "max_new_tokens": max_new_tokens,
        "pad_token_id": tokenizer.eos_token_id, # Set pad_token_id to eos_token_id
    }

    if strategy == "greedy":
        generation_kwargs["do_sample"] = False
    elif strategy == "top_k":
        generation_kwargs["do_sample"] = True
        generation_kwargs["top_k"] = k
        generation_kwargs["temperature"] = temperature
    elif strategy == "top_p":
        generation_kwargs["do_sample"] = True
        generation_kwargs["top_p"] = p
        generation_kwargs["temperature"] = temperature
    else:
        raise ValueError("Invalid strategy. Choose from 'greedy', 'top_k', 'top_p'.")

    output_ids = model.generate(input_ids, **generation_kwargs)
    return tokenizer.decode(output_ids[0], skip_special_tokens=True)

In [30]:

tests=["Once upon a time","What is 2+2?", "Suggest a party theme."]
for prompt in tests:
    print(f"\n GPT-2 | Top-k")
    print(generate(model, tokenizer, prompt, "top_k", 40))


 GPT-2 | Top-k
Once upon a time the government's stated goal was to create jobs by creating jobs, and this meant creating a strong middle class. In fact we were right behind, the middle class was growing by only 1.4 times

 GPT-2 | Top-k
What is 2+2? According to the popular misconception, two must be simultaneous. This is often a problem with some games because when you think about it, each player has to determine if they're even. They can decide for

 GPT-2 | Top-k
Suggest a party theme.

Step 3

Now we need to write some code for our app as an extension to our site structure. To do this, simply add all of these lines to your AppMain class.


### 3.3 - Try It Yourself

Now it‚Äôs time to experiment with text generation. Replace the sample prompts with your own prompts or adjust the decoding strategy.
You can experiment with:
- strategy: "greedy", "beam", "top_k", "top_p"
- temperature: values between 0.2 and 2.0
- k or p: thresholds that control sampling diversity

Try generating the same prompt with `greedy` and `top_p` (for example, 0.9). Notice how even small temperature changes can make the output more focused or more free-form.




# 4 - Completion vs. Instruction-tuned LLMs

So far, we have used `GPT-2` to generate text from a given input prompt. However, `GPT-2` is just a completion model. It simply continues the provided text without understanding it as a task or question. It is not designed to engage in dialogue or follow instructions.

In contrast, instruction-tuned LLMs (such as `Qwen-Chat`) undergo an additional post-training stage after base pre-training. This process fine-tunes the model to behave helpfully and safely when interacting with users. Because of this extra stage, instruction-tuned models can:

- Interpret prompts as requests rather than just text to continue
- Stay in conversation mode, answering questions and following steps
- Handle refusals and safety boundaries appropriately
- Maintain a consistent helpful persona, rather than drifting into storytelling

### 4.1 - `Qwen/Qwen3-0.6B` vs. `GPT2`

In the next cell, you will feed the same prompt to two different models:

- GPT-2 (completion-only): continues the text in the same writing style
- Qwen/Qwen3-0.6B (instruction-tuned): interprets the input as an instruction and responds helpfully

Comparing the two outputs will make the difference between completion and instruction-tuned behavior clear.



In [31]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# Assuming device and gpt2 tokenizer/model are already loaded from previous cells
# model_id = "gpt2"
# device = "cuda" if torch.cuda.is_available() else "cpu"
# gpt2_tokenizer = AutoTokenizer.from_pretrained(model_id)
# gpt2_model = AutoModelForCausalLM.from_pretrained(model_id).to(device)

# Load both GPT-2 and Qwen models using HuggingFace `.from_pretrained` method.
# GPT-2 model and tokenizer are already loaded as 'model' and 'tokenizer' in cell 2f2cb953
gpt2_model = model # Reuse the previously loaded GPT-2 model
gpt2_tokenizer = tokenizer # Reuse the previously loaded GPT-2 tokenizer

qwen_model_id = "Qwen/Qwen1.5-0.5B"

# Load Qwen tokenizer and model
qwen_tokenizer = AutoTokenizer.from_pretrained(qwen_model_id)
qwen_model = AutoModelForCausalLM.from_pretrained(qwen_model_id).to(device)

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

config.json:   0%|          | 0.00/661 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.24G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/138 [00:00<?, ?B/s]

We have now downloaded two small checkpoints: GPT-2 (124M parameters) and Qwen3-0.6B (600M parameters). If the previous cell took some time to run, that was mainly due to model download speed. The models will be cached locally, so future runs will be faster.

Next, we will generate text using our generate function with both models and the same prompt to directly compare how a completion-only model (GPT-2) behaves differently from an instruction-tuned model (Qwen).

In [32]:
tests=[("Once upon a time", "greedy"),("What is 2+2?", "top_k"),("Suggest a party theme.", "top_p")]

for prompt, strategy in tests:
    print(f"\n--- Prompt: '{prompt}' ---")

    # GPT-2 (Completion-only)
    print(f"\nGPT-2 ({strategy.capitalize()} decoding):")
    gpt2_output = generate(gpt2_model, gpt2_tokenizer, prompt, strategy=strategy, max_new_tokens=80)
    print(gpt2_output)

    # Qwen (Instruction-tuned)
    print(f"\nQwen ({strategy.capitalize()} decoding):")
    # Qwen typically expects a chat-like format for instruction following
    messages = [
        {"role": "user", "content": prompt}
    ]
    # Apply chat template if available, otherwise just use the prompt
    if hasattr(qwen_tokenizer, 'apply_chat_template') and qwen_tokenizer.chat_template:
        qwen_input_text = qwen_tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    else:
        qwen_input_text = prompt

    qwen_output = generate(qwen_model, qwen_tokenizer, qwen_input_text, strategy=strategy, max_new_tokens=80)
    print(qwen_output)


--- Prompt: 'Once upon a time' ---

GPT-2 (Greedy decoding):
Once upon a time, the world was a place of great beauty and great danger. The world was a place of great danger, and the world was a place of great danger. The world was a place of great danger, and the world was a place of great danger. The world was a place of great danger, and the world was a place of great danger. The world was a place of great danger, and

Qwen (Greedy decoding):
system
You are a helpful assistant
user
Once upon a time
assistant
Once upon a time, there was a little girl named Lily. She lived in a small village in the countryside. She was very kind and loved to help others. One day, she saw a little boy who was lost and needed help. Lily knew that she had to help him. So she went to the boy's house and asked him if he needed anything. The boy said yes and asked for

--- Prompt: 'What is 2+2?' ---

GPT-2 (Top_k decoding):
What is 2+2? (1+2) (2+2) (1+2) [1+2] (1+2) ((1+2))

Notice that when trying to do a 

# 5. (Optional) A Small Interactive LLM Playground
This section is optional. You do not need to implement it to complete the project. It is meant purely for exploration and will not significantly affect your core AI engineering skills.

If you are curious, you can build a simple interactive playground to experiment with text generation. You can:
- Create input widgets for the prompt, model selection, decoding strategy, and temperature
- Use Hugging Face's generate method to produce text based on the selected settings
- Display the model's response directly in the notebook output

You may find following links helpful:
- https://ipywidgets.readthedocs.io/en/latest/
- https://ipython.readthedocs.io/en/stable/api/generated/IPython.display.html

In [33]:
import ipywidgets as widgets
from IPython.display import display, Markdown
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# Ensure models and tokenizers are loaded
model_id = "gpt2"
device = "cuda" if torch.cuda.is_available() else "cpu" # Use 'cpu' as a fallback if 'mps' not available

# GPT-2 model and tokenizer
gpt2_tokenizer = AutoTokenizer.from_pretrained(model_id)
gpt2_model = AutoModelForCausalLM.from_pretrained(model_id).to(device)

qwen_model_id = "Qwen/Qwen1.5-0.5B"
# Qwen tokenizer and model
qwen_tokenizer = AutoTokenizer.from_pretrained(qwen_model_id)
qwen_model = AutoModelForCausalLM.from_pretrained(qwen_model_id).to(device)

def generate_text_playground(model, tokenizer, prompt, strategy="greedy", max_new_tokens=128, k=50, p=0.9, temperature=1.0):
    input_ids = tokenizer.encode(prompt, return_tensors='pt').to(model.device)

    generation_kwargs = {
        "max_new_tokens": max_new_tokens,
        "pad_token_id": tokenizer.eos_token_id,
    }

    if strategy == "greedy":
        generation_kwargs["do_sample"] = False
    elif strategy == "top_k":
        generation_kwargs["do_sample"] = True
        generation_kwargs["top_k"] = k
        generation_kwargs["temperature"] = temperature
    elif strategy == "top_p":
        generation_kwargs["do_sample"] = True
        generation_kwargs["top_p"] = p
        generation_kwargs["temperature"] = temperature
    else:
        raise ValueError("Invalid strategy. Choose from 'greedy', 'top_k', 'top_p'.")

    output_ids = model.generate(input_ids, **generation_kwargs)
    return tokenizer.decode(output_ids[0], skip_special_tokens=True)

# 3. Create interactive UI elements
prompt_input = widgets.Textarea(
    value='Once upon a time',
    description='Prompt:',
    continuous_update=False,
    layout=widgets.Layout(width='auto', height='80px')
)

model_selector = widgets.Dropdown(
    options={'GPT-2': 'gpt2', 'Qwen/Qwen1.5-0.5B': 'qwen'},
    value='gpt2',
    description='Model:',
)

strategy_selector = widgets.Dropdown(
    options=['greedy', 'top_k', 'top_p'],
    value='top_p',
    description='Strategy:',
)

temperature_slider = widgets.FloatSlider(
    value=0.7,
    min=0.1,
    max=2.0,
    step=0.1,
    description='Temperature:',
    continuous_update=False,
)

k_slider = widgets.IntSlider(
    value=50,
    min=1,
    max=100,
    step=1,
    description='Top-K:',
    continuous_update=False,
)

p_slider = widgets.FloatSlider(
    value=0.9,
    min=0.1,
    max=1.0,
    step=0.05,
    description='Top-P:',
    continuous_update=False,
)

max_new_tokens_slider = widgets.IntSlider(
    value=100,
    min=10,
    max=250,
    step=10,
    description='Max Tokens:',
    continuous_update=False,
)

generate_button = widgets.Button(description="Generate Text")
output_area = widgets.Output()

# 5. Define the button‚Äôs behavior.
def on_generate_button_clicked(b):
    with output_area:
        output_area.clear_output()
        selected_model_name = model_selector.value
        selected_model = gpt2_model if selected_model_name == 'gpt2' else qwen_model
        selected_tokenizer = gpt2_tokenizer if selected_model_name == 'gpt2' else qwen_tokenizer

        current_prompt = prompt_input.value
        current_strategy = strategy_selector.value
        current_temperature = temperature_slider.value
        current_k = k_slider.value
        current_p = p_slider.value
        current_max_new_tokens = max_new_tokens_slider.value

        print(f"Generating with {selected_model_name}, strategy={current_strategy}...")

        try:
            if selected_model_name == 'qwen' and hasattr(selected_tokenizer, 'apply_chat_template') and selected_tokenizer.chat_template:
                messages = [{"role": "user", "content": current_prompt}]
                processed_prompt = selected_tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
            else:
                processed_prompt = current_prompt

            generated_text = generate_text_playground(
                selected_model, selected_tokenizer, processed_prompt,
                strategy=current_strategy, max_new_tokens=current_max_new_tokens,
                k=current_k, p=current_p, temperature=current_temperature
            )
            display(Markdown(f"**Generated Text:**\n\n```\n{generated_text}\n```"))
        except Exception as e:
            display(Markdown(f"**Error during generation:**\n\n```\n{e}\n```"))

generate_button.on_click(on_generate_button_clicked)

# 6. Display the full UI for the playground.
display(
    widgets.VBox([
        prompt_input,
        widgets.HBox([model_selector, strategy_selector]),
        widgets.HBox([temperature_slider, k_slider, p_slider]),
        max_new_tokens_slider,
        generate_button,
        output_area
    ])
)

VBox(children=(Textarea(value='Once upon a time', continuous_update=False, description='Prompt:', layout=Layou‚Ä¶


## üéâ Congratulations!

You've just learned, explored, and inspected a real **LLM**. In one project you:
* Learned how **tokenization** works in practice
* Used `tiktoken` library to load and experiment with most advanced tokenizers.
* Explored LLM architecture and inspected GPT2 blocks and layers
* Learned decoding strategies and used `top-p` to generate text from GPT2
* Loaded a powerful chat model, `Qwen3-0.6B` and generated text
* Built an LLM playground


üëè **Great job!** Take a moment to celebrate. You now have a working mental model of how LLMs work. The skills you used here power most LLMs you see everywhere.
