# 01 — LLM Fundamentals

This notebook builds intuition for how language models work by implementing
tokenization, padding, softmax, and sampling manually before using frameworks.


In [1]:
# Special tokens
PAD = "[PAD]"
UNK = "[UNK]"
BOS = "[BOS]"
EOS = "[EOS]"

# Toy vocabulary
vocab = {
    PAD: 0,
    UNK: 1,
    BOS: 2,
    EOS: 3,
    "i": 4,
    "love": 5,
    "machine": 6,
    "learning": 7,
    "nlp": 8,
    "is": 9,
    "fun": 10,
}

id_to_token = {v: k for k, v in vocab.items()}

In [2]:
def encode(text: str):
    tokens = text.lower().split()
    ids = [vocab.get(token, vocab[UNK]) for token in tokens]
    return ids

In [3]:
encode("i love deep learning")

[4, 5, 1, 7]

In [4]:
def decode(ids):
    tokens = [id_to_token[i] for i in ids]
    return " ".join(tokens)

In [5]:
decode([4, 5, 1, 7])

'i love [UNK] learning'

In [6]:
def encode_with_special_tokens(text: str):
    return [vocab[BOS]] + encode(text) + [vocab[EOS]]

encode_with_special_tokens("i love nlp")

[2, 4, 5, 8, 3]

- **[UNK]** represents a token that is not present in the tokenizer vocabulary.
- **[PAD]** is used to make sequences the same length so they can be processed together in batches; padded positions are ignored using an attention mask.
- **[BOS]** marks the beginning of a sequence and **[EOS]** marks the end, helping the model understand sequence boundaries during training and generation.


## Task 2 — Padding and Attention Masks
Transformers process batches of sequences with equal length using padding.


In [7]:
# Example sentences with different lengths
sentences = [
    "i love nlp",
    "i love deep learning",
    "i love machine learning",
]

encoded = [encode_with_special_tokens(s) for s in sentences]
encoded


[[2, 4, 5, 8, 3], [2, 4, 5, 1, 7, 3], [2, 4, 5, 6, 7, 3]]

In [None]:
def pad_sequences(batch_ids, pad_id: int):
    max_len = max(len(x) for x in batch_ids)
    padded = []
    for x in batch_ids:
        padded.append(x + [pad_id] * (max_len - len(x)))
    return padded, max_len

pad_id = vocab[PAD]

padded_ids, max_len = pad_sequences(encoded, pad_id=pad_id)

padded_ids, max_len

([[2, 4, 5, 8, 3, 0], [2, 4, 5, 1, 7, 3], [2, 4, 5, 6, 7, 3]], 6)

In [11]:
def make_attention_mask(padded_batch_ids, pad_id: int):
    return [[1 if i != pad_id else 0 for i in x] for x in padded_batch_ids]

attention_mask = make_attention_mask(padded_ids, pad_id=pad_id)

attention_mask

[[1, 1, 1, 1, 1, 0], [1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1]]

In [12]:
def pretty_print_batch(padded_batch_ids, attention_mask):
    for i, (seq, mask) in enumerate(zip(padded_batch_ids, attention_mask)):
        tokens = [id_to_token[t] for t in seq]
        print(f"Example {i}")
        print("tokens: ", tokens)
        print("mask:   ", mask)
        print()

pretty_print_batch(padded_ids, attention_mask)


Example 0
tokens:  ['[BOS]', 'i', 'love', 'nlp', '[EOS]', '[PAD]']
mask:    [1, 1, 1, 1, 1, 0]

Example 1
tokens:  ['[BOS]', 'i', 'love', '[UNK]', 'learning', '[EOS]']
mask:    [1, 1, 1, 1, 1, 1]

Example 2
tokens:  ['[BOS]', 'i', 'love', 'machine', 'learning', '[EOS]']
mask:    [1, 1, 1, 1, 1, 1]



1) We pad sequences so they all have the same length, allowing them to be processed together in a batch by the model.

2) If we do not use an attention mask, the model will treat padding tokens as real input tokens, which can distort attention scores and negatively affect the model’s understanding of the actual sentence.
