# 01 â€” LLM Fundamentals

This notebook builds intuition for how language models work by implementing
tokenization, padding, softmax, and sampling manually before using frameworks.


In [1]:
# Special tokens
PAD = "[PAD]"
UNK = "[UNK]"
BOS = "[BOS]"
EOS = "[EOS]"

# Toy vocabulary
vocab = {
    PAD: 0,
    UNK: 1,
    BOS: 2,
    EOS: 3,
    "i": 4,
    "love": 5,
    "machine": 6,
    "learning": 7,
    "nlp": 8,
    "is": 9,
    "fun": 10,
}

id_to_token = {v: k for k, v in vocab.items()}

In [2]:
def encode(text: str):
    tokens = text.lower().split()
    ids = [vocab.get(token, vocab[UNK]) for token in tokens]
    return ids

In [3]:
encode("i love deep learning")

[4, 5, 1, 7]

In [4]:
def decode(ids):
    tokens = [id_to_token[i] for i in ids]
    return " ".join(tokens)

In [5]:
decode([4, 5, 1, 7])

'i love [UNK] learning'

In [6]:
def encode_with_special_tokens(text: str):
    return [vocab[BOS]] + encode(text) + [vocab[EOS]]

encode_with_special_tokens("i love nlp")

[2, 4, 5, 8, 3]

- **[UNK]** represents a token that is not present in the tokenizer vocabulary.
- **[PAD]** is used to make sequences the same length so they can be processed together in batches; padded positions are ignored using an attention mask.
- **[BOS]** marks the beginning of a sequence and **[EOS]** marks the end, helping the model understand sequence boundaries during training and generation.
