## Pre-Training GPT-Neo on TinyStories
HuggingFace links: [GPT-Neo model](https://huggingface.co/docs/transformers/model_doc/gpt_neo), [TinyStories dataset](https://huggingface.co/datasets/roneneldan/TinyStories). 

In [45]:
import wandb; wandb.login()
from transformers import GPT2TokenizerFast, GPTNeoForCausalLM, GPTNeoConfig
from transformers import TrainingArguments, Trainer, DataCollatorForLanguageModeling
from datasets import load_dataset, load_from_disk, DatasetDict

config = GPTNeoConfig(

    # number of tokens in the vocabulary 
    vocab_size = 10_000, 
    # embedding size (vector length) of each token 
    hidden_size=512, 
    # we thus have an embedding block of 512 x 10'000 parameters

    # maximum sequence length, though inputs longer than `hidden_size` will be iteratively processed
    max_position_embeddings = 512, 

    # number of transformer blocks. div by 2 for attention_types
    num_layers=2, 
    # for global and local attention (GPT-Neo-specific)
    attention_types=[[["global", "local"], 1]], 

    num_heads=8,     # attention heads
    window_size=256, # for local attention (GPT-Neo-specific)

    intermediate_size=1024, # size of 'up-projection' layer in FFN
)
# config

### Pre-Processing
All experiments use the same tokenizer, so in theory, we only need to preprocess our data once. This means you can subsequently skip this CPU-intensive task on the DelftBlue GPU nodes. 
(except if you are experimenting with different context lengths, in that case you will need to re-tokenise to recover potentially truncated inputs; see below).
You may even have to load the entire dataset (or pre-load chunks) into memory, DelftBlue has extremely slow IO. 

#### Tokenisation
Two things happen during tokenisation, of which the latter is optional:

1. Map (sub-)words to integers $ \in [0, 10000]$.
2. Truncate/pad each input as necessary to fit within `hidden_size`. 

In [46]:
tokenizer = GPT2TokenizerFast.from_pretrained('10k-gpt-neo', model_max_length=config.hidden_size)

# in theory, window_size x num_layers is possible. 
# But, this project is not about the efficacy of sliding window attention (though it could be!)
assert tokenizer.model_max_length == 512 
assert tokenizer.vocab_size == config.vocab_size

# printing this because of a bug in tokenizers (should be fixed now) https://github.com/huggingface/transformers/issues/26500
print(f'padding token is {tokenizer.pad_token}')
# HF wasn't saving this nor the tokenizer's pad_token
config.pad_token_id = tokenizer.pad_token_id

padding token is <|endoftext|>


In [47]:
debug_mode = True
dataset = load_dataset('roneneldan/tinystories')

small_dataset = DatasetDict({
    'train': dataset['train'].select(range(1000)),
    'validation': dataset['validation'].select(range(100))
})

if debug_mode:
    dataset = small_dataset

tokenized_dataset = dataset.map(
    lambda x: tokenizer(x['text'], truncation=True, padding='max_length'),
    batched=True, num_proc=8, batch_size=1_000)

tokenized_dataset.save_to_disk(f'./tokenized_dataset', num_proc=5)

tokenized_dataset = load_from_disk(f'./tokenized_dataset')

assert len(tokenized_dataset['train'][0]['input_ids']) == config.hidden_size
tokenized_dataset['train'][0]['input_ids'][-10:] 
# should be eos tokens (9999), given that most short stories are <512 tokens

Repo card metadata block was not found. Setting CardData to empty.


Saving the dataset (0/5 shards):   0%|          | 0/1000 [00:00<?, ? examples/s]

Saving the dataset (0/5 shards):   0%|          | 0/100 [00:00<?, ? examples/s]

[9999, 9999, 9999, 9999, 9999, 9999, 9999, 9999, 9999, 9999]

#### Creating the TinyStories Tokenizer (side-quest)
> [!NOTE]
> You can safely skip this section if you are not interested in Tokenisation at all; but I briefly explain how it works and how to get to a 10k token vocabulary. 

The TinyStories authors cite a vocab of 10'000 words, while the GPT-Neo tokenizer they provide on HF uses `EleutherAI/gpt-neo-125M` with vocab of 50'257 words. I describe the process I took to get the **10k**-tokenizer below, in case you are interested. Tokenization constitutes a large part of transformers' performance despite its relative simplicity, but is not necessary to understand in the scope of this project. 

To cut out words from TokenizerFast, we need to access the underlying Rust object's state, as per https://github.com/huggingface/transformers/issues/15032. <!-- how the hell do these huggingfacers expect normal people to figure this out man? -->

In [4]:
# import json 
# tokenizer_state = json.loads(tokenizer.backend_tokenizer.model.__getstate__())

# n_words = len(tokenizer_state['vocab'])
# print(f'vocabulary: {n_words}') # would be lovely if this was 10'000

# # index n_words - 1 contains the special eos token `<|endoftext|>`
# new_vocab = {k: v for k, v in tokenizer_state['vocab'].items() if v < 10_000-1 or v == n_words-1}
# tokenizer_state['vocab'] = new_vocab

# # you can see that most common tokens are listed first (individual characters, pairs of chars, triples, etc.)
# print(f'new vocab : {len(new_vocab)}, {list(new_vocab.keys())[:3]}, {list(new_vocab.keys())[-3:]}')


# # Updating the tokenizer with new vocab: of course this doesn't work the first try, for some reason. 
# from tokenizers import models
# model_class = getattr(models, tokenizer_state.pop('type'))

# # 'str' object cannot be converted to 'PyTuple'
# # tokenizer.backend_tokenizer.model = model_class(**tokenizer_state) 

# # Let's manually create tuple objects, maybe Rust's type safety is keeping us safe. 
# new_merges = [tuple(m.split()) for m in tokenizer_state['merges']]
# print(f'new merges: {len(new_merges)}, {new_merges[:3]}, {new_merges[-3:]}') # Ġ means space ' ' 
# tokenizer_state['merges'] = new_merges

# # tokenizer.backend_tokenizer.model = model_class(**tokenizer_state) # Token `ordon` out of vocabulary

I went down a bit of a rabbit hole at this point. But, to summarise: As we are using the Rust-implemented `TokenizerFast`, we neeed both:

- `vocab.json` containing all the tokens, and 
- `merges.txt` explaining how to create those tokens from (sub-)word pairs.

Each merge-tuple in `merges` constitutes one new token. However, each merge by definition means it contains at least two characters. We probably want to be able to encode single characters, in case of a word the model has not seen before. Thus, we need a set of base tokens, one for each character (`257` characters total). 

This is where the `257` in the *original* vocabulary of `50'257` comes from. So, the remaining `50'000` come from iteratively merging the most frequent tokens, starting from the individual `257` base characters. Our definition of a merge now becomes any two tokens merged together, forming a new token. 

> what happens when the tokenizer encounters a character that is not part of its base tokens? We assign it `unk_token` and move on, hoping that it doesn't happen too frequently. In this tokenizer, `unk_token == eos_token == '<|endoftext|>'`. Note that the base set also encodes all the characters needed to construct UniCode characters (which is *close enough* to all characters).

What this means for the tokenizer is that we want only the first `10'000 - 257` merges; if we want our tokeniser to have a vocabulary of 10k tokens. 

In [5]:
# new_merges = new_merges[:10000-257] # 256 says: "Token `ordon` out of vocabulary"
# tokenizer_state['merges'] = new_merges
# print(f'new merges: {len(new_merges)}, {new_merges[:3]}, {new_merges[-3:]}') # Ġ means space ' ' 

# tokenizer.backend_tokenizer.model = model_class(**tokenizer_state)
# print(f'our tokenizer now has the {len(tokenizer)} most frequent tokens (from whatever dataset it was trained on)')

In [6]:
# # tokenizer.save_pretrained('10k-tokenizer')
# tokenizer.decode(tokenizer.encode('hello worldaisudhgiashg asdugh'))
# tokenizer.vocab_size

# # NOTE: Saving hangs indefinitely, I don't know why and it requires me to look at the rust impl.
# # instead, I just ended up modifying the merges.txt and vocab.json files manually, following
# # exactly what I did above. 

# tokenizer.name_or_path = '10k-tokenizer'
# # tokenizer.save_pretrained('10k-tokenizer') 
# # tokenizer.save_vocabulary('.', '10k-tokenizer.json')

And that's about it for tokenisation. All it does is map frequent character strings (i.e. tokens) to integers. 

The intuition is that by assigning each token its own embedding (vector of `1 x hidden_size`) in the transformer, will help it learn semantical understanding of these tokens. E.g. say that you have two tokens 'about' and 'around', the transformer should learn similar embeddings for each. 

### Training
#### Model
We use GPT-Neo as the architecture for our model. This consists of the following blocks, generally seen in transformer architectures:

- Embeddings: Map one-hot (sparse) token vectors to a dense vector of length `hidden_size`. Add positional encoding. 
- $n$ Blocks: Contain self-attention and FFNs.
- Head: Map hidden state back to a useful output. 

We use the config specified at the top of this notebook to construct a model.

In [48]:
import torch
from torch import nn

class GPTNeoGQASelfAttention(nn.Module):
        def __init__(self, config, attention_type, num_query_groups=config.num_heads):
            super().__init__()
            self.config = config
            self.num_query_groups = num_query_groups

            max_positions = config.max_position_embeddings
            bias = torch.tril(torch.ones((max_positions, max_positions), dtype=bool)).view(
                1, 1, max_positions, max_positions
            )

            # local causal self attention is a sliding window where each token can only attend to the previous
            # window_size tokens. This is implemented by updating the causal mask such that for each token
            # all other tokens are masked except the previous window_size tokens.
            if attention_type == "local":
                bias = torch.bitwise_xor(bias, torch.tril(bias, -config.window_size))

            self.register_buffer("bias", bias, persistent=False)
            self.register_buffer("masked_bias", torch.tensor(-1e9), persistent=False)

            self.attn_dropout = nn.Dropout(float(config.attention_dropout))
            self.resid_dropout = nn.Dropout(float(config.resid_dropout))
            self.is_causal = True

            self.embed_dim = config.hidden_size
            self.num_heads = config.num_heads
            self.head_dim = self.embed_dim // self.num_heads
            if self.head_dim * self.num_heads != self.embed_dim:
                raise ValueError(
                    f"embed_dim must be divisible by num_heads (got `embed_dim`: {self.embed_dim} and `num_heads`:"
                    f" {self.num_heads})."
                )

            self.k_proj = nn.Linear(self.embed_dim, self.embed_dim, bias=False)
            self.v_proj = nn.Linear(self.embed_dim, self.embed_dim, bias=False)
            self.q_proj = nn.Linear(self.embed_dim, self.embed_dim, bias=False)
            self.out_proj = nn.Linear(self.embed_dim, self.embed_dim, bias=True)

        def _split_heads(self, tensor, num_heads, attn_head_size):
            """
            Splits hidden_size dim into attn_head_size and num_heads
            """
            new_shape = tensor.size()[:-1] + (num_heads, attn_head_size)
            tensor = tensor.view(new_shape)
            return tensor.permute(0, 2, 1, 3)  # (batch, head, seq_length, head_features)

        def _merge_heads(self, tensor, num_heads, attn_head_size):
            """
            Merges attn_head_size dim and num_attn_heads dim into hidden_size
            """
            tensor = tensor.permute(0, 2, 1, 3).contiguous()
            new_shape = tensor.size()[:-2] + (num_heads * attn_head_size,)
            return tensor.view(new_shape)

        def _attn(self, query, key, value, attention_mask=None, head_mask=None):
            # Keep the attention weights computation in fp32 to avoid overflow issues
            query = query.to(torch.float32)
            key = key.to(torch.float32)

            attn_weights = torch.matmul(query, key.transpose(-1, -2))

            query_length, key_length = query.size(-2), key.size(-2)
            causal_mask = self.bias[:, :, key_length - query_length : key_length, :key_length]
            mask_value = torch.finfo(attn_weights.dtype).min
            # Need to be a tensor, otherwise we get error: `RuntimeError: expected scalar type float but found double`.
            # Need to be on the same device, otherwise `RuntimeError: ..., x and y to be on the same device`
            mask_value = torch.tensor(mask_value, dtype=attn_weights.dtype).to(attn_weights.device)
            attn_weights = torch.where(causal_mask, attn_weights, mask_value)

            if attention_mask is not None:
                # Apply the attention mask
                attn_weights = attn_weights + attention_mask

            attn_weights = nn.functional.softmax(attn_weights, dim=-1)
            attn_weights = attn_weights.to(value.dtype)
            attn_weights = self.attn_dropout(attn_weights)

            # Mask heads if we want to
            if head_mask is not None:
                attn_weights = attn_weights * head_mask

            attn_output = torch.matmul(attn_weights, value)

            return attn_output, attn_weights

        def forward(
            self,
            hidden_states,
            attention_mask=None,
            layer_past=None,
            head_mask=None,
            use_cache=False,
            output_attentions=False,
        ):
            query = self.q_proj(hidden_states)
            key = self.k_proj(hidden_states)
            value = self.v_proj(hidden_states)

            query = self._split_heads(query, self.num_heads, self.head_dim)
            key = self._split_heads(key, self.num_heads, self.head_dim)
            value = self._split_heads(value, self.num_heads, self.head_dim)

            if layer_past is not None:
                past_key = layer_past[0]
                past_value = layer_past[1]
                key = torch.cat((past_key, key), dim=-2)
                value = torch.cat((past_value, value), dim=-2)

            if use_cache is True:
                present = (key, value)
            else:
                present = None

            attn_output, attn_weights = self._attn(query, key, value, attention_mask, head_mask)

            attn_output = self._merge_heads(attn_output, self.num_heads, self.head_dim)
            attn_output = self.out_proj(attn_output)
            attn_output = self.resid_dropout(attn_output)

            outputs = (attn_output, present)
            if output_attentions:
                outputs += (attn_weights,)

            return outputs  # a, present, (attentions)

In [49]:
class CustomGPTNeoForCausalLM(GPTNeoForCausalLM):
    def __init__(self, config):
        super().__init__(config)
        for block in self.transformer.h:
            block.attn.attention = GPTNeoGQASelfAttention(config, block.attn.attention_type, 4)

In [50]:
model = CustomGPTNeoForCausalLM(config=config)

print(f'The model has {model.num_parameters():,} parameters.')
# Token embeddings:    10'000 * 512 = 5M 
# Position embeddings:    512 * 512 = 260K
# 2 Blocks: 2 * (~1M + ~1M) = 4M
#   Attention: 4 * 512 * 512  (4 because of k, q, v, o)     = 1M 
#   FFNs:      2 * 512 * 1024 (c_fc, c_proj)                = 1M 

# total: ~9.5M 

# note that the above does not account for the LM_head, which typically gets swapped out depending on the downstream task. 
# the LM_head, in this case, maps logits back to tokens
# 512 * 10'000 = 5M    (A decent chunk compared to our 9.5M parameters)

# NOTE: uncomment to see the model architecture 
# model 

The model has 9,585,664 parameters.


In [8]:
# for comparison. Note, wte should be (10000, 64)
# GPTNeoForCausalLM.from_pretrained('roneneldan/TinyStories-1M')

#### Training loop
Huggingface provides some powerful (and often confusingly long) APIs for model training. The `TrainingArguments` specifies our hyperparameters, which are used by the `Trainer` taking in the remaining objects (like `model`, `tokenizer`, and `train_dataset`). Specifically:

- `learning_rate` and `num_train_epochs` determine how much the model learns. A higher rate is faster, but more unstable. More epochs (entire passes over the dataset) yields incrementally better results, at the cost of more training time. 
- Batch sizes determine how many samples the model sees in *parallel*. Given `gradient_accumulation_steps=1` and a `batch_size=8`, the model will backpropagate the average loss of 8 samples; if `batch_size=1`, it will average the loss of `gradient_accumulation_steps` samples. It is important to make sure the backpropagated loss is averaged over the same number of samples, when comparing models. 

- `data_collator` batches (and optionally, pads) the input for the model. We have already padded in our `tokenized_dataset`, and leaving this argument empty will automatically batch the inputs. So why do we need it? 

    Glad you asked. This has to do with how the loss is computed in causal language modelling. In our case, we try to predict $p(y | x)$, where $x$ is an input sequence of tokens, and $y$ is the next token following that sequence. Our model, unaware of the target token $y$, outputs $\hat y$. 
    
    For `Trainer` to compute the (cross-entropy) loss, we need to provide it with both $y$ and $\hat y$. The `DataCollatorForLanguageModeling` knows this, and provides the next token $y$ as a separate part of the input, to the `Trainer`.

    The loss is the backbone of backpropagation, which we need to actually improve our model. If this is confusing, please re-watch Karpathy's GPT tutorial. 

If you prefer to write the training loop yourself, check out HF' `run_clm_no_trainer.py` scripts. 

In [5]:
train_dataset, eval_dataset = tokenized_dataset['train'], tokenized_dataset['validation']

batch_size = 4 # TinyStories claims 80, but I am training locally on my poor M1 Air
num_train_epochs = 1  # TinyStories doesn't mention
gradient_accumulation_steps = 16 # TinyStories claims 16

lr = 5e-4 # TinyStories claims 5e-4, higher values are preferable for smaller models

_train_steps = len(train_dataset) // (batch_size * gradient_accumulation_steps)
eval_steps = _train_steps // 10 # evaluate every 10% of training steps

model_name = f'{model.num_parameters()//1e6:.1f}M-{config.num_layers}L-{config.num_heads}H-{config.hidden_size}C-{config.intermediate_size}I'

In [6]:
training_args = TrainingArguments(

    seed       = 42,
    use_cpu    = False, # use GPU if available (not necessarily faster on laptops, but Apple's MPS have good support)
    output_dir = f'./results/models/{model_name}',

    # NOTE: training params
    learning_rate    = lr,
    num_train_epochs = num_train_epochs,
    # Use a smaller batch size to fit into GPU RAM. 
    per_device_train_batch_size = batch_size,
    per_device_eval_batch_size  = batch_size,
    # You should aim to have the same amount of samples per acc step, in all of your experiments!
    # so, if you increase batch_size, decrease gradient_accumulation_steps by the same factor.
    gradient_accumulation_steps = gradient_accumulation_steps,

    # NOTE: Evaluation params
    # wandb is great for tracking experiments, it will even (try to) save your code nowadays
    evaluation_strategy = 'steps',
    eval_steps = eval_steps,
    save_steps = eval_steps,

    logging_first_step=True,
    logging_steps=max(1, eval_steps),
    report_to  = 'wandb',
)

trainer = Trainer(
    model = model, 
    args = training_args, 
    train_dataset = train_dataset, 
    eval_dataset = eval_dataset,
    data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False),
)

# print amount of training steps, and how often the model is evaluated
print(f'''
    training for {num_train_epochs} epochs, {len(train_dataset)} samples
    {batch_size} batch size, {gradient_accumulation_steps} accumulation steps
    gives {_train_steps} training steps.

    evaluating every {eval_steps} steps, {len(eval_dataset)} samples 
    ''')


    training for 1 epochs, 1000 samples
    4 batch size, 16 accumulation steps
    gives 15 training steps.

    evaluating every 1 steps, 100 samples 
    


Finally, we can train. 

This configuration takes ≤24hr to pre-train on my M1 Macbook Air with 16GB RAM. Python takes ≤4GB VRAM at a `batch_size=16` and ≤11GB at `batch_size=64`, though they take the same amount of time to train - likely because this processor is not designed to move that much data in and out of RAM constantly. And tbh, the GPU be lacking. If you decide to go the local-training-route, consider [chai](https://github.com/lvillani/chai) to keep your (Apple) laptop awake – there's probably a windows/linux equivalent too. 

In [7]:
wandb.init(project='tinystories', name=model_name, config=training_args)
trainer.train()
trainer.save_model(f'./results/models/{model_name}')

Step,Training Loss,Validation Loss


 ## Inference!
Try out your own pre-trained model on some prompts

In [51]:
import textwrap # for pretty printing
w = textwrap.TextWrapper(replace_whitespace=False, break_long_words=False, width=60, initial_indent='   ', subsequent_indent='  ')
def see(text): print('\n\033[3m' + '\n\n'.join(['\n'.join(w.wrap(line))
                 for line in text.splitlines() if line.strip() != '']) + '\033[0m\n')

In [52]:
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = '9.0M-2L-4H-512C-1024I'
tokenizer = AutoTokenizer.from_pretrained('10k-gpt-neo')
tokenizer.pad_token_id = tokenizer.eos_token_id
model = AutoModelForCausalLM.from_pretrained(f'results/models/{model_name}')

In [56]:
import time
from memory_profiler import memory_usage

prompt = 'Once upon a time, there was a chicken with'
input_ids = tokenizer.encode(prompt, return_tensors='pt')

memory_usage_start = memory_usage(max_usage=True)

start_time = time.time()
output = model.generate(input_ids, max_length=300, num_beams=1)
end_time = time.time()

memory_usage_end = memory_usage(max_usage=True)

output_text = tokenizer.decode(output[0])

# textwrap with indentation on every new paragraph
see(output_text)
inference_time = end_time - start_time
print("Inference Time:", inference_time, "seconds")
print("Peak Memory Usage:", memory_usage_end, "MiB")


[3m   Once upon a time, there was a chicken with.

   . He. He.[0m

Inference Time: 2.215142250061035 seconds
Peak Memory Usage: 136.99609375 MiB


In [25]:
import time
from memory_profiler import memory_usage

# Define your prompt and encode it
prompt = 'Once upon a time, there was a chicken with'
input_ids = tokenizer.encode(prompt, return_tensors='pt')

# Set the number of times to generate output
num_iterations = 10  # You can adjust this number as needed

# Initialize a list to store inference times
inference_times = []
# Initialize a list to store peak memory usages per inference
memory_usage_values = []

# Loop for generating output multiple times
for _ in range(num_iterations):
    # Start recording memory to find the peak usage
    memory_usage_start = memory_usage(max_usage=True)

    # Record the start time
    start_time = time.time()
    # Generate the output
    output = model.generate(input_ids, max_length=300, num_beams=1)
    # Record the end time
    end_time = time.time()

    # End recording memory to find the peak usage
    memory_usage_end = memory_usage(max_usage=True)

    # Calculate the time taken for inference
    inference_time = end_time - start_time
    inference_times.append(inference_time)

    memory_usage_values.append(memory_usage_end)

    # Decode and print the output text (optional)
    output_text = tokenizer.decode(output[0])
    print("Generated Text:", output_text)

average_inference_time = sum(inference_times) / num_iterations
average_memory_usage = sum(memory_usage_values) / num_iterations

print("Average Inference Time:", average_inference_time, "seconds")
print("Average Peak Memory Usage:", average_memory_usage, "MiB")

Generated Text: Once upon a time, there was a chicken with.













































































































































. He. He.














































































































































Generated Text: Once upon a time, there was a chicken with.













































































































































. He. He.














































































































































Generated Text: Once upon a time, there was a chicken with.













































































































































. He. He.



















































































