## Pre-Training GPT-Neo on TinyStories
HuggingFace links: [GPT-Neo model](https://huggingface.co/docs/transformers/model_doc/gpt_neo), [TinyStories dataset](https://huggingface.co/datasets/roneneldan/TinyStories). 

In [1]:
import wandb; wandb.login()
from transformers import GPT2TokenizerFast, GPTNeoForCausalLM, GPTNeoConfig
from transformers import TrainingArguments, Trainer, DataCollatorForLanguageModeling
from datasets import load_dataset, load_from_disk

config = GPTNeoConfig(

    # number of tokens in the vocabulary 
    vocab_size = 10_000, 
    # embedding size (vector length) of each token 
    hidden_size=512, 
    # we thus have an embedding block of 512 x 10'000 parameters

    # maximum sequence length, though inputs longer than `hidden_size` will be iteratively processed
    max_position_embeddings = 512, 

    # number of transformer blocks. div by 2 for attention_types
    num_layers=2, 
    # for global and local attention (GPT-Neo-specific)
    attention_types=[[["global", "local"], 1]], 

    num_heads=4,     # attention heads
    window_size=256, # for local attention (GPT-Neo-specific)

    intermediate_size=1024, # size of 'up-projection' layer in FFN
)
# config

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
wandb: Currently logged in as: filip-ignijic (filipignijic). Use `wandb login --relogin` to force relogin


### Pre-Processing
All experiments use the same tokenizer, so in theory, we only need to preprocess our data once. This means you can subsequently skip this CPU-intensive task on the DelftBlue GPU nodes. 
(except if you are experimenting with different context lengths, in that case you will need to re-tokenise to recover potentially truncated inputs; see below).
You may even have to load the entire dataset (or pre-load chunks) into memory, DelftBlue has extremely slow IO. 

#### Tokenisation
Two things happen during tokenisation, of which the latter is optional:

1. Map (sub-)words to integers $ \in [0, 10000]$.
2. Truncate/pad each input as necessary to fit within `hidden_size`. 

In [2]:
tokenizer = GPT2TokenizerFast.from_pretrained('10k-gpt-neo', model_max_length=config.hidden_size)

# in theory, window_size x num_layers is possible. 
# But, this project is not about the efficacy of sliding window attention (though it could be!)
assert tokenizer.model_max_length == 512 
assert tokenizer.vocab_size == config.vocab_size

# printing this because of a bug in tokenizers (should be fixed now) https://github.com/huggingface/transformers/issues/26500
print(f'padding token is {tokenizer.pad_token}')
# HF wasn't saving this nor the tokenizer's pad_token
config.pad_token_id = tokenizer.pad_token_id

padding token is <|endoftext|>


In [3]:
dataset = load_dataset("roneneldan/TinyStories")

def tokenize_function(examples):
    return tokenizer(examples['text'], truncation=True, padding='max_length')

#tokenized_dataset = dataset.map(lambda x: tokenizer(x['text'], truncation=True, padding='max_length'), 
     #batched=True, num_proc=8, batch_size=1_000)
tokenized_dataset = dataset.map(tokenize_function, batched=True, num_proc=None, batch_size=1000)

tokenized_dataset.save_to_disk(f'./tokenized_dataset', num_proc=5)

tokenized_dataset = load_from_disk(f'./tokenized_dataset')

assert len(tokenized_dataset['train'][0]['input_ids']) == config.hidden_size
tokenized_dataset['train'][0]['input_ids'][-10:] 
# should be eos tokens (9999), given that most short stories are <512 tokens

Repo card metadata block was not found. Setting CardData to empty.


Map:   0%|          | 0/2119719 [00:00<?, ? examples/s]

KeyboardInterrupt: 

#### Creating the TinyStories Tokenizer (side-quest)
> [!NOTE]
> You can safely skip this section if you are not interested in Tokenisation at all; but I briefly explain how it works and how to get to a 10k token vocabulary. 

The TinyStories authors cite a vocab of 10'000 words, while the GPT-Neo tokenizer they provide on HF uses `EleutherAI/gpt-neo-125M` with vocab of 50'257 words. I describe the process I took to get the **10k**-tokenizer below, in case you are interested. Tokenization constitutes a large part of transformers' performance despite its relative simplicity, but is not necessary to understand in the scope of this project. 

To cut out words from TokenizerFast, we need to access the underlying Rust object's state, as per https://github.com/huggingface/transformers/issues/15032. <!-- how the hell do these huggingfacers expect normal people to figure this out man? -->

In [4]:
# import json 
# tokenizer_state = json.loads(tokenizer.backend_tokenizer.model.__getstate__())

# n_words = len(tokenizer_state['vocab'])
# print(f'vocabulary: {n_words}') # would be lovely if this was 10'000

# # index n_words - 1 contains the special eos token `<|endoftext|>`
# new_vocab = {k: v for k, v in tokenizer_state['vocab'].items() if v < 10_000-1 or v == n_words-1}
# tokenizer_state['vocab'] = new_vocab

# # you can see that most common tokens are listed first (individual characters, pairs of chars, triples, etc.)
# print(f'new vocab : {len(new_vocab)}, {list(new_vocab.keys())[:3]}, {list(new_vocab.keys())[-3:]}')


# # Updating the tokenizer with new vocab: of course this doesn't work the first try, for some reason. 
# from tokenizers import models
# model_class = getattr(models, tokenizer_state.pop('type'))

# # 'str' object cannot be converted to 'PyTuple'
# # tokenizer.backend_tokenizer.model = model_class(**tokenizer_state) 

# # Let's manually create tuple objects, maybe Rust's type safety is keeping us safe. 
# new_merges = [tuple(m.split()) for m in tokenizer_state['merges']]
# print(f'new merges: {len(new_merges)}, {new_merges[:3]}, {new_merges[-3:]}') # Ġ means space ' ' 
# tokenizer_state['merges'] = new_merges

# # tokenizer.backend_tokenizer.model = model_class(**tokenizer_state) # Token `ordon` out of vocabulary

I went down a bit of a rabbit hole at this point. But, to summarise: As we are using the Rust-implemented `TokenizerFast`, we neeed both:

- `vocab.json` containing all the tokens, and 
- `merges.txt` explaining how to create those tokens from (sub-)word pairs.

Each merge-tuple in `merges` constitutes one new token. However, each merge by definition means it contains at least two characters. We probably want to be able to encode single characters, in case of a word the model has not seen before. Thus, we need a set of base tokens, one for each character (`257` characters total). 

This is where the `257` in the *original* vocabulary of `50'257` comes from. So, the remaining `50'000` come from iteratively merging the most frequent tokens, starting from the individual `257` base characters. Our definition of a merge now becomes any two tokens merged together, forming a new token. 

> what happens when the tokenizer encounters a character that is not part of its base tokens? We assign it `unk_token` and move on, hoping that it doesn't happen too frequently. In this tokenizer, `unk_token == eos_token == '<|endoftext|>'`. Note that the base set also encodes all the characters needed to construct UniCode characters (which is *close enough* to all characters).

What this means for the tokenizer is that we want only the first `10'000 - 257` merges; if we want our tokeniser to have a vocabulary of 10k tokens. 

In [5]:
# new_merges = new_merges[:10000-257] # 256 says: "Token `ordon` out of vocabulary"
# tokenizer_state['merges'] = new_merges
# print(f'new merges: {len(new_merges)}, {new_merges[:3]}, {new_merges[-3:]}') # Ġ means space ' ' 

# tokenizer.backend_tokenizer.model = model_class(**tokenizer_state)
# print(f'our tokenizer now has the {len(tokenizer)} most frequent tokens (from whatever dataset it was trained on)')

In [6]:
# # tokenizer.save_pretrained('10k-tokenizer')
# tokenizer.decode(tokenizer.encode('hello worldaisudhgiashg asdugh'))
# tokenizer.vocab_size

# # NOTE: Saving hangs indefinitely, I don't know why and it requires me to look at the rust impl.
# # instead, I just ended up modifying the merges.txt and vocab.json files manually, following
# # exactly what I did above. 

# tokenizer.name_or_path = '10k-tokenizer'
# # tokenizer.save_pretrained('10k-tokenizer') 
# # tokenizer.save_vocabulary('.', '10k-tokenizer.json')

And that's about it for tokenisation. All it does is map frequent character strings (i.e. tokens) to integers. 

The intuition is that by assigning each token its own embedding (vector of `1 x hidden_size`) in the transformer, will help it learn semantical understanding of these tokens. E.g. say that you have two tokens 'about' and 'around', the transformer should learn similar embeddings for each. 

### Training
#### Model
We use GPT-Neo as the architecture for our model. This consists of the following blocks, generally seen in transformer architectures:

- Embeddings: Map one-hot (sparse) token vectors to a dense vector of length `hidden_size`. Add positional encoding. 
- $n$ Blocks: Contain self-attention and FFNs.
- Head: Map hidden state back to a useful output. 

We use the config specified at the top of this notebook to construct a model. 

In [22]:
model = GPTNeoForCausalLM(config=config)

print(f'The model has {model.num_parameters():,} parameters.')
# Token embeddings:    10'000 * 512 = 5M 
# Position embeddings:    512 * 512 = 260K
# 2 Blocks: 2 * (~1M + ~1M) = 4M
#   Attention: 4 * 512 * 512  (4 because of k, q, v, o)     = 1M 
#   FFNs:      2 * 512 * 1024 (c_fc, c_proj)                = 1M 

# total: ~9.5M 

# note that the above does not account for the LM_head, which typically gets swapped out depending on the downstream task. 
# the LM_head, in this case, maps logits back to tokens
# 512 * 10'000 = 5M    (A decent chunk compared to our 9.5M parameters)

# NOTE: uncomment to see the model architecture 
# model 

The model has 9,585,664 parameters.


In [8]:
# for comparison. Note, wte should be (10000, 64)
# GPTNeoForCausalLM.from_pretrained('roneneldan/TinyStories-1M')

#### Training loop
Huggingface provides some powerful (and often confusingly long) APIs for model training. The `TrainingArguments` specifies our hyperparameters, which are used by the `Trainer` taking in the remaining objects (like `model`, `tokenizer`, and `train_dataset`). Specifically:

- `learning_rate` and `num_train_epochs` determine how much the model learns. A higher rate is faster, but more unstable. More epochs (entire passes over the dataset) yields incrementally better results, at the cost of more training time. 
- Batch sizes determine how many samples the model sees in *parallel*. Given `gradient_accumulation_steps=1` and a `batch_size=8`, the model will backpropagate the average loss of 8 samples; if `batch_size=1`, it will average the loss of `gradient_accumulation_steps` samples. It is important to make sure the backpropagated loss is averaged over the same number of samples, when comparing models. 

- `data_collator` batches (and optionally, pads) the input for the model. We have already padded in our `tokenized_dataset`, and leaving this argument empty will automatically batch the inputs. So why do we need it? 

    Glad you asked. This has to do with how the loss is computed in causal language modelling. In our case, we try to predict $p(y | x)$, where $x$ is an input sequence of tokens, and $y$ is the next token following that sequence. Our model, unaware of the target token $y$, outputs $\hat y$. 
    
    For `Trainer` to compute the (cross-entropy) loss, we need to provide it with both $y$ and $\hat y$. The `DataCollatorForLanguageModeling` knows this, and provides the next token $y$ as a separate part of the input, to the `Trainer`.

    The loss is the backbone of backpropagation, which we need to actually improve our model. If this is confusing, please re-watch Karpathy's GPT tutorial. 

If you prefer to write the training loop yourself, check out HF' `run_clm_no_trainer.py` scripts. 

In [3]:
train_dataset, eval_dataset = tokenized_dataset['train'], tokenized_dataset['validation']

batch_size = 16 # TinyStories claims 80, but I am training locally on my poor M1 Air
num_train_epochs = 1  # TinyStories doesn't mention
gradient_accumulation_steps = 16 # TinyStories claims 16

lr = 5e-4 # TinyStories claims 5e-4, higher values are preferable for smaller models

_train_steps = len(train_dataset) // (batch_size * gradient_accumulation_steps)
eval_steps = _train_steps // 10 # evaluate every 10% of training steps

model_name = f'{model.num_parameters()//1e6:.1f}M-{config.num_layers}L-{config.num_heads}H-{config.hidden_size}C-{config.intermediate_size}I'

NameError: name 'tokenized_dataset' is not defined

In [4]:
training_args = TrainingArguments(

    seed       = 42,
    use_cpu    = False, # use GPU if available (not necessarily faster on laptops, but Apple's MPS have good support)
    output_dir = f'./results/models/{model_name}',

    # NOTE: training params
    learning_rate    = lr,
    num_train_epochs = num_train_epochs,
    # Use a smaller batch size to fit into GPU RAM. 
    per_device_train_batch_size = batch_size,
    per_device_eval_batch_size  = batch_size,
    # You should aim to have the same amount of samples per acc step, in all of your experiments!
    # so, if you increase batch_size, decrease gradient_accumulation_steps by the same factor.
    gradient_accumulation_steps = gradient_accumulation_steps,

    # NOTE: Evaluation params
    # wandb is great for tracking experiments, it will even (try to) save your code nowadays
    evaluation_strategy = 'steps',
    eval_steps = eval_steps,
    save_steps = eval_steps,

    logging_first_step=True,
    logging_steps=eval_steps,
    report_to  = 'wandb',
)

trainer = Trainer(
    model = model, 
    args = training_args, 
    train_dataset = train_dataset, 
    eval_dataset = eval_dataset,
    data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False),
)

# print amount of training steps, and how often the model is evaluated
print(f'''
    training for {num_train_epochs} epochs, {len(train_dataset)} samples
    {batch_size} batch size, {gradient_accumulation_steps} accumulation steps
    gives {_train_steps} training steps.

    evaluating every {eval_steps} steps, {len(eval_dataset)} samples 
    ''')

NameError: name 'TrainingArguments' is not defined

Finally, we can train. 

This configuration takes ≤24hr to pre-train on my M1 Macbook Air with 16GB RAM. Python takes ≤4GB VRAM at a `batch_size=16` and ≤11GB at `batch_size=64`, though they take the same amount of time to train - likely because this processor is not designed to move that much data in and out of RAM constantly. And tbh, the GPU be lacking. If you decide to go the local-training-route, consider [chai](https://github.com/lvillani/chai) to keep your (Apple) laptop awake – there's probably a windows/linux equivalent too. 

In [5]:
wandb.init(project='tinystories', name=model_name, config=training_args)
trainer.train()
trainer.save_model(f'./results/models/{model_name}')

NameError: name 'wandb' is not defined

## Inference!
Try out your own pre-trained model on some prompts

In [1]:
import textwrap # for pretty printing
w = textwrap.TextWrapper(replace_whitespace=False, break_long_words=False, width=60, initial_indent='   ', subsequent_indent='  ')
def see(text): print('\n\033[3m' + '\n\n'.join(['\n'.join(w.wrap(line))
                 for line in text.splitlines() if line.strip() != '']) + '\033[0m\n')

In [6]:
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = '9.0M-2L-4H-512C-1024I'
tokenizer = AutoTokenizer.from_pretrained('10k-gpt-neo')
tokenizer.pad_token_id = tokenizer.eos_token_id
model = AutoModelForCausalLM.from_pretrained(f'results/models/{model_name}')

In [7]:
prompt = 'Once upon a time, there was a chicken with'
input_ids = tokenizer.encode(prompt, return_tensors='pt')

output = model.generate(input_ids, max_length=300, num_beams=1)
output_text = tokenizer.decode(output[0])

# textwrap with indentation on every new paragraph
see(output_text)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



[3m   Once upon a time, there was a chicken with a big, strong
  head. The chicken was very happy because it had a lot of
  friends. One day, the chicken wanted to go on an
  adventure. So, the chicken went to the river to swim.

   As the chicken swam, it saw a big, scary dog. The dog was
  very scary and had sharp teeth. The chicken was scared and
  didn't know what to do. But then, the dog came back with a
  big, strong head. The dog was very happy and played with
  the chicken all day.

   The chicken and the dog became best friends. They played
  together every day. The dog was not scary anymore. The
  end. The end. The end. The end. The end. The end. The end.
  The end. The end. The end. The end. The end. The end. The
  end. The end. The end. The end. The end. The end. The end.
  The end. The end. The end. The end. The end. The end. The
  end. The end. The end. The end. The end. The end. The end.
  The end. The end. The end. The end. The end. The end. The
  end. The end end. Th