## Pre-Training GPT-Neo on TinyStories
HuggingFace links: [GPT-Neo model](https://huggingface.co/docs/transformers/model_doc/gpt_neo), [TinyStories dataset](https://huggingface.co/datasets/roneneldan/TinyStories). 

In [1]:
# %pip install wandb
# %pip install transformers
# %pip install datasets
# %pip install torch torchvision torchaudio
# %pip install accelerate -U

In [2]:
with open('keys.txt', 'r') as file:
    api_key = file.read().strip()
import wandb; wandb.login(key=api_key)
from transformers import GPT2TokenizerFast, GPTNeoForCausalLM, GPTNeoConfig
from transformers import TrainingArguments, Trainer, DataCollatorForLanguageModeling
from datasets import load_dataset, load_from_disk, DatasetDict
debug_mode = True
config = GPTNeoConfig(
    vocab_size=10_000,
    hidden_size=512,
    max_position_embeddings=512,
    num_layers=2,
    attention_types=[[["global", "local"], 1]],
    num_heads=4,
    window_size=256,
    intermediate_size=1024
)
# config

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33mlkeskull[0m ([33mlkeskullorg[0m). Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /Users/laurikeskull/.netrc
  from .autonotebook import tqdm as notebook_tqdm


### Pre-Processing
All experiments use the same tokenizer, so in theory, we only need to preprocess our data once. This means you can subsequently skip this CPU-intensive task on the DelftBlue GPU nodes. 
(except if you are experimenting with different context lengths, in that case you will need to re-tokenise to recover potentially truncated inputs; see below).
You may even have to load the entire dataset (or pre-load chunks) into memory, DelftBlue has extremely slow IO. 

#### Tokenisation
Two things happen during tokenisation, of which the latter is optional:

1. Map (sub-)words to integers $ \in [0, 10000]$.
2. Truncate/pad each input as necessary to fit within `hidden_size`. 

In [3]:
tokenizer = GPT2TokenizerFast.from_pretrained('10k-gpt-neo', model_max_length=config.hidden_size)

# in theory, window_size x num_layers is possible. 
# But, this project is not about the efficacy of sliding window attention (though it could be!)
assert tokenizer.model_max_length == 512 
assert tokenizer.vocab_size == config.vocab_size

# printing this because of a bug in tokenizers (should be fixed now) https://github.com/huggingface/transformers/issues/26500
print(f'padding token is {tokenizer.pad_token}')
# HF wasn't saving this nor the tokenizer's pad_token
config.pad_token_id = tokenizer.pad_token_id

padding token is <|endoftext|>


In [4]:
dataset = load_dataset('roneneldan/tinystories')

small_dataset = DatasetDict({
    'train': dataset['train'].select(range(1000)),
    'validation': dataset['validation'].select(range(100))
})

if debug_mode:
    dataset = small_dataset

tokenized_dataset = dataset.map(
    lambda x: tokenizer(x['text'], truncation=True, padding='max_length'), 
    batched=True, num_proc=8, batch_size=1_000)

tokenized_dataset.save_to_disk(f'./tokenized_dataset', num_proc=5)

tokenized_dataset = load_from_disk(f'./tokenized_dataset')

assert len(tokenized_dataset['train'][0]['input_ids']) == config.hidden_size
tokenized_dataset['train'][0]['input_ids'][-10:] 
# should be eos tokens (9999), given that most short stories are <512 tokens

Repo card metadata block was not found. Setting CardData to empty.
Saving the dataset (5/5 shards): 100%|██████████| 1000/1000 [00:00<00:00, 7594.39 examples/s] 
Saving the dataset (5/5 shards): 100%|██████████| 100/100 [00:00<00:00, 1070.69 examples/s]


[9999, 9999, 9999, 9999, 9999, 9999, 9999, 9999, 9999, 9999]

### Training
#### Model
We use GPT-Neo as the architecture for our model. This consists of the following blocks, generally seen in transformer architectures:

- Embeddings: Map one-hot (sparse) token vectors to a dense vector of length `hidden_size`. Add positional encoding. 
- $n$ Blocks: Contain self-attention and FFNs.
- Head: Map hidden state back to a useful output. 

We use the config specified at the top of this notebook to construct a model. 

In [5]:
model = GPTNeoForCausalLM(config=config)

print(f'The model has {model.num_parameters():,} parameters.')
# Token embeddings:    10'000 * 512 = 5M 
# Position embeddings:    512 * 512 = 260K
# 2 Blocks: 2 * (~1M + ~1M) = 4M
#   Attention: 4 * 512 * 512  (4 because of k, q, v, o)     = 1M 
#   FFNs:      2 * 512 * 1024 (c_fc, c_proj)                = 1M 

# total: ~9.5M 

# note that the above does not account for the LM_head, which typically gets swapped out depending on the downstream task. 
# the LM_head, in this case, maps logits back to tokens
# 512 * 10'000 = 5M    (A decent chunk compared to our 9.5M parameters)

# NOTE: uncomment to see the model architecture 
# model 

The model has 9,585,664 parameters.


In [6]:
# for comparison. Note, wte should be (10000, 64)
# GPTNeoForCausalLM.from_pretrained('roneneldan/TinyStories-1M')

#### Training loop
Huggingface provides some powerful (and often confusingly long) APIs for model training. The `TrainingArguments` specifies our hyperparameters, which are used by the `Trainer` taking in the remaining objects (like `model`, `tokenizer`, and `train_dataset`). Specifically:

- `learning_rate` and `num_train_epochs` determine how much the model learns. A higher rate is faster, but more unstable. More epochs (entire passes over the dataset) yields incrementally better results, at the cost of more training time. 
- Batch sizes determine how many samples the model sees in *parallel*. Given `gradient_accumulation_steps=1` and a `batch_size=8`, the model will backpropagate the average loss of 8 samples; if `batch_size=1`, it will average the loss of `gradient_accumulation_steps` samples. It is important to make sure the backpropagated loss is averaged over the same number of samples, when comparing models. 

- `data_collator` batches (and optionally, pads) the input for the model. We have already padded in our `tokenized_dataset`, and leaving this argument empty will automatically batch the inputs. So why do we need it? 

    Glad you asked. This has to do with how the loss is computed in causal language modelling. In our case, we try to predict $p(y | x)$, where $x$ is an input sequence of tokens, and $y$ is the next token following that sequence. Our model, unaware of the target token $y$, outputs $\hat y$. 
    
    For `Trainer` to compute the (cross-entropy) loss, we need to provide it with both $y$ and $\hat y$. The `DataCollatorForLanguageModeling` knows this, and provides the next token $y$ as a separate part of the input, to the `Trainer`.

    The loss is the backbone of backpropagation, which we need to actually improve our model. If this is confusing, please re-watch Karpathy's GPT tutorial. 

If you prefer to write the training loop yourself, check out HF' `run_clm_no_trainer.py` scripts. 

In [7]:
train_dataset, eval_dataset = tokenized_dataset['train'], tokenized_dataset['validation']

batch_size = 16 # TinyStories claims 80, but I am training locally on my poor M1 Air
num_train_epochs = 1  # TinyStories doesn't mention
gradient_accumulation_steps = 16 # TinyStories claims 16

lr = 5e-4 # TinyStories claims 5e-4, higher values are preferable for smaller models

_train_steps = len(train_dataset) // (batch_size * gradient_accumulation_steps)
eval_steps = _train_steps // 10  # Evaluate every 10% of training steps


model_name = f'{model.num_parameters()//1e6:.1f}M-{config.num_layers}L-{config.num_heads}H-{config.hidden_size}C-{config.intermediate_size}I'

In [8]:
training_args = TrainingArguments(

    seed       = 42,
    use_cpu    = False, # use GPU if available (not necessarily faster on laptops, but Apple's MPS have good support)
    output_dir = f'./results/models/{model_name}',

    # NOTE: training params
    learning_rate    = lr,
    num_train_epochs = num_train_epochs,
    # Use a smaller batch size to fit into GPU RAM. 
    per_device_train_batch_size = batch_size,
    per_device_eval_batch_size  = batch_size,
    # You should aim to have the same amount of samples per acc step, in all of your experiments!
    # so, if you increase batch_size, decrease gradient_accumulation_steps by the same factor.
    gradient_accumulation_steps = gradient_accumulation_steps,

    # NOTE: Evaluation params
    # wandb is great for tracking experiments, it will even (try to) save your code nowadays
    evaluation_strategy = 'steps',
    eval_steps = eval_steps,
    save_steps = eval_steps,

    logging_first_step=True,
    logging_steps=max(1,eval_steps),
    report_to  = 'wandb',
)

trainer = Trainer(
    model = model, 
    args = training_args, 
    train_dataset = train_dataset, 
    eval_dataset = eval_dataset,
    data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False),
)

# print amount of training steps, and how often the model is evaluated
print(f'''
    training for {num_train_epochs} epochs, {len(train_dataset)} samples
    {batch_size} batch size, {gradient_accumulation_steps} accumulation steps
    gives {_train_steps} training steps.

    evaluating every {eval_steps} steps, {len(eval_dataset)} samples 
    ''')


    training for 1 epochs, 1000 samples
    16 batch size, 16 accumulation steps
    gives 3 training steps.

    evaluating every 0 steps, 100 samples 
    


Finally, we can train. 

This configuration takes ≤24hr to pre-train on my M1 Macbook Air with 16GB RAM. Python takes ≤4GB VRAM at a `batch_size=16` and ≤11GB at `batch_size=64`, though they take the same amount of time to train - likely because this processor is not designed to move that much data in and out of RAM constantly. And tbh, the GPU be lacking. If you decide to go the local-training-route, consider [chai](https://github.com/lvillani/chai) to keep your (Apple) laptop awake – there's probably a windows/linux equivalent too. 

In [9]:
wandb.init(project='tinystories', name=model_name, config=training_args)
trainer.train()
trainer.save_model(f'./results/models/{model_name}')

 33%|███▎      | 1/3 [00:13<00:26, 13.34s/it]

{'loss': 9.2934, 'grad_norm': 2.615218162536621, 'learning_rate': 0.0003333333333333333, 'epoch': 0.25}


                                             
 33%|███▎      | 1/3 [00:15<00:26, 13.34s/it]

{'eval_loss': 8.511707305908203, 'eval_runtime': 1.3513, 'eval_samples_per_second': 74.003, 'eval_steps_per_second': 5.18, 'epoch': 0.25}


 67%|██████▋   | 2/3 [00:22<00:10, 10.84s/it]

{'loss': 8.5252, 'grad_norm': 2.599982261657715, 'learning_rate': 0.00016666666666666666, 'epoch': 0.51}


                                             
 67%|██████▋   | 2/3 [00:23<00:10, 10.84s/it]

{'eval_loss': 8.12758731842041, 'eval_runtime': 1.0247, 'eval_samples_per_second': 97.59, 'eval_steps_per_second': 6.831, 'epoch': 0.51}


100%|██████████| 3/3 [00:31<00:00, 10.02s/it]

{'loss': 8.1393, 'grad_norm': 2.198148250579834, 'learning_rate': 0.0, 'epoch': 0.76}


                                             
100%|██████████| 3/3 [00:33<00:00, 11.21s/it]


{'eval_loss': 7.979551315307617, 'eval_runtime': 1.2673, 'eval_samples_per_second': 78.906, 'eval_steps_per_second': 5.523, 'epoch': 0.76}
{'train_runtime': 33.5303, 'train_samples_per_second': 29.824, 'train_steps_per_second': 0.089, 'train_loss': 8.652612368265787, 'epoch': 0.76}


In [10]:
wandb.finish()

0,1
eval/loss,█▃▁
eval/runtime,█▁▆
eval/samples_per_second,▁█▂
eval/steps_per_second,▁█▂
train/epoch,▁▁▅▅███
train/global_step,▁▁▅▅███
train/grad_norm,██▁
train/learning_rate,█▅▁
train/loss,█▃▁

0,1
eval/loss,7.97955
eval/runtime,1.2673
eval/samples_per_second,78.906
eval/steps_per_second,5.523
total_flos,9917347921920.0
train/epoch,0.7619
train/global_step,3.0
train/grad_norm,2.19815
train/learning_rate,0.0
train/loss,8.1393


## Inference!
Try out your own pre-trained model on some prompts

In [11]:
import textwrap # for pretty printing
w = textwrap.TextWrapper(replace_whitespace=False, break_long_words=False, width=60, initial_indent='   ', subsequent_indent='  ')
def see(text): print('\n\033[3m' + '\n\n'.join(['\n'.join(w.wrap(line))
                 for line in text.splitlines() if line.strip() != '']) + '\033[0m\n')

In [12]:
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = '9.0M-2L-4H-512C-1024I'
tokenizer = AutoTokenizer.from_pretrained('10k-gpt-neo')
tokenizer.pad_token_id = tokenizer.eos_token_id
model = AutoModelForCausalLM.from_pretrained(f'results/models/{model_name}')

In [13]:
prompt = 'Once upon a time, there was a chicken with'
input_ids = tokenizer.encode(prompt, return_tensors='pt')

output = model.generate(input_ids, max_length=300, num_beams=1)
output_text = tokenizer.decode(output[0])

# textwrap with indentation on every new paragraph
see(output_text)


[3m   Once upon a time, there was a chicken
  with..................................................................................................................................................................................................................................................................................................[0m

