## Pre-Training GPT-Neo on TinyStories
HuggingFace links: [GPT-Neo model](https://huggingface.co/docs/transformers/model_doc/gpt_neo), [TinyStories dataset](https://huggingface.co/datasets/roneneldan/TinyStories). 

In [36]:
# %pip install wandb
# %pip install transformers
# %pip install datasets
# %pip install torch torchvision torchaudio
# %pip install accelerate -U

In [37]:
with open('keys.txt', 'r') as file:
    api_key = file.read().strip()
import wandb; wandb.login(key=api_key)
import torch.nn as nn
import torch
from transformers import GPT2TokenizerFast, GPTNeoForCausalLM, GPTNeoConfig
from transformers import TrainingArguments, Trainer, DataCollatorForLanguageModeling
from datasets import load_dataset, load_from_disk, DatasetDict
debug_mode = True
config = GPTNeoConfig(
    vocab_size=10_000,
    hidden_size=512,
    max_position_embeddings=512,
    num_layers=2,
    attention_types=[[["global", "local"], 1]],
    num_heads=4,
    window_size=256,
    intermediate_size=1024
)
# config



### Pre-Processing
All experiments use the same tokenizer, so in theory, we only need to preprocess our data once. This means you can subsequently skip this CPU-intensive task on the DelftBlue GPU nodes. 
(except if you are experimenting with different context lengths, in that case you will need to re-tokenise to recover potentially truncated inputs; see below).
You may even have to load the entire dataset (or pre-load chunks) into memory, DelftBlue has extremely slow IO. 

#### Tokenisation
Two things happen during tokenisation, of which the latter is optional:

1. Map (sub-)words to integers $ \in [0, 10000]$.
2. Truncate/pad each input as necessary to fit within `hidden_size`. 

In [38]:
tokenizer = GPT2TokenizerFast.from_pretrained('10k-gpt-neo', model_max_length=config.hidden_size)

# in theory, window_size x num_layers is possible. 
# But, this project is not about the efficacy of sliding window attention (though it could be!)
assert tokenizer.model_max_length == 512 
assert tokenizer.vocab_size == config.vocab_size

# printing this because of a bug in tokenizers (should be fixed now) https://github.com/huggingface/transformers/issues/26500
print(f'padding token is {tokenizer.pad_token}')
# HF wasn't saving this nor the tokenizer's pad_token
config.pad_token_id = tokenizer.pad_token_id

padding token is <|endoftext|>


In [39]:
dataset = load_dataset('roneneldan/tinystories')

small_dataset = DatasetDict({
    'train': dataset['train'].select(range(1000)),
    'validation': dataset['validation'].select(range(100))
})

if debug_mode:
    dataset = small_dataset

tokenized_dataset = dataset.map(
    lambda x: tokenizer(x['text'], truncation=True, padding='max_length'), 
    batched=True, num_proc=8, batch_size=1_000)

tokenized_dataset.save_to_disk(f'./tokenized_dataset', num_proc=5)

tokenized_dataset = load_from_disk(f'./tokenized_dataset')

assert len(tokenized_dataset['train'][0]['input_ids']) == config.hidden_size
tokenized_dataset['train'][0]['input_ids'][-10:] 
# should be eos tokens (9999), given that most short stories are <512 tokens

Repo card metadata block was not found. Setting CardData to empty.

[A
[A
[A
[A
[A
[A
Saving the dataset (5/5 shards): 100%|██████████| 1000/1000 [00:00<00:00, 6011.53 examples/s]

[A
[A
[A
[A
[A
Saving the dataset (5/5 shards): 100%|██████████| 100/100 [00:00<00:00, 1133.26 examples/s]


[9999, 9999, 9999, 9999, 9999, 9999, 9999, 9999, 9999, 9999]

### Training
#### Model
We use GPT-Neo as the architecture for our model. This consists of the following blocks, generally seen in transformer architectures:

- Embeddings: Map one-hot (sparse) token vectors to a dense vector of length `hidden_size`. Add positional encoding. 
- $n$ Blocks: Contain self-attention and FFNs.
- Head: Map hidden state back to a useful output. 

We use the config specified at the top of this notebook to construct a model. 

In [40]:
class GPTNeoSelfAttention(nn.Module):
    def __init__(self, config, attention_type):
        super().__init__()
        self.config = config

        max_positions = config.max_position_embeddings
        bias = torch.tril(torch.ones((max_positions, max_positions), dtype=bool)).view(
            1, 1, max_positions, max_positions
        )

        # local causal self attention is a sliding window where each token can only attend to the previous
        # window_size tokens. This is implemented by updating the causal mask such that for each token
        # all other tokens are masked except the previous window_size tokens.
        if attention_type == "local":
            bias = torch.bitwise_xor(bias, torch.tril(bias, -config.window_size))

        self.register_buffer("bias", bias, persistent=False)
        self.register_buffer("masked_bias", torch.tensor(-1e9), persistent=False)

        self.attn_dropout = nn.Dropout(float(config.attention_dropout))
        self.resid_dropout = nn.Dropout(float(config.resid_dropout))
        self.is_causal = True

        self.embed_dim = config.hidden_size
        self.num_heads = config.num_heads
        self.head_dim = self.embed_dim // self.num_heads
        if self.head_dim * self.num_heads != self.embed_dim:
            raise ValueError(
                f"embed_dim must be divisible by num_heads (got `embed_dim`: {self.embed_dim} and `num_heads`:"
                f" {self.num_heads})."
            )
        print("GOT HERE BABY")
        self.k_proj = nn.Linear(self.embed_dim, self.embed_dim, bias=False)
        self.v_proj = nn.Linear(self.embed_dim, self.embed_dim, bias=False)
        self.q_proj = nn.Linear(self.embed_dim, self.embed_dim, bias=False)
        self.out_proj = nn.Linear(self.embed_dim, self.embed_dim, bias=True)

    def _split_heads(self, tensor, num_heads, attn_head_size):
        """
        Splits hidden_size dim into attn_head_size and num_heads
        """
        new_shape = tensor.size()[:-1] + (num_heads, attn_head_size)
        tensor = tensor.view(new_shape)
        return tensor.permute(0, 2, 1, 3)  # (batch, head, seq_length, head_features)

    def _merge_heads(self, tensor, num_heads, attn_head_size):
        """
        Merges attn_head_size dim and num_attn_heads dim into hidden_size
        """
        tensor = tensor.permute(0, 2, 1, 3).contiguous()
        new_shape = tensor.size()[:-2] + (num_heads * attn_head_size,)
        return tensor.view(new_shape)

    def _attn(self, query, key, value, attention_mask=None, head_mask=None):
        # Keep the attention weights computation in fp32 to avoid overflow issues
        query = query.to(torch.float32)
        key = key.to(torch.float32)

        attn_weights = torch.matmul(query, key.transpose(-1, -2))

        query_length, key_length = query.size(-2), key.size(-2)
        causal_mask = self.bias[:, :, key_length - query_length : key_length, :key_length]
        mask_value = torch.finfo(attn_weights.dtype).min
        # Need to be a tensor, otherwise we get error: `RuntimeError: expected scalar type float but found double`.
        # Need to be on the same device, otherwise `RuntimeError: ..., x and y to be on the same device`
        mask_value = torch.tensor(mask_value, dtype=attn_weights.dtype).to(attn_weights.device)
        attn_weights = torch.where(causal_mask, attn_weights, mask_value)

        if attention_mask is not None:
            # Apply the attention mask
            attn_weights = attn_weights + attention_mask

        attn_weights = nn.functional.softmax(attn_weights, dim=-1)
        attn_weights = attn_weights.to(value.dtype)
        attn_weights = self.attn_dropout(attn_weights)

        # Mask heads if we want to
        if head_mask is not None:
            attn_weights = attn_weights * head_mask

        attn_output = torch.matmul(attn_weights, value)

        return attn_output, attn_weights

    def forward(
        self,
        hidden_states,
        attention_mask=None,
        layer_past=None,
        head_mask=None,
        use_cache=False,
        output_attentions=False,
    ):
        query = self.q_proj(hidden_states)
        key = self.k_proj(hidden_states)
        value = self.v_proj(hidden_states)

        query = self._split_heads(query, self.num_heads, self.head_dim)
        key = self._split_heads(key, self.num_heads, self.head_dim)
        value = self._split_heads(value, self.num_heads, self.head_dim)

        if layer_past is not None:
            past_key = layer_past[0]
            past_value = layer_past[1]
            key = torch.cat((past_key, key), dim=-2)
            value = torch.cat((past_value, value), dim=-2)

        if use_cache is True:
            present = (key, value)
        else:
            present = None

        attn_output, attn_weights = self._attn(query, key, value, attention_mask, head_mask)

        attn_output = self._merge_heads(attn_output, self.num_heads, self.head_dim)
        attn_output = self.out_proj(attn_output)
        attn_output = self.resid_dropout(attn_output)

        outputs = (attn_output, present)
        if output_attentions:
            outputs += (attn_weights,)

        return outputs  # a, present, (attentions)

In [41]:
class CustomGPTNeoForCausalLM(GPTNeoForCausalLM):
    def __init__(self, config):
        super().__init__(config)
        # Assuming attention types are specified in config.attention_layers
        # Replace the standard attention with the specified type
        for block in self.transformer.h:
            block.attn.attention = GPTNeoSelfAttention(config, block.attn.attention_type)
model = CustomGPTNeoForCausalLM( 
    config = config,
)
print(f'The model has {model.num_parameters():,} parameters.')


GOT HERE BABY
GOT HERE BABY
The model has 9,585,664 parameters.


In [42]:
# # model = CustomGPTNeoForCausalLM( 
# #     config = config,
# # )
# model = GPTNeoForCausalLM(config=config)

# print(f'The model has {model.num_parameters():,} parameters.')
# # Token embeddings:    10'000 * 512 = 5M 
# # Position embeddings:    512 * 512 = 260K
# # 2 Blocks: 2 * (~1M + ~1M) = 4M
# #   Attention: 4 * 512 * 512  (4 because of k, q, v, o)     = 1M 
# #   FFNs:      2 * 512 * 1024 (c_fc, c_proj)                = 1M 

# # total: ~9.5M 

# # note that the above does not account for the LM_head, which typically gets swapped out depending on the downstream task. 
# # the LM_head, in this case, maps logits back to tokens
# # 512 * 10'000 = 5M    (A decent chunk compared to our 9.5M parameters)

# # NOTE: uncomment to see the model architecture 
# # model 

In [43]:
# for comparison. Note, wte should be (10000, 64)
# GPTNeoForCausalLM.from_pretrained('roneneldan/TinyStories-1M')

#### Training loop
Huggingface provides some powerful (and often confusingly long) APIs for model training. The `TrainingArguments` specifies our hyperparameters, which are used by the `Trainer` taking in the remaining objects (like `model`, `tokenizer`, and `train_dataset`). Specifically:

- `learning_rate` and `num_train_epochs` determine how much the model learns. A higher rate is faster, but more unstable. More epochs (entire passes over the dataset) yields incrementally better results, at the cost of more training time. 
- Batch sizes determine how many samples the model sees in *parallel*. Given `gradient_accumulation_steps=1` and a `batch_size=8`, the model will backpropagate the average loss of 8 samples; if `batch_size=1`, it will average the loss of `gradient_accumulation_steps` samples. It is important to make sure the backpropagated loss is averaged over the same number of samples, when comparing models. 

- `data_collator` batches (and optionally, pads) the input for the model. We have already padded in our `tokenized_dataset`, and leaving this argument empty will automatically batch the inputs. So why do we need it? 

    Glad you asked. This has to do with how the loss is computed in causal language modelling. In our case, we try to predict $p(y | x)$, where $x$ is an input sequence of tokens, and $y$ is the next token following that sequence. Our model, unaware of the target token $y$, outputs $\hat y$. 
    
    For `Trainer` to compute the (cross-entropy) loss, we need to provide it with both $y$ and $\hat y$. The `DataCollatorForLanguageModeling` knows this, and provides the next token $y$ as a separate part of the input, to the `Trainer`.

    The loss is the backbone of backpropagation, which we need to actually improve our model. If this is confusing, please re-watch Karpathy's GPT tutorial. 

If you prefer to write the training loop yourself, check out HF' `run_clm_no_trainer.py` scripts. 

In [44]:
train_dataset, eval_dataset = tokenized_dataset['train'], tokenized_dataset['validation']

batch_size = 16 # TinyStories claims 80, but I am training locally on my poor M1 Air
num_train_epochs = 1  # TinyStories doesn't mention
gradient_accumulation_steps = 16 # TinyStories claims 16

lr = 5e-4 # TinyStories claims 5e-4, higher values are preferable for smaller models

_train_steps = len(train_dataset) // (batch_size * gradient_accumulation_steps)
eval_steps = _train_steps // 10  # Evaluate every 10% of training steps


model_name = f'{model.num_parameters()//1e6:.1f}M-{config.num_layers}L-{config.num_heads}H-{config.hidden_size}C-{config.intermediate_size}I'

In [45]:
training_args = TrainingArguments(

    seed       = 42,
    use_cpu    = False, # use GPU if available (not necessarily faster on laptops, but Apple's MPS have good support)
    output_dir = f'./results/models/{model_name}',

    # NOTE: training params
    learning_rate    = lr,
    num_train_epochs = num_train_epochs,
    # Use a smaller batch size to fit into GPU RAM. 
    per_device_train_batch_size = batch_size,
    per_device_eval_batch_size  = batch_size,
    # You should aim to have the same amount of samples per acc step, in all of your experiments!
    # so, if you increase batch_size, decrease gradient_accumulation_steps by the same factor.
    gradient_accumulation_steps = gradient_accumulation_steps,

    # NOTE: Evaluation params
    # wandb is great for tracking experiments, it will even (try to) save your code nowadays
    evaluation_strategy = 'steps',
    eval_steps = eval_steps,
    save_steps = eval_steps,

    logging_first_step=True,
    logging_steps=max(1,eval_steps),
    report_to  = 'wandb',
)

trainer = Trainer(
    model = model, 
    args = training_args, 
    train_dataset = train_dataset, 
    eval_dataset = eval_dataset,
    data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False),
)

# print amount of training steps, and how often the model is evaluated
print(f'''
    training for {num_train_epochs} epochs, {len(train_dataset)} samples
    {batch_size} batch size, {gradient_accumulation_steps} accumulation steps
    gives {_train_steps} training steps.

    evaluating every {eval_steps} steps, {len(eval_dataset)} samples 
    ''')


    training for 1 epochs, 1000 samples
    16 batch size, 16 accumulation steps
    gives 3 training steps.

    evaluating every 0 steps, 100 samples 
    


Finally, we can train. 

This configuration takes ≤24hr to pre-train on my M1 Macbook Air with 16GB RAM. Python takes ≤4GB VRAM at a `batch_size=16` and ≤11GB at `batch_size=64`, though they take the same amount of time to train - likely because this processor is not designed to move that much data in and out of RAM constantly. And tbh, the GPU be lacking. If you decide to go the local-training-route, consider [chai](https://github.com/lvillani/chai) to keep your (Apple) laptop awake – there's probably a windows/linux equivalent too. 

In [47]:
# wandb.init(project='tinystories', name=model_name, config=training_args)
trainer.train()
trainer.save_model(f'./results/models/{model_name}')

 33%|███▎      | 1/3 [02:37<05:14, 157.23s/it]
 33%|███▎      | 1/3 [01:40<03:21, 100.55s/it]
 33%|███▎      | 1/3 [00:12<00:25, 12.69s/it]

{'loss': 8.5733, 'grad_norm': 2.0818498134613037, 'learning_rate': 0.0003333333333333333, 'epoch': 0.25}



 33%|███▎      | 1/3 [00:13<00:25, 12.69s/it]

{'eval_loss': 8.155171394348145, 'eval_runtime': 1.1088, 'eval_samples_per_second': 90.186, 'eval_steps_per_second': 6.313, 'epoch': 0.25}


 67%|██████▋   | 2/3 [00:20<00:10, 10.01s/it]

{'loss': 8.1718, 'grad_norm': 2.120964527130127, 'learning_rate': 0.00016666666666666666, 'epoch': 0.51}



 67%|██████▋   | 2/3 [00:21<00:10, 10.01s/it]

{'eval_loss': 7.904494762420654, 'eval_runtime': 1.0752, 'eval_samples_per_second': 93.008, 'eval_steps_per_second': 6.511, 'epoch': 0.51}


100%|██████████| 3/3 [00:28<00:00,  9.15s/it]

{'loss': 7.9172, 'grad_norm': 2.0379819869995117, 'learning_rate': 0.0, 'epoch': 0.76}



100%|██████████| 3/3 [00:30<00:00, 10.03s/it]


{'eval_loss': 7.78946590423584, 'eval_runtime': 1.0523, 'eval_samples_per_second': 95.029, 'eval_steps_per_second': 6.652, 'epoch': 0.76}
{'train_runtime': 30.1108, 'train_samples_per_second': 33.211, 'train_steps_per_second': 0.1, 'train_loss': 8.220765272776285, 'epoch': 0.76}


In [None]:
wandb.finish()

## Inference!
Try out your own pre-trained model on some prompts

In [48]:
import textwrap # for pretty printing
w = textwrap.TextWrapper(replace_whitespace=False, break_long_words=False, width=60, initial_indent='   ', subsequent_indent='  ')
def see(text): print('\n\033[3m' + '\n\n'.join(['\n'.join(w.wrap(line))
                 for line in text.splitlines() if line.strip() != '']) + '\033[0m\n')

In [49]:
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = '9.0M-2L-4H-512C-1024I'
tokenizer = AutoTokenizer.from_pretrained('10k-gpt-neo')
tokenizer.pad_token_id = tokenizer.eos_token_id
model = AutoModelForCausalLM.from_pretrained(f'results/models/{model_name}')

In [50]:
prompt = 'Once upon a time, there was a chicken with a small'
input_ids = tokenizer.encode(prompt, return_tensors='pt')

output = model.generate(input_ids, max_length=300, num_beams=1)
output_text = tokenizer.decode(output[0])

# textwrap with indentation on every new paragraph
see(output_text)


[3m   Once upon a time, there was a chicken with a
  small................................................................................................................................................................................................................................................................................................[0m

