## Pre-Training [`GPT-Neo`](https://github.com/huggingface/transformers/blob/main/src/transformers/models/gpt_neo/modeling_gpt_neo.py) & [`RoBERTa`](https://github.com/huggingface/transformers/blob/main/src/transformers/models/roberta/modeling_roberta.py) on [`TinyStories`](https://huggingface.co/datasets/roneneldan/TinyStories)

This script shows how to pre-train both models. I put them in one notebook because the majority of the code is shared; but you may want to separate the logic per model. 

In [1]:
# %pip install torch wandb transformers[torch] datasets tqdm 
%load_ext autoreload
%autoreload 2 

In [2]:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"  # or "0,1" for multiple GPUs

import wandb; wandb.login()
from transformers import (
    RobertaForMaskedLM, RobertaConfig, RobertaTokenizerFast,
    GPTNeoForCausalLM, GPTNeoConfig, GPT2TokenizerFast
)



[34m[1mwandb[0m: Currently logged in as: [33mst1p42[0m ([33mtiny-transformers[0m). Use [1m`wandb login --relogin`[0m to force relogin


#### Models 
We consider GPT-Neo and BERT as base transformer architectures. This consists of the following blocks linked via residual connections:

- Embeddings: Map one-hot (sparse) token vectors to a dense vector of length `hidden_size`. Add positional encoding. 
- $n$ Blocks: Contain self-attention and FFNs.
- Head: Map hidden state back to a useful output. 

This specifies some of the model-related hyperparameters. I chose them based on what achieved reasonable performance in the [TinyStories paper](https://arxiv.org/abs/2305.07759), while also being feasible to train on our limited compute budgets. 

In [3]:
embed_size = 512
kqv_factor = 8
kqv_size = embed_size // kqv_factor
ffn_original_width = 1024
ffn_new_width = int (ffn_original_width + 2 * embed_size * (1 - 1/kqv_factor))

query_groups_factor = 0.25          # try 0.75, 0.5, 0.25, for 8 and 16 heads also 0.125
attn_heads = 4
num_key_value_heads = int(attn_heads * query_groups_factor)
ffn_new_width = ffn_new_width + (kqv_size - int(kqv_size * query_groups_factor))

config_gpt = dict(

    # EMBEDDING PARAMETERS
    vocab_size              = 10_000,   # number of tokens in the vocabulary 
    hidden_size             = embed_size,      # embedding size (vector length) of each token
    max_position_embeddings = 512,      # maximum sequence length (context window)

    # BLOCKS (ATTN & FFN)
    num_layers          = 2,                    # number of transformer blocks
    attention_types     = [[["global", "local"], 1]], # (GPT-Neo-specific) global and local attention 
    num_heads           = attn_heads,                    # attention heads
    window_size         = 256,                  # (GPT-Neo-specific) for local attention 
    intermediate_size   = 1024,                 # size of 'up-projection' layer in FFN

    pad_token_id = 0,           # need to specify this for tokenizer interop between models
)


config_gqa = dict(

    # EMBEDDING PARAMETERS
    vocab_size              = 10_000,   # number of tokens in the vocabulary
    hidden_size             = embed_size,      # embedding size (vector length) of each token
    max_position_embeddings = 512,      # maximum sequence length (context window)

    # BLOCKS (ATTN & FFN)
    num_layers          = 2,                    # number of transformer blocks
    attention_types     = [[["global", "local"], 1]], # (GPT-Neo-specific) global and local attention
    num_heads           = attn_heads,                    # attention heads
    window_size         = 256,                  # (GPT-Neo-specific) for local attention
    intermediate_size   = ffn_new_width,                 # size of 'up-projection' layer in FFN

    pad_token_id = 0,           # need to specify this for tokenizer interop between models
)

config_rob = dict(
    
    # EMBEDDING PARAMETERS
    vocab_size              = 10_000,   
    hidden_size             = 512,
    # we add 1 as RoBERTa uses a special position embedding for the padding token (zero vector)
    max_position_embeddings = config_gpt['max_position_embeddings'] + 1,

    # BLOCKS (of course naming is different in roberta :) )
    num_hidden_layers = config_gpt['num_layers'],
    num_attention_heads = config_gpt['num_heads'],
    intermediate_size=1024,                     

    pad_token_id = 0,
)

config_gpt = GPTNeoConfig(**config_gpt)
config_rob = RobertaConfig(**config_rob)
config_gqa = GPTNeoConfig(**config_gqa)

In [4]:
from custom_gpt_gqa import CustomGPTNeoForCausalLM, GPTNeoGQASelfAttention

# TODO: you should set all pytorch & huggingface seeds here as model initialisation depends on it

gpt = GPTNeoForCausalLM(config=config_gpt)
rob = RobertaForMaskedLM(config=config_rob)

gqa_model_gpt = CustomGPTNeoForCausalLM(config_gqa=config_gqa)
gqa_model_gpt.set_kqv_size(kqv_size)
gqa_model_gpt.set_attention(num_key_value_heads=num_key_value_heads)
# print(gqa_model_gpt.state_dict())


print(f'''
    The custom model has {gqa_model_gpt.num_parameters():,} parameters,
    GPT Neo - {gpt.num_parameters():,} parameters,
    and RoBERTa - {rob.num_parameters():,} parameters.
    ''')

# gpt, rob # uncomment to see model architecture
# gqa_model_gpt


    The custom model has 9,587,552 parameters,
    GPT Neo - 9,585,664 parameters,
    and RoBERTa - 9,863,952 parameters.
    


#### Pre-Processing
All experiments use the same tokenizer, so in theory, we only need to preprocess our data once. This means you can subsequently skip this CPU-intensive task on the DelftBlue GPU nodes. 
(except if you are experimenting with different context lengths, in that case you will need to re-tokenise to recover potentially truncated inputs; see below).

When running experiments on delftblue, it may be fastest to load (parts of) the dataset into memory, as the cluster has pretty slow IO. See `datasets`, specifically [`load_dataset`'s `keep_in_memory` attribute](https://huggingface.co/docs/datasets/en/package_reference/loading_methods#datasets.load_dataset). 

In [14]:
import tqdm 
from datasets import load_dataset, DatasetDict

debug_mode = True

dataset = load_dataset('roneneldan/tinystories', num_proc=16)
print(len(dataset['train']['text']))
small_dataset = DatasetDict({
    'train': dataset['train'].select(range(1000)),
    'validation': dataset['validation'].select(range(100))
})

if debug_mode:
    dataset = small_dataset

n_words = 0 
for story in tqdm.tqdm(dataset['train']['text']):    # tqdm is used for progress bars around iterables
    n_words += len(story.split())

print(f'''
    Train dataset has {len(dataset['train']['text']):,} samples, 
    totalling {n_words:,} words (avg. {n_words/len(dataset['train']['text']):,.1f} per story).
    ''')

Repo card metadata block was not found. Setting CardData to empty.


2119719


100%|██████████| 1000/1000 [00:00<00:00, 60565.82it/s]


    Train dataset has 1,000 samples, 
    totalling 183,801 words (avg. 183.8 per story).
    





#### Tokenisation
Two things happen during tokenisation, of which the latter is optional:

1. Map (sub-)words to integers $ \in [0, 10000]$.
2. Truncate/pad each input as necessary to fit within `hidden_size`. 

All it does is map frequent character strings (i.e. tokens) to integers, which it can then use to look up the corresponding embedding vector for that integer index.

The intuition is that by assigning each token its own embedding (vector of `1 x hidden_size`) in the transformer, will help it learn semantical understanding of these tokens. E.g. say that you have two tokens 'about' and 'around', the transformer should learn similar embedding vectors for each.

In [12]:
from transformers import PreTrainedTokenizer, PretrainedConfig

def get_tokenizer_for_config(Tok: PreTrainedTokenizer, config: PretrainedConfig):

    tokenizer = Tok.from_pretrained(
        '10k-tok',                                         # our custom tokenizer
        model_max_length=config.max_position_embeddings    # sequence length (context window)
    )

    # we're using our special tokenizer with only 10'000 tokens instead of 50'256
    assert tokenizer.vocab_size == config.vocab_size

    print(f'padding token is {tokenizer.pad_token}')
    print(f'padding token in config: {config.pad_token_id}, in tokeniser: {tokenizer.pad_token_id}')

    return tokenizer

tok_gpt = get_tokenizer_for_config(GPT2TokenizerFast, config_gpt)
tok_rob = get_tokenizer_for_config(RobertaTokenizerFast, config_rob)

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'RobertaTokenizer'. 
The class this function is called from is 'GPT2TokenizerFast'.


padding token is <pad>
padding token in config: 0, in tokeniser: 0
padding token is <pad>
padding token in config: 0, in tokeniser: 0


In [6]:
tokenized_gpt = dataset.map(
    lambda x: tok_gpt(x['text'], truncation=True, padding='max_length'),
    batched=True, num_proc=64, batch_size=1_000)                 # change num_proc to 1 if multithread issues

tokenized_gpt.save_to_disk('./tokenized_dataset', num_proc=5)

NameError: name 'dataset' is not defined

In [13]:
from datasets import load_from_disk 
tokenized_dataset = load_from_disk(f'./tokenized_dataset')

train_dataset = tokenized_dataset['train']
eval_dataset  = tokenized_dataset['validation']

assert len(tokenized_dataset['train'][0]['input_ids']) == config_gpt.max_position_embeddings
tokenized_dataset['train'][0]['input_ids'][-10:]
# should be pad tokens (0), given that most short stories are <512 tokens

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

#### Training
Before we get started, you may want to specify which GPU to use. See the first cell in this notebook; make sure to run it before anything else. 

Huggingface provides some powerful (and often confusingly long) APIs for model training. The `TrainingArguments` specifies our hyperparameters, which are used by the `Trainer` taking in the remaining objects (like `model`, `tokenizer`, and `train_dataset`). Specifically:

- `learning_rate` and `num_train_epochs` determine how much the model learns. A higher rate is faster, but more unstable. More epochs (entire passes over the dataset) yields incrementally better results, at the cost of more training time. 
- Batch sizes determine how many samples the model sees in *parallel*. Given `gradient_accumulation_steps=1` and a `batch_size=8`, the model will backpropagate the average loss of 8 samples; if `batch_size=1`, it will average the loss of `gradient_accumulation_steps` samples. It is important to make sure the backpropagated loss is averaged over the same number of samples, when comparing models. 

- `data_collator` batches (and optionally, pads) the input for the model. We have already padded in our `tokenized_dataset`, and leaving this argument empty will automatically batch the inputs. So why do we need it? 

    Glad you asked. This has to do with how the loss is computed in causal language modelling. In our case, we try to predict $p(y | x)$, where $x$ is an input sequence of tokens, and $y$ is the next token following that sequence. Our model, unaware of the target token $y$, outputs $\hat y$. 
    
    For `Trainer` to compute the (cross-entropy) loss, we need to provide it with both $y$ and $\hat y$. The `DataCollatorForLanguageModeling` knows this, and provides the next token $y$ as a separate part of the input, to the `Trainer`.

    The loss is the backbone of backpropagation, which we need to actually improve our model. If this is confusing, please re-watch Karpathy's GPT tutorial. 

If you prefer to write the training loop yourself, check out HF' `run_clm_no_trainer.py` scripts. (`run_mlm_no_trainer.py` for RoBERTa-style masked-language modelling, as opposed to causal language modelling). This can be useful to give you better control over which devices are used for training. 

In [14]:
from transformers import Trainer, TrainingArguments, DataCollatorForLanguageModeling

def get_hyperparameters(model, dataset, num_key_value_heads=None, kqv_factor=None):
    ''' common hyperparameters to give to the trainer '''

    # TRAINING HYPERPARAMETERS 
    batch_size = 4                  # TinyStories uses 80, but I am training locally on my poor M1 Air
    num_train_epochs = 1             # TinyStories doesn't mention
    gradient_accumulation_steps = 16 # TinyStories uses 16

    lr = 5e-4                        # TinyStories uses 5e-4, higher values better for small models

    first_part_id = 'CUSTOM' if isinstance(model, CustomGPTNeoForCausalLM) else 'GPT' if isinstance(model, GPTNeoForCausalLM) else'BERT'

    if first_part_id == 'CUSTOM':
        if num_key_value_heads is not None:
            first_part_id += '-GQA-' + str(query_groups_factor) + 'KV'
        elif kqv_factor is not None:
            first_part_id += '-KQV-' + str(kqv_factor) + 'F'

    # future you will thank you for descriptive model names
    # TODO: customise this name such that every model you train has a unique identifier!
    config      = model.config 
    model_name  = '-'.join([
        first_part_id,
        f'{model.num_parameters()//1e6:.1f}M',
        f'{config.num_layers if isinstance(model, GPTNeoForCausalLM) else config.num_hidden_layers}L', 
        f'{config.num_heads if isinstance(model, GPTNeoForCausalLM) else config.num_attention_heads}H', 
        f'{config.hidden_size}C',
        f'{config.intermediate_size}I'
    ])

    _train_steps = len(dataset) // (batch_size * gradient_accumulation_steps)
    eval_steps = _train_steps // 10 # evaluate every 10% of training steps

    return dict(
        model_name = model_name,
        batch_size = batch_size, 
        num_train_epochs = num_train_epochs,
        gradient_accumulation_steps = gradient_accumulation_steps,
        lr = lr,
        eval_steps = eval_steps
    )

params_gpt = get_hyperparameters(gpt, train_dataset)
params_rob = get_hyperparameters(rob, train_dataset)
params_gqa = get_hyperparameters(gqa_model_gpt, train_dataset, num_key_value_heads=num_key_value_heads, kqv_factor=kqv_factor)

In [15]:
def get_trainer(
        model, tokenizer, train_dataset, eval_dataset, output_dir,
        model_name, batch_size, num_train_epochs, gradient_accumulation_steps, lr, eval_steps):
    ''' more general training arguments you likely want to keep fixed'''

    training_args = TrainingArguments(

        seed       = 42,
        use_cpu    = False, # use GPU if available (not necessarily faster on laptops, but Apple's MPS have good support)

        output_dir = os.path.join(output_dir, model_name),

        # NOTE: training params
        learning_rate    = lr,
        num_train_epochs = num_train_epochs,
        # Use a smaller batch size to fit into GPU RAM. 
        per_device_train_batch_size = batch_size,
        per_device_eval_batch_size  = batch_size,
        # You should aim to have the same amount of samples per acc step, in all of your experiments!
        # so, if you increase batch_size, decrease gradient_accumulation_steps by the same factor.
        gradient_accumulation_steps = gradient_accumulation_steps,

        # NOTE: Evaluation params
        # wandb is great for tracking experiments, it will even (try to) save your code nowadays
        evaluation_strategy = 'steps',
        eval_steps = eval_steps,
        save_steps = eval_steps,

        logging_first_step=True,
        logging_steps=100,
        report_to  = 'wandb',
    )

    trainer = Trainer(
        model = model, 
        args = training_args, 
        train_dataset = train_dataset, 
        eval_dataset = eval_dataset,
        data_collator = DataCollatorForLanguageModeling(
            tokenizer, mlm=isinstance(model, RobertaForMaskedLM)),
    )

    # print amount of training steps, and how often the model is evaluated
    print(f'''
    Retrieving Trainer for \033[1m{model_name}\033[0m
        training for {num_train_epochs} epochs, {len(train_dataset)} samples
        {batch_size} batch size, {gradient_accumulation_steps} accumulation steps
        gives {len(train_dataset)//(batch_size * gradient_accumulation_steps)} training steps.
        Evaluating every {eval_steps} steps, {len(eval_dataset)} samples 
        ''')

    return trainer

In [16]:
out_dir = './results/models_baseline/' 

trainer_gpt = get_trainer(gpt, tok_gpt, train_dataset, eval_dataset, out_dir, **params_gpt)
trainer_rob = get_trainer(rob, tok_rob, train_dataset, eval_dataset, out_dir, **params_rob)

trainer_gqa_gpt = get_trainer(gqa_model_gpt, tok_gpt, train_dataset, eval_dataset, out_dir, **params_gqa)


    Retrieving Trainer for [1mGPT-9.0M-2L-4H-512C-1024I[0m
        training for 1 epochs, 1000 samples
        4 batch size, 16 accumulation steps
        gives 15 training steps.
        Evaluating every 1 steps, 100 samples 
        

    Retrieving Trainer for [1mBERT-9.0M-2L-4H-512C-1024I[0m
        training for 1 epochs, 1000 samples
        4 batch size, 16 accumulation steps
        gives 15 training steps.
        Evaluating every 1 steps, 100 samples 
        

    Retrieving Trainer for [1mCUSTOM-GQA-0.25KV-9.0M-2L-4H-512C-1968I[0m
        training for 1 epochs, 1000 samples
        4 batch size, 16 accumulation steps
        gives 15 training steps.
        Evaluating every 1 steps, 100 samples 
        


Finally, we can train. 

This configuration takes ≤24hr to pre-train on my M1 Macbook Air with 16GB RAM. Python takes ≤4GB VRAM at a `batch_size=16` and ≤11GB at `batch_size=64`, though they take the same amount of time to train - likely because this processor is not designed to move that much data in and out of RAM constantly. And tbh, the GPU be lacking. If you decide to go the local-training-route, consider [chai](https://github.com/lvillani/chai) to keep your (Apple) laptop awake – there's probably a windows/linux equivalent too. 

In [17]:
def do_train(trainer: Trainer, name: str, out_dir: str):

    wandb.init(project='kqv-gqa-gpt-neo', name=name, group='baseline', config=trainer.args)
    # wandb.init(project='tinystories', name=name, config=trainer.args)
    trainer.train()
    trainer.save_model(os.path.join(out_dir, name))
    wandb.finish()

In [18]:
# words vs. tokens 
len(train_dataset['text'][11]), len(train_dataset[11]['input_ids'])
print(params_gqa['model_name'])
trainer_gqa_gpt.model

CUSTOM-GQA-0.25KV-9.0M-2L-4H-512C-1968I


CustomGPTNeoForCausalLM(
  (transformer): GPTNeoModel(
    (wte): Embedding(10000, 512)
    (wpe): Embedding(512, 512)
    (drop): Dropout(p=0.0, inplace=False)
    (h): ModuleList(
      (0-1): 2 x GPTNeoBlock(
        (ln_1): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (attn): GPTNeoAttention(
          (attention): GPTNeoGQASelfAttention(
            (attn_dropout): Dropout(p=0.0, inplace=False)
            (resid_dropout): Dropout(p=0.0, inplace=False)
            (k_proj): Linear(in_features=512, out_features=16, bias=False)
            (v_proj): Linear(in_features=512, out_features=16, bias=False)
            (q_proj): Linear(in_features=512, out_features=64, bias=False)
            (out_proj): Linear(in_features=64, out_features=512, bias=True)
          )
        )
        (ln_2): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (mlp): GPTNeoMLP(
          (c_fc): Linear(in_features=512, out_features=1968, bias=True)
          (c_proj): Linear(in_

In [19]:
do_train(trainer_gqa_gpt, params_gqa['model_name'], out_dir)
wandb.finish()

Step,Training Loss,Validation Loss


VBox(children=(Label(value='0.031 MB of 0.048 MB uploaded (0.006 MB deduped)\r'), FloatProgress(value=0.649288…

0,1
eval/loss,█▆▅▅▄▃▃▂▂▂▁▁▁▁▁
eval/runtime,▁▁▃▆▄▅▅▄▄▄▄▄▇██
eval/samples_per_second,██▆▂▄▄▄▄▄▄▄▄▂▁▁
eval/steps_per_second,██▆▂▄▄▄▄▄▄▄▄▂▁▁
train/epoch,▁▁▂▂▃▃▄▄▅▅▅▆▇▇███
train/global_step,▁▁▁▂▃▃▃▄▅▅▅▆▇▇▇██
train/grad_norm,▁
train/learning_rate,▁
train/loss,▁

0,1
eval/loss,6.55984
eval/runtime,21.4372
eval/samples_per_second,4.665
eval/steps_per_second,1.166
total_flos,12402252840960.0
train/epoch,0.96
train/global_step,15.0
train/grad_norm,3.9748
train/learning_rate,0.00047
train/loss,9.2891


In [20]:
import torch
model_name = 'CUSTOM-GQA-0.25KV-9.0M-2L-4H-512C-1968I'
trained_model = trainer_gqa_gpt.model
torch.save(trained_model.state_dict(), f'results/models_baseline/{model_name}/model_state.pt')

#### Using your Pre-Trained Model 
Try out your own model on some prompts!

In [21]:
# I made a small function for pretty printing into paragraphs with line wrapping for readability
import textwrap 
w = textwrap.TextWrapper(replace_whitespace=False, break_long_words=False, width=60, initial_indent='   ', subsequent_indent='  ')
def see(text): print('\n\033[3m' + '\n\n'.join(['\n'.join(w.wrap(line))
                 for line in text.splitlines() if line.strip() != '']) + '\033[0m\n')

In [22]:
from transformers import AutoTokenizer, AutoModelForCausalLM, GPTNeoConfig, LlamaForCausalLM
import torch

# model_name = 'GPT-9.0M-2L-4H-516C-1024I'
# model_name = 'BERT-9.0M-2L-4H-516C-1024I' # bert won't work for generation unless you fine-tune it for that task

tokenizer = AutoTokenizer.from_pretrained('10k-tok')
tokenizer.pad_token_id = tokenizer.eos_token_id
model_name_1 = 'CUSTOM-GQA-0.25KV-9.0M-2L-4H-512C-1968I'
model_name_2 = 'CUSTOM-KQV-2F-9.0M-2L-4H-512C-1536I'

config_1 = GPTNeoConfig.from_pretrained(f'results/models_baseline/{model_name_1}')
config_2 = GPTNeoConfig.from_pretrained(f'results/models_baseline/{model_name_2}')

# Initialize your custom model
custom_model_1 = CustomGPTNeoForCausalLM(config_gqa=config_1)
custom_model_1.set_kqv_size(embed_size // 8)
custom_model_1.set_attention(num_key_value_heads=num_key_value_heads)
#
custom_model_2 = CustomGPTNeoForCausalLM(config_gqa=config_2)
custom_model_2.set_kqv_size(embed_size // 2)
custom_model_2.set_attention(num_key_value_heads=attn_heads)
#
trained_state_dict_1 = torch.load(f'results/models_baseline/{model_name_1}/model_state.pt')
trained_state_dict_2 = torch.load(f'results/models_baseline/{model_name_2}/model_state.pt')
custom_model_1.load_state_dict(trained_state_dict_1)
custom_model_2.load_state_dict(trained_state_dict_2)

<All keys matched successfully>

#### Inference 
Let's generate a short story like the ones the model has been trained on! You'll notice that the prompt is surrounded by `<s>`, the begin-of-sequence (bos) token, and `</s>` end-of-sequence (eos) / separator (sep) token. This is from the BERT-style tokenisation, making it clear to the model where (one of several) input sequences ends.

In [23]:
import time
from memory_profiler import memory_usage

prompt = 'Once upon a time, there was a chicken with'
input_ids = tokenizer.encode(prompt, return_tensors='pt')

memory_usage_start = memory_usage(max_usage=True)

start_time = time.time()
output = custom_model_1.generate(input_ids, max_length=50, num_beams=1)
print("The output length is: ", len(output[0]))
end_time = time.time()

memory_usage_end = memory_usage(max_usage=True)

output_text = tokenizer.decode(output[0])

# textwrap with indentation on every new paragraph
see(output_text)
inference_time = end_time - start_time

print("Inference Time:", inference_time, "seconds")
print("Peak Memory Usage:", memory_usage_end, "MiB")

The output length is:  50

[3m   <s>Once upon a time, there was a chicken with</s>.[0m

Inference Time: 0.5596239566802979 seconds
Peak Memory Usage: 406.5703125 MiB


In [24]:
import time
from memory_profiler import memory_usage

# Define your prompt and encode it
def avg_time_and_memory_consumption(model: CustomGPTNeoForCausalLM):
    prompt = 'Once upon a time, there was a chicken with'
    input_ids = tokenizer.encode(prompt, return_tensors='pt')

    # Set the number of times to generate output
    num_iterations = 10  # You can adjust this number as needed

    # Initialize a list to store inference times
    inference_times = []
    # Initialize a list to store peak memory usages per inference
    memory_usage_values = []

    # Loop for generating output multiple times
    for _ in range(num_iterations):
        # Start recording memory to find the peak usage
        memory_usage_start = memory_usage(max_usage=True)

        # Record the start time
        start_time = time.time()
        # Generate the output
        output = model.generate(input_ids, max_length=300, num_beams=1)
        # Record the end time
        end_time = time.time()

        # End recording memory to find the peak usage
        memory_usage_end = memory_usage(max_usage=True)

        # Calculate the time taken for inference
        inference_time = end_time - start_time
        inference_times.append(inference_time)

        memory_usage_values.append(memory_usage_end)

        # Decode and print the output text (optional)
        output_text = tokenizer.decode(output[0])
        # print("Generated Text:", output_text)

    average_inference_time = sum(inference_times) / num_iterations
    average_memory_usage = sum(memory_usage_values) / num_iterations

    # print("Average Inference Time:", average_inference_time, "seconds")
    # print("Average Peak Memory Usage:", average_memory_usage, "MiB")
    return average_inference_time, average_memory_usage

In [26]:
time1, memory1 = avg_time_and_memory_consumption(custom_model_1)
time2, memory2 = avg_time_and_memory_consumption(custom_model_2)
print("Time taken difference is: ", str(time2 - time1))
print("Memory consumption difference is: ", str(memory2 - memory1))

print("Trying in reverse order:")
time2, memory2 = avg_time_and_memory_consumption(custom_model_2)
time1, memory1 = avg_time_and_memory_consumption(custom_model_1)
print("Time taken difference is: ", str(time2 - time1))
print("Memory consumption difference is: ", str(memory2 - memory1))

print("And again in the normal order:")
time1, memory1 = avg_time_and_memory_consumption(custom_model_1)
time2, memory2 = avg_time_and_memory_consumption(custom_model_2)
print("Time taken difference is: ", str(time2 - time1))
print("Memory consumption difference is: ", str(memory2 - memory1))

print("And the last time in reverse:")
time2, memory2 = avg_time_and_memory_consumption(custom_model_2)
time1, memory1 = avg_time_and_memory_consumption(custom_model_1)
print("Time taken difference is: ", str(time2 - time1))
print("Memory consumption difference is: ", str(memory2 - memory1))

Time taken difference is:  0.07875437736511248
Memory consumption difference is:  0.8667968749999773
Trying in reverse order:
Time taken difference is:  0.19388210773468018
Memory consumption difference is:  0.6003906249999886
And again in the normal order:
Time taken difference is:  0.20630371570587158
Memory consumption difference is:  0.6671875000000114
And the last time in reverse:
Time taken difference is:  0.18756856918334974
Memory consumption difference is:  0.5695312500000114


In [92]:
prompt = 'Once upon a time, there was a chicken with'
input_ids = tokenizer.encode(prompt, return_tensors='pt')

output = custom_model.generate(
    input_ids,                              # input to the model
    max_length=300,                         # maximum generation length
    eos_token_id=tokenizer.eos_token_id,    # early stopping when eos_token is output

    # num_beams=1,                            # number of beams to use in generation
    temperature=1,
)
output_text = tokenizer.decode(output[0])

# textwrap with indentation on every new paragraph
see(output_text)
output


[3m   <s>Once upon a time, there was a chicken with</s>[0m



tensor([[   1, 7454, 2402,  257,  640,   11,  612,  373,  257, 9015,  351,    2,
          198,  198,  198,  198,  198,  198,  198,  198,  198,  198,  198,  198,
          198,  198,  198,  198,  198,  198,  198,  198,  198,  198,  198,  198,
          198,  198,  198,  198,  198,  198,  198,  198,  198,  198,  198,  198,
          198,  198,  198,  198,  198,  198,  198,  198,  198,  198,  198,  198,
          198,  198,  198,  198,  198,  198,  198,  198,  198,  198,  198,  198,
          198,  198,  198,  198,  198,  198,  198,  198,  198,  198,  198,  198,
          198,  198,  198,  198,  198,  198,  198,  198,  198,  198,  198,  198,
          198,  198,  198,  198,  198,  198,  198,  198,  198,  198,  198,  198,
          198,  198,  198,  198,  198,  198,  198,  198,  198,  198,  198,  198,
          198,  198,  198,  198,  198,  198,  198,  198,  198,  198,  198,  198,
          198,  198,  198,  198,  198,  198,  198,  198,  198,  198,  198,  198,
          198,  198,  198,  

Sounds like me after a few beers too many, but at least the grammar is (mostly) correct. The model also learns some basic reasoning-like associations like being 'so high' allows you to see 'the whole world'. 