Goal for this notebook is to get an idea of how Chinchilla scaling laws apply to my model development here. I suspect this model may be too small for these laws to be fully relevant, but I'm still going to use them as a general guideline.

In [1]:
import spacy

tokenizer = spacy.load("en_core_web_sm")

files = [
        "../data/_part1.txt",
        "../data/_part2.txt",
        "../data/_part3.txt",
        "../data/_part4.txt",
        "../data/_part5.txt",
        "../data/_part6.txt",
        "../data/_part7.txt"
    ]

texts = []
for file_name in files:
    with open(file_name, 'r', encoding='utf-8') as file:
        texts.append(file.read())

all_tokens = []
all_tokens.extend(['<PAD>', '<UNK>'])

for text in texts:
    doc = tokenizer(text)
    tokens = [token.text for token in doc]
    all_tokens.extend(tokens)

print(f"Total number of tokens in training + validation data: {len(all_tokens)}")

Total number of tokens in training + validation data: 1383670


Chinchilla is a 70B parameter model trained on 1.4T tokens. 1.4T/70B = 20.

In [4]:
num_tokens = len(all_tokens)
num_parameters = num_tokens // 20
num_unique_tokens = len(set(all_tokens))

print(f"For a dataset with {num_tokens} tokens to be optimal in training, the model should have about {num_parameters} parameters")

For a dataset with 1383670 tokens to be optimal in training, the model should have about 69183 parameters


Let's check how many parameters I've actually been training with

In [10]:
import sys
sys.path.append("../decoder-transformer")

In [15]:
from model import TransformerNetwork

def count_parameters(model):
    total_params = sum(p.numel() for p in model.parameters())
    trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
    return total_params, trainable_params

model = TransformerNetwork(output_dict_size=num_unique_tokens, context_len=32, num_layers=6, model_dim=256, att_heads=8, ff_hidden_dim=1024, name="small")
total, trainable = count_parameters(model)
print(f"{model.name} has {total} total parameters, of which {trainable} are trainable.")

model = TransformerNetwork(output_dict_size=num_unique_tokens, context_len=16, num_layers=2, model_dim=128, att_heads=4, ff_hidden_dim=256, name="tiny")
total, trainable = count_parameters(model)
print(f"{model.name} has {total} total parameters, of which {trainable} are trainable.")

model = TransformerNetwork(output_dict_size=num_unique_tokens, context_len=8, num_layers=1, model_dim=32, att_heads=4, ff_hidden_dim=64, name="micro")
total, trainable = count_parameters(model)
print(f"{model.name} has {total} total parameters, of which {trainable} are trainable.")

4m0s - DEBUG - Initializing model...
4m1s - DEBUG - Initializing model...


small has 281494493 total parameters, of which 281494493 are trainable.


4m1s - DEBUG - Initializing model...


tiny has 69447645 total parameters, of which 69447645 are trainable.
micro has 8684925 total parameters, of which 8684925 are trainable.


I did not expect even the tiniest preset to have dramatically too many parameters, but I guess the complete works of shakespeare ARE relatively small.