Skip to content

This a minimal implementation of a GPT model but it has some advanced features such as temperature/ top-k/ top-p sampling, and KV Cache.

License

Notifications You must be signed in to change notification settings

phkhanhtrinh23/milliGPT

 
 

Repository files navigation

milliGPT

Install requirements

pip install torch numpy transformers datasets tiktoken wandb tqdm

Dependencies:

  • pytorch (should be PyTorch >= 2.0)
  • numpy
  • transformers for huggingface transformers (to load GPT-2 checkpoints)
  • datasets for huggingface datasets (if you want to download + preprocess OpenWebText)
  • tiktoken for OpenAI's fast BPE code
  • wandb for optional logging
  • tqdm for progress bars

Training a minimal example GPT from start

First step is to select a text corpus to train your model. This repo provides the default set-up script to download a public text dataset (Shakespeare books). You need to run the tokenizer script to conver the raw text data to token ids which can be interpreted by the model:

$ python data/shakespeare_char/prepare.py

Want to use another dataset? Train a milliGPT model on your favorite book? Check out this open online e-book resources. Select one and pick the option Plain Text UTF-8 to download in text format (.txt). Or just use any text file that you can find. Rename it into input.txt and copy it into data/shakespeare_char, then re-run the above command.

This creates a train.bin and val.bin in that data directory. Now it is time to train your GPT. The size of it very much depends on the computational resources of your system:

I have a GPU. Great, we can quickly train a milliGPT with the settings provided in the config/train_shakespeare_char.py config file:

$ python train.py config/train_shakespeare_char.py

If you peek inside it, you'll see that we're training a GPT with a context size of up to 256 characters, 384 feature channels, and it is a 6-layer Transformer with 6 heads in each layer. On one A100 GPU this training run takes about 3 minutes and the best validation loss is 1.4697. Based on the configuration, the model checkpoints are being written into the --out_dir directory out-shakespeare-char. So once the training finishes we can sample from the best model by pointing the sampling script at this directory:

$ python sample.py --out_dir=out-shakespeare-char

This generates a few samples, for example:

ANGELO:
And cowards it be strawn to my bed,
And thrust the gates of my threats,
Because he that ale away, and hang'd
An one with him.

DUKE VINCENTIO:
I thank your eyes against it.

DUKE VINCENTIO:
Then will answer him to save the malm:
And what have you tyrannous shall do this?

DUKE VINCENTIO:
If you have done evils of all disposition
To end his power, the day of thrust for a common men
That I leave, to fight with over-liking
Hasting in a roseman.

Not bad for a character-level model after 3 minutes of training on a GPU. Better results are quite likely obtainable by instead finetuning a pretrained GPT-2 model on this dataset (see finetuning section later).

Training on a CPU machine

If we are running on CPU instead of GPU we must set both --device=cpu and also turn off PyTorch 2.0 compile with --compile=False. Then when we evaluate we get a bit more noisy but faster estimate (--eval_iters=20, down from 200), our context size is only 64 characters instead of 256, and the batch size only 12 examples per iteration, not 64. We'll also use a much smaller Transformer (4 layers, 4 heads, 128 embedding size), and decrease the number of iterations to 2000 (and correspondingly usually decay the learning rate to around max_iters with --lr_decay_iters). Because our network is so small we also ease down on regularization (--dropout=0.0). This still runs in almost 10 minutes on a CPU, but gets us a loss of around 1.48 and therefore also worse samples, but it's still good fun:

$ python sample.py --out_dir=out-shakespeare-char --device=cpu

Generates samples like this:

GLEORKEN VINGHARD III:
Whell's the couse, the came light gacks,
And the for mought you in Aut fries the not high shee
bot thou the sought bechive in that to doth groan you,
No relving thee post mose the wear

If you're willing to wait longer, feel free to tune the hyperparameters, increase the size of the network, the context length (--block_size), the length of training, etc.

Finally, on Apple Silicon Macbooks and with a recent PyTorch version make sure to add --device=mps (short for "Metal Performance Shaders"); PyTorch then uses the on-chip GPU that can significantly accelerate training (2-3X) and allow you to use larger networks. See Issue 28 for more.

sampling / inference

Use the script sample.py to sample either from pre-trained GPT-2 models released by OpenAI, or from a model you trained yourself. For example, here is a way to sample from the largest available gpt2-xl model:

$ python sample.py \
    --init_from=gpt2-xl \
    --start="What is the answer to life, the universe, and everything?" \
    --num_samples=5 --max_new_tokens=100

If you'd like to sample from a model you trained, use the --out_dir to point the code appropriately. You can also prompt the model with some text from a file, e.g. $ python sample.py --start=FILE:prompt.txt.

About

This a minimal implementation of a GPT model but it has some advanced features such as temperature/ top-k/ top-p sampling, and KV Cache.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%