Skip to content

Commit

Permalink
Updated readme for torch2.0
Browse files Browse the repository at this point in the history
  • Loading branch information
JonasGeiping committed Jun 13, 2023
1 parent 2009755 commit 25651c3
Show file tree
Hide file tree
Showing 2 changed files with 32 additions and 28 deletions.
58 changes: 31 additions & 27 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,14 @@ How far can we get with a single GPU in just one day?
> We investigate the downstream performance achievable with a transformer-based language model trained completely from scratch with masked language modeling for a *single* day on a *single consumer* GPU.
Aside from re-analyzing nearly all components of the pretraining pipeline for this scenario and providing a modified pipeline with performance close to BERT, we investigate why scaling down is hard, and which modifications actually improve performance in this scenario. We provide evidence that even in this constrained setting, performance closely follows scaling laws observed in large-compute settings. Through the lens of scaling laws, we categorize a range of recent improvements to training and architecture and discuss their merit and practical applicability (or lack thereof) for the limited compute setting.


## UPDATE: This is the version of the framework!

You need PyTorch 2.0 to run the new code. If you want to remain on PyTorch 1.*, you can checkout the tag `Last1.13release`. The new model, trained with the new codebase is 1-2% better on GLUE with the same budget. The checkpoint can be found at https://huggingface.co/JonasGeiping/crammed-bert. The old checkpoint is now https://huggingface.co/JonasGeiping/crammed-bert-legacy.

Also, data preprocessing has improved, you can now stream data directly from huggingface, from the upload at https://huggingface.co/datasets/JonasGeiping/the_pile_WordPiecex32768_2efdb9d060d1ae95faf952ec1a50f020.


## The Rules for Cramming
Setting:
* A transformer-based language model of arbitrary size is trained with masked-language modeling, completely from scratch.
Expand All @@ -23,21 +31,20 @@ Setting:

# How to run the code

## Requirements:
* PyTorch: `torch` (at least version 1.12)
Run `pip install .` to install all dependencies.


## Requirements in Details:
* PyTorch: `torch` (at least version 2.1)
* huggingface: `transformers`, `tokenizers`, `datasets`, `evaluate`
* `hydra-core`
* [OPTIONAL]`deepspeed`
* [OPTIONAL] `flash-attention`
* `psutil`
* `psutil`, `pynvml`, `safetensors`
* `einops`
* [OPTIONAL] For The-Pile data, install `zstandard`

## Installation
* Just clone for now, and install packages as described. Inside the cloned directory, you can use `pip install .` to install all packages and scripts.
* [Optional] For deduplication (necessary for the final dataset recipe), first install rust `curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh `, then
* [Optional] For deduplication (only needed to replicate deduplication tests), first install rust `curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh `, then
`git clone https://github.com/google-research/deduplicate-text-datasets/tree/dev-v1` and then run `cargo install --target-dir ../cramming/dedup`
* [Optional] For FlashAttention (necessary for the `c5` recipe), install package as instructed at https://github.com/HazyResearch/flash-attention
* [Optional] Follow the instructions at https://pre-commit.com/ to install the pre-commit hooks (necessary only if you want to contribute to this project).

To verify a minimal installation, you can run
Expand All @@ -49,35 +56,37 @@ This command pre-processes a small sanity-check dataset, and runs a single train

## General Usage

Use the `pretrain.py` script to pretrain with limited compute. This repository uses hydra (https://hydra.cc/docs/intro/), so all fields in `cramming/config` can be modified on the command line. For example, the `budget` can be modified by providing `budget=48` as additional argument, or the learning rate can be modified via `train.optim.lr=1e-4`. Check out the configuration folder to see all arguments.
Use the `pretrain.py` script to pretrain with limited compute. This repository uses hydra (https://hydra.cc/docs/intro/), so all fields in `cramming/config` can be modified on the command line. For example, the `budget` can be modified by providing `budget=48` as additional argument (to run for 48 hours), or the learning rate can be modified via `train.optim.lr=1e-4`. Check out the configuration folder to see all arguments.

Your first step should be to verify the installed packages. To do so, you can run `python pretrain.py dryrun=True`, which will run the default sanity check for a single iteration. From there, you can enable additional functionality. For example, modify the architecture, e.g. `arch=bert-original` and training setup `train=bert-original`.
To really train a language model, you need to switch away from the sanity check dataset to at least `data=bookcorpus-wikipedia`.
To really train a language model, you need to switch away from the sanity check dataset to at least `data=pile-readymade`. Then, choose an improved training setup, e.g. `train=bert-o4`, and an improved model layout, e.g. `arch=crammed-bert`.

### Data Handling
The data sources from `data.sources` will be read, normalized and pretokenized before training starts and cached into a database. Subsequent calls with the same configuration will reused this database of tokenized sequences. By default, a new tokenizer will also be constructed and saved during this process. Important data options are `data.max_entries_in_raw_dataset`, which defines how much *raw* data will be loaded. For example, for a large data source such as C4, only a subset of raw data will be downloaded. Then, `max_seq_in_tokenized_dataset` bottlenecks how many *processed* sequences will be stored in the database. This number should be larger than the number of sequences expected to be read within the budget.

Additional Notes:
* Start with preprocessed data, using `data=pile-readymade`
* A simple trick to run dataset preprocessing only is to run `python pretrain.py data=... dryrun=True`, which dry-runs the training, but runs the full data preprocessing. Later runs can then re-use the cached data.
* Dataset preprocessing is heavily parallelized. This might be a problem for your RAM. If this happens, reduce `impl.threads`. Especially the deduplication code does require substantial amounts of RAM.
* I would run first experiments with `bookcorpus-wikipedia` only, which preprocesses comparatively quickly and only then look into the full processed and filtered C4.
* Use `impl.sharing_strategy=file_system` on windows or macOS.
* I would run first data experiments with `bookcorpus-wikipedia` only, which preprocesses comparatively quickly and only then look into the full processed and filtered C4.


#### Preprocessed Datasets

For reference and if you are only interested in changing training/architecture, you can find some preprocessed datasets here:

https://www.dropbox.com/sh/sy8wanplx5typ9k/AAAWUceTcvZIh1GFX4Ij7_xXa?dl=0.
* https://huggingface.co/datasets/JonasGeiping/the_pile_WordPiecex32768_2efdb9d060d1ae95faf952ec1a50f020
* https://huggingface.co/datasets/JonasGeiping/the_pile_WordPiecex32768_8eb2d0ea9da707676c81314c4ea04507
* https://huggingface.co/datasets/JonasGeiping/the_pile_WordPiecex32768_2efdb9d060d1ae95faf952ec1a50f020

You will not need to download all of these. `c4-subset_WordPiecex32768_e0501aeb87699de7500dc91a54939f44` here is the final processed dataset for `data=c4-subset-processed` and `bookcorpus-wikitext_WordPiecex32768_a295a1f5b033756b08d2dbca690655a7` is the default `bookcorpus-wikipedia` dataset. Each folder contains a file called `model_config.json` that describes the preprocessing. You need to move these datasets into the `data` folder of your base output directory (so, `cramming/outputs/data` with default settings). The preprocessed data will be read and the preprocessing step skipped entirely, if the data is placed in the right folder.
These data sources can be streamed. To do so, simple set `data=pile-readymade`.

Preprocessed data is convenient to work with, and I do think modifications to data processing and filtering continue to be under-explored compared to training and architecture because of this. There might be more gains to be had with better data, than with other tweaks, so ultimately you might want to consider setting up the code and environment for the full data processing pipeline to work.


#### Model Checkpoint

You can now find a checkpoint for `c5-o3` trained on `c4-subset-processed` at https://huggingface.co/JonasGeiping/crammed-bert.
You can now find a checkpoint for the final version trained on `the-pile` at https://huggingface.co/JonasGeiping/crammed-bert.

### Evaluation

Expand All @@ -94,23 +103,23 @@ You can log runs to your weights&biases account. To do so, simply modify `wandb.

To replicate the final recipe discussed in the paper, run
```
python pretrain.py name=amp_b4096_c5_o3_final arch=bert-c5 train=bert-o3 train.batch_size=4096 data=c4-subset-processed
python pretrain.py name=amp_b8192_cb_o4_final arch=crammed-bert train=bert-o4 data=pile-readymade
```
to pretrain and
```
python eval.py eval=GLUE_sane name=amp_b4096_c5_o3_final eval.checkpoint=latest impl.microbatch_size=16 impl.shuffle_in_dataloader=True
python eval.py eval=GLUE_sane name=amp_b8192_cb_o4_final eval.checkpoint=latest impl.microbatch_size=16 impl.shuffle_in_dataloader=True
```
to evaluate the model. The recipe called "crammed BERT" in the paper corresponds to the architecture called `bert-c5` trained with training setup `bert-o3` on data `c4-subset-processed`.
to evaluate the model. The recipe called "crammed BERT" in the paper corresponds to the architecture called `crammed-bert` in the config, trained with the training setup `bert-o4` on data `the-pile`.

## Additional Recipes
Pretraining:
Single GPU, original BERT settings:
```
python pretrain.py name=bert data=bookcorpus-wikipedia arch=bert-original train=bert-original
python pretrain.py name=bert data=bookcorpus-wikipedia arch=bert-original train=bert-original budget=10000000
```
Multi-GPU, original BERT settings:
```
torchrun --nproc_per_node=4 --standalone pretrain.py name=bert4gpu data=bookcorpus-wikipedia arch=bert-original train=bert-original
torchrun --nproc_per_node=4 --standalone pretrain.py name=bert4gpu data=bookcorpus-wikipedia arch=bert-original train=bert-original budget=10000000 impl.fullgraph=false impl._inductor_vars.triton.cudagraphs=False
```

Eval a huggingface checkpoint:
Expand All @@ -134,13 +143,8 @@ Additional examples for recipes can be found in the `/scripts` folder.

The following options are currently broken/limited/work-in-progress. Use these at your own discretion. Of course, any contributions here are highly appreciated. You can also message me with more questions about any of these points, if you want to look into them.

* The-Pile needs to be downloaded in its entirety to be used, but the code could be updated to stream, just like C4.
* Data Preprocessing is wasteful in terms of RAM.
* Token Dropping is simplistic, a more involved version could be better.
* Code currently uses the "old" `jit.script` fusion, should move toward new `torch.compile` implementation at some point. The current `inductor` hook is also non-functional.
* Shampoo (see discussion at https://twitter.com/_arohan_/status/1608577721818546176?s=20)
* Causal Attention [I broke this shortly before release, if you want to re-test CA, you'd have to fix it first]
* LAWA
* Shampoo (see discussion at https://twitter.com/_arohan_/status/1608577721818546176?s=20). In general, alternative optimizers are probably undertested, and could use more atttention.
* Some options are only available in the old release at the `Last1.13release` tag. If you are interested in reviving some of these options. Feel free to open a pull request with updates to the new codebase.

# Contact

Expand Down
2 changes: 1 addition & 1 deletion setup.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ install_requires =
psutil
einops
safetensors
# apache-beam # only used for wikipedia ...
# apache-beam # only used for wikipedia, terrible dependencies
zstandard # only used for the Pile

scripts =
Expand Down

0 comments on commit 25651c3

Please sign in to comment.