Skip to content

Commit

Permalink
Update NeoXArgs docs automatically
Browse files Browse the repository at this point in the history
  • Loading branch information
github-actions committed Apr 13, 2024
1 parent defa0a4 commit 5cdff76
Showing 1 changed file with 117 additions and 1 deletion.
118 changes: 117 additions & 1 deletion configs/neox_arguments.md
Original file line number Diff line number Diff line change
Expand Up @@ -111,7 +111,7 @@ Logging Arguments

- **git_hash**: str

Default = 11a5537
Default = defa0a4

current git hash of repository

Expand Down Expand Up @@ -1378,6 +1378,122 @@ Training Arguments



- **replay_config**: dict

Default = None

Dictionary storing the replay config.



- **is_replay_enabled**: bool

Default = False

Triggers the logic for replay. It is important to deal with replay separately from the general "train_data_paths" logic, as replay
requires reusing the same idx files to know what data was seen the first time a dataset was originally trained on.
If one attempts to do replay by just putting the datasets to be replayed in the train_data_paths instead of the replay params:
- If the exact same dataset files are used as during the 1st time it was seen, and the number of iterations on the replay buffer
corresponds to as many epochs on a replay dataset as the non-replay training, the data will be seen in exactly the same order as
the first time if the seed and sequence length is the same.
- For similar reasons, replaying multiple times on the same dataset (e.g. across multiple tasks) with the same number of epochs
on the replay dataset will lead to seeing the same data in the same order.
- If a different dataset is used for replay (e.g. different shard of Pile), then the shuffling will lead to completely different
indices, which will lead to potentially significant proportions of data being unseen if the original training on the replay dataset
did not see all of it, e.g. when training on 300B tokens of the GPT2-tokenised Pile which contains a few dozen billion more tokens,
then sharding the full dataset into smaller ones.



- **replay_idx_paths_prefixes**: list

Default = None

List of paths prefixes to retrieve replay dataset idx files. Those idx files should have been generated when originally training on the dataset
being used for replay. They contain in the filename the number of samples potentially seen during pretraining, the sequence length and the
seed. The exact files (shuffle_idx, sample_idx and doc_idx) will be automatically derived from the prefix. Similarly, the data paths will
be generated from the prefixes.
The *_idx files are important as it allows one to know what data was seen in the dataset during training. If those files are missing, you can
regenerate them by relaunching the same training script (most importantly, config) used originally to pretrain on a given dataset. You
can add an exit(0) statement in training.py in pretrain() after the call to build_train_valid_test_data_iterators(neox_args=neox_args).
It is crucial to use the same dataset shard, sequence length, number of iterations, seed, and potentially batch size, or the indices
generated may not be the same.
For a single replay data source, the value passed looks like ["data/mydataset/train/mydataset_train_4_indexmap_456789ns_2048sl_1234s"] and
the files at the following paths (the paths will be constructed during execution from the prefix), must exist:
"data/mydataset/train/mydataset_train_4_indexmap_456789ns_2048sl_1234s_doc_idx.npy"
"data/mydataset/train/mydataset_train_4_indexmap_456789ns_2048sl_1234s_sample_idx.npy"
"data/mydataset/train/mydataset_train_4_indexmap_456789ns_2048sl_1234s_shuffle_idx.npy"
"data/mydataset/train/mydataset"



- **replay_data_to_idx_paths**: dict

Default = None

As indicated above, gets automatically built from the replay_idx_paths_prefixes by appending to it "_doc_idx.npy", "_sample_idx.npy" and
"_shuffle_idx.npy". It generates a dict of dict, with the data paths as keys, and dictionaries mapping each data path to the relevant
doc_idx, sample_idx and shuffle_idx file paths. Note that these files must exist at the relevant paths.



- **replay_data_paths**: list

Default = None

As indicated above, gets automatically built from the replay_idx_paths_prefixes by removing the information about the idx files to retain
only the path to the dataset itself.



- **replay_data_weights**: list

Default = None

List of 'weights' that decide how often to sample from each replay dataset when building the replay buffer.



- **replay_idx_offsets**: list

Default = None

List of indices that decide where to start in the list of seen indices during pretraining on each replay dataset when building
the replay buffer. For example, when training originally on a dataset seeing 10000 samples, this allows to start looking at the
RESHUFFLED indices starting from idx replay_idx_offsets[i] for replay dataset i.
If not set, this will uniformly sample among all replay datasets.



- **replay_fraction**: float

Default = 0.05

Fraction of a batch dedicated to doing replay. For example, 0.1 means that in a batch of 100, 19 samples will come from the replay
buffer. Note that this means that if we train on 100B tokens, we will have only used 90B tokens from the datasets specified in
train_data_paths.



- **replay_reshuffle_idx**: bool

Default = True

When index files are loaded from those the dataset was originally pretrained on, they will follow the exact same sequence of samples
seen when training on that dataset the first time if this is set to False. If True, the indices are reshuffled to prevent that.



- **replay_seed**: int

Default = 1234

Seed used to reshuffle indices accessed when originally training on a dataset, that are used to do replay. This is useful in the case
where replay is done twice on as many passes over the dataset, in which case if the same seed is used, the replay buffers in both case
will be exactly the same.



- **weight_by_num_documents**: bool

Default = False
Expand Down

0 comments on commit 5cdff76

Please sign in to comment.