-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data Refactor Proposal #954
Comments
SGTM |
I really like this refactor. A few minor additional thoughts:
In general, does that mean the dataset is processed (tokenized etc) every time we execute the training script? Since this is relatively quick, this should not be a problem for the finetuning scripts. For pretraining scripts, this may not be feasible, though. So my question is, in how far should be make the pretraining and finetuning usages similar? I think the more similar the better from an intuition and user experience perspective. So, maybe
is the default. But then we can also offer the following for larger datasets where preprocessing separately makes sense:
Users would by default use Or perhaps that can be an additional flag
Or maybe there are some other better ideas?
I think this is nice. Previously, we defined the |
For the SFT datasets, tokenization is now done on-the fly in the dataloader worker. There is no longer a need to store the entire dataset in tokenized format entirely. The max length can also be adjusted directly on-the-fly. At the moment, You are right, for the pretraining datasets we still require the manual preprocessing step. In that sense, we cannot really make their usage identical. I think that's acceptable given the size of these datasets. |
Such a nicely written proposal 👍 Overall I have nothing to add. Looks like all the cases are covered. |
Sounds reasonable @awaelchli. I was just concerned in case we have larger SFT datasets in the future. I think OpenAssistant is 150k samples, for example (still relatively small though). But I think the current way you describe should definitely be the default, and we can think about tokenized-dataset support as an add-on in the future as it wouldn't impact the currently proposed usage. |
We have something like this in our training stack, for large datasets we use a similar abstraction with a disk cache which can be overwritten with a overwrite flag (to force retokenisation even if the hash doesn't change). Meaning it only tokenises once the first time we run a training script, might be a low overhead way to make it the same api between fine tuning and pre training. |
Great job @awaelchli |
This issue proposes a refactor for how data is preprocessed and consumed in Lit-GPT.
Current Issues and Limitations
Error prone: We have scripts in
scripts/prepare_*
to preprocess the data before running the scripts. This is cumbersome and error prone because you have to specify the tokenizer that you are going to use to finetune. If you then want to finetune a different model, and forget to rerun the prepare script, you will use wrongly tokenized data.Data overlap: The way the finetuning scripts load the data is with a random index into the memory mapped preprocessed file (sampling with replacement). Training over N epochs isn't really possible, but for finetuning we often want to control this very precisely. In addition, distributed sampling is also an issue, and the current workaround is to set a different seed per rank but users are likely oblivious to this immense technical detail.
Inflexible: There is no standard interface to read a dataset in our scripts, making it harder to adopt new dataset types (e.g. DPO). Furthermore, the prompt template is hardcoded into the data preparation.
No code experience: Lit-GPT is moving toward a CLI-focused experience where everything needs to be configurable without changing the code. The data part is the largest piece standing in the way of this at the moment.
Proposed Changes
prepare_data()
,setup()
, andtrain/va_dataloader()
to get the loaders (see also extension section below)lit_gpt.datasets
--data.xyz
arguments.--data.module="Alpaca"
, making it super quick to change between datasets.--data.module=csv --data.source="path/to/csv/or/folder"
Usage Examples
DataModule and Dataset
The DataModule could simply follow the
LightningDataModule
design, or even subclass it:It bundles:
prepare_data()
setup()
train/val/test_dataloader()
Note: The
setup()
method takes special arguments as input that can't be deterimined immediately at the time of instantiating the datamodule. For exmaple, the tokenizer must be loaded from the checkpoint directory of the model, or the batch size is the micro-batch size set in the script.Usage in Training Scripts
The training scripts would simply add these lines of code (replacing the existing
get_batch()
function):A POC for this design is in #950.
Pretraining
The pretraining datasets will still require preprocessing to be done externally due to their size. The corresponding datamodule would simply read the dataset at the default location, and error with instructions to preprocess if it can't be found.
Extension
With this proposal, data preprocessing will be on-the-fly now. But we don't have that much to preprocess, since tokenization will be done as part of the dataloaders. The only real preprocessing we have is automatic creation of train-test splits for datasets that don't have them. This would be done in-memory with the proposal above, but as an extension, we could have a cache folder where we store the splits and only rerun it if the arguments change (hashed). This could be done as a follow up in the future.
Pros
Cons
@carmocca @lantiga @rasbt @Andrei-Aksionov
The text was updated successfully, but these errors were encountered: