Data Refactor Proposal #954

awaelchli · 2024-02-26T12:56:07Z

This issue proposes a refactor for how data is preprocessed and consumed in Lit-GPT.

Current Issues and Limitations

Error prone: We have scripts in scripts/prepare_* to preprocess the data before running the scripts. This is cumbersome and error prone because you have to specify the tokenizer that you are going to use to finetune. If you then want to finetune a different model, and forget to rerun the prepare script, you will use wrongly tokenized data.

Data overlap: The way the finetuning scripts load the data is with a random index into the memory mapped preprocessed file (sampling with replacement). Training over N epochs isn't really possible, but for finetuning we often want to control this very precisely. In addition, distributed sampling is also an issue, and the current workaround is to set a different seed per rank but users are likely oblivious to this immense technical detail.

Inflexible: There is no standard interface to read a dataset in our scripts, making it harder to adopt new dataset types (e.g. DPO). Furthermore, the prompt template is hardcoded into the data preparation.

No code experience: Lit-GPT is moving toward a CLI-focused experience where everything needs to be configurable without changing the code. The data part is the largest piece standing in the way of this at the moment.

Proposed Changes

Switch all dataloading to PyTorch Dataset + DataLoader
Bundle all data related logic in a datamodule
Training scripts will instantiate a datamodule and call prepare_data(), setup(), and train/va_dataloader() to get the loaders (see also extension section below)
Predefined datamodules are available under lit_gpt.datasets
All datamodule related arguments are exposed in the training script CLI via --data.xyz arguments.
Our predefined datamodules are registered under a shortcut name that can be referenced via the CLI (e.g. --data.module="Alpaca", making it super quick to change between datasets.
We can provide a generic datamodule for CSV, JSON etc. that can read user defined datasets in standardized format. E.g. --data.module=csv --data.source="path/to/csv/or/folder"
The prompt template is configureable

Usage Examples

# Use defaults provided by data module
python finetune/full.py --data.module=LIMA 

# Provide optional arguments
python finetune/full.py --data.module=LIMA --data.test_split_fraction=0.1
python finetune/full.py --data.module=LIMA --data.max_seq_length=256

DataModule and Dataset

The DataModule could simply follow the LightningDataModule design, or even subclass it:

class Alpaca:
    def __init__(
        self,
        max_seq_length: int = -1,
        mask_prompt: bool = True,
        test_split_fraction: float = 0.03865,
        # ...
    ) -> None:
        super().__init__()
        # ...

    def prepare_data(self) -> None:
        # Download, Preprocess files etc.
        pass

    def setup(self, tokenizer, batch_size, ...) -> None:
        with open(self.data_file_path, "r", encoding="utf-8") as file:
            data = json.load(file)

        # Partition the dataset into train and test
        train_data, test_data = random_split(data, ...)
        
        self.train_dataset = SFTDataset(data=train_data, ...)
        self.test_dataset = SFTDataset(data=test_data, ...)

    def train_dataloader(self) -> DataLoader:
        return DataLoader(self.train_dataset, ...)

    def val_dataloader(self) -> DataLoader:
        return DataLoader(self.test_dataset, ...)

    def test_dataloader(self) -> DataLoader:
        return self.val_dataloader()

It bundles:

Preprocessing in prepare_data()
Instantiation of datasets in setup()
Definintion of dataloaders in train/val/test_dataloader()

Note: The setup() method takes special arguments as input that can't be deterimined immediately at the time of instantiating the datamodule. For exmaple, the tokenizer must be loaded from the checkpoint directory of the model, or the batch size is the micro-batch size set in the script.

Usage in Training Scripts

The training scripts would simply add these lines of code (replacing the existing get_batch() function):

datamodule = ...   # Instantiate from data CLI args
if fabric.global_rank == 0:
    datamodule.prepare_data()
fabric.barrier()
datamodule.setup(tokenizer, batch_size, ...)
train_dataloader = datamodule.train_dataloader()
val_dataloader = datamodule.val_dataloader()
train_dataloader, val_dataloader = fabric.setup_dataloaders(train_dataloader, val_dataloader)

A POC for this design is in #950.

Pretraining

The pretraining datasets will still require preprocessing to be done externally due to their size. The corresponding datamodule would simply read the dataset at the default location, and error with instructions to preprocess if it can't be found.

Extension

With this proposal, data preprocessing will be on-the-fly now. But we don't have that much to preprocess, since tokenization will be done as part of the dataloaders. The only real preprocessing we have is automatic creation of train-test splits for datasets that don't have them. This would be done in-memory with the proposal above, but as an extension, we could have a cache folder where we store the splits and only rerun it if the arguments change (hashed). This could be done as a follow up in the future.

Pros

Solves the aforementioned limitations
It will be easier to write unit tests since everything will be more decoupled and modular

Cons

While this refactor can address the aforementioned limitations, it will introduce some level of abstraction (dataset + datamodule), which works slightly against the initial design philosophy of lit-gpt. The code is no longer in a single file that you can read top to bottom. This refactor will split things up into functions, dataset classes, etc. But we will try our best to keep the engineering effort to a minimum.

@carmocca @lantiga @rasbt @Andrei-Aksionov

The text was updated successfully, but these errors were encountered:

carmocca · 2024-02-26T13:11:33Z

SGTM

rasbt · 2024-02-26T15:07:06Z

I really like this refactor. A few minor additional thoughts:

Training scripts will instantiate a datamodule and call prepare_data(), setup(), and train/va_dataloader() to get the loaders (see also extension section below)

In general, does that mean the dataset is processed (tokenized etc) every time we execute the training script? Since this is relatively quick, this should not be a problem for the finetuning scripts. For pretraining scripts, this may not be feasible, though.

So my question is, in how far should be make the pretraining and finetuning usages similar? I think the more similar the better from an intuition and user experience perspective.

So, maybe

python finetune/full.py --data.module=LIMA

is the default. But then we can also offer the following for larger datasets where preprocessing separately makes sense:

python finetune/full.py --data.module_tokenized=LIMA

Users would by default use --data.module=LIMA but where it makes sense, users could choose --data.module_tokenized=LARGE_DATASET

Or perhaps that can be an additional flag

python finetune/full.py --data.module=LIMA --pretokenized true

Or maybe there are some other better ideas?

python finetune/full.py --data.module=LIMA --data.max_seq_length=256

I think this is nice. Previously, we defined the max_seq_len in the prepare_dataset scripts. So, that meant we had to reprocess the dataset every time we wanted to try a different length during finetuning (e.g., to work around memory limitations).

awaelchli · 2024-02-26T15:19:58Z

For the SFT datasets, tokenization is now done on-the fly in the dataloader worker. There is no longer a need to store the entire dataset in tokenized format entirely. The max length can also be adjusted directly on-the-fly.

At the moment, prepare_data() will only run "slow" for the datasets that require to make a train-test split. But it's not that slow, and for bigger datasets I discuss a possible way to address it in the "Extension" section above. But then on the other hand, the bigger the datasets, the more likely they already come in train-test splits.

You are right, for the pretraining datasets we still require the manual preprocessing step. In that sense, we cannot really make their usage identical. I think that's acceptable given the size of these datasets.

Andrei-Aksionov · 2024-02-26T15:24:41Z

Such a nicely written proposal 👍

Overall I have nothing to add. Looks like all the cases are covered.

rasbt · 2024-02-26T15:31:03Z

Sounds reasonable @awaelchli. I was just concerned in case we have larger SFT datasets in the future. I think OpenAssistant is 150k samples, for example (still relatively small though). But I think the current way you describe should definitely be the default, and we can think about tokenized-dataset support as an add-on in the future as it wouldn't impact the currently proposed usage.

Algomancer · 2024-02-27T03:52:13Z

You are right, for the pretraining datasets we still require the manual preprocessing step. In that sense, we cannot really make their usage identical. I think that's acceptable given the size of these datasets.

We have something like this in our training stack, for large datasets we use a similar abstraction with a disk cache which can be overwritten with a overwrite flag (to force retokenisation even if the hash doesn't change). Meaning it only tokenises once the first time we run a training script, might be a low overhead way to make it the same api between fine tuning and pre training.

lantiga · 2024-02-27T04:26:23Z

Great job @awaelchli
Looks good to me, fits nicely with the goals

carmocca added the enhancement New feature or request label Feb 26, 2024

awaelchli mentioned this issue Feb 26, 2024

(4/n) Data Refactor - Finetuning Scripts #950

Merged

carmocca mentioned this issue Feb 26, 2024

Trainer args consistency #951

Closed

This was referenced Feb 27, 2024

(1/n) Data Refactor - TinyLlama Pretraining #958

Merged

(5/n) Data Refactor - Docs #968

Merged

Data refactor follow-up work #969

Closed

carmocca added this to the Configurability milestone Mar 1, 2024

awaelchli mentioned this issue Mar 6, 2024

Save and load prompt style automatically #1025

Merged

carmocca closed this as completed Mar 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Refactor Proposal #954

Data Refactor Proposal #954

awaelchli commented Feb 26, 2024 •

edited

Loading

carmocca commented Feb 26, 2024

rasbt commented Feb 26, 2024 •

edited

Loading

awaelchli commented Feb 26, 2024

Andrei-Aksionov commented Feb 26, 2024

rasbt commented Feb 26, 2024

Algomancer commented Feb 27, 2024 •

edited

Loading

lantiga commented Feb 27, 2024

Data Refactor Proposal #954

Data Refactor Proposal #954

Comments

awaelchli commented Feb 26, 2024 • edited Loading

Current Issues and Limitations

Proposed Changes

Usage Examples

DataModule and Dataset

Usage in Training Scripts

Pretraining

Extension

Pros

Cons

carmocca commented Feb 26, 2024

rasbt commented Feb 26, 2024 • edited Loading

awaelchli commented Feb 26, 2024

Andrei-Aksionov commented Feb 26, 2024

rasbt commented Feb 26, 2024

Algomancer commented Feb 27, 2024 • edited Loading

lantiga commented Feb 27, 2024

awaelchli commented Feb 26, 2024 •

edited

Loading

rasbt commented Feb 26, 2024 •

edited

Loading

Algomancer commented Feb 27, 2024 •

edited

Loading