Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Refactor Proposal #954

Closed
awaelchli opened this issue Feb 26, 2024 · 7 comments · Fixed by #950
Closed

Data Refactor Proposal #954

awaelchli opened this issue Feb 26, 2024 · 7 comments · Fixed by #950
Labels
enhancement New feature or request

Comments

@awaelchli
Copy link
Contributor

awaelchli commented Feb 26, 2024

This issue proposes a refactor for how data is preprocessed and consumed in Lit-GPT.

Current Issues and Limitations

Error prone: We have scripts in scripts/prepare_* to preprocess the data before running the scripts. This is cumbersome and error prone because you have to specify the tokenizer that you are going to use to finetune. If you then want to finetune a different model, and forget to rerun the prepare script, you will use wrongly tokenized data.

Data overlap: The way the finetuning scripts load the data is with a random index into the memory mapped preprocessed file (sampling with replacement). Training over N epochs isn't really possible, but for finetuning we often want to control this very precisely. In addition, distributed sampling is also an issue, and the current workaround is to set a different seed per rank but users are likely oblivious to this immense technical detail.

Inflexible: There is no standard interface to read a dataset in our scripts, making it harder to adopt new dataset types (e.g. DPO). Furthermore, the prompt template is hardcoded into the data preparation.

No code experience: Lit-GPT is moving toward a CLI-focused experience where everything needs to be configurable without changing the code. The data part is the largest piece standing in the way of this at the moment.

Proposed Changes

  1. Switch all dataloading to PyTorch Dataset + DataLoader
  2. Bundle all data related logic in a datamodule
  3. Training scripts will instantiate a datamodule and call prepare_data(), setup(), and train/va_dataloader() to get the loaders (see also extension section below)
  4. Predefined datamodules are available under lit_gpt.datasets
  5. All datamodule related arguments are exposed in the training script CLI via --data.xyz arguments.
  6. Our predefined datamodules are registered under a shortcut name that can be referenced via the CLI (e.g. --data.module="Alpaca", making it super quick to change between datasets.
  7. We can provide a generic datamodule for CSV, JSON etc. that can read user defined datasets in standardized format. E.g. --data.module=csv --data.source="path/to/csv/or/folder"
  8. The prompt template is configureable

Usage Examples

# Use defaults provided by data module
python finetune/full.py --data.module=LIMA 

# Provide optional arguments
python finetune/full.py --data.module=LIMA --data.test_split_fraction=0.1
python finetune/full.py --data.module=LIMA --data.max_seq_length=256

DataModule and Dataset

The DataModule could simply follow the LightningDataModule design, or even subclass it:

class Alpaca:
    def __init__(
        self,
        max_seq_length: int = -1,
        mask_prompt: bool = True,
        test_split_fraction: float = 0.03865,
        # ...
    ) -> None:
        super().__init__()
        # ...

    def prepare_data(self) -> None:
        # Download, Preprocess files etc.
        pass

    def setup(self, tokenizer, batch_size, ...) -> None:
        with open(self.data_file_path, "r", encoding="utf-8") as file:
            data = json.load(file)

        # Partition the dataset into train and test
        train_data, test_data = random_split(data, ...)
        
        self.train_dataset = SFTDataset(data=train_data, ...)
        self.test_dataset = SFTDataset(data=test_data, ...)

    def train_dataloader(self) -> DataLoader:
        return DataLoader(self.train_dataset, ...)

    def val_dataloader(self) -> DataLoader:
        return DataLoader(self.test_dataset, ...)

    def test_dataloader(self) -> DataLoader:
        return self.val_dataloader()

It bundles:

  • Preprocessing in prepare_data()
  • Instantiation of datasets in setup()
  • Definintion of dataloaders in train/val/test_dataloader()

Note: The setup() method takes special arguments as input that can't be deterimined immediately at the time of instantiating the datamodule. For exmaple, the tokenizer must be loaded from the checkpoint directory of the model, or the batch size is the micro-batch size set in the script.

Usage in Training Scripts

The training scripts would simply add these lines of code (replacing the existing get_batch() function):

datamodule = ...   # Instantiate from data CLI args
if fabric.global_rank == 0:
    datamodule.prepare_data()
fabric.barrier()
datamodule.setup(tokenizer, batch_size, ...)
train_dataloader = datamodule.train_dataloader()
val_dataloader = datamodule.val_dataloader()
train_dataloader, val_dataloader = fabric.setup_dataloaders(train_dataloader, val_dataloader)

A POC for this design is in #950.

Pretraining

The pretraining datasets will still require preprocessing to be done externally due to their size. The corresponding datamodule would simply read the dataset at the default location, and error with instructions to preprocess if it can't be found.

Extension

With this proposal, data preprocessing will be on-the-fly now. But we don't have that much to preprocess, since tokenization will be done as part of the dataloaders. The only real preprocessing we have is automatic creation of train-test splits for datasets that don't have them. This would be done in-memory with the proposal above, but as an extension, we could have a cache folder where we store the splits and only rerun it if the arguments change (hashed). This could be done as a follow up in the future.

Pros

  • Solves the aforementioned limitations
  • It will be easier to write unit tests since everything will be more decoupled and modular

Cons

  • While this refactor can address the aforementioned limitations, it will introduce some level of abstraction (dataset + datamodule), which works slightly against the initial design philosophy of lit-gpt. The code is no longer in a single file that you can read top to bottom. This refactor will split things up into functions, dataset classes, etc. But we will try our best to keep the engineering effort to a minimum.

@carmocca @lantiga @rasbt @Andrei-Aksionov

@carmocca carmocca added the enhancement New feature or request label Feb 26, 2024
@carmocca
Copy link
Contributor

SGTM

@rasbt
Copy link
Contributor

rasbt commented Feb 26, 2024

I really like this refactor. A few minor additional thoughts:

Training scripts will instantiate a datamodule and call prepare_data(), setup(), and train/va_dataloader() to get the loaders (see also extension section below)

In general, does that mean the dataset is processed (tokenized etc) every time we execute the training script? Since this is relatively quick, this should not be a problem for the finetuning scripts. For pretraining scripts, this may not be feasible, though.

So my question is, in how far should be make the pretraining and finetuning usages similar? I think the more similar the better from an intuition and user experience perspective.

So, maybe

python finetune/full.py --data.module=LIMA

is the default. But then we can also offer the following for larger datasets where preprocessing separately makes sense:

python finetune/full.py --data.module_tokenized=LIMA

Users would by default use --data.module=LIMA but where it makes sense, users could choose --data.module_tokenized=LARGE_DATASET

Or perhaps that can be an additional flag

python finetune/full.py --data.module=LIMA --pretokenized true

Or maybe there are some other better ideas?

python finetune/full.py --data.module=LIMA --data.max_seq_length=256

I think this is nice. Previously, we defined the max_seq_len in the prepare_dataset scripts. So, that meant we had to reprocess the dataset every time we wanted to try a different length during finetuning (e.g., to work around memory limitations).

@awaelchli
Copy link
Contributor Author

For the SFT datasets, tokenization is now done on-the fly in the dataloader worker. There is no longer a need to store the entire dataset in tokenized format entirely. The max length can also be adjusted directly on-the-fly.

At the moment, prepare_data() will only run "slow" for the datasets that require to make a train-test split. But it's not that slow, and for bigger datasets I discuss a possible way to address it in the "Extension" section above. But then on the other hand, the bigger the datasets, the more likely they already come in train-test splits.

You are right, for the pretraining datasets we still require the manual preprocessing step. In that sense, we cannot really make their usage identical. I think that's acceptable given the size of these datasets.

@Andrei-Aksionov
Copy link
Contributor

Such a nicely written proposal 👍

Overall I have nothing to add. Looks like all the cases are covered.

@rasbt
Copy link
Contributor

rasbt commented Feb 26, 2024

Sounds reasonable @awaelchli. I was just concerned in case we have larger SFT datasets in the future. I think OpenAssistant is 150k samples, for example (still relatively small though). But I think the current way you describe should definitely be the default, and we can think about tokenized-dataset support as an add-on in the future as it wouldn't impact the currently proposed usage.

@Algomancer
Copy link

Algomancer commented Feb 27, 2024

You are right, for the pretraining datasets we still require the manual preprocessing step. In that sense, we cannot really make their usage identical. I think that's acceptable given the size of these datasets.

We have something like this in our training stack, for large datasets we use a similar abstraction with a disk cache which can be overwritten with a overwrite flag (to force retokenisation even if the hash doesn't change). Meaning it only tokenises once the first time we run a training script, might be a low overhead way to make it the same api between fine tuning and pre training.

@lantiga
Copy link
Contributor

lantiga commented Feb 27, 2024

Great job @awaelchli
Looks good to me, fits nicely with the goals

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants