SFT data sampling #1368

maw501 · 2023-02-08T21:05:46Z

Closes #910. Adds per epoch, per dataset sub-sampling by specifying a fraction or size argument in the configuration file.

As such, I’ve removed the references which truncate the initial data loading after a certain number of samples. This means we process all the data upfront (which could be slow). See below for an issue to potentially resolve this.

Other small fixes along the way

Updated README to reflect subsampling instructions and general tidy-up.
Removal of mpi4py as a dependency as I couldn’t install it and it’s not used anywhere in the whole OA project.
Added wandb to the requirements.txt

Potential other issues to raise

I didn’t want to let this PR sprawl into a bunch of separate fixes. As such, depending on what other's think, I think it might be worth raising the following new issues (I’m happy to work on them almost immediately, some are small fixes):

The prompt_dialogue dataset has changed url. I think it was coming from this repo which has now changed.
It’s possible that loading all the data will now be slow as it re-processes the datasets every run (which is wasteful). Could we do this as a one-off and upload back to HF?
Why are the dependencies pinned so heavily? It might be beneficial to relax these if we can.
Tests: this is something I didn't quite understand. The current ones don’t run with pytest (in fact, I think the directory structure is incorrect for this to work). I didn’t want to start moving files around in this PR to make but would like to open a separate PR to reorganise the tests (and write a bunch more, including for the data sampling).

Checks performed

Per above, as pytest isn’t currently working I did a bunch of one-off checks:

Ran on CPU ✅
Ran on single GPU machine: g4dn.xlarge Nvidia T4 ✅
Ran on 4 GPU machine: g4dn.12xlarge Nvidia T4 ✅
Validated unique indices are generated in correct proportions per epoch i.e. that we get a different random sample per epoch and the total examples are shared across devices correctly when using distributed training. ✅

sanagno

Hey @maw501, this looks great with even more than we discussed, thanks a lot!

maw501 added 3 commits February 8, 2023 08:01

First version of single GPU sampling working

e2caf53

Get working on multi-gpu

283df8e

minor tidy-up

9faae25

maw501 requested review from theblackcat102 and sanagno as code owners February 8, 2023 21:05

fozziethebeat added the ml label Feb 9, 2023

sanagno approved these changes Feb 9, 2023

View reviewed changes

Merge branch 'main' into sft-data-sampling

4ba622d

sanagno merged commit 3f30e1f into LAION-AI:main Feb 9, 2023

This was referenced Feb 20, 2023

Incorrect url in SFT prompt_dialogue dataset #1769

Closed

Configure pytest in SFT pipeline #1770

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SFT data sampling #1368

SFT data sampling #1368

maw501 commented Feb 8, 2023

sanagno left a comment

SFT data sampling #1368

SFT data sampling #1368

Conversation

maw501 commented Feb 8, 2023

Other small fixes along the way

Potential other issues to raise

Checks performed

sanagno left a comment

Choose a reason for hiding this comment