Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SFT data sampling #1368

Merged
merged 4 commits into from Feb 9, 2023
Merged

SFT data sampling #1368

merged 4 commits into from Feb 9, 2023

Conversation

maw501
Copy link
Contributor

@maw501 maw501 commented Feb 8, 2023

Closes #910. Adds per epoch, per dataset sub-sampling by specifying a fraction or size argument in the configuration file.

As such, I’ve removed the references which truncate the initial data loading after a certain number of samples. This means we process all the data upfront (which could be slow). See below for an issue to potentially resolve this.

Other small fixes along the way

  • Updated README to reflect subsampling instructions and general tidy-up.
  • Removal of mpi4py as a dependency as I couldn’t install it and it’s not used anywhere in the whole OA project.
  • Added wandb to the requirements.txt

Potential other issues to raise

I didn’t want to let this PR sprawl into a bunch of separate fixes. As such, depending on what other's think, I think it might be worth raising the following new issues (I’m happy to work on them almost immediately, some are small fixes):

  • The prompt_dialogue dataset has changed url. I think it was coming from this repo which has now changed.
  • It’s possible that loading all the data will now be slow as it re-processes the datasets every run (which is wasteful). Could we do this as a one-off and upload back to HF?
  • Why are the dependencies pinned so heavily? It might be beneficial to relax these if we can.
  • Tests: this is something I didn't quite understand. The current ones don’t run with pytest (in fact, I think the directory structure is incorrect for this to work). I didn’t want to start moving files around in this PR to make but would like to open a separate PR to reorganise the tests (and write a bunch more, including for the data sampling).

Checks performed

Per above, as pytest isn’t currently working I did a bunch of one-off checks:

  • Ran on CPU ✅
  • Ran on single GPU machine: g4dn.xlarge Nvidia T4
  • Ran on 4 GPU machine: g4dn.12xlarge Nvidia T4
  • Validated unique indices are generated in correct proportions per epoch i.e. that we get a different random sample per epoch and the total examples are shared across devices correctly when using distributed training. ✅

Copy link
Collaborator

@sanagno sanagno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @maw501, this looks great with even more than we discussed, thanks a lot!

@sanagno sanagno merged commit 3f30e1f into LAION-AI:main Feb 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add data sampling features for SFT and future projects ( RLHF ) as well
3 participants