Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Dataset Descriptions And Instructions #358

Merged
merged 19 commits into from Aug 30, 2023
Merged

Add Dataset Descriptions And Instructions #358

merged 19 commits into from Aug 30, 2023

Conversation

rasbt
Copy link
Collaborator

@rasbt rasbt commented Aug 4, 2023

  • Adds alpaca-libre preparation script as alternative to alpaca.
  • Provides instructions for all supported datasets
  • Add LIMA description from Add LIMA dataset #479
  • Update new token usage in LIMA
  • Replace reference to --max_seq_length

@rasbt rasbt changed the title Add alpaca-libre preparation script WIP: Add alpaca-libre preparation script Aug 15, 2023
@rasbt
Copy link
Collaborator Author

rasbt commented Aug 15, 2023

Let me add a documentation tutorial instead because we can download it with the existing command line args.

Will continue with the prepare_dataset.md, but just a quick check before I put more work into it: does that sound good to you @carmocca ?

@carmocca
Copy link
Member

Yes!

@rasbt rasbt changed the title WIP: Add alpaca-libre preparation script Add alpaca-libre preparation script Aug 24, 2023
@rasbt
Copy link
Collaborator Author

rasbt commented Aug 24, 2023

This should be complete now. (Or, at least good for review).

I suggest merging #466 and #447 first though, because these are the OpenWeb Text and RedPajama documents referenced at the bottom of this doc.

Note that a focus of this document is to highlight the use of --checkpoint_dir in the prepare_ scripts which a lot of people (me included) forget when trying out different models.

@rasbt rasbt mentioned this pull request Aug 29, 2023
2 tasks
@rasbt rasbt changed the title Add alpaca-libre preparation script Add Dataset Descriptions And Instructions Aug 29, 2023
@rasbt rasbt changed the title Add Dataset Descriptions And Instructions WIP: Add Dataset Descriptions And Instructions Aug 29, 2023
@rasbt rasbt changed the title WIP: Add Dataset Descriptions And Instructions Add Dataset Descriptions And Instructions Aug 29, 2023
@rasbt
Copy link
Collaborator Author

rasbt commented Aug 30, 2023

This is the dataset tutorial companion to all the datasets we added recently (Dolly, LIMA, Alpaca Libre). It's updated after the max_token_length change and should be good to review @carmocca

datasets Outdated Show resolved Hide resolved
tutorials/finetune_lora.md Outdated Show resolved Hide resolved
tutorials/neurips_challenge_quickstart.md Outdated Show resolved Hide resolved
tutorials/neurips_challenge_quickstart.md Outdated Show resolved Hide resolved
tutorials/prepare_dataset.md Outdated Show resolved Hide resolved
tutorials/prepare_dataset.md Outdated Show resolved Hide resolved
tutorials/prepare_dataset.md Outdated Show resolved Hide resolved
tutorials/prepare_dataset.md Outdated Show resolved Hide resolved
Copy link
Member

@carmocca carmocca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@carmocca carmocca merged commit 241970d into main Aug 30, 2023
5 checks passed
@carmocca carmocca deleted the alpaca-libre branch August 30, 2023 23:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants