Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feedback welcome] CLI to upload arbitrary huge folder #2254

Open
wants to merge 52 commits into
base: main
Choose a base branch
from

Conversation

Wauplin
Copy link
Contributor

@Wauplin Wauplin commented Apr 26, 2024

What for?

Upload arbitrarily large folders in a single command line!

⚠️ This tool is still experimental and is meant for power users. Expect some rough edges in the process. Feedback and bug reports would be very much appreciated ❤️

How to use it?

Install

pip install git+https://github.com/huggingface/huggingface_hub@large-upload-cli

Upload folder

huggingface-cli large-upload <repo-id> <local-path>

Every minute a report is printed to the terminal with the current status. Apart from that, progress bars and errors are still displayed.

Large upload status:
  Progress:
    104/104 hashed files (22.5G/22.5G)
    0/42 preuploaded LFS files (0.0/22.5G) (+4 files with unknown upload mode yet)
    58/104 committed files (24.9M/22.5G)
    (0 gitignored files)
  Jobs:
    sha256: 0 workers (0 items in queue)
    get_upload_mode: 0 workers (4 items in queue)
    preupload_lfs: 6 workers (36 items in queue)
    commit: 0 workers (0 items in queue)
  Elapsed time: 0:00:00
  Current time: 2024-04-26 16:24:25

Run huggingface-cli large-upload --help to see all options.

What does it do?

This CLI is intended to upload arbitrary large folders in a single command:

  • process is split in 4 steps: hash, get upload mode, lfs upload, commit
  • retry on error at each step
  • multi-threaded: workers are managed with queues
  • resumable: if the process is interrupted, you can re-run it. Only partially uploaded files are lost.
  • files are hashed only once
  • starts to upload files while other files are still been hashed
  • commit at most 50 files at a time
  • prevent concurrent commits
  • prevent rate limits as much as possible
  • prevent small commits
  • retry on error for all steps

A .hugginface/ folder will be created at the root of your folder to keep track of the progress. Please do not modify these files manually. If you feel this folder got corrupted, please report it here, delete the .huggingface/ entirely and then restart you command. Some intermediate steps will be lost but the upload process should be able to continue correctly.

Known limitations

  • cannot set a path_in_repo => always upload files at root of the folder. If you want to upload to a subfolder, you need to set the proper structure locally.
  • not optimized for hf_transfer (though it works) => better to set --num-workers to 2 otherwise CPU will be bloated
  • cannot delete files on repo while uploading folder
  • cannot set commit message/commit description
  • cannot create PR by itself => you must first create a PR manually, then provide revision

What to review?

Nothing yet.

For now the goal is to gather as much feedback as possible. If it proves successful, I will clean the implementation and make it more production-ready. Also, this PR is built on top of #2223 that is not merged yet, which makes the changes very long.

For curious people, here is the logic to decide what should be the next task to perform.

@Wauplin Wauplin changed the title [Experimental] CLI to upload arbitrary huge folder [Feedback welcome] CLI to upload arbitrary huge folder Apr 26, 2024
@Wauplin Wauplin added the CLI label Apr 26, 2024
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@Wauplin
Copy link
Contributor Author

Wauplin commented May 3, 2024

Feedback so far:

  • when connection is slow, better to reduce the number of workers. Should we do that automatically or just print a message? Reducing number of workers might not speed-up upload but at least less files are uploaded in parallel => less chances to loose progress in case of failed upload.
  • terminal output is too verbose. Might be good to disable individual progress bars?
  • terminal output is awful in a jupyter notebook => how can we make that more friendly (printing a report every 1 minute ends up with very long logs)
  • a CTRL-C (or at most 2 CTRL+C) must stop the process. It's not the case at moment due to all try/except.

EDIT:

  • should print warning when upload parquet/arrow files to a model repository. It is not possible to convert a model to dataset repo afterwards so better being sure.

@davanstrien
Copy link
Member

IMO, it would make sense for this not to default to uploading as a model repo i.e. require this:

huggingface-cli large-upload <repo-id> <local-path> --repo-type dataset

If a user runs:

huggingface-cli large-upload <repo-id> <local-path>

they should get an error along the lines of "Please specify the repo type you want to use"

Quite a few people using this tool have accidentally uploaded a dataset to a model repo, and currently, it's not easy to move this to a dataset repo.

I know that many of the huggingface_hub methods/functions default to model repos, but I think that doesn't make sense in this case since:

  • it's more/equally likely to be used for uploading datasets as model weights
  • since the goal is to support large uploads the cost of getting it wrong for the user is quite high

@julien-c
Copy link
Member

ah i rather agree with @davanstrien here

@wanng-ide
Copy link

Can the parameters of "large-upload" be aligned to the "upload"?
huggingface-cli large-upload [repo_id] [local_path]

@Wauplin
Copy link
Contributor Author

Wauplin commented May 22, 2024

@wanng-ide Agree we should aim for consistency yes. What parameters/options you would specifically change?

So far we have:

$ huggingface-cli large-upload --help
usage: huggingface-cli <command> [<args>] large-upload [-h] [--repo-type {model,dataset,space}]
                                                       [--revision REVISION] [--private]
                                                       [--include [INCLUDE ...]] [--exclude [EXCLUDE ...]]
                                                       [--token TOKEN] [--num-workers NUM_WORKERS]
                                                       repo_id local_path
$ huggingface-cli upload --help 
usage: huggingface-cli <command> [<args>] upload [-h] [--repo-type {model,dataset,space}]
                                                 [--revision REVISION] [--private] [--include [INCLUDE ...]]
                                                 [--exclude [EXCLUDE ...]] [--delete [DELETE ...]]
                                                 [--commit-message COMMIT_MESSAGE]
                                                 [--commit-description COMMIT_DESCRIPTION] [--create-pr]
                                                 [--every EVERY] [--token TOKEN] [--quiet]
                                                 repo_id [local_path] [path_in_repo]

@wanng-ide
Copy link

@wanng-ide Agree we should aim for consistency yes. What parameters/options you would specifically change?

So far we have:

$ huggingface-cli large-upload --help
usage: huggingface-cli <command> [<args>] large-upload [-h] [--repo-type {model,dataset,space}]
                                                       [--revision REVISION] [--private]
                                                       [--include [INCLUDE ...]] [--exclude [EXCLUDE ...]]
                                                       [--token TOKEN] [--num-workers NUM_WORKERS]
                                                       repo_id local_path
$ huggingface-cli upload --help 
usage: huggingface-cli <command> [<args>] upload [-h] [--repo-type {model,dataset,space}]
                                                 [--revision REVISION] [--private] [--include [INCLUDE ...]]
                                                 [--exclude [EXCLUDE ...]] [--delete [DELETE ...]]
                                                 [--commit-message COMMIT_MESSAGE]
                                                 [--commit-description COMMIT_DESCRIPTION] [--create-pr]
                                                 [--every EVERY] [--token TOKEN] [--quiet]
                                                 repo_id [local_path] [path_in_repo]

what about: huggingface-cli large-upload [local_path] [path_in_repo]
ADD [path_in_repo]

@Wauplin
Copy link
Contributor Author

Wauplin commented May 22, 2024

I'm not sure to understand what's the purpose of the ADD keyword

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants