diff --git a/docs/wiki-guide/Digital-Product-Lifecycle.md b/docs/wiki-guide/Digital-Product-Lifecycle.md index 5d877a3..fe091b3 100644 --- a/docs/wiki-guide/Digital-Product-Lifecycle.md +++ b/docs/wiki-guide/Digital-Product-Lifecycle.md @@ -19,7 +19,7 @@ The following adds additional context and direction to supplement the diagram, o * **Datasets:** Hugging Face Dataset Repository ([Data checklist](Data-Checklist.md)). * For already published data usage, see the [Metadata Checklist](Metadata-Checklist.md). * **ML Models:** Hugging Face Model Repository ([Model checklist](Model-Checklist.md)). -* Though alternative storage options may be discussed, **Google Drive is not an acceptable storage location for research data, models, or code**. Folder activity does not include actual file additions or deletions, so content can be changed or removed without a record of when or by whom. All research, data, models, and code must be stored in **a version controlled repository, preferably in more than one location** to ensure preservation and full provenance tracking. +* Though alternative storage options may be discussed, **Google Drive, OneDrive, and other institutional user-tied locations are not an acceptable storage location for research data, models, or code**. Folder activity does not include actual file additions or deletions, so content can be changed or removed without a record of when or by whom. All research, data, models, and code must be stored in **a version controlled repository, preferably in more than one location** to ensure preservation and full provenance tracking. ### Exploration Phase diff --git a/docs/wiki-guide/The-Hugging-Face-Dataset-Upload-Guide.md b/docs/wiki-guide/The-Hugging-Face-Dataset-Upload-Guide.md index a16b312..41d4ae8 100644 --- a/docs/wiki-guide/The-Hugging-Face-Dataset-Upload-Guide.md +++ b/docs/wiki-guide/The-Hugging-Face-Dataset-Upload-Guide.md @@ -1,21 +1,50 @@ # Hugging Face Dataset Guide -## Create a New Dataset Repository +[Hugging Face](https://hf.co/) offers numerous methods for interacting with and creating datasets. This page provides a basic overview with some recommendations specifically targeting image dataset uploads, though the principles are transferrable to other data types. We list these options—in order of increasing complexity—with some guidance, recommendations, and links out to the appropriate parts of the Hugging Face docs for the most up-to-date information available. -When creating a new dataset repository, you can make the dataset **Public** (accessible to anyone on the internet) or **Private** (accessible only to members of the organization). +1. [Web interface (UI)](#upload-a-dataset-with-the-web-interface): For smaller, simpler uploads. +2. [Hugging Face Command Line Interface (CLI)](#upload-a-dataset-with-the-hugging-face-cli): For most use-cases, easy access from cluster. +3. [Hugging Face API (python package)](#upload-a-dataset-with-hfapi): For when more fine-grained control than is achievable with the CLI is needed. +4. [Git/Git LFS](#upload-a-dataset-with-git): Main use-case is when multiple PRs lead to merge conflicts—Hugging Face provides no other means for resolution. -![New dataset repository interface](images/HF-dataset-upload/346972860-ed0feb0e-529b-4021-b44f-41ac96680bc3.png){ loading=lazy, width=800 } -/// caption -/// +!!! info + Some sections of the Hugging Face docs, such as for the `huggingface_hub`, have only version specific links for stable versions. In this case, if the link directs to an older version, there will be a banner to alert you to a newer version available, so keep an eye out for that updated version banner. + +Most of the content below is covered in various parts of [Hugging Face's Upload Guide](https://huggingface.co/docs/huggingface_hub/en/guides/upload); this page is provided as a summary reference mainly to determine which method might be best and link to the appropriate docs. Additionally, we include an [integrity check](#integrity-check) to help you ensure that your repo contains all the desired files after uploading through any of these methods. + +[HF tips and tricks for large uploads](https://huggingface.co/docs/huggingface_hub/en/guides/upload#tips-and-tricks-for-large-uploads). + +## Note on Authentication + +All of these methods require authentication to edit datasets, ranging from passwords, to tokens, to SSH authentication, and all support editing **Public** (accessible to anyone on the internet) or **Private** (accessible only to members of the organization) repos. Two key notes on authentication: + +1. Private repositories are only visible if you are authenticated. +2. If using tokens for access, be sure to create a [fine-grained token](https://huggingface.co/docs/hub/en/security-tokens#what-are-user-access-tokens), specifically for your needs. ## Upload a Dataset with the Web Interface -In the Files and versions tab of the Dataset card, you can choose to add file in the hugging web interface. +In the Files and Versions tab of the repository, you can select "Contribute" to add or create files or start a pull request directly from the web interface. ![Dataset repository Add file button](images/HF-dataset-upload/346190430-9e6cef9b-18ef-4d4a-84c5-1a3f75ac9336.png){ loading=lazy } +This method is fine for smaller files (<100MB), or data uploads from distributed sources, that have a relatively flat structure with few directories, and/or have few files. If you are uploading existing files, navigate to the target folder first. + +## Upload a Dataset with the Hugging Face CLI + +Hugging Face provides a comprehensive Command Line Interface (CLI) and corresponding [docs](https://huggingface.co/docs/huggingface_hub/en/guides/cli). Note that this is installed with the `huggingface_hub` python package, but can also be installed directly, then called with `hf `. + +The Hugging Face CLI is the ideal method for uploads that are large in volume, have more than a few files, and/or a folder structure with many or nested directories. It works directly from HPC clusters, such as OSC. Under the hood, [`hf upload`](https://huggingface.co/docs/huggingface_hub/en/guides/cli#hf-upload) uses the same upload functions described below, under [Upload a Dataset with HfApi](#upload-a-dataset-with-hfapi). Review [Hugging Face's guidance on large folder uploads](https://huggingface.co/docs/huggingface_hub/v1.10.1/guides/upload#upload-a-large-folder) before selecting a method for uploading large folders to a non-empty repository. + +When uploading to a dataset, note that the repo type must be specified (`--repo-type=dataset`); this is also the case for spaces, since Hugging Face treats models as the default. + +There are specific [`hf datasets`](https://huggingface.co/docs/huggingface_hub/en/guides/cli#hf-datasets) and [`hf repo`](https://huggingface.co/docs/huggingface_hub/en/guides/cli#hf-repo) commands for more general queries and repo initialization. + ## Upload a Dataset with HfApi +When more complex dataset structures are involved or more fine-grained control (not exposed on CLI) over how a repo will be organized on Hugging Face is neeeded, the Hugging Face API may be the answer. For instance, if a glob pattern cannot sufficiently clarify necessary exclusions of subfolders or files, [`HfApi`](https://huggingface.co/docs/huggingface_hub/package_reference/hf_api) is likely the preferred choice. This is a class, accessible through the [`huggingface_hub` package](https://huggingface.co/docs/huggingface_hub/index), that acts as a Python wrapper for the API. + +Please see the Hugging Face API docs for the most up-to-date guidance. For quick reference, to [upload by file](https://huggingface.co/docs/huggingface_hub/v1.10.1/package_reference/hf_api#huggingface_hub.HfApi.upload_file) or [upload by folder (structure maintained)](https://huggingface.co/docs/huggingface_hub/v1.10.1/package_reference/hf_api#huggingface_hub.HfApi.upload_folder). + ``` py linenums="1" from huggingface_hub import login @@ -24,70 +53,70 @@ login() from huggingface_hub import HfApi api = HfApi() +repo_id = "" -api.upload_file ( +# Upload by file +api.upload_file( path_or_fileobj = , path_in_repo = , - repo_id = , + repo_id = repo_id, repo_type = 'dataset' ) -``` - -## Upload a Dataset with Git - -### If the Dataset is Less Than 5GB - -Navigate to the folder for the repository: +# Upload by folder (maintain structure) +api.upload_folder( + folder_path="/path/to/local/folder", # should end with folder containing data + path_in_repo="path/to/folder/", # path desired for folder in repo + repo_id= repo_id, + repo_type="dataset", +    token_id = "paste-token-here" # if you're not logged in, HF does **not** recommend this method +) ``` -# Clone the repository -git clone https://huggingface.co/datasets/username/repo-name -# Add, commit, and push the files -git add -git commit -m 'comments' -git push +Repos can also be created through the Hugging Face API using the [create_repo method](https://huggingface.co/docs/huggingface_hub/v1.10.1/en/package_reference/hf_api#huggingface_hub.HfApi.create_repo) with the following parameters: +```py linenums="1" +repo_id = "/" +repo_type = "dataset" +private = True # if you want the repo private ``` -### If the Dataset is Larger Than 5GB +See also instructions using the [datasets package](https://huggingface.co/docs/datasets/create_dataset). -#### Install Git LFS +## Upload a Dataset with Git -Follow instructions at +Using Git to interact with Hugging Face requires installation of [Git LFS](https://git-lfs.com/), the [Hugging Face CLI](https://huggingface.co/docs/huggingface_hub/en/guides/cli), and then enabling large file upload for the repo -#### Install the Hugging Face CLI +Hugging Face provides details on [git vs http](https://huggingface.co/docs/huggingface_hub/en/concepts/git_vs_http), really using Git vs HfApi -``` -brew install huggingface-cli -pip install -U "huggingface_hub[cli]" -``` +Hugging Face has moved away from Git LFS, instead utilizing [Xet](https://huggingface.co/docs/hub/en/xet/index) for data storage and version control; this is [backwards compatible with LFS](https://huggingface.co/docs/hub/en/xet/legacy-git-lfs). -#### Enable the repository to upload large files +## Other Repo Considerations -``` -huggingface-cli lfs-enable-largefiles -``` +One time where `git` may be needed, is if one encounters a [merge conflict](https://discuss.huggingface.co/t/how-to-fix-merge-conflicts-in-prs/160090). Unlike GitHub, Hugging Face ***does not*** have conflict resolution UI tools, nor does it provide merge conflict resolution capabilities in the CLI or HfApi. The only means for resolving merge conflicts is to manually update the pull request in a [local clone](The-Hugging-Face-Workflow.md#hugging-face-pull-requests-with-local-edits), pulling `main` into your PR branch and resolving the conflicts. -#### Initialize Git LFS +## Integrity Check -``` -git lfs install -``` +Sometimes uploads fail partway through, leaving one or more files un-uploaded. Unfortunately, it seems that there is not an easy solution to be alerted to these issues when not uploading through the UI. Additionally, using a glob pattern to set upload without a dry-run[^1] (in `git` terms, this would be running `git status` after adding files) can also lead to accidental exclusion. To catch these issues, we recommend the following integrity check after uploading a dataset[^1]. -#### Track large files (e.g., .csv files) +[^1]: The Hugging Face CLI does have a [dry-run mode](https://huggingface.co/docs/huggingface_hub/en/guides/cli#dry-run-mode) for *downloading* datasets. Additionally, if working with Git LFS, there is a [preupload LFS](https://huggingface.co/docs/huggingface_hub/en/guides/upload#preupload-lfs-files-before-commit) option to ensure all files are properly preset and organized before committing. There are additional considerations for sharding noted in the Hugging Face docs. -``` -# Adds a line to .gitattributes, which Git uses to determine files managed by LFS -git lfs track "*.csv" -git add .gitattributes -git commit -m "Track large files with Git LFS" -``` +```python +import pandas as pd +from huggingface_hub import HfApi -#### Add, commit, and push the files +api = HfApi() +repo_id = "/" +repo_type = "dataset" +file_list = api.list_repo_files(repo_id=repo_id, repo_type=repo_type) +file_df = pd.Dataframe(columns = {"filepath": file_list}) +metadata = pd.read_csv("path/to/metadata/file") + +# assuming you use the same filepath in your system as in the repo +df = pd.merge(file_df, metadata, how = "inner", on = "filepath") +df.shape[0] # this should match the number of expected images ``` -git add -git commit -m 'comments' -git push -``` + +!!! tip "Pro tip" + If you don't have a metadata file for your images, use the [sum-buddy package](Helpful-Tools-for-your-Workflow.md#sum-buddy) to generate one in your local file system. This can also be used as a metadata file for the dataset viewer as needed (see [image datasets docs](https://huggingface.co/docs/hub/en/datasets-image) for more information on setting this up). Similar options are available for [audio](https://huggingface.co/docs/hub/en/datasets-audio) and [video](https://huggingface.co/docs/hub/en/datasets-video) datasets. diff --git a/docs/wiki-guide/images/HF-dataset-upload/346190430-9e6cef9b-18ef-4d4a-84c5-1a3f75ac9336.png b/docs/wiki-guide/images/HF-dataset-upload/346190430-9e6cef9b-18ef-4d4a-84c5-1a3f75ac9336.png index 265e540..d3380cb 100644 Binary files a/docs/wiki-guide/images/HF-dataset-upload/346190430-9e6cef9b-18ef-4d4a-84c5-1a3f75ac9336.png and b/docs/wiki-guide/images/HF-dataset-upload/346190430-9e6cef9b-18ef-4d4a-84c5-1a3f75ac9336.png differ