Skip to content

Huggingface dataset integration#101

Merged
banghuaz-nvidia merged 43 commits intomainfrom
fsiino/hf-migration
Nov 11, 2025
Merged

Huggingface dataset integration#101
banghuaz-nvidia merged 43 commits intomainfrom
fsiino/hf-migration

Conversation

@fsiino-nvidia
Copy link
Copy Markdown
Contributor

@fsiino-nvidia fsiino-nvidia commented Sep 26, 2025

This change adds support for Huggingface dataset management (upload/download/delete Gitlab artifact(s))

Addresses items 2 and 3 from #81

Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
…r-organization

Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Sep 26, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
@fsiino-nvidia fsiino-nvidia marked this pull request as ready for review September 27, 2025 00:29
@fsiino-nvidia fsiino-nvidia requested a review from a team as a code owner September 27, 2025 00:29
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>

# Conflicts:
#	resources_servers/comp_coding/configs/comp_coding.yaml
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
@banghuaz-nvidia
Copy link
Copy Markdown
Contributor

Hey @fsiino-nvidia , can we resolve the conflict as well? Left some comments there also. Thx!

Comment thread nemo_gym/hf_utils.py Outdated
print(f"[Nemo-Gym] - Repo '{repo_id}' already exists")
except HfHubHTTPError as e:
if e.response is not None and e.response.status_code == 404:
client.create_repo(repo_id=repo_id, token=config.hf_token, repo_type="dataset", private=True)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be a slightly unnatural way to check existing repo and then create one. Does create_repo not automatically check the existence?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added exist_ok=True to simplify this.

Comment thread README.md Outdated
```

Naming convention for Huggingface datasets is as follows:
`{hf_organization}/{hf_collection_name}-{domain}–{resource_server_name}-{your dataset name}`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we decouple collection vs dataset prefix? There might be cases where we want to put it under "Nemo-Gym" collection without having a prefix "Nemo-Gym" there. So it would be nice to have two fields for these.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed this to hf_dataset_prefix. By default it is NeMo-Gym- and can be overridden with even an empty string if desired.

Signed-off-by: Frankie Siino <fsiino@nvidia.com>

# Conflicts:
#	.pre-commit-config.yaml
#	README.md
#	nemo_gym/config_types.py
#	resources_servers/google_search/configs/google_search.yaml
#	scripts/update_resource_servers.py
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>

# Conflicts:
#	README.md
#	nemo_gym/config_types.py
#	scripts/update_resource_servers.py
@banghuaz-nvidia
Copy link
Copy Markdown
Contributor

Hey @fsiino-nvidia , the PR should be ready to merge. Could you please resolve the current conflicts?

Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Copy link
Copy Markdown
Contributor

@banghuaz-nvidia banghuaz-nvidia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

@banghuaz-nvidia banghuaz-nvidia removed the request for review from bxyu-nvidia November 11, 2025 00:11
@banghuaz-nvidia banghuaz-nvidia merged commit 4a19206 into main Nov 11, 2025
6 checks passed
@banghuaz-nvidia banghuaz-nvidia deleted the fsiino/hf-migration branch November 11, 2025 00:17
lbliii pushed a commit that referenced this pull request Nov 12, 2025
This change adds support for Huggingface dataset management
(upload/download/delete Gitlab artifact(s))

Addresses items 2 and 3 from
#81

---------

Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Kelvin0110 pushed a commit to Kelvin0110/Gym that referenced this pull request Feb 16, 2026
This change adds support for Huggingface dataset management
(upload/download/delete Gitlab artifact(s))

Addresses items 2 and 3 from
NVIDIA-NeMo#81

---------

Signed-off-by: Frankie Siino <fsiino@nvidia.com>
abubakaria56 pushed a commit to abubakaria56/Gym that referenced this pull request Mar 2, 2026
This change adds support for Huggingface dataset management
(upload/download/delete Gitlab artifact(s))

Addresses items 2 and 3 from
NVIDIA-NeMo#81

---------

Signed-off-by: Frankie Siino <fsiino@nvidia.com>
abubakaria56 pushed a commit to abubakaria56/Gym that referenced this pull request Mar 2, 2026
This change adds support for Huggingface dataset management
(upload/download/delete Gitlab artifact(s))

Addresses items 2 and 3 from
NVIDIA-NeMo#81

---------

Signed-off-by: Frankie Siino <fsiino@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

HF Dataset Format Checking & Migration, Auto summary table generation

2 participants