Skip to content

SDE local tar-base-path and text_key #15663

Merged
Jorjeous merged 2 commits into
SDE_NC_Afeatfrom
SDE_NC_Afeat_key
May 11, 2026
Merged

SDE local tar-base-path and text_key #15663
Jorjeous merged 2 commits into
SDE_NC_Afeatfrom
SDE_NC_Afeat_key

Conversation

@karpnv
Copy link
Copy Markdown
Collaborator

@karpnv karpnv commented May 4, 2026

Support local tar files and custom text_key

Collection: ASR

Usage

python tools/speech_data_explorer/data_explorer.py /lustre/mydata/nemo_tar/sharded_manifests/manifest__OP_0..255_CL_.json --tar-base-path /lustre/mydata/nemo_tar/audio__OP_0..255_CL_.tar

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

PR Type:

  • [V ] New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • Related to # (issue)

karpnv added 2 commits May 4, 2026 11:40
Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com>
Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 4, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@karpnv karpnv requested a review from Jorjeous May 4, 2026 19:58
@karpnv karpnv changed the title Sde nc afeat key SDE local tar-base-path and text_key May 4, 2026
@Jorjeous Jorjeous merged commit 4151971 into SDE_NC_Afeat May 11, 2026
31 checks passed
@Jorjeous Jorjeous deleted the SDE_NC_Afeat_key branch May 11, 2026 09:50
Jorjeous added a commit that referenced this pull request May 11, 2026
* read manifest from s3

Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: karpnv <karpnv@users.noreply.github.com>

* s3cfg parameter

Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com>

* file range

Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: karpnv <karpnv@users.noreply.github.com>

* Avoid downloading of full tar, instead extracting specific audio file. Updaetd logging system

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: Jorjeous <Jorjeous@users.noreply.github.com>

* shard_index + 1

Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com>

* Undo latest changes, as it was dataset specific

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

* update table to not fail on "non-string format", update bucketing and sharding with separate numeration.

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

* add ability to read two manifests in comparison mode

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: Jorjeous <Jorjeous@users.noreply.github.com>

* removing picke; --s3cfg=AIS -- read env vars; add matching manifests my filename in case unordered manifests;

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: Jorjeous <Jorjeous@users.noreply.github.com>

* adding --force flag to load if required fields missing

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

* update documentation, guard issues, update requrements, Added quick start and available options, lazy import for s3 dependancies

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: Jorjeous <Jorjeous@users.noreply.github.com>

* README: document _OP_/_CL_ sharded path syntax and cartesian-product expansion

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

* SDE local tar-base-path and text_key  (#15663) featire add

* text

Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com>

* local tarr support

Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com>

---------

Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com>

* isort and reformatting

Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>

---------

Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com>
Signed-off-by: karpnv <karpnv@users.noreply.github.com>
Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com>
Signed-off-by: Jorjeous <Jorjeous@users.noreply.github.com>
Co-authored-by: Nikolay Karpov <nkarpov@nvidia.com>
Co-authored-by: karpnv <karpnv@users.noreply.github.com>
Co-authored-by: Jorjeous <Jorjeous@users.noreply.github.com>
Co-authored-by: Jason <jasoli@nvidia.com>
Co-authored-by: Nikolay Karpov <karpnv@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants