Skip to content

Add ASR fine-tuning skill#15733

Merged
pzelasko merged 8 commits into
mainfrom
codex/nemo-speech-asr-finetune-skill
May 28, 2026
Merged

Add ASR fine-tuning skill#15733
pzelasko merged 8 commits into
mainfrom
codex/nemo-speech-asr-finetune-skill

Conversation

@pzelasko
Copy link
Copy Markdown
Collaborator

@pzelasko pzelasko commented May 27, 2026

Summary

  • Add a repo-local nemo-speech-asr-finetune skill for NeMo Speech ASR fine-tuning workflows
  • Split detailed guidance into staged references for setup/checkpoints, data/Lhotse, architecture/tokenizer/metrics, and training/evaluation
  • Include Lhotse-first dataloader guidance, OOMptimizer workflow, AED/Canary multitask metrics, checkpoint averaging, and evaluation recommendations

Validation

I tested this skill by asking Codex to finetune parakeet v3 on a polish HF dataset (bigos-v2) and evaluate the improvement on the test set. It autonomously created exp config, set up bucketing, oomptimizer, and reduced the WER from 18.49% to 17.71%.

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 27, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@pzelasko pzelasko changed the title Add ASR fine-tuning skill draft Add ASR fine-tuning skill May 27, 2026
- CTC: `examples/asr/asr_ctc/speech_to_text_ctc_bpe.py`
- RNNT: `examples/asr/asr_transducer/speech_to_text_rnnt_bpe.py`
- Hybrid RNNT/CTC or TDT/CTC: `examples/asr/asr_hybrid_transducer_ctc/speech_to_text_hybrid_rnnt_ctc_bpe.py`
- AED/Canary: `examples/asr/speech_multitask/speech_to_text_aed.py`
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we point them to speechlm2 scripts instead or add a note about it?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, speechlm2 should have its own skill later. I want to start with ASR because it is very stable already.

Before launching a long fine-tune, spend a few minutes on cheap failure checks:

- Confirm the intended NeMo checkout is imported from inside the container.
- Confirm each training/validation manifest exists, has non-empty `text`, valid `audio_filepath`, and usable
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we provide manifest_row example here on what to expect?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's described in data-lhotse.md already

Standard ASR JSONL:

```json
{"audio_filepath": "/data/audio/sample.wav", "text": "transcript text", "duration": 3.42}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

answer key?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's mentioned below as Canary's special manifest format

```python
from nemo.collections.asr.models import ASRModel

cfg = ASRModel.from_pretrained("nvidia/parakeet-tdt-0.6b-v2", return_config=True)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: lets change to v3 to suggest in most cases.

Use `examples/asr/speech_to_text_finetune.py` for compatible-architecture fine-tuning. For architecture-specific
recipes:

- CTC: `examples/asr/asr_ctc/speech_to_text_ctc_bpe.py`
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might also be better to include example configs here?

use_cer=False
```

Use `examples/asr/transcribe_speech.py` for direct transcription and streaming or chunked inference scripts for
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Use `examples/asr/transcribe_speech.py` for direct transcription and streaming or chunked inference scripts for
Use `examples/asr/transcribe_speech.py` for direct offline transcription and streaming or chunked inference scripts for

@nithinraok
Copy link
Copy Markdown
Member

/claude review

Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown
Member

@nithinraok nithinraok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work @pzelasko . LGTM!

@pzelasko pzelasko merged commit 5ccc6c8 into main May 28, 2026
44 checks passed
@pzelasko pzelasko deleted the codex/nemo-speech-asr-finetune-skill branch May 28, 2026 23:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants