Add ASR fine-tuning skill#15733
Conversation
| - CTC: `examples/asr/asr_ctc/speech_to_text_ctc_bpe.py` | ||
| - RNNT: `examples/asr/asr_transducer/speech_to_text_rnnt_bpe.py` | ||
| - Hybrid RNNT/CTC or TDT/CTC: `examples/asr/asr_hybrid_transducer_ctc/speech_to_text_hybrid_rnnt_ctc_bpe.py` | ||
| - AED/Canary: `examples/asr/speech_multitask/speech_to_text_aed.py` |
There was a problem hiding this comment.
should we point them to speechlm2 scripts instead or add a note about it?
There was a problem hiding this comment.
No, speechlm2 should have its own skill later. I want to start with ASR because it is very stable already.
| Before launching a long fine-tune, spend a few minutes on cheap failure checks: | ||
|
|
||
| - Confirm the intended NeMo checkout is imported from inside the container. | ||
| - Confirm each training/validation manifest exists, has non-empty `text`, valid `audio_filepath`, and usable |
There was a problem hiding this comment.
should we provide manifest_row example here on what to expect?
There was a problem hiding this comment.
It's described in data-lhotse.md already
| Standard ASR JSONL: | ||
|
|
||
| ```json | ||
| {"audio_filepath": "/data/audio/sample.wav", "text": "transcript text", "duration": 3.42} |
There was a problem hiding this comment.
It's mentioned below as Canary's special manifest format
| ```python | ||
| from nemo.collections.asr.models import ASRModel | ||
|
|
||
| cfg = ASRModel.from_pretrained("nvidia/parakeet-tdt-0.6b-v2", return_config=True) |
There was a problem hiding this comment.
nit: lets change to v3 to suggest in most cases.
| Use `examples/asr/speech_to_text_finetune.py` for compatible-architecture fine-tuning. For architecture-specific | ||
| recipes: | ||
|
|
||
| - CTC: `examples/asr/asr_ctc/speech_to_text_ctc_bpe.py` |
There was a problem hiding this comment.
might also be better to include example configs here?
| use_cer=False | ||
| ``` | ||
|
|
||
| Use `examples/asr/transcribe_speech.py` for direct transcription and streaming or chunked inference scripts for |
There was a problem hiding this comment.
| Use `examples/asr/transcribe_speech.py` for direct transcription and streaming or chunked inference scripts for | |
| Use `examples/asr/transcribe_speech.py` for direct offline transcription and streaming or chunked inference scripts for |
|
/claude review |
nithinraok
left a comment
There was a problem hiding this comment.
Great work @pzelasko . LGTM!
Summary
nemo-speech-asr-finetuneskill for NeMo Speech ASR fine-tuning workflowsValidation
I tested this skill by asking Codex to finetune parakeet v3 on a polish HF dataset (bigos-v2) and evaluate the improvement on the test set. It autonomously created exp config, set up bucketing, oomptimizer, and reduced the WER from 18.49% to 17.71%.