Add SpeechLLM docs #9780

stevehuang52 · 2024-07-17T23:23:56Z

What does this PR do ?

Add docs to SpeechLLM

Collection: [multimodal]

Signed-off-by: stevehuang52 <heh@nvidia.com>

zhehuaichen

Great work! Thank you so much!

docs/source/multimodal/speech_llm/configs.rst

Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com>

titu1994

Requires images to be moved to GH release, rest all are minor comments

docs/source/multimodal/speech_llm/configs.rst

titu1994 · 2024-07-23T16:39:45Z

docs/source/multimodal/speech_llm/datasets.rst

+        "answer": "the transcription of the audio", # optional for inference, default to "na" in dataloader
+    }
+
+


We support more variations of what does the audio mean now right ?

titu1994 · 2024-07-23T16:43:50Z

docs/source/multimodal/speech_llm/datasets.rst

+
+The `context` field in the manifest is optional, and you can put a list of context in a context file (one context for each line) then set `++model.data.train_ds.context_file=<path to to context file>` to ask the dataloader to randomly pick a context from the file for each audio sample. This is useful for training with multiple prompts for the same task. If neither `context` field nor `context_file` is provided, the dataloader will use a default context `what does the audio mean?` for all audios. During inference, it is recommended to have the `context` field in the manifest. 
+
+Customizing the fields to use


Note that the use of prompt_template here conflicts with Canary model (and speechlm) PromptFormatter class which also uses a model.cfg.prompt_format called Canary. Just a note

docs/source/multimodal/speech_llm/datasets.rst

titu1994 · 2024-07-23T16:50:31Z

docs/source/multimodal/speech_llm/datasets.rst

+------------------------------
+
+
+In order to use a context file, you can set `++model.data.train_ds.context_file=<path to to context file>` in the command line or use multiple context files with `++model.data.train_ds.context_file=[<path to to context file1>,<path to context file2>,...]`. If the number of context files is equal to the number of provided datasets, the dataloader will assigne each context file to a dataset. Otherwise, the dataloader will randomly pick a context file from all provided context files for each audio sample. Using multiple context files is useful for training with multiple tasks, where each task has its own set of prompts. Meanwhile, you can control the weights for different tasks/datasets by using concatentated tarred datasets, where you can assign weights to datasets by:


What if the task and the context are wildly different during sampling ? Ie for ASR and AST ?

each dataset can have it's own list of context files, such that ASR and ASR can sample from each pool separately

Cool, is this mentioned somewhere else ?

titu1994 · 2024-07-23T17:11:27Z

docs/source/multimodal/speech_llm/images/bestow.png

Don't add images to git. Upload file to last release, and put url in rst

titu1994 · 2024-07-23T17:11:34Z

docs/source/multimodal/speech_llm/images/salm.png

Signed-off-by: stevehuang52 <heh@nvidia.com>

…to add_speechlm_docs

Signed-off-by: stevehuang52 <heh@nvidia.com>

titu1994 · 2024-07-23T19:11:24Z

docs/source/multimodal/speech_llm/datasets.rst

+------------------------------
+
+
+In order to use a context file, you can set `++model.data.train_ds.context_file=<path to to context file>` in the command line or use multiple context files with `++model.data.train_ds.context_file=[<path to to context file1>,<path to context file2>,...]`. If the number of context files is equal to the number of provided datasets, the dataloader will assigne each context file to a dataset. Otherwise, the dataloader will randomly pick a context file from all provided context files for each audio sample. Using multiple context files is useful for training with multiple tasks, where each task has its own set of prompts. Meanwhile, you can control the weights for different tasks/datasets by using concatentated tarred datasets, where you can assign weights to datasets by:


Cool, is this mentioned somewhere else ?

* add docs Signed-off-by: stevehuang52 <heh@nvidia.com> * add lhotse specific info Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * move images to github release 1.23 Signed-off-by: stevehuang52 <heh@nvidia.com> * clean up Signed-off-by: stevehuang52 <heh@nvidia.com> --------- Signed-off-by: stevehuang52 <heh@nvidia.com> Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> Co-authored-by: zhehuaichen <dian.chenzhehuai@gmail.com>

* add docs Signed-off-by: stevehuang52 <heh@nvidia.com> * add lhotse specific info Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * move images to github release 1.23 Signed-off-by: stevehuang52 <heh@nvidia.com> * clean up Signed-off-by: stevehuang52 <heh@nvidia.com> --------- Signed-off-by: stevehuang52 <heh@nvidia.com> Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> Co-authored-by: zhehuaichen <dian.chenzhehuai@gmail.com> Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* add docs Signed-off-by: stevehuang52 <heh@nvidia.com> * add lhotse specific info Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * move images to github release 1.23 Signed-off-by: stevehuang52 <heh@nvidia.com> * clean up Signed-off-by: stevehuang52 <heh@nvidia.com> --------- Signed-off-by: stevehuang52 <heh@nvidia.com> Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> Co-authored-by: zhehuaichen <dian.chenzhehuai@gmail.com> Signed-off-by: Boxiang Wang <boxiangw@nvidia.com>

* add docs Signed-off-by: stevehuang52 <heh@nvidia.com> * add lhotse specific info Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * move images to github release 1.23 Signed-off-by: stevehuang52 <heh@nvidia.com> * clean up Signed-off-by: stevehuang52 <heh@nvidia.com> --------- Signed-off-by: stevehuang52 <heh@nvidia.com> Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> Co-authored-by: zhehuaichen <dian.chenzhehuai@gmail.com> Signed-off-by: Vivian Chen <xuanzic@example.com>

* add docs Signed-off-by: stevehuang52 <heh@nvidia.com> * add lhotse specific info Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * move images to github release 1.23 Signed-off-by: stevehuang52 <heh@nvidia.com> * clean up Signed-off-by: stevehuang52 <heh@nvidia.com> --------- Signed-off-by: stevehuang52 <heh@nvidia.com> Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> Co-authored-by: zhehuaichen <dian.chenzhehuai@gmail.com> Signed-off-by: kchike <kohei.chike@jp.ricoh.com>

* add docs Signed-off-by: stevehuang52 <heh@nvidia.com> * add lhotse specific info Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * move images to github release 1.23 Signed-off-by: stevehuang52 <heh@nvidia.com> * clean up Signed-off-by: stevehuang52 <heh@nvidia.com> --------- Signed-off-by: stevehuang52 <heh@nvidia.com> Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> Co-authored-by: zhehuaichen <dian.chenzhehuai@gmail.com>

add docs

aa3425e

Signed-off-by: stevehuang52 <heh@nvidia.com>

stevehuang52 requested review from titu1994 and zhehuaichen July 17, 2024 23:23

github-actions bot added the Multi Modal label Jul 17, 2024

Merge branch 'main' into add_speechlm_docs

d65825a

zhehuaichen previously approved these changes Jul 18, 2024

View reviewed changes

docs/source/multimodal/speech_llm/configs.rst Show resolved Hide resolved

stevehuang52 and others added 2 commits July 18, 2024 13:16

Merge branch 'main' into add_speechlm_docs

3e4e724

add lhotse specific info

a685914

Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com>

zhehuaichen dismissed their stale review via a685914 July 18, 2024 21:00

stevehuang52 added 3 commits July 19, 2024 10:48

Merge branch 'main' into add_speechlm_docs

1afce6e

Merge branch 'main' into add_speechlm_docs

dd0b54c

Merge branch 'main' into add_speechlm_docs

00dbc35

stevehuang52 requested a review from zhehuaichen July 23, 2024 14:11

zhehuaichen previously approved these changes Jul 23, 2024

View reviewed changes

titu1994 requested changes Jul 23, 2024

View reviewed changes

stevehuang52 added 2 commits July 23, 2024 15:01

move images to github release 1.23

e8c26a0

Signed-off-by: stevehuang52 <heh@nvidia.com>

Merge branch 'add_speechlm_docs' of https://github.com/NVIDIA/NeMo in…

8576204

…to add_speechlm_docs

stevehuang52 dismissed zhehuaichen’s stale review via 8576204 July 23, 2024 19:01

clean up

f745b8f

Signed-off-by: stevehuang52 <heh@nvidia.com>

titu1994 approved these changes Jul 23, 2024

View reviewed changes

titu1994 merged commit 9c06389 into main Jul 23, 2024
12 checks passed

titu1994 deleted the add_speechlm_docs branch July 23, 2024 19:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add SpeechLLM docs #9780

Add SpeechLLM docs #9780

stevehuang52 commented Jul 17, 2024

zhehuaichen left a comment

titu1994 left a comment

titu1994 Jul 23, 2024

titu1994 Jul 23, 2024

titu1994 Jul 23, 2024

stevehuang52 Jul 23, 2024

titu1994 Jul 23, 2024

titu1994 Jul 23, 2024

titu1994 Jul 23, 2024

titu1994 Jul 23, 2024

		"answer": "the transcription of the audio", # optional for inference, default to "na" in dataloader
		}


		The `context` field in the manifest is optional, and you can put a list of context in a context file (one context for each line) then set `++model.data.train_ds.context_file=<path to to context file>` to ask the dataloader to randomly pick a context from the file for each audio sample. This is useful for training with multiple prompts for the same task. If neither `context` field nor `context_file` is provided, the dataloader will use a default context `what does the audio mean?` for all audios. During inference, it is recommended to have the `context` field in the manifest.

		Customizing the fields to use

		------------------------------


		In order to use a context file, you can set `++model.data.train_ds.context_file=<path to to context file>` in the command line or use multiple context files with `++model.data.train_ds.context_file=[<path to to context file1>,<path to context file2>,...]`. If the number of context files is equal to the number of provided datasets, the dataloader will assigne each context file to a dataset. Otherwise, the dataloader will randomly pick a context file from all provided context files for each audio sample. Using multiple context files is useful for training with multiple tasks, where each task has its own set of prompts. Meanwhile, you can control the weights for different tasks/datasets by using concatentated tarred datasets, where you can assign weights to datasets by:

Add SpeechLLM docs #9780

Add SpeechLLM docs #9780

Conversation

stevehuang52 commented Jul 17, 2024

What does this PR do ?

zhehuaichen left a comment

Choose a reason for hiding this comment

titu1994 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment