-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add SpeechLLM docs #9780
Add SpeechLLM docs #9780
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! Thank you so much!
Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Requires images to be moved to GH release, rest all are minor comments
"answer": "the transcription of the audio", # optional for inference, default to "na" in dataloader | ||
} | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We support more variations of what does the audio mean now right ?
|
||
The `context` field in the manifest is optional, and you can put a list of context in a context file (one context for each line) then set `++model.data.train_ds.context_file=<path to to context file>` to ask the dataloader to randomly pick a context from the file for each audio sample. This is useful for training with multiple prompts for the same task. If neither `context` field nor `context_file` is provided, the dataloader will use a default context `what does the audio mean?` for all audios. During inference, it is recommended to have the `context` field in the manifest. | ||
|
||
Customizing the fields to use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that the use of prompt_template here conflicts with Canary model (and speechlm) PromptFormatter class which also uses a model.cfg.prompt_format called Canary. Just a note
------------------------------ | ||
|
||
|
||
In order to use a context file, you can set `++model.data.train_ds.context_file=<path to to context file>` in the command line or use multiple context files with `++model.data.train_ds.context_file=[<path to to context file1>,<path to context file2>,...]`. If the number of context files is equal to the number of provided datasets, the dataloader will assigne each context file to a dataset. Otherwise, the dataloader will randomly pick a context file from all provided context files for each audio sample. Using multiple context files is useful for training with multiple tasks, where each task has its own set of prompts. Meanwhile, you can control the weights for different tasks/datasets by using concatentated tarred datasets, where you can assign weights to datasets by: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if the task and the context are wildly different during sampling ? Ie for ASR and AST ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
each dataset can have it's own list of context files, such that ASR and ASR can sample from each pool separately
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, is this mentioned somewhere else ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't add images to git. Upload file to last release, and put url in rst
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here
Signed-off-by: stevehuang52 <heh@nvidia.com>
…to add_speechlm_docs
------------------------------ | ||
|
||
|
||
In order to use a context file, you can set `++model.data.train_ds.context_file=<path to to context file>` in the command line or use multiple context files with `++model.data.train_ds.context_file=[<path to to context file1>,<path to context file2>,...]`. If the number of context files is equal to the number of provided datasets, the dataloader will assigne each context file to a dataset. Otherwise, the dataloader will randomly pick a context file from all provided context files for each audio sample. Using multiple context files is useful for training with multiple tasks, where each task has its own set of prompts. Meanwhile, you can control the weights for different tasks/datasets by using concatentated tarred datasets, where you can assign weights to datasets by: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, is this mentioned somewhere else ?
* add docs Signed-off-by: stevehuang52 <heh@nvidia.com> * add lhotse specific info Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * move images to github release 1.23 Signed-off-by: stevehuang52 <heh@nvidia.com> * clean up Signed-off-by: stevehuang52 <heh@nvidia.com> --------- Signed-off-by: stevehuang52 <heh@nvidia.com> Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> Co-authored-by: zhehuaichen <dian.chenzhehuai@gmail.com>
* add docs Signed-off-by: stevehuang52 <heh@nvidia.com> * add lhotse specific info Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * move images to github release 1.23 Signed-off-by: stevehuang52 <heh@nvidia.com> * clean up Signed-off-by: stevehuang52 <heh@nvidia.com> --------- Signed-off-by: stevehuang52 <heh@nvidia.com> Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> Co-authored-by: zhehuaichen <dian.chenzhehuai@gmail.com> Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
* add docs Signed-off-by: stevehuang52 <heh@nvidia.com> * add lhotse specific info Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * move images to github release 1.23 Signed-off-by: stevehuang52 <heh@nvidia.com> * clean up Signed-off-by: stevehuang52 <heh@nvidia.com> --------- Signed-off-by: stevehuang52 <heh@nvidia.com> Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> Co-authored-by: zhehuaichen <dian.chenzhehuai@gmail.com> Signed-off-by: Boxiang Wang <boxiangw@nvidia.com>
* add docs Signed-off-by: stevehuang52 <heh@nvidia.com> * add lhotse specific info Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * move images to github release 1.23 Signed-off-by: stevehuang52 <heh@nvidia.com> * clean up Signed-off-by: stevehuang52 <heh@nvidia.com> --------- Signed-off-by: stevehuang52 <heh@nvidia.com> Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> Co-authored-by: zhehuaichen <dian.chenzhehuai@gmail.com> Signed-off-by: Vivian Chen <xuanzic@example.com>
* add docs Signed-off-by: stevehuang52 <heh@nvidia.com> * add lhotse specific info Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * move images to github release 1.23 Signed-off-by: stevehuang52 <heh@nvidia.com> * clean up Signed-off-by: stevehuang52 <heh@nvidia.com> --------- Signed-off-by: stevehuang52 <heh@nvidia.com> Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> Co-authored-by: zhehuaichen <dian.chenzhehuai@gmail.com> Signed-off-by: kchike <kohei.chike@jp.ricoh.com>
* add docs Signed-off-by: stevehuang52 <heh@nvidia.com> * add lhotse specific info Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> * move images to github release 1.23 Signed-off-by: stevehuang52 <heh@nvidia.com> * clean up Signed-off-by: stevehuang52 <heh@nvidia.com> --------- Signed-off-by: stevehuang52 <heh@nvidia.com> Signed-off-by: zhehuaichen <dian.chenzhehuai@gmail.com> Co-authored-by: zhehuaichen <dian.chenzhehuai@gmail.com>
What does this PR do ?
Add docs to SpeechLLM
Collection: [multimodal]