Streaming mode: text stream with gaps #212
Unanswered
matevosashot
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
First of all, thank you for releasing amasing TTS model.
Last few days, I was trying to run the model in the full streaming mode. One can have true streaming in the case of VoiceDesign (CustomVoice model), when the first audio chunk is produced right after processing the first token of the text.
However, the text has to be fed to the model continuously. Usually, text tokens are 10 times less than the corresponding audio chunks. Therefore, then text tokens finish, in order to generate the remaining audio,
tts_padtokens are fed to the model. For exampleThe first part is the initial prompt to the model, where one specifies the language and speaker details. Then comes the text in a stream, and the audio is generated autoregressively. Above, I used the following simple notation:
The only limitation is that the text should arrive without interruption. For example, I am able to generate audio in the following scenario:
where
g=gapdenotes the missing text in the corresponding frame. I have tried to use various tokens in place ofg.Do you have any ideas on how to generate the audio from the non-continuous stream of text?
Beta Was this translation helpful? Give feedback.
All reactions