You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Summary
I would like to request that generate_voice_clone() support an instruct parameter similar to generate_custom_voice().
Currently, generate_custom_voice() allows instruction-based control over emotion, speaking style, and other expressive attributes. However, generate_voice_clone() is limited to reproducing the tone and style from the reference audio only, without allowing further expressive control.
Motivation
In practical voice cloning use cases, the reference audio is primarily used to capture speaker identity (timbre, voice characteristics), not to rigidly lock the emotional state or speaking style.
It is often desirable to:
Clone the speaker’s voice identity from reference audio
Then instruct the model to alter emotion, intensity, speaking manner, or delivery style dynamically
For example:
Same cloned voice, but different emotions (calm, angry, cheerful, whispering)
Same speaker identity, but different contexts (narration, dialogue, warning, casual speech)
Without an instruct parameter, users must re-record multiple reference audios for each emotion or style, which is inefficient and limiting.
Proposed API Design
Add an optional instruct parameter to generate_voice_clone(), aligned with generate_custom_voice().
Example (conceptual):
generate_voice_clone(
text="I will protect this village.",
reference_audio="speaker_ref.wav",
instruct="angry, determined, low pitch, strong emphasis"
)
If instruct is omitted, behavior remains backward-compatible and unchanged
Benefits
Aligns generate_voice_clone() with the expressive flexibility of generate_custom_voice()
Enables richer and more controllable voice acting use cases (games, narration, virtual characters)
Reduces the need for multiple reference recordings
Improves consistency and usability of the API design
Additional Notes
Conceptually, voice identity and expressive control are orthogonal dimensions. Allowing instruct in voice cloning would reflect this separation more clearly in the API and unlock more advanced creative workflows.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
Summary
I would like to request that generate_voice_clone() support an instruct parameter similar to generate_custom_voice().
Currently, generate_custom_voice() allows instruction-based control over emotion, speaking style, and other expressive attributes. However, generate_voice_clone() is limited to reproducing the tone and style from the reference audio only, without allowing further expressive control.
Motivation
In practical voice cloning use cases, the reference audio is primarily used to capture speaker identity (timbre, voice characteristics), not to rigidly lock the emotional state or speaking style.
It is often desirable to:
For example:
Without an instruct parameter, users must re-record multiple reference audios for each emotion or style, which is inefficient and limiting.
Proposed API Design
Add an optional instruct parameter to generate_voice_clone(), aligned with generate_custom_voice().
Example (conceptual):
Expected Behavior
Benefits
Additional Notes
Conceptually, voice identity and expressive control are orthogonal dimensions. Allowing instruct in voice cloning would reflect this separation more clearly in the API and unlock more advanced creative workflows.
Beta Was this translation helpful? Give feedback.
All reactions