Feature Request: Support instruct parameter in generate_voice_clone() #100

WRSP · 2026-01-27T04:07:37Z

WRSP
Jan 27, 2026

Summary
I would like to request that generate_voice_clone() support an instruct parameter similar to generate_custom_voice().

Currently, generate_custom_voice() allows instruction-based control over emotion, speaking style, and other expressive attributes. However, generate_voice_clone() is limited to reproducing the tone and style from the reference audio only, without allowing further expressive control.

Motivation
In practical voice cloning use cases, the reference audio is primarily used to capture speaker identity (timbre, voice characteristics), not to rigidly lock the emotional state or speaking style.

It is often desirable to:

Clone the speaker’s voice identity from reference audio
Then instruct the model to alter emotion, intensity, speaking manner, or delivery style dynamically

For example:

Same cloned voice, but different emotions (calm, angry, cheerful, whispering)
Same speaker identity, but different contexts (narration, dialogue, warning, casual speech)

Without an instruct parameter, users must re-record multiple reference audios for each emotion or style, which is inefficient and limiting.

Proposed API Design
Add an optional instruct parameter to generate_voice_clone(), aligned with generate_custom_voice().

Example (conceptual):

generate_voice_clone(
    text="I will protect this village.",
    reference_audio="speaker_ref.wav",
    instruct="angry, determined, low pitch, strong emphasis"
)

Expected Behavior

Reference audio defines speaker identity
instruct modifies expressive attributes (emotion, tone, delivery)
If instruct is omitted, behavior remains backward-compatible and unchanged

Benefits

Aligns generate_voice_clone() with the expressive flexibility of generate_custom_voice()
Enables richer and more controllable voice acting use cases (games, narration, virtual characters)
Reduces the need for multiple reference recordings
Improves consistency and usability of the API design

Additional Notes
Conceptually, voice identity and expressive control are orthogonal dimensions. Allowing instruct in voice cloning would reflect this separation more clearly in the API and unlock more advanced creative workflows.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Support instruct parameter in generate_voice_clone() #100

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Feature Request: Support instruct parameter in generate_voice_clone() #100

Uh oh!

WRSP Jan 27, 2026

Replies: 0 comments

WRSP
Jan 27, 2026