Scenema Audio

Zero-shot expressive voice cloning and speech generation.

Visit scenema.ai/audio to hear all demos and try it out.

Every existing text-to-speech system converts words into sound, but none of them perform. Speech that merely pronounces words correctly is functionally useless for filmmaking, audiobooks, or any context where the emotional delivery carries as much meaning as the words themselves. Scenema Audio generates speech with intention, pacing, breath control, and emotional arcs that shift within a single generation, all from a text prompt that describes not just what to say but how to say it.

Built on an audio diffusion transformer extracted from LTX 2.3's 22B parameter audiovisual model, it learned how people actually sound in real scenes: angry, laughing, whispering, crying, exhausted, terrified.

Quick Start

Docker (Recommended)

git clone https://github.com/ScenemaAI/scenema-audio.git
cd scenema-audio

# Set your HuggingFace token (Gemma 3 access required)
export HF_TOKEN=your_huggingface_token

# Build and run (models are downloaded on first start)
docker compose up

Runs on any NVIDIA GPU with 16 GB+ VRAM. The default configuration uses INT8 audio transformer + NF4 Gemma quantization, with automatic model offloading on smaller cards. First startup downloads ~38 GB of model checkpoints and caches them in a Docker volume. Subsequent starts are fast.

Generate Audio

# Using the included script
python generate.py output.wav

# Or with curl
curl -X POST http://localhost:8000/generate \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "<speak voice=\"A warm, clear male voice with a slight British accent. Measured, thoughtful pacing.\" gender=\"male\">The old lighthouse had stood on the cliff for over a century, its beam cutting through the fog like a blade of light.</speak>",
    "seed": 42
  }' \
  --output output.wav

Voice Design (Preview a Voice)

curl -X POST http://localhost:8000/generate \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "<speak voice=\"A young woman with a smoky, low register voice. Intimate, confessional tone.\" gender=\"female\">The city never really sleeps. It just closes its eyes and pretends for a while.</speak>",
    "mode": "voice_design"
  }' \
  --output voice_preview.wav

Zero-Shot Voice Cloning

Provide 10-20 seconds of reference audio with some emotional variability. The model generates expressive speech from the prompt, then transfers the reference voice's identity onto the performance. References that contain a range of pitch and intonation produce significantly better identity transfer than flat, monotone clips.

curl -X POST http://localhost:8000/generate \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "<speak voice=\"Gravelly male voice, fast talking, rough.\" gender=\"male\"><action>He completely loses it, shouting</action>What are you waiting for?!</speak>",
    "reference_voice_url": "https://example.com/calm-reference.wav",
    "seed": 42
  }' \
  --output cloned_angry.wav

Any voice can perform any emotion, even if that voice has never been recorded in that emotional state. The reference provides identity. The performance comes from the prompt.

Web UI (Gradio)

A built-in web interface for experimenting with voice descriptions, action tags, and all generation parameters without writing code.

# Enable the web UI
ENABLE_GRADIO=1 HF_TOKEN=your_token docker compose up

Open http://localhost:8000/ui in your browser. The UI provides four tabs:

Generate: Build prompts from individual fields (voice description, speech text, scene, action tags) with preset examples
Voice Design: Quick 15-second voice previews for iterating on voice descriptions
Voice Cloning: Upload reference audio and generate with voice identity transfer
Advanced: Write raw <speak> XML directly for full control

For remote GPU servers, forward the port via SSH:

ssh -i /path/to/key -L 8000:localhost:8000 user@your_gpu_server
# Then open http://localhost:8000/ui/ locally

Or use a public share link (no port forwarding needed). Gradio opens an outbound tunnel and gives you a *.gradio.live URL that anyone can access:

ENABLE_GRADIO=1 GRADIO_SHARE=1 HF_TOKEN=your_token docker compose up
# Look for the gradio.live URL in the logs

Prompt Format

<speak voice="VOICE_DESCRIPTION" gender="male|female"
       scene="OPTIONAL_SCENE" language="OPTIONAL_LANG_CODE"
       shot="closeup|wide|scene">
  <action>Performance direction.</action>
  Speech text here.
  <sound>Environmental audio event.</sound>
  More speech.
</speak>

Attributes

Attribute	Required	Default	Description
`voice`	Yes		Detailed voice description. Drives vocal quality, emotion, accent, age, timbre, delivery style. The more specific and theatrical, the better.
`gender`	Yes		`"male"` or `"female"`. Controls pronoun assignment in the compiled prompt sent to the diffusion model.
`scene`	No		Environmental context. Conditions the ambient audio environment around the speech (rain, office hum, crowd noise).
`language`	No	`"en"`	Language code. The model supports major world languages with native-sounding output.
`shot`	No	`"closeup"`	Controls SFX prominence. `"closeup"`: speech-focused, SFX minimal. `"wide"`: environment + speech. `"scene"`: maximum environmental audio, SFX reinforced.

Child Elements

Element	Description
Text nodes	The actual speech content. Write natural prose.
`<action>`	Performance directions that shape HOW the speech is delivered. Not spoken aloud. Stage directions for the diffusion model: emotional shifts, physical delivery, pacing cues, breath control.
`<sound>`	Environmental audio events generated alongside the speech. Thunder cracks, doors slamming, rain starting. Only effective in `wide` or `scene` shot modes.

Voice Description

The voice attribute is the primary control for the entire output. Be specific and theatrical:

<!-- Weak -->
<speak voice="A man speaking" gender="male">...</speak>

<!-- Strong -->
<speak voice="Male, mid 60s. Deep baritone with gravel. Slight Southern American inflection.
Worn but warm. Nostalgic, firelight cadence. The voice of someone who has seen too much
and chosen kindness anyway." gender="male" scene="Fireside, night, crickets">...</speak>

Action Tags

Action tags are the primary tool for controlling emotional performance. Place them between speech segments to direct delivery shifts:

<speak voice="Middle-aged man, warm but weathered." gender="male">
  <action>Calm, almost casual. Staring at his hands.</action>
  I used to think I had all the time in the world.
  <action>Voice tightens. Swallows. Fighting to stay composed.</action>
  Then one Tuesday morning, the doctor said three words that changed everything.
  <action>Long pause. Deep breath. When he speaks again, his voice is raw but steady.</action>
  And I realized... I hadn't called my son in six months.
  <action>Voice breaks on the last word. Clears throat. Forces a half-laugh.</action>
  Funny how that works, isn't it?
</speak>

Describe what the speaker is DOING and FEELING, not what the audio should sound like. Combine physical and emotional cues for richer performance.

API Reference

POST /generate

Request Body

Field	Type	Default	Description
`prompt`	string	required	`<speak>` XML string. See Prompt Format above.
`mode`	string	`"generate"`	`"generate"` for full pipeline with chunking. `"voice_design"` for a single 15-second voice sample (no chunking, useful for previewing a voice description).
`reference_voice_url`	string	`null`	URL to reference audio (WAV or MP3) for zero-shot voice cloning. 10-20 seconds of clean speech with some emotional variability is ideal. The reference provides identity; emotional performance comes from the prompt.
`background_sfx`	bool	`false`	Keep generated environmental sound effects in the output. When `false`, non-vocal audio is removed. Set to `true` when using `shot="scene"` or `shot="wide"` with `<sound>` tags.
`validate`	bool	`true`	Enable Whisper speech validation. Each generated chunk is transcribed by faster-whisper and compared against expected text. If word match ratio falls below the threshold, the chunk is regenerated with extended duration and a new seed (up to 3 retries), keeping the best result. Adds <1s per chunk on GPU. Disable for faster generation when prompt reliability is not critical.
`seed`	int	`-1`	Generation seed. `-1` for random. Fixed seeds produce deterministic output for the same prompt and configuration.
`pace`	float	`1.5`	Duration allocation multiplier. Higher values give the model more time, resulting in slower, more deliberate speech. Lower values produce faster speech. The default 1.5x accounts for LTX's naturally slower speaking pace compared to real-time speech.
`min_match_ratio`	float	`0.90`	Whisper validation threshold. Minimum word match ratio (0.0 to 1.0) between generated audio transcription and expected text. Only used when `validate` is `true`. Lower values accept more pronunciation variance. Lower threshold recommended for languages with accents.
`skip_vc`	bool	`false`	Skip voice conversion (SeedVC) post-processing entirely. When `true`, no voice identity transfer or cross-chunk voice consistency normalization is applied. Useful for single-chunk generations where the voice description alone is sufficient.
`vc_steps`	int	`25`	SeedVC diffusion steps. More steps produce higher-quality voice identity transfer at the cost of processing time. Range: 10-50.
`vc_cfg_rate`	float	`0.5`	SeedVC classifier-free guidance rate. Controls how strongly the target voice identity is applied. Higher values produce stronger identity transfer but may reduce naturalness. Range: 0.0-1.0.

Response

Returns JSON with base64-encoded WAV audio:

{
  "status": "succeeded",
  "audio": "<base64-encoded WAV>",
  "content_type": "audio/wav",
  "metadata": {
    "duration_s": 12.4,
    "sample_rate": 48000,
    "processing_ms": 8200,
    "seed": 42,
    "mode": "generate",
    "has_reference_voice": false
  }
}

On error:

{
  "status": "failed",
  "error": "Description of what went wrong"
}

Capabilities

Emotional Acting

Emotional state shifts within a single generation. Action tags function as stage directions at specific points in the script.

<speak voice="A man on the edge. Explosive rage. Italian-American inflection."
       gender="male" scene="A dimly lit office, late at night">
  <action>He stands up slowly, voice dangerously low</action>
  You come into my house, you eat my food, and then you got the nerve
  to tell me how to run my business.
  <action>Voice rising, finger pointing</action>
  I built this thing from nothing while you were sitting on your ass.
</speak>

Child Voices

<speak voice="A six-year-old girl, bright and excited, speaking fast
with breathless enthusiasm. Slight lisp on S sounds."
gender="female">
  Mommy look! There is a rainbow and it goes all the way across the whole sky!
</speak>

Scene-Aware Audio (Voice + Environment)

Set shot="scene" and background_sfx: true to generate speech with environmental audio in the same diffusion pass.

<speak voice="Male, mid 40s. Weathered. Urgent, projecting over wind."
       gender="male" scene="Open dock in a thunderstorm, heavy rain"
       shot="scene">
  <sound>Heavy rain and wind howling</sound>
  <action>He shouts over the storm</action>
  Get the lines! She is pulling loose!
  <sound>Thunder cracks overhead</sound>
  Move! I said move!
</speak>

Multilingual

The model supports major world languages with native fluency. Set the language attribute and write the voice description to match.

<speak voice="Female, mid 70s. Soft alto. Native French speaker, Parisian accent.
Warm like wool blankets. Unhurried." gender="female"
scene="Cozy bedroom, lamplight" language="fr">
  <action>Elle s'assied au bord du lit</action>
  Alors, mon petit. Tu veux que je te raconte l'histoire du renard
  qui a trompé la lune?
</speak>

Long-Form Narration

Text is automatically split at sentence boundaries using Kokoro phoneme-level duration estimation. Voice identity is maintained across chunks via A2V latent conditioning.

<speak voice="An elderly storyteller with a weathered knowing voice.
Deep baritone, slow deliberate pacing."
gender="male">
  Many years later, as he faced the firing squad, Colonel Aureliano Buendia
  was to remember that distant afternoon when his father took him to discover ice.
  At that time Macondo was a village of twenty adobe houses, built on the bank
  of a river of clear water that ran along a bed of polished stones, which were
  white and enormous, like prehistoric eggs.
</speak>

Hardware Requirements

Minimum: 16 GB VRAM (RTX 4060 Ti 16GB, RTX A4000)

INT8 audio transformer + NF4 Gemma quantization. Models are automatically offloaded between GPU and CPU RAM between pipeline stages (encode, diffuse, decode, voice convert). Requires 32 GB system RAM. Default configuration via docker compose up.

Recommended: 24 GB VRAM (RTX 4090, RTX A5000)

Same INT8 + NF4 config with all models resident on GPU simultaneously. No offloading overhead, fastest generation.

Full Precision: 48 GB VRAM (A6000 Ada, A40, L40S)

bf16 audio transformer + bf16 Gemma, all models resident on GPU. Best quality. Set environment variables:

AUDIO_CKPT=/app/models/scenema-audio-transformer.safetensors
GEMMA_QUANTIZE=

VRAM Configurations

VRAM	Audio Model	Gemma	Behavior	Notes
16 GB	INT8 (4.9 GB)	NF4 (~8 GB)	Auto-offload per stage	Default config
24 GB	INT8 (4.9 GB)	NF4 (~8 GB)	All models resident	Fastest with quantization
48 GB	bf16 (9.8 GB)	bf16 (24 GB)	All models resident	Best quality

VRAM strategy is auto-detected. The engine measures available VRAM at startup and decides whether to offload models between stages or keep everything resident.

Performance

Benchmarked on NVIDIA RTX 4090, 100-word passage (~55 seconds of audio, 4 chunks):

Configuration	Total Time	Real-Time Factor
bf16 + bf16 (CPU streaming)	83s	0.66x
INT8 + bf16 (CPU streaming)	66s	0.83x
INT8 + NF4 (all GPU)	35s	1.57x
INT8 + NF4 + SageAttention 2	35s	1.57x

Pipeline Architecture

XML prompt (voice description + scene + stage directions + text)
  |
  v
[Text Splitting] -----------> Sentence boundaries via Kokoro, ~15s max per segment
  |
  v
[Gemma 3 12B Encode] -------> Text conditioning (per segment)
  |
  v
[8-Step Diffusion] ---------> Audio latent generation
  |                            Voice continuity via A2V latent conditioning between segments
  v
[Audio Decode] --------------> Waveform
  |
  v
[MelBandRoFormer] ----------> Vocal separation (strips SFX unless background_sfx=true)
  |
  v
[SeedVC] -------------------> Voice identity transfer (when reference_voice_url provided
  |                            or multi-chunk for cross-chunk consistency)
  v
Output WAV (48kHz stereo)

Key Design Decisions

Kokoro for duration estimation. Kokoro TTS (82M params, CPU) provides phoneme-level duration estimates. The chunker splits text at sentence boundaries when accumulated Kokoro estimates exceed 15 seconds (with a configurable pace multiplier for LTX's naturally slower speaking pace). No word counting.

15-second chunk cap. The model was trained on 20-second clips, but quality degrades (repetition, pronunciation failure) beyond ~15 seconds. The 15s cap ensures consistent quality.

Voice continuity across segments. The tail of each segment's audio is encoded and used as a voice reference for the next segment. This maintains consistent voice identity across arbitrarily long outputs without requiring a separate voice embedding model.

Zero-shot voice cloning. A2V latent conditioning gets about 60% of the way to matching a reference voice. SeedVC post-processing brings it to full identity transfer. No training, no enrollment, no voice database.

Emotion and identity are independent controls. The voice description drives the emotional performance. The reference audio drives the voice identity. For maximum emotional range with a cloned voice, use a strong character archetype in the voice description and let the reference audio handle identity.

INT8 quantization. Per-channel INT8 reduces the transformer from 9.8 GB to 4.9 GB with no measurable quality difference, enabling generation on consumer GPUs.

Model Checkpoints

Hosted on HuggingFace: ScenemaAI/scenema-audio

File	Size	Description
`scenema-audio-transformer.safetensors`	9.8 GB	Audio diffusion transformer (bf16)
`scenema-audio-transformer-int8.safetensors`	4.9 GB	Audio diffusion transformer (INT8, identical quality)
`scenema-audio-pipeline.safetensors`	6.7 GB	Audio VAE decoder + vocoder + text projection
`scenema-audio-vae-encoder.safetensors`	42.7 MB	Audio VAE encoder for reference voice encoding

Building from Source

git clone https://github.com/ScenemaAI/scenema-audio.git
cd scenema-audio

export HF_TOKEN=your_huggingface_token
docker compose build
docker compose up

Environment Variables

Set in docker-compose.yml or pass via docker run -e:

Variable	Default	Description
`HF_TOKEN`	required	HuggingFace token with Gemma 3 access
`AUDIO_CKPT`	`/app/models/scenema-audio-transformer-int8.safetensors`	Path to audio transformer checkpoint
`PIPELINE_CKPT`	`/app/models/scenema-audio-pipeline.safetensors`	Path to pipeline checkpoint
`GEMMA_ROOT`	`/app/models/gemma-3-12b-it`	Path to Gemma 3 12B model directory
`GEMMA_QUANTIZE`	`nf4`	Gemma quantization. `nf4` for 24 GB cards, empty for bf16 on 48 GB+
`PORT`	`8000`	HTTP service port
`MODEL_DIR`	`/app/models`	Base directory for model downloads and cache
`ENABLE_GRADIO`	(empty)	Set to `1` to enable the Gradio web UI at `/ui`
`GRADIO_SHARE`	(empty)	Set to `1` to create a public share link (no port forwarding needed)

Limitations

Pronunciation: The model occasionally garbles complex multi-syllable words and proper nouns. Spelling out difficult words phonetically can help.
15-second generation window: Each audio segment is limited to ~15 seconds. Longer text is automatically split, but very long single sentences may be divided at suboptimal points.
Emotional range with voice cloning: Voice cloning optimizes for identity accuracy, which can reduce the extremes of emotional delivery. For maximum expressiveness, use a strong emotional archetype in the voice description and provide a reference clip with natural emotional variability (10-20 seconds, not monotone).
Multilingual pronunciation: When a character switches languages mid-speech, the model may apply the primary language's phonetics to the foreign words. Use separate requests per language.
Generation speed: Each 15-second segment takes 3-8 seconds depending on hardware. Audio is returned as a complete file, not streamed.
Gemma 3 12B is gated: Requires accepting Google's terms of use and a HuggingFace token with access.
Reference audio quality sensitivity: Low-quality references (compressed MP3, background noise) significantly degrade output. Use clean reference audio or rely on the voice description alone with SeedVC as a post-processing step.

Acknowledgments

LTX-2 by Lightricks for the base audiovisual model
Gemma 3 by Google for the text encoder
SeedVC by Plachta for voice refinement
Kokoro by hexgrad for duration estimation
SageAttention for attention acceleration

License

Model weights: LTX-2 Community License Agreement. The audio diffusion transformer is derived from LTX 2.3's audiovisual model, and its weights are subject to the same license terms.

Code: MIT License. The inference server, chunking pipeline, and all supporting code are MIT licensed.

Gemma 3 12B (text encoder) is a gated model requiring acceptance of Google's terms of use.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.idea		.idea
docs		docs
src		src
tests		tests
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
app.py		app.py
docker-compose.yml		docker-compose.yml
generate.py		generate.py

Folders and files

Latest commit

History

Repository files navigation

Scenema Audio

Quick Start

Docker (Recommended)

Generate Audio

Voice Design (Preview a Voice)

Zero-Shot Voice Cloning

Web UI (Gradio)

Prompt Format

Attributes

Child Elements

Voice Description

Action Tags

API Reference

POST /generate

Request Body

Response

Capabilities

Emotional Acting

Child Voices

Scene-Aware Audio (Voice + Environment)

Multilingual

Long-Form Narration

Hardware Requirements

Minimum: 16 GB VRAM (RTX 4060 Ti 16GB, RTX A4000)

Recommended: 24 GB VRAM (RTX 4090, RTX A5000)

Full Precision: 48 GB VRAM (A6000 Ada, A40, L40S)

VRAM Configurations

Performance

Pipeline Architecture

Key Design Decisions

Model Checkpoints

Building from Source

Environment Variables

Limitations

Acknowledgments

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages