Some optimizations for memory and speed by ghost-agent-250 · Pull Request #11 · HumeAI/tada

ghost-agent-250 · 2026-03-15T18:18:52Z

The example provided in README is slow and consumes high memory. I have removed the unnecessary loading of the encoder in the main module (tada.py)

For the best speed and memory utilization, we may want to update the example with:

Using torch compile (needs warmup)
num_flow_matching_steps = 10
prompt preprocessing

With this script, I was able to run the 3B model on RTX 5090 at 0.096 RTF and ~9GB VRAM

import os
import time
import torch
import torchaudio

from tada.modules.encoder import Encoder, EncoderOutput
from tada.modules.tada import InferenceOptions, TadaForCausalLM

device = "cuda"
PROMPT_CACHE = "prompt_cache.pt"
SAMPLE_RATE = 24000

audio_path = "samples/ljspeech.wav"
prompt_text = "The examination and testimony of the experts, enabled the commission to conclude that five shots may have been fired."

if os.path.exists(PROMPT_CACHE):
    state = torch.load(PROMPT_CACHE, map_location=device, weights_only=False)
    prompt = EncoderOutput(**state)
else:
    encoder = Encoder.from_pretrained("HumeAI/tada-codec", subfolder="encoder").to(device)
    audio, sample_rate = torchaudio.load(audio_path)
    audio = audio.to(device)
    prompt = encoder(audio, text=[prompt_text], sample_rate=sample_rate)
    torch.save(vars(prompt), PROMPT_CACHE)

for field in vars(prompt):
    v = getattr(prompt, field)
    if isinstance(v, torch.Tensor) and v.is_floating_point():
        setattr(prompt, field, v.to(torch.bfloat16))

model = TadaForCausalLM.from_pretrained(
    "HumeAI/tada-3b-ml",
    torch_dtype=torch.bfloat16,
).to(device)
model._decoder.to(torch.bfloat16)
model.compile()

torch.cuda.empty_cache()

torch.cuda.reset_peak_memory_stats()
torch.cuda.synchronize()

with torch.inference_mode():
    # Warmup (must share inference_mode context with timed run)
    for i in range(2):
        model.generate(
            prompt=prompt,
            text="Please call Stella. Ask her to bring these things with her from the store.",
            inference_options=InferenceOptions(
                num_flow_matching_steps=8
            ),
        )

    torch.cuda.synchronize()
    t0 = time.perf_counter()
    output = model.generate(
        prompt=prompt,
        text="Please call Stella. Ask her to bring these things with her from the store.",
        inference_options=InferenceOptions(
            num_flow_matching_steps=10
        ),
    )
torch.cuda.synchronize()
elapsed = time.perf_counter() - t0

out_samples = output.audio[0].shape[-1]
out_duration_sec = out_samples / SAMPLE_RATE
rtf = elapsed / out_duration_sec
peak_gb = torch.cuda.max_memory_allocated() / 2**30
print(f"Peak GPU memory: {peak_gb:.2f} GB")
print(f"RTF: {rtf:.3f} ({elapsed:.2f}s gen / {out_duration_sec:.2f}s audio)")

torchaudio.save("output.wav", output.audio[0].cpu().float().unsqueeze(0), SAMPLE_RATE)

srao25 · 2026-03-16T03:08:20Z

Fixed these in the latest commits.

Update tada.py

c1b1f4c

srao25 closed this Mar 16, 2026

HaD0Yun mentioned this pull request Mar 22, 2026

Clarify InferenceOptions usage in README and lock the API surface #19

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some optimizations for memory and speed#11

Some optimizations for memory and speed#11
ghost-agent-250 wants to merge 1 commit intoHumeAI:mainfrom
ghost-agent-250:optimize-memory-and-speed

ghost-agent-250 commented Mar 15, 2026 •

edited

Loading

Uh oh!

srao25 commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ghost-agent-250 commented Mar 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

srao25 commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ghost-agent-250 commented Mar 15, 2026 •

edited

Loading