Skip to content

AceStep with "gen codes" not working #2072

@inspire22

Description

@inspire22

Describe the Issue
Gen codes with ace-step is I think necessary for it to do a good job following the caption prompt, not skipping lines, etc. Even if I turn it on, in the console I get no logs of the codes generation.

I also get [Pipeline] WARNING: turbo model, forcing guidance_scale=1.0 (was 4.5) - that might be contributing to why it's not following very well?

I often get 20+s of instrumental intro for example despite telling it immediate vocal start, and skipped lines.

Additional Information:

 my startup batch file:
@echo off
:: These MUST match the files you just downloaded exactly
set LM=acestep-5Hz-lm-4B-Q8_0.gguf
set EMBED=Qwen3-Embedding-0.6B-Q8_0.gguf
set DIT=acestep-v15-turbo-Q8_0.gguf
set VAE=vae-BF16.gguf

echo [OK] Launching ACE-Step 1.5 Pipeline on your 5070 Ti...

koboldcpp.exe ^
--musicllm "%LM%" ^
--musicembeddings "%EMBED%" ^
--musicdiffusion "%DIT%" ^
--musicvae "%VAE%" ^
--gpulayers 99 ^
--usecuda ^
--port 5001 ^
--skiplauncher

A log from it running from the music interface

Input: {"caption": "[vocals at 0.0s | no instrumental lead-in], 78 BPM, dark country rock, heavy stomp-clap percussion.", "seed": 611847, "lm_temperature": 0.9, "lm_cfg_scale": 3, "lm_top_p": 0.9, "lm_top_k": 50, "lm_rep_pen": 1.03, "inference_steps": 45, "codes_top_p": 0.99, "codes_top_k": 1000, "codes_temperature": 1, "audio_cover_strength": 0.5, "guidance_scale": 4.5, "shift": 7, "stereo": true, "use_mp3": false, "gen_codes": true, "rewrite_caption": true}

Music Gen Generating Audio...[Request] parsed json (23 fields)
[Pipeline] WARNING: turbo model, forcing guidance_scale=1.0 (was 4.5)
[Pipeline] T=3000, S=1500
[Pipeline] seed=611847, steps=45, guidance=1.0, shift=7.0, duration=120.0s
[Pipeline] caption: 128 tokens, lyrics: 161 tokens
[Encode] TextEncoder (128 tokens): 41.2 ms
[Encode] Lyric vocab lookup (161 tokens): 0.1 ms
[CondEnc] Lyric sliding mask: 161x161, window=128
[CondEnc] Timbre sliding mask: 750x750, window=128
[Encode] Packed: lyric=161 + timbre=1 + text=128 = 290 tokens
[Encode] ConditionEncoder: 107.7 ms, enc_S=290
[Context Batch0] noise seed=611847
[DiT] Starting: T=3000, S=1500, enc_S=290, steps=45, batch=1
[DiT] Batch N=1, T=3000, S=1500, enc_S=290
[DiT] Graph: 1841 nodes
[DiT] step 1/45 t=1.000
[DiT] step 2/45 t=0.997
[DiT] step 3/45 t=0.993
[DiT] step 4/45 t=0.990
[DiT] step 5/45 t=0.986
[DiT] step 6/45 t=0.982
[DiT] step 7/45 t=0.978
[DiT] step 8/45 t=0.974
[DiT] step 9/45 t=0.970
[DiT] step 10/45 t=0.966
[DiT] step 11/45 t=0.961
[DiT] step 12/45 t=0.956
[DiT] step 13/45 t=0.951
[DiT] step 14/45 t=0.945
[DiT] step 15/45 t=0.939
[DiT] step 16/45 t=0.933
[DiT] step 17/45 t=0.927
[DiT] step 18/45 t=0.920
[DiT] step 19/45 t=0.913
[DiT] step 20/45 t=0.905
[DiT] step 21/45 t=0.897
[DiT] step 22/45 t=0.889
[DiT] step 23/45 t=0.880
[DiT] step 24/45 t=0.870
[DiT] step 25/45 t=0.860
[DiT] step 26/45 t=0.848
[DiT] step 27/45 t=0.836
[DiT] step 28/45 t=0.824
[DiT] step 29/45 t=0.810
[DiT] step 30/45 t=0.794
[DiT] step 31/45 t=0.778
[DiT] step 32/45 t=0.760
[DiT] step 33/45 t=0.740
[DiT] step 34/45 t=0.718
[DiT] step 35/45 t=0.694
[DiT] step 36/45 t=0.667
[DiT] step 37/45 t=0.636
[DiT] step 38/45 t=0.602
[DiT] step 39/45 t=0.563
[DiT] step 40/45 t=0.519
[DiT] step 41/45 t=0.467
[DiT] step 42/45 t=0.406
[DiT] step 43/45 t=0.333
[DiT] step 44/45 t=0.246
[DiT] step 45/45 t=0.137
[DiT] Total generation: 3058.1 ms (3058.1 ms/sample)
[VAE] Tiled decode: 24 tiles (chunk=256, overlap=64, stride=128)
[VAE] Graph: 474 nodes, T_latent=192
[VAE] Upsample factor: 1920.00 (expected ~1920)
[VAE] Graph: 474 nodes, T_latent=256
[VAE] Graph: 474 nodes, T_latent=248
[VAE] Graph: 474 nodes, T_latent=120
[VAE] Tiled decode done: 24 tiles -> T_audio=5760000 (120.00s @ 48kHz)
[VAE] Decode: 7270.0 ms
[Save Audio] Save as Stereo WAV...
[Request Done: Music Length 120.00s]

here's my startup sequence:

elcome to KoboldCpp - Version 1.110
Loading Chat Completions Adapter: C:\Users\poetr\AppData\Local\Temp\_MEI19002\kcpp_adapters\AutoGuess.json
Chat Completions Adapter Loaded
System: Windows 10.0.26200 AMD64 Intel64 Family 6 Model 183 Stepping 1, GenuineIntel
Unable to determine GPU Memory
Detected Available RAM: 38095 MB
Initializing dynamic library: koboldcpp_cublas.dll
==========
Namespace(admin=False, admindir='', adminpassword=None, adminunloadtimeout=0, analyze='', autofit=False, autofitpadding=1024, batchsize=512, benchmark=None, blasthreads=0, chatcompletionsadapter='AutoGuess', cli=False, config=None, contextsize=8192, debugmode=0, defaultgenamt=1024, device='', downloaddir='', draftamount=8, draftgpulayers=999, draftgpusplit=None, draftmodel='', embeddingsgpu=False, embeddingsmaxctx=0, embeddingsmodel='', enableguidance=False, exportconfig='', exporttemplate='', failsafe=False, flashattention=False, forceversion=False, foreground=False, gendefaults='', gendefaultsoverwrite=False, genlimit=0, gpulayers=99, highpriority=False, hordeconfig=None, hordegenlen=0, hordekey='', hordemaxctx=0, hordemodelname='', hordeworkername='', host='', ignoremissing=False, jinja=False, jinja_tools=False, launch=False, lora=None, loramult=1.0, lowvram=False, maingpu=-1, maxrequestsize=32, mcpfile='', mmproj='', mmprojcpu=False, model=[], model_param=None, moecpu=0, moeexperts=-1, multiplayer=False, multiuser=1, musicdiffusion='acestep-v15-turbo-Q8_0.gguf', musicembeddings='Qwen3-Embedding-0.6B-Q8_0.gguf', musicllm='acestep-5Hz-lm-4B-Q8_0.gguf', musiclowvram=False, musicvae='vae-BF16.gguf', noavx2=False, noblas=False, nobostoken=False, nocertify=False, nofastforward=False, noflashattention=False, nommap=False, nomodel=False, nopipelineparallel=False, noshift=False, onready='', overridekv='', overridenativecontext=0, overridetensors='', password=None, pipelineparallel=False, port=5001, port_param=5001, preloadstory='', prompt='', proxy_port=None, quantkv=0, quiet=False, ratelimit=0, remotetunnel=False, ropeconfig=[0.0, 10000.0], routermode=False, savedatafile='', sdclamped=0, sdclampedsoft=0, sdclip1='', sdclip2='', sdclipgpu=False, sdconfig=None, sdconvdirect='off', sdflashattention=False, sdgendefaults=False, sdlora=[], sdloramult=[1.0], sdmodel='', sdnotile=False, sdoffloadcpu=False, sdphotomaker='', sdquant=0, sdt5xxl='', sdthreads=0, sdtiledvae=768, sdupscaler='', sdvae='', sdvaeauto=False, sdvaecpu=False, showgui=False, singleinstance=False, skiplauncher=True, smartcache=0, smartcontext=False, ssl=None, tensor_split=None, testmemory=False, threads=8, ttsdir='', ttsgpu=False, ttsmaxlen=4096, ttsmodel='', ttsthreads=0, ttswavtokenizer='', unpack='', usecpu=False, usecuda=[], usemlock=False, usemmap=False, useswa=False, usevulkan=None, version=False, visionmaxres=1024, websearch=False, whispermodel='')
==========

Loading Music Gen LLM Model: C:\AI_Models\acestep-5Hz-lm-4B-Q8_0.gguf
Loading Music Gen Embed Model: C:\AI_Models\Qwen3-Embedding-0.6B-Q8_0.gguf
Loading Music Gen Diffusion Model: C:\AI_Models\acestep-v15-turbo-Q8_0.gguf
Loading Music Gen VAE Model: C:\AI_Models\vae-BF16.gguf
[BPE] Loaded from GGUF: 151643 vocab, 151387 merges
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 16302 MiB):
  Device 0: NVIDIA GeForce RTX 5070 Ti, compute capability 12.0, VMM: yes, VRAM: 16302 MiB
[Load] LM backend: CUDA0 (CPU threads: 10)
[GGUF] C:\AI_Models\acestep-5Hz-lm-4B-Q8_0.gguf: 398 tensors, data at offset 5346304
[LM-Config] 36L, H=2560, V=217204, Nh=32, Nkv=8, D=128, tied=1
[Qwen3] Attn: Q+K+V fused
[Qwen3] MLP: gate+up fused
[WeightCtx] Loaded 398 tensors, 4245.7 MB into backend
[LM-Load] CPU embed lookup: type=q8_0, row=2720 bytes
[LM-KV] Allocated 2 sets x 36 layers, 2304.0 MB
[Load] DiT backend: CUDA0 (CPU threads: 10)
[Load] Backend init: 2899.0 ms
[GGUF] C:\AI_Models\acestep-v15-turbo-Q8_0.gguf: 678 tensors, data at offset 56864
[DiT] Self-attn: Q+K+V fused
[DiT] Cross-attn: Q+K+V fused
[DiT] MLP: gate+up fused
[Load] null_condition_emb found (CFG available)
[WeightCtx] Loaded 478 tensors, 1600.7 MB into backend
[Load] DiT: 24 layers, H=2048, Nh=16/8, D=128
[Load] DiT weight load: 1405.1 ms
[GGUF] C:\AI_Models\acestep-v15-turbo-Q8_0.gguf: 678 tensors, data at offset 56864
[Load] silence_latent: [15000, 64] from GGUF
[BPE] Loaded from GGUF: 151643 vocab, 151387 merges
[Load] BPE tokenizer: 86.6 ms
[Load] TextEncoder backend: CUDA0 (CPU threads: 10)
[GGUF] C:\AI_Models\Qwen3-Embedding-0.6B-Q8_0.gguf: 310 tensors, data at offset 5337664
[Load] TextEncoder: 28L, H=1024, Nh=16/8
[Qwen3] Attn: Q+K+V fused
[Qwen3] MLP: gate+up fused
[WeightCtx] Loaded 310 tensors, 742.7 MB into backend
[Load] TextEncoder: 863.3 ms
[GGUF] C:\AI_Models\Qwen3-Embedding-0.6B-Q8_0.gguf: 310 tensors, data at offset 5337664
[Load] CondEncoder backend: CUDA0 (CPU threads: 10)
[GGUF] C:\AI_Models\acestep-v15-turbo-Q8_0.gguf: 678 tensors, data at offset 56864
[Load] LyricEncoder: 8L
[Qwen3] Attn: Q+K+V fused
[Qwen3] MLP: gate+up fused
[Load] TimbreEncoder: 4L
[Qwen3] Attn: Q+K+V fused
[Qwen3] MLP: gate+up fused
[WeightCtx] Loaded 140 tensors, 616.6 MB into backend
[Load] CondEncoder: lyric(8L), timbre(4L), text_proj, null_cond
[Load] ConditionEncoder: 545.6 ms
[GGUF] C:\AI_Models\acestep-v15-turbo-Q8_0.gguf: 678 tensors, data at offset 56864
[WeightCtx] Loaded 30 tensors, 106.5 MB into backend
[Load] Detokenizer: FSQ(6->2048) + 2L encoder(S=5, 2048->64)
[Load] Detokenizer: 100.0 ms
[GGUF] C:\AI_Models\vae-BF16.gguf: 365 tensors, data at offset 30048
[Load] VAE-Enc backend: CUDA0 (CPU threads: 10)
[VAE-Enc] Backend: CUDA0, Weight buffer: 160.8 MB
[VAE-Enc] Loaded: 5 blocks, downsample=1920x, F32 activations
[Load] VAE Enc weights: 497.8 ms
[GGUF] C:\AI_Models\vae-BF16.gguf: 365 tensors, data at offset 30048
[Load] VAE backend: CUDA0 (CPU threads: 10)
[VAE] Backend: CUDA0, Weight buffer: 255.7 MB
[VAE] Loaded: 5 blocks, upsample=1920x
[Load] VAE weights: 413.0 ms

Music Gen Load Complete.
Load Music Models OK: True
Embedded KoboldAI Lite loaded.
Embedded API docs loaded.
Llama.cpp UI loaded.
Embedded MusicUI loaded.
======
Active Modules: MusicGen
Inactive Modules: TextGeneration ImageGeneration VoiceRecognition MultimodalVision MultimodalAudio NetworkMultiplayer ApiKeyPassword WebSearchProxy TextToSpeech VectorEmbeddings AdminControl MCPBridge RouterMode
Enabled APIs: KoboldCppApi
Note: For third party Ollama API Emulation, you should set the port to 11434.
Starting Kobold API on port 5001 at http://localhost:5001/api/
Starting OpenAI Compatible API on port 5001 at http://localhost:5001/v1/
Starting llama.cpp secondary WebUI at http://localhost:5001/lcpp/
MusicUI is available at http://localhost:5001/musicui/
======
Please connect to custom endpoint at http://localhost:5001

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions