Khala is an open-source system for high-fidelity song generation, capable of generating complete songs from text descriptions and lyric conditions. Unlike approaches built around semantic tokens, diffusion models, or multi-stage audio generation stacks, Khala follows a unified acoustic-token route and generates both coarse musical structure and fine acoustic detail within the same discrete audio representation space.
The core characteristics of Khala include:
- Full-song generation: designed for complete song generation rather than short clips or loop-style accompaniment.
- Text and lyric control: supports natural-language prompts and lyrics to control style, mood, vocals, and content.
- Unified acoustic-token representation: built on a 64-layer RVQ acoustic token hierarchy that represents audio as coarse-to-fine discrete acoustic tokens.
- Two-stage generation pipeline: a backbone first generates coarse acoustic tokens, then a super-resolution model completes higher RVQ token layers, and finally a decoder reconstructs the waveform.
- Complete system implementation: includes a frontend UI, a FastAPI backend dispatcher, a single-GPU inference worker, model loading, and the end-to-end audio generation path rather than just standalone inference scripts.
β οΈ [2026-05-07]We have identified a potential issue that may significantly affect inference quality. The problem is currently under investigation and may be related to numerical precision. Until this notice is removed, please treat current generation quality as unstable.
[2026-05-16]The online audio demo page is now available: Khala Demo[2026-05-11]Backend inference launch now supports single-GPU safe startup by default, plus multi-GPU and runtime-mode overrides for deployment compatibility.[2026-05-05]The arXiv paper is now available: Khala: Scaling Acoustic Token Language Models Toward High-Fidelity Music Generation[2026-05-01]The codebase, environment documentation, and Dockerfile have been cleaned up for release.
[Coming Soon]A full deployment guide for musicians and beginner users.[Coming Soon]Discord community server.
Listen to generated samples on the online demo page: Khala Demo
The current release is mainly intended for researchers and developers who are already familiar with GPU servers.
- NVIDIA GPU, with 24GB or more VRAM recommended for the full inference pipeline, such as an RTX 4090 or a higher-tier GPU.
- Docker and NVIDIA Container Toolkit.
- A CUDA-compatible NVIDIA driver.
- Python and Node.js are already included in the prebuilt image.
- Model weights need to be downloaded into the
checkpoints/directory at the repository root.
This section is intended for researchers and developers who are already comfortable with basic Docker and CUDA workflows, and provides the shortest path to running the system.
If you want to configure the environment step by step from a clean NGC container, please read:
If you want to understand the backend structure and runtime logic, please read:
The currently available prebuilt image is:
docker pull ghcr.io/davidliujiafeng/khala-env:ngc25.02-node24
docker run --gpus all -it --rm \
--name khala \
-p 30869:30869 \
-p 8889:8889 \
ghcr.io/davidliujiafeng/khala-env:ngc25.02-node24Note: the command above uses
--rm, so files created inside the container will be removed after the container exits. If you want a long-lived development container or want to keep downloaded model weights, use a mounted directory or remove--rm.
After entering the container, run:
cd /workspace
git clone https://github.com/Khala-Music-AI/Khala.git
cd KhalaModel repository:
From the repository root, run:
mkdir -p checkpoints
hf download liujiafeng/Khala-MusicGeneration-v1.0 --local-dir checkpointsThis command downloads the model repository contents into the local checkpoints/ directory.
cd /workspace/Khala/backend
bash run_backend.shThe default launcher now starts in a single-GPU safe mode. Advanced users can also select specific GPU ids and switch between one_shot and keep_loaded runtime modes from the same script; see backend/README_backend.md for details.
In another terminal, run:
cd /workspace/Khala/frontend
npm install
npm run devDefault URL:
The current system has three layers:
- Frontend: accepts prompts, lyrics, and generation settings, and displays results.
- API dispatcher: receives requests, creates jobs, queues them, and dispatches them to idle workers.
- Inference worker: runs backbone, super-resolution, and decoder inference.
The request path is:
flowchart LR
A["Frontend UI"] --> B["backend_api.py"]
B --> C["backend_worker.py"]
C --> D["Backbone"]
D --> E["Super-resolution"]
E --> F["Decoder"]
F --> G["Generated Audio"]
G --> B
B --> A
- Demo page: Khala Demo
- arXiv paper: Khala: Scaling Acoustic Token Language Models Toward High-Fidelity Music Generation
- Model weights: https://huggingface.co/liujiafeng/Khala-MusicGeneration-v1.0
- Environment setup: ENVIRONMENT_SETUP.md
- Backend docs: backend/README_backend.md
Khala/
βββ backend/
βββ frontend/
βββ core/
βββ models/
βββ checkpoints/
βββ assets/
βββ Dockerfile
βββ requirements.txt
βββ ENVIRONMENT_SETUP.md
βββ ENVIRONMENT_SETUP_zh.md
Main directories:
frontend/: frontend pages and the Vite project.backend/: backend API, worker, and launcher scripts.core/: project-specific core modules.models/: Megatron, decoder, and tokenizer related code.checkpoints/: model checkpoint directory.assets/: images used by the README and demo materials.
If this project is helpful to your research or development work, you are welcome to cite our paper:
The final BibTeX information will be added later to both the paper page and the repository documentation.
The current implementation builds on a number of excellent open-source projects and tools, including but not limited to:
- NVIDIA NGC
- Megatron / Megatron Core
- Hugging Face
- FastAPI
- Vite / React
The model weights are currently intended to be released under CC BY-NC 4.0 (Creative Commons Attribution-NonCommercial 4.0 International).
Feel free to join the WeChat group for discussion, usage questions, and future updates:



