- AESCA (Yamamoto+, 2025) has been proposed for AudioMOS Challenge 2025 (Huang+, 2025) as the T12, which is the top performing system in Track 2.
- We provide the KAN-based predictor, which is based on the Audiobox-Aesthetics (Tjandra+, 2025), and extended with a group-rational Kolmogorov–Arnold Network (GR-KAN) (Yang & Wang, 2025).
- The model predicts 4-axis scores of audio aesthetic (AES) defined by (Tjandra+, 2025):
- Product Quality (PQ): Focuses on the technical aspects of quality instead of subjective quality. Aspects including clarity & fidelity, dynamics, frequencies, and spatialization of the audio;
- Product Complexity (PC): Focuses on the complexity of an audio scene, measured by the number of audio components;
- Content Enjoyment (CE): Focuses on the subject quality of an audio piece. It’s a more open-ended axis; some aspects might include emotional impact, artistic skill, artistic expression, as well as subjective experience, etc.
- Content Usefulness (CU): Also a subjective axis, evaluating the likelihood of leveraging the audio as source material for content creation.
- In this demo tool, you can use the ensemble of
KAN #1-#4models or each model to evaluate an audio file (WAV/MP3) on the GUI-based demo tool. - You can check the 4-axis AES scores, which range from 1 (worst) to 10 (best), of the input audio file.
- Python 3.10+ (recommended)
git-lfs: required to download pretrained model filesuv: required to install Python dependenciesffmpeg: required to process input audio signals- A CUDA-capable NVIDIA GPU is required to run inference with CUDA (GPU) and obtain the correct/reference results.
This code has been tested with the following setup:
- Ubuntu 22.04 LTS
- NVIDIA L4
- CUDA 13.0
-
Install
git-lfs:$ sudo apt install git-lfs $ git lfs install
-
Clone this repository
$ git clone https://github.com/CyberAgentAILab/aesca.git $ cd aesca -
Install
uv:$ curl -LsSf https://astral.sh/uv/install.sh | sh $ source $HOME/.local/bin/env
-
Setup the environment:
$ uv sync $ git clone --depth 1 https://github.com/Adamdad/rational_kat_cu rational_kat_cu $ uv pip install --no-build-isolation ./rational_kat_cu
-
Install
ffmpeg$ sudo apt install ffmpeg
-
Prepare
*.jsonlfile for an evaluation of audio files (WAV/MP3), for example, in ./samples/input.jsonl:{"sample_id": "shibuya-scramble-crossing", "data_path": "./samples/shibuya-scramble-crossing.mp3"} -
Run the
run_infer_aesca.shwith your input*.jsonlfile.$ bash run_infer_aesca.sh ./samples/input.jsonl
-
You can get ensemble results of
KAN #1-#4 (Ensemble)in the ./samples/input_result.jsonl{"sample_id": "shibuya-scramble-crossing", "data_path": "./samples/shibuya-scramble-crossing.mp3", "file_exists": true, "model": "KAN #1-#4 (Ensemble)", "PQ": 5.940490794181824, "PC": 5.672019839286805, "CE": 4.508048462867737, "CU": 4.82172417640686} -
You can also get an evaluation result by each
KAN #{1..4}model in the./samples/kan/{1..4}/input_result.jsonl, for example, in ./samples/kan/1/input_result.jsonl:{"sample_id": "shibuya-scramble-crossing", "data_path": "./samples/shibuya-scramble-crossing.mp3", "model": "KAN #1", "file_exists": true, "PQ": 6.02338981628418, "PC": 5.5568060874938965, "CE": 4.515963077545166, "CU": 4.9017157554626465} -
If there is no file, the
"file_exists"isfalseand the scores will be0.0.{"sample_id": "sample-not-exists", "data_path": "./samples/sample-not-exists.mp3", "file_exists": false, "model": "KAN #1-#4 (Ensemble)", "PQ": 0.0, "PC": 0.0, "CE": 0.0, "CU": 0.0}
-
Run the
run_infer_aesca_gui.sh:$ bash run_infer_aesca_gui.sh
-
Go to the GUI on an WEB browser via the link.
-
Upload an audio file (default: ./samples/shibuya-scramble-crossing.mp3)
-
Select an evaluation model (default:
KAN #1-#4 (Ensemble)) -
Start inference
-
You can get evaluation results on
Evaluation Results:{ "Sample": "shibuya-scramble-crossing.mp3", "Model": "KAN #1-#4 (Ensemble)", "Production Quality (PQ)": 5.940, "Production Complexity (PC)": 5.672, "Content Enjoyment (CE)": 4.508, "Content Usefulness (CU)":4.822 }
This repository implements only the KAN-based Predictor (a) and does not include the VERSA-based Predictor (b) or any ensemble of (a) and (b).
All newly developed code in this repository is released under the Apache License 2.0.
This repository also includes components derived from:
- Code under the MIT License (see
third_party/MIT_LICENSE_*.txt) - Code under the Creative Commons Attribution 4.0 License (see
third_party/CC-BY_LICENSE.txt)
All third-party license notices are listed in NOTICE.
- Due to the current implementation, inference results may differ between CUDA (GPU) and CPU-only (no-GPU) runs. The CUDA (GPU) results are considered the correct/reference results at this time. We plan to release code in the future that enables CPU-only inference to produce the same (correct) results as the CUDA (GPU) path.
- Across different GPU environments, small floating-point-level numerical differences may occur; however, their impact is expected to be minor.
If you use this software for academic or research purposes, please cite:
- K. Yamamoto, K. Miyazaki, and S. Seki, "The T12 System for AudioMOS Challenge 2025: Audio Aesthetics Score Prediction System Using KAN- and VERSA-based Models," ASRU 2025. [arXiv] https://arxiv.org/abs/2512.05592
Our codes are based on:
- facebookresearch/audiobox-aesthetics: base models;
- Adamdad/rational_kat_cu: implementation with GR-KAN.
- Huang+ (2025): W. Huang, H. Wang, C. Liu, Y. Wu, A. Tjandra, W. Hsu, E. Cooper, Y. Qin, & T. Toda, "The AudioMOS Challenge 2025," ASRU 2025. [arXiv] https://arxiv.org/abs/2509.01336
- Tjandra+ (2025): A. Tjandra, Y. Wu, B. Guo, J. Hoffman, B. Ellis, A. Vyas, B. Shi, S. Chen, M. Le, N. Zacharov, C. Wood, A. Lee, & W. Hsu, "Meta Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech, Music, and Sound," ASRU 2025. [arXiv] https://arxiv.org/abs/2502.05139
- Yang & Wang (2025): X. Yang & X. Wang, "Kolmogorov-Arnold Transformer," ICLR 2025. [arXiv] https://arxiv.org/abs/2409.10594

