AESCA: audio aesthetics score prediction system by CyberAgent

Abstract

AESCA (Yamamoto+, 2025) has been proposed for AudioMOS Challenge 2025 (Huang+, 2025) as the T12, which is the top performing system in Track 2.
We provide the KAN-based predictor, which is based on the Audiobox-Aesthetics (Tjandra+, 2025), and extended with a group-rational Kolmogorov–Arnold Network (GR-KAN) (Yang & Wang, 2025).
The model predicts 4-axis scores of audio aesthetic (AES) defined by (Tjandra+, 2025):
- Product Quality (PQ): Focuses on the technical aspects of quality instead of subjective quality. Aspects including clarity & fidelity, dynamics, frequencies, and spatialization of the audio;
- Product Complexity (PC): Focuses on the complexity of an audio scene, measured by the number of audio components;
- Content Enjoyment (CE): Focuses on the subject quality of an audio piece. It’s a more open-ended axis; some aspects might include emotional impact, artistic skill, artistic expression, as well as subjective experience, etc.
- Content Usefulness (CU): Also a subjective axis, evaluating the likelihood of leveraging the audio as source material for content creation.

Demo Tool (GUI version)

In this demo tool, you can use the ensemble of KAN #1-#4 models or each model to evaluate an audio file (WAV/MP3) on the GUI-based demo tool.
You can check the 4-axis AES scores, which range from 1 (worst) to 10 (best), of the input audio file.

Installation

Requirements

Python 3.10+ (recommended)
git-lfs: required to download pretrained model files
uv: required to install Python dependencies
ffmpeg: required to process input audio signals
A CUDA-capable NVIDIA GPU is required to run inference with CUDA (GPU) and obtain the correct/reference results.

Tested Environment

This code has been tested with the following setup:

Ubuntu 22.04 LTS
NVIDIA L4
CUDA 13.0

Steps

Install git-lfs:

$ sudo apt install git-lfs
$ git lfs install

Clone this repository

$ git clone https://github.com/CyberAgentAILab/aesca.git
$ cd aesca

Install uv:

$ curl -LsSf https://astral.sh/uv/install.sh | sh
$ source $HOME/.local/bin/env

Setup the environment:

$ uv sync
$ git clone --depth 1 https://github.com/Adamdad/rational_kat_cu rational_kat_cu
$ uv pip install --no-build-isolation ./rational_kat_cu

Install ffmpeg
```
$ sudo apt install ffmpeg
```

Usage

CLI Version

Prepare *.jsonl file for an evaluation of audio files (WAV/MP3), for example, in ./samples/input.jsonl:

{"sample_id": "shibuya-scramble-crossing", "data_path": "./samples/shibuya-scramble-crossing.mp3"}

Run the run_infer_aesca.sh with your input *.jsonl file.
```
$ bash run_infer_aesca.sh ./samples/input.jsonl
```

You can get ensemble results of KAN #1-#4 (Ensemble) in the ./samples/input_result.jsonl

{"sample_id": "shibuya-scramble-crossing", "data_path": "./samples/shibuya-scramble-crossing.mp3", "file_exists": true, "model": "KAN #1-#4 (Ensemble)", "PQ": 5.940490794181824, "PC": 5.672019839286805, "CE": 4.508048462867737, "CU": 4.82172417640686}

You can also get an evaluation result by each KAN #{1..4} model in the ./samples/kan/{1..4}/input_result.jsonl, for example, in ./samples/kan/1/input_result.jsonl:

{"sample_id": "shibuya-scramble-crossing", "data_path": "./samples/shibuya-scramble-crossing.mp3", "model": "KAN #1", "file_exists": true, "PQ": 6.02338981628418, "PC": 5.5568060874938965, "CE": 4.515963077545166, "CU": 4.9017157554626465}

If there is no file, the "file_exists" is false and the scores will be 0.0.

{"sample_id": "sample-not-exists", "data_path": "./samples/sample-not-exists.mp3", "file_exists": false, "model": "KAN #1-#4 (Ensemble)", "PQ": 0.0, "PC": 0.0, "CE": 0.0, "CU": 0.0}

GUI Version

Run the run_infer_aesca_gui.sh:
```
$ bash run_infer_aesca_gui.sh
```
Go to the GUI on an WEB browser via the link.
Upload an audio file (default: ./samples/shibuya-scramble-crossing.mp3)
Select an evaluation model (default: KAN #1-#4 (Ensemble))
Start inference

You can get evaluation results on Evaluation Results:

{
"Sample": "shibuya-scramble-crossing.mp3",
"Model": "KAN #1-#4 (Ensemble)",
"Production Quality (PQ)": 5.940,
"Production Complexity (PC)": 5.672,
"Content Enjoyment (CE)": 4.508,
"Content Usefulness (CU)":4.822
}

Scope

This repository implements only the KAN-based Predictor (a) and does not include the VERSA-based Predictor (b) or any ensemble of (a) and (b).

License

All newly developed code in this repository is released under the Apache License 2.0.

This repository also includes components derived from:

Code under the MIT License (see third_party/MIT_LICENSE_*.txt)
Code under the Creative Commons Attribution 4.0 License (see third_party/CC-BY_LICENSE.txt)

All third-party license notices are listed in NOTICE.

Notes

Due to the current implementation, inference results may differ between CUDA (GPU) and CPU-only (no-GPU) runs. The CUDA (GPU) results are considered the correct/reference results at this time. We plan to release code in the future that enables CPU-only inference to produce the same (correct) results as the CUDA (GPU) path.
Across different GPU environments, small floating-point-level numerical differences may occur; however, their impact is expected to be minor.

Citations

If you use this software for academic or research purposes, please cite:

K. Yamamoto, K. Miyazaki, and S. Seki, "The T12 System for AudioMOS Challenge 2025: Audio Aesthetics Score Prediction System Using KAN- and VERSA-based Models," ASRU 2025. [arXiv] https://arxiv.org/abs/2512.05592

Acknowledgements

Our codes are based on:

facebookresearch/audiobox-aesthetics: base models;
Adamdad/rational_kat_cu: implementation with GR-KAN.

References

Huang+ (2025): W. Huang, H. Wang, C. Liu, Y. Wu, A. Tjandra, W. Hsu, E. Cooper, Y. Qin, & T. Toda, "The AudioMOS Challenge 2025," ASRU 2025. [arXiv] https://arxiv.org/abs/2509.01336
Tjandra+ (2025): A. Tjandra, Y. Wu, B. Guo, J. Hoffman, B. Ellis, A. Vyas, B. Shi, S. Chen, M. Le, N. Zacharov, C. Wood, A. Lee, & W. Hsu, "Meta Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech, Music, and Sound," ASRU 2025. [arXiv] https://arxiv.org/abs/2502.05139
Yang & Wang (2025): X. Yang & X. Wang, "Kolmogorov-Arnold Transformer," ICLR 2025. [arXiv] https://arxiv.org/abs/2409.10594

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
assets		assets
model		model
pretrained		pretrained
samples		samples
third_party		third_party
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
infer_aesca.py		infer_aesca.py
infer_aesca_ensemble_kan.py		infer_aesca_ensemble_kan.py
infer_aesca_gui.py		infer_aesca_gui.py
pyproject.toml		pyproject.toml
run_infer_aesca.sh		run_infer_aesca.sh
run_infer_aesca_gui.sh		run_infer_aesca_gui.sh
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AESCA: audio aesthetics score prediction system by CyberAgent

Abstract

Demo Tool (GUI version)

Installation

Requirements

Tested Environment

Steps

Usage

CLI Version

GUI Version

Scope

License

Notes

Citations

Acknowledgements

References

About

Uh oh!

Releases

Packages

Languages

License

CyberAgentAILab/aesca

Folders and files

Latest commit

History

Repository files navigation

AESCA: audio aesthetics score prediction system by CyberAgent

Abstract

Demo Tool (GUI version)

Installation

Requirements

Tested Environment

Steps

Usage

CLI Version

GUI Version

Scope

License

Notes

Citations

Acknowledgements

References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages