Skip to content

AESCA (Yamamoto+, 2025), the top-performing system in AudioMOS Challenge Track 2 to predict the audio aesthetics score (AES)

License

Notifications You must be signed in to change notification settings

CyberAgentAILab/aesca

Repository files navigation

AESCA: audio aesthetics score prediction system by CyberAgent

AESCA Overall

Abstract

  • AESCA (Yamamoto+, 2025) has been proposed for AudioMOS Challenge 2025 (Huang+, 2025) as the T12, which is the top performing system in Track 2.
  • We provide the KAN-based predictor, which is based on the Audiobox-Aesthetics (Tjandra+, 2025), and extended with a group-rational Kolmogorov–Arnold Network (GR-KAN) (Yang & Wang, 2025).
  • The model predicts 4-axis scores of audio aesthetic (AES) defined by (Tjandra+, 2025):
    • Product Quality (PQ): Focuses on the technical aspects of quality instead of subjective quality. Aspects including clarity & fidelity, dynamics, frequencies, and spatialization of the audio;
    • Product Complexity (PC): Focuses on the complexity of an audio scene, measured by the number of audio components;
    • Content Enjoyment (CE): Focuses on the subject quality of an audio piece. It’s a more open-ended axis; some aspects might include emotional impact, artistic skill, artistic expression, as well as subjective experience, etc.
    • Content Usefulness (CU): Also a subjective axis, evaluating the likelihood of leveraging the audio as source material for content creation.

Demo Tool (GUI version)

AESCA Inference Demo

  • In this demo tool, you can use the ensemble of KAN #1-#4 models or each model to evaluate an audio file (WAV/MP3) on the GUI-based demo tool.
  • You can check the 4-axis AES scores, which range from 1 (worst) to 10 (best), of the input audio file.

Installation

Requirements

  • Python 3.10+ (recommended)
  • git-lfs: required to download pretrained model files
  • uv: required to install Python dependencies
  • ffmpeg: required to process input audio signals
  • A CUDA-capable NVIDIA GPU is required to run inference with CUDA (GPU) and obtain the correct/reference results.

Tested Environment

This code has been tested with the following setup:

  • Ubuntu 22.04 LTS
  • NVIDIA L4
  • CUDA 13.0

Steps

  1. Install git-lfs:

    $ sudo apt install git-lfs
    $ git lfs install
  2. Clone this repository

    $ git clone https://github.com/CyberAgentAILab/aesca.git
    $ cd aesca
  3. Install uv:

    $ curl -LsSf https://astral.sh/uv/install.sh | sh
    $ source $HOME/.local/bin/env
  4. Setup the environment:

    $ uv sync
    $ git clone --depth 1 https://github.com/Adamdad/rational_kat_cu rational_kat_cu
    $ uv pip install --no-build-isolation ./rational_kat_cu
  5. Install ffmpeg

    $ sudo apt install ffmpeg

Usage

CLI Version

  1. Prepare *.jsonl file for an evaluation of audio files (WAV/MP3), for example, in ./samples/input.jsonl:

    {"sample_id": "shibuya-scramble-crossing", "data_path": "./samples/shibuya-scramble-crossing.mp3"}
  2. Run the run_infer_aesca.sh with your input *.jsonl file.

    $ bash run_infer_aesca.sh ./samples/input.jsonl
  3. You can get ensemble results of KAN #1-#4 (Ensemble) in the ./samples/input_result.jsonl

    {"sample_id": "shibuya-scramble-crossing", "data_path": "./samples/shibuya-scramble-crossing.mp3", "file_exists": true, "model": "KAN #1-#4 (Ensemble)", "PQ": 5.940490794181824, "PC": 5.672019839286805, "CE": 4.508048462867737, "CU": 4.82172417640686}
  4. You can also get an evaluation result by each KAN #{1..4} model in the ./samples/kan/{1..4}/input_result.jsonl, for example, in ./samples/kan/1/input_result.jsonl:

    {"sample_id": "shibuya-scramble-crossing", "data_path": "./samples/shibuya-scramble-crossing.mp3", "model": "KAN #1", "file_exists": true, "PQ": 6.02338981628418, "PC": 5.5568060874938965, "CE": 4.515963077545166, "CU": 4.9017157554626465}
  5. If there is no file, the "file_exists" is false and the scores will be 0.0.

    {"sample_id": "sample-not-exists", "data_path": "./samples/sample-not-exists.mp3", "file_exists": false, "model": "KAN #1-#4 (Ensemble)", "PQ": 0.0, "PC": 0.0, "CE": 0.0, "CU": 0.0}

GUI Version

  1. Run the run_infer_aesca_gui.sh:

    $ bash run_infer_aesca_gui.sh
  2. Go to the GUI on an WEB browser via the link.

  3. Upload an audio file (default: ./samples/shibuya-scramble-crossing.mp3)

  4. Select an evaluation model (default: KAN #1-#4 (Ensemble))

  5. Start inference

  6. You can get evaluation results on Evaluation Results:

    {
    "Sample": "shibuya-scramble-crossing.mp3",
    "Model": "KAN #1-#4 (Ensemble)",
    "Production Quality (PQ)": 5.940,
    "Production Complexity (PC)": 5.672,
    "Content Enjoyment (CE)": 4.508,
    "Content Usefulness (CU)":4.822
    }

Scope

This repository implements only the KAN-based Predictor (a) and does not include the VERSA-based Predictor (b) or any ensemble of (a) and (b).

License

All newly developed code in this repository is released under the Apache License 2.0.

This repository also includes components derived from:

  • Code under the MIT License (see third_party/MIT_LICENSE_*.txt)
  • Code under the Creative Commons Attribution 4.0 License (see third_party/CC-BY_LICENSE.txt)

All third-party license notices are listed in NOTICE.

Notes

  • Due to the current implementation, inference results may differ between CUDA (GPU) and CPU-only (no-GPU) runs. The CUDA (GPU) results are considered the correct/reference results at this time. We plan to release code in the future that enables CPU-only inference to produce the same (correct) results as the CUDA (GPU) path.
  • Across different GPU environments, small floating-point-level numerical differences may occur; however, their impact is expected to be minor.

Citations

If you use this software for academic or research purposes, please cite:

  • K. Yamamoto, K. Miyazaki, and S. Seki, "The T12 System for AudioMOS Challenge 2025: Audio Aesthetics Score Prediction System Using KAN- and VERSA-based Models," ASRU 2025. [arXiv] https://arxiv.org/abs/2512.05592

Acknowledgements

Our codes are based on:

References

  • Huang+ (2025): W. Huang, H. Wang, C. Liu, Y. Wu, A. Tjandra, W. Hsu, E. Cooper, Y. Qin, & T. Toda, "The AudioMOS Challenge 2025," ASRU 2025. [arXiv] https://arxiv.org/abs/2509.01336
  • Tjandra+ (2025): A. Tjandra, Y. Wu, B. Guo, J. Hoffman, B. Ellis, A. Vyas, B. Shi, S. Chen, M. Le, N. Zacharov, C. Wood, A. Lee, & W. Hsu, "Meta Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech, Music, and Sound," ASRU 2025. [arXiv] https://arxiv.org/abs/2502.05139
  • Yang & Wang (2025): X. Yang & X. Wang, "Kolmogorov-Arnold Transformer," ICLR 2025. [arXiv] https://arxiv.org/abs/2409.10594

About

AESCA (Yamamoto+, 2025), the top-performing system in AudioMOS Challenge Track 2 to predict the audio aesthetics score (AES)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published