ASAudio

ASAudio: A Survey of Advanced Spatial Audio Research

🚀Quick Start

ASAudio: A Survey of Advanced Spatial Audio Research
🚀Quick Start

Introduction

This repository is the official repository of the ASAudio: A Survey of Advanced Spatial Audio Research.

Figure 1: The timeline of spatial audio models & datasets in recent years.

Abstract

With the rapid development of spatial audio technologies today, applications in AR, VR and other scenarios have garnered extensive attention. Unlike traditional mono sound, spatial audio offers a more realistic and immersive auditory experience. Despite notable progress in the field, there remains a lack of comprehensive surveys that systematically organize and analyze these methods and their underlying technologies. In this paper, we provide a comprehensive overview of spatial audio and systematically review recent literature in the area. To address this, we chronologically outline existing work related to spatial audio and categorize these studies based on input-output representations, as well as generation and understanding tasks, thereby summarizing various research aspects of spatial audio. In addition, we review related datasets, evaluation metrics, and benchmarks, offering insights from both training and evaluation perspectives.

Overall

Figure 2: Orgnization of this survey.

Representations of Spatial Audio

1. Input Representations

Attribute	Natural Language	Spatial Position	Visual Information	Monaural Audio
Primary Info	Semantic, relational, implicit spatial	Explicit spatial, dynamic	Semantic, spatial, dynamic	Acoustic (timbre, pitch, content)
Control Precision	Low	Very high	High	N/A
Abstraction Level	High	Low	High	Low
Interpretability	Indirect	Direct	Indirect	Indirect
Key Challenges	Ambiguity; semantic–signal gap	No semantics; tedious authoring	Ambiguity; occlusion; compute cost	Lack of spatial cues

Table 1: Comparative analysis of spatial audio input representations

Figure 1: The input representations and their fundamental processing steps.

2. Spatial Cues and Physical Modeling

2.1 Room Impulse Response (RIR)

Paper	URL	Code/Dataset
Few-shot audio-visual learning of environment acoustics	Link	Link
Spatial scaper: a library to simulate and augment soundscapes for sound event localization and detection in realistic rooms	Link	Link
Novel-view acoustic synthesis from 3D reconstructed rooms	Link	Link
A binaural room impulse response database for the evaluation of dereverberation algorithms	Link	Link
The Sweet-Home speech and multimodal corpus for home automation interaction	Link	-
Dataset of Binaural Room Impulse Responses at Multiple Recording Positions, Source Positions, and Orientations in a Real Room	Link	-
dEchorate: a calibrated room impulse response database for echo-aware signal processing	Link	Link
BIRD: Big impulse response dataset	Link	Link
Visually informed binaural audio generation without binaural audios	Link	Link
MeshRIR: A dataset of room impulse responses on meshed grid points for evaluating sound field analysis and synthesis methods	Link	Link
Mesh2ir: Neural acoustic impulse response generator for complex 3d scenes	Link	Link
A dataset of reverberant spatial sound scenes with moving sources for sound event localization and detection	Link	Link
Acoustic analysis and dataset of transitions between coupled rooms	Link	-
Dataset of spatial room impulse responses in a variable acoustics room for six degrees-of-freedom rendering and analysis	Link	Link
On the authenticity of individual dynamic binaural synthesis	Link	-

Table 2: The list of RIR papers and their URL

2.2 Head Related Transfer Function (HRTF)

Paper	URL	Code/Dataset
HRTF personalization based on ear morphology	Link	-
On HRTF Notch Frequency Prediction using Anthropometric Features and Neural Networks	Link	-
Magnitude modeling of personalized HRTF based on ear images and anthropometric measurements	Link	-
Global HRTF interpolation via learned affine transformation of hyper-conditioned features	Link	Link
HRTF recommendation based on the predicted binaural colouration model	Link	-
Modeling individual head-related transfer functions from sparse measurements using a convolutional neural network	Link	-
Head-related transfer function interpolation from spatially sparse measurements using autoencoder with source position conditioning	Link	Link
HRTF upsampling with a generative adversarial network using a gnomonic equiangular projection	Link	Link
Spatial upsampling of head-related transfer functions using a physics-informed neural network	Link	Link
HRTF field: Unifying measured HRTF magnitude representation with neural fields	Link	Link
Head-related transfer function interpolation with a spherical CNN	Link	Link
HRTF interpolation using a spherical neural process meta-learner	Link	-
NIIRF: Neural IIR Filter Field for HRTF Upsampling and Personalization	Link	Link

Table 3: The list of HRTF papers and their URL

3. Output Representations

Attribute	Channel-Based	Scene-Based	Object-Based
Freedom of Listening Position	Limited	High	Moderate
Playback System Dependency	Very high	High	Low
Scalability	Low	Moderate	Excellent
Playback-End Complexity	Low	High	Moderate
Common Formats	Stereo; 5.1/7.1 surround	Ambisonics; wave-field synthesis (WFS)	Dolby Atmos; DTS:X; MPEG-H 3D Audio

Table 4: Comparative analysis of spatial audio output representations

4. Spatial Audio Understanding Models

4.1 SELD Papers

Paper	URL	Code
A probabilistic model for robust localization based on a binaural auditory front-end	Link	-
Sound event localization and detection of overlapping sources using convolutional recurrent neural networks	Link	Link
3D localization of multiple sound sources with intensity vector estimates in single source zones	Link	-
Towards generating ambisonics using audio-visual cue for virtual reality	Link	Link
Deepear: Sound localization with binaural microphones	Link	-
AD-YOLO: You look only once in training multiple sound event localization and detection	Link	-
Polyphonic Sound Event Detection and Localization using a Two-Stage Strategy	Link	Link
An improved event-independent network for polyphonic sound event localization and detection	Link	Link
ACCDOA: Activity-coupled cartesian direction of arrival representation for sound event localization and detection	Link	-
Multi-accdoa: Localizing and detecting overlapping sounds from the same class with auxiliary duplicating permutation invariant training	Link	-
Salsa: Spatial cue-augmented log-spectrogram features for polyphonic sound event localization and detection	Link	Link
Sound event localization based on sound intensity vector refined by DNN-based denoising and source separation	Link	-
Binaural sound source distance estimation and localization for a moving listener	Link	Link
Binaural source localization using deep learning and head rotation information	Link	-
Learning Audio-Visual Source Localization via False Negative Aware Contrastive Learning	Link	Link
w2v-SELD: A Sound Event Localization and Detection Framework for Self-Supervised Spatial Audio Pre-Training	Link	Link
Self-supervised moving vehicle tracking with stereo sound	Link	-
Audio-visual event localization in unconstrained videos	Link	-
Sound event localization and detection using squeeze-excitation residual CNNs	Link	Link
BAST: Binaural audio spectrogram transformer for binaural sound localization	Link	Link
Sslide: Sound source localization for indoors based on deep learning	Link	-
Semi-supervised source localization with deep generative modeling	Link	-
Semi-supervised source localization in reverberant environments with deep generative modeling	Link	-

Table 5: The list of SELD Papers and their URL

4.2 Spatial Audio Separation Papers

Paper	URL	Code
Source separation based on binaural cues and source model constraints	Link	-
The cocktail party robot: Sound source separation and localisation with an active binaural head	Link	-
Deep learning based binaural speech separation in reverberant environments	Link	-
Combining spectral and spatial features for deep learning based blind speaker separation	Link	-
Real-time binaural speech separation with preserved spatial cues	Link	-
Lavss: Location-guided audio-visual spatial audio separation	Link	Link
Multi-channel deep clustering: Discriminative spectral and spatial embeddings for speaker-independent speech separation	Link	-
Multichannel audio source separation with deep neural networks	Link	-
Integration of variational autoencoder and spatial clustering for adaptive multi-channel neural speech separation	Link	Link
Self-supervised generation of spatial audio for 360 video	Link	Link

Table 6: The list of Spatial Audio Separation Papers and their URL

4.3 Joint Learning Papers

Paper	URL	Code
Learning representations from audio-visual spatial alignment	Link	-
Telling left from right: Learning spatial correspondence of sight and sound	Link	Link
Av-nerf: Learning neural fields for real-world audio-visual scene synthesis	Link	Link
Learning neural acoustic fields	Link	Link
Overview of geometrical room acoustic modeling techniques	Link	-
Av-rir: Audio-visual room impulse response estimation	Link	Link
Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation	Link	-
Multi-Channel Mosra: Mean Opinion Score and Room Acoustics Estimation Using Simulated Data and A Teacher Model	Link	-
Few-shot audio-visual learning of environment acoustics	Link	Link
Blind room parameter estimation using multiple multichannel speech recordings	Link	Link
Visual-based spatial audio generation system for multi-speaker environments	Link	-
Learning Spatially-Aware Language and Audio Embeddings	Link	-
BAT: Learning to Reason about Spatial Sounds with Large Language Models	Link	Link

Table 7: The list of Spatial Audio Separation Papers and their URL

5. Spatial Audio Generation Models

Paper	URL	Code
A structural model for binaural sound synthesis	Link	-
Neural synthesis of binaural speech from mono audio	Link	Link
2.5 d visual sound	Link	Link
Cyclic Learning for Binaural Audio Generation and Localization	Link	-
Beyond mono to binaural: Generating binaural audio from mono audio with depth and cross modal attention	Link	-
Geometry-aware multi-task learning for binaural audio generation from video	Link	-
Multi-attention audio-visual fusion network for audio spatialization	Link	-
Visually Guided Binaural Audio Generation with Cross-Modal Consistency	Link	-
Interpretable binaural ratio for visually guided binaural audio generation	Link	-
Cross-modal generative model for visual-guided binaural stereo generation	Link	-
Binauralgrad: A two-stage conditional diffusion probabilistic model for binaural audio synthesis	Link	Link
DopplerBAS: Binaural Audio Synthesis Addressing Doppler Effect	Link	-
Neural fourier shift for binaural speech rendering	Link	Link
Visually informed binaural audio generation without binaural audios	Link	Link
Localize to binauralize: Audio spatialization from visual sound source localization	Link	Link
Sep-stereo: Visually guided stereophonic audio generation by associating source separation	Link	Link
Exploiting audio-visual consistency with partial supervision for spatial audio generation	Link	-
Binaural audio generation via multi-task learning	Link	Link
End-to-end binaural speech synthesis	Link	-
Upmixing via style transfer: a variational autoencoder for disentangling spatial images and musical content	Link	-
ViSAGe: Video-to-Spatial Audio Generation	Link	Link
OmniAudio: Generating Spatial Audio from 360-Degree Video	Link	Link
Towards generating ambisonics using audio-visual cue for virtual reality	Link	Link
Av-nerf: Learning neural fields for real-world audio-visual scene synthesis	Link	Link
ImmerseDiffusion: A Generative Spatial Audio Latent Diffusion Model	Link	-
Both Ears Wide Open: Towards Language-Driven Spatial Audio Generation	Link	Link
DualSpec: Text-to-spatial-audio Generation via Dual-Spectrogram Guided Diffusion Model	Link	-
Diff-SAGe: End-to-End Spatial Audio Generation Using Diffusion Models	Link	-
Ambisonizer: Neural upmixing as spherical harmonics generation	Link	Link
ISDrama: Immersive Spatial Drama Generation through Multimodal Prompting	Link	Link
Simple and controllable music generation	Link	Link
Moûsai: Text-to-music generation with long-context latent diffusion	Link	Link
Long-form music generation with latent diffusion	Link	Link
Listen2scene: Interactive material-aware binaural sound propagation for reconstructed 3d scenes	Link	Link
Immersive spatial audio reproduction for vr/ar using room acoustic modelling from 360 images	Link	-
See-2-sound: Zero-shot spatial environment-to-spatial sound	Link	Link

Table 8: The list of Spatial Audio Generation Papers and their URL

6. Spatial Audio Datasets

Dataset	Format	Collect	Hours	Type	Labels	URL
Sweet-Home	Multi	Recorded	47.3	Speech	Text	Link
Voice-Home	Multi	Recorded	2.5	Speech	Text, Geometric	Link
YT-ALL & REC-STREET	FOA	Crawled	116.5	Audio	Video, Text	Link
FAIR-Play	Binaural	Recorded	5.2	Audio	Video	Link
SECL-UMons	Multi	Recorded	5	Audio	Text, Geometric	Link
YT-360	FOA	Crawled	246	Audio	Video	Link
EasyCom	Binaural	Recorded	5	Speech	Geometric, Text	Link
Binaural_Dataset	Binaural	Recorded	2	Speech	Geometric	Link
SimBinaural	Binaural	Sim/Crawl	143	Audio	Video, Geometric	Link
Spatial LibriSpeech	FOA	Simulated	650	Speech	Text, Geometric	Link
Link	FOA	Recorded	7.5	Audio	Video, Geometric	Link
YT-Ambigen	FOA	Crawled	142	Audio	Video	Link
BEWO-1M	Binaural	Simulated	2.8k	Audio	Text/Image, Geo	Link
MRSDrama	Binaural	Recorded	98	Speech	Text, Video, Geo	Link

Table 9: The list of Spatial Audio Datasets and their URL

Provide feedback

Saved searches

Use saved searches to filter your results more quickly