Skip to content
View ASAudio's full-sized avatar

Block or report ASAudio

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
ASAudio/README.md

ASAudio: A Survey of Advanced Spatial Audio Research

🚀Quick Start

Introduction

This repository is the official repository of the ASAudio: A Survey of Advanced Spatial Audio Research.

img1-paper-list

Figure 1: The timeline of spatial audio models & datasets in recent years.

Abstract

With the rapid development of spatial audio technologies today, applications in AR, VR and other scenarios have garnered extensive attention. Unlike traditional mono sound, spatial audio offers a more realistic and immersive auditory experience. Despite notable progress in the field, there remains a lack of comprehensive surveys that systematically organize and analyze these methods and their underlying technologies. In this paper, we provide a comprehensive overview of spatial audio and systematically review recent literature in the area. To address this, we chronologically outline existing work related to spatial audio and categorize these studies based on input-output representations, as well as generation and understanding tasks, thereby summarizing various research aspects of spatial audio. In addition, we review related datasets, evaluation metrics, and benchmarks, offering insights from both training and evaluation perspectives.

Overall

overall

Figure 2: Orgnization of this survey.

Representations of Spatial Audio

1. Input Representations

Attribute Natural Language Spatial Position Visual Information Monaural Audio
Primary Info Semantic, relational, implicit spatial Explicit spatial, dynamic Semantic, spatial, dynamic Acoustic (timbre, pitch, content)
Control Precision Low Very high High N/A
Abstraction Level High Low High Low
Interpretability Indirect Direct Indirect Indirect
Key Challenges Ambiguity; semantic–signal gap No semantics; tedious authoring Ambiguity; occlusion; compute cost Lack of spatial cues
Table 1: Comparative analysis of spatial audio input representations
representations Figure 1: The input representations and their fundamental processing steps.

2. Spatial Cues and Physical Modeling

2.1 Room Impulse Response (RIR)
Paper URL Code/Dataset
Few-shot audio-visual learning of environment acoustics Link Link
Spatial scaper: a library to simulate and augment soundscapes for sound event localization and detection in realistic rooms Link Link
Novel-view acoustic synthesis from 3D reconstructed rooms Link Link
A binaural room impulse response database for the evaluation of dereverberation algorithms Link Link
The Sweet-Home speech and multimodal corpus for home automation interaction Link -
Dataset of Binaural Room Impulse Responses at Multiple Recording Positions, Source Positions, and Orientations in a Real Room Link -
dEchorate: a calibrated room impulse response database for echo-aware signal processing Link Link
BIRD: Big impulse response dataset Link Link
Visually informed binaural audio generation without binaural audios Link Link
MeshRIR: A dataset of room impulse responses on meshed grid points for evaluating sound field analysis and synthesis methods Link Link
Mesh2ir: Neural acoustic impulse response generator for complex 3d scenes Link Link
A dataset of reverberant spatial sound scenes with moving sources for sound event localization and detection Link Link
Acoustic analysis and dataset of transitions between coupled rooms Link -
Dataset of spatial room impulse responses in a variable acoustics room for six degrees-of-freedom rendering and analysis Link Link
On the authenticity of individual dynamic binaural synthesis Link -
Table 2: The list of RIR papers and their URL
2.2 Head Related Transfer Function (HRTF)
Paper URL Code/Dataset
HRTF personalization based on ear morphology Link -
On HRTF Notch Frequency Prediction using Anthropometric Features and Neural Networks Link -
Magnitude modeling of personalized HRTF based on ear images and anthropometric measurements Link -
Global HRTF interpolation via learned affine transformation of hyper-conditioned features Link Link
HRTF recommendation based on the predicted binaural colouration model Link -
Modeling individual head-related transfer functions from sparse measurements using a convolutional neural network Link -
Head-related transfer function interpolation from spatially sparse measurements using autoencoder with source position conditioning Link Link
HRTF upsampling with a generative adversarial network using a gnomonic equiangular projection Link Link
Spatial upsampling of head-related transfer functions using a physics-informed neural network Link Link
HRTF field: Unifying measured HRTF magnitude representation with neural fields Link Link
Head-related transfer function interpolation with a spherical CNN Link Link
HRTF interpolation using a spherical neural process meta-learner Link -
NIIRF: Neural IIR Filter Field for HRTF Upsampling and Personalization Link Link
Table 3: The list of HRTF papers and their URL

3. Output Representations

Attribute Channel-Based Scene-Based Object-Based
Freedom of Listening Position Limited High Moderate
Playback System Dependency Very high High Low
Scalability Low Moderate Excellent
Playback-End Complexity Low High Moderate
Common Formats Stereo; 5.1/7.1 surround Ambisonics; wave-field synthesis (WFS) Dolby Atmos; DTS:X; MPEG-H 3D Audio
Table 4: Comparative analysis of spatial audio output representations

4. Spatial Audio Understanding Models

4.1 SELD Papers

Paper URL Code
A probabilistic model for robust localization based on a binaural auditory front-end Link -
Sound event localization and detection of overlapping sources using convolutional recurrent neural networks Link Link
3D localization of multiple sound sources with intensity vector estimates in single source zones Link -
Towards generating ambisonics using audio-visual cue for virtual reality Link Link
Deepear: Sound localization with binaural microphones Link -
AD-YOLO: You look only once in training multiple sound event localization and detection Link -
Polyphonic Sound Event Detection and Localization using a Two-Stage Strategy Link Link
An improved event-independent network for polyphonic sound event localization and detection Link Link
ACCDOA: Activity-coupled cartesian direction of arrival representation for sound event localization and detection Link -
Multi-accdoa: Localizing and detecting overlapping sounds from the same class with auxiliary duplicating permutation invariant training Link -
Salsa: Spatial cue-augmented log-spectrogram features for polyphonic sound event localization and detection Link Link
Sound event localization based on sound intensity vector refined by DNN-based denoising and source separation Link -
Binaural sound source distance estimation and localization for a moving listener Link Link
Binaural source localization using deep learning and head rotation information Link -
Learning Audio-Visual Source Localization via False Negative Aware Contrastive Learning Link Link
w2v-SELD: A Sound Event Localization and Detection Framework for Self-Supervised Spatial Audio Pre-Training Link Link
Self-supervised moving vehicle tracking with stereo sound Link -
Audio-visual event localization in unconstrained videos Link -
Sound event localization and detection using squeeze-excitation residual CNNs Link Link
BAST: Binaural audio spectrogram transformer for binaural sound localization Link Link
Sslide: Sound source localization for indoors based on deep learning Link -
Semi-supervised source localization with deep generative modeling Link -
Semi-supervised source localization in reverberant environments with deep generative modeling Link -
Table 5: The list of SELD Papers and their URL

4.2 Spatial Audio Separation Papers

Paper URL Code
Source separation based on binaural cues and source model constraints Link -
The cocktail party robot: Sound source separation and localisation with an active binaural head Link -
Deep learning based binaural speech separation in reverberant environments Link -
Combining spectral and spatial features for deep learning based blind speaker separation Link -
Real-time binaural speech separation with preserved spatial cues Link -
Lavss: Location-guided audio-visual spatial audio separation Link Link
Multi-channel deep clustering: Discriminative spectral and spatial embeddings for speaker-independent speech separation Link -
Multichannel audio source separation with deep neural networks Link -
Integration of variational autoencoder and spatial clustering for adaptive multi-channel neural speech separation Link Link
Self-supervised generation of spatial audio for 360 video Link Link
Table 6: The list of Spatial Audio Separation Papers and their URL

4.3 Joint Learning Papers

Paper URL Code
Learning representations from audio-visual spatial alignment Link -
Telling left from right: Learning spatial correspondence of sight and sound Link Link
Av-nerf: Learning neural fields for real-world audio-visual scene synthesis Link Link
Learning neural acoustic fields Link Link
Overview of geometrical room acoustic modeling techniques Link -
Av-rir: Audio-visual room impulse response estimation Link Link
Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation Link -
Multi-Channel Mosra: Mean Opinion Score and Room Acoustics Estimation Using Simulated Data and A Teacher Model Link -
Few-shot audio-visual learning of environment acoustics Link Link
Blind room parameter estimation using multiple multichannel speech recordings Link Link
Visual-based spatial audio generation system for multi-speaker environments Link -
Learning Spatially-Aware Language and Audio Embeddings Link -
BAT: Learning to Reason about Spatial Sounds with Large Language Models Link Link
Table 7: The list of Spatial Audio Separation Papers and their URL

5. Spatial Audio Generation Models

Paper URL Code
A structural model for binaural sound synthesis Link -
Neural synthesis of binaural speech from mono audio Link Link
2.5 d visual sound Link Link
Cyclic Learning for Binaural Audio Generation and Localization Link -
Beyond mono to binaural: Generating binaural audio from mono audio with depth and cross modal attention Link -
Geometry-aware multi-task learning for binaural audio generation from video Link -
Multi-attention audio-visual fusion network for audio spatialization Link -
Visually Guided Binaural Audio Generation with Cross-Modal Consistency Link -
Interpretable binaural ratio for visually guided binaural audio generation Link -
Cross-modal generative model for visual-guided binaural stereo generation Link -
Binauralgrad: A two-stage conditional diffusion probabilistic model for binaural audio synthesis Link Link
DopplerBAS: Binaural Audio Synthesis Addressing Doppler Effect Link -
Neural fourier shift for binaural speech rendering Link Link
Visually informed binaural audio generation without binaural audios Link Link
Localize to binauralize: Audio spatialization from visual sound source localization Link Link
Sep-stereo: Visually guided stereophonic audio generation by associating source separation Link Link
Exploiting audio-visual consistency with partial supervision for spatial audio generation Link -
Binaural audio generation via multi-task learning Link Link
End-to-end binaural speech synthesis Link -
Upmixing via style transfer: a variational autoencoder for disentangling spatial images and musical content Link -
ViSAGe: Video-to-Spatial Audio Generation Link Link
OmniAudio: Generating Spatial Audio from 360-Degree Video Link Link
Towards generating ambisonics using audio-visual cue for virtual reality Link Link
Av-nerf: Learning neural fields for real-world audio-visual scene synthesis Link Link
ImmerseDiffusion: A Generative Spatial Audio Latent Diffusion Model Link -
Both Ears Wide Open: Towards Language-Driven Spatial Audio Generation Link Link
DualSpec: Text-to-spatial-audio Generation via Dual-Spectrogram Guided Diffusion Model Link -
Diff-SAGe: End-to-End Spatial Audio Generation Using Diffusion Models Link -
Ambisonizer: Neural upmixing as spherical harmonics generation Link Link
ISDrama: Immersive Spatial Drama Generation through Multimodal Prompting Link Link
Simple and controllable music generation Link Link
Moûsai: Text-to-music generation with long-context latent diffusion Link Link
Long-form music generation with latent diffusion Link Link
Listen2scene: Interactive material-aware binaural sound propagation for reconstructed 3d scenes Link Link
Immersive spatial audio reproduction for vr/ar using room acoustic modelling from 360 images Link -
See-2-sound: Zero-shot spatial environment-to-spatial sound Link Link
Table 8: The list of Spatial Audio Generation Papers and their URL

6. Spatial Audio Datasets

Dataset Format Collect Hours Type Labels URL
Sweet-Home Multi Recorded 47.3 Speech Text Link
Voice-Home Multi Recorded 2.5 Speech Text, Geometric Link
YT-ALL & REC-STREET FOA Crawled 116.5 Audio Video, Text Link
FAIR-Play Binaural Recorded 5.2 Audio Video Link
SECL-UMons Multi Recorded 5 Audio Text, Geometric Link
YT-360 FOA Crawled 246 Audio Video Link
EasyCom Binaural Recorded 5 Speech Geometric, Text Link
Binaural_Dataset Binaural Recorded 2 Speech Geometric Link
SimBinaural Binaural Sim/Crawl 143 Audio Video, Geometric Link
Spatial LibriSpeech FOA Simulated 650 Speech Text, Geometric Link
Link FOA Recorded 7.5 Audio Video, Geometric Link
YT-Ambigen FOA Crawled 142 Audio Video Link
BEWO-1M Binaural Simulated 2.8k Audio Text/Image, Geo Link
MRSDrama Binaural Recorded 98 Speech Text, Video, Geo Link
Table 9: The list of Spatial Audio Datasets and their URL

Popular repositories Loading

  1. ASAudio ASAudio Public