- ASAudio: A Survey of Advanced Spatial Audio Research
- 🚀Quick Start
This repository is the official repository of the ASAudio: A Survey of Advanced Spatial Audio Research.
Abstract
With the rapid development of spatial audio technologies today, applications in AR, VR and other scenarios have garnered extensive attention. Unlike traditional mono sound, spatial audio offers a more realistic and immersive auditory experience. Despite notable progress in the field, there remains a lack of comprehensive surveys that systematically organize and analyze these methods and their underlying technologies. In this paper, we provide a comprehensive overview of spatial audio and systematically review recent literature in the area. To address this, we chronologically outline existing work related to spatial audio and categorize these studies based on input-output representations, as well as generation and understanding tasks, thereby summarizing various research aspects of spatial audio. In addition, we review related datasets, evaluation metrics, and benchmarks, offering insights from both training and evaluation perspectives.
Attribute | Natural Language | Spatial Position | Visual Information | Monaural Audio |
---|---|---|---|---|
Primary Info | Semantic, relational, implicit spatial | Explicit spatial, dynamic | Semantic, spatial, dynamic | Acoustic (timbre, pitch, content) |
Control Precision | Low | Very high | High | N/A |
Abstraction Level | High | Low | High | Low |
Interpretability | Indirect | Direct | Indirect | Indirect |
Key Challenges | Ambiguity; semantic–signal gap | No semantics; tedious authoring | Ambiguity; occlusion; compute cost | Lack of spatial cues |
Paper | URL | Code/Dataset |
---|---|---|
Few-shot audio-visual learning of environment acoustics | Link | Link |
Spatial scaper: a library to simulate and augment soundscapes for sound event localization and detection in realistic rooms | Link | Link |
Novel-view acoustic synthesis from 3D reconstructed rooms | Link | Link |
A binaural room impulse response database for the evaluation of dereverberation algorithms | Link | Link |
The Sweet-Home speech and multimodal corpus for home automation interaction | Link | - |
Dataset of Binaural Room Impulse Responses at Multiple Recording Positions, Source Positions, and Orientations in a Real Room | Link | - |
dEchorate: a calibrated room impulse response database for echo-aware signal processing | Link | Link |
BIRD: Big impulse response dataset | Link | Link |
Visually informed binaural audio generation without binaural audios | Link | Link |
MeshRIR: A dataset of room impulse responses on meshed grid points for evaluating sound field analysis and synthesis methods | Link | Link |
Mesh2ir: Neural acoustic impulse response generator for complex 3d scenes | Link | Link |
A dataset of reverberant spatial sound scenes with moving sources for sound event localization and detection | Link | Link |
Acoustic analysis and dataset of transitions between coupled rooms | Link | - |
Dataset of spatial room impulse responses in a variable acoustics room for six degrees-of-freedom rendering and analysis | Link | Link |
On the authenticity of individual dynamic binaural synthesis | Link | - |
Paper | URL | Code/Dataset |
---|---|---|
HRTF personalization based on ear morphology | Link | - |
On HRTF Notch Frequency Prediction using Anthropometric Features and Neural Networks | Link | - |
Magnitude modeling of personalized HRTF based on ear images and anthropometric measurements | Link | - |
Global HRTF interpolation via learned affine transformation of hyper-conditioned features | Link | Link |
HRTF recommendation based on the predicted binaural colouration model | Link | - |
Modeling individual head-related transfer functions from sparse measurements using a convolutional neural network | Link | - |
Head-related transfer function interpolation from spatially sparse measurements using autoencoder with source position conditioning | Link | Link |
HRTF upsampling with a generative adversarial network using a gnomonic equiangular projection | Link | Link |
Spatial upsampling of head-related transfer functions using a physics-informed neural network | Link | Link |
HRTF field: Unifying measured HRTF magnitude representation with neural fields | Link | Link |
Head-related transfer function interpolation with a spherical CNN | Link | Link |
HRTF interpolation using a spherical neural process meta-learner | Link | - |
NIIRF: Neural IIR Filter Field for HRTF Upsampling and Personalization | Link | Link |
Attribute | Channel-Based | Scene-Based | Object-Based |
---|---|---|---|
Freedom of Listening Position | Limited | High | Moderate |
Playback System Dependency | Very high | High | Low |
Scalability | Low | Moderate | Excellent |
Playback-End Complexity | Low | High | Moderate |
Common Formats | Stereo; 5.1/7.1 surround | Ambisonics; wave-field synthesis (WFS) | Dolby Atmos; DTS:X; MPEG-H 3D Audio |
Paper | URL | Code |
---|---|---|
A probabilistic model for robust localization based on a binaural auditory front-end | Link | - |
Sound event localization and detection of overlapping sources using convolutional recurrent neural networks | Link | Link |
3D localization of multiple sound sources with intensity vector estimates in single source zones | Link | - |
Towards generating ambisonics using audio-visual cue for virtual reality | Link | Link |
Deepear: Sound localization with binaural microphones | Link | - |
AD-YOLO: You look only once in training multiple sound event localization and detection | Link | - |
Polyphonic Sound Event Detection and Localization using a Two-Stage Strategy | Link | Link |
An improved event-independent network for polyphonic sound event localization and detection | Link | Link |
ACCDOA: Activity-coupled cartesian direction of arrival representation for sound event localization and detection | Link | - |
Multi-accdoa: Localizing and detecting overlapping sounds from the same class with auxiliary duplicating permutation invariant training | Link | - |
Salsa: Spatial cue-augmented log-spectrogram features for polyphonic sound event localization and detection | Link | Link |
Sound event localization based on sound intensity vector refined by DNN-based denoising and source separation | Link | - |
Binaural sound source distance estimation and localization for a moving listener | Link | Link |
Binaural source localization using deep learning and head rotation information | Link | - |
Learning Audio-Visual Source Localization via False Negative Aware Contrastive Learning | Link | Link |
w2v-SELD: A Sound Event Localization and Detection Framework for Self-Supervised Spatial Audio Pre-Training | Link | Link |
Self-supervised moving vehicle tracking with stereo sound | Link | - |
Audio-visual event localization in unconstrained videos | Link | - |
Sound event localization and detection using squeeze-excitation residual CNNs | Link | Link |
BAST: Binaural audio spectrogram transformer for binaural sound localization | Link | Link |
Sslide: Sound source localization for indoors based on deep learning | Link | - |
Semi-supervised source localization with deep generative modeling | Link | - |
Semi-supervised source localization in reverberant environments with deep generative modeling | Link | - |
Paper | URL | Code |
---|---|---|
Source separation based on binaural cues and source model constraints | Link | - |
The cocktail party robot: Sound source separation and localisation with an active binaural head | Link | - |
Deep learning based binaural speech separation in reverberant environments | Link | - |
Combining spectral and spatial features for deep learning based blind speaker separation | Link | - |
Real-time binaural speech separation with preserved spatial cues | Link | - |
Lavss: Location-guided audio-visual spatial audio separation | Link | Link |
Multi-channel deep clustering: Discriminative spectral and spatial embeddings for speaker-independent speech separation | Link | - |
Multichannel audio source separation with deep neural networks | Link | - |
Integration of variational autoencoder and spatial clustering for adaptive multi-channel neural speech separation | Link | Link |
Self-supervised generation of spatial audio for 360 video | Link | Link |
Paper | URL | Code |
---|---|---|
Learning representations from audio-visual spatial alignment | Link | - |
Telling left from right: Learning spatial correspondence of sight and sound | Link | Link |
Av-nerf: Learning neural fields for real-world audio-visual scene synthesis | Link | Link |
Learning neural acoustic fields | Link | Link |
Overview of geometrical room acoustic modeling techniques | Link | - |
Av-rir: Audio-visual room impulse response estimation | Link | Link |
Impulse response data augmentation and deep neural networks for blind room acoustic parameter estimation | Link | - |
Multi-Channel Mosra: Mean Opinion Score and Room Acoustics Estimation Using Simulated Data and A Teacher Model | Link | - |
Few-shot audio-visual learning of environment acoustics | Link | Link |
Blind room parameter estimation using multiple multichannel speech recordings | Link | Link |
Visual-based spatial audio generation system for multi-speaker environments | Link | - |
Learning Spatially-Aware Language and Audio Embeddings | Link | - |
BAT: Learning to Reason about Spatial Sounds with Large Language Models | Link | Link |
Paper | URL | Code |
---|---|---|
A structural model for binaural sound synthesis | Link | - |
Neural synthesis of binaural speech from mono audio | Link | Link |
2.5 d visual sound | Link | Link |
Cyclic Learning for Binaural Audio Generation and Localization | Link | - |
Beyond mono to binaural: Generating binaural audio from mono audio with depth and cross modal attention | Link | - |
Geometry-aware multi-task learning for binaural audio generation from video | Link | - |
Multi-attention audio-visual fusion network for audio spatialization | Link | - |
Visually Guided Binaural Audio Generation with Cross-Modal Consistency | Link | - |
Interpretable binaural ratio for visually guided binaural audio generation | Link | - |
Cross-modal generative model for visual-guided binaural stereo generation | Link | - |
Binauralgrad: A two-stage conditional diffusion probabilistic model for binaural audio synthesis | Link | Link |
DopplerBAS: Binaural Audio Synthesis Addressing Doppler Effect | Link | - |
Neural fourier shift for binaural speech rendering | Link | Link |
Visually informed binaural audio generation without binaural audios | Link | Link |
Localize to binauralize: Audio spatialization from visual sound source localization | Link | Link |
Sep-stereo: Visually guided stereophonic audio generation by associating source separation | Link | Link |
Exploiting audio-visual consistency with partial supervision for spatial audio generation | Link | - |
Binaural audio generation via multi-task learning | Link | Link |
End-to-end binaural speech synthesis | Link | - |
Upmixing via style transfer: a variational autoencoder for disentangling spatial images and musical content | Link | - |
ViSAGe: Video-to-Spatial Audio Generation | Link | Link |
OmniAudio: Generating Spatial Audio from 360-Degree Video | Link | Link |
Towards generating ambisonics using audio-visual cue for virtual reality | Link | Link |
Av-nerf: Learning neural fields for real-world audio-visual scene synthesis | Link | Link |
ImmerseDiffusion: A Generative Spatial Audio Latent Diffusion Model | Link | - |
Both Ears Wide Open: Towards Language-Driven Spatial Audio Generation | Link | Link |
DualSpec: Text-to-spatial-audio Generation via Dual-Spectrogram Guided Diffusion Model | Link | - |
Diff-SAGe: End-to-End Spatial Audio Generation Using Diffusion Models | Link | - |
Ambisonizer: Neural upmixing as spherical harmonics generation | Link | Link |
ISDrama: Immersive Spatial Drama Generation through Multimodal Prompting | Link | Link |
Simple and controllable music generation | Link | Link |
Moûsai: Text-to-music generation with long-context latent diffusion | Link | Link |
Long-form music generation with latent diffusion | Link | Link |
Listen2scene: Interactive material-aware binaural sound propagation for reconstructed 3d scenes | Link | Link |
Immersive spatial audio reproduction for vr/ar using room acoustic modelling from 360 images | Link | - |
See-2-sound: Zero-shot spatial environment-to-spatial sound | Link | Link |
Dataset | Format | Collect | Hours | Type | Labels | URL |
---|---|---|---|---|---|---|
Sweet-Home | Multi | Recorded | 47.3 | Speech | Text | Link |
Voice-Home | Multi | Recorded | 2.5 | Speech | Text, Geometric | Link |
YT-ALL & REC-STREET | FOA | Crawled | 116.5 | Audio | Video, Text | Link |
FAIR-Play | Binaural | Recorded | 5.2 | Audio | Video | Link |
SECL-UMons | Multi | Recorded | 5 | Audio | Text, Geometric | Link |
YT-360 | FOA | Crawled | 246 | Audio | Video | Link |
EasyCom | Binaural | Recorded | 5 | Speech | Geometric, Text | Link |
Binaural_Dataset | Binaural | Recorded | 2 | Speech | Geometric | Link |
SimBinaural | Binaural | Sim/Crawl | 143 | Audio | Video, Geometric | Link |
Spatial LibriSpeech | FOA | Simulated | 650 | Speech | Text, Geometric | Link |
Link | FOA | Recorded | 7.5 | Audio | Video, Geometric | Link |
YT-Ambigen | FOA | Crawled | 142 | Audio | Video | Link |
BEWO-1M | Binaural | Simulated | 2.8k | Audio | Text/Image, Geo | Link |
MRSDrama | Binaural | Recorded | 98 | Speech | Text, Video, Geo | Link |