Skip to content

Code for the Paper: [ECCV2022] Sound Localization by Self-Supervised Time-Delay Estimation

License

Notifications You must be signed in to change notification settings

IFICL/stereocrw

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Sound Localization by Self-Supervised Time Delay Estimation

Ziyang Chen, David F. Fouhey, Andrew Owens
University of Michigan


This repository contains the official codebase for Sound Localization by Self-Supervised Time Delay Estimation. [Project Page]

StereoCRW Illustration

Environment

To setup the environment, please simply run

conda env create -f environment.yml
conda activate Stereo

Datasets

Free Music Archive (FMA)

We perform self-supervised learning on this training dataset, data can be downloaded from FMA offical github repo.

FAIR-Play

Data can be downloaded from FAIR-Play offical github repo.

TDE-Simulation

We create a simulated test set using Pyroomacoustics. It contains approximately 6K stereo audio samples from three simulated environments with rooms of different sizes and microphone positions. We use TIMIT as sound database. Our data can be downloaded from Here. You can simply download our dataset by running

cd Dataset/TDE-Simulation
chmod +x download_tde.sh
./download_tde.sh

We also provide the code for generating the stereo sound in Dataset/TDE-Simulation/data-generation-advance.py, you can create your own evaluation set. We have provided the evaluation information in Dataset/TDE-Simulation/data-split.

In-the-wild data

We collected 1K samples from 30 internet binaural videos, and use human judgements to label sound directions. These videos contain a variety of sounds, including engine noise and human speech, which are often far from the viewer. The processed data could be downloaded from Here. You can simply download our dataset by running

cd Dataset/Youtube-Binaural
chmod +x download_inthewild.sh
./download_inthewild.sh

We also provide Youtube ID and timestamp in Dataset/Youtube-Binaural/data-info/in-the-wild.csv, you can download and process them with Dataset/Youtube-Binaural/multi-download-process.sh. Labels are provided in Dataset/Youtube-Binaural/data-split/in-the-wild/test_with_label.csv.

Visually-guided Time Delay Simulation Dataset

We use audio clips from VoxCeleb2 with the simulation parameters from TDE-Simulation. We select 500 speakers from the database and pair them with their corresponding face images. The processed data could be downloaded from Here. You can simply download our dataset by running

cd Dataset/VoxCeleb2
chmod +x download_voxceleb2_simulation.sh
./download_voxceleb2_simulation.sh

We have provided the evaluation information in Dataset/VoxCeleb2/data-split/voxceleb-tde/Easy/test.csv.

Model Zoo

We release several models pre-trained with our proposed methods. We hope it could benefit our research communities.

Method size, stride, num Train Set Test Set MAE (ms) RMSE (ms) url
MonoCLR 1024, 4, 49 Free-Music TDE-Simulation 0.187 0.335 url
ZeroNCE 1024, 4, 49 Free-Music TDE-Simulation 0.174 0.319 url
StereoCRW 1024, 4, 49 Free-Music TDE-Simulation 0.133 0.259 url
AV-MonoCLR 15360, 4, 49 VoxCeleb2 Voxceleb2-Simulation - 0.304 url

Note that our models above are trained with 0.064s while you can directly inference with different audio lengths without retraining. We also provide some pre-trained models trained with longer audio inputs (0.48s) for accelerating training process only. To download all the checkpoints, simply run

./scripts/download_models.sh

Train & Evaluation

We provide training and evaluation scripts under scripts, please check each bash file before running.

Training

  • To train our StereoCRW method on FMA, simply run: ./scripts/training/train-StereoCRW-FMA.sh under parent path.
  • To train our MonoCLR method on FMA, simply run: ./scripts/training/train-MonoCLR-FMA.sh under parent path.
  • To train our ZeroNCE method on FMA, simply run: ./scripts/training/train-ZeroNCE-FMA.sh under parent path.
  • To train our AV-MonoCLR method on VoxCeleb2, simply run: ./scripts/training/train-AVMonoCLR-VoxCeleb2.sh under parent path.

Evaluation

  • To evaluate our model method on TDE-Simualtion dataset, simply run: ./scripts/evaluation/evaluation_tde.sh under parent path. You can change the checkpoint in the bash file.
  • To evaluate our model method on TDE-Simualtion dataset with mixture condition, simply run: ./scripts/evaluation/evaluation_mixture_tde.sh under parent path. You can change the checkpoint in the bash file.
  • To evaluate our model method on In-the-wild dataset, simply run: ./scripts/evaluation/evaluation_inthewild.sh under parent path. You can change the checkpoint in the bash file.
  • To evaluate our visual-guided ITD estimation model method on Visually-guided Time Delay Simulation Dataset, simply run: ./scripts/evaluation/evaluation_vgITD.sh under parent path. You can change the checkpoint in the bash file.

Visualization Demo

We provide codes for visualizing the ITD prediction of videos over time in vis_scripts/vis_video_itd.py. You can follow the steps below to generate visualization results of your own videos:

  • Create a folder for your test vides to by mkdir Dataset/DemoVideo/RawVideos/YourVideo, and save your videos to this path.
  • For preprocessing the video, simply run:
    cd Dataset/DemoVideo
    chmod +x process.sh
    ./process.sh 'YourVideo'
  • To inference with video results, go back to the parent folder path and simply run
    ./scripts/visualization_video.sh 'YourVideo' YOUR_SAVE_PATH
    and the video results will be appeared under results/YOUR_SAVE_PATH.

Citation

If you find this code useful, please consider citing:

@inproceedings{
    chen2022sound,
    title={Sound Localization by Self-Supervised Time Delay Estimation},
    author={Chen, Ziyang and Fouhey, David F. and Owens, Andrew},
    journal={arXiv},
    year={2022}
}

Acknowledgment

This work was funded in part by DARPA Semafor and Cisco Systems. The views, opinions and/or findings expressed are those of the authors and should not be interpreted as representing the official views or policies of the Department of Defense or the U.S. Government.

About

Code for the Paper: [ECCV2022] Sound Localization by Self-Supervised Time-Delay Estimation

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published