EDMSound

Codebase and project page for EDMSound, Demopage. This codebase is only for the copy detection part in the paper. The code for the EDMSound diffusion model will be released in our follow-up work.

Description

Diffusion models have showcased their capabilities in audio synthesis ranging over a variety of sounds. Existing models often operate on the latent domain with cascaded phase recovery modules to reconstruct waveform. It potentially introduces challenges in generating high-fidelity audio. In this paper, we propose EDMSound, a diffusion-based generative model in spectrogram domain under the framework of elucidated diffusion models (EDM). Combining with efficient deterministic sampler, we achieved similar Fréchet audio distance (FAD) score as top-ranked baseline with only 10 steps and reached state-of-the-art performance with 50 steps on the DCASE2023 foley sound generation benchmark. We also revealed a potential concern regarding diffusion based audio generation models that they tend to generate samples with high perceptual similarity to the data from training data.

Setup

Install dependencies

# clone project
git clone https://github.com/AgentCooper2002/EDMSound
cd EDMSound

# [OPTIONAL] create conda environment
conda create -n diffaudio python=3.8
conda activate diffaudio

# install pytorch (>=2.0.1), e.g. with cuda=11.7, we have:
conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia

# install requirements
pip install -r requirements.txt

Hydra-lightning

A config management tool that decouples dataloaders, training, network backbones etc.

How to run

Change the root_dir in EDMSound/configs/paths/default.yaml to your own working directory /path/to/your/EDMSound/.

Extract audio embeddings using pretrained CLAP

First extract audio embeddings using pretrained CLAP. Make sure extract both dataset audio embeddings and generated audio embeddings. Run

CUDA_VISIBLE_DEVICES=0 python script/extract_clap_embeddings.py

Run copy detection using pretrained CLAP

To do copy detection between generated audio and training dataset using pretrained CLAP, make sure zero_shot is set to True in the experiment yaml file, and run

CUDA_VISIBLE_DEVICES=0 python src/eval.py +trainer.precision=16 experiment=ssl_fine_tune_gen_eval.yaml ckpt_path='dummy.ckpt'

To do copy detection between training dataset and itself using pretrained CLAP, make sure zero_shot is set to True in the experiment yaml file, and run

CUDA_VISIBLE_DEVICES=0 python src/eval.py +trainer.precision=16 experiment=ssl_fine_tune_self_eval.yaml ckpt_path='dummy.ckpt'

Finetune CLAP

To finetune CLAP for copy detection, run

CUDA_VISIBLE_DEVICES=0 python src/train.py +trainer.precision=16 experiment=clap_fine_tune.yaml

To do copy detection using the finetuned CLAP, just set the zero_shot to False in the desired experiment yaml file, and run the aforementioned commands.

Generate plots

To generate plots in the paper, run

python script/similarity_distribution_plot.py

References

Resources

This repo is generated with lightning-hydra-template.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
configs		configs
images		images
script		script
src		src
.gitignore		.gitignore
.project-root		.project-root
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
requirements.txt		requirements.txt
setup.py		setup.py

License

AgentCooper2002/EDMSound

Folders and files

Latest commit

History

Repository files navigation

EDMSound

Description

Setup

Install dependencies

Hydra-lightning

How to run

Extract audio embeddings using pretrained CLAP

Run copy detection using pretrained CLAP

Finetune CLAP

Generate plots

References

Resources

About

Resources

License

Stars

Watchers

Forks

Languages