Visually Indicated Sounds

Implementation and extension of the paper, Visually Indicated Sounds by Andrew et al. which proposes the task of predicting what sound an object makes when struck as a way of studying physical interactions within a visual scene.

Brief Description of the paper

The authors present an algorithm that synthesizes sound from silent videos of people hitting and scratching objects with a drumstick. This algorithm uses a recurrent neural network to predict sound features from videos and then produces a waveform from these features with an example-based synthesis procedure. The authors show that the sounds predicted by their model are realistic enough to fool participants in a “real or fake” psychophysical experiment and that they convey significant information about material properties and physical interactions.

Implementations

All implementation code can be found inside /src/experiments, where each experiment is contained within its own directory.

PaperModel: The model architecture used in the paper.
BiLSTMModel: A modification to the architecture described in the paper, by replacing the LSTM with a Bidirectional LSTM.
VMAEModel: Using modern transformer based architecture for Feature Extraction.
LatentVMAEModel: Switching out Cochleagrams with a Learned Latent Space Representation of the waves through an AutoEncoder, fed into the VMAEModel.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Visually Indicated Sounds

Brief Description of the paper

Implementations

Files

README.md

Latest commit

History

README.md

File metadata and controls

Visually Indicated Sounds

Brief Description of the paper

Implementations