Skip to content
Unsupervised Audio + Video Network Pretraining using PyTorch
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
src
.gitignore
LICENSE
README.md
avnet_model.png

README.md

EmoMatch Task

Simple transfer-learning task based on the VoxCeleb dataset to pretrain networks working on videos (audio + video) This code requires you to download the VoxCeleb dataset and to extract it (both audio and video).

The idea of this aproach is based on the paper Look, Listen, Learn: here, audio and video information were used to pretain an image encoder network to be used for image classificaiton tasks.

This project tries to extend this approach to not only train an image encoder but to actually pre-train a network that is able to process both audio and video information. The task the network is meant to solve is rather simple: given an audio sequence and a video sequence, decide whether the two match (i.e. have the same origin).

Structure of the EmoMatch training procedure is shown in the image below. The left side showsthe data preparation while the right side illustrates the data flow through thenetwork. In the data preparation video recordings are used to separate theirvideo and audio track. These tracks are then feed into a VNet and an ANet for the video respectively the audio. These networks serve as an encoder to generate features for a classifier network. This classifier will then detect whether the audiotrack originates from the same original recording as the video track (Match) or from a two different recordings (No Match).

You can’t perform that action at this time.