Skip to content

ICVGIP-Challenge/AudioVisualRetrieval-21

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Audio-visual Retrieval Challenge-21

Problem Description

Given a query example in one modality (audio/video) the task is to retrieve relevent examples in other modality (video/audio). For every data point, audio and video modality is available. The class name can also be considered as the third modality text which is only available for training data. The retrieval examples are correct if they are semantically similar to query, i.e. they share same class label as the query. At test time, only the paired audio and video modality features will be available.

Dataset Statistics

AudiosetZSL dataset will be used for the task. This dataset is proposed for the task of zero-shot classification and retrieval of videos and is curated from a large dataset, Audioset. For this challenge, only the seen classes from the dataset will be considered. It contains a total of 79795 training examples and 26587 validation example. Out of the total 26593 testing examples, a subset of it will be used for the final evaluation. We have provided the features for both audio and video, extracted using pre-trained networks. For a fair comparison of the approach it is mandatory for everyone to use the features provided. More details about the dataset and task can be found in these papers below.

  1. Coordinated Joint Multimodal Embeddings
  2. Discriminative Semantic Transitive Consistency

Evaluation Metric

ClassAverage mAP will be used as the evaluation metric. Each retrieval example will produce an average precision (AP) score. Averageing AP for all the query from a particular class will give the mAP for that class. ClassAverage mAP is then obtained by averaging mAP for all the class. ClassAverage Map can be calculated for both audio to video and video to audio retrieval. The final score will be the average of both of them.

$$Final mAP = 0.5*(audio2video) + 0.5*(video2audio)$$

Code to get started

Data Download

  1. Download the dataset from this link onto data folder.
  2. Arrange as per the directory structure given in the readme.md file inside data folder.

Run Baselines (Unsupervised)

Different baseline codes using unsupervised approach are provided to start with.

  1. Run python main_baseline.py for obtaining the baseline retrieval resutls from raw features directly.
  2. Run python main_baseline.py -mode cca for obtaining the results using CCA.

Run Baselines (Supervised)

A supervised learning baseline using triplet loss is also provided.

  1. Run python main_triplet.py to learn a neural network model for aligning all modalities using a triplet loss.
  2. Run python evaluate_triplet.py to evaluate using the model learnt in the previous step.

The code is tested with

python3.8
torch==1.9.0
numpy==1.21.2
scipy==1.7.1
h5py==3.4.0
pandas===1.3.2

Submission

Submit a txt file with each row specifying the index of the retrieval samples sorted in decreasing order of similarity. A single txt file should be submitted containing both the audio to video and video to audio retrieval results. In the txt file the first half of the results should containg the index for audio to video retrieva where as the second half should contain the video to audio retrieval results. E.g. if there are N examples in the test set, then the text file should have 2N rows, where the first N rows correspond to retrieval index for audio to video retrieval and the next N rows should contain video to audio retrieval resuts. Please note that the text file required for submission can be generated by specifying out_txt=Truein calculate_both_map function.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages