Given a query example in one modality (audio/video) the task is to retrieve relevent examples in other modality (video/audio). For every data point, audio and video modality is available. The class name can also be considered as the third modality text which is only available for training data. The retrieval examples are correct if they are semantically similar to query, i.e. they share same class label as the query. At test time, only the paired audio and video modality features will be available.
AudiosetZSL dataset will be used for the task. This dataset is proposed for the task of zero-shot classification and retrieval of videos and is curated from a large dataset, Audioset. For this challenge, only the seen classes from the dataset will be considered. It contains a total of 79795 training examples and 26587 validation example. Out of the total 26593 testing examples, a subset of it will be used for the final evaluation. We have provided the features for both audio and video, extracted using pre-trained networks. For a fair comparison of the approach it is mandatory for everyone to use the features provided. More details about the dataset and task can be found in these papers below.
ClassAverage mAP will be used as the evaluation metric. Each retrieval example will produce an average precision (AP) score. Averageing AP for all the query from a particular class will give the mAP for that class. ClassAverage mAP is then obtained by averaging mAP for all the class. ClassAverage Map can be calculated for both audio to video and video to audio retrieval. The final score will be the average of both of them.
- Download the dataset from this link onto
data
folder. - Arrange as per the directory structure given in the
readme.md
file insidedata
folder.
Different baseline codes using unsupervised approach are provided to start with.
- Run
python main_baseline.py
for obtaining the baseline retrieval resutls from raw features directly. - Run
python main_baseline.py -mode cca
for obtaining the results using CCA.
A supervised learning baseline using triplet loss is also provided.
- Run
python main_triplet.py
to learn a neural network model for aligning all modalities using a triplet loss. - Run
python evaluate_triplet.py
to evaluate using the model learnt in the previous step.
The code is tested with
python3.8
torch==1.9.0
numpy==1.21.2
scipy==1.7.1
h5py==3.4.0
pandas===1.3.2
Submit a txt file with each row specifying the index of the retrieval samples sorted in decreasing order of similarity.
A single txt file should be submitted containing both the audio to video and video to audio retrieval results. In the txt file the first half of the results should containg the index for audio to video retrieva where as the second half should contain the video to audio retrieval results. E.g. if there are N examples in the test set, then the text file should have 2N rows, where the first N rows correspond to retrieval index for audio to video retrieval and the next N rows should contain video to audio retrieval resuts.
Please note that the text file required for submission can be generated by specifying out_txt=True
in calculate_both_map
function.