The official implementation of "MTANET: Multi-band Time-frequency attention Network for Singing Melody Extraction from Polyphonic Music.
We propose a more powerful singing melody extractor named multi-band time-frequency attention network (MTANet) for polyphonic music. Experimental results show that our proposed MTANet achieves promising performance compared with existing state-of-the-art methods, while keeping with a small number of network parameters.
(i) Due to the author's mistake, Figure 3 in the manuscript of the paper shows an earlier version, which may cause some misunderstandings for reviewers and readers. I am very sorry for this situation! The following picture is the revised version for reference and I will make formal corrections in the subsequent manuscript.
(ii) Rename the MMNet to the MTANet.
The author has contacted the chairs and applied for modification. If the modification is successful, please ignore the above update. I am very sorry for the inconvenience to the reviewers and readers.
The Paper has been accepted by INTERSPEECH 2023 and the official version awaits the official release.
The rest of the code will be sorted out and published soon.
All the code is uploaded.
When I read back the paper, I found a mistake witch is one of the dimension tracking descriptions in Figure 4. Specifically, the dimensions after concatenate operation are different for different stages. For example, the input feature size in the first MFA module is (B, 32, F, T), so the feature size after concatenate operation should be (B, 32+4×16, F, T). The difference is that the feature size after concatenate operation in the subsequent MFA modules is (B, 16+4×16, F, T) (i.e., B, (N+1)×C, F, T).
Although the original intention is to facilitate understanding and reading, but we ignored the strict relationship between the paper and the code. Since the paper can no longer be modified, it is very sorry for the troubles that bring readers here.
After downloading the data, use the txt files in the data folder, and process the CFP feature by feature_extraction.py.
Note that the label data corresponding to the frame shift should be available before generation.
main.py is the main function of this project.
Refer to the file: mtanet.py
The replication code for other comparison models has been uploaded and can be found in the folder: control group model.
The visualization illustrates that our proposed MTANet can reduce the octave errors and the melody detection errors.
The scores here are either taken from their respective papers or from the result implemented by us. Experimental results show that our proposed MTANet achieves promising performance compared with existing state-of-the-art methods.
- Correction:Number of parameters for TONet from 214M to 147M.
We conducted seven ablations to verify the effectiveness of each design in the proposed network. Due to the page limit, we selected the ADC2004 dataset for ablation study in the paper. More detailed results are presented here.
@inproceedings{gao23i_interspeech,
author={Yuan Gao and Ying Hu and Liusong Wang and Hao Huang and Liang He},
title={{MTANet: Multi-band Time-frequency Attention Network for Singing Melody Extraction from Polyphonic Music}},
year=2023,
booktitle={Proc. INTERSPEECH 2023},
pages={5396--5400},
doi={10.21437/Interspeech.2023-2494}
}