# ACM MM 2021 Demo Page
This demo page is for the paper __ReconVAT: A Semi-Supervised Automatic Music Transcription Framework for Low-Resource Real-World Data__. Here, we want to demonstrate that our proposed semi-supervised framework for AMT can generalize better to unseen data than the fully supervised baseline. Both *Pop Music* and *Concerto Grosso* are not present in the labelled data, but our semi-supervised framework can still be trained on these unseen genre without labels.

**Source code:** https://github.com/KinWaiCheuk/ReconVAT

**Details of data splitting:** https://github.com/KinWaiCheuk/ReconVAT/blob/gh-pages/supplementary.pdf

Three models are compared here:
- ReconVAT (existing + new data): The proposed semi-supervised AMT framework based on spectrogram reconstruction [[1]](https://arxiv.org/abs/2010.09969) and VAT [[2]](https://arxiv.org/abs/1704.03976). It is trained using existing data for 4k epoches, then add the music downloaded from Youtube or ISMLP as the unlabelled data and train for another 4k epoches.
- ReconVAT (existing): The proposed semi-supervised AMT framework trained using existing data for 8k epoches.
- Baseline: A fully supervied model [[3]](https://ieeexplore.ieee.org/document/9222310) trained using existing data for 8k epoches.




In [1]:
from IPython.display import HTML
table = "<style>audio {width:200px}; td {vertical-align: middle}</style>"
HTML(table)

## Woodwind
We downloaded a few pop music covers from Youtube, and add them to the unlabelled dataset to train our proposed framework.

### LeanOn
A clarient cover for the song __Lean On__ downloaded from [Youtube](https://www.youtube.com/watch?v=nuEMqMc1Fh4). The following excerpt is extracted from 1:19-1:30.

The music transcription prodcued by the ReconVAT model is more detail than the baseline model. And the ReconVAT continue training on new data is slightly more accurate than the one without using the new data. 
![](Demo/Woodwind/LeanOn/LeanOn_pianorolls.png)

<table border="0">
    
 <tr>
    <td style="text-align: left"><b style="font-size:14px">Ground Truth</b></td>
    <td style="text-align: left"><b style="font-size:14px">ReconVAT+new data</b></td> 
    <td style="text-align: left"><b style="font-size:14px">ReconVAT</b></td>
    <td style="text-align: left"><b style="font-size:14px">Baseline</b></td>
 </tr>
    
 <tr>
    <td><audio src="Demo/Woodwind/LeanOn/LeanOn_groundtruth.mp3" controls>alternative text</audio><br/>
    </td>
    <td><audio src="Demo/Woodwind/LeanOn/LeanOn_VATretrain.mp3" controls>alternative text</audio><br/>
    </td>  
    <td><audio src="Demo/Woodwind/LeanOn/LeanOn_VAT.mp3" controls>alternative text</audio><br/>
     </td>
    <td><audio src="Demo/Woodwind/LeanOn/LeanOn_baseline.mp3" controls>sdasd text</audio><br/>
     </td>     
 </tr>
</table>

### Lemon
A clarient cover for the song __Lemon__ downloaded from [Youtube](https://www.youtube.com/watch?v=GhqCTpA7TG8). The following excerpt is extracted from 3:30-3:40. The blue notes highlight the main melody and the green notes highlight the music accompaniment (paino and drums). Our ReconVAT is better than the baseline model in terms of melodic and accompaniment details. The transcription produced by our ReconVAT training on the new data is even finer in details.
![](Demo/Woodwind/Lemon/Lemon_pianorolls.png)

<table border="0">
    
 <tr>
    <td style="text-align: left"><b style="font-size:14px">Ground Truth</b></td>
    <td style="text-align: left"><b style="font-size:14px">ReconVAT+new data</b></td> 
    <td style="text-align: left"><b style="font-size:14px">ReconVAT</b></td>
    <td style="text-align: left"><b style="font-size:14px">Baseline</b></td>
 </tr>
    
 <tr>
    <td><audio src="Demo/Woodwind/Lemon/Lemon_groundtruth.mp3" controls>alternative text</audio><br/>
    </td>
    <td><audio src="Demo/Woodwind/Lemon/Lemon_VATretrain.mp3" controls>alternative text</audio><br/>
    </td>  
    <td><audio src="Demo/Woodwind/Lemon/Lemon_VAT.mp3" controls>alternative text</audio><br/>
     </td>
    <td><audio src="Demo/Woodwind/Lemon/Lemon_baseline.mp3" controls>sdasd text</audio><br/>
     </td>     
 </tr>
</table>

### Yoasobi
A clarient cover for the song __Lemon__ downloaded from [Youtube](https://www.youtube.com/watch?v=jSVp6h2vGUQ). The following excerpt is extracted from 0:31-0:42. Again our ReconVAT is much better than the baseline model. But our ReconVAT trained on new data only has a subtle improvement for this piece, in which the note durations at the beginning of the music is slightly more accurate.

![](Demo/Woodwind/Yoasobi/Yoasobi_pianorolls.png)

<table border="0">
    
 <tr>
    <td style="text-align: left"><b style="font-size:14px">Ground Truth</b></td>
    <td style="text-align: left"><b style="font-size:14px">ReconVAT+new data</b></td> 
    <td style="text-align: left"><b style="font-size:14px">ReconVAT</b></td>
    <td style="text-align: left"><b style="font-size:14px">Baseline</b></td>
 </tr>
    
 <tr>
    <td><audio src="Demo/Woodwind/Yoasobi/Yoasobi_groundtruth.mp3" controls>alternative text</audio><br/>
    </td>
    <td><audio src="Demo/Woodwind/Yoasobi/Yoasobi_VATretrain.mp3" controls>alternative text</audio><br/>
    </td>  
    <td><audio src="Demo/Woodwind/Yoasobi/Yoasobi_VAT.mp3" controls>alternative text</audio><br/>
     </td>
    <td><audio src="Demo/Woodwind/Yoasobi/Yoasobi_baseline.mp3" controls>sdasd text</audio><br/>
     </td>     
 </tr>
</table>

## String
We downloaded Corelli's Concerto Grosso OP.6 [No.1](https://imslp.org/wiki/Concerto_grosso_in_D_major%2C_Op.6_No.1_(Corelli%2C_Arcangelo)), [No.2](https://imslp.org/wiki/Concerto_grosso_in_F_major%2C_Op.6_No.2_(Corelli%2C_Arcangelo)), and [No.3](https://imslp.org/wiki/Concerto_grosso_in_C_minor%2C_Op.6_No.3_(Corelli%2C_Arcangelo)) from ISMLP. As a preprossing step, we break down each of the concerto into shorter pieces movement by movement. These three compositions result in 17 different movements after the preporcessing, and we add them to our unlabelled data to train our proposed framework.

### Corelli Op6 No1 mvt4 Allegro
Since this type of music genre is not in the labelled training data in MusicNet, the transcription result produced by the supervised baseline model contains a lot of missing notes. Our proposed ReconVAT is much better than the baseline, and the ReconVAT trained on the new unlabelled data is even better in terms of melodic and the accompaniment details.
![](Demo/String/op6_no1_allegro4/op6_no1_allegro4_pianorolls.png)

<table border="0">
    
 <tr>
    <td style="text-align: left"><b style="font-size:14px">Ground Truth</b></td>
    <td style="text-align: left"><b style="font-size:14px">ReconVAT+new data</b></td> 
    <td style="text-align: left"><b style="font-size:14px">ReconVAT</b></td>
    <td style="text-align: left"><b style="font-size:14px">Baseline</b></td>
 </tr>
    
 <tr>
    <td><audio src="Demo/String/op6_no1_allegro4/op6_no1_allegro4_groundtruth.mp3" controls>alternative text</audio><br/>
    </td>
    <td><audio src="Demo/String/op6_no1_allegro4/op6_no1_allegro4_VATretrain.mp3" controls>alternative text</audio><br/>
    </td>  
    <td><audio src="Demo/String/op6_no1_allegro4/op6_no1_allegro4_VAT.mp3" controls>alternative text</audio><br/>
     </td>
    <td><audio src="Demo/String/op6_no1_allegro4/op6_no1_allegro4_baseline.mp3" controls>sdasd text</audio><br/>
     </td>     
 </tr>
</table>

### Corelli Op6 No2 mvt4 Allegro
Again, our proposed ReconVAT is much better than the baseline, and the ReconVAT trained on the new unlabelled data is slightly better than the ReconVAT trained on only existing data.

![](Demo/String/op6_no2_allegro4/op6_no2_allegro4_pianorolls.png)

<table border="0">
    
 <tr>
    <td style="text-align: left"><b style="font-size:14px">Ground Truth</b></td>
    <td style="text-align: left"><b style="font-size:14px">ReconVAT+new data</b></td> 
    <td style="text-align: left"><b style="font-size:14px">ReconVAT</b></td>
    <td style="text-align: left"><b style="font-size:14px">Baseline</b></td>
 </tr>
    
 <tr>
    <td><audio src="Demo/String/op6_no2_allegro4/op6_no2_allegro4_groundtruth.mp3" controls>alternative text</audio><br/>
    </td>
    <td><audio src="Demo/String/op6_no2_allegro4/op6_no2_allegro4_VATretrain.mp3" controls>alternative text</audio><br/>
    </td>  
    <td><audio src="Demo/String/op6_no2_allegro4/op6_no2_allegro4_VAT.mp3" controls>alternative text</audio><br/>
     </td>
    <td><audio src="Demo/String/op6_no2_allegro4/op6_no2_allegro4_baseline.mp3" controls>sdasd text</audio><br/>
     </td>     
 </tr>
</table>

### Corelli Op6 No3 mvt5 Allegro
Again, our proposed ReconVAT is much better than the baseline. But this time ReconVAT trained on existing data capature the details of the accompaniment better, while the ReconVAT trained on new data capature the melodic details better. We will explore ways to improve the continous training method in the future.

![](Demo/String/op6_no3_allegro5/op6_no3_allegro5_pianorolls.png)

<table border="0">
    
 <tr>
    <td style="text-align: left"><b style="font-size:14px">Ground Truth</b></td>
    <td style="text-align: left"><b style="font-size:14px">ReconVAT+new data</b></td> 
    <td style="text-align: left"><b style="font-size:14px">ReconVAT</b></td>
    <td style="text-align: left"><b style="font-size:14px">Baseline</b></td>
 </tr>
    
 <tr>
    <td><audio src="Demo/String/op6_no3_allegro5/op6_no3_allegro5_groundtruth.mp3" controls>alternative text</audio><br/>
    </td>
    <td><audio src="Demo/String/op6_no3_allegro5/op6_no3_allegro5_VATretrain.mp3" controls>alternative text</audio><br/>
    </td>  
    <td><audio src="Demo/String/op6_no3_allegro5/op6_no3_allegro5_VAT.mp3" controls>alternative text</audio><br/>
     </td>
    <td><audio src="Demo/String/op6_no3_allegro5/op6_no3_allegro5_baseline.mp3" controls>sdasd text</audio><br/>
     </td>     
 </tr>
</table>