# Assignment 2 — Singing Voice Transcription  
**Weight:** 15 % of final grade (48 marks total)

Welcome to the second assignment of this course. In this exercise, you will build a **Singing Voice Transcription (SVT)** system for songs—a system that converts singing voice audio into note-level notation. The assignment consists of three sections:

1. Building the necessary components for the system  
2. Implementing the complete transcription workflow  
3. Answering extension questions  

### Requirements
- Complete this notebook by running all code cells successfully and answering all questions.  
- When embedding a screenshot in the notebook, place the image file in the `./resources` directory.  
- After finishing, **zip the entire assignment directory** (excluding the `data_mini` directory) and submit it to Canvas.  

### Submission 
Upon completing this assignment, please **compress the entire assignment directory** into a `.zip` file and submit it via Canvas. File naming convention: **`eXXXXXXX_Name_Assignment2.zip`**.

### Late Policy 
A penalty of **25% of the total marks per day** will be applied for late submissions. All assignments must be submitted via Canvas and we do not accept email submissions.

### Honor Code
Plagiarism will not be tolerated. You may discuss the questions with classmates and consult online references, but you must not submit code or answers copied directly from other sources. If you refer to code, algorithms, or tutorials from external sources, you must explicitly acknowledge the source in your submission (e.g., in code comments).

This assignment is worth **15 %** of your final grade. The full mark for this notebook is **48**. Your score will be normalized when computing the final grade: **Your assignment 2 score = [Your Score] / 48 * 15**.

### Notes
1. **Restart the Kernel After File Changes**  
   After modifying any `.py` file, restart the Jupyter kernel to ensure the updated code is loaded. Restarting will not erase previous cell outputs, so your execution history remains available.  

2. **Maintain Tensor Shape Consistency**  
   When reshaping or altering tensor dimensions, ensure all subsequent operations remain shape-compatible to avoid runtime errors or incorrect results.  

3. **Restricted Editing Scope**  
   You may only add or modify code within sections explicitly marked `YOUR CODE:`. Do not alter any other parts of the provided framework.  

### Useful Resources
- [Music Transcription Overview](https://www.eecs.qmul.ac.uk/~simond/pub/2018/BenetosDixonDuanEwert-SPM2018-Transcription.pdf)  
- [Evaluation for Singing Transcription](https://riuma.uma.es/xmlui/bitstream/handle/10630/8372/298_Paper.pdf?sequence=1)  
- [mir_eval Documentation](https://craffel.github.io/mir_eval/#mir_eval.transcription.evaluate)  
- [VOCANO: A Note Transcription Framework for Singing Voice in Polyphonic Music](https://archives.ismir.net/ismir2021/paper/000036.pdf)  
- [TONet: Tone–Octave Network for Singing Melody Extraction from Polyphonic Music](http://arxiv.org/abs/2202.00951)  


## Getting started

We recommend using a **Conda** environment for this course. If you have not yet set one up, follow the steps below:

1. **Install Miniconda**  
- Download [Miniconda](https://docs.conda.io/en/latest/miniconda.html) and install it on your system.  
- After installation, restart your terminal (PowerShell, zsh, or bash).  

2. **Verify Installation**  
- When `(base)` appears at the start of your terminal prompt, Miniconda is active.  
     ![Conda Base Environment](./resources/conda.png)

3. **Create and Activate the Course Environment**  
   ```bash
   conda create -n 4347 python=3.9
   conda activate 4347

4. **Install packages**

        # Install PyTorch (use command suitable for your OS)
        # Linux / Windows (CUDA 11.7)
        pip install torch==2.0.0+cu117 torchvision==0.15.1+cu117 torchaudio==2.0.1 --index-url https://download.pytorch.org/whl/cu117
        # Windows (CPU)
        pip install torch==2.0.0+cpu torchvision==0.15.1+cpu torchaudio==2.0.1 --index-url https://download.pytorch.org/whl/cpu
        # OSX (CPU)
        pip install torch==2.0.0 torchvision==0.15.1 torchaudio==2.0.1

        # Install other libraries
        pip install -r requirement.txt

        # Install ffmpeg
        conda install ffmpeg

5. **Set Interpreter in Your IDE**
- When running this notebook in your IDE, ensure the interpreter is set to the 4347 Conda environment.

6. **Install Jupyter if Prompted**
- If prompted to install the jupyter package, click **Confirm** to proceed.

## Section 1 - Important Components [21 marks]
We are going to use pytorch to build a neural network for SVT. Like all the deep learning projects, we start with data pipeline.

### Task 1: Data Loader [3 marks]

**YOUR TASK:** Complete the `MyDataset` class in **`dataset.py`** so that it passes the test in the provided cell. **[3 marks]**

You need to fill the code that implement:
1. Convert note-level annotations to frame-level annotations.
2. Load the audio file and convert it to a [mel spectrogram](https://en.wikipedia.org/wiki/Mel-frequency_cepstrum). Some tools may help you, like [torchaudio](https://pytorch.org/audio/main/generated/torchaudio.transforms.MelSpectrogram.html) or [librosa](https://librosa.org/doc/main/generated/librosa.feature.melspectrogram.html).
3. Extract 5-s segments from both the spectrogram and the corresponding annotations to create training samples.



In [2]:
from tqdm import tqdm
from dataset import get_data_loader, move_data_to_device
from hparams import Hparams

train_loader = get_data_loader(split='train', args=Hparams.args)
for data in tqdm(train_loader):
    x, onset, offset, octave, pitch_class = move_data_to_device(data, 'cpu')
    assert list(x.shape) == [8, 250, 256]  # shape in [B, T, D],
                                # i.e., [Batch size, num of frame per sample, spectrogram feature dimension]
    assert list(onset.shape) == list(offset.shape) == list(octave.shape) == list(pitch_class.shape) == [8, 250]
    break
print('Congrats!')

  0%|          | 0/123 [00:20<?, ?it/s]


RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/Users/charlie/miniconda3/envs/4347/lib/python3.9/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
    data = fetcher.fetch(index)
  File "/Users/charlie/miniconda3/envs/4347/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/Users/charlie/miniconda3/envs/4347/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 51, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/Users/charlie/University/3A/Sound and Music Computing/Assignments/assignment2/dataset.py", line 141, in __getitem__
    waveform, sample_rate = torchaudio.load(audio_fp)
  File "/Users/charlie/miniconda3/envs/4347/lib/python3.9/site-packages/torchaudio/backend/sox_io_backend.py", line 256, in load
    return _fallback_load(filepath, frame_offset, num_frames, normalize, channels_first, format)
  File "/Users/charlie/miniconda3/envs/4347/lib/python3.9/site-packages/torchaudio/backend/sox_io_backend.py", line 30, in _fail_load
    raise RuntimeError("Failed to load audio from {}".format(filepath))
RuntimeError: Failed to load audio from ./data_mini/train/60/Mixture.mp3


### Task 2: Speed it up  [7 marks]

At this stage, your data loader should be functional, but it may still have performance bottlenecks.  
In this task, we will evaluate its efficiency by loading **5 batches** of data from your data loader.  

**YOUR TASK:**  Run the provided code cell below to load 5 batches and observe the execution time. **[1 mark]**

In [None]:
from tqdm import tqdm
from dataset import get_data_loader, move_data_to_device
from hparams import Hparams
import time

t0 = time.time()
train_loader = get_data_loader(split='train', args=Hparams.args)
for i, data in enumerate(tqdm(train_loader)):
    x, onset, offset, octave, pitch_class = move_data_to_device(data, 'cpu')
    assert list(x.shape) == [8, 250, 256]  # shape in [B, T, D], i.e., [Batch size, num of frame per sample, feature dimention]
    assert list(onset.shape) == list(offset.shape) == list(octave.shape) == list(pitch_class.shape) == [8, 250]
    if i == 4:
        dur = time.time()-t0
        est_time = dur / 5 * 123
        break
print('5 batches use {:.2f} seconds'.format(dur))
print('Estimated time to load the whole training set: {:.2f} seconds.'.format(est_time))

If the estimated time to load the entire training set is **within 30 seconds**, congratulations — your data loader is efficient.  
If you have not modified the dataset workflow, it may take **over 1000 seconds** to load the entire set, making data loading the primary performance bottleneck.

**YOUR TASK:**

1. **Answer the two questions below.** **[2 marks]**  
2. **Optimize your `MyDataset` class** to reduce data loading time. **[3 marks]**  
   *Hint:*  
   - Each dataset iteration should return only a 5-second audio clip. The current implementation reads the entire audio file into memory each time, even if only one clip is needed. If the same file is accessed again, the I/O operation could be skipped.  
   - Frame-level annotations are currently recomputed every time a clip is loaded; this can be cached or precomputed to improve efficiency.  
   - Consider methods to read only the required audio segment directly from disk rather than the entire file. You may also change the storage format of audio files if it improves performance.  
3. **Restart the kernel** and run the provided code cell to verify the improvement. **[1 mark]**

Question:

[1] What are the primary factors contributing to the time overhead in the current data-loading process?
**[Your Answer Here]**

[2] What is your plan for modifying the dataset class? Propose two possible solutions to improve its efficiency. Indicate which solution you will implement and explain your choice, including the expected benefits and trade-offs.
**[Your Answer Here]**

In [None]:
from tqdm import tqdm
from dataset import get_data_loader, move_data_to_device
from hparams import Hparams
import time

train_loader = get_data_loader(split='train', args=Hparams.args)
t0 = time.time()
for i, data in enumerate(tqdm(train_loader)):
    x, onset, offset, octave, pitch_class = move_data_to_device(data, 'cpu')
    assert list(x.shape) == [8, 250, 256]  # shape in [B, T, D], i.e., [Batch size, num of frame per sample, feature dimention]
    assert list(onset.shape) == list(offset.shape) == list(octave.shape) == list(pitch_class.shape) == [8, 250]
    if i == 4:
        dur = time.time()-t0
        est_time = dur / 5 * 123
        break
print('5 batches use {:.2f} seconds'.format(dur))
print('Estimated time to load the whole training set: {:.2f} seconds.'.format(est_time))
if est_time < 40:
    print('Well Done!')
else:
    print('We can still be faster.')

### Task 3: Loss function [3 marks]

We are going to use "multitask learning" to train our model. I.e., we are training our model simultaneously on 4 tasks: 4 types of frame classification tasks. They are:
- Classify if there is an onset on some frame.
- Classify if there is an offset on some frame.
- Classify the pitch of some frame lies on which octave (we have 4 octaves numbered 1~4, and 0 means silence).
- Classify which pitch class if the current pitch (from C~B, 12 semitones, and 0 which means silence).

In this case, the loss function is not that straightforward. To improve the readability of our code, we are going wrap the loss computation into a class.

**YOUR TASK:** Finish the code of train.LossFunc class, and run the cell below.  **[3 mark(s)]**

In [None]:
import torch
from train import LossFunc

loss_func = LossFunc(device='cpu')

on_out = torch.rand(size=(8, 250))          # [B, T]
off_out = torch.rand(size=(8, 250))
octave_out = torch.rand(size=(8, 250, 5))   # [B, T, #Class]
pitch_class_out = torch.rand(size=(8, 250, 13))

on_tgt = on_out
off_tgt = off_out
octave_tgt = torch.randint(high=5, size=(8, 250))
pitch_class_tgt = torch.randint(high=13, size=(8, 250))


losses = loss_func.get_loss(
    out=(on_out, off_out, octave_out, pitch_class_out),
    tgt=(on_tgt, off_tgt, octave_tgt, pitch_class_tgt)
)

assert losses != None
assert len(losses) == 5
assert isinstance(losses[0], torch.Tensor)
print('Succeed!')


### Task 4: Metric [3 marks]

We need to observe the model performance during training to see how the training is going on. In addition to loss, we may also want to know the f1 score or accuracy, in both training loop and validation loop.
To facilitate this, we wrap the metric computation to a single class.

**YOUR TASK:** Finish the code of train.Metric class, and run the cell below.  **[3 mark(s)]**



In [None]:
import torch
from tqdm import tqdm
from dataset import get_data_loader, move_data_to_device
from hparams import Hparams
from train import LossFunc, Metrics

loss_func = LossFunc(device='cpu')
metric = Metrics(loss_func=loss_func)

# dummy output
on_out = torch.rand(size=(8, 250))          # [B, T]
off_out = torch.rand(size=(8, 250))
octave_out = torch.rand(size=(8, 250, 5))   # [B, T, #Class]
pitch_class_out = torch.rand(size=(8, 250, 13))
out = (on_out, off_out, octave_out, pitch_class_out)

train_loader = get_data_loader(split='train', args=Hparams.args)
for i, data in enumerate(tqdm(train_loader)):
    x, onset, offset, octave, pitch_class = move_data_to_device(data, 'cpu')
    tgt = (onset, offset, octave, pitch_class)

    metric.update(out, tgt)
    if i == 4:
        break
train_metric = metric.get_value()
print(train_metric, '\n')
assert len(train_metric) == 9
for k in train_metric:
    assert train_metric[k] > 0
assert metric.buffer == {}
print('Congrats!')

### Task 5: Model [5 marks]

**YOUR TASK:**
1. Implement the model in model.BaseCNN_mini, following the description below **[4 mark(s)]**
2. Successfully run the cell below. **[1 mark(s)]**

**Model Description**
1. This is a convolutional neural network that operates on spectrogram of a 5s-segment audio.
2. Three 2-d convolutional layers at the beginning, each with **3x3 kernal size** and **1x1 padding**. **Output channel number: 16, 32, 64**.
3. There is a batch normalization after each conv layer, then followed by ReLU as activation.
4. Before the 2nd and 3rd conv layer, there are max pooling (**kernel size=(1,2)**) along the feature dimension. **NOTE:** do not shrink the time dimension, because we need to make prediction for each frame.
5. After all convolution operation, permute the "feature" and "channel" dimensions so that they are adjacent, then merge the two dimensions to form a new feature dimension.
6. There is a position-wise feed-forward layer with **256 dimention**, i.e., for all frames along the time axis, convert all features from each frame into a 256-d vector. There is a ReLU activation function afterwards.
7. Prediction heads for onset, offset, octave, pitch class, each of them is a linear layer. They receive output from feed-forward layer, and produce the final output. **Note:** no activation function for these last layers, e.g., sigmoid/softmax.

In [None]:
from model import BaseCNN_mini

model = BaseCNN_mini(feat_dim=256)
dummy_input = torch.rand(size=(8, 250, 256))
out = model(dummy_input)
on, off, oct, pit = out

assert list(on.shape) == [8, 250]
assert list(off.shape) == [8, 250]
assert list(oct.shape) == [8, 250, 5]
assert list(pit.shape) == [8, 250, 13]

print('Congrats!')

## Section 2 - Training and Evaluation [14 marks]

### Task 6: Training [5 marks]
Now we are ready for training! Both the model and data are not large, training can be performed in your laptop.
Start training with the following command (estimated training time: 10 min):

        python train.py
You may need some time for debugging to successfully finish the training.

### Task 7: Testing [3 marks]
Then is testing. Test the model by

        python test.py

You may need to adjust the threshold values to make precision and recall to a similar value, to maximize the F1 score.

The estimated performance: **48%, 38%, 17%, for COn, COnP, COnPOff**, respectively.
- **COn** (*Correct Onset*): Percentage of notes with correctly predicted onset time within a tolerance window.  
- **COnP** (*Correct Onset & Pitch*): Percentage of notes with both onset time and pitch correct.  
- **COnPOff** (*Correct Onset, Pitch & Offset*): Percentage of notes with onset time, pitch, and note offset all correct.  


After finishing training or testing, please attach a screenshot that indicate training/testing is finished. **Please include your command line prompt in the screenshot to show that is you**.

### Task 8: Visualization (Case Study) [6 marks]
Finally, to have an intuitive understanding of the model's performance, please visualize the output of your model and the ground truth (annotation), for one segment of audio. You may choose whichever way you would like. And after that, please also attach your visualization figure below.

[TODO: Please attach your screenshots in this block]

**[Screenshot of finish of training]**

**[Screenshot of finish of testing]**

**[Visualization figure of output and ground truth]**

## Section 3 - Questions [13 marks]
**YOUR TASKS:** Please answer the questions below:

-  Explain how the post-processing algorithm operates. Specifically, describe how it converts frame-level outputs into note-level outputs before the final evaluation. (You may need to locate and review the relevant code in the provided `.py` files before answering.) **[2 marks]**
[Your answer]
</br>

- Describe how the final performance metric for note-level transcription is computed. Explain the evaluation process for a pair consisting of a note-level output and the corresponding annotation for one song. (Hint: review the relevant code, and consult the “Useful Resources” section at the beginning of this notebook.) **[2 marks]**
[Your answer]
</br>

- Recall that the `Metrics` class uses the F1 score to evaluate onset/offset classification. Do you think the F1 score is appropriate in this case? What about using accuracy (correct predictions ÷ total frames) instead? If accuracy is not appropriate, explain why, and do the same for the F1 score. Suggest alternative metrics that may be more suitable. **[2 marks]**
[Your answer]
</br>

- Summarize your system’s transcription performance (excluding time-efficiency) using objective metric scores and your visualization results. **[2 marks]** 
[Your answer]
</br>

- The current system may perform poorly on this dataset, or may achieve decent results but still have room for improvement. List three **pairs** of:  
  1. A possible reason for suboptimal performance, and  
  2. A corresponding direction for improvement. **[3 marks]**
[Your answer] 
</br>

- Which aspect of the assignment did you find most difficult? On which part did you spend the most time? **[1 mark]** 
[Your answer]
</br>

- How much time did you spend completing the assignment? If you did not record the exact duration, provide an estimate. **[1 mark(s)]**
[Your answer]
</br>
