# Assignment 2 [48 marks, 15%]
Hi all! Welcome to the 2nd assignment of this course! Here, you will learn how to build a singing voice transcription system (SVT) for songs, a system that help you convert singing voice in audio into note-level notations. This assignment contains 3 sections:
1. Building necessary components for the system,
2. The entire transcription workflow,
3. Some extension questions.

You are required to:
- Finish this notebook. Successfully run all the code cells and answer all the questions.
- When you need to embed screenshot in the notebook, put the picture in './resources'.
- After finishing, **zip the whole directory (please exclude data_mini directory)**, then submit to Canvas. **Naming: "eXXXXXXX_Name_Assignment2.zip"**.

This assignment constitutes 15% of your final grade, but **the full marks of this notebook are 48**. We will normalize your score when computing final grade, i.e., Your assignment 2 score = [Your Score] / 48 * 15.

**Honor Code**
Note that plagiarism will not be condoned. You may discuss the questions with your classmates or search on the internet for references, but you MUST NOT submit your code/answers that is copied directly from other sources. If you referred to the code or tutorial somewhere, please explicitly attribute the source somewhere in your code, e.g., in the comment.

**Note**
Please restart the jupyter kernel every time after modifying any py files. Otherwise, the updated code may not be loaded by this notebook.
Restarting kernel won't clear the output of previous cells, your running history is still preserved.

**Useful Resources**
- [Music Transcription Overview](https://www.eecs.qmul.ac.uk/~simond/pub/2018/BenetosDixonDuanEwert-SPM2018-Transcription.pdf)
- [Evaluation for Singing Transcription](https://riuma.uma.es/xmlui/bitstream/handle/10630/8372/298_Paper.pdf?sequence=1)
- [mir_eval documentation](https://craffel.github.io/mir_eval/#mir_eval.transcription.evaluate)
- [VOCANO: A note transcription framework for singing voice in polyphonic music](https://archives.ismir.net/ismir2021/paper/000036.pdf)
- [TONet: Tone-Octave Network for Singing Melody Extraction from Polyphonic Music](http://arxiv.org/abs/2202.00951)

## Getting started

We recommend you to use a Conda environment for the course. If you have not yet done so, please
1. Download [miniconda](https://docs.conda.io/en/latest/miniconda.html), install it on your computer. After installation, you may need to restart the command line (powershell/zsh/bash).
2. When you see this (base) in your command line, it means a successful installation.
               ![figure](./resources/conda.png)
3. Create an environment called "4347", and then enter the environment:

        conda create -n 4347 python=3.9
        conda activate 4347
4. Install packages

        # Install PyTorch (use command suitable for your OS)
        # Linux / Windows (CUDA 11.7)
        pip install torch==2.0.0+cu117 torchvision==0.15.1+cu117 torchaudio==2.0.1 --index-url https://download.pytorch.org/whl/cu117
        # Windows (CPU)
        pip install torch==2.0.0+cpu torchvision==0.15.1+cpu torchaudio==2.0.1 --index-url https://download.pytorch.org/whl/cpu
        # OSX (CPU)
        pip install torch==2.0.0 torchvision==0.15.1 torchaudio==2.0.1

        # Install other libraries
        pip install -r requirement.txt

        # Install ffmpeg
        conda install ffmpeg
5. When you run this notebook in your IDE, switch the interpreter to the 4347 conda environment.
6. You may be prompted to install the jupyter package. Click "confirm" in this case.

## Section 1 - Important Components [21 marks]
We are going to use pytorch to build a neural network for SVT. Like all the deep learning projects, we start with data pipeline.

### Task 1: Data Loader [3 marks]

**YOUR TASKS:** Finish the code for MyDataset class in **dataset.py**, so that you can pass the test in the cell below. **[3 mark(s)]**
You need to fill the code that implement:
1. Convert note-level annotation to frame-level annotation.
2. Read audio file, convert to [mel spectrogram](https://en.wikipedia.org/wiki/Mel-frequency_cepstrum). Some tools may help you, like [torchaudio](https://pytorch.org/audio/main/generated/torchaudio.transforms.MelSpectrogram.html) or [librosa](https://librosa.org/doc/main/generated/librosa.feature.melspectrogram.html).
3. Extract 5-s segments from spectrogram and annotations as samples for training.

**NOTE:** Please do not modify existing code.


In [1]:
from tqdm import tqdm
from dataset import get_data_loader, move_data_to_device
from hparams import Hparams

train_loader = get_data_loader(split='train', args=Hparams.args)
for data in tqdm(train_loader):
    x, onset, offset, octave, pitch_class = move_data_to_device(data, 'cpu')
    assert list(x.shape) == [8, 250, 256]  # shape in [B, T, D],
                                # i.e., [Batch size, num of frame per sample, spectrogram feature dimension]
    assert list(onset.shape) == list(offset.shape) == list(octave.shape) == list(pitch_class.shape) == [8, 250]
    break
print('Congrats!')

  0%|          | 0/123 [00:48<?, ?it/s]

Congrats!





### Task 2: Speed it up  [7 marks]

You might have a working dataloader now, but it may have a drawback. Let's load 5 batches of data from your data loader. (Please run the code below) **[1 mark(s)]**

In [None]:
from tqdm import tqdm
from dataset import get_data_loader, move_data_to_device
from hparams import Hparams
import time

t0 = time.time()
train_loader = get_data_loader(split='train', args=Hparams.args)
for i, data in enumerate(tqdm(train_loader)):
    x, onset, offset, octave, pitch_class = move_data_to_device(data, 'cpu')
    assert list(x.shape) == [8, 250, 256]  # shape in [B, T, D], i.e., [Batch size, num of frame per sample, feature dimention]
    assert list(onset.shape) == list(offset.shape) == list(octave.shape) == list(pitch_class.shape) == [8, 250]
    if i == 4:
        dur = time.time()-t0
        est_time = dur / 5 * 123
        break
print('5 batches use {:.2f} seconds'.format(dur))
print('Estimated time to load the whole training set: {:.2f} seconds.'.format(est_time))

  0%|          | 0/123 [00:00<?, ?it/s]

In [1]:
#Timing everything to see what takes more time for a single batch

from tqdm import tqdm
from dataset import get_data_loader, move_data_to_device
from hparams import Hparams
import time
import warnings

# Disable all warnings
warnings.filterwarnings("ignore")

t0 = time.time()
train_loader = get_data_loader(split='train', args=Hparams.args)
for i, data in enumerate(tqdm(train_loader)):
    x, onset, offset, octave, pitch_class = move_data_to_device(data, 'cpu')
    assert list(x.shape) == [8, 250, 256]  # shape in [B, T, D], i.e., [Batch size, num of frame per sample, feature dimention]
    assert list(onset.shape) == list(offset.shape) == list(octave.shape) == list(pitch_class.shape) == [8, 250]
    if i == 0:
        dur = time.time()-t0
        est_time = dur * 123
        break
print('1 batches use {:.2f} seconds'.format(dur))
print('Estimated time to load the whole training set: {:.2f} seconds.'.format(est_time))

  0%|          | 0/123 [00:00<?, ?it/s]

[.\data_mini\train\60\Mixture.mp3]: Time to load audio is 4.703905820846558
[.\data_mini\train\60\Mixture.mp3]: Time to load mel spec is 0.11013507843017578
[.\data_mini\train\60\Mixture.mp3]: time for getlabels without caching is 0.010503530502319336


  0%|          | 0/123 [00:05<?, ?it/s]

1 batches use 6.01 seconds
Estimated time to load the whole training set: 739.18 seconds.





If the estimated time to load the whole training set is within 30s, congratulations!
If you did not modify the workflow of the current dataset, it may take more than 1000 seconds to load the whole training set. Consequently, data loading become the bottleneck of time overhead.

**YOUR TASK**:
1. Answer the two questions below **[2 mark(s)]**
2. Do necessary change to your dataset class. **[3 mark(s)]**
3. Restart the kernel, and run the code cell below. **[1 mark(s)]**

Question:
[1] What do you think causes the time overhead issue in the current data loading?
**The main reason it takes this much time is because for loading each audio using librosa it approximately take around 13-15 seconds and for resampling it to the default sampling rate also it takes another 4-6 seconds**

[2] What is your plan to modify your dataset class? State 2 possible solutions, and mention which do you choose to implement, together with the reason.
**The approaches I plan to try is to:
a. use pytorch audio instead of librosa as that loads the audio a lot faster in just 0.3 seconds
b. cache every tensor computed to make things a lot faster.**

In [4]:
from tqdm import tqdm
from dataset import *
from hparams import Hparams
import time

t0 = time.time()
train_loader = get_data_loader(split='train', args=Hparams.args)
for i, data in enumerate(tqdm(train_loader)):
    x, onset, offset, octave, pitch_class = move_data_to_device(data, 'cpu')
    assert list(x.shape) == [8, 250, 256]  # shape in [B, T, D], i.e., [Batch size, num of frame per sample, feature dimention]
    assert list(onset.shape) == list(offset.shape) == list(octave.shape) == list(pitch_class.shape) == [8, 250]
    if i == 4:
        dur = time.time()-t0
        est_time = dur / 5 * 123
        break
print('5 batches use {:.2f} seconds'.format(dur))
print('Estimated time to load the whole training set: {:.2f} seconds.'.format(est_time))
if est_time < 40:
    print('Well Done!')
else:
    print('We can still be faster.')

  3%|▎         | 4/123 [00:00<00:17,  6.70it/s]

5 batches use 0.69 seconds
Estimated time to load the whole training set: 17.09 seconds.
Well Done!





### Task 3: Loss function [3 marks]

We are going to use "multitask learning" to train our model. I.e., we are training our model simultaneously on 4 tasks: 4 types of frame classification tasks. They are:
- Classify if there is an onset on some frame.
- Classify if there is an offset on some frame.
- Classify the pitch of some frame lies on which octave (we have 4 octaves numbered 1~4, and 0 means silence).
- Classify which pitch class if the current pitch (from C~B, 12 semitones, and 0 which means silence).

In this case, the loss function is not that straightforward. To improve the readability of our code, we are going wrap the loss computation into a class.

**YOUR TASKS:** Finish the code of train.LossFunc class, and run the cell below.  **[3 mark(s)]**

In [1]:
import torch
from train import LossFunc

loss_func = LossFunc(device='cpu')

on_out = torch.rand(size=(8, 250))          # [B, T]
off_out = torch.rand(size=(8, 250))
octave_out = torch.rand(size=(8, 250, 5))   # [B, T, #Class]
pitch_class_out = torch.rand(size=(8, 250, 13))

on_tgt = on_out
off_tgt = off_out
octave_tgt = torch.randint(high=5, size=(8, 250))
pitch_class_tgt = torch.randint(high=13, size=(8, 250))


losses = loss_func.get_loss(
    out=(on_out, off_out, octave_out, pitch_class_out),
    tgt=(on_tgt, off_tgt, octave_tgt, pitch_class_tgt)
)

assert losses != None
assert len(losses) == 5
assert isinstance(losses[0], torch.Tensor)
print('Succeed!')


Succeed!


### Task 4: Metric [3 marks]

We need to observe the model performance during training to see how the training is going on. In addition to loss, we may also want to know the f1 score or accuracy, in both training loop and validation loop.
To facilitate this, we wrap the metric computation to a single class.

**YOUR TASKS:** Finish the code of train.Metric class, and run the cell below.  **[3 mark(s)]**



In [3]:
from torch import float64
import torch
from tqdm import tqdm
from dataset import get_data_loader, move_data_to_device
from hparams import Hparams
from train import LossFunc, Metrics

loss_func = LossFunc(device='cpu')
metric = Metrics(loss_func=loss_func)

# dummy output
on_out = torch.rand(size=(8, 250))          # [B, T]
off_out = torch.rand(size=(8, 250))
octave_out = torch.rand(size=(8, 250, 5))   # [B, T, #Class]
pitch_class_out = torch.rand(size=(8, 250, 13))
out = (on_out, off_out, octave_out, pitch_class_out)

train_loader = get_data_loader(split='train', args=Hparams.args)
for i, data in enumerate(tqdm(train_loader)):
    x, onset, offset, octave, pitch_class = move_data_to_device(data, 'cpu')
    tgt = (onset, offset, octave, pitch_class)

    metric.update(out, tgt)
    if i == 4:
        break
train_metric = metric.get_value()
print(train_metric, '\n')
assert len(train_metric) == 9
for k in train_metric:
    assert train_metric[k] > 0
assert metric.buffer == {}
print('Congrats!')

  3%|▎         | 4/123 [01:16<38:08, 19.23s/it]  

{'loss': 6.605385971069336, 'onset_loss': 1.1756054162979126, 'offset_loss': 1.1829899787902831, 'octave_loss': 1.6482551574707032, 'pitch_loss': 2.598535585403442, 'onset_f1': 0.060420632618140103, 'offset_f1': 0.06041505634486559, 'octave_acc': 0.1891, 'pitch_acc': 0.0834} 

Congrats!





### Task 5: Model [5 marks]

**YOUR TASKS:**
1. Implement the model in model.BaseCNN_mini, following the description below **[4 mark(s)]**
2. Successfully run the cell below. **[1 mark(s)]**

**Model Description**
1. This is a convolutional neural network that operates on spectrogram of a 5s-segment of audio.
2. Three 2-d convolutional layers at the beginning, each with 3x3 kernal size and 1x1 padding. Output channel number: 16, 32, 64.
3. There is a batch normalization after each conv layer, then followed by relu as activation.
4. Before the 2nd and 3rd conv layer, there are max pooling along the feature dimension. NOTE: do not shrink the time dimension, because we need to make prediction for each frame.
5. After convolution operation, there is a position-wise feed-forward layer with 256 dimention. I.e., for all frames along the time axis, convert all features from each frame into a 256-d vector.
6. Prediction head for onset, offset, octave, pitch class. They receive output from feed-forward layer, and produce the final output. Note: no activation function for these last layers, e.g., sigmoid/softmax.

In [1]:
from model import BaseCNN_mini
import torch

model = BaseCNN_mini(feat_dim=256)
dummy_input = torch.rand(size=(8, 250, 256))
out = model(dummy_input)
on, off, oct, pit = out

assert list(on.shape) == [8, 250]
assert list(off.shape) == [8, 250]
assert list(oct.shape) == [8, 250, 5]
assert list(pit.shape) == [8, 250, 13]

print('Congrats!')

Congrats!


## Section 2 - Training and Evaluation [14 marks]

### Task 6: Training [5 marks]
Now we are ready for training! Both the model and data are not large, training can be performed in your laptop.
Start training with the following command (estimated training time: 10 min):

        python train.py
You may need some time for debugging to successfully finish the training.

### Task 7: Testing [3 marks]
Then is testing. Test the model by

        python test.py
The estimated performance: 48%, 38%, 17%.

After finishing training or testing, please attach a screenshot that indicate training/testing is finished. **Please include your command line prompt in the screenshot to show that is you**.

### Task 8: Visualization (Case Study) [6 marks]
Finally, to have an intuitive understanding of the model's performance, please visualize the output of your model and the ground truth (annotation), for one segment of audio. You may choose whichever way you would like. And after that, please also attach your visualization figure below.

## Screenshots

**1.Screenshot of finish of training**

![Training Loss](./resources/trainingLossAfter10Epochs.png)

**2. Screenshot of finish of testing**

![Testing Results](./resources/testResults.png)


**3. Visualization figure of output and ground truth**

Given below is a visualization of the original song(left) vs predicted midi output(right). Note that there is a slight delay on purpose to show the difference between the generated audio files

In [1]:
%%HTML
<video width="320" height="240" controls>
  <source src="./resources/Visualization.mp4" type="video/mp4">
</video>

## Section 3 - Questions [13 marks]
**YOUR TASKS:** Please answer the questions below:

Q1. How does the post-processing algorithm operates? How does it convert frame-level output to note-level output before the final evaluation? Briefly explain your understanding. (You may need to find the corresponding code from some py file and try to understand it before answering.) **[2 marks]**

Ans. Once we get the predictions from our trained model at a frame level here are the brief steps taken to convert it into a note level output for an individual song:
1. We use the onset and offset threshold values (0.7 & 0.5 respectively) to threshold the model output into a binary vectors for 250 frames (i.e. 5 seconds)
2. Similarly the pitch (250,13) and octave vectors(250,5) vectors are converted into the corresponding numbers to get vectors of size (250,) (for 250 frames or 5 seconds)
3. We do such an operation for multiple batches and flatten it {say in this case 48 batches so each vector becomes ( 48 * 250) shape}. Now for onset, offset, pitch and octave we have 12k frames with appropriate values.
4. Using this list of 12k frames we now convert it to note level annotations in a manner [(a,b,c),...] where 'a', 'b' correspond to the start and end duration in seconds respectively and 'c' represents the midi number.
5. Since each frame is of size 0.02 seconds we try to collate/merge together many frames in ranges to get note level annotations.
6. Everytime we see an onset frame value as 1, we consider that as the current frame from which we see how many more frames contains a particular midi pitch value. ( which in turn can be calculated using the pitch and octave values from the corresponding frames in the vectors)
7. Until we see an offset frame value set to 1, we see how many frames from the current onset frame has occurred and then covert that difference between frames into corresponding duration values since we know that the frame size is 0.02 seconds
8. In this manner we iterate through all the frames and try to find ranges between an onset and offset value where we have the same midi number by merging together different ranges and store that as the note level midi annotation.
9. This way 12k such frames would be reduced to much lesser annotations as we will be operating in the duration instead of frames and we can merge together many frames to form a particular note level annotation.

In this manner we repeat it for all songs in the test dataset as well to get corresponding note level annotations for each song in the test dataset.

</br>

---------------------------------------------------------------------------------------------------------------------------------------

Q2. How did we compute the final note-level performance metrics? Briefly explain how the computation is conducted for a pair of note-level output and annotation for one song. (Hint: look into the code first, and some "useful resources" attached at the beginning of this notebook may help.) **[2 marks]**

Ans. In our test code, we first make the predictions for all the songs in the test category and get the note level predicted annotations for those and then we use the mir_eval framework to try and compute the evaluation metrics for each song's predicted vs ground truth annotations.

The MIR Evaluation framework mainly operates with 3 aspects which are mainly onset, offset and pitch values.

It considers the following criteria to classify something as "correct":
1. Correct Onset: If the predicted onset time is within +-50ms range of the ground truth onset time. Then it is classified as a "correct" onset
2. Correct Offset: If the predicted offset time is within +-max(50ms, duration of ground truth note) range of the ground truth offset time. Then it is classified as a "correct" offset
3. Correct Pitch: If the predicted pitch is within +-0.5 semitones from the ground truth pitch hertz then it is considered to be a "correct" pitch.

Using the afore mentioned notion of "correctness", there are mainly 3 categories of metrics for each of which we try to compute the recall , precision and f1 score:
1. COnPOff: (Correct Onset & Pitch & Offset)
2. COnP: (Correct Onset & Pitch)
3. COn: (Correct Onset only)

For each of these categories we would consider only that subset where that specific combination of attributes are all considered to be "correct" ( e.g. in COnP we would need to count the number of times both Onset and Pitch have been classified as correct!). Using which we can compute the values of Precision , Recall and f1 score.

Here is a brief explainer on what each term means:
1. Precision:
. This is a measure of how many positive predictions made out are actually correct ( or a true positive)
. precision = true positives/ (true positive + false positive)
2. Recall:
. This is measure of how many positive cases the classifier was able to predict correctly amongst all the actual positive cases
. recall = true positive / ( true positive + false negative)
3. F1 Score:
. This is the harmonic mean between precision and recall and is a singular metric which encapsulates both these ideas
. f1 score = 2 * precision * recall / (precision + recall)

</br>

---------------------------------------------------------------------------------------------------------------------------------------

Q3. Recall that we are using f1 score in the Metrics class to observe the onset/offset classification performance. Do you think such f1 score is a good performance in this case? What about using accuracy (correct prediction / num of frames) instead, do you think it's proper for onset/offset classification? If not, what are the reasons respectively for the two metrics? What are possible better choices?  **[2 marks]**

Ans. For onset/offset classification f1 score is certainly a better metric than accuracy primarily because the onset and offset vectors are mainly imbalanced in nature.

We have more 0s and 1s only in a few places, so training a classifier with accuracy as loss will give more priority to misclassifying the onset/offset rather than getting it right.

In onset/offset classification, since we want to minimize false positives (misclassifying non-onset/offset frames as onset/offset) while also minimizing false negatives (missing actual onset/offset frames); f1 score is a much better metric as it can handle imbalanced data much better as it combines both precision and recall into a single metric.

That said, f1 score also has its flaws for the following reasons:
1. It doesn't consider true negatives in its formulation which is important in understanding the classifier's behaviour
2. We have to consider a specific threshold (in my case 0.5) when calculating which may not be obvious beforehand.

Here are a few approaches we can take:
1. We can consider the mean square accuracy metric by changing all the 1s to some positive integer and all 0s to some arbitrarily high negative number. This way the model has incentive to classify onset and offset positive values correctly as it is heavily penalized in other cases.
2. Other metric we can consider is the ROC-AUC metric which essentially is the area under the curve of True Positive Rate vs False Positive Rate which can obtained by considering different threshold values. A value of 0.5 would indicate almost random performance wheras a value closer to 1 indicates a really good classifier. Given below are some pictures borrowed from an [online blog from evidently AI](https://www.evidentlyai.com/classification-metrics/explain-roc-curve#:~:text=A%20ROC%20AUC%20score%20is,one%20is%20a%20trapezoidal%20rule.) to better visualize this metric.

a. ![roc auc1](./resources/ROC%20AUC%200.png)
a. ![roc auc1](./resources/ROC%20AUC%201.png)
a. ![roc auc1](./resources/ROC%20AUC%202.png)

</br>

---------------------------------------------------------------------------------------------------------------------------------------

Q4. How does your system performed? Briefly introduce your system's performance with objective metric scores and your visualization. **[2 marks]**

Ans. Based on the screenshot attached from my test script here is a breakdown of the values:

1. Onset/Offset Classification (COnPOff):

Precision: 11.49%
Recall: 10.75%
F1-score: 9.62%

2. Onset/Pitch Classification (COnP):

Precision: 34.56%
Recall: 26.81%
F1-score: 25.40%

3. Onset Classification (COn):

Precision: 59.63%
Recall: 39.24%
F1-score: 39.35%

4. Overall Note Transcription:

Ground truth note count: 1520
Predicted note count: 1397

As we can see here, the model's performance is not that great when considering COnPOff as that is a much stricter criterion as the f1 score is just a measly 9.6%.

Based on my experiments on running this model for 20/30 epochs I could see these f1 score values improving over time and I believe that if we run it for around 50-60 epochs the f1 score for COnPOff would have come around 50-60%. In that case, the f1 scores for the other two less stricter criterions ( i.e. COnP and COn ) would also have even higher f1 scores of around 70 or even 80%.

Another observation to note here is that the precision across all 3 categories is always higher than my recall which seems to indicate that when the model makes a positive prediction, it's more likely to be correct compared to the recall, but the model still misses many positive instances.

As for the visualization kindly refer to the video attached as a part of the previous question or check the path "/resources/Visualization.mp4".

I have also additionally saved the midi output files for the test data for the models obtained after running for 20 and 30 epochs in the path "./results/midi after x epochs"
and my observation was that for specific songs the midi output file was slightly closer to the original song but the model still has a large scope of improvement as we can train it for a lot more epochs and observe that the accuracy increases.

</br>

---------------------------------------------------------------------------------------------------------------------------------------
Q5. The current system might not be performing very well on this dataset, or perform decently but still have room to improve. What are possible reasons for the not-so-good performance, and directions of improvement? Please list 3 pairs of them. **[3 marks]**

Ans. Here are the reasons along with possible improvements listed below:
1. Dataset quality:
. Since this is a very small dataset , had we considered a much larger and generic dataset and train it for longer the model can certainly improve.
. Alternatively, this particular dataset has a lot of variety . So if we were to consider only specific genres instead the model can easily specialize the predictions over this genre. But if we want a generic model, then its best to train over a large generic dataset as mentioned above.
. The mp3 files also has a lot of background music as well, so we could perhaps have dataset where there is only voice and no background music.

2. Training approach:
. In this particular case we only trained for 10 epochs, but had we trained for larger number of epochs the accuracy would have surely improved as I was able to observe this trend when training for 20-30 epochs.
. Perhaps training it to something ~250 epochs on this dataset with this model would have yielded a better performance.

3. Model architecture:
. Since music in general is more sequential, doing language modelling over the notes seem like a more better/intuitive approach rather than considering CNN's and Max pooling which may lose the finer aspects of time and how each note is conditioned on the previous notes.
. To that end, architectures like RNNs , LSTMs or Autoregressive SOTA models like transformers can possibly yield better results.
. Another fine-grained aspect if we are to use the current dataset, would be to train another model which can purely only predict what constitutes a human voice and what is a background music as just having 2 vectors of onset and offset doesnt seem to be capable enough to understand this subtle nuance as well.
. We can instead opt for transcribing polyphonic music for everything ( instruments, voice etc) but have another classifier model to only filter out the human voices and ignore other instruments.

4. Hyperparameter tuning:
. The choice of the hyperparameters and optimizers used can also greatly impact the training & the overall test accuracy of the model.
. Doing a hyperparameter search for a given architecture is necessary to figure out the optimal values so as to converge to the optimal solution and yield better results.

5. Different metrics
. Using the same MIR evaluation metrics themselves as a loss function might actually guide the model to a better solution.
. Similarly other metrics like mean square error can also be used for onset and offset using the kind of preprocessing I had mentioned in my answer to Q3 (refer above).

</br>

---------------------------------------------------------------------------------------------------------------------------------------
Q6. What do you think is the most difficult part? Which part did you spent most time on it? **[1 mark(s)]**

Ans. For me since I had more experience doing Deep Learning with Keras and Tensorflow a long time back , I had to learn the pytorch fundamentals and had to spend time building the model and ensuring that the resulting tensor shapes were indeed correct. I also spent a considerable time on the dataloader as librosa audio loader was very slow and only after switching to pytorch audio load the dataloader ran within the expected. Understanding the MIR Eval values and going through the code flow also took around half a day.
</br>

---------------------------------------------------------------------------------------------------------------------------------------
Q7. How much time did you spent on the assignment? Please fill an estimated time here if you did not time yourself. **[1 mark(s)]**

Ans. ~5 days
</br>

---------------------------------------------------------------------------------------------------------------------------------------
