# Import

MUSDB18 (non-HQ): STEM data format mp4. Use stempeg import import
MUSDB18-HQ: wave data

Data representation: STFT (using librosa for mono, scipy for stereo) and MFCC

## Ratio Mask Filtering (Oracle)

Oracle Filtering

Let $S_{v}$ be the STFT of track $v$

Define a ratio mask on training samples

$$ mask_v = \frac{|S_v|}{|S_v| + |\sum_{i \neq v}{S_i}|} = \frac{|S_v|}{|S_v| + |S_{others}|} $$

Mask has the same shape as STFT

To reconstruct the track, simply perform element-wise multiplication with the mixture

$$ S_v' = mask_v \cdot S_{mixture}' $$

# Models

Open Unmix: Learn mapping from mixture STFT to a target STFT
- 4 tracks (vocal, drums, bass, other) mean there are 4 models (1 model for each target track)

Asteroid

Demucs

NUSSL

Spleeter

## Improvements

Data augmentation: 
- https://www.researchgate.net/profile/Yuki-Mitsufuji/publication/315100151_Improving_music_source_separation_based_on_deep_neural_networks_through_data_augmentation_and_network_blending/links/59ed4f844585151983ccdcba/Improving-music-source-separation-based-on-deep-neural-networks-through-data-augmentation-and-network-blending.pdf
- https://hal.archives-ouvertes.fr/hal-02379796/document

Model ensemble:
- https://www.researchgate.net/profile/Yuki-Mitsufuji/publication/315100151_Improving_music_source_separation_based_on_deep_neural_networks_through_data_augmentation_and_network_blending/links/59ed4f844585151983ccdcba/Improving-music-source-separation-based-on-deep-neural-networks-through-data-augmentation-and-network-blending.pdf

# Evaluation

Use museval

## Baseline
- [ ] Write code to load data
- [ ] Run 1 chosen model
- [ ] Write submission code

## Improvements
- [ ] Data augmentation methods
- [ ] Research on SOTA models

# Submission
- `test.py`:
    - `prediction_setup()`: load model
    - `predict()`: 
        - load single mixture audio file from file 
        - perform separation
        - write demixed target to file path
- `predict.py`: submit using `submission = copy_predictor` once `CopyPredictor` class has been implemented using our model

# Test code for Open UnMix

In [None]:
import torch
import torchaudio
from openunmix import data, predict
from IPython.display import Audio, display

In [None]:
# load model
torch.hub.set_dir('./models')
separator = torch.hub.load("sigsep/open-unmix-pytorch", "umxhq")

In [None]:
mixture_file_path = './data/test/Al James - Schoolboy Facination/mixture.wav'

In [None]:
# load mixture audio
audio, rate = data.load_audio(mixture_file_path)
# perform separation
estimates = predict.separate(audio=audio, rate=rate, separator=separator)

In [None]:
# display demixed targets
# WILL RESULT IN A LARGE NOTEBOOK FILE. DO NOT RUN THIS ON AZURE WORKSPACE.
for target, estimate in estimates.items():
    print(target)
    audio = estimate.detach().cpu().numpy()[0]
    display(Audio(audio, rate=rate))

In [None]:
# save demixed targets
for target, estimate in estimates.items():
    torchaudio.save(
        target + ".wav",
        torch.squeeze(estimate),
        sample_rate=rate,
    )

# Data augmentation
- [Uhlich et al. (2017)](https://www.researchgate.net/profile/Yuki-Mitsufuji/publication/315100151_Improving_music_source_separation_based_on_deep_neural_networks_through_data_augmentation_and_network_blending/links/59ed4f844585151983ccdcba/Improving-music-source-separation-based-on-deep-neural-networks-through-data-augmentation-and-network-blending.pdf): 
    - STFT 
    - modify data on-the-fly when constructing a training mini-batch sequence for BLSTMs (mini-batch size: 10, sequence length: 500)
    - random swapping left/right channel for each instrument
    - random scaling with uniform amplitudes from \[0.25, 1.25\]
    - random chunking into sequences for each instrument
    - random mixing of instruments from different songs
- [Nachmani & Wolf (2019)](https://arxiv.org/pdf/1904.06590.pdf): 
    - single CNN encoder and single WaveNet decoder
    - playing song forward and backward in time
    - multiplying values of raw audio signal by -1 (i.e. phase shift by 180 degrees)
- [Cohen-Hadria et al. (2019)](https://arxiv.org/pdf/1903.01415.pdf):
    - 4 signals (voice, drum, bass, accompaniment) are transformed separately using  most appropriate parameters
    - pitch shfiting but preserve spectral envelop in \[-300, -200, -100, 0 100, 200, 300]
    - time stretching in \[0.5, 0.93, 1, 1.07, 1.15]
    - transformation of spectral envelope (i.e. formant) only of the singing voice in \[-150, -100, 0, 100, 150]
    - improvements:
        - U-Net: pitch transposition
        - U-Net and Wave-U-Net: transposition, time stretching, formant shifting
- [Défossez et al. (2021)](https://hal.archives-ouvertes.fr/hal-02379796/document):
    - 2 architectures, both are waveform models
        - Conv-Tasnet (adapted)
        - Demucs: convoluational encoder, BLSTM, convoluation decoder, with encoder and decoder linked with skip U-Net connections
    - shulffling sources within one batch to generate new mixing
    - random swapping channels
    - random scaling by a uniform factor \[0.25, 1.25]
    - multipling each source by $\pm1$
    - random changing pitch by -2, -1, 0, +1, +2 semitones (20% of the time)
    - change tempo by a factor taken uniformly in \[0.88, 1.12]
    - Conv-Tasnet doesn't benefit from pitch/tempo shift augmentation

# Model blending
[Uhlich et al. (2017)](https://www.researchgate.net/profile/Yuki-Mitsufuji/publication/315100151_Improving_music_source_separation_based_on_deep_neural_networks_through_data_augmentation_and_network_blending/links/59ed4f844585151983ccdcba/Improving-music-source-separation-based-on-deep-neural-networks-through-data-augmentation-and-network-blending.pdf):
- Step 1: time-invariant linear combination of raw DNN outputs, for each instrument $i$ 
    - $\hat{\textbf{s}}_{i, \text{BLEND}}(n)=\lambda \hat{\textbf{s}}_{i, \text{FNN}}(n)+(1-\lambda)\hat{\textbf{s}}_{i, \text{BLSTM}}(n)$
    - optimal $\lambda=0.25$ for DSD100
- Step 2: multi-channel Wiener-filter post-processing

# Architecture
[Défossez et al. (2021)](https://hal.archives-ouvertes.fr/hal-02379796/document):
- waveform domain models performm better for `bass` and `drums` sources; spectrogram domain models perform better on `vocals` and `other` sources

# 