# **META-LEARNING EXTRACTORS FOR MUSIC SOURCE SEPARATION**

*David Samuel, Aditya Ganeshan & Jason Naradowsky*



[**GitHub repository**](https://github.com/pfnet-research/meta-tasnet) | **[Paper](https://arxiv.org/abs/2002.07016)**
<br>
<br>

We propose a hierarchical meta-learning-inspired model for music source separation in which **a generator model is used to predict the weights of individual extractor models**.  This enables efficient parameter-sharing, while still allowing for instrument-specific parameterization.  The resulting models are shown to be more effective than those trained independently or in a multi-task setting, and achieve performance comparable with state-of-the-art methods.

<p align="center">
  <img src="https://raw.githubusercontent.com/pfnet-research/meta-tasnet/master/img/parameter_generation.png" alt="Overall architecture." width="512"/>  
</p>

<p align="center">
<em>The overall architecture. The blue area depicts the parameter generator, a network which predicts the weights of the extractor's masking subnetwork specific to each instrument.  The extractor network then uses these weights when separating the instrument source from the mixture.</em>
</p>

## Brief Introduction to Music Source Separation

Given a mixed source signal, the task of source separation algorithm is to divide the signal into its original components. We test our method on music separation and specifically on the [MUSDB18 dataset](https://zenodo.org/record/1117372#.XiSY9Bco9QJ) where the sources consist of contemporary songs and the goal is to divide them into four stems — **drums, vocals, bass and any other accompaniments**.

Music source separation can not only be used as a preprocessing step to other MIR problems (like sound source identification), but it can also be used more creatively: we can create backing tracks to any song for musical practice or just for fun (karaoke), we can create "smart" equilizers that are able to make a new remix, or we can separate a single instrument to better study its intricacies (guitar players can more easily determine the exact chords for example). 



<p align="center">
  <img src="https://raw.githubusercontent.com/pfnet-research/meta-tasnet/master/img/spectrogram.png" alt="Overall architecture." width="768"/>  
</p>

<p align="center">
<em>Illustration of a separated audio signal (projected on log-scaled spectrograms). The top spectrogram shows the mixed audio that is transformed into the four separated components at the bottom. Note that we use the spectrograms just to illustrate the task — our model operates directly on the audio waveforms.</em>
</p>

## Generating Extractor Models

The key idea is to utilize a tiered architecture where a **generator** network "supervises" the training of the individual extractors by **generating some of their parameters directly**.  This allows the generator to develop a dense representation of how instruments relate to each other *as it pertains to the task*, and to utilize their commonalities when generating each extractor.

Our model is based on [Conv-TasNet](https://arxiv.org/abs/1809.07454), a time domain-based approach to speech separation comprising three parts: 
1. an **encoder** which applies a 1-D convolutional transform to a segment of the mixture waveform to produce a high-dimensional representation
2. a **masking function** which calculates a multiplicative function which identifies a targeted area in the learned representation 
3. a **decoder** (1-D inverse convolutional layer) which reconstructs the separated waveform for the target source.

The masking network is of particular interest, as it contains the source-specific masking information; the encoder and decoder are source-agnostic and remain fixed for separation of all sources.




## Multi-stage Architecture

Despite the data's higher sampling rate (44kHz), we find that models trained using lower sampling rates are more effective despite the loss in resolution.  We therefore propose a multi-stage architecture to leverage this strength while still fundamentally predicting high resolution audio and use three stages with 8, 16 and 32kHz sampling rates

<p align="center">
  <img src="https://raw.githubusercontent.com/pfnet-research/meta-tasnet/master/img/multi_stage.png" alt="Multi-stage architecture." width="512"/> 
</p>

<p align="center">
<em>Illustration of the multi-stage architecture. The resolution of the estimated signal is progressively enhanced by utilizing information from previous stages. The encoders increase the stride $s$ to preserve the same time dimension $T'$. Note that the masking TCN is still generated (not included in the illustration).</em>
</p>

## Interactive Example



### 1. Initialize

In [None]:
!pip install youtube-dl
!pip install soundfile
!git clone https://github.com/pfnet-research/meta-tasnet

!wget "https://www.dropbox.com/s/zw6zgt3edd88v87/best_model.pt"

import youtube_dl, soundfile, librosa, os, sys, torch, IPython.display
import numpy as np
from IPython.display import HTML
from google.colab import output, files

sys.path.append("/content/meta-tasnet")
from model.tasnet import MultiTasNet

output.clear()

### 2. Load the saved model

In [None]:
state = torch.load("best_model.pt")  # load checkpoint
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")  # optionally use the GPU

network = MultiTasNet(state["args"]).to(device)  # initialize the model
network.load_state_dict(state['state_dict'])  # load weights from the checkpoint

<All keys matched successfully>

### 3. Define the separation procedure

In [None]:
# separate an audio clip (shape: [1, T]) with samping rate $rate
def separate_sample(audio, rate: int):
        
    def resample(audio, target_rate):
        return librosa.core.resample(audio, rate, target_rate, res_type='kaiser_best', fix=False)
    
    audio = audio.astype('float32')  # match the type with the type of the weights in the network
    mix = [resample(audio, s) for s in[8000, 16000, 32000]]  # resample to different sampling rates for the three stages
    mix = [librosa.util.fix_length(m, (mix[0].shape[-1]+1)*(2**i)) for i,m in enumerate(mix)]  # allign all three sample so that their lenghts are divisible
    mix = [torch.from_numpy(s).float().to(device).view(1, 1, -1) for s in mix]  # cast to tensor with shape: [1, 1, T']
    mix = [s / s.std(dim=-1, keepdim=True) for s in mix]  # normalize by the standard deviation
    
    network.eval()
    with torch.no_grad():        
        separation = network.inference(mix, n_chunks=2)[-1]  # call the network to obtain the separated audio with shape [1, 4, 1, T']

    # normalize the amplitudes by computing the least squares
    # -> we try to scale the separated stems so that their sum is equal to the input mix 
    a = separation[0,:,0,:].cpu().numpy().T  # separated stems
    b = mix[-1][0,0,:].cpu().numpy()  # input mix
    sol = np.linalg.lstsq(a, b, rcond=None)[0]  # scaling coefficients that minimize the MSE
    separation = a * sol  # scale the separated stems

    estimates = {
        'drums': separation[:,0:1],
        'bass': separation[:,1:2],
        'other': separation[:,2:3],
        'vocals': separation[:,3:4],
    }

    return estimates

### 4. Load a song from youtube

Choose a song to separate within a time interval and hit play to load that song from youtube.

Note that you can override the variable `id` and separate whatever song you want (use the "id" string from the URL address on youtube)! Just keep in mind that the audio should be of high quality (which isn't always the case on youtube, unfortunately).

In [None]:
ids = {
    "비밀의 화원": "U6FopXugJo8",
    "Marvin Gaye - Whats Happening Brother (soul)": "HO5UH4uX_yA",
    "Dire Straits - Sultans Of Swing (rock)": "0fAQhSRLQnM", 
    "Billie Eilish - Bad Guy (pop)": "DyDfgMOUjCI",
    "Death - Spirit Crusher (death metal)": "4_rYk_aJbcQ",
    "James Brown - Get Up (I Feel Like Being A) Sex Machine (funk)": "kwjHpi4rXb8",
    "Věra Bílá & Kale - Pas o panori (world music)": "R-L477kx8LA",
    "Eminem - Lose Yourself (hip-hop)": "nPA2czkOsFE",
    "Sting - Englishman in New York (pop/rock)": "d27gTrPPAyk",
    "R.E.M. - Losing my Religion (alternative rock)": "xwtdhWltSIg",
    "AURORA – Animal (pop)": "3DIT8Y3LC6M",
    "Red Hot Chili Peppers - Scar Tissue (alternative rock)": "mzJj5-lubeM",
    "John Mayer - Gravity (blues)": "7VBex8zbDRs",
    "Darude - Sandstorm (EDM)": "y6120QOlsfU",
    "Pokemon (soundtrack)": "JuYeHPFR3f0",
    "Daft Punk - Get Lucky (pop)": "5NV6Rdv1a3I",
    "Maroon 5 feat. Christina Aguilera - Moves Like Jagger (pop)": "suRsxpoAc5w"
}

song = "비밀의 화원" #@param ["비밀의 화원", "Marvin Gaye - Whats Happening Brother (soul)", "Red Hot Chili Peppers - Scar Tissue (alternative rock)", "Daft Punk - Get Lucky (pop)", 'Billie Eilish - Bad Guy (pop)','Death - Spirit Crusher (death metal)', "James Brown - Get Up (I Feel Like Being A) Sex Machine (funk)", 'Věra Bílá & Kale - Pas o panori (world music)','Eminem - Lose Yourself (hip-hop)','Sting - Englishman in New York (pop/rock)','R.E.M. - Losing my Religion (alternative rock)', "AURORA – Animal (pop)", 'Dire Straits - Sultans Of Swing (rock)', "John Mayer - Gravity (blues)", "Darude - Sandstorm (EDM)", "Pokemon (soundtrack)", "Maroon 5 feat. Christina Aguilera - Moves Like Jagger (pop)"]
start = 41 #@param {type:"slider", min:0, max:180, step:1}
stop = 73 #@param {type:"slider", min:0, max:180, step:1}

id = ids[song]  # change this for you own song

ydl_opts = {
    'format': 'bestaudio/best', 
    'postprocessors': [{'key': 'FFmpegExtractAudio','preferredcodec': 'wav','preferredquality': '44100',}],
    'outtmpl': 'tmp.wav'
}
with youtube_dl.YoutubeDL(ydl_opts) as ydl:
    status = ydl.download([id])

audio, rate = librosa.load('tmp.wav', sr=None)
os.remove('tmp.wav')

start_pad, stop_pad = max(0, start-4), min(audio.shape[-1]/rate-1, stop+4)
start_cut, stop_cut = start-start_pad, stop-stop_pad

audio = audio[start_pad*rate:stop_pad*rate].copy()

output.clear()
print(f"{song}")
IPython.display.display(IPython.display.Audio(audio[start_cut*rate:stop_cut*rate].copy(), rate=rate))

Output hidden; open in https://colab.research.google.com to view.

### 5. Separate!

In [None]:
print("separating... ", end='')
estimates = separate_sample(audio, rate)
estimates = {i: e[start_cut*32000:stop_cut*32000,:] for i, e in estimates.items()}  # cut to show only the desired part (mainly to reduce the latency)
print("done")
print("downloading audio files to the client side...")

for instrument in ['vocals', 'drums', 'bass', 'other']:
    if estimates[instrument].max() < 0.25: continue  # hacky way to remove the silent instruments

    print(f"\n{instrument}")
    IPython.display.display(IPython.display.Audio(estimates[instrument].T.copy(), rate=32000))

Output hidden; open in https://colab.research.google.com to view.

### 6. Create a backing track for karaoke

In [None]:
mix = estimates["drums"] + estimates["bass"] + estimates["other"]
IPython.display.display(IPython.display.Audio(mix.T.copy(), rate=32000))

Output hidden; open in https://colab.research.google.com to view.

### 7. Remix

Let's say would like to make the awesome bassline more pronounced... Well, why not remix the song to our taste?

Set the volume of different instruments and hit play to compute a new mix.


In [None]:
vocals = 1.2 #@param {type:"slider", min:0, max:2, step:0.1}
drums = 0.3 #@param {type:"slider", min:0, max:2, step:0.1}
bass = 0.3 #@param {type:"slider", min:0, max:2, step:0.1}
other = 0.3 #@param {type:"slider", min:0, max:2, step:0.1}

mix = estimates["vocals"]*vocals + \
      estimates["drums"]*drums + \
      estimates["bass"]*bass + \
      estimates["other"]*other

IPython.display.display(IPython.display.Audio(mix.T.copy(), rate=32000))

Output hidden; open in https://colab.research.google.com to view.

### 8. Load from an uncompressed file

Unfortunately, the youtube videos are compressed, so the sepaparation quality is not as good as it could be. Load an uncompressed file from your computer (click on `Upload` in the left toolbar) to obtain better results: 

In [None]:
filename = "scar_tissue.wav" #@param {type:"string"}
start = 40 #@param {type:"slider", min:0, max:180, step:1}
stop = 50 #@param {type:"slider", min:0, max:180, step:1}

audio, rate = soundfile.read(filename)
audio = librosa.core.to_mono(audio.transpose())

print(audio.shape, rate)
start_pad, stop_pad = max(0, start-4), min(audio.shape[-1]/rate-1, stop+4)
start_cut, stop_cut = start-start_pad, stop-stop_pad

audio = audio[start_pad*rate:stop_pad*rate].copy()
audio = np.expand_dims(audio, 0)

output.clear()
print(f"{filename} mix:")
IPython.display.display(IPython.display.Audio(audio[:, start_cut*rate:stop_cut*rate].copy(), rate=rate))

print()
print("separating... ", end='')
estimates = separate_sample(audio, rate)
print("done")
print("downloading audio files to the client side...")

for instrument in ['vocals', 'drums', 'bass', 'other']:
    separation = estimates[instrument][start_cut*32000:stop_cut*32000,:]  # cut to show only the desired part (mainly to reduce the latency)
    if separation.max() < 0.25: continue  # hacky way to remove the silent instruments

    print(f"\n{instrument}")
    IPython.display.display(IPython.display.Audio(separation.T.copy(), rate=32000))

RuntimeError: ignored