# IRS Practical 10
> 19BCE245 - Aayush Shah

- Karaoke Generator

## Open-Unmix PyTorch

![](https://sisec18.unmix.app/static/img/hero_header.4f28952.svg)

__Open-Unmix__ is a deep neural network reference implementation for music source separation, applicable for researchers, audio engineers and artists. This notebook provides easy access to pre-trained models that allow users to separate pop music into four stems: __vocals__, __drums__, __bass__ and the remaining __other__ instruments. The models were trained on the [MUSDB18](https://sigsep.github.io/datasets/musdb.html) dataset.

## The Model

_Open-Unmix_ is based on a three-layer bidirectional deep LSTM. The model learns to predict the magnitude spectrogram of a target, like _vocals_, from the magnitude spectrogram of a mixture input. Internally, the prediction is obtained by applying a mask on the input. The model is optimized in the magnitude domain using mean squared error and the actual separation is done in a post-processing step involving a differentiable multichannel wiener filter. To perform separation into multiple sources, multiple models are trained for each particular target. While this makes the training less comfortable, it allows great flexibility to customize the training data for each target source.

## How to proceed

We provide four pre-trained models:

* __`umxl` (default)__  trained on private stems dataset of compressed stems. 
  [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.5069601.svg)](https://doi.org/10.5281/zenodo.5069601)

* __`umxhq`__  trained on [MUSDB18-HQ](https://sigsep.github.io/datasets/musdb.html#uncompressed-wav) which comprises the same tracks as in MUSDB18 but un-compressed which yield in a full bandwidth of 22050 Hz.

  [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.3267291.svg)](https://doi.org/10.5281/zenodo.3267291)

* __`umx`__ is trained on the regular [MUSDB18](https://sigsep.github.io/datasets/musdb.html#compressed-stems) which is bandwidth limited to 16 kHz do to AAC compression. This model should be used for comparison with other (older) methods for evaluation in [SiSEC18](sisec18.unmix.app).

  [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.3340804.svg)](https://doi.org/10.5281/zenodo.3340804)

* __`umxse`__ speech enhancement model is trained on the 28-speaker version of the [Voicebank+DEMAND corpus](https://datashare.is.ed.ac.uk/handle/10283/1942?show=full).

  [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.3786908.svg)](https://doi.org/10.5281/zenodo.3786908)

All models are downloaded automatically.

### Colab Limitations 

* The disk and RAM is limited in colab. Loading the four separation models `vocals`, `drums`, `bass` and `other` is already using 400 MB of disk and RAM. 
* A major step in the separation is the post-processing, contolled by the parameters `niter`. For faster inference (at the expense of separation quality) it is adviced to use `niter=0`.
* Another way to prevent colab from crashing is to only perform separation on smaller excerpts. In the following examples we privide a way to set the start and stop duration of the audio being separated. We suggest __not to separate segements of longer than 30s__.



# Installation and Imports (RUN THESE CELLS FIRST)

In [None]:
!pip install musdb
!pip install youtube-dl
!pip install openunmix
!pip install nlpaug
!pip install pydub
!pip install google-colab

In [None]:
import torch
import torchaudio
import numpy as np
import scipy
import youtube_dl
import stempeg
import os
import librosa
from google.colab import files
from IPython.display import Audio, display
import matplotlib.pyplot as plt
import nlpaug.augmenter.audio as naa
from pydub import AudioSegment 
import subprocess

use_cuda = torch.cuda.is_available()
device = torch.device("cuda" if use_cuda else "cpu")

# Separate MUSDB18 tracks

Get a musdb18 7 second preview track

In [None]:
import musdb
mus = musdb.DB(download=True, subsets='test')

for i in mus:
    print(i)

track = mus[25]
print(track.name)
display(Audio(track.audio.T, rate=track.rate))

###Apply separation into four stems

open-unmix is auto-downloading a model for each available target:

* vocals
* drums
* bass
* other

In [None]:
from openunmix import predict
estimates = predict.separate(
    torch.as_tensor(track.audio).float(),
    rate=track.rate,
    device=device
)   
for target, estimate in estimates.items():
    print(target)
    audio = estimate.detach().cpu().numpy()[0]
    display(Audio(audio, rate=track.rate))

### Apply separation into vocals/accompaniment

Even open-unmix does not provide a separate model for the accompaniment, we can use the spectral `residual` model in the post-processing to force a linear sum of all separated sources - e.g. this can be used for vocal/accompaniment separation. Note, that the sepearation performance is decreased when using the residual model.

In [None]:
estimates = predict.separate(
    torch.as_tensor(track.audio).float(),
    rate=track.rate,
    targets=['vocals'], 
    residual=True,
    device=device,
)
for target, estimate in estimates.items():
    print(target)
    display(Audio(estimate.detach().cpu().numpy()[0], rate=track.rate))

Another way to achive vocal/accompanimnet separation is to sepearate into four stems and sum up the non-vocal stems.

In [None]:
estimates = predict.separate(
    audio=torch.as_tensor(track.audio).float(), 
    rate=track.rate,
    targets=['vocals', 'drums', 'bass', 'other'], 
    residual=True,
    device=device
)
print('vocals')
display(Audio(estimates['vocals'].detach().cpu().numpy()[0], rate=track.rate))
acc = np.sum(
    [audio.detach().cpu().numpy()[0] for target, audio in estimates.items() if not target=='vocals'],
    axis=0
)
print('accompaniment')
display(Audio(acc, rate=track.rate))

# Separate Youtube Video

In [None]:
from IPython.display import HTML
url = "W8a4sUabCUo" #@param {type:"string"}
start = 60 #@param {type:"number"}
stop = 90 #@param {type:"number"}
embed_url = "https://www.youtube.com/embed/%s?start=%d&end=%d" %(url,start,stop)
# embed_url = "https://www.youtube.com/embed/%s?rel=0&start=%d&end=%d" % (url, start, stop)
HTML('<iframe width="550" height="315" src=' + embed_url + 'frameborder="0" allowfullscreen></iframe>')

In [None]:
import stempeg

def my_hook(d):
    if d['status'] == 'finished':
        print('Done downloading...')


ydl_opts = {
    'format': 'bestaudio/best',
    'postprocessors': [{
        'key': 'FFmpegExtractAudio',
        'preferredcodec': 'wav',
        'preferredquality': '44100',
    }],
    'outtmpl': '%(title)s.wav',
    'progress_hooks': [my_hook],
}
with youtube_dl.YoutubeDL(ydl_opts) as ydl:
    info = ydl.extract_info(url, download=False)
    status = ydl.download([url])

audioYT, samplerate = stempeg.read_stems(
    info.get('title', None) + '.wav', 
    start=start,
    duration=(stop-start),
    sample_rate=44100.0,
    dtype=np.float32
)
display(Audio(audioYT.T, rate=samplerate))
estimates = predict.separate(
    torch.as_tensor(audioYT).float(),
    rate=samplerate,
    device=device
)   
for target, estimate in estimates.items():
    print(target)
    display(Audio(estimate.detach().cpu().numpy()[0], rate=samplerate))

##Apply separation into 2 classes 
##1) Vocals
##2) Accompaniment

In [None]:
audioYT, samplerate = stempeg.read_stems(
    info.get('title', None) + '.wav', 
    start=start,
    duration=(stop-start),
    sample_rate=44100.0,
    dtype=np.float32
)
display(Audio(audioYT.T, rate=samplerate))
estimates = predict.separate(
    torch.as_tensor(audioYT).float(),
    rate=samplerate,
    device=device
)   
print('vocals')
display(Audio(estimates['vocals'].detach().cpu().numpy()[0], rate=samplerate))
acc = np.sum(
    [audio.detach().cpu().numpy()[0] for target, audio in estimates.items() if not target=='vocals'],
    axis=0
)
print('accompaniment')
display(Audio(acc, rate=track.rate))

Download separations

In [None]:
# target_path = str("target.mp3")

# estimates_numpy = {}
# for target, estimate in estimates.items():
#     estimates_numpy[target] = torch.squeeze(estimate).detach().cpu().numpy().T

# stempeg.write_stems(
#     target_path,
#     estimates_numpy,
#     writer=stempeg.FilesWriter(multiprocess=True, output_sample_rate=44100),
# )

# for target, estimate in estimates.items():
#     files.download(target + '.mp3')

# Separate from uploaded file

In [None]:
from google.colab import files
'''
uploaded = files.upload()
'''

In [None]:
'''
start = 0 #@param {type:"number"}
stop =  10#@param {type:"number"}
audio, rate = stempeg.read_stems(
    list(uploaded.keys())[0],
    sample_rate=44100,
    start=start,
    duration=stop-start,
)
display(Audio(audio.T, rate=rate))
estimates = predict.separate(
    audio=torch.as_tensor(audio).float(),
    rate=44100,
    device=device,
)
for target, estimate in estimates.items():
    print(target)
    display(Audio(estimate.detach().cpu().numpy()[0], rate=rate))
'''

##Apply separation into 2 classes 
##1) Vocals
##2) Accompaniment

In [None]:
'''
audio, rate = stempeg.read_stems(
    list(uploaded.keys())[0],
    sample_rate=44100,
    start=start,
    duration=stop-start,
)
display(Audio(audio.T, rate=rate))
estimates = predict.separate(
    audio=torch.as_tensor(audio).float(),
    rate=44100,
    device=device,
)
print('vocals')
display(Audio(estimates['vocals'].detach().cpu().numpy()[0], rate=samplerate))
acc = np.sum(
    [audio.detach().cpu().numpy()[0] for target, audio in estimates.items() if not target=='vocals'],
    axis=0
)
print('accompaniment')
display(Audio(acc, rate=track.rate))
'''