# Open-Unmix PyTorch

![](https://sisec18.unmix.app/static/img/hero_header.4f28952.svg)

__Open-Unmix__ is a deep neural network reference implementation for music source separation, applicable for researchers, audio engineers and artists. This notebook provides easy access to pre-trained models that allow users to separate pop music into four stems: __vocals__, __drums__, __bass__ and the remaining __other__ instruments. The models were trained on the [MUSDB18](https://sigsep.github.io/datasets/musdb.html) dataset.

## The Model

_Open-Unmix_ is based on a three-layer bidirectional deep LSTM. The model learns to predict the magnitude spectrogram of a target, like _vocals_, from the magnitude spectrogram of a mixture input. Internally, the prediction is obtained by applying a mask on the input. The model is optimized in the magnitude domain using mean squared error and the actual separation is done in a post-processing step involving a differentiable multichannel wiener filter. To perform separation into multiple sources, multiple models are trained for each particular target. While this makes the training less comfortable, it allows great flexibility to customize the training data for each target source.

## How to run this notebook

We provide four pre-trained models:

* __`umxl` (default)__  trained on private stems dataset of compressed stems. __Note, that the weights are only licensed for non-commercial use (CC BY-NC-SA 4.0).__

  [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.5069601.svg)](https://doi.org/10.5281/zenodo.5069601)

* __`umxhq`__  trained on [MUSDB18-HQ](https://sigsep.github.io/datasets/musdb.html#uncompressed-wav) which comprises the same tracks as in MUSDB18 but un-compressed which yield in a full bandwidth of 22050 Hz.

  [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.3267291.svg)](https://doi.org/10.5281/zenodo.3267291)

* __`umx`__ is trained on the regular [MUSDB18](https://sigsep.github.io/datasets/musdb.html#compressed-stems) which is bandwidth limited to 16 kHz do to AAC compression. This model should be used for comparison with other (older) methods for evaluation in [SiSEC18](sisec18.unmix.app).

  [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.3340804.svg)](https://doi.org/10.5281/zenodo.3340804)

* __`umxse`__ speech enhancement model is trained on the 28-speaker version of the [Voicebank+DEMAND corpus](https://datashare.is.ed.ac.uk/handle/10283/1942?show=full).

  [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.3786908.svg)](https://doi.org/10.5281/zenodo.3786908)

All models are downloaded automatically.

### Colab Limitations

* The disk and RAM is limited in colab. Loading the four separation models `vocals`, `drums`, `bass` and `other` is already using 400 MB of disk and RAM.
* A major step in the separation is the post-processing, contolled by the parameters `niter`. For faster inference (at the expense of separation quality) it is adviced to use `niter=0`.
* Another way to prevent colab from crashing is to only perform separation on smaller excerpts. In the following examples we privide a way to set the start and stop duration of the audio being separated. We suggest __not to separate segements of longer than 30s__.



# Installation and Imports (RUN THESE CELLS FIRST)

In [None]:
!pip install musdb -q
!pip install youtube-dl -q
!pip install openunmix -q
!pip install stempeg -q

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m963.5/963.5 kB[0m [31m45.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m58.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.0/40.0 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
import torch
import torchaudio
import numpy as np
import scipy
import youtube_dl
import stempeg
import os
from google.colab import files
from IPython.display import Audio, display

use_cuda = torch.cuda.is_available()
device = torch.device("cuda" if use_cuda else "cpu")

# Separate MUSIC tracks

Get a musdb18 7 second preview track

In [None]:
!pip install pydub -q

In [None]:
# import musdb
# mus = musdb.DB(download=True, subsets='test')

# track = mus[49]
# track.info

In [None]:
import os
import subprocess
from openunmix import predict

# Clone the repository
repo_url = 'https://github.com/OlamideShogbamu/audiostemming.git'
repo_dir = 'audiostemming'

if not os.path.exists(repo_dir):
    subprocess.run(['git', 'clone', repo_url])

# Directory containing the songs
songs_dir = os.path.join(repo_dir, 'inputdata')

# List all songs in the directory
songs = [f for f in os.listdir(songs_dir) if f.endswith('.mp3')]

In [None]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'

# Function to process and display audio
def process_audio(audio_path):
    print(f"Processing {audio_path}")
    waveform, sample_rate = torchaudio.load(audio_path)
    #display(Audio(waveform.numpy(), rate=sample_rate))

    estimates = predict.separate(
        torch.tensor(waveform).float(),
        rate=sample_rate,
        device=device
    )

    output_directory = os.path.join('/content/drive/MyDrive/audio_stemming/OpenUnMix/outputdata', os.path.basename(audio_path).split('.')[0])
    os.makedirs(output_directory, exist_ok=True)

    for target, estimate in estimates.items():
        print(target)
        audio = estimate.detach().cpu().numpy()[0]
        #display(Audio(audio, rate=sample_rate))

        output_file = os.path.join(output_directory, f"{target}.wav")
        torchaudio.save(output_file, torch.tensor(audio), sample_rate)
        print(f"Saved {target} to {output_file}")

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
from tqdm import tqdm
from pydub import AudioSegment
from IPython.display import display, Audio

input_directory = "/content/audiostemming/inputdata/"
output_directory = "/content/drive/MyDrive/audio_stemming/OpenUnMix/outputdata"

# Create the output directory if it doesn't exist
os.makedirs(output_directory, exist_ok=True)

# Load the audio file
for song in tqdm(songs, desc="Processing songs", unit="file", leave=False, disable=True):
    input_file = os.path.join(input_directory, song)
    process_audio(input_file)

Processing /content/audiostemming/inputdata/Beautiful-Nubia-Beriwon2.mp3


  torch.tensor(waveform).float(),
Downloading: "https://zenodo.org/records/5069601/files/vocals-bccbd9aa.pth" to /root/.cache/torch/hub/checkpoints/vocals-bccbd9aa.pth
100%|██████████| 108M/108M [00:07<00:00, 14.3MB/s]
Downloading: "https://zenodo.org/records/5069601/files/drums-69e0ebd4.pth" to /root/.cache/torch/hub/checkpoints/drums-69e0ebd4.pth
100%|██████████| 108M/108M [00:08<00:00, 13.8MB/s]
Downloading: "https://zenodo.org/records/5069601/files/bass-2ca1ce51.pth" to /root/.cache/torch/hub/checkpoints/bass-2ca1ce51.pth
100%|██████████| 108M/108M [00:08<00:00, 13.5MB/s]
Downloading: "https://zenodo.org/records/5069601/files/other-c8c5b3e6.pth" to /root/.cache/torch/hub/checkpoints/other-c8c5b3e6.pth
100%|██████████| 108M/108M [00:11<00:00, 9.43MB/s]
  resampler = torchaudio.transforms.Resample(


vocals
Saved vocals to /content/drive/MyDrive/audio_stemming/OpenUnMix/outputdata/Beautiful-Nubia-Beriwon2/vocals.wav
drums
Saved drums to /content/drive/MyDrive/audio_stemming/OpenUnMix/outputdata/Beautiful-Nubia-Beriwon2/drums.wav
bass
Saved bass to /content/drive/MyDrive/audio_stemming/OpenUnMix/outputdata/Beautiful-Nubia-Beriwon2/bass.wav
other
Saved other to /content/drive/MyDrive/audio_stemming/OpenUnMix/outputdata/Beautiful-Nubia-Beriwon2/other.wav
Processing /content/audiostemming/inputdata/Beautiful-Nubia-MaBaWonSo.mp3
vocals
Saved vocals to /content/drive/MyDrive/audio_stemming/OpenUnMix/outputdata/Beautiful-Nubia-MaBaWonSo/vocals.wav
drums
Saved drums to /content/drive/MyDrive/audio_stemming/OpenUnMix/outputdata/Beautiful-Nubia-MaBaWonSo/drums.wav
bass
Saved bass to /content/drive/MyDrive/audio_stemming/OpenUnMix/outputdata/Beautiful-Nubia-MaBaWonSo/bass.wav
other
Saved other to /content/drive/MyDrive/audio_stemming/OpenUnMix/outputdata/Beautiful-Nubia-MaBaWonSo/other.wav
Pr

In [None]:
import shutil

output_dir = "/content/drive/MyDrive/audio_stemming/WUN/output.zip"
zip_filename = "/content/drive/MyDrive/audio_stemming/WUN/output.zip"

shutil.make_archive(output_dir, 'zip', output_dir)

###Apply separation into four stems

open-unmix is auto-downloading a model for each available target:

* vocals
* drums
* bass
* other

In [None]:

estimates = predict.separate(
    torch.tensor(waveform).float(),
    rate=sample_rate,
    device=device
)
for target, estimate in estimates.items():
    print(target)
    audio = estimate.detach().cpu().numpy()[0]
    display(Audio(audio, rate=sample_rate))

### Apply separation into vocals/accompaniment

Even open-unmix does not provide a separate model for the accompaniment, we can use the spectral `residual` model in the post-processing to force a linear sum of all separated sources - e.g. this can be used for vocal/accompaniment separation. Note, that the sepearation performance is decreased when using the residual model.

In [None]:
estimates = predict.separate(
    torch.as_tensor(waveform).float(),
    rate=sample_rate,
    targets=['vocals'],
    residual=True,
    device=device,
)
for target, estimate in estimates.items():
    print(target)
    display(Audio(estimate.detach().cpu().numpy()[0], rate=track.rate))

Another way to achive vocal/accompanimnet separation is to sepearate into four stems and sum up the non-vocal stems.

In [None]:
estimates = predict.separate(
    audio=torch.as_tensor(track.audio).float(),
    rate=track.rate,
    targets=['vocals', 'drums', 'bass', 'other'],
    residual=True,
    device=device
)
print('vocals')
display(Audio(estimates['vocals'].detach().cpu().numpy()[0], rate=track.rate))
acc = np.sum(
    [audio.detach().cpu().numpy()[0] for target, audio in estimates.items() if not target=='vocals'],
    axis=0
)
print('accompaniment')
display(Audio(acc, rate=track.rate))

# Separate Youtube Video

In [None]:
from IPython.display import HTML
url = "xwtdhWltSIg" #@param {type:"string"}
start = 60 #@param {type:"number"}
stop = 90 #@param {type:"number"}
embed_url = "https://www.youtube.com/embed/%s?rel=0&start=%d&end=%d&amp;controls=0&amp;showinfo=0" % (url, start, stop)
HTML('<iframe width="560" height="315" src=' + embed_url + 'frameborder="0" allowfullscreen></iframe>')

In [None]:
import stempeg

def my_hook(d):
    if d['status'] == 'finished':
        print('Done downloading...')


ydl_opts = {
    'format': 'bestaudio/best',
    'postprocessors': [{
        'key': 'FFmpegExtractAudio',
        'preferredcodec': 'wav',
        'preferredquality': '44100',
    }],
    'outtmpl': '%(title)s.wav',
    'progress_hooks': [my_hook],
}
with youtube_dl.YoutubeDL(ydl_opts) as ydl:
    info = ydl.extract_info(url, download=False)
    status = ydl.download([url])

audio, samplerate = stempeg.read_stems(
    info.get('title', None) + '.wav',
    start=start,
    duration=(stop-start),
    sample_rate=44100.0,
    dtype=np.float32
)
display(Audio(audio.T, rate=samplerate))
estimates = predict.separate(
    torch.as_tensor(audio).float(),
    rate=samplerate,
    device=device
)
for target, estimate in estimates.items():
    print(target)
    display(Audio(estimate.detach().cpu().numpy()[0], rate=samplerate))

Download separations

In [None]:
target_path = str("target.mp3")

estimates_numpy = {}
for target, estimate in estimates.items():
    estimates_numpy[target] = torch.squeeze(estimate).detach().cpu().numpy().T

stempeg.write_stems(
    target_path,
    estimates_numpy,
    sample_rate=separator.sample_rate,
    writer=stempeg.FilesWriter(multiprocess=True, output_sample_rate=44100),
)

for target, estimate in estimates.items():
    files.download(target + '.mp3')

# Separate from uploaded file

In [None]:
from google.colab import files
uploaded = files.upload()

In [None]:
from openunmix import predict

start = 0 #@param {type:"number"}
stop = 120 #@param {type:"number"}
audio, rate = stempeg.read_stems(
    list(uploaded.keys())[0],
    sample_rate=44100,
    start=start,
    duration=stop-start,
)
display(Audio(audio.T, rate=rate))
estimates = predict.separate(
    audio=torch.as_tensor(audio).float(),
    rate=44100,
    device=device,
)
for target, estimate in estimates.items():
    print(target)
    display(Audio(estimate.detach().cpu().numpy()[0], rate=rate))


# Export estimates

After separation, you can save the results as wav files or STEMs.

## Download Separations to disk

In [None]:
!sudo apt-get install gpac

## Encode to STEMS format

In [None]:
import stempeg
estimates_numpy = {}
for target, estimate in estimates.items():
    estimates_numpy[target] = torch.squeeze(estimate).detach().cpu().numpy().T

estimates_numpy['mixture'] = audio
stempeg.write_stems(
    "umx.stem.m4a",
    estimates_numpy,
    sample_rate=44100,
    writer=stempeg.NIStemsWriter(),
)
files.download("umx.stem.m4a")