(intro-carnatic-separation)=
# Music source separation for Carnatic Music

As seen in the [music segmentation example](music-segmentation), the vocal separation for Carnatic and Hindustani music remains an unsolved (and unexplored!) field. In the [singing voice separation](singing-voice-extraction) walkthrough you can find an example of a model developed specifically to perform vocal separation for Carnatic Music. Let's first introduce the task, the challenges, and current solutions.

## Music source separation
The problem of music source separation (MSS) is aimed at automatically estimating the individual elements in a music mixture. MSS systems, which recently are mostly based on DL architectures, operate on the waveform and on the time-frequency domain, and even on a combination of both. Since this is a core problem in the field of music information research, many efforts to obtain open and high-performance models are done, and several pre-trained systems are made available to be freely used out-of-the-box.

MSS systems are normally trained using the mixture as input, and the target sources as expected output, and the models are optimized to reproduce the same operation. There are few datasets in the literature that may be used for that purpose: musdb18hq, moisesdb, and medleydb. However, these datasets mostly include recordings that can be framed into the pop and rock styles, and therefore, as it normally happens in the field of DL, when training with data belonging in a particular domain, the generalization to out-of-domain use cases is not feasible.

For the case of Carnatic Music we observe such problem. Not only the available models in the literature do not have any knowledge on this repertoire, but also the task of MSS normally targets the following source setup: _vocals_, _bass_, _drums_, and _other_, and that does not comply with the actual arrangement and nature of Carnatic Music. 

Some well-known models for MSS are Spleeter {cite}`spleeter` by Deezer, Meta's Demucs {cite}`demucs`, and their related extensions and evolutions. 

## Spleeter (by Deezer)

One main model in the literature is Spleeter {cite}`spleeter`, which is broadly used in many computational musicology works for the Carnatic repertoire. 

In [None]:
%pip install spleeter
%pip install numba --upgrade
import spleeter

We download the latest Spleeter pre-trained model in the official repository.

In [None]:
!wget https://github.com/deezer/spleeter/releases/download/v1.4.0/2stems.tar.gz

We unzip it!

In [None]:
import os
import tarfile

# Open file
file = tarfile.open("2stems.tar.gz")

# Creating directory where spleeter looks for models by default
os.mkdir("pretrained_models/")

# Extracting files in tar
file.extractall(
    os.path.join("pretrained_models", "2stems")
)

# Closing file
file.close()

`spleeter` is based on `TensorFlow`. We disable the GPU usage and the `TensorFlow` related warnings just like we did in the [pitch extraction walkthrough](melody-extraction).

In [None]:
# Disabling tensorflow warnings and debugging info
import os 
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3" 

# Importing tensorflow and disabling GPU usage
import tensorflow as tf
tf.config.set_visible_devices([], "GPU")

We may now load the `spleeter` separator, which will automatically load the pre-trained weights for the model. We will use the ``2:stems`` model, which has been trained to separate *vocals* and *accompaniment*.

```{note}
The other option, which is the ``4:stems``, separates *vocals*, *bass*, *drums*, and *other*, does not properly apply to the case of Carnatic Music.
```

In [None]:
from spleeter.separator import Separator

# Load default 2-stem spleeter separation
separator = Separator("spleeter:2stems")

# Separating file
separator.separate_to_file(
    os.path.join(
        "..", "audio", "59c88c32-0bde-433b-b194-0f65281e5714.mp3"
    ),
    os.path.join("..", "audio")
)

Let's reproduce the separated file using `IPython display`. 

In [None]:
import IPython.display as ipd
import librosa

vocals, sr = librosa.load(
    os.path.join(
        "..", "audio", "59c88c32-0bde-433b-b194-0f65281e5714", "vocals.wav"
    ),
)

In [None]:
ipd.Audio(
    data=vocals[-sr*30:],  # Taking only the last 30 seconds
    rate=sr,
)

## Leakage-aware source separation model

In this section we walkthrough a tool that has been trained using the {cite}`saraga`. Given the live performance nature of Carnatic Music, it is difficult, in fact current impossible, to find fully-isolated multi-stem recordings to train or fine-tune existing separation approaches. Saraga includes multi-stem recordings, but these have source bleeding in the background, since these have been recorded in live performances. In this section we present an approach that has been designed having the bleeding problem in mind.

This model is able to separate clean singing voices even though it has been solely trained with data that have bleeding in the multi-track stems. Let's test how it works in a real example. Since the model is DL-based, we first need to install tensorflow.

In [None]:
## Installing (if not) and importing compiam to the project
import importlib.util
if importlib.util.find_spec('compiam') is None:
    ## Bear in mind this will only run in a jupyter notebook / Collab session
    %pip install compiam
import compiam

# Import extras and supress warnings to keep the tutorial clean
import os
import numpy as np
from pprint import pprint

import warnings
warnings.filterwarnings('ignore')

# Installing and importing tensorflow in case is not installed
%pip install tensorflow
%pip install tensorflow_addons
import tensorflow as tf
import tensorflow_addons as tfa

In [None]:
# Importing and initializing a melodia instance
import soundfile as sf
from compiam import load_model
separation_model = load_model('separation:cold-diff-sep')

# We load the same example
audio_path = os.path.join(
    "..", "audio", "59c88c32-0bde-433b-b194-0f65281e5714", "vocals.wav")
input_mixture, sr = sf.read(audio_path)

input_mixture = input_mixture.T
mean = np.mean(input_mixture, keepdims=True)
std = np.std(input_mixture, keepdims=True)
input_mixture = (input_mixture - mean) / (1e-6 + std)

In [None]:
### Getting 20 seconds and separating
input_mixture = input_mixture[:, -44100*30:]
separation = separation_model.separate(
    input_data=input_mixture,
    input_sr=sr,
    clusters=6,
    scheduler=5,
)

In [None]:
import IPython.display as ipd

# And we play it!
ipd.Audio(
    data=separation,
    rate=separation_model.sample_rate,
)

Although perceptible artifacts in the vocals can be heard, the separation is surprisingly clean, hopefully helping musicians and musicologists to extract relevant information for it. Also, less pitched noise is present in the signal so melodic feature extraction systems may work better on these data rather than in a complete mixture or in a singing voice with source bleeding in the background.