# Implementatio of "Monoaural Audio Source Separation Using Deep Convolutional Neural Networks"

**Author**: Davide Facchinelli
 
**Reference paper**: Chandna, P., Miron, M., Janer, J., and Gómez, E. (2017). Monoaural Audio Source Separation Using Deep Convolutional Neural Networks. 13th International Conference on Latent Variable Analysis and Signal Separation (LVA ICA2017).


**Other papers**:
* Chandna, P. (2016). Audio Source Separation Using Deep Neural Networks, Master Thesis, Universitat Pompeu Fabra.
* Vincent, E., Gribonval, R., and Fevotte, C. (2006). Performance measurement in blind audio source separation. IEEE Transactions on Audio, Speech and Language Processing, 14(4):1462-1469.

## Part 1: Report

### The network

The aim of the original work of Chandna et al. is to develop a deep convolutional neural network to separate a music track in its components. These components are the drums, the voice, the bass and all other sounds (instrumental or synthetic).

The original article also considered a fifth track given by the combination of all the non-vocal source tracks. Here we choose to ignore it as non essential. It is an aid to predict the other four tracks, not an objective of the problem. The reason of this exclusion is to gain computational speed.

Given a song they compute its Short-time Fourier transform with a 75% of overlapping, using a Hanning window of 1024 samples and a step size of 256 samples. Its magnitude and phase are given as input to the network. They divide the network in two stages: and encoding part and a decoding part. The encoding part is made by two convolutive layer and one dense layer. They then split the network in as many part as output tracks needed, and added an independent decoding part for each split. The decoding part is the inverse of the encoding: it is composed by a dense layer, and two deconvolutions.

![](OrNet.png)

<center>Image taken from the reference paper.</center>

Both the deconvolutions and convolutions do not use any activation function, instead both dense layers have a ReLu function as activation. The output of the network is then taken in module, this module also act as a non-linear component. Finally, the output of the network is used to mask the original magnitude. The inverse Short-time Fourier transform of the masked magnitude combined with the original phase give us the desired output track.

The first convolutive layer is designed to capture the timbre feature of the song, where instead the second convolutive layer models the time-frequency characteristics of different instruments. Finally the dense layer achieve some dimensionality reduction and add non-linearity to the model.

### The loss

To train the network a peculiar loss function is used. The track that contains all the other instruments can be very variable, as in different song very different sound can be put in the "other" track. To solve this problem a loss 

$$L = L_m + L_o$$ 

composed by two components is used.

First let us define $y_o$ the "other" source track, $y_v$ the "vocal" source track, $y_b$ the "bass" source track and $y_d$ the "drums" source track. Their predictions are defined respectivly as $\hat{y_o}$, $\hat{y_v}$, $\hat{y_b}$ and $\hat{y_d}$

The first component takes care of all track but the "other" one, considering their squared euclidean distance. It is also considered an additional term to penalize similarity between tracks.

$$L_m = ||y_d - \hat{y_d}||^2+||y_b - \hat{y_b}||^2+||y_v - \hat{y_v}||^2 $$$$ - \alpha(||\hat{y_v} - \hat{y_d}||^2+||\hat{y_v} - \hat{y_b}||^2+||\hat{y_b} - \hat{y_d}||^2 + ||\hat{y_v} - \hat{y_o}||^2 + ||\hat{y_o} - \hat{y_d}||^2 + ||\hat{y_b} - \hat{y_o}||^2)$$

The second component, that is always negative, enforce the distance between the true "other" track, and every other prediction. 

$$L_o = -\beta(||\hat{y_b}-y_o||^2+||\hat{y_d}-y_o||^2)-\gamma||\hat{y_v}-y_o||^2$$

We should remark how the true and predicted "other" track are never directly compared.

$\alpha, \beta, \gamma$ are weight to scale the penalizations and have been selected experimentally.

### The metrics

To evaluate the results three classical metrics used in sound signal separation are used:
- SDR: source to distortion ratio
- SIR: source to interference ratio
- SAR: source to artifacts ratio

In the original article also the source to noise ratio is considered, but we do not consider noise in our sets and therefore we ignore it here.

### The dataset

The dataset used is a public collection of 100 song of mixed genre provided with their four source tracks. The dataset can be found at https://sigsep.github.io/datasets/dsd100.html

### The train

We trained the network with blocks of 10 songs at the time, with a 50% overlapping. For each block we trained the network for 10 epochs, with batches of 32 elements composed by 25 frames each shuffled across all the songs in the block. The procedure is repeated three times. We obtain the following training structure:

- 3 repetitions
  - 9 blocks, 50% overlapping
    - 10 epochs
      - 32 element batches

In the original article there were no division in blocks, all the 50 songs were considered at the same time and the net trained for 30 epochs. Our version of the training is a lot less demaning for the memory, as it needs to load in memory only 10 songs at the time.

We used, as the original authors, the Adadelta algorithm to optimize our network. The parameters have been empirically choosed.

## Part 2: The code

### Libraries import

In [1]:
import tensorflow as tf
from os import listdir

### Definition of global parameters

In [2]:
# Path for the dataset folder
path = 'DSD100/'

# Block of songs for the train
blocks = [(0,10),(5,15),(10,20),(15,25),(20,30),(25,35),(30,40),(35,45),(40,50)]

# Training parameters
epochs = 10
repetitions = 3

# Number of frame per input element
T = 25

# Channel number
channels = 1

# Number of element per batch
batches = 32

### Dataset building

We prepare the function to be called during the training. They will pass to the network the dataset build on the fly. In this way it's not necessary to import the song all at the same time.

In [3]:
def extractor(path,partial):
    """
    Function to extract either the waveform or the stft transformation of a track.

    Parameters
    ----------
    path : str
        The path to the track to be imported.
    partial : boolean
        A boolean parameter to ask for either the waveform of the song (if True) or its stft transformation (if False).

    Returns
    -------
    tensorflow.Tensor
        Contaning the waveform divided in sample if 'partial' is True, or its stft transformation if 'partial' is False.

    tensorflow.Tensor
        A zero-dimensional tensor contaning as number the sample rate of the wav decoding.

    """
    
    # We read the audio file
    raw_audio = tf.io.read_file(path)
    # Decode it in its waveform
    audio, sample_rate = tf.audio.decode_wav(raw_audio, desired_channels=channels)
    # Partitionate it in such a way that each elemnt will have a stft transformation of 25 frame
    segmented = [audio[p:(p+256*(T+3)),:] for p in range(0,tf.shape(audio)[0],256*(T+3))]
    # We 0 pad the tail to ensure that also the last part will have 25 frame
    segmented[-1] = tf.pad(segmented[-1], [[0,256*(T+3) - tf.shape(segmented[-1])[0]],[0,0]])

    if partial: return tf.stack(segmented), sample_rate
    
    # We compute the stft transformation
    segmented = map(lambda segment: tf.signal.stft(tf.transpose(segment), frame_length=1024, frame_step=256, fft_length=1024),segmented)
    segmented = list(map(lambda segment: tf.transpose(segment, perm = [1,2,0]),segmented))
    
    return tf.stack(segmented), sample_rate

def mixturer(first, last, path):
    """
    Function to build the input dataset X for the train.

    Parameters
    ----------
    first : int
        Index of the first song to be considered.
    last : int
        Index of the last song to be considered.
    path : str
        Path to the folder where the dataset in contained.

    Returns
    -------
    tensorflow.Tensor
        Contaning the segmented stft transformation of the input song.
    """
    
    
    tail = 'Mixtures/Dev'
    
    X = list()

    # We build a tensor with all the segmented elements.
    for song in listdir(path + tail)[first:last]:
        X.append(extractor(path + tail + "/" + song + "/mixture.wav", False)[0])
    X = tf.concat(X, 0)
    
    # We divide it in its magnitude and phase, and output them thogether
    out = tf.stack([tf.abs(X), tf.math.angle(X)],4)
    # We cut the last element in such a way that every batch will be of exactly 'batches' elements
    return out[:len(out)-len(out)%batches]

def sourcerer(first,last,path):
    """
    Function to build the input dataset y_true for the train.

    Parameters
    ----------
    first : int
        Index of the first song to be considered.
    last : int
        Index of the last song to be considered.
    path : str
        Path to the folder where the dataset in contained.

    Returns
    -------
    tensorflow.Tensor
        Contaning the segmented stft transformation of the input sources.
    """
    
    tail = 'Sources/Dev'
    
    y_bass = list()
    y_drums = list()
    y_other = list()
    y_vocals = list()

    # We build a tensorflow tensor with all the segmented elements.
    for song in listdir(path +  tail)[first:last]:
        y_bass.append(extractor(path + tail + "/" + song + "/bass.wav", True)[0])
        y_drums.append(extractor(path + tail + "/" + song + "/drums.wav", True)[0])
        y_other.append(extractor(path + tail + "/" + song + "/other.wav", True)[0])
        y_vocals.append(extractor(path + tail + "/" + song + "/vocals.wav", True)[0])

    y_bass = tf.concat(y_bass, 0)
    y_drums = tf.concat(y_drums, 0)
    y_other = tf.concat(y_other, 0)
    y_vocals = tf.concat(y_vocals, 0)
    
    out = tf.stack([y_bass, y_drums,y_other,y_vocals], 3)
    # We cut the last element in such a way that every batch will be of exactly 'batches' elements
    return out[:len(out)-len(out)%batches]

### Model

We prepare our costum loss and optimizer. The metrics will be computed separatedly to don't slow down the traning.

In [4]:
# Classic Adadelta optimizer with costum parameters
optimizer = tf.keras.optimizers.Adadelta(epsilon = 10e-1, learning_rate= 10e-3, rho = 0.95)

def loss(y_true, y_pred):
    """
    Costum loss that suits our specific problem.

    Parameters
    ----------
    y_true : tensorflow.Tensor
        The true sources.
    y_pred : tensorflow.Tensor
        The sources predicted by the network.
        
    Returns
    -------
    tensorflow.Tensor
        A zero-dimensional tensor contaning the total computed loss.
    """
     
    # We prepare a local function that compute the squared euclidean distance of a tensor, considering it as vecotrized
    sqd = lambda v : tf.math.reduce_sum(tf.square(v))
    
    # We fix three weights
    alpha = 0.001
    beta = 0.01
    gamma = 0.03
    
    # We divide our input in the 4 separate tracks
    bass_pred = (y_pred[:,:,:,0])
    drums_pred = (y_pred[:,:,:,1])
    other_pred = (y_pred[:,:,:,2])
    vocals_pred = (y_pred[:,:,:,3])
    
    bass_true = (y_true[:,:,:,0])
    drums_true =(y_true[:,:,:,1])
    other_true =(y_true[:,:,:,2])
    vocals_true = (y_true[:,:,:,3])
    
    # We compute all the components of the loss
    sq1 = sqd(bass_pred - bass_true) + sqd(drums_pred - drums_true) + sqd(vocals_pred - vocals_true)
    diff = sqd(bass_pred - drums_pred) + sqd(bass_pred - vocals_pred) + sqd(vocals_pred - drums_pred)
    diff_o = sqd(bass_pred - other_pred) + sqd(drums_pred - other_pred) + sqd(vocals_pred - other_pred)
    diff = diff + diff_o
    other = sqd(bass_pred - other_true) + sqd(drums_pred - other_true)
    othervocals = sqd(vocals_pred - other_true)

    # As we are considering all the batch thogether, we devide by their number before giving in output the value
    return (sq1 - alpha * diff - beta * other - gamma * othervocals)/batches

We prepare the function to build and compile the model.

In [5]:
def conv(signal):
    """
    Function that compute the inverse stft for each channel.
    
    Parameters
    ----------
    signal : tensorflow.Tensor
        A tensor contaning the transformed tracks.

    Returns
    -------
    tensorflow.Tensor
        A tensor contaning the waveform of the track given in input.
    """
        
    # We separate the track on the channels
    signal = tf.unstack(signal,num=channels,axis=2)

    out = list()

    # Compute the inverse stft
    for song in signal:
        out.append(tf.signal.inverse_stft(song, frame_length=1024, frame_step=256, fft_length=1024,
                                    window_fn=tf.signal.inverse_stft_window_fn(frame_step = 256)))
    
    # Rejoin the tracks in one multi-channel track
    return tf.stack(out,1)

def encoder(inp,t1,f1,N1,t2,f2,N2,N):
    """
    Function that add the encoding block to our network.

    Parameters
    ----------
    inp : tensorflow.keras.layers.Layer
        The precedent layer.
    t1 : int
        First dimension of the first convlutional layer.
    f1 : int
        Second dimension of the first convlutional layer.
    N1 : int
        Number of filters of the first convolutional layer.
    t2 : int
        First dimension of the second convlutional layer.
    f2 : int
        Second dimension of the second convlutional layer.
    N2 : int
        Number of filters of the second convolutional layer.
    N : int
        Number of units in the encoding dense layer.
        
    Returns
    -------
    tensorflow.keras.layers.Layer
        The resulting layer after the adding of this block.
    """
    x = tf.keras.layers.Conv2D(filters = N1, kernel_size=(t1,f1), name = 'vertical_convolution')(inp)
    x = tf.keras.layers.Conv2D(filters = N2, kernel_size=(t2,f2), name = 'horizontal_convolution')(x)
    x = tf.keras.layers.Flatten(name = 'flattening')(x)
    return tf.keras.layers.Dense(N, activation='relu', name = 'global_dense')(x)

def decoder(x, t1, f1, N1,t2,f2,N2, track):
    """
    Function that add the decoding block to our network.

    Parameters
    ----------
    x : tensorflow.keras.layers.Layer
        The precedent layer.
    t1 : int
        First dimension of the first convlutional layer.
    f1 : int
        Second dimension of the first convlutional layer.
    N1 : int
        Number of filters of the first convolutional layer.
    t2 : int
        First dimension of the second convlutional layer.
    f2 : int
        Second dimension of the second convlutional layer.
    N2 : int
        Number of filters of the second convolutional layer.
    track : str
        The name to have associated with the different step of this block.
        
    Returns
    -------
    tensorflow.keras.layers.Layer
        The resulting layer after the adding of this block.
    """
    y = tf.keras.layers.Dense((T-t2+1)*N2, activation='relu', name = 'single_dense_' + track)(x)
    y = tf.keras.layers.Reshape(((T-t2+1),1,N2), name = 'reshape_'+track)(y)
    y = tf.keras.layers.Conv2DTranspose(filters = N1, kernel_size = (t2,f2), name = 'horizontal_deconvolution_' + track)(y)
    y = tf.keras.layers.Conv2DTranspose(filters=channels, kernel_size=(t1,f1), name = 'vertical_deconvolution_' + track)(y)
    y = tf.keras.layers.Lambda(tf.abs, name = 'module_' + track)(y)
    return y

def masker_converter(y, tot,magnitude, name):
    """
    Function to apply the mask to the original magnitude.

    Parameters
    ----------
    y : tensorflow.keras.layers.Layer
        The precedent layer.
    tot : tensorflow.keras.layers.Layer
        Layer where the total value of the sum of all the computed magnitudes.
    magnitude : tensorflow.keras.layers.Layer
        Original magnitude given in input to be masked.
    name : str
        The name to have associated with the different step of this block.
        
    Returns
    -------
    tensorflow.keras.layers.Layer
        The resulting layer after the adding of this block.
    """
    y = tf.keras.layers.Lambda(lambda c : tf.math.divide(c[0],c[1]), name='division_'+name)([y,tot])
    y = tf.keras.layers.Multiply(name='y_'+name)([magnitude,y])
    return y

def complexer(z, phase, track):
    """
    Function to put thogether the phase and magnitude and obtain the estimated stft transformation of our final track.

    Parameters
    ----------
    z : tensorflow.keras.layers.Layer
        The precedent layer.
    phase : tensorflow.keras.layers.Layer
        Original phase given in input.
    track : str
        The name to have associated with the different step of this block.
        
    Returns
    -------
    tensorflow.keras.layers.Layer
        The resulting layer after the adding of this block.
    """
    z = tf.keras.layers.Lambda(lambda s:tf.math.multiply(tf.complex(s[0],.0), tf.math.exp(tf.complex(.0,s[1]))), name = 'complex_'+track)([z,phase])
    return tf.keras.layers.Lambda(lambda v:tf.stack(list(map(conv,tf.unstack(v,num=batches,axis=0)))), name = 'istft_'+track)(z)
    
def buildier(t1,f1,N1,t2,f2,N2,N):
    """
    Function that build and compile the whole model.
    
    Parameters
    ----------
    t1 : int
        First dimension of the first convlutional layer.
    f1 : int
        Second dimension of the first convlutional layer.
    N1 : int
        Number of filters of the first convolutional layer.
    t2 : int
        First dimension of the second convlutional layer.
    f2 : int
        Second dimension of the second convlutional layer.
    N2 : int
        Number of filters of the second convolutional layer.
        
    Returns
    -------
    tensorflow.keras.layers.Layer
        The resulting layer after the adding of this block.
    """
    
    # We take as input the stft transformation
    stft = tf.keras.layers.Input(shape = (T, 513, channels, 2), name='stft', batch_size=batches)
    
    # Divide it in magnitude and phase
    magnitude = tf.keras.layers.Lambda(lambda s:tf.unstack(s,num=2,axis = -1)[0], name = 'magnitude')(stft)
    phase = tf.keras.layers.Lambda(lambda s:tf.unstack(s,num=2,axis = -1)[1], name = 'phase')(stft)
    
    # Apply the encoding block
    x = encoder(magnitude,t1,f1,N1,t2,f2,N2,N)
    
    # Split in four tracks and apply the decoding blocks
    bass = decoder(x, t1, f1, N1,t2,f2,N2, 'bass')
    drums = decoder(x, t1, f1, N1,t2,f2,N2, 'drums')
    other = decoder(x, t1, f1, N1,t2,f2,N2, 'other')
    vocals = decoder(x, t1, f1, N1,t2,f2,N2, 'vocals')
    
    # Add the four estimated elements
    added = tf.keras.layers.Add(name = 'sum')([bass,drums,other,vocals])

    # Apply the mask to each track
    bass = masker_converter(bass, added, magnitude, name = 'bass')
    drums = masker_converter(drums, added, magnitude, name = 'drums')
    other = masker_converter(other, added, magnitude, name = 'other')
    vocals = masker_converter(vocals, added, magnitude, name = 'vocals')
    
    # Get the whole stft of each track
    bass = complexer(bass, phase, track = 'bass')
    drums = complexer(drums, phase, track = 'drums')
    other = complexer(other, phase, track = 'other')
    vocals = complexer(vocals, phase, track = 'vocals')
    
    out = tf.keras.layers.Lambda(lambda x:tf.keras.backend.stack(x,3), name = 'output_stacking')([bass,drums,other,vocals])
    
    # Initialize and compile the model
    model = tf.keras.Model(inputs = stft, outputs = out)
    model.compile(loss = loss, optimizer = optimizer, experimental_run_tf_function=False)
    
    return model

We call the precedent defined functions, and provide the code to train the net. We also include a file with the weights of the already trained network, as the training may take a lot of time depending on the machine used.

We also specify the number of units used in the different layers, using the same as in the original paper.

In [6]:
net = buildier(t1 = 1,f1 = 513,N1 = 50,t2 = 12,f2 = 1,N2 = 30, N = 128)
net.summary(print_fn=display)

'Model: "model"'

'__________________________________________________________________________________________________'

'Layer (type)                    Output Shape         Param #     Connected to                     '



'stft (InputLayer)               [(32, 25, 513, 1, 2) 0                                            '

'__________________________________________________________________________________________________'

'magnitude (Lambda)              (32, 25, 513, 1)     0           stft[0][0]                       '

'__________________________________________________________________________________________________'

'vertical_convolution (Conv2D)   (32, 25, 1, 50)      25700       magnitude[0][0]                  '

'__________________________________________________________________________________________________'

'horizontal_convolution (Conv2D) (32, 14, 1, 30)      18030       vertical_convolution[0][0]       '

'__________________________________________________________________________________________________'

'flattening (Flatten)            (32, 420)            0           horizontal_convolution[0][0]     '

'__________________________________________________________________________________________________'

'global_dense (Dense)            (32, 128)            53888       flattening[0][0]                 '

'__________________________________________________________________________________________________'

'single_dense_bass (Dense)       (32, 420)            54180       global_dense[0][0]               '

'__________________________________________________________________________________________________'

'single_dense_drums (Dense)      (32, 420)            54180       global_dense[0][0]               '

'__________________________________________________________________________________________________'

'single_dense_other (Dense)      (32, 420)            54180       global_dense[0][0]               '

'__________________________________________________________________________________________________'

'single_dense_vocals (Dense)     (32, 420)            54180       global_dense[0][0]               '

'__________________________________________________________________________________________________'

'reshape_bass (Reshape)          (32, 14, 1, 30)      0           single_dense_bass[0][0]          '

'__________________________________________________________________________________________________'

'reshape_drums (Reshape)         (32, 14, 1, 30)      0           single_dense_drums[0][0]         '

'__________________________________________________________________________________________________'

'reshape_other (Reshape)         (32, 14, 1, 30)      0           single_dense_other[0][0]         '

'__________________________________________________________________________________________________'

'reshape_vocals (Reshape)        (32, 14, 1, 30)      0           single_dense_vocals[0][0]        '

'__________________________________________________________________________________________________'

'horizontal_deconvolution_bass ( (32, 25, 1, 50)      18050       reshape_bass[0][0]               '

'__________________________________________________________________________________________________'

'horizontal_deconvolution_drums  (32, 25, 1, 50)      18050       reshape_drums[0][0]              '

'__________________________________________________________________________________________________'

'horizontal_deconvolution_other  (32, 25, 1, 50)      18050       reshape_other[0][0]              '

'__________________________________________________________________________________________________'

'horizontal_deconvolution_vocals (32, 25, 1, 50)      18050       reshape_vocals[0][0]             '

'__________________________________________________________________________________________________'

'vertical_deconvolution_bass (Co (32, 25, 513, 1)     25651       horizontal_deconvolution_bass[0]['

'__________________________________________________________________________________________________'

'vertical_deconvolution_drums (C (32, 25, 513, 1)     25651       horizontal_deconvolution_drums[0]'

'__________________________________________________________________________________________________'

'vertical_deconvolution_other (C (32, 25, 513, 1)     25651       horizontal_deconvolution_other[0]'

'__________________________________________________________________________________________________'

'vertical_deconvolution_vocals ( (32, 25, 513, 1)     25651       horizontal_deconvolution_vocals[0'

'__________________________________________________________________________________________________'

'module_bass (Lambda)            (32, 25, 513, 1)     0           vertical_deconvolution_bass[0][0]'

'__________________________________________________________________________________________________'

'module_drums (Lambda)           (32, 25, 513, 1)     0           vertical_deconvolution_drums[0][0'

'__________________________________________________________________________________________________'

'module_other (Lambda)           (32, 25, 513, 1)     0           vertical_deconvolution_other[0][0'

'__________________________________________________________________________________________________'

'module_vocals (Lambda)          (32, 25, 513, 1)     0           vertical_deconvolution_vocals[0]['

'__________________________________________________________________________________________________'

'sum (Add)                       (32, 25, 513, 1)     0           module_bass[0][0]                '

'                                                                 module_drums[0][0]               '

'                                                                 module_other[0][0]               '

'                                                                 module_vocals[0][0]              '

'__________________________________________________________________________________________________'

'division_bass (Lambda)          (32, 25, 513, 1)     0           module_bass[0][0]                '

'                                                                 sum[0][0]                        '

'__________________________________________________________________________________________________'

'division_drums (Lambda)         (32, 25, 513, 1)     0           module_drums[0][0]               '

'                                                                 sum[0][0]                        '

'__________________________________________________________________________________________________'

'division_other (Lambda)         (32, 25, 513, 1)     0           module_other[0][0]               '

'                                                                 sum[0][0]                        '

'__________________________________________________________________________________________________'

'division_vocals (Lambda)        (32, 25, 513, 1)     0           module_vocals[0][0]              '

'                                                                 sum[0][0]                        '

'__________________________________________________________________________________________________'

'y_bass (Multiply)               (32, 25, 513, 1)     0           magnitude[0][0]                  '

'                                                                 division_bass[0][0]              '

'__________________________________________________________________________________________________'

'phase (Lambda)                  (32, 25, 513, 1)     0           stft[0][0]                       '

'__________________________________________________________________________________________________'

'y_drums (Multiply)              (32, 25, 513, 1)     0           magnitude[0][0]                  '

'                                                                 division_drums[0][0]             '

'__________________________________________________________________________________________________'

'y_other (Multiply)              (32, 25, 513, 1)     0           magnitude[0][0]                  '

'                                                                 division_other[0][0]             '

'__________________________________________________________________________________________________'

'y_vocals (Multiply)             (32, 25, 513, 1)     0           magnitude[0][0]                  '

'                                                                 division_vocals[0][0]            '

'__________________________________________________________________________________________________'

'complex_bass (Lambda)           (32, 25, 513, 1)     0           y_bass[0][0]                     '

'                                                                 phase[0][0]                      '

'__________________________________________________________________________________________________'

'complex_drums (Lambda)          (32, 25, 513, 1)     0           y_drums[0][0]                    '

'                                                                 phase[0][0]                      '

'__________________________________________________________________________________________________'

'complex_other (Lambda)          (32, 25, 513, 1)     0           y_other[0][0]                    '

'                                                                 phase[0][0]                      '

'__________________________________________________________________________________________________'

'complex_vocals (Lambda)         (32, 25, 513, 1)     0           y_vocals[0][0]                   '

'                                                                 phase[0][0]                      '

'__________________________________________________________________________________________________'

'istft_bass (Lambda)             (32, 7168, 1)        0           complex_bass[0][0]               '

'__________________________________________________________________________________________________'

'istft_drums (Lambda)            (32, 7168, 1)        0           complex_drums[0][0]              '

'__________________________________________________________________________________________________'

'istft_other (Lambda)            (32, 7168, 1)        0           complex_other[0][0]              '

'__________________________________________________________________________________________________'

'istft_vocals (Lambda)           (32, 7168, 1)        0           complex_vocals[0][0]             '

'__________________________________________________________________________________________________'

'output_stacking (Lambda)        (32, 7168, 1, 4)     0           istft_bass[0][0]                 '

'                                                                 istft_drums[0][0]                '

'                                                                 istft_other[0][0]                '

'                                                                 istft_vocals[0][0]               '



'Total params: 489,142'

'Trainable params: 489,142'

'Non-trainable params: 0'

'__________________________________________________________________________________________________'

In [7]:
if True: net.load_weights('pretrained_weights_experiment1.h5')
else:
    for _ in range(repetitions):
        for block in blocks:
            net.fit(x = mixturer(*block, path),
                    y = sourcerer(*block, path),
                    epochs = epochs, batch_size=batches, shuffle=True)

### Testing

Here we provide the code we used to test the network, and to save the predicted audio track in .wav files to listen to them.

In [8]:
def tracks_predict(path, net):
    """
    Function that, given the trained network, predict the four source track that should compose it.

    Parameters
    ----------
    path : str
        Path to reach the song to be separeted.
    net : tensorflow.keras.Model
        The trained model to be used to separate our track.
    
    Returns
    -------
    tensorflow.keras.layers.Layer
        A tensor contaning the 4 predicte source.
    tensorflow.Tensor
        A zero-dimensional tensor contaning as number the sample rate of the wav decoding.
    """
    # We exctract the waveform from the file
    stft, sample_rate = extractor(path, False)
    # We shape is as our net expect to get it
    stft = [stft[i:32+i] for i in range(0,int(tf.shape(stft)[0])-31,32)]
    stft = list(map(lambda v:tf.stack([tf.abs(v), tf.math.angle(v)],4), stft))
    # We compute the prediction
    prediction = list(map(lambda v:net.predict(v),stft))
    # We reshapre the output as a normal track waveform
    prediction = [inner for outer in list(map(lambda v:tf.unstack(v,axis=0),prediction)) for inner in outer]
    prediction = tf.concat(prediction,0)
    
    return prediction, sample_rate

def measurer(y_true_m,y_pred_m):
    """
    Function that compute SDR, SIR and SAR.

    Parameters
    ----------
    y_true_m : tensorflow.Tensor
        All the true sources stacked.
    y_pred_m : tensorflow.Tensor
        All the predicted sources stacked.
        
    Returns
    -------
    list
        A list of tensorflow.Tensor objects, one for each channel, contaning the metrics' result.
    """
    
    # We cut the tail of the true song to match the one of the predicted one
    y_true_m = y_true_m[:int(tf.shape(y_pred_m)[0]),:,:]
    
    out = list()
    
    # We iterate over the channels
    for i in range(channels):
        y_true = tf.transpose(y_true_m[:,i,:])
        y_pred = tf.transpose(y_pred_m[:,i,:])
        
        # We compute the target vector
        s_target = tf.stack([y_true[i] * tf.tensordot(y_true[i], y_pred[i],1)/tf.tensordot(y_true[i], y_true[i],1) for i in range(4)])
        
        # We compute the projection of the predicted vector onto the space of the true sources
        R_ss = tf.tensordot(y_true,tf.transpose(y_true),1)
        R_ss = tf.linalg.inv(R_ss)
        c = tf.stack([tf.tensordot(R_ss, tf.tensordot(y_true, y_pred[i], 1), 1) for i in range(4)])
        P_s = tf.tensordot(c,y_true,1)
        
        # We compute the interference vector
        e_interf = P_s - s_target

        # We compute the artifacts vector
        e_artif = y_pred - P_s

        # We compute the SDR,
        t = e_interf + e_artif
        SDR = [4.342944819 * tf.math.log(tf.tensordot(s_target[i],s_target[i],1)/tf.tensordot(t[i],t[i],1)) for i in range(4)]
        # the SIR,
        SIR = [4.342944819 * tf.math.log(tf.tensordot(s_target[i],s_target[i],1)/tf.tensordot(e_interf[i],e_interf[i],1)) for i in range(4)]

        # and the SAR
        t = s_target + e_interf
        SAR = [4.342944819 * tf.math.log(tf.tensordot(t[i],t[i],1)/tf.tensordot(e_artif[i],e_artif[i],1)) for i in range(4)]

        out.append(tf.stack([SDR, SIR, SAR]))
    return out

def comparer(n):    
    """
    Function that call measurer for all the songs and compute mean and STD of the results.

    Parameters
    ----------
    n : int
        The number of songs to run the test on.

    Returns
    -------
    tensorflow.Tensor
        A 2D tensor having as rows the metrics SDR, SIR and SAR means, and as columns the sources.
    tensorflow.Tensor
        A 2D tensor having as rows the metrics SDR, SIR and SAR std, and as columns the sources.
    """
    
    measured = list()
    
    for song in listdir(path + "Sources/Test/")[:n]:
        tracks = list()

        tracks.append(extractor(path + "Sources/Test/" + song + "/bass.wav", True)[0])
        tracks.append(extractor(path + "Sources/Test/" + song + "/drums.wav", True)[0])
        tracks.append(extractor(path + "Sources/Test/" + song + "/other.wav", True)[0])
        tracks.append(extractor(path + "Sources/Test/" + song + "/vocals.wav", True)[0])

        y_true = tf.stack(tracks,axis=3)
        y_true = tf.concat(tf.unstack(y_true,axis=0),axis=0)

        y_pred = tracks_predict(path + "Mixtures/Test/" + song + "/mixture.wav", net)[0]

        measured+=measurer(y_true, y_pred)
        
    measured = tf.stack(measured)
    
    return tf.math.reduce_mean(measured,0), tf.math.reduce_std(measured,0), measured

We compute the measures.

In [9]:
measured_mean, measurerd_std, _ = comparer(50)
display(measured_mean)
display(measurerd_std)

<tf.Tensor: id=9240961, shape=(3, 4), dtype=float32, numpy=
array([[-2.0841825, -0.6200529, -2.838591 , -0.6094992],
       [ 4.1584187,  7.5702324,  3.5783527,  4.7879066],
       [ 1.3268983,  1.3926016,  0.1526977,  2.8035827]], dtype=float32)>

<tf.Tensor: id=9240968, shape=(3, 4), dtype=float32, numpy=
array([[3.4224148, 3.4294777, 2.3956723, 2.6649582],
       [5.3171062, 5.193449 , 2.9764698, 4.0756445],
       [1.7441429, 2.2626538, 1.417179 , 1.9064392]], dtype=float32)>

We predict and save in a .wav file an audio track, to directly test it.

In [10]:
out_vocals, sample_rate = tracks_predict("DSD100/Mixtures/Dev/051 - AM Contra - Heart Peripheral/mixture.wav", net)
tf.io.write_file('prediction/bass.wav', tf.audio.encode_wav(out_vocals[:,:,0], sample_rate = sample_rate))
tf.io.write_file('prediction/drums.wav', tf.audio.encode_wav(out_vocals[:,:,1], sample_rate = sample_rate))
tf.io.write_file('prediction/other.wav', tf.audio.encode_wav(out_vocals[:,:,2], sample_rate = sample_rate))
tf.io.write_file('prediction/vocals.wav', tf.audio.encode_wav(out_vocals[:,:,3], sample_rate = sample_rate))

## Part 3: Experimental evaluation

Here we show all the result obtained during the work, even the one obtained with different configurations from the final one presented above.

First of all let us report here the metrics of the result obtained in the original paper, as reference.

|Measures||Bass|Drums|Other|Vocals|
|------||------|------|------|------|
|SDR||0.9$\pm$2.7|2.4$\pm$2|1.3$\pm$2.4|0.8$\pm$1.5|
|SIR||4.6$\pm$4.4|9.1$\pm$4.3|7.2$\pm$3.6|3.8$\pm$4|
|SAR||6.9$\pm$2.3|7$\pm$2.8|5.3$\pm$2.9|2.8$\pm$2.4|
Original article results.

For each experiment it is also provided the file with the pretrained weights.

### Experiment 1: Main network

Here we show the result of the network presented above.

|Measures||Bass|Drums|Other|Vocals|
|------||------|------|------|------|
|SDR||-2.1$\pm$3.4|-0.6$\pm$3.4|-2.8$\pm$2.4|-0.6$\pm$2.7|
|SIR||4.2$\pm$5.3|7.6$\pm$5.2|3.6$\pm$3|4.8$\pm$4|
|SAR||1.3$\pm$1.7|1.4$\pm$2.3|0.1$\pm$1.4|2.8$\pm$1.9|
Experiment 1 results: main network.

We can see that the results are worse, and in particular the SDR become negative. Analyzing the definition of the measures (see the paper from Vincent E. et al.) it results clear that the major problem is the introduction of artifact sounds, as the artifacts vector is the common element in the denominator of both the SDR and SAR measure.

Listening to track predicted with this network confirm this deduction. It is possible to hear what the network was trying to extract, with some odd sounds added and sometimes some small contamination from other tracks.

As said the result are far lower in quality than the one in the original article. Anyway, they are mostly still compatible with the original result, as considered the error in the estimations they often intersects.

We deviated from the original work deciding to not include the fifth artificial track and to split the train in blocks for computational reason. It may be the major cause of the reduction in efficiency of our network.

In particular, as we suspected that our different version of the train was the major cause of the problem, we also tried to increment the number of epochs, repetitions and to change the organizations of the blocks. All the experiment of this type lead to results almost identical to the one above, suggesting that we reached a stable point of our network.

The extra track would have augmented the differenziation between tracks, we therefore tried different changes in the loss parameters and network parameters to compensate this effect. The result obtained were very similar to the Experiment 1 results, pointing probably at the fact that we reached the maximum possible performance for this architecture.

### Experiment 2: Double channel

Everything we did up to now, as the original paper, was monoaural. That is: we used a single channel.

But the database we used was composed by stereo audio track: with two channels. We wrote the code in such a way that the number of channel can be given and the network modified accordingly, therefore we tried it using both the channels.

We trained the network on both channels, without chaingin any other parameters.

|Measures||Bass|Drums|Other|Vocals|
|------||------|------|------|------|
|SDR||-2$\pm$3.4|-0.6$\pm$3.6|-2.7$\pm$2.3|-0.7$\pm$2.6|
|SIR||4.1$\pm$5.2|7.8$\pm$5.4|3.9$\pm$3|5$\pm$3.8|
|SAR||1.4$\pm$1.8|1.3$\pm$2.4|0.1$\pm$1.2|2.5$\pm$2|
Experiment 2 results: two channels.

As we can see the result are very similar to the one in the precedent experiment, leading us to belive that this type of network can easly be generalized to multi-channel audio track.

The problem is the computational time: the stereo network took three times the monoaurual network to be trained.