# Distinguish author-specific patterns in music

* Find this notebook at `EpyNN/nnlive/author_music/train.ipynb`.
* Regular python code at `EpyNN/nnlive/author_music/train.py`.

In this notebook we will review:

* Handling univariate time series that represents a **huge amount of data points**.
* Take advantage of recurrent architectures (RNN, GRU) over Feed-Forward architectures.
* Introduce recall and precision along with accuracy when dealing with unbalanced datasets.

## Environment and data

Follow [this link](prepare_dataset.ipynb) for details about data preparation.

Briefly, raw data are acoustic guittare music from the *True* author and the *False* author. These are raw ``.wav`` files that were resampled, clipped and digitalized using a 4-bits encoder.

Commonly, music ``.wav`` files have a sampling rate of 44100 Hz. This means that each second of music represents a numerical time series of length 44100.

In [1]:
# EpyNN/nnlive/author_music/train.ipynb
# Standard library imports
import random

# Related third party imports
import numpy as np

# Local application/library specific imports
import nnlibs.initialize
from nnlibs.commons.maths import relu, softmax
from nnlibs.commons.library import (
    configure_directory,
    read_model,
)
from nnlibs.network.models import EpyNN
from nnlibs.embedding.models import Embedding
from nnlibs.rnn.models import RNN
# from nnlibs.lstm.models import LSTM
from nnlibs.gru.models import GRU
from nnlibs.flatten.models import Flatten
from nnlibs.dropout.models import Dropout
from nnlibs.dense.models import Dense
from prepare_dataset import prepare_dataset
from settings import se_hPars


########################## CONFIGURE ##########################
random.seed(1)

np.set_printoptions(threshold=10)

np.seterr(all='warn')
np.seterr(under='ignore')


############################ DATASET ##########################
X_features, Y_label = prepare_dataset(N_SAMPLES=1280)

Let's inspect.

In [2]:
print(len(X_features))
print(X_features[0].shape)
print(X_features[0])
print(np.min(X_features[0]), np.max(X_features[0]))

1280
(10000,)
[15 10 10 ... 13 13 13]
0 15


We clipped the original ``.wav`` files in 1 second clips and thus we could retrieve ``1280`` samples. We did that because we do not have an infinite number of data. Since we want more training examples, we need to split the data.

Below other problems are discussed:

* **Arrays size in memory**: One second represents 44100 data points for each clip and thus ``44100 * 1280 = 56.448e6`` data points in total. More than fifty millions of these is likely to overload your RAM or to raise a memory allocation error on most laptops. This is why we resampled the original ``.wav`` files content to 10000 Hz. When doing that, we loose the patterns associated with frequencies greater than 5000 Hz. Instead, we could have made clips of shorther duration but then we would miss patterns associated with lower frequencies. Because guittare emission spectrum is essentially filled below 5000 Hz, we prefered to apply the resampling method.
* **Signal normalization**: Original signals were sequences of 16-bits integers ranging from ``0`` to ``32767``. Feeding a neural network which such big values will most likely result in floatting point errors. This is why we normalized the original data from each ``.wav`` file within the range \[0, 1\].
* **Signal digitalization**: While the original signal was a digital signal encoded onver 16-bits integers, this results in ``3e-5`` difference between each digit after normalization within the range \[0, 1\]. Such thin differences may be difficult to be evaluated for the network and the training could turn prohibitively slow. In the context of this notebook, we digitalized again but using 4-bits integers ranging from ``0`` to ``15`` for then a total of 16 values instead of 32768.  

All things being said, we can go ahead.

## Feed-Forward (FF)

We first start by our reference, a Feed-Forward network with dropout regularization.

### Embedding

We scaled input data for each ``.wav`` file before, so we do not need to provide the argument to the class constructor of the *embedding* layer. Note that when ``X_scale=True`` it applies a global scaling over the whole training set. Here we work with independant ``.wav`` files which should be normalized separately.

In [3]:
embedding = Embedding(X_data=X_features,
                      Y_data=Y_label,
                      Y_encode=True,
                      batch_size=16,
                      relative_size=(2, 1, 0))

Let's inspect the shape of the data.

In [4]:
print(embedding.dtrain.X.shape)
print(embedding.dtrain.b)

(853, 10000)
{1: 378, 0: 475}


We note that we have an unbalanced dataset, with about 2/3 of negative samples.

### Flatten-(Dense)n

Let's proceed with the network design and training.

In [5]:
name = 'Flatten_Dense-64-relu_Dense-2-softmax'

# se_hPars['learning_rate'] = 0.00001
# se_hPars['learning_rate'] = 1
se_hPars['learning_rate'] = 0.01
se_hPars['softmax_temperature'] = 1

layers = [
    embedding,
    Dense(64, relu),
    Dense(2, softmax),
]

model = EpyNN(layers=layers, name=name)

We have set the softmax temperature to ``5`` to diminish the confidence of the model and the risk of vanishing/exploding gradients.

We can initialize the model.

In [6]:
model.initialize(loss='MSE', seed=1, metrics=['accuracy', 'recall', 'precision'], se_hPars=se_hPars.copy(), end='\r')

[1m--- EpyNN Check OK! --- [0mdding[0m[0m

Train it for 50 epochs.

In [7]:
model.train(epochs=10, init_logs=False)

+-------+----------+----------+-------+----------+-------+--------+-------+-----------+-------+-------+--------------------------------------------------+
| [1m[37mepoch[0m |  [1m[37mlrate[0m   |  [1m[37mlrate[0m   |       | [1m[32maccuracy[0m |       | [1m[31mrecall[0m |       | [1m[35mprecision[0m |       |  [1m[36mMSE[0m  |                    [37mExperiment[0m                    |
|       |  [37mDense[0m   |  [37mDense[0m   |  [1m[32m(0)[0m  |   [1m[32m(1)[0m    |  [1m[31m(0)[0m  |  [1m[31m(1)[0m   |  [1m[35m(0)[0m  |    [1m[35m(1)[0m    |  [1m[36m(0)[0m  |  [1m[36m(1)[0m  |                                                  |
+-------+----------+----------+-------+----------+-------+--------+-------+-----------+-------+-------+--------------------------------------------------+
|   [1m[37m0[0m   | [1m[37m1.00e-02[0m | [1m[37m1.00e-02[0m | [1m[32m0.557[0m |  [1m[32m0.522[0m   | [1m[31m1.000[0m | [1m[31m1.000[0m  |

At least we have no overfitting, because we have no fit at all. There are 10 000 features per sample, the problem seems too complex. We will pass and use a more appropriate architecture with respect to the situation.

We can still comment the **recall** and **precision** metrics:

* **Recall**: It represents *the fraction of positive instances retrieved by the model*.
* **Precision**: It represents *the fraction of positive instances within the labels predicted as positive*. 

Said differently:

* Given **tp** the *true positive* samples.
* Given **tn** the *true negative* samples.
* Given **fp** the *false positive* samples.
* Given **fn** the *false negative* samples.
* Then **recall** = ``tp / (tp+fn)`` and **precision** = ``tp / (tp+fp)``.

## Recurrent Architectures

Recurrent architectures can make a difference here because they process time series one measurement by one measurement. 

The number of time steps **does not define the size of parameters (weight/bias) array** while in the Feed-Forward network this is the case. 

For the *dense*, the shape of W is ``p, n`` given ``p`` the number of nodes in the previous layer and ``n`` in the current layer. So when a *dense* layer follows the *embedding* layer, the number of nodes in the *embedding* layer is equal to the number of features, herein the number of time steps ``10 000``. 

By contrast, the *RNN* layer has parameters shape which depends on the number of cells and the uni/multivariate nature of each measurement, but not depending of the number of time steps. In the previous situation there is likely too much parameters and the computation does not converge.

### Embedding

For the embedding, we will one-hot encode time series and we know the *"vocabulary"* size will be 16 because we digitalized over 16 bins.

In [8]:
embedding = Embedding(X_data=X_features,
                      Y_data=Y_label,
                      X_encode=True,
                      Y_encode=True,
                      batch_size=128,
                      relative_size=(2, 1, 0))

Let's inspect the data shape.

In [9]:
print(embedding.dtrain.X.shape)

(853, 10000, 16)


### RNN(sequences=True)-Flatten-Dense

Time to clarify a point:

* We have multivariate like time series (one-hot encoding) with 10000 time steps.
* The 10000 or **length of sequence is unrelated to the number of cells in the RNN layer**. The number of cells may be whatever, the whole sequence will entirely be processed.
* In recurrent layers, parameters shape is related to the number of cells and the vocabulary size, not to the length of the sequence. That's why such architectures can handle input sequences of variable lenth.

Therefore, we can use only 2 cells on a sequence with 10 000 steps. With setting ``sequences=True`` this will return an array of shape ``(n_samples, 10000, 2)``. Without the flag, the shape will be ``n_samples, 1, 2)``.

In [10]:
name = 'RNN-2-Seq_Flatten_Dense-2-softmax'

se_hPars['learning_rate'] = 0.1
se_hPars['softmax_temperature'] = 1
se_hPars['schedule'] = 'exp_decay'

rnn = RNN(2, sequences=True)

flatten = Flatten()

dense = Dense(2, softmax)

layers = [embedding, rnn, flatten, dense]

model = EpyNN(layers=layers, name=name)

We initialize the model.

In [11]:
model.initialize(loss='MSE', seed=1, metrics=['accuracy', 'recall', 'precision'], se_hPars=se_hPars.copy(), end='\r')

[1m--- EpyNN Check OK! --- [0mdding[0m0m0m

We will only train for 5 epochs.

In [12]:
model.train(epochs=5, init_logs=False)

+-------+----------+----------+-------+----------+-------+--------+-------+-----------+-------+-------+----------------------------------------------+
| [1m[37mepoch[0m |  [1m[37mlrate[0m   |  [1m[37mlrate[0m   |       | [1m[32maccuracy[0m |       | [1m[31mrecall[0m |       | [1m[35mprecision[0m |       |  [1m[36mMSE[0m  |                  [37mExperiment[0m                  |
|       |   [37mRNN[0m    |  [37mDense[0m   |  [1m[32m(0)[0m  |   [1m[32m(1)[0m    |  [1m[31m(0)[0m  |  [1m[31m(1)[0m   |  [1m[35m(0)[0m  |    [1m[35m(1)[0m    |  [1m[36m(0)[0m  |  [1m[36m(1)[0m  |                                              |
+-------+----------+----------+-------+----------+-------+--------+-------+-----------+-------+-------+----------------------------------------------+
|   [1m[37m0[0m   | [1m[37m1.00e-01[0m | [1m[37m1.00e-01[0m | [1m[32m0.735[0m |  [1m[32m0.756[0m   | [1m[31m0.644[0m | [1m[31m0.664[0m  | [1m[35m0.843

This time the network could achieve some regression. There is a slight overfitting but that is a nice result compared to the Feed-Forward architecture.

### GRU(sequences=True)-Flatten-Dense

Let's now try a more evolved recurrent architecture.

In [13]:
name = 'GRU-2-Seq_Flatten_Dense-2-softmax'

se_hPars['learning_rate'] = 0.1
se_hPars['softmax_temperature'] = 1

gru = GRU(2, sequences=True)

flatten = Flatten()

dense = Dense(2, softmax)

layers = [embedding, gru, flatten, dense]

model = EpyNN(layers=layers, name=name)

model.initialize(loss='MSE', seed=1, metrics=['accuracy', 'recall', 'precision'], se_hPars=se_hPars.copy(), end='\r')

model.train(epochs=5, init_logs=False)

+-------+----------+----------+-------+----------+-------+--------+-------+-----------+-------+-------+----------------------------------------------+
| [1m[37mepoch[0m |  [1m[37mlrate[0m   |  [1m[37mlrate[0m   |       | [1m[32maccuracy[0m |       | [1m[31mrecall[0m |       | [1m[35mprecision[0m |       |  [1m[36mMSE[0m  |                  [37mExperiment[0m                  |
|       |   [37mGRU[0m    |  [37mDense[0m   |  [1m[32m(0)[0m  |   [1m[32m(1)[0m    |  [1m[31m(0)[0m  |  [1m[31m(1)[0m   |  [1m[35m(0)[0m  |    [1m[35m(1)[0m    |  [1m[36m(0)[0m  |  [1m[36m(1)[0m  |                                              |
+-------+----------+----------+-------+----------+-------+--------+-------+-----------+-------+-------+----------------------------------------------+
|   [1m[37m0[0m   | [1m[37m1.00e-01[0m | [1m[37m1.00e-01[0m | [1m[32m0.727[0m |  [1m[32m0.684[0m   | [1m[31m0.924[0m | [1m[31m0.942[0m  | [1m[35m0.690

_

_