# Distinguish author-specific patterns in music

* Find this notebook at `EpyNN/nnlive/author_music/train.ipynb`.
* Regular python code at `EpyNN/nnlive/author_music/train.py`.

In this notebook we will review:

* Handling univariate time series that represents a **huge amount of data points**.
* Take advantage of recurrent architectures (RNN, GRU) over Feed-Forward architectures.
* Introduce recall and precision along with accuracy when dealing with unbalanced datasets.

Please see the following if you get lost:

* [Fully Connected (Dense)](../../Dense.html)
* [Recurrent Neural Network (RNN)](../../RNN.html)
* [Gated Recurrent Unit (GRU)](../../GRU.html)

## Environment and data

Follow [this link](prepare_dataset.ipynb) for details about data preparation.

Briefly, raw data are acoustic guittare music from the *True* author and the *False* author. These are raw ``.wav`` files that were resampled, clipped and digitalized using a 4-bits encoder.

Commonly, music ``.wav`` files have a sampling rate of 44100 Hz. This means that each second of music represents a numerical time series of length 44100.

In [1]:
# EpyNN/nnlive/author_music/train.ipynb
# Standard library imports
import random

# Related third party imports
import numpy as np

# Local application/library specific imports
import nnlibs.initialize
from nnlibs.commons.maths import relu, softmax
from nnlibs.commons.library import (
    configure_directory,
    read_model,
)
from nnlibs.network.models import EpyNN
from nnlibs.embedding.models import Embedding
from nnlibs.rnn.models import RNN
from nnlibs.gru.models import GRU
from nnlibs.flatten.models import Flatten
from nnlibs.dropout.models import Dropout
from nnlibs.dense.models import Dense
from prepare_dataset import (
    prepare_dataset,
    download_music,
)
from settings import se_hPars


########################## CONFIGURE ##########################
random.seed(1)

np.set_printoptions(threshold=10)

np.seterr(all='warn')
np.seterr(under='ignore')

configure_directory()


############################ DATASET ##########################
download_music()

X_features, Y_label = prepare_dataset(N_SAMPLES=256)

Let's inspect.

In [2]:
print(len(X_features))
print(X_features[0].shape)
print(X_features[0])
print(np.min(X_features[0]), np.max(X_features[0]))

256
(10000,)
[10  7  7 ...  9  9  9]
1 15


We clipped the original ``.wav`` files in 1 second clips and thus we could retrieve ``256`` samples. We did that because we do not have an infinite number of data. Since we want more training examples, we need to split the data.

Below other problems are discussed:

* **Arrays size in memory**: One second represents 44100 data points for each clip and thus ``44100 * 256 = 11.2896e6`` data points in total. More than ten millions of these is more likely to overload your RAM or to raise a memory allocation error on most laptops. This is why we resampled the original ``.wav`` files content to 10000 Hz. When doing that, we loose the patterns associated with frequencies greater than 5000 Hz. Instead, we could have made clips of shorther duration but then we would miss patterns associated with lower frequencies. Because guittare emission spectrum is essentially filled below 5000 Hz, we prefered to apply the resampling method.
* **Signal normalization**: Original signals were sequences of 16-bits integers ranging from ``-32768`` to ``32767``. Feeding a neural network which such big values will most likely result in floatting point errors. This is why we normalized the original data from each ``.wav`` file within the range \[0, 1\].
* **Signal digitalization**: While the original signal was a digital signal encoded onver 16-bits integers, this results in ``3e-5`` difference between each digit after normalization within the range \[0, 1\]. Such thin differences may be difficult to be evaluated for the network and the training could turn prohibitively slow. In the context of this notebook, we digitalized again but using 4-bits integers ranging from ``0`` to ``15`` for then a total of 16 values instead of 65536. 
* **One-hot encoding**: To simplify the problem and focus on patterns, we will eliminate explicit amplitudes by performing one-hot encoding of the univariate, 4-bits encoded time series.

All things being said, we can go ahead.

## Feed-Forward (FF)

We first start by our reference, a Feed-Forward network with dropout regularization.

### Embedding

We scaled input data for each ``.wav`` file before, so we do not need to provide the argument to the class constructor of the *embedding* layer. Note that when ``X_scale=True`` it applies a global scaling over the whole training set. Here we work with independant ``.wav`` files which should be normalized separately.

In [3]:
embedding = Embedding(X_data=X_features,
                      Y_data=Y_label,
                      X_encode=True,
                      Y_encode=True,
                      batch_size=16,
                      relative_size=(2, 1, 0))

Let's inspect the shape of the data.

In [4]:
print(embedding.dtrain.X.shape)
print(embedding.dtrain.b)

(171, 10000, 16)
{1: 71, 0: 100}


We note that we have an unbalanced dataset, with about 2/3 of negative samples.

### Flatten-(Dense)n with Dropout

Let's proceed with the network design and training.

In [5]:
name = 'Flatten_Dense-64-relu_Dense-2-softmax'

# se_hPars['learning_rate'] = 0.00001
# se_hPars['learning_rate'] = 1
se_hPars['learning_rate'] = 0.01
se_hPars['softmax_temperature'] = 1

layers = [
    embedding,
    Flatten(),
    Dense(64, relu),
    Dropout(0.5),
    Dense(2, softmax),
]

model = EpyNN(layers=layers, name=name)

We can initialize the model.

In [6]:
model.initialize(loss='MSE', seed=1, metrics=['accuracy', 'recall', 'precision'], se_hPars=se_hPars.copy(), end='\r')

[1m--- EpyNN Check OK! --- [0mdding[0m0m0m

Train it for 5 epochs.

In [7]:
model.train(epochs=10, init_logs=False)

[1m[37mEpoch 9 - Batch 9/9 - Accuracy: 0.625 Cost: 0.38532 - TIME: 11.91s RATE: 1.20e+01e/s TTC: 0s        [0m

+-------+----------+----------+-------+----------+-------+--------+-------+-----------+-------+-------+--------------------------------------------------+
| [1m[37mepoch[0m |  [1m[37mlrate[0m   |  [1m[37mlrate[0m   |       | [1m[32maccuracy[0m |       | [1m[31mrecall[0m |       | [1m[35mprecision[0m |       |  [1m[36mMSE[0m  |                    [37mExperiment[0m                    |
|       |  [37mDense[0m   |  [37mDense[0m   |  [1m[32m(0)[0m  |   [1m[32m(1)[0m    |  [1m[31m(0)[0m  |  [1m[31m(1)[0m   |  [1m[35m(0)[0m  |    [1m[35m(1)[0m    |  [1m[36m(0)[0m  |  [1m[36m(1)[0m  |                                                  |
+-------+----------+----------+-------+----------+-------+--------+-------+-----------+-------+-------+--------------------------------------------------+
|   [1m[37m0[0m   | [1m[37m1.00e-02[0

While the model could reproduce the training data with high fidelity, it is less True for the testing data and we observe significant overfitting.

We can still comment the **recall** and **precision** metrics:

* **Recall**: It represents *the fraction of positive instances retrieved by the model*.
* **Precision**: It represents *the fraction of positive instances within the labels predicted as positive*. 

Said differently:

* Given **tp** the *true positive* samples.
* Given **tn** the *true negative* samples.
* Given **fp** the *false positive* samples.
* Given **fn** the *false negative* samples.
* Then **recall** = ``tp / (tp+fn)`` and **precision** = ``tp / (tp+fp)``.

## Recurrent Architectures

Recurrent architectures can make a difference here because they process time series one measurement by one measurement. 

The number of time steps **does not define the size of parameters (weight/bias) array** while in the Feed-Forward network this is the case. 

For the *dense*, the shape of W is ``n, u`` given ``n`` the number of nodes in the previous layer and ``u`` in the current layer. So when a *dense* layer follows the *embedding* layer, the number of nodes in the *embedding* layer is equal to the number of features, herein the number of time steps ``10 000``. 

By contrast, the *RNN* layer has parameters shape which depends on the number of cells and the uni/multivariate nature of each measurement, but not depending of the number of time steps. In the previous situation there is likely too much parameters and the computation does not converge.

### Embedding

For the embedding, we will one-hot encode time series and we know the *"vocabulary"* size will be 16 because we digitalized over 16 bins.

In [8]:
embedding = Embedding(X_data=X_features,
                      Y_data=Y_label,
                      X_encode=True,
                      Y_encode=True,
                      batch_size=16,
                      relative_size=(2, 1, 0))

Let's inspect the data shape.

In [9]:
print(embedding.dtrain.X.shape)

(171, 10000, 16)


### RNN(sequences=True)-Flatten-(Dense)n with Dropout

Time to clarify a point:

* We have multivariate like time series (one-hot encoded univariate series) with 10000 time steps.
* The 10000 or **length of sequence is unrelated to the number of cells in the RNN layer**. The number of cells may be whatever, the whole sequence will entirely be processed.
* In recurrent layers, parameters shape is related to the number of cells and the vocabulary size, not to the length of the sequence. That's why such architectures can handle input sequences of variable length.


In [10]:
name = 'RNN-1-Seq_Flatten_Dense-64-relu_Dropout05_Dense-2-softmax'

se_hPars['learning_rate'] = 0.01
se_hPars['softmax_temperature'] = 1

layers = [
    embedding,
    RNN(1, sequences=True),
    Flatten(),
    Dense(64, relu),
    Dropout(0.5),
    Dense(2, softmax),
]

model = EpyNN(layers=layers, name=name)

We initialize the model.

In [11]:
model.initialize(loss='MSE', seed=1, metrics=['accuracy', 'recall', 'precision'], se_hPars=se_hPars.copy(), end='\r')

[1m--- EpyNN Check OK! --- [0mdding[0m0m0m

We will only train for 5 epochs.

In [12]:
model.train(epochs=5, init_logs=False)

[1m[37mEpoch 4 - Batch 9/9 - Accuracy: 1.0 Cost: 0.00205 - TIME: 33.98s RATE: 8.69e-01e/s TTC: 2s          [0m

+-------+----------+----------+----------+-------+----------+-------+--------+-------+-----------+-------+-------+----------------------------------------------------------------------+
| [1m[37mepoch[0m |  [1m[37mlrate[0m   |  [1m[37mlrate[0m   |  [1m[37mlrate[0m   |       | [1m[32maccuracy[0m |       | [1m[31mrecall[0m |       | [1m[35mprecision[0m |       |  [1m[36mMSE[0m  |                              [37mExperiment[0m                              |
|       |   [37mRNN[0m    |  [37mDense[0m   |  [37mDense[0m   |  [1m[32m(0)[0m  |   [1m[32m(1)[0m    |  [1m[31m(0)[0m  |  [1m[31m(1)[0m   |  [1m[35m(0)[0m  |    [1m[35m(1)[0m    |  [1m[36m(0)[0m  |  [1m[36m(1)[0m  |                                                                      |
+-------+----------+----------+----------+-------+----------+-------+--------+-----

FLAG.

### GRU(sequences=True)-Flatten-(Dense)n with Dropout

Let's now try a more evolved recurrent architecture.

In [13]:
name = 'GRU-1-Seq_Flatten_Dense-64-relu_Dropout05_Dense-2-softmax'

se_hPars['learning_rate'] = 0.01
se_hPars['softmax_temperature'] = 1

layers = [
    embedding,
    GRU(1, sequences=True),
    Flatten(),
    Dense(64, relu),
    Dropout(0.5),
    Dense(2, softmax),
]

model = EpyNN(layers=layers, name=name)

model.initialize(loss='MSE', seed=1, metrics=['accuracy', 'recall', 'precision'], se_hPars=se_hPars.copy(), end='\r')

model.train(epochs=3, init_logs=False)

[1m[37mEpoch 2 - Batch 9/9 - Accuracy: 0.688 Cost: 0.21652 - TIME: 76.07s RATE: 1.35e-01e/s TTC: 15s       [0m

+-------+----------+----------+----------+-------+----------+-------+--------+-------+-----------+-------+-------+----------------------------------------------------------------------+
| [1m[37mepoch[0m |  [1m[37mlrate[0m   |  [1m[37mlrate[0m   |  [1m[37mlrate[0m   |       | [1m[32maccuracy[0m |       | [1m[31mrecall[0m |       | [1m[35mprecision[0m |       |  [1m[36mMSE[0m  |                              [37mExperiment[0m                              |
|       |   [37mGRU[0m    |  [37mDense[0m   |  [37mDense[0m   |  [1m[32m(0)[0m  |   [1m[32m(1)[0m    |  [1m[31m(0)[0m  |  [1m[31m(1)[0m   |  [1m[35m(0)[0m  |    [1m[35m(1)[0m    |  [1m[36m(0)[0m  |  [1m[36m(1)[0m  |                                                                      |
+-------+----------+----------+----------+-------+----------+-------+--------+-----

FLAG.

## Write, Read & Predict

In [14]:
### Write/read model

model.write()

model = read_model()


### Predict

X_features, _ = prepare_dataset(N_SAMPLES=10)

dset = model.predict(X_features, X_encode=True)

for n, pred, probs in zip(dset.ids, dset.P, dset.A):
    print(n, pred, probs)

[1m[32mMake: /media/synthase/beta/EpyNN/nnlive/author_music/models/1631096326_GRU-1-Seq_Flatten_Dense-64-relu_Dropout05_Dense-2-softmax.pickle[0m
0 0 [0.53466267 0.46533733]
1 1 [0.30410443 0.69589557]
2 0 [0.63609356 0.36390644]
3 0 [0.76139002 0.23860998]
4 0 [0.69681818 0.30318182]
5 0 [0.52317497 0.47682503]
6 1 [0.40630025 0.59369975]
7 0 [0.62075473 0.37924527]
8 1 [0.37191097 0.62808903]
9 1 [0.33150543 0.66849457]
