# Experiments using the various handpicked features

This notebook shows you how to train a CNN various features of the IRMAS tracks. You can find the IRMAS dataset [here](http://www.mtg.upf.edu/download/datasets/irmas/)

The features used are:

- Spectral Centroid
- Spectral Bandwidth
- Spectral Rolloff
- Zero-crossing rate
- RMSE
- MFCC

Librosa was used to extract all of these features.

Every track in the trainset is 3 seconds long. Using the Librosa function calls
``` python
librosa.feature.spectral_centroid()
librosa.feature.spectral_bandwidth()
librosa.feature.spectral_rolloff()
librosa.feature.zero_crossing_rate()
librosa.feature.rmse()
librosa.feature.mfcc()
```
we get a (25, 130) array. Similarly, for the trainset, we split the features of every track so that their dimensions match the (25, 130) arrays from the trainset.

In [None]:
# Import the dataset preprocesso
from DatasetPreprocess import DatasetPreprocessor

dp = DatasetPreprocessor('handpicked')
dp.generateTrain() # This will create a .h5 file containing the trainset
dp.generateTest() # This will create a .h5 file containing the testset

The repository contains two different CNN architectures. For this experiment we are using the VGG-16 model.The architecture contains several layers of ReLU-activated Convolutional layers and three fully-connected (dense) layers. Take a look at the paper [here](https://arxiv.org/abs/1409.1556). 

Let's import the model and start training. For the sake of simplicity we are going to work with only 3 instruments: Flute, Electric Guitar and Piano.

In [1]:
import tensorflow as tf
import numpy as np
import h5py
from sklearn.model_selection import train_test_split
from models import vgg16_model
import os

In [2]:
# Open dataset
keys = ['flu', 'gel', 'pia'] # The keys of the 4 instruments to be used
dataset = h5py.File('train_handpicked_normalized.h5', 'r')
vector_size = dataset.attrs['vector_size']
num_of_labels = len(keys)
num_of_tracks = sum([dataset[x].shape[0] for x in keys])

Let's create two arrays for our examples. One of them should contain the features and the other the labels in one-hot represention.

In [3]:
# Prepare data for training and testing
features = np.zeros((num_of_tracks, vector_size[0], vector_size[1]), dtype=np.float32)
labels = np.zeros((num_of_tracks, len(keys)), dtype=np.float32)

i = 0
for ki, k in enumerate(keys):
	features[i:i + len(dataset[k])] = np.nan_to_num(dataset[k])
	labels[i:i + len(dataset[k]), ki] = 1
	i += len(dataset[k])
    
print(features.shape)
print(labels.shape)

(1932, 25, 130)
(1932, 3)


Let's train and evaluate the model on the trainset to see how it performs on one-instrument tracks. We will later do the same for multi-instrument songs.

In [4]:
# Split trainset to train and evaluation
X_train, X_eval, y_train, y_eval = train_test_split(features, labels, test_size=0.1, random_state=1337)
print(X_train.shape)
print(X_eval.shape)

(1738, 25, 130)
(194, 25, 130)


Time to add our model. We are using the new Tensorflow 1.0 high level API with tf.layers and tf.estimator. It resembles Keras. More information [here](https://www.tensorflow.org/programmers_guide/#high_level_apis).

In [5]:
saved_model_path = os.getcwd() + '/models/vgg-handpicked-{}'.format(','.join(keys))
print(saved_model_path)

classifier = tf.estimator.Estimator(model_fn=vgg16_model, model_dir=saved_model_path)
train_input_fn = tf.estimator.inputs.numpy_input_fn(x=X_train, y=y_train, batch_size=10, num_epochs=None, shuffle=True)
classifier.train(input_fn=train_input_fn, steps=4000)

/home/odysseas/Documents/irmas-cnn/models/vgg-handpicked-flu,gel,pia
INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_save_checkpoints_secs': 600, '_session_config': None, '_keep_checkpoint_max': 5, '_task_type': 'worker', '_global_id_in_cluster': 0, '_is_chief': True, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f62588c1590>, '_evaluation_master': '', '_save_checkpoints_steps': None, '_keep_checkpoint_every_n_hours': 10000, '_service': None, '_num_ps_replicas': 0, '_tf_random_seed': None, '_master': '', '_num_worker_replicas': 1, '_task_id': 0, '_log_step_count_steps': 100, '_model_dir': '/home/odysseas/Documents/irmas-cnn/models/vgg-handpicked-flu,gel,pia', '_save_summary_steps': 100}
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tens

<tensorflow.python.estimator.estimator.Estimator at 0x7f62588c1390>

The model took a long time to be trained on CPU as it struggled to fit in my GPU memory. However, after 4000 steps, our model is trained. Let's see how it went by evaluating on the trainset.

In [6]:
eval_input_fn = tf.estimator.inputs.numpy_input_fn(x=X_eval,y=y_eval,num_epochs=1,shuffle=False)
eval_results = classifier.evaluate(input_fn=eval_input_fn)
print(eval_results)

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2018-06-14-16:12:47
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /home/odysseas/Documents/irmas-cnn/models/vgg-handpicked-flu,gel,pia/model.ckpt-4000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Finished evaluation at 2018-06-14-16:12:50
INFO:tensorflow:Saving dict for global step 4000: accuracy = 0.7061856, global_step = 4000, loss = 0.7089133
{'loss': 0.7089133, 'global_step': 4000, 'accuracy': 0.7061856}


This does not seem too bad for a first try...

Now let's try to detect the primary instrument of a song using the same network. Time to use our testset.

First, we need to load the data.

In [9]:
dataset = h5py.File("test_handpicked_normalized.h5", 'r')
instruments = dataset.attrs['instruments']
vector_size = dataset.attrs['vector_size']

# Prepare data for training and testing
features = np.array(dataset['features'])
labels = np.array(dataset['labels'])

# Keep only samples with a primary instrument being one of the 'gac', 'gel', 'tru', 'vio'
key_indices = [np.where(instruments == x)[0][0] for x in keys]
example_indices = np.array([])
for ind in key_indices:
    tmp = np.argwhere(labels[:,ind] == True).flatten()
    example_indices = np.union1d(example_indices, tmp).astype(np.int32)

features = features[example_indices].astype(np.float32)
example_indices = [[x for i in key_indices] for x in example_indices]
labels = labels[example_indices, key_indices].astype(np.int)

Now use the classifier in evaluation mode.

In [10]:
eval_input_fn = tf.estimator.inputs.numpy_input_fn(x=features, y=labels, num_epochs=1, shuffle=False)
eval_results = classifier.evaluate(input_fn=eval_input_fn)
print(eval_results)

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2018-06-14-16:13:07
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /home/odysseas/Documents/irmas-cnn/models/vgg-handpicked-flu,gel,pia/model.ckpt-4000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Finished evaluation at 2018-06-14-16:13:45
INFO:tensorflow:Saving dict for global step 4000: accuracy = 0.62705797, global_step = 4000, loss = 1.2474327
{'loss': 1.2474327, 'global_step': 4000, 'accuracy': 0.62705797}


This time the accuracy is higher than in the experiment using the YOLO-like model. It is clear that classifying solo instruments is much easier than detecting instruments in a track.