Part of this code was adopted from the solutions to the excercies for the book 'Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow. The goal of this notebook is to walk through a possible solution to the excercise. 

In [1]:
#import the necessary libraries
import tensorflow as tf
from tensorflow import keras
from pathlib import Path
import pandas as pd
import numpy as np
from IPython.display import Audio
from IPython.display import display

The project requires us to download the Bach chorales dataset and then to train a model that would predict the next note given an input sequence. We start by downloading the data set and unzipping the files.

Once the files are downloaded they are read as pandas using the read_csv() method. Notice that there are different files, so this means that the load_chorales() function is going to return a list of the contents of the files.

In [3]:
DOWNLOAD_ROOT = "https://github.com/ageron/handson-ml2/raw/master/datasets/jsb_chorales/"
FILENAME = "jsb_chorales.tgz"
filepath = keras.utils.get_file(FILENAME, DOWNLOAD_ROOT + FILENAME, cache_subdir='datasets/jsb_chorales', extract=True)
jsb_chorales_dir = Path(filepath).parent
train_files = sorted(jsb_chorales_dir.glob("train/chorale_*.csv"))
valid_files = sorted(jsb_chorales_dir.glob("valid/chorale_*.csv"))
test_files = sorted(jsb_chorales_dir.glob("test/chorale_*.csv"))

def load_chorales(filepaths):
    return [pd.read_csv(filepath).values.tolist() for filepath in filepaths]

train_chorales = load_chorales(train_files)
valid_chorales = load_chorales(valid_files)
test_chorales = load_chorales(test_files)

Let us have a look at the first element in the list:

In [4]:
train_chorales[0]

[[74, 70, 65, 58],
 [74, 70, 65, 58],
 [74, 70, 65, 58],
 [74, 70, 65, 58],
 [75, 70, 58, 55],
 [75, 70, 58, 55],
 [75, 70, 60, 55],
 [75, 70, 60, 55],
 [77, 69, 62, 50],
 [77, 69, 62, 50],
 [77, 69, 62, 50],
 [77, 69, 62, 50],
 [77, 70, 62, 55],
 [77, 70, 62, 55],
 [77, 69, 62, 55],
 [77, 69, 62, 55],
 [75, 67, 63, 48],
 [75, 67, 63, 48],
 [75, 69, 63, 48],
 [75, 69, 63, 48],
 [74, 70, 65, 46],
 [74, 70, 65, 46],
 [74, 70, 65, 46],
 [74, 70, 65, 46],
 [72, 69, 65, 53],
 [72, 69, 65, 53],
 [72, 69, 65, 53],
 [72, 69, 65, 53],
 [72, 69, 65, 53],
 [72, 69, 65, 53],
 [72, 69, 65, 53],
 [72, 69, 65, 53],
 [74, 70, 65, 46],
 [74, 70, 65, 46],
 [74, 70, 65, 46],
 [74, 70, 65, 46],
 [75, 69, 63, 48],
 [75, 69, 63, 48],
 [75, 67, 63, 48],
 [75, 67, 63, 48],
 [77, 65, 62, 50],
 [77, 65, 62, 50],
 [77, 65, 60, 50],
 [77, 65, 60, 50],
 [74, 67, 58, 55],
 [74, 67, 58, 55],
 [74, 67, 58, 53],
 [74, 67, 58, 53],
 [72, 67, 58, 51],
 [72, 67, 58, 51],
 [72, 67, 58, 51],
 [72, 67, 58, 51],
 [72, 65, 57

We see that the first list is actually a list of lists. We see that the inner most list is a list of four integers. These integers represent to a note's index. Ultimately, the algorithm must be able to predict the notes, given a certain input. 

The above output shows just the first list in the training set. How many lists are there?

In [5]:
len(train_chorales)

229

So there is a total of 229 lists (each corresponds to a chorales which is composed by Bach). In each of these lists (or chorales) there is another list of a list of four integers. In other words, we have three nested lists.

Before we go further, let us calculate the number of unique notes, the lowest note and the highest note:

In [40]:
notes = set()
for chorales in (train_chorales, valid_chorales, test_chorales):
    for chorale in chorales:
        for chord in chorale:
            notes |= set(chord)
number_of_notes = len(notes)
min_note = min(notes - {0})
max_note = max(notes)

I like to visualise the data at every step of data preparing because it helps me understand what is happening. The training set is too big to be visualized, so I will create a small sample that includes only the first two chorales, and from each chorales I will only take the first six lists:

In [6]:
first_two_lists = train_chorales[0:2]
first_two_lists_subset = [lst[0:6] for lst in first_two_lists]
first_two_lists_subset

[[[74, 70, 65, 58],
  [74, 70, 65, 58],
  [74, 70, 65, 58],
  [74, 70, 65, 58],
  [75, 70, 58, 55],
  [75, 70, 58, 55]],
 [[69, 64, 61, 57],
  [69, 64, 61, 57],
  [69, 64, 61, 57],
  [69, 64, 61, 57],
  [71, 64, 59, 56],
  [71, 64, 59, 56]]]

This is a more manageable data set for us for visualization. We here have three nested lists. The two outer lists represent two chorales. Inside each of these two lists, we have six lists (which are the first six lists of the original training set).

So what do we want to do? Since we are dealing with sequences, we woudl like to train some type of RNN. We also need to prepare the training data by dividing it into an inout and output. The input is going to be some notes, and the output will be notes. 

We have several options here. First, we can train a model that will predict the output given a certain number of inputs. In other words, we divide the data set into windows of length l, and for each input of length l, we will have an output of four notes.

However, instead of having a model that will predict the next four notes, why not have a model that will predict only the next node? So we make a prediction one note at a time. This will decrease the chance of the algorithm producing four notes at once where these four notes do not go hand in hand. Therefore, we will follow this approach. This means that we are going to treat the notes as a long series of individual notes, not as a group of four. 

Another decision that we have to make here is do we want to fit a sequence to vector model or a seuence to seuqnce model? In a sequence to vector approach, we train the model to predict the next note only at the very last time step. In a sequence to sequence appraoch, we train the model to predict the next note at each and every time step. As you can imagine, the sequence to sequence appraoch would give us beter results in this case.

So how do we train a model using the sequence to sequence approach? What we simply need is to train the model to predict one time step into the future. In other words, the output will simply be the input shifted by one time step. To do that, we will use the tensorflow data sets. We will perform the operations on the small sample first so that we can visualise what is happenign along the way. Once we are sure that the data is being preprocessed properly, we will perform the same operations on the train, valid, and test data sets.


In [37]:
sample_data = tf.ragged.constant(first_two_lists_subset, ragged_rank=1)
sample_data = tf.data.Dataset.from_tensor_slices(sample_data)

The above code simply converts the sample data to tensors. The first line converts the sample data to ragged tensors. Ragged tensors are tensors that are not of equal length. Since each chorales has a different length, we need to convert them first to ragged tensors. The second line then finishes this transformation. Let us look at the result:

In [38]:
for element in sample_data:
    print(element)

tf.Tensor(
[[74 70 65 58]
 [74 70 65 58]
 [74 70 65 58]
 [74 70 65 58]
 [75 70 58 55]
 [75 70 58 55]], shape=(6, 4), dtype=int32)
tf.Tensor(
[[69 64 61 57]
 [69 64 61 57]
 [69 64 61 57]
 [69 64 61 57]
 [71 64 59 56]
 [71 64 59 56]], shape=(6, 4), dtype=int32)


Notice that the two outer lists were converted to two tensors. We also want to convert the inner lists to tensors. Once we convert them to tensors, we can use the window function in order to create a window that we will use to create the input and output data sets. The command after that uses the batch function on each window since the window function creates a dataset that contains windows where each window is a dataset. This mean sthat the result is a nested data set. We therefore use the flat_map function to flatten the datasets and we apply the batch function with the same size as the window size:

In [39]:
window_size = 2
shift = 1
sample_data = sample_data.map(lambda row: tf.data.Dataset.from_tensor_slices(row).window(window_size + 1, shift=shift, drop_remainder=True))
sample_data = sample_data.flat_map(lambda row: row.flat_map(lambda window: window.batch(window_size + 1)))
for element in sample_data:
    print(element)

tf.Tensor(
[[74 70 65 58]
 [74 70 65 58]
 [74 70 65 58]], shape=(3, 4), dtype=int32)
tf.Tensor(
[[74 70 65 58]
 [74 70 65 58]
 [74 70 65 58]], shape=(3, 4), dtype=int32)
tf.Tensor(
[[74 70 65 58]
 [74 70 65 58]
 [75 70 58 55]], shape=(3, 4), dtype=int32)
tf.Tensor(
[[74 70 65 58]
 [75 70 58 55]
 [75 70 58 55]], shape=(3, 4), dtype=int32)
tf.Tensor(
[[69 64 61 57]
 [69 64 61 57]
 [69 64 61 57]], shape=(3, 4), dtype=int32)
tf.Tensor(
[[69 64 61 57]
 [69 64 61 57]
 [69 64 61 57]], shape=(3, 4), dtype=int32)
tf.Tensor(
[[69 64 61 57]
 [69 64 61 57]
 [71 64 59 56]], shape=(3, 4), dtype=int32)
tf.Tensor(
[[69 64 61 57]
 [71 64 59 56]
 [71 64 59 56]], shape=(3, 4), dtype=int32)


If you look at the output, you will see that instead of just two tensors we now have a series of them, were each is made up of three lists, because we chose a window size of 2 + 1. We also notice that as we move down, the window shifts one step at a time, since we specified that the sift parameter is equal to 1. The nice thing here is that this shiftin does not roll over from one chorales to the other. In other words, there is no window that contains notes from each chorales. At each chorales we start with a new window. This is exactly how we want it.

The next step is to convert these windows to one dimensional arrays so that the input becomes simply a series of numbers. We also want to shift the values of the notes so that they start from zero. The reason I want to do that is that in the model creation section, I want plan to use an embedded layer, and we need to map the input numbers to the input dimension of that layer:

In [41]:
sample_data = sample_data.map(lambda window: tf.where(window == 0, window, window - min_note + 1)).map(lambda window: tf.reshape(window, [-1]))
for window in sample_data:
    print([elem.numpy() for elem in window])

Cause: could not parse the source code of <function <lambda> at 0x0000020BD42EB7F0>: found multiple definitions with identical signatures at the location. This error may be avoided by defining each lambda on a single line and with unique argument names. The matching definitions were:
Match 0:
lambda window: tf.reshape(window, [-1])

Match 1:
lambda window: tf.where(window == 0, window, window - min_note + 1)

Cause: could not parse the source code of <function <lambda> at 0x0000020BD42EB7F0>: found multiple definitions with identical signatures at the location. This error may be avoided by defining each lambda on a single line and with unique argument names. The matching definitions were:
Match 0:
lambda window: tf.reshape(window, [-1])

Match 1:
lambda window: tf.where(window == 0, window, window - min_note + 1)

Cause: could not parse the source code of <function <lambda> at 0x0000020BD42EB880>: found multiple definitions with identical signatures at the location. This error may be a

Looking at the output we see that we no whave a sequence of numbers. The numbers seem different. This is because we have shifted them. Notice that each line contains 12 numbers. This is because we created windows of size three where each window had four notes. We are now ready to divide the dat ainto input and output. Remember that we are going to use the sequence to sequence appraoch. This means that the output is simply the inputed shifted by one:

In [42]:
sample_data = sample_data.map(lambda window: (window[:-1], window[1:]))
for X, y in sample_data:
    print("Input:", X.numpy(), "Target:", y.numpy())

Input: [39 35 30 23 39 35 30 23 39 35 30] Target: [35 30 23 39 35 30 23 39 35 30 23]
Input: [39 35 30 23 39 35 30 23 39 35 30] Target: [35 30 23 39 35 30 23 39 35 30 23]
Input: [39 35 30 23 39 35 30 23 40 35 23] Target: [35 30 23 39 35 30 23 40 35 23 20]
Input: [39 35 30 23 40 35 23 20 40 35 23] Target: [35 30 23 40 35 23 20 40 35 23 20]
Input: [34 29 26 22 34 29 26 22 34 29 26] Target: [29 26 22 34 29 26 22 34 29 26 22]
Input: [34 29 26 22 34 29 26 22 34 29 26] Target: [29 26 22 34 29 26 22 34 29 26 22]
Input: [34 29 26 22 34 29 26 22 36 29 24] Target: [29 26 22 34 29 26 22 36 29 24 21]
Input: [34 29 26 22 36 29 24 21 36 29 24] Target: [29 26 22 36 29 24 21 36 29 24 21]


Everything looks good. We now need to process the train, valid, and test data sets using these commands. We can create a function in order to avoid repetition:

In [46]:
def prepare_data(to_process):
    dataset = tf.ragged.constant(to_process, ragged_rank=1)
    dataset = tf.data.Dataset.from_tensor_slices(dataset)
    # we use a window_size of 32 since we are now using the whole data and not just experimenting with a small subset
    window_size = 32
    shift = 1
    dataset = dataset.map(lambda row: tf.data.Dataset.from_tensor_slices(row).window(window_size + 1, shift=shift, drop_remainder=True))
    dataset = dataset.flat_map(lambda row: row.flat_map(lambda window: window.batch(window_size + 1)))
    dataset = dataset.map(lambda window: tf.where(window == 0, window, window - min_note + 1)).map(lambda window: tf.reshape(window, [-1]))
    dataset = dataset.map(lambda window: (window[:-1], window[1:]))
    dataset = dataset.batch(32)
    return dataset

train_set = prepare_data(train_chorales)
valid_set = prepare_data(valid_chorales)
test_set = prepare_data(test_chorales)


Cause: could not parse the source code of <function prepare_data.<locals>.<lambda> at 0x0000020BD48F0CA0>: found multiple definitions with identical signatures at the location. This error may be avoided by defining each lambda on a single line and with unique argument names. The matching definitions were:
Match 0:
lambda window: tf.reshape(window, [-1])

Match 1:
lambda window: tf.where(window == 0, window, window - min_note + 1)

Cause: could not parse the source code of <function prepare_data.<locals>.<lambda> at 0x0000020BD48F0CA0>: found multiple definitions with identical signatures at the location. This error may be avoided by defining each lambda on a single line and with unique argument names. The matching definitions were:
Match 0:
lambda window: tf.reshape(window, [-1])

Match 1:
lambda window: tf.where(window == 0, window, window - min_note + 1)

Cause: could not parse the source code of <function prepare_data.<locals>.<lambda> at 0x0000020BD6B15FC0>: found multiple definiti

We now create the model. We will use an embedding layer, since it makes more since to treat the note numbers as categorical data and not as numerical data. This input will be projected on 6 dimensions. We will then add two LSTM layers, each with 64 neurons. We will also set the dropout parameter to 0.2 as a form of regularization. Finally, we will use a TimeDistributed layer on the output layer in order to train the model using the sequence to sequence appraoch. The output layer will contain one node for each note. A softmax activation will be used to calculate the probabilities of each of these notes.

In [47]:
model = keras.models.Sequential([
    keras.layers.Embedding(number_of_notes, 6, input_shape=[None]),
    keras.layers.LSTM(64, return_sequences=True, dropout=0.2, recurrent_dropout=0.2),
    keras.layers.LSTM(64, return_sequences=True, dropout=0.2, recurrent_dropout=0.2),
    keras.layers.TimeDistributed( keras.layers.Dense(number_of_notes, activation='softmax'))
])

model.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, None, 6)           282       
                                                                 
 lstm_2 (LSTM)               (None, None, 64)          18176     
                                                                 
 lstm_3 (LSTM)               (None, None, 64)          33024     
                                                                 
 time_distributed_1 (TimeDis  (None, None, 47)         3055      
 tributed)                                                       
                                                                 
Total params: 54,537
Trainable params: 54,537
Non-trainable params: 0
_________________________________________________________________


In [48]:
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(train_set, epochs=20, validation_data=valid_set)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x20bd8215210>

We see that the accuracy on the valid_set is 77.82%. We can now use the model to find the accuracy on the thirs set which the model was not exposed to during training:

In [49]:
model.evaluate(test_set)



[0.8496134281158447, 0.7608373165130615]

In the solutions to the book excerices, the author used a different model. The model he used was more complex and was made up of a series of convolution layers and normalization layers, followed by an LSTM layer. The result was a model that was much deeper than the one that we used. The author also specified a specific learning rate. Let us see what results we would get with that model:

In [50]:
model = keras.models.Sequential([
    keras.layers.Embedding(input_dim=number_of_notes, output_dim=5,
                           input_shape=[None]),
    keras.layers.Conv1D(32, kernel_size=2, padding="causal", activation="relu"),
    keras.layers.BatchNormalization(),
    keras.layers.Conv1D(48, kernel_size=2, padding="causal", activation="relu", dilation_rate=2),
    keras.layers.BatchNormalization(),
    keras.layers.Conv1D(64, kernel_size=2, padding="causal", activation="relu", dilation_rate=4),
    keras.layers.BatchNormalization(),
    keras.layers.Conv1D(96, kernel_size=2, padding="causal", activation="relu", dilation_rate=8),
    keras.layers.BatchNormalization(),
    keras.layers.LSTM(256, return_sequences=True),
    keras.layers.Dense(number_of_notes, activation="softmax")
])

optimizer = keras.optimizers.Nadam(learning_rate=1e-3)
model.compile(loss="sparse_categorical_crossentropy", optimizer=optimizer, metrics=["accuracy"])
model.fit(train_set, epochs=20, validation_data=valid_set)

Epoch 1/20
    133/Unknown - 41s 277ms/step - loss: 2.4688 - accuracy: 0.3596