<a href="https://colab.research.google.com/github/RodrigoMarquesP/Music_Generation_with_Recurrent_Neural_Networks-An_AI_Composer/blob/master/Music_Generation_with_RNNs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This project is inspired by the MIT 6.S191 2020 course, including the usage of the training set, but it's extremely reproducible with any other dataset.

# ***Recurrent Neural Network as a solution to sequential problems***

A sequential problem in the data niche is a problem involving a sequential piece of data of any type - such as an array of information -, where the terms have relations of meaning, state, and order with others. Such a case can carry many issues to machine learning and deep learning standard models, like:

*   The sequence **length** can be **variable**.
*   The **order** of the terms can be **variable**, changing the sequence meaning/value or not.
*   **Important relations** may be **distant** on the sequence, not allowing to consider just a fixed length of every input.
*   **One term** can have a **different meaning and/or relevance** depending on its position and/or the other terms themselves.

Handling these points allow us many applications, such as 
climatic forecast,  stock prices prediction, object tracking prediction (essential to autonomous vehicles), text generation, emotion classification, music generation, and many others.

   

<p align="center">
  <img src="https://raw.githubusercontent.com/RodrigoMarquesP/Music_Generation_with_Recurrent_Neural_Networks-An_AI_Composer/master/files/climatic.gif"> 
</p>  


<p align="center">
  <img src="https://raw.githubusercontent.com/RodrigoMarquesP/Music_Generation_with_Recurrent_Neural_Networks-An_AI_Composer/master/files/stock_prices.png" width=650>
</p> 


<p align="center">
  <img src="https://raw.githubusercontent.com/RodrigoMarquesP/Music_Generation_with_Recurrent_Neural_Networks-An_AI_Composer/master/files/trajectory_prediction.gif" width=650>
</p>

For solving these problems, the RNN process along the sequence, piece by piece, predicting the next element, like the next letter, word, note (in music generation), number, array, etc. With this technique, the backpropagation is made through time, making the internal state of the model (*h*) important for the prediction (*y*).

Here is a comparison between the standard network model and a specific RNN architecture - this one will be used in the music generation cause it outputs a sequence of any desired length, but there are many others.

<p align="center">
  <img src="https://raw.githubusercontent.com/RodrigoMarquesP/Music_Generation_with_Recurrent_Neural_Networks-An_AI_Composer/master/files/rnn.jpg" width=700>
</p>

So, after training such architecture with music notes, we can predict next notes that never existed, generating a fully new song, a deep learning model composition.

# ***Creating the model***

Before even starting the project, go to the run option in the toolbar and set the environment to GPU to accelerate the computations.

## ***Loading the dependencies:***

In [None]:
# specifying tensorflow version
%tensorflow_version 2.x

# importing packages
import tensorflow as tf 
import numpy as np
import matplotlib.pyplot as plt
import os
import time
import functools
from IPython import display as ipythondisplay
# tqdm module displays a progress bar in our training loop
from tqdm import tqdm

# for the training set we're gonna use the irish folk songs dataset from MIT 6.S191, under license:

# Copyright 2020 MIT 6.S191 Introduction to Deep Learning. All Rights Reserved.
# 
# Licensed under the MIT License. You may not use this file except in compliance
# with the License. Use and/or modification of this code outside of 6.S191 must
# reference:
#
# © MIT 6.S191: Introduction to Deep Learning
# http://introtodeeplearning.com

!pip install mitdeeplearning -q
import mitdeeplearning as mdl

# for converting the abc notes in song we use
!apt-get install abcmidi timidity > /dev/null 2>&1

## ***Loading and exploring the Dataset from the MIT package***

The MIT 6.S191 disponibilizes a Dataset with almost a thousand of Irish folk songs already in ABC notation to train our model. Lets look at it:



<p align="center">
<img src="https://raw.githubusercontent.com/RodrigoMarquesP/Music_Generation_with_Recurrent_Neural_Networks-An_AI_Composer/master/files/irish_folk.gif">
</p>

In [None]:
# we can load the data and play it directly from the mdl package
songs = mdl.lab1.load_training_data()
print(len(songs))
mdl.lab1.play_song(songs[1])

Lets take a closer look to the song.

In [None]:
print(songs[1])

To make a single piece of data, we can join all songs into one text object. 

In [None]:
all_songs = "\n\n".join(songs)

# ***Data processing***

The RNN training and prediction process requires a sequence of data, so the network will predict new elements at the end of the sequence. Based on this, we need to vectorize the data and have a practical way of mapping the characters into numbers and vice versa. For doing this we can look at the unique characters.

In [None]:
unique = sorted(set(all_songs))
id2char = np.array(unique)
char2id = {char:id for id, char in enumerate(unique)}

print(f"{len(unique)} unique characters listed below:")
for key, item in char2id.items():
  print(f"({repr(key)}:{item})", end="   ")

Now we can just map the character vector to a numerical vector.

In [None]:
all_songs_num = np.array(list(map(lambda char: char2id[char], all_songs)))
print(all_songs_num[:20])

After that done, we need to remember that our network will have 'n' inputs, which means that the model will process 'n' characters each time, a sequence. Therefore, we need to group the training data into 'n length' pieces, where the input will be randomly sampled while the output will be the same sequence but shifted one character to the right. Let's take an example:

*   If the desired legth is 5:  

        'qwerty' -> input='qwert', output='werty'


Based on that, we can create a batch extractor function, which takes an input size ('n_length') and a batch size to return a randomly extracted set of samples.

In [None]:
def get_batch(all_data, input_size, batch_size):
  # the last valid index of the data
  n = all_data.shape[0] - 1
  # randomly sampled indexes
  rsi = np.random.choice(n-input_size, batch_size)

  # for each index, we extract the input and output
  input_batch = np.array([all_data[i:i+input_size] for i in rsi])
  output_batch = np.array([all_data[i+1:i+1+input_size] for i in rsi])

  # make sure of the batch format
  x_batch = np.reshape(input_batch, [batch_size, input_size])
  y_batch = np.reshape(output_batch, [batch_size, input_size])

  return x_batch, y_batch

# ***The model itself***

With the proper data and the batch extraction function, we are ready to create the model. In this specific project, the network will be composed of an embedding layer, an LSTM layer, and a dense layer:


*   **tf.keras.layers.Embedding:** The embedding layer consists of mapping the input codes into a vector, similar to [one-hot encoding](https://towardsdatascience.com/categorical-encoding-using-label-encoding-and-one-hot-encoder-911ef77fb5bd), but making sure that similar codes are represented by similar vectors, which is made by learning the weights of the layer, it is, the embedding layer acts like a trainable lookup table.
*   **tf.keras.layers.LSTM:**  LSTM - Long Short Term Memory - is an RNN variation, which deals well with understanding the relations of elements in different orders and distances in the data sequence.
*   **tf.keras.layers.Dense:** The dense layer will compact the LSTM layer outputs into our vocabulary length, meaning the unnormalized log-probabilities of each category, it is, the bigger its number, the most the network 'thinks' that a character should be the next - the prediction itself. We won't require softmax activation (normalized probability) due to the function used to extract the value from the output.

For simplicity and better understanding, we will design the model with **tf.keras.Sequential**:







In [None]:
def our_model(vocab_size: int, embedding_dim: int, rnn_units: int, batch_size: int) -> tf.keras.Sequential:
  """
  Returns the three-layered model.

  Inputs:
    vocab_size: the size of our vocabulary, it is, how many characters are being considered.

    embedding_dim: the dense embedding dimension.

    rnn_units: the LSTM layer dimension.

    batch_size: the size of each batch.
  
  Returns:
    model: a three-layered model for sequential problems.
  """

  model = tf.keras.Sequential([
    tf.keras.layers.Embedding(input_dim=vocab_size, output_dim=embedding_dim, batch_input_shape=[batch_size, None]),
    tf.keras.layers.LSTM(units=rnn_units, return_sequences=True, recurrent_initializer='glorot_uniform', recurrent_activation='sigmoid', stateful=True),
    tf.keras.layers.Dense(units=vocab_size)
    ])

  return model

We can then instantiate the model and see its structure.

In [None]:
model = our_model(len(unique), embedding_dim=256, rnn_units=1024, batch_size=32)
model.summary()

We can pass an aleatory sequence in the network, even without training, just for looking at the dimensionality.

In [None]:
# generate a batch of 32 sequences of 100 characters each
x, y = get_batch(all_songs_num, input_size=100, batch_size=32)

# feedforward the training set
no_training_predic = model(x)

# now we can inspect the shape of the input (x) and the output (no_training_predic)
print("We're using a batch of 32 sequences of 100 characters each:")
print(f"Input shape: {x.shape}")
print(f"Output shape: {no_training_predic.shape}")

Thus, for each input sequence, there are 83 outputs: an unnormalized distribution of log-probability over the 83 possible characters, where the bigger output can be taken as the network prediction. This simple approach (take the argmax) can lead to a loop, so it's more adequate to sample from the distribution: 

In [None]:
sampled_predictions = tf.random.categorical(no_training_predic[0], num_samples=1)  # returns a tensor with (input_size, 1) shape
sampled_predictions = tf.squeeze(sampled_predictions)  # so, we can squeeze it to a simple array
print(sampled_predictions)

We can use our decoder 'id2char' for generating the actual characters:

In [None]:
print(repr("".join(id2char[sampled_predictions])))

As the weights of the network are randomly initialized, the output doesn't make much sense, but we can see that the whole structure of the problem is now defined and working, so let's train it.

# ***Make the network smart: train it***

The training process requires a loss function to inform the model "how much it is doing wrong." For this task, we can use a *sparse_categorical_crossentropy* loss, which deals with integer targets (categories).

Let's build the loss function for later optimization.

In [None]:
def compute_loss(labels, logits):
  loss = tf.keras.losses.sparse_categorical_crossentropy(y_true=labels, y_pred=logits, from_logits=True)
  return loss

The loss function returns a vector of losses, which we can take the average for having the total cost. We can see this cost on our untrained model:

In [None]:
total_cost_untrained = compute_loss(y, no_training_predic).numpy().mean()
print(f"Total cost of the batch on the untrained model: {total_cost_untrained}")

### ***Hyperparameters***

Besides the loss function, there are a lot of hyperparameters to specify. Below we chose some reasonable values, but feel free to try others (in fact, for really understanding their influence, you should test a lot of different values).

&rarr; The optimization process needs a stop condition, which may be something automatic (a loss threshold), but in our case, we will set the number of iterations.

In [None]:
num_training_iterations = 2000  # increase for a most tuned model

&rarr; The bacth extraction function requires the batch size and the input size. The bigger those parameters, the higher the computational cost.

In [None]:
batch_size = 4
seq_length = 100

&rarr; The learning rate is a critical parameter:

*   If too small, can get stuck in bad local optima, resulting in bad model results.
*   If too big, can not reach the optimal point, or can even diverge the training process.



In [None]:
learning_rate = 5e-3

&rarr; We are going to use the same previous values for the dimensions of the layers. These values are arbitrary and empirically chosen: remember that the model architecture has no fixed rule. Therefore, we model it by experience (always respecting the computational limits).

In [None]:
vocab_size = len(unique)
embedding_dim = 256 
rnn_units = 1024

# instantiate the model
model = our_model(vocab_size=vocab_size, 
                  embedding_dim=embedding_dim, 
                  rnn_units=rnn_units, 
                  batch_size=batch_size)

Finally, we can define our optimizer, the algorithm that will apply our gradients in the weights optimization. Some good and widely used optimizers are [Adam](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Adam) and [Adagrad](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Adagrad), so go ahead and try both yourself or even other optimizers (for the full list visit [the TensorFlow website](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/)).

In [None]:
optimizer = tf.keras.optimizers.Adam(learning_rate)
# optimizer = tf.keras.optimizers.Adagrad(learning_rate)

# ***The training routine***

The training process will be based on [GradientTape](https://www.tensorflow.org/api_docs/python/tf/GradientTape), a gradient recording method: by recording all forward operations, we can calculate the gradients with the weights and the loss. With this last step, we can start training our model.

**Note**: We can access the trainable weights by `model.trainable_variables`.

In [None]:
@tf.function
def train_step(x, y): 
  with tf.GradientTape() as tape:
    y_pred = model(x)
    loss = compute_loss(y, y_pred)

  gradients = tape.gradient(loss, model.trainable_variables)
  optimizer.apply_gradients(zip(gradients, model.trainable_variables))

  return loss

# ***Train the model***

After preparing the field, the training process is simple:


1.   Extract a batch;
2.   Calculate the loss (inside this step is already the weights actualization);
3.   Add the loss information to a history list for later analysis of the convergence;

In [None]:
history = []

# ensuring tqdm will work well
if hasattr(tqdm, '_instances'): tqdm._instances.clear() # clear if it exists

# training loop
for iter in tqdm(range(num_training_iterations)):
  x_batch, y_batch = get_batch(all_songs_num, seq_length, batch_size)
  loss = train_step(x_batch, y_batch)
  history.append(loss.numpy().mean())

We can show the training process convergence simply by plotting the history list:

In [None]:
# Creating a moving average of the losses
step = 20
moving_average = np.array([np.array(history[i-step:i]).mean() for i in range(step, len(history))])
ma_x = [i for i in range(step, len(history))]
string = f"The cost reaches approximately {moving_average[-1]:.2f}"


fig, ax = plt.subplots(figsize=(10, 7))
ax.plot(history, 'b-', lw=1)
ax.plot(ma_x, moving_average, 'r-', lw=2)
ax.set_xlabel('Iterations')
ax.set_ylabel('Total Cost')
ax.text(.40, .7, string, transform=ax.transAxes, bbox=dict(facecolor='white', alpha=0), fontsize=15)
plt.show();

# ***Model preparing***

As the embedding layer is built with the batch size information, we need to rebuild the model, passing a new batch size of 1, due to having a single input for starting the prediction. By rebuilding the network, all parameters are reinitialized, and to don't lose the progress, we can save the weights for loading them into the fresh model, which will be ready for production then.

In [None]:
# saving the weights
checkpoint_prefix = os.path.join('./training_checkpoints', "my_ckpt")
model.save_weights(checkpoint_prefix)

# rebuilding the model
model = our_model(vocab_size=vocab_size, 
                    embedding_dim=embedding_dim, 
                    rnn_units=rnn_units, 
                    batch_size=1)

# reloading the weights
model.load_weights(tf.train.latest_checkpoint('./training_checkpoints'))
model.build(tf.TensorShape([1, None]))

Let's see if we maintained our model.

In [None]:
model.summary()

# ***Prediction***

Our model is ready for production, so we can input a starting point and extract as many predictions as we want with the same methodology: sample from the output categories probability distribution. Notice that our first prediction must be the input for the second prediction and so on. The full loop will be:


1.   Chose a starting string.
2.   Run it on the model and obtain its prediction.
3.   Store this prediction.
4.   Use the last prediction as input.

The internal state of the network is actualized every loop, building piece by piece a fresh new song track.

In [None]:
def generate_song(model, start_string, new_song_length):
  input = [char2id[s] for s in start_string]
  input = tf.expand_dims(input, 0)
  new_songs = []

  # reset the model internal state and prepare tqdm module
  model.reset_states()
  tqdm._instances.clear()

  for i in tqdm(range(new_song_length)):

    # Extract the prediction
    predictions = model(input)
    predictions = tf.squeeze(predictions, 0)
    predicted_category = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()
      
    # Update the actual input
    input = tf.expand_dims([predicted_category], 0)
      
    # Add the new character
    new_songs.append(id2char[predicted_category])
    
  return (start_string + ''.join(new_songs))

Our artist network is ready, let it compose:

<p align="center">
<img src="https://raw.githubusercontent.com/RodrigoMarquesP/Music_Generation_with_Recurrent_Neural_Networks-An_AI_Composer/master/files/musician.gif">
</p>

In [None]:
generated_text = generate_song(model=model, start_string="X", new_song_length=1000)

Now we need to look for patterns that indicate a song in the outputted text, which can be easily done with [regular expressions](https://docs.python.org/3/library/re.html). That is what `mdl.lab1.extract_song_snippet(text)` internally does.

After extracted, we can play, enjoy and even download:

<p align="center">
<img src="https://raw.githubusercontent.com/RodrigoMarquesP/Music_Generation_with_Recurrent_Neural_Networks-An_AI_Composer/master/files/party_time.gif">
</p>

In [None]:
generated_songs = mdl.lab1.extract_song_snippet(generated_text)

for i, song in enumerate(generated_songs): 
  waveform = mdl.lab1.play_song(song)

  # If there's any recognized song, play it
  if waveform:
    print("Song ", i)
    ipythondisplay.display(waveform)