<a href="https://colab.research.google.com/github/Benned-H/LSTMjazz/blob/master/Data_Processing/Final_Preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---
Author(s) | Year | Models Used | Music | Encoding | Quantization | Future work | Code/Examples
--- | --- | --- | --- | --- | --- | --- | ---
Eck | 2002 | LSTM | Melody + chords | 13 melody, 12 chord 1/0 | 2 per beat | N/A | [Ex](https://web.archive.org/web/20190104192500/http://people.idsia.ch/~juergen/blues/)
Bickerman | 2010 | DBN | Chords -> jazz licks | 18 melody (12 pitch, 4 8va), 12 chord | 12 per beat | Melodies avoid triplets | [Code](https://sourceforge.net/projects/rbm-provisor/)
Choi | 2016 | char-RNN, word-RNN | Jazz chord progressions | Note chars, Chord words | 1 per beat | N/A | [Code](https://github.com/keunwoochoi/lstm_real_book)
Lackner | 2016 | LSTM | Melody given chords | 24 melody, 12 chord 1/0 | 4 per beat | Larger dataset | [Ex](https://konstilackner.github.io/LSTM-RNN-Melody-Composer-Website/)
Agarwala | 2017 | Seq2Seq, char-RNN | Melodies | ABC char -> embeddings | None; ABC notation | N/A | [Code](https://github.com/yinoue93/CS224N_proj)
Brunner | 2017 | 2 LSTMs | Chords -> polyphonic piano | 48 melody, 50 chord embeddings | 2 per beat | Encoding polyphonic sustain, genre metadata | N/A
Hilscher | 2018 | char-RNN | Polyphonic piano | 1/0 on/off vectors | 4 per beat | More keys/data, text pattern matching | [Ex](https://yellow-ray.de/~moritz/midi_rnn/examples.html)

In [1]:
import os
import glob
import pandas as pd
import numpy as np
from IPython.display import clear_output

from google.colab import drive
drive.mount('/content/gdrive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/gdrive


In [2]:
!ls

gdrive	sample_data


In [3]:
# Navigate to the Datasets folder.
os.chdir('gdrive/My Drive/Datasets')
!ls

'Chords Bits'  'Chords Tokens'	'Melodies 18-bit'  'Melodies Piano Roll'


In [4]:
# We have all of our data in these subfolders.
len(glob.glob("*/*.csv"))

4800

# Additional Data Formatting Considerations  
IGNORE FOR NOW

In the work of [Brunner et al.](https://arxiv.org/abs/1711.07682) (2017), their LSTM received vectors of piano rolls with these appended features:
1. Embedded chord vector of the next time step. (Current chord)
2. Embedded chord vector of the chord following that chord.
3. A binary counter from 0 to 7 each bar.

Because my timesteps are much finer, I'd alter these features to the following:
1. "Current" embedded chord vector for this timestep.
2. Embedded chord vector of the chord following that chord.
3. See below for different timing bit vector.

In my case, I would need to count to 48 OR simplify the bits a bit to give information only about on-beat, off-beat, sixteenth, and triplet information. What about:

---
Bit 8 | Bit 7 | Bit 6 | Bit 5 | Bit 4 | Bit 3 | Bit 2 | Bit 1
--- | ---
Third triplet | Second triplet | Any triplet | On any down-beat | On any half-note | On-beat | On any 8th | On any 16th
Offset % 12 = 8 | Offset % 12 = 4 | Offset % 4 = 0 | Offset % 48 = 0 | Offset % 24 = 0 | Offset % 12 = 0 | Offset % 6 = 0  | Offset % 3 = 0

---
Bit 7 | Bit 6 | Bit 5 | Bit 4 | Bit 3 | Bit 2 | Bit 1
--- | ---
Third triplet | Second triplet | On-beat | On any 8th | On any 16th | Count 4 beats Bit 2 | Count 4 beats Bit 1
Offset % 12 = 8 | Offset % 12 = 4 | Offset % 12 = 0 | Offset % 6 = 0 | Offset % 3 = 0 | ???  | ???

To be clear, these features give the LSTM an easier job knowing when to change notes. To that end, I'd argue that such assistance might normally come from a rhythm section, and it's not inherently "unfair" information to give the system. What it might cause is a much clearer bias towards on-beat or on-eighth (etc.) notes, but balancing temperature and over-fitting can hopefully alleviate these concerns.

In [0]:
#def binaryCounter(offset):
  """
  Returns a DataFrame based on the above encodings given an offset.
  """
  #bit8 = (offset % 12 == 8)
  #bit7 = (offset % 12 == 4)
  #bit6 = (offset % 4 == 0)
  #bit5 = (offset % 48 == 0)
  #bit4 = (offset % 24 == 0)
  #bit3 = (offset % 12 == 0)
  #bit2 = (offset % 6 == 0)
  #bit1 = (offset % 3 == 0)
  #return pd.DataFrame(np.array([bit8, bit7, bit6, bit5, bit4, bit3, bit2, bit1])).T

In [0]:
binaryCounter(4)

**NOTE: THE ABOVE CODE IS INCOMPLETE UNTIL I'VE TESTED WITHOUT THESE FEATURES FIRST**

In [0]:
"""

os.chdir("Melodies Piano Roll")
!ls

c_bits = glob.glob("*.csv")
#c_tokens = glob.glob("*.csv")
#m_18bits = glob.glob("*.csv")
#m_piano = glob.glob("*.csv")

names = []
for c in c_bits:
  names.append(c[:-4])
name_indices = dict((n, i) for i, n in enumerate(names))

for i, c in enumerate(m_piano):
  df = pd.read_csv(c)
  new_name = str(name_indices[c[:-4]]) + ".csv"
  df.to_csv(new_name)
  print(i)
  
"""

The above code was used to rename all csv so they're more clearly compatible.

# Sampling The Data

Per my past summaries of previous works, we have this process for formatting data for Keras:

We have dataset $D=(X, Y)$ of "labelled" chord progression segments: $X = \{X_1, X_2, ..., X_n\}$ and $Y = \{Y_1, Y_2, ..., Y_n\}$, where each $X_i$ is some section of chord progression and each $Y_i$ is the corresponding melody label. My original piano matrix is of dimensions $(\text{# timesteps}, |\text{note range}|)$. In the 18-bit case, this is more simply $(\text{# timesteps}, 18)$.

First, we need to sample these matrices into $t$-timestep-long sequences of chord data (these are our $X_i$). We'll then label each of these with the melody information from the $t+1$ timestep. The number of samples, $S$, will be the total length of each song (in timesteps) minus $(t+1)$.

The final data shape should be a 3D matrix of dimensions $(\text{# samples}, \text{time steps}, \text{features})$. Therefore we'll just need to sample the piano and 18-bit matrices into samples without any other dimensional shifting. But *what sample sizes did my sources use?*
* Choi - Sampled 20 characters at a time with step size 3.
* Lackner - Seems to be 8, or 2 beats worth.
* Hilscher - Sentence length 100 (used chars)

The rest of my sources didn't really say.

**Conclusion** - I think two bars would be a good place to start. These would be $(8\text{ beats}*12\text{ ticks/beat}) = 96$ time steps. I'd also consider sampling at an interval of 5, as an example, to decrease redundancy if training takes too long. Until I hit this wall, I'll use a step size between windows of 1. So let's get to sampling, then.

Edit: Lackner uses 8-bar samples overall in his code.

In [0]:
def sample(X_df, Y_df, sample_length, step):
  """
  Splits the given dataframes into overlapping samples each beginning the given step size apart.
  X output will be the sliding windows from X
  Y output will be single-row labels from the Y DataFrame.
  
  df : The Pandas DataFrame to be sliced into samples.
  sample_length : int
    The desired length of each sample.
  step : int
    The number of timesteps between where each sample begins.
  """
  if (not len(X_df) == len(Y_df)):
    print("These dataframes have different length!")
    return 0,0
  
  length = X_df.shape[0]
  Xsamples = []
  Ysamples = []
  
  for i in range(0, (length - sample_length - 1), step):
    Xsample = X_df.iloc[i:i+sample_length, :].values
    Ysample = Y_df.iloc[i+sample_length, :].values
    
    Xsamples.append(Xsample)
    Ysamples.append(Ysample)
    
  return Xsamples, Ysamples

# Reading in the Data

In [0]:
files_to_read = 36 # Eventually 600
sample_len = 24
step_size = 1


ChB = []
ChT = []
M18 = []
MPR = []

# Running the below block, in order, will import the above number of files from each of the four data encodings.
# Begin in the Datasets directory.

## Run this section to import all 4 data encodings

In [41]:
# Chords Bits

# Move into directory
os.chdir("Chords Bits")

# Import data
for i in range(files_to_read):
  filename = str(i) + ".csv"
  df = pd.read_csv(filename)
  ChB.append(df.drop(columns=['Unnamed: 0', 'Unnamed: 0.1'])) # Drop waste columns from previous exports
  if (i % 10 == 0):
    clear_output()
    print("Read ChB file", i)

# Move back into Datasets directory.
os.chdir("..")

Read ChB file 30


In [42]:
# Chords Tokens

# Move into directory
os.chdir("Chords Tokens")

# Import data
for i in range(files_to_read):
  filename = str(i) + ".csv"
  df = pd.read_csv(filename)
  ChT.append(df.drop(columns=['Unnamed: 0', 'Unnamed: 0.1'])) # Drop waste columns from previous exports
  if (i % 10 == 0):
    clear_output()
    print("Read ChT file", i)

# Move back into Datasets directory.
os.chdir("..")

Read ChT file 30


In [43]:
# Melodies 18-bit

# Move into directory
os.chdir("Melodies 18-bit")

# Import data
for i in range(files_to_read):
  filename = str(i) + ".csv"
  df = pd.read_csv(filename)
  M18.append(df.drop(columns=['Unnamed: 0', 'Unnamed: 0.1'])) # Drop waste columns from previous exports
  if (i % 10 == 0):
    clear_output()
    print("Read M18 file", i)

# Move back into Datasets directory.
os.chdir("..")

Read M18 file 30


In [44]:
# Melodies Piano Roll

# Move into directory
os.chdir("Melodies Piano Roll")

# Import data
for i in range(files_to_read):
  filename = str(i) + ".csv"
  df = pd.read_csv(filename)
  MPR.append(df.drop(columns=['Unnamed: 0', 'Unnamed: 0.1'])) # Drop waste columns from previous exports
  if (i % 10 == 0):
    clear_output()
    print("Read MPR file", i)

# Move back into Datasets directory.
os.chdir("..")

Read MPR file 30


In [46]:
# Verify
print(len(ChB), len(ChT), len(M18), len(MPR))

36 36 36 36


In [0]:
# Verify corresponding dfs are same length
for i in range(files_to_read):
  len1 = len(ChB[i])
  len2 = len(ChT[i])
  len3 = len(M18[i])
  len4 = len(MPR[i])
  
  if (not (len1 == len2 and len1 == len3 and len1 == len4 and len2 == len3 and len3 == len4 and len2 == len3)):
    printf("Dataframe", i, "was a problem:", len1, len2, len3, len4)

In [57]:
# Verify that some dataframes have different lengths
print("Length of ChB[0]:", len(ChB[0]))
print("Length of ChB[12]:", len(ChB[12]))
print("Length of ChT[0]:", len(ChT[0]))
print("Length of ChT[12]:", len(ChT[12]))
print("Length of M18[0]:", len(M18[0]))
print("Length of M18[12]:", len(M18[12]))
print("Length of MPR[0]:", len(MPR[0]))
print("Length of MPR[12]:", len(MPR[12]))

Length of ChB[0]: 3072
Length of ChB[12]: 3120
Length of ChT[0]: 3072
Length of ChT[12]: 3120
Length of M18[0]: 3072
Length of M18[12]: 3120
Length of MPR[0]: 3072
Length of MPR[12]: 3120


## Sampling Time

In [60]:
# Let's try the ChB-M18 encoding pair.

X_train = []
Y_train = []

for i in range(files_to_read):
  Xsam, Ysam = sample(ChB[i], M18[i], sample_len, step_size)
  Xsam = np.array(Xsam)
  Ysam = np.array(Ysam)
  
  X_train.append(Xsam)
  Y_train.append(Ysam)
  clear_output()
  print("Sampling", i + 1, "of", files_to_read)

Sampling 36 of 36


In [61]:
# What have we created?
print(type(X_train), "shape", len(X_train))

<class 'list'> shape 36


In [62]:
# A few songs' lengths.
print(type(X_train[0]), "shape", X_train[0].shape)
print(type(X_train[12]), "shape", X_train[12].shape)
print(type(X_train[24]), "shape", X_train[24].shape)

<class 'numpy.ndarray'> shape (3047, 24, 12)
<class 'numpy.ndarray'> shape (3095, 24, 12)
<class 'numpy.ndarray'> shape (3047, 24, 12)


In [0]:
# Set our input and output widths based on our chosen encodings.
input_width = 12
output_width = 18

# What structure should the network have?

The sources I read had this to say (Taken from my summaries)  

**Lackner (2016)**  
The LSTM input layer took chord notes (thus 12 nodes), was fully connected to optional hidden layers, and then was fully connected to an output layer with 24 LSTM nodes corresponding to melody notes. The highest melody probability is chosen; if it's above some value, that note is played.

To generate new music, any chord sequence can be input and the resulting output can be converted to MIDI. Many architectures were tested; the best had 2 hidden layers with 9 and 18 LSTM units, respectively.

**Choi et al. (2016)**  
Two LSTM layers with 512 hidden units (hidden state dimensionality) and then Dropout of 0.2 after each LSTM layer. Put together in Keras with categorical cross entropy as loss, Adam optimizer, and stochastic prediction based on a diversity parameter $\alpha$. New probabilities are calculated:  
$\hat{p}_i=e^{log(p_i)/\alpha}$, where $p_i$ is the probability for the $i$ states.  
A state is then selected based on the probabilities.

**Agarwala et al. (2017)**  
Char-RNN: A hidden layer of 200 LSTM cells with 0.2 dropout and embedding size of 20 were chosen. Softmax was used to predict the next character. An input window of size 50 worked best.  

**Brunner et al. (2017)**  
Chord LSTM: This model learned those chord embeddings. From this layer, they used a hidden layer with 256 LSTM cells followed by a $\text{softmax}$ activation. The output corresponded to a vector of probabilities for the next chord. Training used cross-entropy as loss, Adam optimizer, $10^{-5}$ initial learning rate, and 80,000 of the shifted songs for 4 epochs. To generate new progressions, they seed the model and then sample output probabilities with temperature. This is fed in and the cycle repeats.

**Hilscher et al. (2018)**  
The best architecture used a single LSTM layer of 512 units, sequence length of 100, Mozart data fully normalized/transposed, batch shuffling, with categorical cross entropy loss, Adam optimizer, 0.001 learning rate, and validation split of 0.2.

Polyphonic LSTM: This LSTM received vectors of piano rolls (1/0 on/off made with the pretty_midi library) with these appended features:
1. Embedded chord vector of the next time step.
2. Embedded chord vector of the chord following that chord.
3. A binary counter from 0 to 7 each bar.

The input is fed into an LSTM with 512 hidden cells and $\text{sigmoid}$ activation. The output at each time step is the probabilities for each note being played. Training used cross entropy between outputs and the ground truth, Adam, initial learning rate of $10^{-6}$, and only 10,000 songs for 4 epochs.

To generate a new song, the polyphonic LSTM is seeded with the piano roll and corresponding chords. The next step is sampled and the number of notes to play at any one time is limited. To deal with ambiguous note endings/re-attacks, they consider all consecutive same notes as held notes. At barlines, all played notes are re-attacked.

Author(s) | Layers | Hidden State Size | Learning Rate | Loss | Optimizer
--- | --- | --- | --- | --- | ---
Lackner | Input (12), 9 LSTM units, 18 LSTM units, output 24 LSTM units | 9, 18, 24 | Unspecified | Unspecified | Unspecified
Choi et al. | 512 LSTM units, Dropout 0.2, 512 LSTM units, Dropout 0.2 | 512 | Unspecified | Catgorical cross-entropy | ADAM
Agarwala et al. | Embedding Size 20, 200 LSTM units w/ 0.2 Dropout | 200 | Unspecified | Unspecified | Unspecified
Brunner et al. | Embedding, 256 LSTM units, softmax | 256 | 10^-5 | Cross-entropy | ADAM
Hilscher et al. | Single LSTM layer of 512 units | 512 | 10^-6 | Cross-entropy | ADAM

To summarize, then, a good starting network for monophonic melody generation will probably have:
* 1 to 3 LSTM layers
* Maybe 200 units
* Learning rate on the order of 10^-6
* Cross-entropy loss is 100% the way to go
* ADAM optimizer

# Build the network

Sources used to create this particular part of the project:

1. https://keras.io/layers/recurrent/ Accessed 4/12/2019
2. https://www.dlology.com/blog/how-to-use-return_state-or-return_sequences-in-keras/ Accessed 4/12/2019
3. https://keras.io/getting-started/sequential-model-guide/ Accessed 5/22/2019

In [0]:
# LSTM_units = [32, 64, 128] # Eventually, for now use simpler:
LSTM_units = [8, 16, 32]

In [0]:
# Import statements
from keras.layers import LSTM, Dropout, Dense
from keras.models import Sequential

input_dim = (sample_len, input_width)

model = Sequential() # Declare our network.

# Input layer
model.add(LSTM(LSTM_units[0], return_sequences = True, input_shape = input_dim))

# Add rest of the layers except last
for i in range(1, len(LSTM_units) - 1):
  model.add(LSTM(LSTM_units[i], return_sequences = True))
  
# Last LSTM layer doeesn't return a sequence.
model.add(LSTM(LSTM_units[-1]))

model.add(Dropout(0.5)) # How about a dropout layer? 
model.add(Dense(output_width, activation = 'sigmoid')) # Output layer

In [0]:
# Compile our network
model.compile(optimizer = 'adam',
             loss = 'categorical_crossentropy',
             metrics = ['accuracy'])

In [0]:
def train(model, X_list, Y_list, epochs):
  # Trains a given Keras model on the lists of training data samples.
  if (len(X_list) != len(Y_list)):
    print("Lists different length!")
    return
  
  num_songs = len(X_list)
  for e in range(epochs):
    # Train model on each song in training data.
    # Batch size is length of one beat right now.
    for i in range(num_songs):
      clear_output()
      print("Epoch", e + 1, "of", str(epochs) + ". Training on song", i + 1, "of", num_songs)
      model.fit(X_list[i], Y_list[i], batch_size = 12, verbose = 1)

# model.evaluate(X_list[i], Y_list[i], verbose = 1)

In [102]:
train(model, X_train, Y_train, 10)

Epoch 7 of 10. Training on song 8 of 36
Epoch 1/1