<a href="https://colab.research.google.com/github/Benned-H/LSTMjazz/blob/master/Data_Processing/Final_Preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---
Author(s) | Year | Models Used | Music | Encoding | Quantization | Future work | Code/Examples
--- | ---
Eck | 2002 | LSTM | Melody + chords | 13 melody, 12 chord 1/0 | 2 per beat | N/A | [Ex](https://web.archive.org/web/20190104192500/http://people.idsia.ch/~juergen/blues/)
Bickerman | 2010 | DBN | Chords -> jazz licks | 18 melody (12 pitch, 4 8va), 12 chord | 12 per beat | Melodies avoid triplets | [Code](https://sourceforge.net/projects/rbm-provisor/)
Choi | 2016 | char-RNN, word-RNN | Jazz chord progressions | Note chars, Chord words | 1 per beat | N/A | [Code](https://github.com/keunwoochoi/lstm_real_book)
Lackner | 2016 | LSTM | Melody given chords | 24 melody, 12 chord 1/0 | 4 per beat | Larger dataset | [Ex](https://konstilackner.github.io/LSTM-RNN-Melody-Composer-Website/)
Agarwala | 2017 | Seq2Seq, char-RNN | Melodies | ABC char -> embeddings | None; ABC notation | N/A | [Code](https://github.com/yinoue93/CS224N_proj)
Brunner | 2017 | 2 LSTMs | Chords -> polyphonic piano | 48 melody, 50 chord embeddings | 2 per beat | Encoding polyphonic sustain, genre metadata | N/A
Hilscher | 2018 | char-RNN | Polyphonic piano | 1/0 on/off vectors | 4 per beat | More keys/data, text pattern matching | [Ex](https://yellow-ray.de/~moritz/midi_rnn/examples.html)

In [1]:
import os
import glob
import pandas as pd
import numpy as np

from google.colab import drive
drive.mount('/content/gdrive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/gdrive


In [2]:
!ls

gdrive	sample_data


In [0]:
os.chdir('gdrive/My Drive/Datasets')
!ls

In [4]:
# We have all of our data in these subfolders.
len(glob.glob("*/*.csv"))

4800

# Additional Data Formatting Considerations

In the work of [Brunner et al.](https://arxiv.org/abs/1711.07682) (2017), their LSTM received vectors of piano rolls with these appended features:
1. Embedded chord vector of the next time step. (Current chord)
2. Embedded chord vector of the chord following that chord.
3. A binary counter from 0 to 7 each bar.

Because my timesteps are much finer, I'd alter these features to the following:
1. "Current" embedded chord vector for this timestep.
2. Embedded chord vector of the chord following that chord.
3. See below for different timing bit vector.

In my case, I would need to count to 48 OR simplify the bits a bit to give information only about on-beat, off-beat, sixteenth, and triplet information. What about:

---
Bit 8 | Bit 7 | Bit 6 | Bit 5 | Bit 4 | Bit 3 | Bit 2 | Bit 1
--- | ---
Third triplet | Second triplet | Any triplet | On any down-beat | On any half-note | On-beat | On any 8th | On any 16th
Offset % 12 = 8 | Offset % 12 = 4 | Offset % 4 = 0 | Offset % 48 = 0 | Offset % 24 = 0 | Offset % 12 = 0 | Offset % 6 = 0  | Offset % 3 = 0

---
Bit 7 | Bit 6 | Bit 5 | Bit 4 | Bit 3 | Bit 2 | Bit 1
--- | ---
Third triplet | Second triplet | On-beat | On any 8th | On any 16th | Count 4 beats Bit 2 | Count 4 beats Bit 1
Offset % 12 = 8 | Offset % 12 = 4 | Offset % 12 = 0 | Offset % 6 = 0 | Offset % 3 = 0 | ???  | ???

To be clear, these features give the LSTM an easier job knowing when to change notes. To that end, I'd argue that such assistance might normally come from a rhythm section, and it's not inherently "unfair" information to give the system. What it might cause is a much clearer bias towards on-beat or on-eighth (etc.) notes, but balancing temperature and over-fitting can hopefully alleviate these concerns.

In [0]:
#def binaryCounter(offset):
  """
  Returns a DataFrame based on the above encodings given an offset.
  """
  #bit8 = (offset % 12 == 8)
  #bit7 = (offset % 12 == 4)
  #bit6 = (offset % 4 == 0)
  #bit5 = (offset % 48 == 0)
  #bit4 = (offset % 24 == 0)
  #bit3 = (offset % 12 == 0)
  #bit2 = (offset % 6 == 0)
  #bit1 = (offset % 3 == 0)
  #return pd.DataFrame(np.array([bit8, bit7, bit6, bit5, bit4, bit3, bit2, bit1])).T

In [0]:
binaryCounter(4)

Unnamed: 0,0,1,2,3,4,5,6,7
0,False,True,True,False,False,False,False,False


**NOTE: THE ABOVE CODE IS INCOMPLETE UNTIL I'VE TESTED WITHOUT THESE FEATURES FIRST**

# Sampling The Data

Per my past summaries of previous works, we have this process for formatting data for Keras:

We have dataset $D=(X, Y)$ of "labelled" chord progression segments: $X = \{X_1, X_2, ..., X_n\}$ and $Y = \{Y_1, Y_2, ..., Y_n\}$, where each $X_i$ is some section of chord progression and each $Y_i$ is the corresponding melody label. My original piano matrix is of dimensions $(\text{# timesteps}, |\text{note range}|)$. In the 18-bit case, this is more simply $(\text{# timesteps}, 18)$.

First, we need to sample these matrices into $t$-timestep-long sequences of chord data (these are our $X_i$). We'll then label each of these with the melody information from the $t+1$ timestep. The number of samples, $S$, will be the total length of each song (in timesteps) minus $t+1$.

The final data shape should be a 3D matrix of dimensions $(\text{# samples}, \text{time steps}, \text{features})$. Therefore we'll just need to sample the piano and 18-biot matrices into samples without any other dimensional shifting. But *what sample sizes did my sources use?*
* Choi - Sampled 20 characters at a time with step size 3.
* Lackner - Seems to be 8, or 2 beats worth.
* Hilscher - Sentence length 100 (used chars)

The rest of my sources didn't really say.

**Conclusion** - I think the two-beat samples of Lackner might be a good place to start, or maybe even the full four beats eventually. These would be 24 or 48 timesteps. I'd also consider sampling at an interval of 5, as an example, to decrease redundancy if training takes too long. Until I hit this wall, I'll use a step size between windows of 1. So let's get to sampling, then.

In [0]:
def sample(df, sample_length, step):
  """
  Splits a given dataframe into overlapping samples each beginning the given step size apart.
  
  df : Pandas DataFrame
    The DataFrame to be sliced into samples.
  sample_length : int
    The desired length of each sample.
  step : int
    The number of timesteps between where each sample begins.
  """
  length = df.shape[0]
  samples = []
  for i in range(0, (length - sample_length - 1), step):
    sample = df.iloc[i:i+sample_length, :]
    samples.append(sample)
    
  return samples

# Reading in the Data

In [0]:
"""

os.chdir("Melodies Piano Roll")
!ls

c_bits = glob.glob("*.csv")
#c_tokens = glob.glob("*.csv")
#m_18bits = glob.glob("*.csv")
#m_piano = glob.glob("*.csv")

names = []
for c in c_bits:
  names.append(c[:-4])
name_indices = dict((n, i) for i, n in enumerate(names))

for i, c in enumerate(m_piano):
  df = pd.read_csv(c)
  new_name = str(name_indices[c[:-4]]) + ".csv"
  df.to_csv(new_name)
  print(i)
  
"""

The above code was used to rename all csv so they're more clearly compatible.

In [6]:
!ls

'Chords Bits'  'Chords Tokens'	'Melodies 18-bit'  'Melodies Piano Roll'


In [7]:
glob.glob("*/1.csv")

['Chords Bits/1.csv',
 'Chords Tokens/1.csv',
 'Melodies 18-bit/1.csv',
 'Melodies Piano Roll/1.csv']

In [0]:
def readCSV(filename):
  # Cleans up the .csv from all the exporting/importing.
  df = pd.read_csv(filename)
  df = df.drop(columns = df.columns[0:2])
  return df