<a href="https://colab.research.google.com/github/Benendead/LSTMjazz/blob/master/Research/Related_Works.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Source 1: [Text-based LSTM networks for Automatic Music Composition](https://arxiv.org/pdf/1604.05358.pdf)

## Introduction

This paper used text-based LSTMs for music composition. They found that word-RNNs could learn chord progressions and drum tracks, while char-RNNs could only succeed with chord progressions.

Their work attempted to generate jazz chord progressions and rock drum tracks. Their work is notable for two reasons:
1. Their LSTMs were designed to learn from text data as opposed to music symbols or numeric values.
2. They used a larger dataset than previous work, which allows learning of more complex chord progressions.

## Architecture

Char-RNNs are RNNs with character-based learning, as opposed to the conventional word-based learning. These return a vector corresponding to a character, while word-RNNs predict vectors corresponding to unique words. The merits of char-RNNs include:
1. Minimal assumptions are made - there are no constraints on the form of the text representation. Can RNNs learn musical information with such a weak assumption?
2. Fewer characters means fewer states, which reduces computational costs. Their chord vocabulary had 1,259 "words" yet this reduces to 39 characters.

It's worth noting that char-RNNs need to form longer-term dependencies, as the sequence becomes longer. LSTMs help with this, but the trade-off does have downsides we'll see later.

They used two LSTM layers, each with 512 hidden units. According to [reddit](https://www.reddit.com/r/MachineLearning/comments/87djn7/d_what_is_meant_by_number_of_hidden_units_in_an/), this 512 refers to the dimensionality of the hidden state ([Keras documentation](https://keras.io/layers/recurrent/) supports this answer). Dropout of 0.2 was added after each LSTM layer.

The model was put together using *Keras*. Loss function was categorical cross-entropy and optimizer was ADAM. This performed just as well as SGD with Nestrov momentum.

The prediction is stochastic in that each prediction for time $t$, the network outputs probabilities for every state. The system also had a diversity parameter $\alpha$ in the prediction stage, which suppresses $(\alpha < 1)$ or encourages $(\alpha < 1)$ the diversity of prediction by re-weighting the probabilities. This is done using the formula:  
$\hat{p}_i=e^{log(p_i)/\alpha}$, where $p_i$ is the probability for the $i$ states.  
Finally, one of the states is selected using the re-weighted probabilities.

The same size and number of layers was kept for both the char- and word-RNN experiments. The [code](https://github.com/keunwoochoi/lstm_real_book) is available online. I'll look through it once I'm done with the paper.

## Case Study: Chord Progressions

### Representation

They avoid vector representations of the chords and instead use only the text representations. They filled in each chord for all quarter notes during the chord and that's the dataset. Ex:  

F:9 F:9 F:9 F:9 D:min7
D:min7 G:9 G:9 C:maj C:maj
F:9 F:9 C:maj C:maj C:maj
C:maj

Their dataset included 2,486 scores from real and fake books. They transposed everything to C and filled in the duration of chords per quarter note. All song data was appended using _START_ and _END_ flags.

Even when transposed to C, only 867 scores end in C:maj, 489 in G:7, 186 C:maj6, 52 F:maj, and 1,252 in other chords. There were 1,259 unique chords in the training data, which gave the word-RNN a vocab size of 1,259. There were 39 distinct characters, 539,609 chords, and 3,531,261 characters total.

### Results

The system was set to output a chord progression for each diversity parameter after every iteration. Both the word- and char-RNNs showed well structured results. They learned local structures of chords and bars as well as the _START_ and _END_ tags.

After enough training, both results showed chord progressions from real jazz grammar. In  char-RNN, this included ii-V-I, passing chords, modal interchange chords, and substitutions (B:7 for F:7). For word-RNN, these included modal interchanges, circle of fifths (Eb:sus - Gb:maj6 - B:maj7), and descending bass.

The approaches differed slightly in output, as the word-RNN appeared to prefer more conventional progressions than in char-RNN. This may be caused by the different effective lengths of the approaches, as the char-RNN effective has a shorter memory span.

## Conclusion

They used text-based LSTMs to generate chord progressions based on jazz real books. Both word-RNNs and char-RNNs worked in this use case. Their usage of a diversity parameter gives composers a useful tool to dial in desired effects.

# Source 2: [Charlie Parker's Omnibook Data](https://members.loria.fr/KDeguernel/omnibook/)

Just a note that this is the dataset we used. Citation below:

Using Multidimensional Sequences For Improvisation In The OMax Paradigm
Ken Déguernel, Emmanuel Vincent, Gérard Assayag
13th Sound and Music Computing Conference, Aug 2016, Hamburg, Germany. 〈http://quintetnet.hfmt-hamburg.de/SMC2016/〉

# Source 3: [A First Look at Music Composition using LSTM Recurrent Neural Networks](http://people.idsia.ch/~juergen/blues/IDSIA-07-02.pdf)

**Abstract**  
Music composed by RNNs typically suffers from a lack of global structure. Note-by-note transitions or phrase reproduction might succeed, but overall musical form eludes RNNs' grasp. Luckily, LSTMs can overcome the long-term dependency issues and create music that's surprisingly pleasing in the blues style.

## 1. Introduction

A most basic attempt a composition with RNNs might consider single-step predictions. This can then be seeded with just one example and attempt to compose from there. To state the obvious, feed-forward networks coud never perform such a task, as they can't store information about the past. RNNs overcome this basic limitation, at least.

RNNs still struggle with this task, unfortunately. Mozer, in 1994, wrote of his RNN's compositions that “While the
local contours made sense, the pieces were not musically coherent, lacking thematic structure and having minimal phrase structure and rhythmic organization." This problem likely links to vanishing gradients, as in both BPTT and RTRL error flow tends to explode or vanish.

Of course music requires this overall contour Mozer mentions. At the most basic form, early rock-and-roll might consist of four-bar phrases which easily become 32 events or more as the time step is defined as eighth notes. Before this paper, Mozer's single-note melodies were the most relevant work. Mozer had used RNNs with BPTT, probabilistic output values, and a psychologically realistic encoding that gave bias towards chromatically and harmonically related notes. He also encoded events in fewer time steps by treating all notes as one time step. Even still, the architecture failed to capture global musical structure.

Mozer suggested that for a note-by-note method to work, it requires a network which can induce structure at multiple levels. This paper offers that network.



## 2. An LSTM Music Composer

### LSTM Overview

To summarize LSTMs, they're designed to obtain constant error flow through time so that nothing explodes or vanishes. LSTMs use linear units called Constant Error Carousels (CECs), apparently. Really the CEC is just another way to represent the continuing cells we otherwise get when we consider time steps.

The gates which control information into the CEC include:
1. Multiplicative **Input Gate** - Learns to protect the information passed into the CEC by rejecting irrelevant inputs.
2. Multiplicative **Output Gate** - Learns to protect other areas from currently irrelevant memory contents of the CEC.
3. **Forget Gate** - Learns to reset memory cells when their content is obsolete.

Learning is then done using a modified BPTT and a customized version of RTRL. That is, output units use BPTT, output gates use truncated BPTT, and the input and forget gates use truncated RTRL.

### Data Representation

This paper represented the data in simple local form, using one input per note, with 1 as on and 0 as off. In later experiments, they adjusted input units to have a mean of 0 and a standard deviation of 1 (???). This left it to the network to develop a bias towards chromatic or harmonic notes. Their reasons follow:
1. Implicitly multivoice and makes no distinction between chords and melodies. (They implemented chords simply by including them in the single input vector)
2. It's easy to generate probability distributions over the set of possible notes, as they can treat single notes as independent or dependent.

Time was represented as a single input vector representing one slice of real time. The stepsize of each time step of course can vary; they chose an eighth note as theirs. This is preferable for LSTMs, as the network needs to learn relative durations of notes as to create rhythm and coutning.

This representation does ignore two issues:
1. There's no indication where a note ends. Thus eight eighth notes in a row of the same notes are equivalent to four quarters or a whole note. One way to overcome this could be to decrease the stepsize of quantization and mark note endings with a zero (???).
2. A second method was suggested by [Todd in 1989](https://pdfs.semanticscholar.org/81e0/a57abde1bf2cc4b7cd772e0573e92069e8ef.pdf). This used special units in the network to indicate the beginning of notes. These authors were unsure how that might scale to multi-voice melodies.

The LSTM was given a range of 12 notes for the chords and 13 notes for melodies. These were simply two sections of the vector but otherwise were represented no differently.

### Training Data

This experiment used a form of 12-bar blues with a quantization step of 8 notes per bar. Each song was thus 96 time steps long. The same chords were used in every song, and they were inverted as to fit into the 12 note options for the chords. Experiment 1 used only the chords whereas Experiment 2 also included a single melody line based on the pentatonic scale.

Training melodies were constructed by concatenating bar-long segments of music by the first author to fit each chord. The datasets were then constructed by concatenating random samples from the $(n=2^{12}=4096)$ possible songs. The melodies included no rests and were only quarter notes.

## 3. Experiment 1 - Learning Chords

Can an LSTM reproduce a musical chord structure? The motivation here is to ensure that the LSTM doesn't require melody.

### Network Topology and Hyperparameters

The chords used were the same 12-bar blues. The network had four cell blocks containing 2 cells each, fully connected to each other and to the input layer. The output layer was fully connected to all cells and the input layer. Forget, input, and output gate biases for the four blocks were set to -0.5, -1.0, -1.5, and -2.0. This causes the blocks to come online one by one. Output biases were set to 0.5, the learning rate 0.00001, and momentum rate 0.9.

Weights were burned after every timestep, as experiments showed that learning was faster if the network was reset after making one or a few gross errors. Resetting went as follows:  
On error, burn existing weights, reset input pattern and clear partial derivatives, activations, and cell states. This was similar to Gers and Schmidhuber's (2000) approach. The output function was the logistic sigmoid with range [0,1].

### Training/Testing

The goal was to predict the probability for a given note to be on or off. For predicting probabilities, RMSE is not appropriate and thus the network was trained using cross-entropy as the objective function. The error function $E_i$ for output activation $y_i$ and target $t_i$ was:  
$E_i=-t_i*ln(y_i)-(1-t_i)*ln(1-y_i).$

This gives a $\delta$ term at the output layer of $(t_i-y_i)$. See [Joost and Schiffmann](https://www.worldscientific.com/doi/abs/10.1142/S0218488598000100) (1998) for what exactly this means. Note that they treat the outputs as statistically inpedendent of one another. Even if this assumption is untrue, it allows the network to predict chords and melodies in parallel (as well as multi-voice melodies).

The testing started the network with the inputs from the first timestep and used network predictions for ensuing time steps. The decision threshold for chord notes was 0.5. Training ended once the network spit back the correct chord sequence in entirety.

### Results

In the end, the network could accomplish this task with a variety of learning rates and momentum rates. The network was able to generate continuing cycles of the progression as well. This result is not surprising, honestly. The learning time took anywhere from 15 to 45 minutes on a 1Ghz Pentium.

## 4. Experiment 2 - Learning Melody and Chords

In this experiment, both parts were learned. They continued learning until the chord structure was learned and cross-entropy was relatively low. Once learning ended, the network was seeded with a note or a series of notes and then allowed to compose freely. The goal was to test if LSTMs could learn chord/melody structure and then use that when composing new songs.

### Network Topology and Hyperparameters

In this case, some cell blocks processed chord information and some melody information. Eight cell blocks with two cells each were used. Four blocks were fully connected to the inputs for melody. The chord cell blocks had recurrent connections to themselves and the melody blocks, whereas the melody blocks were only recurrently connected to other melody blocks. Another way to say this is that melody information never reaches chord cell blocks.

At the output layer, output units for chords are fully connected to chord blocks as well as chord inputs. The melody outputs were fully connected to melody blocks and melody inputs. This is discussed in Section 5.

Forget gate, input gate, and output gate biases were set to -0.5, -1.0, -1.5, and -2.0 for the chord and melody blocks. All other parameters were identical to Experiment 1.

### Training/Testing

The goal was to predict the probability a given note to be on or off. For chords, the same 0.5 decision threshold was used. The melody was restricted to choosing one note per time step. This is done by making the melody outputs sum to 1 and then using a uniform random number in [0,1] to pick the next note. Again, this is discussed in Section 5.

The network was trained until it learned the chord structure and objective error plateaued. The network was then used to compose music by providing a single note or series of notes and then presenting the network outputs as inputs to the next time step. No algorithmic or statistical method was used to evaluate the output music's quality.

### Results

The LSTM indeed composed blues music. It fully learned the chord structure and then used that to constrain its melody output. They urge the reader to visit [their site](https://people.idsia.ch/~juergen/blues/) (Just delete the s in https) to judge the LSTM's music themselves. The network's results do far better than randomly stumbling around the pentatonic scale. One limitation is the fundamental same chord progression, but the compositions are indeed remarkable nonetheless.

## 5. Discussion

The experiments were clearly successful, and to the authors' knowledge, theirs was the first successful use of LSTMs to compose globally coherent music. They acknowledge that more research is needed to see whether the LSTM can deal with more difficult tasks.

### Training Data

The chord structure, for one, was uniform across the training set. Thus this experiment might be more acccurately defined as learning to solo over a predefined form. Their time step was also quite low, as 8 notes per measure is quite easier than, say, 32 time steps per whole note. (My approach plans on 48 steps per whole note)

### Network Architecture

The network's connections were divided between chords and melody, with chords influencing melody but not vice-versa. This choice is justified by the fact that a soloist follows the chord structure supplied by a rhythm section. This choice does presume, though, that we know how to segment input chords from melodies. With jazz sheet music, though, changes are provided separately from melodies and so this is no huge problem. That said, classical music as well as audio signals mix the two.

"Much more research is warranted." Comparing BPTT and RTRL, as well as other RNN variants, would help support a claim about LSTM's relative effectiveness. A more interesting training set might also allow for more interesting compositions. Finally, research circa 2002 led the authors to believe that LSTM works better using a Kalman filter to update weights.

The current architecture is also limited to mere symbolic representations. If it were modified to work with MIDI or audio, it could be used for interactive improvisation. This would require dealing with temporal noise, but research from oscillator beat tracking models (Eck) to LSTM might help calm this noise.

## 6. Conclusion

Their LSTM model successfully learned the blues form and was able to improvise over it. The LSTM could learn the form without melody, and could also learn to compose new melodies along with the chords. "Much more work is warranted." They demonstrated that an RNN can capture local melody structures and long-term structures, which represents an advance for neural network music composition.

-Fin 1/10/2019-

# Source 4: [Learning to Create Jazz Melodies Using Deep Belief Nets](https://www.cs.hmc.edu/~keller/jazz/improvisor/ICCCX-Bickerman-Bosley-Swire-Keller.pdf)

**Abstract** - This paper describes an unsupervised learning technique which automatically creates jazz improvisation over chord sequences. They trained deep belief nets, which are based on restricted Boltzmann machines. They present their encoding scheme, the specifics of learning, and the creation process for their resulting music. Their model created novel jazz licks and should be regarded as a feasibility study for whether such networks could be used at all. Clearly, maybe.

## 1. Introduction

Due to the structural nature of chord progressions, it's feasible that a machine could be taught to emulate human jazz improv. This could be done by stating rules (jazz grammars), but these risk losing the flexibility or fluidity of jazz. Instead of giving an algorithm rules, what if we give it stylistic examples of what we'd like to hear more of? It can determine the features present on its own.

Their approach was **deep belief networks**, or DBNs. These are multi-layer restricted Boltzmann machines (RBMs), which are a type of stochastic (a given input gives a somewhat random output) neural network. They merely focus on the creation of melodies, not at all in a real-time collaborative setting. This is application-based, nothing more.

They believed DBNs might work due to recent work of Hinton, as of 2010. DBNs learn to recgonize by creating examples (as bit vectors) and comparing those to training examples. The model then adjusts its parameters to produce more similar examples and is thus unsupervised. This is similar to the way humans emulate behavior. The stochastic part of DBNs helps them to achieve novelty, and thus almost creativity. Their goal is create interesting melodies as opposed to Hinton's of recognizing them.

## 2. Restricted Boltzmann Machines

The **restricted Boltzmann machine** (RBM) was introduced by Smolensky in 1986 and developed by Hinton in 2002-2007. It has two layers of neurons: one visible and one hidden layer. The two layers are fully connected with symmetric, bi-directional weights.

One training cycle for the algorithm takes a binary data vector as input , activating the visible neurons to match the input data. It then alternates activating the hidden nodes based on the visible nodes and vice versa. Each node is activated probabilistically based on a weighted sum of the nodes connected to it. Because nodes within a layer are not connected to other nodes in the layer, the activation of each layer only depends on the other layer. Once the network stabilizes, the new configuration of visible nodes can be read as the output.

The goal of an RBM is to learn the features of sets of data sequences. This paper implemented the *contrastive divergence* (CD) algorithm based on [Hinton's work](http://www.cs.toronto.edu/~fritz/absps/tr00-004.pdf). Their implementation was modeled based on a [tutorial by Radev](http://imonad.com/rbm/restricted-boltzmann-machine/). The CD enabled inexpensive training given the large number of nodes and weight in their network. Once trained, an RBM can take random data sequences and generate new sequences which emulate the features of the training data.

A single RBM can learn some patterns in training data, but multiple RBMs layered together are even more powerful. Such a machine is called a **deep belief network**. The RBMs are combined by identifying each one's hidden layer as the visible layer of the next.  
![DBN Graphic](https://i.imgur.com/TiMcbrv.png)

The second RBM learns features about the features learned by the first, and so on. Thus the DBN can learn far more intricate patterns than a single RBM could alone.

## 3. Data Representation

To train RBNs on musical data, the authors had to encode the music as *bit vectors*, with each beat divided into *slots*. They chose 12 slots per beat to allow both 16th notes and triplets. Each slot ended up as a block of 30 bits, with 12 chord bits and 18 melody bits.

For the melody, they one-hot encoded the twelve note options as twelve slots and then encoded four more slots to specify one of four octaves. One bit indicated a sustained note and one represented rests. When a note is attacked, only its pitch and octave bits were on. If it's being sustained, only the sustain bit is on. Doing their octaves like this saved training time by reducing the number of bits quite a lot.  
![Pitch Encoding](https://i.imgur.com/m7CE4xG.png)

Each chord was one-hot encoded as the 12 possible pitches in the chord. These were simply on-off bits. The melody and chord vectors were concatenated to form part of the input corresponding to one slot. Hopefully, the machine learns to associate specific chords with melodic features.

## 4. Training Data

The network was initially trained on a small set of children's melodies. These were all in the same key and consisted of simple rhythms and notes all in their respective chords. Once they had trained a model to create these melodies, they moved on to larger networks and jazz.

Their main dataset was a large corpus of 4-bar jazz licks (short coherent melodies) cycling over the common $\text{ii}-\text{V}-\text{I}-\text{VI}^7$ turnaround progression in a single key. The $\text{ii}-\text{V}-\text{I}$ cadence is common to jazz, and the $\text{VI}^7$ is the dominant to the next $\text{ii}$ chord. The licks were either transcribed from notable jazz solos or hand constructed, some with help from a "lick generator" from the [Impro-Visor software](https://www.cs.hmc.edu/~keller/jazz/improvisor/). 

## 5. Learning Method

Their goal was to create melodies which transitioned between chords in the progression. To aid this effort, they trained the model on windows of one measure each which cycled forward one beat at a time. These measure-long snapshots continued until the end of the 4-bar lick. In this way, a single training example was broken into 13 overlapping 1-bar windows.

For creating melodies, they started the model with a "seed" of specified chord bits of the desired progression and random melody bits. The chord bits are *clamped* so that they cannot be modified during the creation cycle.In creating the melody, they use a process similar to the windowing used for training. They generate the first few beats of the melody, clamp their bits, and then shift down the melody and chords to make room for the next beat. Thus the machine generates one beat at a time but uses clamped chords and clamped beats of preceding melody.  
![Clamping Process](https://i.imgur.com/k6aZaIG.png)

During the final activation of the network's visible layer (which will be the newly generated melody), they constrain the activation of bits in a specific way. Each slot is looked at individually and only the highest probability of pitch and octave is used. Thus the machine chooses one of sustaining, resting, or starting a new pitch. This approach was found to create a variety of melodies while resonating well with the chords.

They also tested if the machine could learn the progression in arbitrary keys and thus included the option to transpose each input into different keys. All functionality mentioned was implemented in a stand-alone tool they call "[RBM-provisor](https://sourceforge.net/projects/rbm-provisor/)." The tool was written in Java and supports input/output in the leadsheet format from Impro-Visor.

## 6. Results

Their first experiments used children's melodies with 2-layer networks trained for 100 epochs. The results were successful, fitting chords well and flowing melodically. After this success, they moved on to jazz.

For the jazz creation networks, they experimented with the model, varying the number of layers, number of neurons per layer, number of training epochs, and other factors. They settled on a 3-layer network with 1441 input nodes (4 beats with 12 slots each  and 30 bits per slot plus one bias slot) and 750, 375, and 200 hidden nodes respectively. Typical training lasted 250 epochs on about 100 4-bar licks. This took about nine hours on an inexpensive desktop computer.

When analyzing their generated music, they found that most notes were in the chord, with some color tones. Foreign tones were quite rare. Created melodies tended to avoid large jumps and rarely jumped octaves.

![Output Example](https://i.imgur.com/P0ZYODm.png)

The training method was tested with transpositions by training on four copies of the inputs, each transposed up 0, 1, 2, or 3 semitones. The machine was still able to create chord-compatible music regardless of the seed chords. They didn't test with more than 4 transpositions due to training time.

To justify the importance of being able to transpose, jazz chord progressions often have abrupt implied key changes. Thus relative transpositions quickly become important. As an example, it's more economical to train on all transpositions of a $\text{ii}-\text{V}-\text{I}$ than all possible contexts of any one version of the progression.

They noticed some differences between the training data and their generated music:
1. Generated licks tended to avoid half-step intervals. This avoidance of approach notes tended towards chord tones.
2. Rhythmically, the outputs almost entirely consisted of duplet rhythms.

An alternative approach was attempted to select bits using the neuron probability distribution, as opposed to the maximum probability. This produced the disjointed melodies seen above. They also tried encoding strong/weak beat information, but these results were not any better than the chosen encoding.

To be clear, DBNs are not the author's first choice for lick generation. The approach used by Impro-Visor is superior to this one, by far. DBNs also take a while to train. It is possible, though, that DBNs avoid as much algorithmic bias as unsupervised approaches using clustering and Markov chains.

## 7. Future Work

These results are promising, certainly, but further improvements could be made. Their model mainly output duplet rhythms, despite triplets in the training data. This is likely due to an overshadowing of the triplets, but no alternative note generation rule yet found yields as coherent results.

The generated music also included a disproportionate number of repeated notes, which sound a bit static and immobile. They did try post-processing generated music to merge repeated notes, but better solutions would prevent the machine from producing the notes in the first place. Perhaps a different encoding?

They also believe that a similar method could be used to infer chords. Theirs took chords as an input and produced suitable melodies. Another DBN work might be able to determine possible chord progressions for a given melody.

## 8. Related Work

Rossen Radev's tutorial on RBM implementation is again cited. Previous work into music generation has largely already been mentioned in this notebook above.

## 9. Summary

These results show that DBNs can learn to create jazz licks. The approach can work in multiple keys and suggests that, given a significant dataset, similar models might be able to solo over full 12-bar progressions. Their results show novelty but do have minor limitations in rhythm and pitch repetition. The possibility of using DBNs in this way was proven, and further work is likely to be done.

-Fin 1/12/19-

# Source 5: [Composing a melody with LSTM RNNs](https://pdfs.semanticscholar.org/f707/ff253dc44ffa1e15f7ad19d75473a3ddecac.pdf)

This source, unlike the others, is 56 pages in full. As such, I'm going to read through it a bit differently by reading full sections and then looping back to take notes.

## 1. Introduction

This paper's primary goal will be to see to what extent LSTM networks can be used to generate melodies over chord progressions. It will also explain similar prior attempts, go over the algorithms used, explain the implementations for this particular result, and evaluate the generated music by comparing it to music composed by humans using judging by human subjects.

## 2. State of the Art in Algorithmic Composition

There are approaches dating back to 1024AD for algorithmic composition, but the advent of computers enabled many more possibilities.  
Continue on page 8.