In [None]:
from pathlib import Path

# Folk RNN Notes

* Runs on Python 2.7 (unsupported since Jan 1st 2020)
* Neural net packages are Theano + Lasagne

## Data

Extract from [data/data_v2](https://github.com/IraKorshunova/folk-rnn/blob/master/data/data_v2), which is used for training:
```
M:9/8
K:Cmaj
G E E E 2 D E D C | G E E E F G A B c | G E E E 2 D E D C | A D D G E C D 2 A | G E E E 2 D E D C | G E E E F G A B c | G E E E 2 D E D C | A D D G E C D 2 D | E D E c 2 A B A G | E D E A /2 B /2 c A B 2 D | E D E c 2 A B A G | A D D D E G A 2 D | E D E c 2 A B A G | E D E A /2 B /2 c A B 2 B | G A B c B A B A G | A D D D E G A B c |

M:4/4
K:Cmin
f B B c f B c c | f B B c a f e c | f B B c f B c c | A 2 B c a f e c :| f 3 e f g a 2 | f 3 g a f e c | f 3 e f g a 2 | A 2 B c a f e c | f 3 e f g a 2 | f 3 g a f e c | f g a b a f a 2 | A B c A a f e c |

M:4/4
K:Cdor
|: c C C /2 C /2 C G C C /2 C /2 C | c C C /2 C /2 C B 2 A B | c C C /2 C /2 C G F =E G | F D B, D F G A B | c C C /2 C /2 C G C C /2 C /2 C | c C C /2 C /2 C B 2 A B | c A A G A c B G |1 F D B, D F G A B :| |2 F D B, D F 2 E F |: G 2 E G C G E G | G F G A B 2 A B | G 2 E G C G E G | F 2 D F B, F D F | G 2 [ E G ] 2 [ C G ] 2 [ E G ] G | G F G A B 2 A B | c A A G A c B G |1 F D B, D F 2 E
...
```

Data to input is tokenized simply by splitting on whitespace
* All tunes are 3 lines followed by a blank line (such that they can be split by '\n\n')
* Everything has been transposed into C (in various modes - dorian, major, minor, and mixolydian)
* I think
    * slurs, ties, and staccato are removed since there is no `-`, nor `(` (without a subsequent number)
    * all rests have been removed since there is no `Z` nor `z`
    * all grace notes have been removed since there is no `~`, nor `{}`

In [None]:
data_path = Path("~", "git", "folk-rnn", "data", "data_v2").expanduser()
with open(data_path, 'r') as f:
    data = f.read()

In [None]:
tokens_set = set(data.split())
start_symbol, end_symbol = '<s>', '</s>'
tokens_set.update({start_symbol, end_symbol})
idx2token = list(tokens_set)
vocab_size = len(idx2token)
print(f"vocabulary size: {vocab_size}")
print(f"vocabulary (each token separated by a space): \n{' '.join(sorted(tokens_set))}")

In [None]:
token2idx = dict(zip(idx2token, range(vocab_size)))
tunes = data.split('\n\n')
print(f"number of tunes: {len(tunes)}")

Therefore the 365 dataset will require preprocessing in the following way:
* transpose all data to C
* remove all metadata bar `M:` (meter), and `K:` (key)
* remove all slurs, ties, and staccato
* remove all rests
* remove all grace notes