## Obtaining BPE vocabularies

This notebook has an example of how to obtain a BPE vocabulary of ''musical words''
from a set of training data (note transcriptions of s manuscript in this case). 

The used code can be found inside the directory `src/bpe` and the obtained results will be stored inside `results/test`.

------

For example, for the *Einsiedeln* dataset encoded with *musical* notation.

- Ground Truth = `../data/GT/EIN_music.txt`

- Minimum occurrences (BPE treshold) = `5`

- Maximum token length (BPE treshold) = `5`

- Path for resulting vocabulary = `../results/test/BPE_5-5_vocab.txt`

- Path for resulting encoded data = `../results/test/BPE_5-5_encoded.txt`


In [12]:
!python3 ../src/bpe/train_bpe.py \
     -gt "../data/GT/EIN_music.txt" \
     -mo 5 \
     -mxl 5 \
     -v "../results/test/BPE_5-5_vocab.txt" \
     -t "../results/test/BPE_5-5_encoded.txt"

Applying BPE...
Merge 1: ('A4', 'G4') -> A4&G4 (occurrences: 14393)
Merge 2: ('F4', 'G4') -> F4&G4 (occurrences: 9576)
Merge 3: ('C5', 'C5') -> C5&C5 (occurrences: 8440)
Merge 4: ('F4', 'E4') -> F4&E4 (occurrences: 6601)
Merge 5: ('C5', 'B4') -> C5&B4 (occurrences: 4872)
Merge 6: ('C5', 'D5') -> C5&D5 (occurrences: 4762)
Merge 7: ('A4', 'A4') -> A4&A4 (occurrences: 3704)
Merge 8: ('D4', 'D4') -> D4&D4 (occurrences: 3696)
Merge 9: ('G4', 'G4') -> G4&G4 (occurrences: 3367)
Merge 10: ('A4&G4', 'A4') -> A4&G4&A4 (occurrences: 2982)
Merge 11: ('F4', 'F4') -> F4&F4 (occurrences: 2864)
Merge 12: ('E5', 'D5') -> E5&D5 (occurrences: 2578)
Merge 13: ('F4&E4', 'D4') -> F4&E4&D4 (occurrences: 2490)
Merge 14: ('A4&G4', 'F4&G4') -> A4&G4&F4&G4 (occurrences: 2339)
Merge 15: ('A4&G4', 'G4') -> A4&G4&G4 (occurrences: 2243)
Merge 16: ('A4', 'B4') -> A4&B4 (occurrences: 2207)
Merge 17: ('C4', 'D4') -> C4&D4 (occurrences: 2027)
Merge 18: ('A4', 'C5') -> A4&C5 (occurrences: 1874)
Merge 19: ('F4', 'D4') -> 

Once we obtained the raw vocabulary, we process it so its format is compatible with the software for plotting a Zipf's Curve.

In [15]:
!awk 'NF > 1' "../results/test/BPE_5-5_vocab.txt" | sort -k2,2nr | nl > "../results/test/BPE_5-5_vocab_sorted.txt"

Then we can see how it looks. The first column is the rank, the second column is the word and the third column is the number of occurrences of that word.

In [18]:
with open("../results/test/BPE_5-5_vocab_sorted.txt", "r") as f:
    vocab = f.readlines()
for line in vocab[:10]:
    print(line.strip())

1	A4&G4&F4&G4 1213
2	A4&G4&A4&G4 742
3	C5&D5&C5&B4 703
4	A4&G4&G4 673
5	F4&E4&F4&G4 673
6	A4&G4&A4 638
7	A4&G4 618
8	C5&C5&C5&C5 601
9	A4&B4&C5&B4 575
10	A4&G4&A4&B4 564


An the original GT encoded for this vocabulary looks like this...

In [14]:
with open("../results/test/BPE_5-5_encoded.txt", "r") as f:
    vocab = f.readlines()
for line in vocab[:10]:
    print(line.strip())

CH-E-611_001r-1 G4&G4&F4&F4&G4 C4&D4&F4&G4 A4&G4&F4&G4&G4
CH-E-611_001r-2 C5&C5&C5&C5 C5&C5&C5&C5 C5&B4&A4
CH-E-611_001r-3 D4&D4&C4&F4&G4 F4&G4&A4&A4 A4&C5&B4 C5&D5&C5&A4&B4 A4&G4&F4&G4 A4&B4&A4 A4&G4&A4&G4 G4&G4 E4 G4&A4&G4&F4&G4 F4&D4&C4&D4 G4&E4 F4&E4&D4&D4&A4 A4&G4&F4&G4&A4
CH-E-611_001r-4 C4&D4 F4&F4&D4 E4 F4&E4&D4&C4&D4 F4&E4&D4&D4 F4&E4&E4&F4&G4 A4&B4&A4&G4 A4&G4&G4 F4&E4&D4&E4&E4 C4 F4&E4&E4 G4&A4&G4&F4&G4 E4&E4&G4&A4
CH-E-611_001r-5 E4&F4&G4 G4&A4&C5&B4 C5&B4&A4&G4 E4&F4&G4 G4&G4&G4&G4 C5&C5 C5&D5&C5&B4 A4&C5 A4&C5&B4&C5&B4 A4&G4&A4&C5&B4 G4&G4&G4&G4 G4&A4&F4&E4&D4 G4 F4&A4 C5&B4&A4&B4 G4&G4 C5&C5&B4&C5 A4&G4&A4
CH-E-611_001r-6 F4&E4&D4&C4&D4 F4&E4&D4&C4&D4 E4&F4&G4 F4&G4&A4&G4 F4&E4 E4&E4&D4 G4&G4 A4&A4&C5 A4&A4&C5 A4&A4&A4 A4&G4&F4&G4&G4 F4&E4&E4&E4 G4&A4&A4 C5&A4 A4&G4&G4&F4&G4 F4&F4&E4&F4&G4 G4 F4&D4&C4 C4&C4&D4 E4&F4&G4&G4 F4&E4&E4 A4&G4&A4&C5&G4
CH-E-611_001r-7 C4&D4 E4&F4&E4&D4 G4&F4&G4 F4&E4&D4&E4 E4&F4&E4&D4&E4 C4&F4&G4 E4&D4 F4&E4&E4&D4 E4 G4&A4&C5 A4&A4&G4&F4&G4 G4&