# Bach Chorales - analysing matching subsequences in training and generated data
## Comparing Generated Sequences to Training Data

In [5]:
from music_generator.serializers.discrete_time_serializer import DiscreteTimeMidiSerializer
import music_generator.utilities.sequence_utils as sequence_utils
import music_generator.utilities.utils as utils

from pathlib import Path

### Set up constants for generation and comparison

A single quarter-note four-voice chord would be made up of at least 9 events (4 note-on, 4 note-off, at least 4 wait events if no chaining is required),
so a measure of four quarter-note four-voice chords would be a minimum of 36 events.

Adding in eigth notes, flourishes, etc. would increase the number of events in a measure.

Windows of 100 events are used to compare generated sequences and training sequences. This would generally represent somewhere from one to four measures, depending on note frequency.

In [6]:
serializer = DiscreteTimeMidiSerializer()
window_size = 100
training_data = './training_data/bach_chorales/'
generated_data = './generated_files/bach_chorales_temperature_{}/'
temperatures = [1.0, 1.2, 1.5]

### Create a set of all unique sub-sequences with length = window_size from the training data

The sequences from the training data are transposed over one octave.

These sequences are then windowed to the previously decided number of events.

Each window is converted to an equivalent string to make it a hashable object, and then added to a set for comparison with another set.

In [7]:
real_sequences = serializer.serialize_folder(training_data)

# transpose training data to all keys and window
real_sequences = sequence_utils.transpose(real_sequences, down=-6, up=5)
real_sequences, _ = sequence_utils.window(real_sequences, window_size=window_size)
print('Training data windows of length {}: {}'.format(window_size, len(real_sequences)))

real_set = utils.create_hashable_set(real_sequences)
print('Unique windows: {}'.format(len(real_set)))

Training data windows of length 100: 5208324
Unique windows: 4423953


### Compare generated sequences to the training data
Sequences generated by the model using different seeds and temperature settings are serialized, converted to a hashable string, added to a set, and then compared to the training data one at a time.
The percentage of unique windows in a generated composition that appear exactly in the training data is calculated.
The average percentage of 'copied' sequences for an entire set of compositions using a given temperature setting is also calculated.

In [8]:
for temp in temperatures:
    print('TEMPERATURE: {}'.format(temp))
    
    percentages = []

    # compare each generated sequence against the training set
    for file in Path(generated_data.format(temp)).glob('*.mid'):
        
        sequence = serializer.serialize(file)
            
        # split generated sequence into subsequences
        gen_sequences, _ = sequence_utils.window([sequence], window_size=window_size)

        # create a set of unique subsequences
        gen_set = utils.create_hashable_set(gen_sequences)

        # find the intersection of the two sets to find all matching subsequences and calculate percentage of generated subsequences that come from the training data
        matches = gen_set.intersection(real_set)
        n_matches = len(matches)
        total = len(gen_set)
        percentage = n_matches/total * 100
        percentages.append(percentage)

        # print results
        print('{}: {:.2f}% of unique windows (length = {}) from the generated composition exist in the training data.'.format(file.name, percentage, window_size))
                        
    average = sum(percentages) / len(percentages)
    print('AVERAGE PERCENTAGE OVER ALL FILES: {:.2f}%'.format(average))
    print('*' * 80)


TEMPERATURE: 1.0
sample_45-60-64-69.mid: 7.79% of unique windows (length = 100) from the generated composition exist in the training data.
sample_38-47-54-62-66.mid: 50.93% of unique windows (length = 100) from the generated composition exist in the training data.
sample_41-50-57-65-69.mid: 35.61% of unique windows (length = 100) from the generated composition exist in the training data.
sample_39-48-55-63-67.mid: 47.28% of unique windows (length = 100) from the generated composition exist in the training data.
sample_40-49-56-64-68.mid: 18.79% of unique windows (length = 100) from the generated composition exist in the training data.
sample_43-55-59-62-65.mid: 29.91% of unique windows (length = 100) from the generated composition exist in the training data.
AVERAGE PERCENTAGE OVER ALL FILES: 31.72%
********************************************************************************
TEMPERATURE: 1.2
sample_45-60-64-69.mid: 8.32% of unique windows (length = 100) from the generated compositi

### How often does Bach repeat himself?

Each composition of the dataset is checked against the rest of the dataset and its transpositions, as above, to find how much repetition appears across compositions in the dataset itself.

In [15]:
# create a hashable set for each composition in the dataset
training_data_sets = list()
for file in Path(training_data).glob('*.mid'):
    # serialize file
    sequence = serializer.serialize(file)
    
    # split the sequence into subsequences and create a hashable set from the composition
    sequences, _ = sequence_utils.window([sequence], window_size=window_size)
    hashable_set = utils.create_hashable_set(sequences)
    
    # transpose the sequence into all keys and create a hashable set from the composition and all its transpositions
    sequences = sequence_utils.transpose([sequence], down=-6, up=5)
    sequences, _ = sequence_utils.window(sequences, window_size=window_size)
    hashable_set_trans = utils.create_hashable_set(sequences)
    
    # add to list of hashable sets
    training_data_sets.append((file, hashable_set, hashable_set_trans))

# compare the hashable set for each composition to the union of hashable sets for all other compositions
counter = 0
for f, hs, _ in training_data_sets:
    composition_set = hs
    remaining_set = set.union(*[hst for f2, _, hst in training_data_sets if f2 != f])
    
    # find the intersection of the two sets to find all matching subsequences and calculate percentage of generated subsequences that come from the training data
    matches = composition_set.intersection(remaining_set)
    n_matches = len(matches)
    total = len(composition_set)
    percentage = n_matches/total * 100

    # print results, if any overlap is found
    if percentage > 0.01:
        counter += 1
        print('{}: {:.2f}% of unique windows from the composition exist in the other training data compositions.'.format(f.name, percentage))

print('Matches found between {}/{} compositions.'.format(counter, len(training_data_sets)))


040900B_.mid: 56.63% of unique windows from the composition exist in the other training data compositions.
010406B_.mid: 3.34% of unique windows from the composition exist in the other training data compositions.
001805Bw.mid: 88.85% of unique windows from the composition exist in the other training data compositions.
040900Bv.mid: 58.02% of unique windows from the composition exist in the other training data compositions.
024835B3.mid: 59.21% of unique windows from the composition exist in the other training data compositions.
024805B_.mid: 1.27% of unique windows from the composition exist in the other training data compositions.
024417B_.mid: 100.00% of unique windows from the composition exist in the other training data compositions.
001007B_.mid: 8.35% of unique windows from the composition exist in the other training data compositions.
024511B_.mid: 17.93% of unique windows from the composition exist in the other training data compositions.
024415B_.mid: 100.00% of unique windows

# Analysis

This model is, for the most part, generating at least trivially unique sequences when sampled. The test does not take into account windows of events that might have just one or two events different.

Using a higher temperature did make it more likely that generated sequences would be more unique, but there is still an element of randomness in the sampling at every temperature: for example, the lowest temperature (most predictable sampling) had one generated sequence with 0% of its windows copied from the training sequence while most other sequences had 21-38% of windows copied from the training data. Similarly, the highest temperature setting had a single generated sequence that had a relatively high percentage of windows copied from the training data, at 33.58%, while the other sequences all had percentages between 5.5-12.5%

Through listening tests, it seemed that lower temperatures had a more clear musical structure, but each generated composition is unique. Sampling from the model could be another area to explore for optimal settings, by choosing the seed and temperature settings and generating multiple options using the same settings.

When comparing each training data composition to the rest of the transposed dataset, there is a not insignificant amount of repetition between compositions written by Bach, indicating that some amount of repetition is inherent to the training data, and thus the compositional style, itself.

This experiment raises questions that I do not currently have a clear answer to, and might be beyond the scope of this project, like what makes a composition unique, and what level of similarity is acceptable for a composition to be considered new? What would be the level of copied sequences found using a human composer who studied the training set and then tried to create a new composition in the same style?