# Chopin Nocturnes
## Find the most similar composition from the training data for a specific generated composition

In [1]:
from music_generator.serializers.discrete_time_serializer import DiscreteTimeMidiSerializer
import music_generator.utilities.sequence_utils as sequence_utils

from pathlib import Path

### Set up constants for generation and comparison

In [2]:
serializer = DiscreteTimeMidiSerializer()
window_size = 100

# Convert a generated sequence to a hashable set of windows

In [21]:
generated_file = Path('./generated_files/chopin_nocturnes_temperature_1.2/sample_45-60-64-69.mid')

gen_sequence = serializer.serialize(generated_file)
            
# split generated sequence into subsequences
gen_sequences, _ = sequence_utils.window([gen_sequence], window_size=window_size)

# create a set of unique subsequences
gen_set = set()
for s in gen_sequences:
    # turn subsequence into string so it is hashable
    s = '-'.join([str(x) for x in s])
    gen_set.add(s)
print(len(gen_set))

1898


### Compare generated sequences to each composition from the training data

In [22]:
for training_file in Path('./training_data/chopin_nocturnes/').glob('*.mid'):

    train_sequence = serializer.serialize(training_file)

    # transpose and split generated sequence into subsequences
    train_sequences = sequence_utils.transpose([train_sequence], down=-6, up=5)
    train_sequences, _ = sequence_utils.window(train_sequences, window_size=window_size)

    # create a set of unique subsequences
    train_set = set()
    for s in train_sequences:
        # turn subsequence into string so it is hashable
        s = '-'.join([str(x) for x in s])
        train_set.add(s)

    # find the intersection of the two sets to find all matching subsequences and calculate percentage of generated subsequences that come from the training data
    matches = gen_set.intersection(train_set)
    n_matches = len(matches)
    total = len(gen_set)
    percentage = n_matches/total * 100

    # print results
    print('{:.2f}% of unique windows from the generated composition exist in the training file: {}.'.format(percentage, training_file.name))

0.00% of unique windows from the generated composition exist in the training file: Nocturne op33 n2.mid.
0.00% of unique windows from the generated composition exist in the training file: Nocturne op62 n2.mid.
0.00% of unique windows from the generated composition exist in the training file: Nocturne op09 n3.mid.
0.00% of unique windows from the generated composition exist in the training file: Nocturne op09 n2.mid.
0.00% of unique windows from the generated composition exist in the training file: Nocturne op55 n2.mid.
0.00% of unique windows from the generated composition exist in the training file: Nocturne op15 n3.mid.
0.00% of unique windows from the generated composition exist in the training file: Nocturne op33 n4.mid.
10.75% of unique windows from the generated composition exist in the training file: Nocturne op48 n1.mid.
0.00% of unique windows from the generated composition exist in the training file: Nocturne op15 n2.mid.
0.00% of unique windows from the generated composition

# Analysis

This test demonstrates that a high percentage of windows coming from the transposed training set does not necessarily mean it comes from a single composition.

In this case, I compared a generated example that has about 50% of its windows coming from the training set, but the results show these windows come from more than one composition. This means a composition likely contains less "copied" material from the training set than the previous experiments indicated, which checked the percentage of windows from a generated composition that existed anywhere in the transposed training set.