This notebook is about determining which $n$-grams we should make into features. First, some explanation of terminology.

A **raw chord** is a musical chord, represented either as a string (e.g. 'C' or 'Amin'), or as a binary vector of lenght 12 ('C' corresponds to $[1,0,0,0,1,0,0,1,0,0,0,0]$). The Chordonomicon dataset represents chords within songs as string labels, and provides the file "chords_mapping.csv" for converting from a string to vector.

Mathematically, the space of raw chords is $X = \{0,1\}^{12}$. The cyclic group $G = \mathbb{Z}/12 \mathbb{Z}$ acts on this set by cyclically permuting vectors, which corresponds musically to transposition. Two chords are **harmonically equivalent** of they are in the same orbit of this group action, so the set of chords up to harmonic equivalence is the quotient space $X/G$. Musically speaking, two chords are harmonically equivalent if one of them is a transposition of the other. More generally, an $n$-gram is a finite sequence in $X$, and $G$ acts entry-wise on these sequences, and two $n$-grams are harmonically equivalent if they lie in the same $G$-orbit. The general space of $n$-grams up to equivalence is

$$\bigcup_{n \ge 1} (X^n/G)$$

The premise of this notebook is that the answer to "Does song $S$ contain an $n$-gram equivalent to $x$?" is useful for predicting the genre of $S$.

This leaves the question of how to pick which $x \in \displaystyle \bigcup_{n \ge 1} (X^n/G)$ are most useful features for genre prediction. We have made this determination as follows:

* A single musical "phrase" or "idea" is usually not longer than 5 chords, and often shorter, so we'll only consider $1 \le n \le 5$.
* If a single model utilizes $n$-grams of multiple lengths, say $n=3$ and $n=4$, then there are dependency issues between features. For example, any song containing the $4$-gram 'C,F,G,C' also contains the $3$-grams 'C,F,G' and 'F,G,C'. While this isn't literally collinearity between features, it is a significant amount of redundant information.
* Generalizing the previous idea, if shorter sequences do contain useful information about genre, then that information would also likely be captured by looking at longer sequences. More specifically, if there is useful information to be captured about genre with single chords or pairs of chords, then that information would also be captured by looking at $3$-grams or longer. So modeling with $1$-grams and $2$-grams seems not that useful.

In summary, we will explore models using $3$-grams, $4$-grams, and $5$-grams up to harmonic equivalence, and in any given model, we will only utilize sequences of the same length.


Even after making this determination, there are far too many $3$-grams to enumerate all of them up to harmonic equivalence. It does not take very long to enumerate all of the raw $3$-grams, but running that raw list through the "uniquify_n_grams" method below takes way too long (estimated ~24 hours computation. 

Therefore, in order to pare down to a feasible set of $n$-grams, we will have to restrict. Our process for this is:
* Enumerate all raw $n$-grams (for $n = 3, 4, 5$) with a counter object
* Pick the top $k$ raw $n$-grams by frequency in the training data set
* Take the quotient by harmonic equivalence to get a list of relatively common $n$-grams up to equivalence

In [4]:
# edit these to change which value of n or k to use
n_range = [3,4,5]
n_count = len(n_range)
k = 200 # increasing k and re-running the file will not re-compute existing columns

In [5]:
import pandas as pd
from collections import Counter
import numpy as np
import json
import time
import copy
import os

data_folder_path = '../data/'
train_data_filename = 'final_train.csv'
test_data_filename = 'final_test.csv'

In [6]:
# read in the data
df_train = pd.read_csv(data_folder_path + train_data_filename, low_memory=False)
df_test = pd.read_csv(data_folder_path + test_data_filename, low_memory=False)
chord_column_train = df_train['simplified_chords']
chord_column_test = df_test['simplified_chords']
num_songs_train = len(df_train.index)
num_songs_test = len(df_test.index)

In [7]:
# read the equivalence dictionary file
# this is a dictionary of dictionaries
#    the top-level keys are chord names (e.g. 'C','Amin')
#    the top-level values are dictionaries, whose keys are equivalent chords, and whose values are the semitone distance between the top-level key and the low-level key
with open(data_folder_path + 'harmonic_equivalence_dictionary.json') as file:
    equiv_dict = json.load(file)

In [8]:
# this method uses the harmonic equivalence dictinoary json file to compare chords input in string format
def compare_chords(chord_1, chord_2):
    # assumptions: chord_1 and chord_2 are type string
    # return (True, distance) if they are equivalent
    # for most purposes, you will not need to care about the distance, so then compare_chords(c1, c2)[0] gets the truth value
    if chord_2 in equiv_dict[chord_1]:
        return (True, equiv_dict[chord_1][chord_2])
    else:
        return (False, None)

# this method uses compare_chords to compare two comma-separated n-gram strings for harmonic equivalence
def compare_n_grams(n_gram_1, n_gram_2):
    list_1 = n_gram_1.split(',')
    list_2 = n_gram_2.split(',')

    # if they aren't the same length, we don't have to check anything
    if len(list_1) != len(list_2):
        return (False, None)

    # now we can assume they have the same length
    comparison = [compare_chords(list_1[i], list_2[i]) for i in range(len(list_1))]

    # if any pairs are not the same, return False
    for c in comparison:
        if not c[0]:
            return (False, None)

    # now we can assume every respective pair is equivalent, but we still need all of the distances to match
    dist_0 = comparison[0][1]
    for c in comparison:
        if c[1] != dist_0:
            return (False, None)

    return (True, dist_0)

In [9]:
# this method compiles all of the raw n-grams in a Counter object for whatever chord_column data is put in
def get_raw_n_gram_counts(chord_column, n):
    results = Counter()
    for song in chord_column:
        song_as_list = song.split(',')
        song_n_grams = [','.join(song_as_list[i:i+n]) for i in range(len(song_as_list) - n + 1)]
        for ng in song_n_grams:
            results[ng] += 1
    return results

In [10]:
# a generic method for iterating through a counter of n-grams and aggregating equivalent n-grams
# note: if you try to use this to find all the unique 3-grams, the computation takes a very long time (i.e around 24 hours)
def uniquify_n_grams(n_gram_counter):
    results = Counter()
    processed = set()
    for ng1 in n_gram_counter:
        if ng1 in processed:
            continue
        total = n_gram_counter[ng1]
        for ng2 in n_gram_counter:
            if (ng2 not in processed) and ng1 != ng2:
                if compare_n_grams(ng1, ng2)[0]:
                    total += n_gram_counter[ng2]
                    processed.add(ng2)
        results[ng1] = total
        processed.add(ng1)
    return results

In [11]:
# return true/false depending on if a song contains a harmonically equivalent n_gram to the input n_gram
def contains_n_gram(song, n_gram):
    # assumption: input song is a comma-separated string of chord names
    # assumption: input n_gram is a comma-separated string of chord names

    # skip ahead and return true if the literal/raw version is the song
    # This isn't necessary to have, but it was added because it seemed to speed things up
    # Probably depends what kind of looping/checking is happening whether this will speed up or slow down
    if n_gram in song:
        return True

    # split the song into a list of individual chord names
    song_as_list = song.split(',')
    n = len(n_gram.split(','))

    # iterate through the possible starting points of n-grams within the song
    for i in range(0,len(song_as_list) - n):
        song_n_gram = ','.join(song_as_list[i:i+n])
        if compare_n_grams(n_gram, song_n_gram)[0]:
            return True
    return False

assert(contains_n_gram('A,B,C,D,E,F,G','C,D'))
assert(contains_n_gram('A,B,C,D,E,F','F,G'))
assert(not(contains_n_gram('A,B,C,D,E,F','C,E')))

In [12]:
# given the chord column of our dataframe and a fixed n-gram, make a binary one-hot column for that n-gram
def get_one_hot(chord_column, n_gram):
    return chord_column.apply(lambda song : contains_n_gram(song, n_gram))

In [13]:
# this line compiles all of the raw n-grams from the training data, for a given list of n values
raw_counters = [get_raw_n_gram_counts(chord_column_train, n) for n in n_range]

In [14]:
# cut to top k most common raw n-grams
top_k_counters = [Counter(dict(rc.most_common(k))) for rc in raw_counters]
assert(len(top_k_counters[0]) == k)

In [15]:
# uniquify each of these
unique_counters = [uniquify_n_grams(tkc) for tkc in top_k_counters]
unique_counter_lengths = [len(uc) for uc in unique_counters]

In [16]:
print("In the training, data, there are:\n")
for i in range(n_count):
    n = n_range[i]
    print("Raw " + str(n) + "-grams:",len(raw_counters[i]))
    print("Unique " + str(n) + "-grams among top " + str(k) + " raw " + str(n) + "-grams:",len(unique_counters[i]))

In the training, data, there are:

Raw 3-grams: 298213
Unique 3-grams among top 200 raw 3-grams: 62
Raw 4-grams: 888910
Unique 4-grams among top 200 raw 4-grams: 72
Raw 5-grams: 1752485
Unique 5-grams among top 200 raw 5-grams: 71


In [17]:
# extract the feature selection from these
selected_n_gram_features = [list(uc.keys()) for uc in unique_counters]

In [18]:
# make a list of new columns
df_train_filenames = ['final_train_with_' + str(n) + '_grams.csv' for n in n_range]
df_test_filenames = ['final_test_with_' + str(n) + '_grams.csv' for n in n_range]
df_all_filenames = df_train_filenames + df_test_filenames

In [19]:
# create csv files if they do not exist
for i in range(3):
    train_path = data_folder_path + df_train_filenames[i]
    test_path = data_folder_path + df_test_filenames[i]
    if not(os.path.exists(train_path)):
        print("The following file does not yet exist, so it was created:",train_path)
        df_train.to_csv(train_path,index=False)
    if not(os.path.exists(test_path)):
        print("The following file does not yet exist, so it was created:",test_path)
        df_test.to_csv(test_path,index=False)

In [20]:
# load the existing dataframes
train_dataframes = [pd.read_csv(data_folder_path + df_train_filenames[i], index_col = False) for i in range(n_count)]
test_dataframes = [pd.read_csv(data_folder_path + df_test_filenames[i], index_col = False) for i in range(n_count)]
all_dataframes = train_dataframes + test_dataframes

In [21]:
for df in all_dataframes:
    display(df[['chords','simplified_chords','decade','main_genre','spotify_song_id']].head(2))

Unnamed: 0,chords,simplified_chords,decade,main_genre,spotify_song_id
0,<intro_1> G A Fsmin Bmin G A Fsmin Bmin <verse...,"G,A,Fsmin,Bmin,G,A,Fsmin,Bmin,G,A,Fsmin,Bmin,G...",2010.0,pop,7vpGKEUPrA4UEsS4o4W1tP
1,C F G C F G F Dmin G C F Dmin G C F G C F G F ...,"C,F,G,C,F,G,F,Dmin,G,C,F,Dmin,G,C,F,G,C,F,G,F,...",2000.0,alternative,7MTpNQUBKyyymbS3gPuqwQ


Unnamed: 0,chords,simplified_chords,decade,main_genre,spotify_song_id
0,<intro_1> G A Fsmin Bmin G A Fsmin Bmin <verse...,"G,A,Fsmin,Bmin,G,A,Fsmin,Bmin,G,A,Fsmin,Bmin,G...",2010.0,pop,7vpGKEUPrA4UEsS4o4W1tP
1,C F G C F G F Dmin G C F Dmin G C F G C F G F ...,"C,F,G,C,F,G,F,Dmin,G,C,F,Dmin,G,C,F,G,C,F,G,F,...",2000.0,alternative,7MTpNQUBKyyymbS3gPuqwQ


Unnamed: 0,chords,simplified_chords,decade,main_genre,spotify_song_id
0,<intro_1> G A Fsmin Bmin G A Fsmin Bmin <verse...,"G,A,Fsmin,Bmin,G,A,Fsmin,Bmin,G,A,Fsmin,Bmin,G...",2010.0,pop,7vpGKEUPrA4UEsS4o4W1tP
1,C F G C F G F Dmin G C F Dmin G C F G C F G F ...,"C,F,G,C,F,G,F,Dmin,G,C,F,Dmin,G,C,F,G,C,F,G,F,...",2000.0,alternative,7MTpNQUBKyyymbS3gPuqwQ


Unnamed: 0,chords,simplified_chords,decade,main_genre,spotify_song_id
0,<intro_1> C D E <verse_1> C D G B E C D G B <c...,"C,D,E,C,D,G,B,E,C,D,G,B,E,B,E,B,E,C,D,G,B,C,E,...",2020.0,pop,206APP3O8maeXJ5dHndAwk
1,<verse_1> G Fsmin Bmin D/A G D/Fs G Emin A D G...,"G,Fsmin,Bmin,D,G,D,G,Emin,A,D,G,D,G,D,G,Fsmin,...",2010.0,country,4d7FN4kiCq8Mh78nCBj1xf


Unnamed: 0,chords,simplified_chords,decade,main_genre,spotify_song_id
0,<intro_1> C D E <verse_1> C D G B E C D G B <c...,"C,D,E,C,D,G,B,E,C,D,G,B,E,B,E,B,E,C,D,G,B,C,E,...",2020.0,pop,206APP3O8maeXJ5dHndAwk
1,<verse_1> G Fsmin Bmin D/A G D/Fs G Emin A D G...,"G,Fsmin,Bmin,D,G,D,G,Emin,A,D,G,D,G,D,G,Fsmin,...",2010.0,country,4d7FN4kiCq8Mh78nCBj1xf


Unnamed: 0,chords,simplified_chords,decade,main_genre,spotify_song_id
0,<intro_1> C D E <verse_1> C D G B E C D G B <c...,"C,D,E,C,D,G,B,E,C,D,G,B,E,B,E,B,E,C,D,G,B,C,E,...",2020.0,pop,206APP3O8maeXJ5dHndAwk
1,<verse_1> G Fsmin Bmin D/A G D/Fs G Emin A D G...,"G,Fsmin,Bmin,D,G,D,G,Emin,A,D,G,D,G,D,G,Fsmin,...",2010.0,country,4d7FN4kiCq8Mh78nCBj1xf


In [22]:
# make a list of new one-hot columns to add (to avoid duplication)
new_n_gram_lists = [None]*n_count
for i in range(n_count):
    new_n_gram_lists[i] = [ng for ng in selected_n_gram_features[i] if ('contains_' + ng) not in list(train_dataframes[i].columns)]
new_n_gram_counts = [len(x) for x in new_n_gram_lists]
total_new_n_grams = np.sum(new_n_gram_counts)

# add one-hot columns for the new n-grams
print("Creating new one-hot columns for n-grams for n in",n_range)
print("Number of new n-grams per class:",new_n_gram_counts)
print("Total new n-grams:",total_new_n_grams)

t0 = time.time()
completed_columns = 0
remaining_columns = total_new_n_grams

for i in range(n_count):
    n = n_range[i]
    df_train_n = train_dataframes[i]
    df_test_n = test_dataframes[i]
    new_n_grams = new_n_gram_lists[i]
    
    print("\nCreating",new_n_gram_counts[i],"one-hot columns for n =",n)
    for ng in new_n_grams:
        new_column_label = 'contains_' + ng
        df_train_n[new_column_label] = chord_column_train.apply(lambda song : contains_n_gram(song, ng)) # this is where all the work gets done
        df_test_n[new_column_label] = chord_column_test.apply(lambda song : contains_n_gram(song, ng)) # this is where all the work gets done
        
        completed_columns += 1
        remaining_columns -= 1
        time_so_far = time.time() - t0
        avg_time_per_column = np.round(time_so_far/completed_columns, decimals = 1)
        print("Completed",completed_columns,"columns in",int(time_so_far),"seconds")
        print("\tAverage time per column so far:",avg_time_per_column,"seconds")
        print("\tRemaining columns:",remaining_columns)
        print("\tEstimated remaining time:",int(remaining_columns*avg_time_per_column),"seconds")

    # save the n-grams dataframe to csv
    print("Completed all columns for n =",n)
    print("Saving dataframe to file.\n")
    df_train_n.to_csv(data_folder_path + df_train_filenames[i], index = False)
    df_test_n.to_csv(data_folder_path + df_test_filenames[i], index = False)

print("Completed all new one-hot columns.")
print("Total time taken:",time.time()-t0,"seconds")

Creating new one-hot columns for n-grams for n in [3, 4, 5]
Number of new n-grams per class: [30, 33, 33]
Total new n-grams: 96

Creating 30 one-hot columns for n = 3
Completed 1 columns in 68 seconds
	Average time per column so far: 68.5 seconds
	Remaining columns: 95
	Estimated remaining time: 6507 seconds
Completed 2 columns in 130 seconds
	Average time per column so far: 65.4 seconds
	Remaining columns: 94
	Estimated remaining time: 6147 seconds
Completed 3 columns in 200 seconds
	Average time per column so far: 66.8 seconds
	Remaining columns: 93
	Estimated remaining time: 6212 seconds
Completed 4 columns in 275 seconds
	Average time per column so far: 69.0 seconds
	Remaining columns: 92
	Estimated remaining time: 6348 seconds
Completed 5 columns in 328 seconds
	Average time per column so far: 65.7 seconds
	Remaining columns: 91
	Estimated remaining time: 5978 seconds
Completed 6 columns in 382 seconds
	Average time per column so far: 63.7 seconds
	Remaining columns: 90
	Estimated

KeyboardInterrupt: 

In [None]:
for df in all_dataframes:
    print(len(df.columns))

In [None]:
train_dataframes[0].head()