This notebook is about determining which $n$-grams we should make into features. More specifically, for which $x$ should we add a feature of the form "contains a harmonically equivalent $n$-gram to $x$"?
In theory, we could do this for any harmonically unique n-gram for $n=1,2,3,4,...$ but that would be infinitely many different features, which is obviously impossible.

So how to pick which ones are good? 

* Based on domain knowledge, sequences longer than around $10$ are not going to be useful for distinguishing genre. Msically speaking, a chord progression of length $10$ is usually longer than an entire "musical idea" or "phrase." Even $10$ is pushing it, probably we shouldn't expect anything longer than $6$ to be very useful, but we'll include up to $10$.
* There are way too many harmonically unique $n$-grams for $1 \le n \le 10$. Even for $n=3$, it would take around $24$ hours to enumerate all of the unique $n$-grams. We can enumerate all $1$-grams ($44$) and all unique $2$-grams ($5903$), but this is the limit of usefulness for complete enumeration.
* I would like to filter for $n$-grams that occur in something like $1\%$ of the songs, but that's only possible to do if we can make a count for every $n$-gram, which takes too long. So instead, I'll enumerate all of the raw $n$-grams for $1 \le n \le 10$ (which is fast), cut to the top $k=300$ or so, then consider unique $n$-grams among those. The table below gives concrete numbers for the number of unique $1$-grams and $2$-grams after restricting to higher and higher relative sample size requirements. Based on the table, I think a threshold cutoff of around $1\%$ is a good sweet spot.

| $n$      | $n$-grams | $1\%$ | $2\%$ | $5\%$ |$10\%$ |
| :-: | :-: | :-: | :-: | :-: | :-: |
| 1 | 44 | 22 | 20 | 15 | 10
| 2 | 5903 | 237 | 173 | 105 | 62 |

* For each $n$, we'll take the top $k=300$ most common raw $n$-grams, then condense those down by harmonic equivalence.

| $n$ | Features | 
| :-: | :-: | 
| 1 | 35 |
| 2 | 93 |
| 3 | 129 |
| 4 | 161 |
| 5 | 160 |
| 6 | 166 |
| 7 | 180 |
| 8 | 185 |
| 9 | 188 |
| 10 | 185 |

In [2]:
import pandas as pd
from collections import Counter
import numpy as np
import json
import time

data_folder_path = '../../data/'

In [3]:
# read in the database
df = pd.read_csv(data_folder_path + 'clean_test.csv', low_memory=False)
chord_column = df['simplified_chords']
num_songs = len(df.index)

In [4]:
print("Number of songs:",num_songs)
display(df.head(5))

Number of songs: 255606


Unnamed: 0,chords,simplified_chords,decade,main_genre,spotify_song_id
0,<intro_1> G A Fsmin Bmin G A Fsmin Bmin <verse...,"G,A,Fsmin,Bmin,G,A,Fsmin,Bmin,G,A,Fsmin,Bmin,G...",2010.0,pop,7vpGKEUPrA4UEsS4o4W1tP
1,C F G C F G F Dmin G C F Dmin G C F G C F G F ...,"C,F,G,C,F,G,F,Dmin,G,C,F,Dmin,G,C,F,G,C,F,G,F,...",2000.0,alternative,7MTpNQUBKyyymbS3gPuqwQ
2,C F C G Amin G F C F C G Amin G F C G C F C G ...,"C,F,C,G,Amin,G,F,C,F,C,G,Amin,G,F,C,G,C,F,C,G,...",2000.0,alternative,6jIIMhcBPRTrkTWh3PXIc7
3,Amin G Gmin B Amin G Gmin B Amin G Gmin B Amin...,"Amin,G,Gmin,B,Amin,G,Gmin,B,Amin,G,Gmin,B,Amin...",2010.0,pop,2zAfQdoOeYujy7QIgDUq9p
4,<verse_1> D Dmaj7 G/D A/D D Dmaj7 G/D A/D <cho...,"D,Dmaj7,G,A,D,Dmaj7,G,A,G,D,Emin,D,A,G,D,Emin,...",2010.0,metal,40rChMoUd1VXb4TKgTuTSP


In [5]:
# read the equivalence dictionary file
# this is a dictionary of dictionaries
#    the top-level keys are chord names (e.g. 'C','Amin')
#    the top-level values are dictionaries, whose keys are equivalent chords, and whose values are the semitone distance between the top-level key and the low-level key
with open(data_folder_path + 'harmonic_equivalence_dictionary.json') as file:
    equiv_dict = json.load(file)

In [6]:
def compare_chords(chord_1, chord_2):
    if chord_2 in equiv_dict[chord_1]:
        return (True, equiv_dict[chord_1][chord_2])
    else:
        return (False, None)

def compare_n_grams(n_gram_1, n_gram_2):
    list_1 = n_gram_1.split(',')
    list_2 = n_gram_2.split(',')

    # if they aren't the same length, we don't have to check anything
    if len(list_1) != len(list_2):
        return (False, None)

    # now we can assume they have the same length
    comparison = [compare_chords(list_1[i], list_2[i]) for i in range(len(list_1))]

    # if any pairs are not the same, return False
    for c in comparison:
        if not c[0]:
            return (False, None)

    # now we can assume every respective pair is equivalent, but we still need all of the distances to match
    dist_0 = comparison[0][1]
    for c in comparison:
        if c[1] != dist_0:
            return (False, None)

    return (True, dist_0)

In [7]:
def get_raw_n_gram_counts(chord_column, n):
    # compile a dictionary of counts
    results = Counter()
    for song in chord_column:
        song_as_list = song.split(',')
        song_n_grams = [','.join(song_as_list[i:i+n]) for i in range(len(song_as_list) - n + 1)]
        for ng in song_n_grams:
            results[ng] += 1
    return results

In [8]:
# a generic method for iterating through a counter of n-grams and aggregating equivalent n-grams
def uniquify_n_grams(n_gram_counter, n):
    results = Counter()
    processed = set()
    for ng1 in n_gram_counter:
        if ng1 in processed:
            continue
        total = n_gram_counter[ng1]
        for ng2 in n_gram_counter:
            if (ng2 not in processed) and ng1 != ng2:
                if compare_n_grams(ng1, ng2)[0]:
                    total += n_gram_counter[ng2]
                    processed.add(ng2)
        results[ng1] = total
        processed.add(ng1)
    return results

In [9]:
# specify key parameter
n_max = 10
n_range = list(range(1,n_max+1))
k = 300

In [10]:
# build the raw counters for each n
raw_counters = [get_raw_n_gram_counts(chord_column, n) for n in n_range]

In [11]:
for n in n_range:
    print("n=",n)
    print("number of raw n-grams:",len(raw_counters[n-1]),'\n')

n= 1
number of raw n-grams: 690 

n= 2
number of raw n-grams: 42574 

n= 3
number of raw n-grams: 298213 

n= 4
number of raw n-grams: 888910 

n= 5
number of raw n-grams: 1752485 

n= 6
number of raw n-grams: 2764461 

n= 7
number of raw n-grams: 3817027 

n= 8
number of raw n-grams: 4824402 

n= 9
number of raw n-grams: 5735351 

n= 10
number of raw n-grams: 6537957 



In [12]:
# cut to top k for each
unique_counters = [None]*n_max
for n in n_range:
    index = n-1
    rc = raw_counters[index] 
    top_k = Counter(dict(rc.most_common(k)))
    unique_among_top_k = uniquify_n_grams(top_k, n)
    unique_counters[index] = unique_among_top_k
    
    print("n=",n)
    print("unique n-grams among top " + str(k) + " raw n-grams:",len(unique_among_top_k),'\n')

n= 1
unique n-grams among top 300 raw n-grams: 25 

n= 2
unique n-grams among top 300 raw n-grams: 61 

n= 3
unique n-grams among top 300 raw n-grams: 88 

n= 4
unique n-grams among top 300 raw n-grams: 97 

n= 5
unique n-grams among top 300 raw n-grams: 107 

n= 6
unique n-grams among top 300 raw n-grams: 108 

n= 7
unique n-grams among top 300 raw n-grams: 111 

n= 8
unique n-grams among top 300 raw n-grams: 108 

n= 9
unique n-grams among top 300 raw n-grams: 104 

n= 10
unique n-grams among top 300 raw n-grams: 103 



In [18]:
final_features_counter = {}
for uc in unique_counters:
    final_features_counter |= uc
final_features_counter = Counter(dict(final_features_counter))

print(len(final_features_counter))
print(final_features_counter.most_common(10))

912
[('G', 12777669), ('Amin', 4627067), ('C,G', 2926619), ('G,C', 2344782), ('C,D', 1196201), ('G,F', 1035697), ('G,Amin', 972610), ('Amin,G', 831537), ('Amin,F', 819618), ('C,Amin', 790647)]


In [28]:
list_of_n_gram_features = list(final_features_counter.keys())
print(list_of_n_gram_features[0:100])

['G', 'Amin', 'Amin7', 'A7', 'Cadd9', 'Fmaj7', 'Dno3d', 'Dsus2', 'C9', 'A7sus4', 'Emin9', 'Dadd11', 'Eminadd13', 'Cmaj9', 'Bdim', 'Eminadd9', 'G13', 'Fsmin11', 'D11', 'Daug', 'Bb9', 'Emajs9', 'Eminmaj7', 'Bminadd11', 'Fmaj911s', 'C,G', 'G,C', 'G,Amin', 'C,D', 'G,F', 'Amin,G', 'Amin,F', 'C,Amin', 'Amin,C', 'C,Emin', 'Emin,Amin', 'Amin,Emin', 'Amin,D', 'D,Amin', 'E,Amin', 'Emin,F', 'D7,G', 'A,C', 'Cadd9,G', 'G,E', 'F,E', 'F,Emin', 'Amin,E', 'B7,Emin', 'C,E', 'C,C', 'G,Cadd9', 'E,F', 'E,C', 'Bmin,Amin', 'G,Emin7', 'Amin,Bmin', 'D,Emin7', 'C,G7', 'Dsus4,D', 'D,Cadd9', 'G,G7', 'Fmaj7,C', 'Dmin,E', 'Amin7,G', 'D,Dsus4', 'Amin,Amin', 'C,Fmaj7', 'Cadd9,D', 'C,D7', 'Amin7,D', 'Amin7,C', 'A,Amin', 'Emin,B7', 'G,B7', 'Emin7,Cadd9', 'Amin,Fmaj7', 'Amin,D7', 'Gno3d,Ano3d', 'Dno3d,Ano3d', 'Fmaj7,G', 'C,Cmaj7', 'G,Fmaj7', 'D,Dsus2', 'C,B7', 'Amin,Amin7', 'G,C,G', 'C,G,C', 'C,G,D', 'C,G,Amin', 'C,D,G', 'D,G,C', 'Emin,C,G', 'D,C,G', 'G,Amin,F', 'G,D,C', 'Amin,G,F', 'G,C,D', 'Amin,F,G', 'F,G,Amin']
