This notebook is about determining which $n$-grams we should make into features. More specifically, for which $x$ should we add a feature of the form "contains a harmonically equivalent $n$-gram to $x$"?
In theory, we could do this for any harmonically unique n-gram for $n=1,2,3,4,...$ but that would be infinitely many different features, which is obviously impossible.

So how to pick which ones are good? 

* Based on domain knowledge, sequences longer than, say, $10$ are definitely not going to be useful for distinguishing genre. First of all, I can't imagine there are any $10$-grams which actually occur in more than $1\%$ of the songs, but also musically speaking, a chord progression of length $10$ is usually longer than an entire "musical idea" or "phrase." Even $10$ is pushing it, really based on this idea we shouldn't expect anything longer than $6$ to be very useful.
* Even if we wanted to look at all the harmonically unique $n$-grams for $1 \le n \le 6$, that would be far too many. There are $44$ harmonically unique $1$-grams in the training set, $5903$ unqiue $2$-grams, and ________ unique $3$-grams. Just to determine the list of unique $2$-grams takes a three minute computation, and for 3-grams it takes _____________.
* Because of the above, we need to heavily filter our choice of n-grams. The first easy criterion is to only consider n-grams which appear in some sample size threshold of the data set, something like $1\%$ or $10\%$. I suspect that if we restrict to n-grams that appear in at least 10% of the database, we will already be down to a fairly small number, possibly even zero for $n \ge 3$. So maybe $10\%$ is too high a threshold, or we should have some threshold which is lower for larger $n$.

In [2]:
import pandas as pd
from collections import Counter
import numpy as np
import json

data_folder_path = '../../data/'

In [3]:
# read in the database
df = pd.read_csv(data_folder_path + 'clean_test.csv', low_memory=False)
num_songs = len(df.index)

In [4]:
print("Number of songs:",num_songs)
display(df.head(5))

Number of songs: 255606


Unnamed: 0,chords,simplified_chords,decade,main_genre,spotify_song_id
0,<intro_1> G A Fsmin Bmin G A Fsmin Bmin <verse...,"G,A,Fsmin,Bmin,G,A,Fsmin,Bmin,G,A,Fsmin,Bmin,G...",2010.0,pop,7vpGKEUPrA4UEsS4o4W1tP
1,C F G C F G F Dmin G C F Dmin G C F G C F G F ...,"C,F,G,C,F,G,F,Dmin,G,C,F,Dmin,G,C,F,G,C,F,G,F,...",2000.0,alternative,7MTpNQUBKyyymbS3gPuqwQ
2,C F C G Amin G F C F C G Amin G F C G C F C G ...,"C,F,C,G,Amin,G,F,C,F,C,G,Amin,G,F,C,G,C,F,C,G,...",2000.0,alternative,6jIIMhcBPRTrkTWh3PXIc7
3,Amin G Gmin B Amin G Gmin B Amin G Gmin B Amin...,"Amin,G,Gmin,B,Amin,G,Gmin,B,Amin,G,Gmin,B,Amin...",2010.0,pop,2zAfQdoOeYujy7QIgDUq9p
4,<verse_1> D Dmaj7 G/D A/D D Dmaj7 G/D A/D <cho...,"D,Dmaj7,G,A,D,Dmaj7,G,A,G,D,Emin,D,A,G,D,Emin,...",2010.0,metal,40rChMoUd1VXb4TKgTuTSP


In [5]:
# specify limitations on n, and sample size threshold
n_range = [1,2]
relative_sample_size_threshold = 0.01
sample_size_threshold = int(relative_sample_size_threshold * num_songs)

print("Sample size threshold:",sample_size_threshold)

Sample size threshold: 2556


In [13]:
# load the counter json files
big_n_gram_counter = Counter()
for n in n_range:
    with open(data_folder_path + 'harmonically_unique_' + str(n) + '_gram_counts.json') as file:
        n_gram_counter = json.load(file)
        big_n_gram_counter = (big_n_gram_counter | n_gram_counter) # take the union

In [21]:
len(big_n_gram_counter)

5947

In [25]:
# Eliminate any entries which don't meet the sample size threshold
restricted_counter = Counter({ key : value for key, value in big_n_gram_counter.items() if value >= sample_size_threshold})

In [27]:
len(restricted_counter)

259

In [29]:
restricted_counter

Counter({'G': 12777669,
         'Fsmin': 4627067,
         'F,C': 2945027,
         'C,F': 2359941,
         'G,A': 1230986,
         'G,F': 1063415,
         'G,Amin': 995946,
         'Bmin,G': 861782,
         'Amin,G': 852222,
         'A,Fsmin': 822318,
         'Amin,C': 700118,
         'Eb,Gmin': 597621,
         'Gs7': 594128,
         'Emin7': 548196,
         'Fsmin,Bmin': 416062,
         'Cmin,Gmin': 392765,
         'Cno3d': 353019,
         'Dmin,G': 307819,
         'A,Emin': 282851,
         'Dmaj7': 277398,
         'Gmin,Ab': 226747,
         'Dsus4': 222221,
         'A,Dmin': 213357,
         'G,As': 198504,
         'A7,D': 188207,
         'B,Gs': 155795,
         'G,Fs': 151409,
         'As,Amin': 127799,
         'F,A': 120868,
         'Fmin,C': 120062,
         'E,F': 117378,
         'Cadd9': 109279,
         'Gs7,Csmin': 101851,
         'E,E': 101580,
         'A,F': 90301,
         'Emin,Dmin': 88038,
         'Cmin,Dmin': 85649,
         'D,D7': 79533,