We considered a large variety of features in our analysis of the Chordonomicon dataset. All features considered or used were ultimately derived from the original "chords" feature. In the initial data, the chords of a song are recorded as a string, listing chord names such as "C", "Amin," and "Gdim7." Along with chord names, the original chord data contains section labels such as "\<intro_1\>" and "\<chorus_1\>", as well as inversion markers such as "C/E" to indicate a C major chord in first inversion. 

While we initially considered some features involving section labels, a sizable fraction of the dataset contained no section labels, so rather than filter by songs with labels (which would significantly reduce the number of entries), we processed all of the chord data to remove all section labels. Regarding inversions, we decided based on domain-specific knowledge that inversions are not central enough to harmonic analysis to be useful for analyzing a target such as genre, so we also processed songs to remove inversion markers. The "cleaned up" version of the initial chords feature was then stored in a "simplified_chords" feature columnn, from which all of our other features derive.

Our final list of model features fits broadly into two categories: "contains x" where x is an n-gram, and various other more "holistic" features. Before describing the features, it is useful to set up a bit of mathematical framework for chords, chord progressions, and harmonic equivalence. 

**Chords** are represented as strings (e.g. 'C' or 'Amin'), or as binary vectors of lenght 12 ('C' corresponds to $[1,0,0,0,1,0,0,1,0,0,0,0]$). The data represents chords within songs as string labels, and provides the file "chords_mapping.csv" for converting from a string to vector. Mathematically, the space of chords is $X = \{0,1\}^{12}$. The cyclic group $G = \mathbb{Z}/12 \mathbb{Z}$ acts on $X$ by cyclically permuting vectors, which corresponds musically to transposition. Two chords are **harmonically equivalent** of they are in the same orbit of this group action, so the set of chords up to harmonic equivalence is the quotient space $X/G$. Musically speaking, two chords are harmonically equivalent if one of them is a transposition of the other. More generally, an $n$-gram is a tuple from $X^n$, and $G$ acts entry-wise on $X^n$, and two $n$-grams are harmonically equivalent if they lie in the same $G$-orbit. The general space of $n$-grams up to equivalence is

$$\bigcup_{n \ge 1} (X^n/G)$$

**n-gram features.** Fix an $n$-gram $x \in X^n$, and let $\overline{x} \in X^n/G$ be the orbit of $x$. The $n$-gram feature ''contains_$x$" is a binary feature denoting whether a given song contains any element of $\overline{x}$, i.e. any $n$-gram harmonically equivalent to $x$. There are far too many $n$-grams to use even a small fraction as features in this way. Just considering $n=1$, the full orbit space $X/G$ has 352 elements, of which 44 are occur in the training data. The orbit space $X^2/G$ has over 1 million elements, of which 5903 occur in the training data. In order to restrict to a more feasible set of $n$-grams, we first decided, based on domain knowledge, that sequences longer thn $n=5$ are longer than a typical musical idea or phrase, and will not be useful for distinguishing genre. On the other hand, sequences of length $n=1$ or $n=2$ are frequent enough to be not useful for distinguishing. That leaves $3$-grams, $4$-grams, and $5$-grams. In order to try and select for $n$-grams which occur in a meaningful enough number of songs, we gathered the most frequent raw $n$-grams (the top 100 for each $n$), then passed that list through the harmonic equivalence quotienting process, leaving around $40$ distinct classes for each $n$.

Another reason for excluding $1$-grams and $2$-grams from our features is that they are significantly redundant with longer $n$-grams. For example, any song containing 'F,G,C' will necessarily contain 'F' and 'F,G' and 'G,C'. While this is not literal collinearity of feature columns, it could lead to near-collinearity.

**Holistic features.** In order to explore the data and view songs in a format roughly approximating traditional sheet music, we created a "string_to_chord_matrix" method which takes in a chord sequence such as 'C,G,C,G' and outputs a matrix whose rows are binary chord vectors of length 12, one for each chord. After an appropriate transoformation (reversing the order of each row and taking the matrix transpose), the resulting binary matrix becomes a "left-to-right" view of the notes in the song, with higher notes corresponding to higher 1's. From this view, we created various features:

* missing_notes - The number of zero rows in a chord matrix, measuring the number of tones from the 12-tone scale never used throughout a song. This feature is discrete, necessarily between 0 and 12, though realistically will be between 0-5 for most songs.

* drone_ration - A measurement of how close a song is to having a "drone," a single note played throughout the entire song. This metric is essentially continuous, giving a number in $[0,1]$ where a value of 1 indicates that the song contains a note played in every chord, while 0.5 indicates that the most common single note appears in only half of the chords. Concretely, this is calculated by taking the maximum value among the 12 column sums of a chord matrix, and dividing by the number of rows.

* average_overlap - A measurement of how much sequential pairs of chords overlap in notes. Concretely, for each sequential pair of chords in a song, take the dot product of those two chord vectors, then average this similarity metric across all sequential pairs in the song. As a concrete example, the sequential overlap in 'C,Amin' is 2, as C and A minor share the notes E and C.

* average_2overlap, average_3overlap, average_4overlap, average_5overlap - Generalizations of average_overlap, where the number following indicates a time lag, so e.g. average_overlap2 takes similarities between a chord and the chord two after it, averaged over all possible such pairs in a song.

* maj_triad_ratio - The fraction of chords in a song that are major triads.

* min_triad_ratio - The fraction of chords in a song that are minor triads.

* unique_chord_density - The number of distinct chords in a song divided by the total number of chords. Note that for the purposes of this metric, harmonically equivalent chords may still be considered distinct.

* unique_5_gram_density - The number of distinct 5-grams divided by the total number of chords. Note that for the purposes of this metric, harmonically equivalent chords may still be considered distinct.