Goal of this notebook: 
convert the chordonomicon data set to make each song a matrix where each column records a single chord. 

Steps:

1. Import the chordonomicon data set, drop all columns except for 'id' and 'chords'
2. Remove section marker info from chords i.e. remove \<intro_1\>
3. For each song, convert each chord into a vector, then concatenate them into a matrix 

In [2]:
# importing basic packages
import numpy as np
import pandas as pd
import matplotlib as plt
import ast
import re

# read in the data set
df = pd.read_csv('../data/chordonomicon.csv')

  df = pd.read_csv('../data/chordonomicon.csv')


In [3]:
df.head()

Unnamed: 0,id,chords,s_artist,release_date,genres,decade,main_genre,rock_genre,parts
0,1,<intro_1> C <verse_1> F C E7 Amin C F C G7 C F...,artist_0,1977,classic country pop,1970.0,pop,,yes
1,2,<intro_1> E D A/Cs E D A/Cs <verse_1> E D A/Cs...,artist_1,2003-01-01,"alternative metal""alternative rock""nu metal""pe...",2000.0,metal,pop rock,yes
2,3,<intro_1> D Dmaj7 D Dmaj7 <verse_1> Emin A D G...,artist_2,2022-09-23,,2020.0,,,yes
3,4,<intro_1> C <verse_1> G C G C <chorus_1> F Dmi...,artist_3,2023-02-10,modern country pop,2020.0,pop,,yes
4,5,<intro_1> C G C G <verse_1> C G C G C Bmin Emi...,artist_4,2018-08-24,"classic opm""opm",2010.0,,,yes


In [4]:
# drop all columns except for chords
chord_data = df[['chords','genres']]

In [5]:
chord_data.head()

Unnamed: 0,chords,genres
0,<intro_1> C <verse_1> F C E7 Amin C F C G7 C F...,classic country pop
1,<intro_1> E D A/Cs E D A/Cs <verse_1> E D A/Cs...,"alternative metal""alternative rock""nu metal""pe..."
2,<intro_1> D Dmaj7 D Dmaj7 <verse_1> Emin A D G...,
3,<intro_1> C <verse_1> G C G C <chorus_1> F Dmi...,modern country pop
4,<intro_1> C G C G <verse_1> C G C G C Bmin Emi...,"classic opm""opm"


I want to clean up the chord sequences in two ways:
1. Replace spaces by commas.
2. Delete any section markers like <intro_1>
3. Remove 'inversion markers,' i.e. replace 'A/Cs' with 'A'

In [7]:
# replacing spaces with commas
def replace_space_with_comma(my_string):
    return my_string.replace(" ",",")

# replacing spaces with commons in all chords in all rows of the data
chord_data.loc[:,'chords'] = chord_data['chords'].apply(replace_space_with_comma)
chord_data.head()

Unnamed: 0,chords,genres
0,"<intro_1>,C,<verse_1>,F,C,E7,Amin,C,F,C,G7,C,F...",classic country pop
1,"<intro_1>,E,D,A/Cs,E,D,A/Cs,<verse_1>,E,D,A/Cs...","alternative metal""alternative rock""nu metal""pe..."
2,"<intro_1>,D,Dmaj7,D,Dmaj7,<verse_1>,Emin,A,D,G...",
3,"<intro_1>,C,<verse_1>,G,C,G,C,<chorus_1>,F,Dmi...",modern country pop
4,"<intro_1>,C,G,C,G,<verse_1>,C,G,C,G,C,Bmin,Emi...","classic opm""opm"


In [8]:
# Remove section markers


################################################################
# Earlier version, deprecated for memory inefficiency but
# should be functionally equivalent
################################################################
#def remove_section_markers(my_string):
#    while True:
#        start = my_string.find('<')
#        if start == -1:
#            break
#        else:
#            end = my_string.find('>')
#            my_string = (my_string[0:start] + my_string[end+2:])
#    return my_string
##################################################################

def remove_section_markers(my_string):
    result = []
    i = 0
    n = len(my_string)
    while i < n:
        if my_string[i] == '<':
            # Skip until after the following ", "
            comma = my_string.find('>', i)
            if comma == -1:
                break  # no closing comma, stop
            i = comma + 2  # skip comma and the space
        else:
            result.append(my_string[i])
            i += 1
    return ''.join(result)

# just a basic test on the first chord sequence
my_string = chord_data.iloc[0]['chords']
print(my_string)
print()
print(remove_section_markers(my_string))

<intro_1>,C,<verse_1>,F,C,E7,Amin,C,F,C,G7,C,F,C,E7,Amin,C,F,G7,C,<verse_2>,F,C,E7,Amin,C,F,C,G7,C,F,C,E7,Amin,C,F,G7,C,<chorus_1>,F,C,F,C,G,C,F,C,E7,Amin,C,F,G7,C,<solo_1>,D,<chorus_2>,G,D,G,D,A,D,G,D,Fs7,Bmin,D,G,A7,D,G,A7,D

C,F,C,E7,Amin,C,F,C,G7,C,F,C,E7,Amin,C,F,G7,C,F,C,E7,Amin,C,F,C,G7,C,F,C,E7,Amin,C,F,G7,C,F,C,F,C,G,C,F,C,E7,Amin,C,F,G7,C,D,G,D,G,D,A,D,G,D,Fs7,Bmin,D,G,A7,D,G,A7,D


In [9]:
# removing section markers from all rows
chord_data.loc[:,'chords'] = chord_data['chords'].apply(remove_section_markers)
chord_data.head()

Unnamed: 0,chords,genres
0,"C,F,C,E7,Amin,C,F,C,G7,C,F,C,E7,Amin,C,F,G7,C,...",classic country pop
1,"E,D,A/Cs,E,D,A/Cs,E,D,A/Cs,E,D,A/Cs,E,D,A/Cs,E...","alternative metal""alternative rock""nu metal""pe..."
2,"D,Dmaj7,D,Dmaj7,Emin,A,D,G,Emin,A,D,G,Emin,A,D...",
3,"C,G,C,G,C,F,Dmin,G,Dmin,G,C,G,C,F,Dmin,G,Dmin,...",modern country pop
4,"C,G,C,G,C,G,C,G,C,Bmin,Emin,Amin,D,G,C,D,G,C,D...","classic opm""opm"


In [10]:
# Removing inversions

################################################################
# Earlier version, deprecated for memory inefficiency but
# should be functionally equivalent
################################################################
#def remove_inversions(my_string):
#    while True:
#        start = my_string.find('/')
#        if start == -1:
#            break
#        else:
#            end = my_string.find(',',start)
#            my_string = (my_string[0:start] + my_string[end+2:])
#    return my_string
##################################################################

def remove_inversions(my_string):
    result = []
    i = 0
    n = len(my_string)
    while i < n:
        if my_string[i] == '/':
            # Skip until after the following ", "
            comma = my_string.find(',', i)
            if comma == -1:
                break  # no closing comma, stop
            i = comma + 2  # skip comma and the space
        else:
            result.append(my_string[i])
            i += 1
    return ''.join(result)

# just a basic test on the first chord sequence
my_string = chord_data.iloc[1]['chords']
print(my_string)
print()
print(remove_inversions(my_string))

E,D,A/Cs,E,D,A/Cs,E,D,A/Cs,E,D,A/Cs,E,D,A/Cs,E,D,A,C,E,G,D,A,E,G,D,A,E,G,D,A,C,D,E,D,A/Cs,E,D,A/Cs,E,D,A/Cs,E,D,A/Cs,E,D,A,C,E,G,D,A,E,G,D,A,E,G,D,A,C,D,E,C,G,D,E,C,G,D,E,C,G,D,C,D,E,G,E,G,D,A,E,G,D,A,E,G,D,A,C,D,E,G,D,A,E,G,D,A,E,G,D,A,C,D,E,C,G,D,E,C,G,D,E,C,G,D,C,D,E

E,D,A,D,A,D,A,D,A,D,A,D,A,C,E,G,D,A,E,G,D,A,E,G,D,A,C,D,E,D,A,D,A,D,A,D,A,D,A,C,E,G,D,A,E,G,D,A,E,G,D,A,C,D,E,C,G,D,E,C,G,D,E,C,G,D,C,D,E,G,E,G,D,A,E,G,D,A,E,G,D,A,C,D,E,G,D,A,E,G,D,A,E,G,D,A,C,D,E,C,G,D,E,C,G,D,E,C,G,D,C,D,E


In [11]:
# removing inversions on the whole data set
chord_data.loc[:,'chords'] = chord_data['chords'].apply(remove_inversions)
chord_data.head()

Unnamed: 0,chords,genres
0,"C,F,C,E7,Amin,C,F,C,G7,C,F,C,E7,Amin,C,F,G7,C,...",classic country pop
1,"E,D,A,D,A,D,A,D,A,D,A,D,A,C,E,G,D,A,E,G,D,A,E,...","alternative metal""alternative rock""nu metal""pe..."
2,"D,Dmaj7,D,Dmaj7,Emin,A,D,G,Emin,A,D,G,Emin,A,D...",
3,"C,G,C,G,C,F,Dmin,G,Dmin,G,C,G,C,F,Dmin,G,Dmin,...",modern country pop
4,"C,G,C,G,C,G,C,G,C,Bmin,Emin,Amin,D,G,C,D,G,C,D...","classic opm""opm"


Now that the chord data is cleaned up into a format I like, I'll convert chords to vectors.

In [13]:
# some code written by the Chordonomicon authors for converting chord names to vector

# Read the mapping CSV file
chord_relations = pd.read_csv('../data/chords_mapping.csv')

# Create a dictionary with keys the "chords" and values the "degrees"
chord_degrees = dict(zip(chord_relations['Chords'], chord_relations['Degrees']))
for key, value in chord_degrees.items():
    chord_degrees[key] = ast.literal_eval(value)

Now chord_degrees is a dictionary where keys are chord names of the form 'Cmaj','E9', etc. and values are binary vectors of length 12, with 1 in a position where that note is included.

In [15]:
print(chord_degrees['C'])
print(chord_degrees['Cs'])
print(chord_degrees['D'])

[1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0]
[0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0]
[0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0]


In [16]:
my_array = np.random.rand(2,4)
print(my_array)
print()
print(my_array.transpose())

[[0.06845819 0.1728776  0.5498252  0.00987757]
 [0.82212376 0.20332511 0.67170763 0.51871111]]

[[0.06845819 0.82212376]
 [0.1728776  0.20332511]
 [0.5498252  0.67170763]
 [0.00987757 0.51871111]]


In [17]:
# function to convert a string of comma-separated chords into a matrix, where each column denotes a chord
def string_to_chord_matrix(chord_sequence):
    # split sequence over commas, then look up each chord in chord_degrees dictionary by the key value (c here)
    return np.array([chord_degrees[c][::-1] for c in chord_sequence.split(',')]).transpose()

np.set_printoptions(linewidth=400)
my_string = chord_data.iloc[0]['chords']
print(my_string)
print()
print(string_to_chord_matrix(my_string))

C,F,C,E7,Amin,C,F,C,G7,C,F,C,E7,Amin,C,F,G7,C,F,C,E7,Amin,C,F,C,G7,C,F,C,E7,Amin,C,F,G7,C,F,C,F,C,G,C,F,C,E7,Amin,C,F,G7,C,D,G,D,G,D,A,D,G,D,Fs7,Bmin,D,G,A7,D,G,A7,D

[[0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 1 0 1 0 0 0 1 0 0 1 0 1 0 0 1 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0]
 [0 1 0 0 1 0 1 0 0 0 1 0 0 1 0 1 0 0 1 0 0 1 0 1 0 0 0 1 0 0 1 0 1 0 0 1 0 1 0 0 0 1 0 0 1 0 1 0 0 1 0 1 0 1 1 1 0 1 0 0 1 0 1 1 0 1 1]
 [0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [1 0 1 0 0 1 0 1 1 1 0 1 0 0 1 0 1 1 0 1 0 0 1 0 1 1 1 0 1 0 0 1 0 1 1 0 1 0 1 1 1 0 1 0 0 1 0 1 1 0 1 0 1 0 0 0 1 0 0 0 0 1 1 0 1 1 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 1 0 1 1 1 1 0 0 1 0 0 1]
 [0 1 0 0 0

In [18]:
for i in range(3):
    my_string = chord_data.iloc[i]['chords']
    print(my_string)
    print()
    print(string_to_chord_matrix(my_string))
    print()

C,F,C,E7,Amin,C,F,C,G7,C,F,C,E7,Amin,C,F,G7,C,F,C,E7,Amin,C,F,C,G7,C,F,C,E7,Amin,C,F,G7,C,F,C,F,C,G,C,F,C,E7,Amin,C,F,G7,C,D,G,D,G,D,A,D,G,D,Fs7,Bmin,D,G,A7,D,G,A7,D

[[0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 1 0 1 0 0 0 1 0 0 1 0 1 0 0 1 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0]
 [0 1 0 0 1 0 1 0 0 0 1 0 0 1 0 1 0 0 1 0 0 1 0 1 0 0 0 1 0 0 1 0 1 0 0 1 0 1 0 0 0 1 0 0 1 0 1 0 0 1 0 1 0 1 1 1 0 1 0 0 1 0 1 1 0 1 1]
 [0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [1 0 1 0 0 1 0 1 1 1 0 1 0 0 1 0 1 1 0 1 0 0 1 0 1 1 1 0 1 0 0 1 0 1 1 0 1 0 1 1 1 0 1 0 0 1 0 1 1 0 1 0 1 0 0 0 1 0 0 0 0 1 1 0 1 1 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 1 0 1 1 1 1 0 0 1 0 0 1]
 [0 1 0 0 0

Now to apply this to every song in the database. Here goes.

In [48]:
print(chord_degrees['F'])
print(chord_degrees['E'])
chord_degrees['Fb'] = chord_degrees['E']
print(chord_degrees['Fb'])

[1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0]
[0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1]
[0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1]


In [50]:
# I would like to turn every chord sequence into a matrix
# However, there is an obstacle: not every chord in the database is in the chords_mapping.csv data file
# For example, Fb (which is enharmonically and mathematically equivalent to E) is missing
# This is fixable by doing:
chord_degrees['Fb'] = chord_degrees['E']
# However, there are other things missing which I am not sure how to remedy. Currently the first key error encountered by running this code is
# 'Amin7maj7'
# I have no idea what that refers to. Is it a real chord? Or is it just 'Amin7' that got accidentally spliced somehow by my other data processing? I'm not sure.

chord_data.loc[:,'chords'] = chord_data['chords'].apply(string_to_chord_matrix)
chord_data.head()

KeyError: 'Amin7maj7'