Example label processing pipeline

We begin with metadata.csv produced by tally.py from the initial dataset
Each row corresponds to a composition ID and its available associated composer, title and description metadata. 
These fields have been input by users and require sufficient cleaning before being able to be used with the model.
See below for example initial inputs.

In [36]:
import csv
from parseMetadata import fix_nulls
import sys
import matplotlib.pyplot as plt
import pandas

composerDict = dict()
compCountAll = 0
csv.field_size_limit(sys.maxsize)

with open("metadata.csv",'r') as labels:
    csvreader = csv.reader(fix_nulls(labels))
    for line in csvreader:
        compCountAll+=1
        if composerDict.get(line[3]):
            composerDict[line[3]] = composerDict[line[3]] +1
        else:
            composerDict[line[3]] = 1

sortedComposersByCount1 = sorted(composerDict.items(), key =lambda x : x[1], reverse=True)
composers = pandas.DataFrame(sortedComposersByCount1)
initialData = pandas.read_csv("metadata.csv",nrows=5)
print(initialData)
print("\nNumber of unique composers found: " + str(len(sortedComposersByCount1)) + " from a total of " + str(compCountAll) + " compositions")
print("Composers with more than 1 composition " + str(len([x for x in filter(lambda x: x[1]>1,sortedComposersByCount1)])) )
print("Top 10 composers (Pre-processing)")
composers[:10]

        ID                                              title  \
0  4939836           Springlek efter Perbj\\"ors Erik Persson   
1  5401399                                      Excerpts in C   
2  5925434  Wii Shoppin' for the Big Bands; Wii Shop Chann...   
3  3587506                                      Hide And Seek   
4  5161684                               The First Love Dream   

                                         description  \
0  Note Length 1/8\nRhythm polska K1\npolska K1\n...   
1                                                NaN   
2  This is meant to be printed for use in Jazz Bi...   
3     FROM MY OWN hide and seek "movie not released"   
4          Note Length - 1/8\nKey - G\nMeter - 4/4\n   

                                            composer  
0                                                NaN  
1                                                NaN  
2  Composed by Kazumi Totaka; Arranged by Lars Ka...  
3                                            Brandon

Unnamed: 0,0,1
0,,630360
1,Composer,24810
2,anon.,4274
3,Pierangelo Fernandes Carera,3855
4,Traditional,3162
5,Yohei Kato * 加藤 洋平,2794
6,Yohei Kato * 加藤洋平,2353
7,Trad.,2223
8,Johann Sebastian Bach,1757
9,Toby Fox,1757


From this initial data we can retrieve the entries with composer data and attempt to clean them with basic processing methods 
This is performed by parseMetadata.py using functions from labelExtract.py, producing "slim_metadata.csv" (See LabelPipeline for details)
See below for the new distribution after processing

In [27]:
composerDict = dict()
compCountSlim = 0
totalEntries = 0
csv.field_size_limit(sys.maxsize)

with open("slim_metadata.csv",'r') as labels:
    csvreader = csv.reader(labels)
    for line in csvreader:
        compCountSlim+=1
        if composerDict.get(line[1]):
            composerDict[line[1]] = composerDict[line[1]] +1
        else:
            composerDict[line[1]] = 1

sortedComposersByCount1 = sorted(composerDict.items(), key =lambda x : x[1], reverse=True)
composers = pandas.DataFrame(sortedComposersByCount1)
print("\nNumber of unique composers found: " + str(len(sortedComposersByCount1)) + " from a total of " + str(compCountSlim) + " compositions")
print("Composers with more than 1 composition " + str(len([x for x in filter(lambda x: x[1]>1,sortedComposersByCount1)])) )
print("Top 10 = ")
composers[:10]



Number of unique composers found: 214308 from a total of 744311 compositions
Composers with more than 1 composition 79281
Top 10 = 


Unnamed: 0,0,1
0,J. S. Bach,9498
1,Y. Kato 加藤 洋平,5156
2,W. A. Mozart,4758
3,P. F. Carera,4438
4,T. Fox,4081
5,L. V. Beethoven,3994
6,J. Williams,3082
7,K. Kondo,3033
8,A. Vivaldi,1403
9,J. Hisaishi,1392


A set of composers derived from composer fields + additional composers extracted from metadata fields is below
This is performed by parseMetadata.py, run with additional flags, using functions from labelExtract.py, producing "slim_all_fields_metadata.csv" (See LabelPipeline for details)

In [29]:
composerDictAll = dict()
compCountSlimAll = 0
with open("slim_all_fields_metadata.csv",'r') as labels:
    csvreader = csv.reader(labels)
    for line in csvreader:
        compCountSlimAll +=1
        if composerDictAll.get(line[1]):
            composerDictAll[line[1]] = composerDictAll[line[1]] +1
        else:
            composerDictAll[line[1]] = 1
 
sortedComposersByCountAll = sorted(composerDictAll.items(), key =lambda x : x[1], reverse=True)
allFieldsData = pandas.DataFrame(sortedComposersByCountAll)
print("\nNumber of unique composers found: " + str(len(sortedComposersByCountAll)) + " from a total of " + str(compCountSlimAll) + " compositions")
print("Composers with more than 1 composition " + str(len([x for x in filter(lambda x: x[1]>1,sortedComposersByCountAll)])) )
allFieldsData[:10]


Number of unique composers found: 264775 from a total of 875571 compositions
Composers with more than 1 composition 90749


Unnamed: 0,0,1
0,J. S. Bach,9972
1,Y. Kato 加藤 洋平,5156
2,W. A. Mozart,4899
3,P. F. Carera,4438
4,T. Fox,4187
5,L. V. Beethoven,4133
6,J. Williams,3205
7,K. Kondo,3066
8,F. Oneill; F. Nordberg,2675
9,J. Kerr,2220


Comparison of top 10 counts with and without metadata

In [33]:
[x for x in zip(sortedComposersByCountAll,sortedComposersByCount1)][:10]

[(('J. S. Bach', 9972), ('', 630360)),
 (('Y. Kato  加藤 洋平', 5156), ('Composer', 24810)),
 (('W. A. Mozart', 4899), ('anon.', 4274)),
 (('P. F. Carera', 4438), ('Pierangelo Fernandes Carera', 3855)),
 (('T. Fox', 4187), ('Traditional', 3162)),
 (('L. V. Beethoven', 4133), ('Yohei Kato * 加藤 洋平', 2794)),
 (('J. Williams', 3205), ('Yohei Kato * 加藤洋平', 2353)),
 (('K. Kondo', 3066), ('Trad.', 2223)),
 (('F. Oneill; F. Nordberg', 2675), ('Johann Sebastian Bach', 1757)),
 (('J. Kerr', 2220), ('Toby Fox', 1757))]

Top 8 remain the same
<=5% increase for top 8

[(('J. S. Bach', 9972), ('J. S. Bach', 9498)),
 (('Y. Kato  加藤 洋平', 5156), ('Y. Kato  加藤 洋平', 5156)),
 (('W. A. Mozart', 4899), ('W. A. Mozart', 4758)),
 (('P. F. Carera', 4438), ('P. F. Carera', 4438)),
 (('T. Fox', 4187), ('T. Fox', 4081)),
 (('L. V. Beethoven', 4133), ('L. V. Beethoven', 3994)),
 (('J. Williams', 3205), ('J. Williams', 3082)),
 (('K. Kondo', 3066), ('K. Kondo', 3033)),

Conclusion: If more data is desired to train a larger model on the top 10 composers then using the inclusive dataset may provide up to 5% increase in training samples for each composer at the cost of accuracy of labels.