All labels have been derived from either the title or description fields from ONLY the compositions that did not have a usable label from the composer fields. 


Decisions
- Whether to use additional metadata for labels, if so, priority to use fields (i.e. If no composer in title then check description, or title only?)
- capitalise or not before entity recognition (may remove names that resemble non-name words, may retain erroneously capitalized normal words)
- delete all single word composers (can add shortcuts for the most popular, e.g. if 'debussy' )

To check
- how many additional reliable labels can be derived from either/or title and description fields?
- Final question - is performance better with/without the additional x labels derived from the other fields?
    - To analyse numbers of compositions with and without additional text field composers


note: experiments performed on datasets containing 'incomplete' are not representative of the whole metadata.csv

In [10]:
import csv

composerDictTitles = dict()
composerDictDesc = dict()
titleCount,descCount = 0,0

with open("ner_metadata1.csv",'r') as labels:
    tsvreader = csv.reader(labels, delimiter='\t', lineterminator="\n")
    for line in tsvreader:
        if len(line)>1 and line[1]:
            titleCount+=1
            if composerDictTitles.get(line[1]):
                composerDictTitles[line[1]] = composerDictTitles[line[1]] +1
            else: composerDictTitles[line[1]] = 1
        if len(line)>2 and line[2]:
            descCount+=1
            if composerDictDesc.get(line[2]):
                composerDictDesc[line[2]] = composerDictDesc[line[2]] +1
            else:
                composerDictDesc[line[2]] = 1

sortedComposersByCountTitle = sorted(composerDictTitles.items(), key =lambda x : x[1], reverse=True)
sortedComposersByCountDesc = sorted(composerDictDesc.items(), key =lambda x : x[1], reverse=True)

print(sortedComposersByCountTitle[:10])
print("\nNumber of unique composers found in Title from " + str(titleCount) + " titles,: " + str(len(sortedComposersByCountTitle)))
print("Composers with more than 1 composition " + str(len([x for x in filter(lambda x: x[1]>1,sortedComposersByCountTitle)])) )
print("\n")

print(sortedComposersByCountDesc[:10])
print("\nNumber of unique composers found in Desc from " + str(descCount) + " descriptions,: " + str(len(sortedComposersByCountDesc)))
print("Composers with more than 1 composition " + str(len([x for x in filter(lambda x: x[1]>1,sortedComposersByCountDesc)])) )



[('Christmas', 1769), ('First', 1144), ('Irish', 686), ('Three', 641), ('Christmas; Christmas', 574), ('Jesus', 528), ('Morning', 483), ('Tonight', 378), ('French', 373), ('Seven', 340)]

Number of unique composers found in Title from 127957 titles,: 57807
Composers with more than 1 composition 13242


[('J. Chambers', 5223), ('First', 4129), ('R. Reelkey', 3902), ('Francis; F. N. -. Httpwwwmusicavivacom', 2755), ('Length-18', 2298), ('J. Kerr; J. Chambers', 2230), ('Meter', 1542), ('London; F. Titford-Mock', 1326), ('Irish', 1244), ('Second', 938)]

Number of unique composers found in Desc from 169928 descriptions,: 70064
Composers with more than 1 composition 10532


First processing observations
- 128k titles with valid entities vs 170k descriptions with valid entities
- Titles composers consist of mainly non-composer terms
- tops from titles don't contain the seeminly real composers from descriptions
- need to clean more before they can be used

Noted many non-composer terms
- Christmas, jesus
- ordinal numbers
- times of day (morning, tonight)
- language/culture descriptives (irish, french)
- music terms ('length-18)
- place names (London)
- web sites (Httpwwwmusicavivacom, musescore)

To address - only take entities if entitylabel = person

In [14]:
sortedComposersTitleByName = sorted(composerDictTitles.items(), key =lambda x : x[0])
sortedComposersDescByName = sorted(composerDictDesc.items(), key =lambda x : x[0])

# print(*[ x[0] for x in sortedComposersByName],sep = '\n')
print([ x[0] for x in sortedComposersTitleByName][:10])
print([ x[0] for x in sortedComposersDescByName][:10])

composerWordsTitle = [[x for x in word[0].split(" ")] for word in sortedComposersTitleByName]
composerWordsDesc = [[x for x in word[0].split(" ")] for word in sortedComposersDescByName]

composerWordsTitle = [item for sublist in composerWordsTitle for item in sublist if item]
composerWordsDesc = [item for sublist in composerWordsDesc for item in sublist if item]

# print(composerWords)
wordCountDictTitle = dict()
for word in composerWordsTitle:
    if wordCountDictTitle.get(str(word)):
        wordCountDictTitle[word] = wordCountDictTitle[word]+1
    else:
        wordCountDictTitle[word] = 1

# print(composerWords)
wordCountDictDesc = dict()
for word in composerWordsDesc:
    if wordCountDictDesc.get(str(word)):
        wordCountDictDesc[word] = wordCountDictDesc[word]+1
    else:
        wordCountDictDesc[word] = 1

print("\nTop words in titles:")
print(sorted(wordCountDictTitle.items(), key =lambda x : x[1],reverse=True)[:100])

print("\nTop words in descs:")
print(sorted(wordCountDictDesc.items(), key =lambda x : x[1],reverse=True)[:100])


['A. -', 'A. -. A. Carol', 'A. -. A. J. Abbey', 'A. -. A. Zero', 'A. -. Adagio', 'A. -. Albert', 'A. -. Altre; Aria', 'A. -. Anver', 'A. -. Apologize', 'A. -. B. Africa']
['A---1', 'A-Moll', 'A. -', 'A. -. Altan', 'A. -. Ammeter; F. N. -. Httpwwwmusicavivacom', 'A. -. Ammeter; F. N. -. Httpwwwmusicavivacomnote', 'A. -. Clarinet', 'A. -. Cmeter; F. N. -. Httpwwwmusicavivacom', 'A. -. Cmeter; F. N. -. Httpwwwmusicavivacomnote', 'A. -. Cmeter; F. N. -. Httpwwwmusicavivacomrhythm']

Top words in titles:
[('S.', 6103), ('C.', 5880), ('M.', 5341), ('T.', 4876), ('A.', 4149), ('D.', 4038), ('J.', 3605), ('B.', 3573), ('P.', 3177), ('L.', 3158), ('H.', 3013), ('-.', 2963), ('F.', 2603), ('G.', 2478), ('W.', 2450), ('R.', 2116), ('N.', 1897), ('E.', 1813), ('O.', 1480), ('I.', 1340), ('V.', 1088), ('K.', 997), ('U.', 603), ('Y.', 540), ('Medley', 317), ('Piano', 313), ('Z.', 313), ('Christmas;', 270), ('Christmas', 268), ('First;', 231), ('Medley;', 217), ('Night', 195), ('Day', 184), ('March',

More words to remove
- '-.', '-', title, hornpipe, medley, march, day, piano,Deutschland,musicxml, cello, solo, alto, chambers, europa, year, capella, bass, tuba, waltz, midi, late, Berlin, year, crhythm, day, length-14, ago, years, Fribourg, england, american, australia, composition, tinwhistle, tuba, wip, morning, night, day march, bass, duet, waltz, alto, winter, theme, tuba, trio

Checking new version with better cleaning and only people entities


In [39]:

import csv

composerDictTitles = dict()
composerDictDesc = dict()
titleCount,descCount = 0,0

with open("ner_metadata2_incomplete.csv",'r') as labels:
    tsvreader = csv.reader(labels, delimiter='\t', lineterminator="\n")
    for line in tsvreader:
        if len(line)>1 and line[1]:
            titleCount+=1
            if composerDictTitles.get(line[1]):
                composerDictTitles[line[1]] = composerDictTitles[line[1]] +1
            else: composerDictTitles[line[1]] = 1
        if len(line)>2 and line[2]:
            descCount+=1
            if composerDictDesc.get(line[2]):
                composerDictDesc[line[2]] = composerDictDesc[line[2]] +1
            else:
                composerDictDesc[line[2]] = 1

sortedComposersByCountTitle = sorted(composerDictTitles.items(), key =lambda x : x[1], reverse=True)
sortedComposersByCountDesc = sorted(composerDictDesc.items(), key =lambda x : x[1], reverse=True)

print(sortedComposersByCountTitle[:10])
print("\nNumber of unique composers found in Title from " + str(titleCount) + " titles,: " + str(len(sortedComposersByCountTitle)))
print("Composers with more than 1 composition " + str(len([x for x in filter(lambda x: x[1]>1,sortedComposersByCountTitle)])) )
print("\n")

print(sortedComposersByCountDesc[:10])
print("\nNumber of unique composers found in Desc from " + str(descCount) + " descriptions,: " + str(len(sortedComposersByCountDesc)))
print("Composers with more than 1 composition " + str(len([x for x in filter(lambda x: x[1]>1,sortedComposersByCountDesc)])) )



[('Mary', 154), ('A. Miles', 128), ('J. S. Bach', 110), ('Bush', 109), ('Merry', 89), ('John', 89), ('Intro', 75), ('S. Claus', 72), ('Thomas', 68), ('Johnny', 67)]

Number of unique composers found in Title from 17752 titles,: 10023
Composers with more than 1 composition 2012


[('John', 4360), ('R. Reelkey', 2039), ('Francis; F. Nordberg', 1489), ('Meter', 1375), ('J. Kerr; John', 1243), ('F. Titford-Mock', 723), ('B. W. Hamilton; John', 612), ('Peter', 472), ('R. Robinson', 443), ('Melakarta', 387)]

Number of unique composers found in Desc from 53043 descriptions,: 17820
Composers with more than 1 composition 3226


Ok looking a bit better
Maybe results will be better if leaving capitalisation to assist in identifying surnames

Notes
- John? Meter? Peter> 
- is R. Reelkey a name?
- Meter

Below is attempt with capitalisation and intro removed

In [43]:
import csv

composerDictTitles = dict()
composerDictDesc = dict()
titleCount,descCount = 0,0

with open("ner_metadata3_incomplete.csv",'r') as labels:
    tsvreader = csv.reader(labels, delimiter='\t', lineterminator="\n")
    for line in tsvreader:
        if len(line)>1 and line[1]:
            titleCount+=1
            if composerDictTitles.get(line[1]):
                composerDictTitles[line[1]] = composerDictTitles[line[1]] +1
            else: composerDictTitles[line[1]] = 1
        if len(line)>2 and line[2]:
            descCount+=1
            if composerDictDesc.get(line[2]):
                composerDictDesc[line[2]] = composerDictDesc[line[2]] +1
            else:
                composerDictDesc[line[2]] = 1

sortedComposersByCountTitle = sorted(composerDictTitles.items(), key =lambda x : x[1], reverse=True)
sortedComposersByCountDesc = sorted(composerDictDesc.items(), key =lambda x : x[1], reverse=True)

print(sortedComposersByCountTitle[:20])
print("\nNumber of unique composers found in Title from " + str(titleCount) + " titles,: " + str(len(sortedComposersByCountTitle)))
print("Composers with more than 1 composition " + str(len([x for x in filter(lambda x: x[1]>1,sortedComposersByCountTitle)])) )
print("\n")

print(sortedComposersByCountDesc[:20])
print("\nNumber of unique composers found in Desc from " + str(descCount) + " descriptions,: " + str(len(sortedComposersByCountDesc)))
print("Composers with more than 1 composition " + str(len([x for x in filter(lambda x: x[1]>1,sortedComposersByCountDesc)])) )



[('Moon', 55), ('C. H. Gabriel', 54), ('A. Miles', 49), ('Silent', 46), ('J. S. Bach', 43), ('Mary', 40), ('R. Lowry', 35), ('G. Ainm', 27), ('B. E. Warren', 26), ('H. Lowden', 24), ('Bush', 24), ('M. D. Y. Know', 24), ('Jack', 21), ('Charlie', 20), ('M. Jig', 20), ('S. Claus', 18), ('H. Potter', 18), ('C. Wesley', 17), ('Thomas', 17), ('M. Channel', 16)]

Number of unique composers found in Title from 13830 titles,: 9768
Composers with more than 1 composition 1631


[('John', 1895), ('Meter', 1133), ('F. Oneill; F. Nordberg', 485), ('Rhythm', 442), ('R. R. Gmajnote', 400), ('J. Kerr; John', 382), ('R. Reelkey', 345), ('F. Titford-Mock', 321), ('J. Campin', 213), ('R. H. Gmajnote', 150), ('R. R. Adornote', 142), ('R. R. Edornote', 134), ('R. H. Dmajnote', 133), ('M. Raagas', 130), ('R. R. Eminnote', 127), ('P. Dunk', 115), ('C. H. Gabriel', 112), ('R. R. Amajnote', 108), ('R. Robinson', 93), ('C. Partington', 90)]

Number of unique composers found in Desc from 26084 descriptions,: 1225

More cleaning
- string quartet, Rhythm
- moon? Maid? marquis? Undertale?  -> try remove all single word composers

Without single words

In [44]:
import csv

composerDictTitles = dict()
composerDictDesc = dict()
titleCount,descCount = 0,0

with open("ner_metadata4_incomplete.csv",'r') as labels:
    tsvreader = csv.reader(labels, delimiter='\t', lineterminator="\n")
    for line in tsvreader:
        if len(line)>1 and line[1]:
            titleCount+=1
            if composerDictTitles.get(line[1]):
                composerDictTitles[line[1]] = composerDictTitles[line[1]] +1
            else: composerDictTitles[line[1]] = 1
        if len(line)>2 and line[2]:
            descCount+=1
            if composerDictDesc.get(line[2]):
                composerDictDesc[line[2]] = composerDictDesc[line[2]] +1
            else:
                composerDictDesc[line[2]] = 1

sortedComposersByCountTitle = sorted(composerDictTitles.items(), key =lambda x : x[1], reverse=True)
sortedComposersByCountDesc = sorted(composerDictDesc.items(), key =lambda x : x[1], reverse=True)

print(sortedComposersByCountTitle[:20])
print("\nNumber of unique composers found in Title from " + str(titleCount) + " titles,: " + str(len(sortedComposersByCountTitle)))
print("Composers with more than 1 composition " + str(len([x for x in filter(lambda x: x[1]>1,sortedComposersByCountTitle)])) )
print("\n")

print(sortedComposersByCountDesc[:20])
print("\nNumber of unique composers found in Desc from " + str(descCount) + " descriptions,: " + str(len(sortedComposersByCountDesc)))
print("Composers with more than 1 composition " + str(len([x for x in filter(lambda x: x[1]>1,sortedComposersByCountDesc)])) )



[('C. H. Gabriel', 16), ('R. Lowry', 9), ('J. S. Bach', 9), ('A. Miles', 8), ('J. S. Bach; J. S. Bach', 7), ('G. Ainm', 7), ('D. Murphys', 6), ('H. Lowden', 5), ('J. Cope', 5), ('B. Lass', 5), ('S. Claus', 5), ('C. Wesley', 5), ('J. Bond', 5), ('C. H', 5), ('B. E. Warren', 4), ('K. Rambles', 4), ('M. Apron', 4), ('S. Universe', 4), ('G. Save', 4), ('J. Mcgranahan', 4)]

Number of unique composers found in Title from 2446 titles,: 2155
Composers with more than 1 composition 175


[('J. Kerr', 105), ('F. Oneill; F. Nordberg', 98), ('F. Titford-Mock', 86), ('R. Gmajnote', 80), ('J. Campin', 54), ('M. Raagas', 34), ('R. Edornote', 34), ('P. Dunk', 34), ('C. H. Gabriel', 32), ('H. Gmajnote', 32), ('R. Eminnote', 32), ('R. Adornote', 30), ('S. Mansfield', 30), ('H. Dmajnote', 29), ('C. Partington', 25), ('F. Nordberg', 24), ('R. Amajnote', 24), ('J. Walsh', 21), ('R. Robinson', 20), ('W. Pearson; J. Walsh', 19)]

Number of unique composers found in Desc from 4468 descriptions,: 2773
Composer

Looking better
- Gmajnote, Amajnote, Eminnote, Adornote ... 
- S. Universe

to inspect the same but without initials 

In [45]:
sortedComposersTitleByName = sorted(composerDictTitles.items(), key =lambda x : x[0])
sortedComposersDescByName = sorted(composerDictDesc.items(), key =lambda x : x[0])

# print(*[ x[0] for x in sortedComposersByName],sep = '\n')
print([ x[0] for x in sortedComposersTitleByName][:10])
print([ x[0] for x in sortedComposersDescByName][:10])

composerWordsTitle = [[x for x in word[0].split(" ")] for word in sortedComposersTitleByName]
composerWordsDesc = [[x for x in word[0].split(" ")] for word in sortedComposersDescByName]

composerWordsTitle = [item for sublist in composerWordsTitle for item in sublist if item]
composerWordsDesc = [item for sublist in composerWordsDesc for item in sublist if item]

# print(composerWords)
wordCountDictTitle = dict()
for word in composerWordsTitle:
    if wordCountDictTitle.get(str(word)):
        wordCountDictTitle[word] = wordCountDictTitle[word]+1
    else:
        wordCountDictTitle[word] = 1

# print(composerWords)
wordCountDictDesc = dict()
for word in composerWordsDesc:
    if wordCountDictDesc.get(str(word)):
        wordCountDictDesc[word] = wordCountDictDesc[word]+1
    else:
        wordCountDictDesc[word] = 1

print("\nTop words in titles:")
print(sorted(wordCountDictTitle.items(), key =lambda x : x[1],reverse=True)[:100])

print("\nTop words in descs:")
print(sorted(wordCountDictDesc.items(), key =lambda x : x[1],reverse=True)[:100])


['A. -A', 'A. A. Pokémon; A. Avalospokémon', 'A. A. Spirit; A. A. Spirit', 'A. Andersen; A. Andersen', 'A. Anthem', 'A. B. H. Invention', 'A. Blues', 'A. Bohemian; A. Bohemian', 'A. Browning', 'A. C. I. A. Made']
['A. A. B. Countymeter-24Rhythm-Engelskakey-G', 'A. Akfiddlers', 'A. Alain', 'A. Amnote; F. N. H. Germanymeter', 'A. Andrade; K. Chavezedit', 'A. Ari; N. Aikawa; H. Yamazaki; S. Nakamura', 'A. Avec; D. Tabledit; C. Che', 'A. B. A. Collins', 'A. B. Barnard', 'A. B. C. Longford']

Top words in titles:
[('S.', 320), ('M.', 309), ('B.', 305), ('J.', 293), ('C.', 263), ('T.', 263), ('D.', 198), ('H.', 177), ('L.', 173), ('P.', 153), ('A.', 141), ('W.', 140), ('R.', 140), ('G.', 124), ('F.', 115), ('K.', 88), ('O.', 83), ('E.', 79), ('N.', 62), ('Y.', 54), ('V.', 47), ('I.', 43), ('Reel', 32), ('U.', 18), ('Jig', 17), ('The', 16), ('Z.', 14), ('Anthem', 12), ('Fancy', 11), ('Q.', 9), ('Blues', 7), ('Strathspey', 7), ('Hill', 7), ('Trombone', 7), ('Delight', 7), ('Man', 6), ('Favouri

words to remove - anthem, blues, jig, cnote, transcribed, sacredhymn, warmup, dance, live, carol, jungle, beat, clarinet, saxophone, undertale, time, races, main, castle, no, hymn, 


In [56]:



composerDictTitles = dict()
composerDictDesc = dict()
titleCount,descCount = 0,0

with open("ner_metadata4_init_incomplete.csv",'r') as labels:
    tsvreader = csv.reader(labels, delimiter='\t', lineterminator="\n")
    for line in tsvreader:
        if len(line)>1 and line[1]:
            titleCount+=1
            if composerDictTitles.get(line[1]):
                composerDictTitles[line[1]] = composerDictTitles[line[1]] +1
            else: composerDictTitles[line[1]] = 1
        if len(line)>2 and line[2]:
            descCount+=1
            if composerDictDesc.get(line[2]):
                composerDictDesc[line[2]] = composerDictDesc[line[2]] +1
            else:
                composerDictDesc[line[2]] = 1

sortedComposersByCountTitle = sorted(composerDictTitles.items(), key =lambda x : x[1], reverse=True)
sortedComposersByCountDesc = sorted(composerDictDesc.items(), key =lambda x : x[1], reverse=True)

print(sortedComposersByCountTitle[:100])
print("\nNumber of unique composers found in Title from " + str(titleCount) + " titles,: " + str(len(sortedComposersByCountTitle)))
print("Composers with more than 1 composition " + str(len([x for x in filter(lambda x: x[1]>1,sortedComposersByCountTitle)])) )
print("\n")

print(sortedComposersByCountDesc[:100])
print("\nNumber of unique composers found in Desc from " + str(descCount) + " descriptions,: " + str(len(sortedComposersByCountDesc)))
print("Composers with more than 1 composition " + str(len([x for x in filter(lambda x: x[1]>1,sortedComposersByCountDesc)])) )



[('J. S. Bach', 123), ('Chas H Gabriel', 96), ('Austin Miles', 95), ('Robert Lowry', 84), ('Harold Lowden', 59), ('Charles Wesley', 55), ('Harry Potter', 55), ('Gan Ainm', 53), ('Mary Did You Know', 52), ('Mii Channel', 48), ('Robert Harkness', 47), ('Barney Elliott Warren', 45), ('Santa Claus', 41), ('Isaac Watts', 40), ('J. S. Bach; J. S. Bach', 38), ('Chas H', 36), ('L. V. Beethoven', 34), ('James Mcgranahan', 34), ('Ira B Wilson', 33), ('Ding Dong Merrily', 32), ('Fairy Tail Main', 30), ('Haldor Lillenas', 28), ('Battle Hymn', 26), ('Ed Sheeran', 26), ('John Williams', 25), ('James Bond', 25), ('Fairy Tail', 25), ('Sally Gardens', 25), ('Mary Did You Know Mary', 25), ('Masons Apron', 23), ('Morrisons Jig', 23), ('Denis Murphys', 23), ('Howard E Smith', 23), ('Frank Lehman', 23), ('Tom Billys', 22), ('God Save', 22), ('Johnny Cope', 21), ('Steven Universe', 21), ('Mario Kart', 21), ('Thoro Harris', 21), ('Jenny Dang', 21), ('This Lesson1', 20), ('Joseph Barnby', 20), ('John Camidge'

Looking good!

Notes
- add to disqualifiers = Reelkey, Hornpipekey, Santa
- remove duplicates 'J. S. Bach; J. S. Bach'
- 'Ding Dong Merrily' Bonny Lass? Battle Hymn? 
- words: collection, 


In [58]:
sortedComposersTitleByName = sorted(composerDictTitles.items(), key =lambda x : x[0])
sortedComposersDescByName = sorted(composerDictDesc.items(), key =lambda x : x[0])

# print(*[ x[0] for x in sortedComposersByName],sep = '\n')
print([ x[0] for x in sortedComposersTitleByName][:10])
print([ x[0] for x in sortedComposersDescByName][:10])

composerWordsTitle = [[x for x in word[0].split(" ")] for word in sortedComposersTitleByName]
composerWordsDesc = [[x for x in word[0].split(" ")] for word in sortedComposersDescByName]

composerWordsTitle = [item for sublist in composerWordsTitle for item in sublist if item]
composerWordsDesc = [item for sublist in composerWordsDesc for item in sublist if item]

# print(composerWords)
wordCountDictTitle = dict()
for word in composerWordsTitle:
    if wordCountDictTitle.get(str(word)):
        wordCountDictTitle[word] = wordCountDictTitle[word]+1
    else:
        wordCountDictTitle[word] = 1

# print(composerWords)
wordCountDictDesc = dict()
for word in composerWordsDesc:
    if wordCountDictDesc.get(str(word)):
        wordCountDictDesc[word] = wordCountDictDesc[word]+1
    else:
        wordCountDictDesc[word] = 1

print("\nTop words in titles:")
print(sorted(wordCountDictTitle.items(), key =lambda x : x[1],reverse=True)[:100])

print("\nTop words in descs:")
print(sorted(wordCountDictDesc.items(), key =lambda x : x[1],reverse=True)[:100])


['A B Simpson', 'A Battle Battle', 'A Dead Princess Naki Oujo', 'A Good', 'A J Abbey', 'A Man Emmett', 'A Mario', 'A Young Virgin', 'A3 Yourname A3- Lauren Cole', 'A4Feng A3; Arirang 아리랑 행진곡']
['A B Everett', 'A B Simpson', 'A Bob Ross Themed', 'A Clarinet', 'A Clarinet Of; Ian Good', 'A Clarinet Quintet', 'A Clarinet Trumpet', 'A County Down', 'A Drumset', 'A Furlong Of Edenborrow Town; Mr King; Henry Atkinsons']

Top words in titles:
[('The', 684), ('John', 411), ('Reel', 262), ('Jig', 252), ('Mary', 233), ('Lady', 184), ('Of', 173), ('Mrs', 171), ('William', 166), ('James', 166), ('St', 151), ('Pokemon', 149), ('Strathspey', 134), ('Dance', 130), ('Battle', 127), ('J', 127), ('Tom', 124), ('Merry', 122), ('B', 118), ('Sweet', 118), ('My', 117), ('Charles', 110), ('Bonny', 105), ('W', 99), ('Harry', 97), ('Henry', 93), ('Thomas', 92), ('George', 92), ('Michael', 91), ('Jack', 89), ('Hymn', 83), ('Delight', 82), ('Johnny', 81), ('Concert', 79), ('Clarinet', 76), ('Castle', 75), ('Bad'

Parsing whole dataset with most recent model to check results of strict title and description processing

In [66]:
import csv

composerDictTitles = dict()
composerDictDesc = dict()
totalDict = dict()
titleCount,descCount = 0,0
totalComposersAdded = 0

with open("ner_metadata5.csv",'r') as labels:
    tsvreader = csv.reader(labels, delimiter='\t', lineterminator="\n")
    for line in tsvreader:
        if (len(line)>1 and line[1]) or (len(line)>2 and line[2]): totalComposersAdded +=1
        if len(line)>1 and line[1]:
            titleCount+=1
            if composerDictTitles.get(line[1]):
                composerDictTitles[line[1]] = composerDictTitles[line[1]] +1
            else: composerDictTitles[line[1]] = 1
        if len(line)>2 and line[2]:
            descCount+=1
            if composerDictDesc.get(line[2]):
                composerDictDesc[line[2]] = composerDictDesc[line[2]] +1
            else:
                composerDictDesc[line[2]] = 1

sortedComposersByCountTitle = sorted(composerDictTitles.items(), key =lambda x : x[1], reverse=True)
sortedComposersByCountDesc = sorted(composerDictDesc.items(), key =lambda x : x[1], reverse=True)

print("NER Processing of metadata text fields of ONLY compositions that did not have a valid composer\n")
print("Top 100 composers derived from ONLY title metadata fields:\n")
print(sortedComposersByCountTitle[:100])
print("\nNumber of new unique composers found in Title from " + str(titleCount) + " cleaned titles: " + str(len(sortedComposersByCountTitle)))
print("Composers with more than 1 composition " + str(len([x for x in filter(lambda x: x[1]>1,sortedComposersByCountTitle)])) )
print("\n")
print("Top 100 composers derived from ONLY description metadata fields:\n")
print(sortedComposersByCountDesc[:100])
print("\nNumber of new unique composers found in Desc from " + str(descCount) + " cleaned descriptions: " + str(len(sortedComposersByCountDesc)))
print("Composers with more than 1 composition " + str(len([x for x in filter(lambda x: x[1]>1,sortedComposersByCountDesc)])) )


print("\nTotal composers added to metadata that previously had no composer (from both titles and desctiptions): " + str(totalComposersAdded))

NER Processing of metadata text fields of ONLY compositions that did not have a valid composer

Top 100 composers derived from ONLY title metadata fields:

[('J. S. Bach', 374), ('C. H. Gabriel', 236), ('A. Miles', 211), ('H. Potter', 196), ('R. Lowry', 185), ('G. Ainm', 171), ('C. Wesley', 132), ('F. Tail', 128), ('J. Bond', 118), ('M. D. Y. Know', 118), ('H. Lowden', 113), ('B. E. Warren', 111), ('J. Mcgranahan', 109), ('M. Channel', 103), ('R. Harkness', 102), ('D. Murphys', 101), ('L. V. Beethoven', 88), ('S. Gardens', 87), ('W. A. Mozart', 82), ('C. H', 79), ('I. Watts', 78), ('M. D. Y. K. Mary', 75), ('M. Jackson', 73), ('S. Universe', 66), ('B. Lass', 66), ('D. Miller', 65), ('E. Sheeran', 65), ('P. Canon', 65), ('S. Up', 65), ('M. Kart', 63), ('H. E. Smith', 62), ('C. Whisper', 61), ('I. B. Wilson', 61), ('D. Boy', 60), ('T. Billys', 59), ('M. Apron', 59), ('J. Barnby', 59), ('J. Williams', 59), ('M. Reel', 58), ('T. Tully', 57), ('E. Rigby', 57), ('J. R. Sweney', 55), ('G. Sav

131230 new composers gives almost 10% increase in data - we can test performance with and without additional labels?

Next step to combine and see the total top composers, first attempting to pull from title then description (and maybe the other order too?)

In [67]:
sortedComposersTitleByName = sorted(composerDictTitles.items(), key =lambda x : x[0])
sortedComposersDescByName = sorted(composerDictDesc.items(), key =lambda x : x[0])

# print(*[ x[0] for x in sortedComposersByName],sep = '\n')
print([ x[0] for x in sortedComposersTitleByName][:10])
print([ x[0] for x in sortedComposersDescByName][:10])

composerWordsTitle = [[x for x in word[0].split(" ")] for word in sortedComposersTitleByName]
composerWordsDesc = [[x for x in word[0].split(" ")] for word in sortedComposersDescByName]

composerWordsTitle = [item for sublist in composerWordsTitle for item in sublist if item]
composerWordsDesc = [item for sublist in composerWordsDesc for item in sublist if item]

# print(composerWords)
wordCountDictTitle = dict()
for word in composerWordsTitle:
    if wordCountDictTitle.get(str(word)):
        wordCountDictTitle[word] = wordCountDictTitle[word]+1
    else:
        wordCountDictTitle[word] = 1

# print(composerWords)
wordCountDictDesc = dict()
for word in composerWordsDesc:
    if wordCountDictDesc.get(str(word)):
        wordCountDictDesc[word] = wordCountDictDesc[word]+1
    else:
        wordCountDictDesc[word] = 1

print("\nTop words in titles:")
print(sorted(wordCountDictTitle.items(), key =lambda x : x[1],reverse=True)[:100])

print("\nTop words in descs:")
print(sorted(wordCountDictDesc.items(), key =lambda x : x[1],reverse=True)[:100])

['A. -A', 'A. A', 'A. A. A. Mashup', 'A. A. Am', 'A. A. Amoramor', 'A. A. B. The', 'A. A. Cecile', 'A. A. Closing', 'A. A. Copy-Cat; A. Advanced-Copy', 'A. A. Fonn']
['A. A. Aishite', 'A. A. Alleluia', 'A. A. Aminnote', 'A. A. Approach; J. S. Bach', 'A. A. Bainnote', 'A. A. Bohusle4N', 'A. A. Book', 'A. A. Book; N. Gatherer', 'A. A. Book; N. Gatherermeter', 'A. A. C. Dmeter']

Top words in titles:
[('S.', 4602), ('M.', 4036), ('B.', 3696), ('J.', 3655), ('C.', 3515), ('T.', 3284), ('L.', 2494), ('D.', 2432), ('A.', 2431), ('H.', 2337), ('R.', 2202), ('P.', 2128), ('G.', 2030), ('W.', 1930), ('F.', 1616), ('K.', 1216), ('E.', 1207), ('O.', 1138), ('N.', 847), ('Y.', 729), ('I.', 633), ('V.', 587), ('U.', 255), ('Q.', 212), ('Z.', 195), ('Reel', 194), ('The', 145), ('Strathspey', 117), ('X.', 78), ('Fancy', 62), ('Cover', 59), ('Furey;', 55), ('Of', 54), ('Favourite', 51), ('A', 48), ('You', 48), ('Favorite', 47), ('Suite', 47), ('Ii', 46), ('Farewell', 44), ('Furey', 44), ('Band', 42), 

Still some words to remove: house, remix, tunes, enjoy, dmeter, reel, collection,
still some weird single word names 'A. -A'
