# Polifonia-Pratt alignment

Align polifonia JAAH recording artists with Pratt linked-jazz database names  
https://github.com/polifonia-project/datasets <-> https://triplydb.com/pratt/linked-jazz/

Two steps:
1. Match recording_meta names to pratt linked jazz people names
2. Validate match pairs by additional information
    - query both elements of the pair for birth/death dates and places
    - consider a match valid if at least one field perfectly matches

Data in brief:
- 510 recording_meta related artists
- 2005 pratt people names

Results in brief:
1. name-only: 220 matches (takes approx. 1 minute)
2. query validated: 189 matches by name and at-least-one field match (takes approx. 5 minutes)

In [23]:
# PATHS
LINKED_JAZZ_PATH = "pratt_linked_jazz/linked-jazz.trig"
DATA_MAP_PATH = "data_map.csv"

#RECORDINGS_PATH = "datasets-main/dataset/jaah/cornucopia_data/musicbrainz/" 
RECORDINGS_PATH = "jaah_recording_meta/"

OUTPUT_1 = "outputs/sameas_by_names.txt"
OUTPUT_2 = "outputs/sameas_by_sparql.txt"

In [3]:
# name match
!pip install pandas -q
!pip install nltk -q

import json
import pandas as pd
from nltk.metrics import distance

# match validation
!pip install requests -q
!pip install sparqlwrapper -q
import requests
from SPARQLWrapper import SPARQLWrapper, JSON


## Name Match
### Pratt file
Pratt linked-jazz.trig database is a trig file:  
a Turtle-like format for RDF triples + context (RDF quads) and thus multiple graphs.

#### Pratt people
People information is stored in the first part of the file, under '<https://triplydb.com/pratt/linked-jazz/id/graph/people>' graph.  
It is made of subject- predicate- object triples where:  
  
subjects are:
- <http://dbpedia.org/resource/Zutty_Singleton>
- <http://id.loc.gov/authorities/names/n00064734>
- <http://linkedjazz.org/resource/Adam_Lambert>
- <http://musicbrainz.org/artist/03df380a-665d-48c6-9cee-90c53f40ec3b>
- <http://www.allmusic.com/artist/angelo-tompros-mn0000868923>  
  
predicates are:  
- sub <http://xmlns.com/foaf/0.1/name>
- sub <http://dbpedia.org/ontology/thumbnail>  
  
  

Every single entry has a foaf name, like:  
sub <http://xmlns.com/foaf/0.1/name> "Zsa Zsa Gabor"@en .
And many have a thumbnail:  
sub <http://dbpedia.org/ontology/thumbnail> <http://linkedjazz.org/image/square/Zutty_Singleton.png> .

Thus: every Pratt People name can be extracted by its foaf:name object.


In [6]:
# list pratt artists by foaf:name:

# 1 read the prat linked-jazz .trig file as plain text
# 2 read people triplets line-by-line, skip first and stop at '}'
# populate pratt_people names by cleaning each line string after foaf:name prefix

with open(LINKED_JAZZ_PATH,"r") as f:
    pratt_file = f.readlines()

flag = 0
#pratt_people = []
pratt_iris = []
pratt_people = {}
count=0
for i, line in enumerate(pratt_file):
    if line == '}': break #end of people
    if '<http://xmlns.com/foaf/0.1/name>' in line:
        line = line.split('<http://xmlns.com/foaf/0.1/name>')

        iri = line[0].strip()
        person = line[-1].split('"')[1]

        if not pratt_people.get(iri):
            pratt_people[iri] = person
            count +=1
        else:
            #pratt_people[iri] += [person]
            count +=1
            print('duplicate name:', iri, pratt_people[iri])
        #pratt_people.append(person)
        #pratt_iris.append(iri)
        #flag +=1
        #if flag >=5: break

print(f"\nPratt has {len(pratt_people)} people entries with {count} name aliases")

duplicate name: <http://dbpedia.org/resource/Fantasy_Records> Max Weiss
duplicate name: <http://dbpedia.org/resource/Martin_Luther_King,_Jr.> Martin Luther King Jr
duplicate name: <http://id.loc.gov/authorities/names/no98012278> Argonne Thornton

Pratt has 2005 people entries with 2008 name aliases


In [7]:
# data_map.csv stores each recording .json file name
df = pd.read_csv(DATA_MAP_PATH)
df.head(2)


Unnamed: 0,artist,track_name,jaah/chord_annotation,genius/lyrics,genius/lyrics_annotation,musicbrainz/work_meta,musicbrainz/recording_meta,musicbrainz/release_meta,musicbrainz/artist_meta,secondhandsong/artist_meta,sonar_id
0,Thelonious Monk,Evidence,evidence.jams,NF,NF,work_f1875b4c-ac07-3614-9bd2-4259e1865f07.json,recording_b438e7a3-91b9-4e43-9551-03b07036f3f3...,release_e65a3c02-84c4-41b8-802f-71b5d42ac92a.json,artist_b438e7a3-91b9-4e43-9551-03b07036f3f3.json,shs_2288.json,jaah_76
1,Charlie Parker All-Stars,Parker's Mood,parkers_mood.jams,NF,NF,work_93618e9e-4bf8-3cd9-8b72-258750d4222c.json,recording_c50ea630-b5f0-4c12-8e75-0321757d69a3...,release_a90eeed6-dc07-4173-938c-49e63142f299.json,artist_c50ea630-b5f0-4c12-8e75-0321757d69a3.json,NF,jaah_78


In [8]:
len(pratt_people.values()), list(pratt_people.values())[:5]

(2005,
 ['Aaron Copland',
  'Abbey Lincoln',
  'Abdullah Ibrahim',
  'Adele Girard',
  'Agatha Christie'])

In [12]:
# check if aartist column matches with pratt people
count_matches = 0
for artist in df['artist']:
    if artist in pratt_people.values(): count_matches +=1
print(count_matches)

assert 'Charlie Parker' in pratt_people.values()

print(df['artist'][1],  "!= Charlie Parker")

25
Charlie Parker All-Stars != Charlie Parker


Match count by 'aartist' column are very low.  
This is because they are usually expressed as main artist + band:  
- Charlie Parker All-Stars,
- Jelly Roll Morton's Red Hot Peppers etc.


### Recording_meta vs Pratt alignment
The following method retrieves all artists related to the 'recording_meta' files we have 

In [13]:
def recording_names(df):
    """List all artist names related to the recordings in df"""

    artist_names = {}
    for i in range(len(df)):
        json_path = df.loc[i,'musicbrainz/recording_meta']
        
        with open(os.path.join(RECORDINGS_PATH, json_path),'r') as json_file:
            json_entry = json.load(json_file)

        for artist in json_entry['recording']['artist-relation-list']:
            artist_names[artist['artist']['id']] = artist['artist']['name']
            #artist_names.append(b['artist']['name'])
            #print(b['artist']['id'])
    return artist_names

recording_artists = recording_names(df)
#recording_artists = list(set(recording_artists)) #deduplicate

print(f"Found {len(recording_artists)} artists in recording meta files")

Found 510 artists in recording meta files


In [14]:
pratt_people_dbpedia ={}
for person in pratt_people.items():
    if person[0].startswith('<http://dbpedia.org/'):
        pratt_people_dbpedia[person[0]] = person[1]
len(pratt_people_dbpedia)

1516

In [15]:
count_matches = 0

for artist in recording_artists.values():
    if artist in pratt_people.values():
        count_matches +=1

print(f"Found {count_matches} perfect matches by name")


Found 231 perfect matches by name


### Edit distance matching

Checking perfect match already gives good results, but many artists are left unmatched.  
If no perfect match exists for a given recording artist, edit distance is computed to every pratt people name.  

The best candidate is selected and the (artist, distance, pratt_people) is returned.  
Non-perfect matches are added to possible_matches list, to possibly add later.  


In [16]:
# not used now
strange = []
for a in recording_artists.items():
    name = a[1]
    #print(name)
    if '"' in name or '.' in name:
        #print(name)
        strange.append(a)

for strange_name in strange:
    print(strange_name)
    name = strange_name[1]
    print(name)
    name = name.split()
    newname = []
    for substring in name:
        if not '"' in substring and '.' not in substring:
            newname.append(substring)
    newname = ' '.join(newname).strip(',')
    print(newname)
    print()


('7a281c93-adc9-4426-9331-f5e870162f91', 'Johnny St. Cyr')
Johnny St. Cyr
Johnny Cyr

('b1669585-019f-483d-80c2-caada4275157', 'Joe "Tricky Sam" Nanton')
Joe "Tricky Sam" Nanton
Joe Nanton

('2a58b026-9659-4bfb-90c9-56c4ad9125f1', 'James P. Johnson')
James P. Johnson
James Johnson

('33e50556-d4be-421b-a5d1-644bd536ec07', 'J.J. Johnson')
J.J. Johnson
Johnson

('63ca22b3-b010-45d6-99bf-fd205829a645', 'Sanford J. Siegelstein')
Sanford J. Siegelstein
Sanford Siegelstein

('6c097081-53d9-417f-a169-2eface036ca7', 'George "Pops" Foster')
George "Pops" Foster
George Foster

('f661cb81-305e-421d-bac0-c02f6056d824', 'L. Z. Cooper')
L. Z. Cooper
Cooper

('a7ed66a4-9ddd-430a-9797-be4ca4199eb3', 'Louis R. Mucci')
Louis R. Mucci
Louis Mucci

('9c0e864c-f356-4ebc-98af-0a45cd7801b9', 'Billy Taylor Sr.')
Billy Taylor Sr.
Billy Taylor

('58bc629e-b6d2-47a6-ab11-4f6d64c6f7fe', 'Mancy "Peck" Carr')
Mancy "Peck" Carr
Mancy Carr

('10eb67e5-8594-41e9-a691-3b7d3a9da907', 'Roy "Red" Maier')
Roy "Red" Maier
R

In [17]:
from collections import namedtuple
def check_match(name, namedict, distance_th=3, loose =False):
    """
    Checks if name distance <= threshold from every element in namelist.
    Returns a (name, distance, candidate) tuple.
    """
    PairDist = namedtuple('PairDist', ['dist', 'y'])
    
    for item in namedict.items():
        if name == item[1]:
            return PairDist(0, item)
        
    else:
        candidates = [] #distance from name, iri of name
        for item in namedict.items():
            
            candidates.append(PairDist(distance.edit_distance(name, item[1]), item))

        candidates.sort(key= lambda x: x[0])

        if candidates[0][0] <= distance_th:
            #print(name, candidates[0])
            return candidates[0]
        else:
            return False

check_match('Thelnius Monk', pratt_people)

PairDist(dist=1, y=('<http://id.loc.gov/authorities/names/n82218969>', 'Thelonius Monk'))

In [18]:
# check matches for each artist in recording meta

possible_matches = []
perfect_matches = []

for artist in recording_artists.items():
    match = check_match(artist[1], pratt_people, distance_th=3)
    if match:
        if match[0] <= 1: #matched with distance <=1
            perfect_matches.append((artist[0], match[1][0])) #iri, iri
            if match[0] == 1: print(artist, match)
        else:
            possible_matches.append((artist, match))

len(perfect_matches)


('f0707f1d-55e1-46b6-8a9c-05d508e09a73', 'Rudy van Gelder') PairDist(dist=1, y=('<http://dbpedia.org/resource/Rudy_Van_Gelder>', 'Rudy Van Gelder'))
('46435fe9-34c1-4e4e-86ed-cc0bf6c6234d', 'Earle Warren') PairDist(dist=1, y=('<http://dbpedia.org/resource/Earle_Warren>', 'Earl Warren'))
('d2de9fbd-1cdf-4212-b5ee-f830f69521de', 'Georgie Auld') PairDist(dist=1, y=('<http://id.loc.gov/authorities/names/n85098932>', 'George Auld'))
('6981867e-810d-4a16-96cf-8dfbc734a477', 'Sidney De Paris') PairDist(dist=1, y=('<http://id.loc.gov/authorities/names/n91049777>', 'Sidney DeParis'))


235

In [19]:
# resolve firstname diminutives
aliases = [
    ('Freddie Powell', 'Fred Powell'),
    ('Freddie Powell','Freddy Powell'),
    ('Bob Powell','Bobby Powell'),
    ('Howard E. Johnson','Howard Johnson'),
    ('John Thomas','Joe Thomas'),
    ('Johnny Martel', 'Johnny Mandel')]

aliases_results = [True, True, True, True, False, False]
def check_firstname_diminutive(pair):

    a,b  = pair
    
    if a.split()[-1] != b.split()[-1]:
        #print(pair, False)
        return False # diff surname
    a = a.split()[0].replace('ie','').strip('y')
    b = b.split()[0].replace('ie','').strip('y')

    if a.startswith(b) or b.startswith(a):
        print(pair, True)
        return True
    else:
        #print(pair, False)
        return False

assert [check_firstname_diminutive(pair) for pair in aliases] == aliases_results


('Freddie Powell', 'Fred Powell') True
('Freddie Powell', 'Freddy Powell') True
('Bob Powell', 'Bobby Powell') True
('Howard E. Johnson', 'Howard Johnson') True


In [20]:
# resolve possible diminutives with distance < distance_th
checked_matches = []
for (artist, match) in sorted(possible_matches, key= lambda x: x[1][0]):

    if check_firstname_diminutive((artist[1], match[1][1])):
        checked_matches.append((artist[0], match[1][0]))

len(checked_matches), checked_matches
        

('Freddy Jenkins', 'Freddie Jenkins') True
('Fred Robinson', 'Freddy Robinson') True
('Bob Whitlock', 'Bobby Whitlock') True
('Dicky Wells', 'Dickie Wells') True
('Fred Guy', 'Freddie Guy') True
('Howard E. Johnson', 'Howard Johnson') True


(6,
 [('a1eb52a1-07d5-4179-91ae-3ee5fe3e19b9',
   '<http://dbpedia.org/resource/Freddie_Jenkins>'),
  ('a21a0f9e-89b9-42c4-83d7-8da1295fdef3',
   '<http://dbpedia.org/resource/Freddy_Robinson>'),
  ('128eac4c-1724-4ea6-a7c9-c348e89235e1',
   '<http://dbpedia.org/resource/Bobby_Whitlock>'),
  ('c43e27c8-3c26-4cea-a39a-9a091aabfd36',
   '<http://dbpedia.org/resource/Dicky_Wells>'),
  ('ffd90943-ec19-4b02-8077-1cf2e4221b90',
   '<http://id.loc.gov/authorities/names/n80124239>'),
  ('14786e41-9591-47a2-b8f4-a7cd50b6ae8b',
   '<http://dbpedia.org/resource/Howard_Johnson_(jazz_musician)>')])

In [21]:
len(perfect_matches) + len(checked_matches)

241

In [22]:
perfect_matches += checked_matches
checked_matches = []
len(perfect_matches)

241

In [39]:
# save matches to file
os.makedirs(os.path.split(OUTPUT_1)[0], exist_ok=True)
with open(OUTPUT_1,'w') as f:
    for i, pair in enumerate(perfect_matches):
        if i!=0: f.write('\n')
        a, b = pair
        f.write(f"<http://musicbrainz.org/artist/{a}> <owl:sameAs> {b}")


Some pratt people are identified my musicbrainz IRI, thus we can automatically validate our previous matching on those pairs.  

In [41]:
count = 0
false_matches = 0
with open(OUTPUT_1,'r') as f:
    file = f.readlines()
    for line in file:
        if line.startswith('<http://musicbrainz.org/artist/'):
            a,b,c = line.split()
            if c.startswith('<http://musicbrainz.org/artist/'):
                count +=1
                if a!=c:
                    false_matches+=1
                    print(a.strip('<>')), print(c.strip('<>'))
                    print()

print(f"Total MBID sameAs MBID: {count}, with wrong ones: {false_matches}")

http://musicbrainz.org/artist/258c295b-bae6-433b-955b-b590fe4a571d
http://musicbrainz.org/artist/755d3c6a-9697-4204-803e-01333b29133e

http://musicbrainz.org/artist/a67b42cc-1dc6-482f-9998-086af20be905
http://musicbrainz.org/artist/b2a298bc-c4ad-438b-8fc7-36002502706b

Total MBID sameAs MBID: 9, with wrong ones: 2


This suggests to proceed with further validation of the pairs.

## Validate Matches by SPARQL
by musicbrainz artist json vs dbpedia SPARQL query

In [42]:
def query_musicbrainz_json(mbid, fields = ['birthdate','deathdate','birthplace','deathplace']):

    """ Query musicbrainz json with requests, return selected fields"""
    mb_json_path = '?inc=aliases&fmt=json'
    mb_json_root = 'https://musicbrainz.org/ws/2/artist/'
    mbid = mbid.replace('http://musicbrainz.org/artist/','')

    url = mb_json_root + mbid + mb_json_path
    #print(url)
    
    response = requests.get(url)

    if (response.status_code != 204 and
        response.headers["content-type"].strip().startswith("application/json")):
        try:
            json_entry = response.json()
        
            queries = {
                'birthdate': ('life-span','begin'),
                'deathdate': ('life-span','end'),
                'birthplace': ('begin-area','name'),
                'deathplace': ('end-area','name')
            }

            results = []
            answer = ''
            if json_entry is not None:
                for i,field in enumerate(fields):
                    try:
                        ll = json_entry.copy()
                        for id in queries[field]: #deep index
                            ll = ll[id]
                        answer = ll or ''
                        #print(answer)
                        results.append(answer)

                    except (TypeError,KeyError): 
                        #print(i,field,'no key in musicbrainz')
                        #print(url)
                        results.append('')
                        continue
                        
                return results

            
        except ValueError:
            print('wrong json')
            return ['']*len(fields)
        

#test
query_musicbrainz_json(mbid='2e38e1de-2890-4620-8467-5b0bb4641cb9',)
# returns [None, None, None,...]
# in case of no json or no fields

['1923-01-01', '1999-10-09', 'Detroit', 'Manhattan']

In [60]:
def query_dbpedia_sparql(artist, fields=('birthdate', 'deathdate','birthplace','deathplace')):
    
    queries = {
        'birthdate':    'dbo:birthDate',
        'deathdate':    'dbo:deathDate',
        'birthplace':   'dbo:birthPlace',
        'deathplace':   'dbo:deathPlace'

    }
    
#    sparql = SPARQLWrapper("http://dbpedia.org/sparql")
#
#    query_text = """
#        PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
#        SELECT"""
#    
#    for query in zip(queries,'abcdefghijklmnopqrstuwxyz'):
#        query_text += ' ?'+query[1]
#
#    query_text += """WHERE {""" # compose multiple query to spare some minutes

    artist = artist.strip('<').strip('>')
    results = []

    for field in fields:
        sparql = SPARQLWrapper("http://dbpedia.org/sparql")
        sparql.setQuery(f"""
            PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
            SELECT ?x
            WHERE {{ <{artist}> {queries[field]} ?x .
                    }}
        """)

        sparql.setReturnFormat(JSON)
        query_results = sparql.query().convert()

        if not query_results["results"]["bindings"]:
            results.append('')
            print('No query answer')###
        else:
            for result in query_results["results"]["bindings"]:
                answer = result['x']['value']
                #continue
                if answer is None:
                    print(f'No {field} field in answer')###
                results.append(answer or '')
                break #TODO handle multiple responses to the same query
    if results == ['']*len(fields): print(artist)
    return results

query_dbpedia_sparql(artist='http://dbpedia.org/resource/Charlie_Parker')

['1920-08-29',
 '1955-03-12',
 'http://dbpedia.org/resource/Kansas_City,_Kansas',
 'http://dbpedia.org/resource/New_York_City']

In [55]:
def check_match(match, fields = ('birthdate','deathdate','birthplace','deathplace'), verbose = False):
    """
    Check musicbrainz vs dbpedia fields.
    Return count of matching fields.
    Field match replaces each underscore with whitespace.
    """

    #if verbose: print('musicbrainz')
    r1 = query_musicbrainz_json(match[0], fields)
    #if verbose: print('dbpedia sparql')
    r2 = query_dbpedia_sparql(match[1], fields)

    # r1 list
    # r2 list or list of lists

    match_count = 0
    match = []
    for pair in zip(r1,r2):
        
        if pair[0] != '' and pair[0] in pair[1].replace('_',' '):
            match_count +=1
            match.append(True)
            #print('matching pair', pair)
        else:
            match.append(False)
            #print('non matching', pair)

    if verbose or match_count < len(fields):
        print(r1)
        print(match)
        print(r2)
        print()

    return match_count

for match in perfect_matches[4:5]:
    print(match)
    check_match(match, verbose=True)

('c7356af9-9ea6-4a78-a55b-c73775716312', '<http://dbpedia.org/resource/Charlie_Parker>')
['1920-08-29', '1955-03-12', 'Kansas City', 'New York']
[True, True, True, True]
['1920-08-29', '1955-03-12', 'http://dbpedia.org/resource/Kansas_City,_Kansas', 'http://dbpedia.org/resource/New_York_City']



In [61]:
query_matches = []
with open(OUTPUT_1,'r') as f:
    file = f.readlines()
    for line in file:
        match = line.replace('<','').replace('>','').strip().split(' owl:sameAs ')
        if check_match(match, fields = ('birthdate','deathdate','birthplace','deathplace')):
            query_matches.append(match)
            #print(match)

print(len(query_matches))

['1923-01-01', '1999-10-09', 'Detroit', 'Manhattan']
[True, True, True, False]
['1923-01-01', '1999-10-09', 'http://dbpedia.org/resource/Detroit', 'http://dbpedia.org/resource/New_York_City']

['1924-01-10', '2007-08-15', 'Pasquotank County', 'Manhattan']
[True, False, True, False]
['1924-01-10', '2007-08-16', 'http://dbpedia.org/resource/Pasquotank_County,_North_Carolina', 'http://dbpedia.org/resource/New_York_City']

['1885-10-20', '1941-07-10', 'New Orleans', 'Los Angeles']
[False, True, True, True]
['1890-09-20', '1941-07-10', 'http://dbpedia.org/resource/New_Orleans', 'http://dbpedia.org/resource/Los_Angeles']

['1886-12-25', '1973-01-23', 'Louisiana', 'Honolulu']
[True, True, True, False]
['1886-12-25', '1973-01-23', 'http://dbpedia.org/resource/LaPlace,_Louisiana', 'http://dbpedia.org/resource/Hawaii']

['1891-01-25', '1966-10-29', 'St. James Parish', 'Los Angeles']
[True, True, True, False]
['1891-01-25', '1966-10-29', 'http://dbpedia.org/resource/St._James_Parish,_Louisiana', 

In [62]:
query_matches[0]

['http://musicbrainz.org/artist/2e38e1de-2890-4620-8467-5b0bb4641cb9',
 'http://dbpedia.org/resource/Milt_Jackson']

In [63]:
with open(OUTPUT_2,'w') as f:
    for i, pair in enumerate(query_matches):
        if i!=0: f.write('\n')
        a, b = pair
        f.write(f"<http://musicbrainz.org/artist/{a}> <owl:sameAs> {b}")