Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Programming for Cultural Heritage Fall 2016
Fetching latest commit…
Cannot retrieve the latest commit at this time.
The scripts in this repository were created for the Programming for Cultural Heritage class at Pratt Institute. More details on the project: -------------------------------------------------------------------------------- DBpedia to Wikidata: Exploring the Linked Jazz Name Directory Mollie Echeverria LIS-664 - Programming for Cultural Heritage Pratt Institute - Fall 2016 Prof Matt Miller DBpedia Founded by researchers from two German universities in 2007. Aims to extract content from Wikidata and publish as structured content. Users can semantically query this data, revealing relationships and properties connected to Wikipedia resources. Wikidata Founded in 2012 by the Wikimedia Foundation. Aggregates content from all the Wikimedia sites as key-value pairs. May be supplanting DBpedia as a source for Wikipedia-based linked data. Linked Jazz Research project at Pratt focused on exploring linked open data in the context of cultural institutions Focused specifically on exploring relationships between jazz musicians based on data in oral history transcripts, as well as other sources like archival documents. The Jazz Names Directory At the start of the Linked Jazz Project in 2011, DBpedia was queried for names of jazz musicians. This yielded around 9,000 names. This list was later narrowed down to individuals mentioned in jazz oral history transcripts. 8,725 of the original names are still hosted on the Linked Jazz website as N-Triples (a textual format used to store linked data). The New Orleans Jazz & Heritage Foundation Linked Jazz recently received a grant from the New Orleans Jazz and Heritage Foundation to create a linked data dataset of Louisiana-based jazz musicians. In support of this project, Linked Jazz wanted to investigate how many of the musicians already in its 8,725 name directory were New Orleans-based. The team also wanted to explore whether Wikidata could offer richer data than DBpedia (such as familial relationships). Extracting Data From the Name Directory I started by downloading the Jazz Names Directory as an N-Triple file (https://github.com/MollieEcheverria/LIS-664/blob/master/jazz_directory_aug_2012.nt). To make this data, I converted the N-Triple into a JSON dictionary (https://github.com/MollieEcheverria/LIS-664/blob/master/jazz_directory_aug_20120.json) using a Python script (https://github.com/mreesele/CH-LJ/blob/master/isolate_as_dict_CH%2BLJ.py) . Getting JSON from DBpedia Each name in the Jazz Names Directory is connected to a DBpedia resource page (sample - https://github.com/MollieEcheverria/LIS-664/blob/master/Paul_Wertico.json). Resources in DBpedia contain links to related resources, including the corresponding page for the same resource on Wikipedia. To query DBpedia, I had to access these pages in the form of JSON data. To do this, I used a script to replace the word “/resource/” in each URI in the directory to “/data/”. This allowed me to access the JSON equivalent of each page (https://github.com/MollieEcheverria/LIS-664/blob/master/resourcetodata.py). Querying DBpedia for Wikidata URIs DBpedia resources (sample - https://github.com/MollieEcheverria/LIS-664/blob/master/Q5693560.json) store the URI for the resource’s corresponding Wikidata page in a property called “sameAs”. To get Wikidata URIs for the names in the Jazz Directory, I used a script to loop through all of properties in each person’s DBpedia page, writing their DBpedia and Wikidata URI to a new document (https://github.com/MollieEcheverria/LIS-664/blob/master/resourcetodata.py). DBpedia Querying Issues Not all DBpedia pages had a corresponding Wikidata page, causing me to get a KeyError when I tried to run the script. I eventually ended up adding a Try/Except statement, allowing the script to pass over DBpedia resources missing Wikidata URIs. Eventually, I ended up with another JSON directory containing corresponding DBpedia and Wikidata URIs for those names in the Jazz Names Directory that had resources on both sites (https://github.com/MollieEcheverria/LIS-664/blob/master/db_to_wiki.json). Getting Places of Birth and Death from Wikidata Now that I had Wikidata URIs for names in the Jazz Names Directory, my next step was figuring out which of these musicians were from New Orleans. To find New Orleans-based musicians, I had to query two of Wikidata’s resource properties: Place of Birth and Place of Death. Using Wikidata’s API, I attempted to extract these two properties for each resource in the name directory. Hitting a Wall: Wikipedia’s Server When I attempted to query Wikidata’s API (https://github.com/MollieEcheverria/LIS-664/blob/master/get_bd_data.py), I soon encountered a major obstacle in the form of ConnectionResetError 54 (https://github.com/MollieEcheverria/LIS-664/blob/master/connection_aborted_error.txt). A few hundred names into my query, I would get disconnected from Wikidata’s server on Wikidata’s end. This was possibly due to the volume of data I was querying. I modified the “requests” method in my script from “requests.get” to “requests.post”, and set my requests to not time out, but continued to be kicked off of Wikidata’s server after a certain point. Querying in Chunks After dozens of unsuccessful querying attempts, I decided to split my 8,000+ name directory into smaller JSON files. Using a script (https://github.com/MollieEcheverria/LIS-664/blob/master/json-split.py), I split the large directory into 54 smaller JSON files (https://github.com/MollieEcheverria/LIS-664/tree/master/db_to_wiki_split), each with around 150 names. 150 names seemed to be a cutoff where I could reliably query Wikidata without being kicked off their server. I then manually updated and ran my birthplace/deathplace script 54 times, once for each JSON file. Separating New Orleans Names After querying for musicians who were born or died in New Orleans, I output that list to a new directory 'neworleansdbwiki.json' using another script ('extractneworleans.py'). Querying Relationship Properties After extracting the URIs of New Orleans-based individuals, I queried Wikidata ('get_all_wiki_props.py') for all available properties for each entity ('get_all_wiki_props_new_orleans.json'). I then searched the results ('getrelationshipprops.py') for Wikidata's 11 familial relationship properties, along with the 'student' and 'student of' properties, and wrote the results to JSON ('new_orleans_jazz_family_relationships.json') Results 8,201 of the 8,725 names in the DBpedia-based Jazz Name Directory have corresponding Wikidata pages. 114 people were born and/or died in New Orleans. 3 people from New Orleans had spouses with Wikidata URIs, 1 had a sibling. Explore other Wikidata properties in the 'get_all_wiki_props_new_orleans.json', such as gender. Refine list by profession (not all names are actually musicians). Examine names from Tulane University list. Use links in Wikidata entity pages to query music database sites like MusicBrainz and Discogs (Wikidata has links to corresponding pages on these sites).