Programming for Cultural Heritage Fall 2016
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
birth_and_death_split
db_to_wiki_split
place uris
.jazz_directory_aug_2012.nt.icloud
BirthDeath.twb
Echeverria_Mollie_LIS-664-01-Final_Project_Proposal.docx
Ella_Fitzgerald_wikidata.json
LIS-664-visual.twb
Paul_Wertico.json
Q5693560.json
README.txt
all_new_orleans_193_names.json
all_new_orleans_props.csv
allbirthdeath.csv
allbirthdeath.json
birth_and_death_place_uris.json
connection_aborted_error.txt
data.csv
db_to_wiki.json
dbpediaquerynotes.txt
discogs_api_get_key.py
extractlocationid.py
extractneworleans.py
firstnamelastname.py
fneworleansbdfromcsv.json
get_all_wiki_props.csv
get_all_wiki_props.json
get_all_wiki_props0.py
get_bd_data.py
get_lat_and_long.py
getrelationshipprops.py
jazz_directory_aug_20120.json
jazzsample.json
json-split.py
names.csv
new_orleans_jazz_family_relationships.json
neworleans_csvtojson.py
neworleansbd.csv
neworleansdbwiki.json
neworleansdbwiki_unique.json
nola_wiki_entities.json
nola_wiki_entities.py
placeofbirth.png
placeofdeath.png
resourcetodata.py
write_to_csv.py

README.txt

The scripts in this repository were created for the Programming for Cultural Heritage class at Pratt Institute.

More details on the project:

--------------------------------------------------------------------------------


DBpedia to Wikidata:
Exploring the Linked Jazz Name Directory
Mollie Echeverria
LIS-664 - Programming for Cultural Heritage
Pratt Institute - Fall 2016
Prof Matt Miller

DBpedia
Founded by researchers from two German universities in 2007.
Aims to extract content from Wikidata and publish as structured content.
Users can semantically query this data, revealing relationships and properties connected to Wikipedia resources.

Wikidata
Founded in 2012 by the Wikimedia Foundation.
Aggregates content from all the Wikimedia sites as key-value pairs.
May be supplanting DBpedia as a source for Wikipedia-based linked data.

Linked Jazz
Research project at Pratt focused on exploring linked open data in the context of cultural institutions
Focused specifically on exploring relationships between jazz musicians based on data in oral history transcripts, as well as other sources like archival documents.

The Jazz Names Directory
At the start of the Linked Jazz Project in 2011, DBpedia was queried for names of jazz musicians. This yielded around 9,000 names.
This list was later narrowed down to individuals mentioned in jazz oral history transcripts.
8,725 of the original names are still hosted on the Linked Jazz website as N-Triples (a textual format used to store linked data).

The New Orleans Jazz & Heritage Foundation
Linked Jazz recently received a grant from the New Orleans Jazz and Heritage Foundation to create a linked data dataset of Louisiana-based jazz musicians.
In support of this project, Linked Jazz wanted to investigate how many of the musicians already in its 8,725 name directory were New Orleans-based.
The team also wanted to explore whether Wikidata could offer richer data than DBpedia (such as familial relationships).

Extracting Data From the Name Directory
I started by downloading the Jazz Names Directory as an N-Triple file (https://github.com/MollieEcheverria/LIS-664/blob/master/jazz_directory_aug_2012.nt).
To make this data, I converted the N-Triple into a JSON dictionary (https://github.com/MollieEcheverria/LIS-664/blob/master/jazz_directory_aug_20120.json) using a Python script (https://github.com/mreesele/CH-LJ/blob/master/isolate_as_dict_CH%2BLJ.py) .

Getting JSON from DBpedia
Each name in the Jazz Names Directory is connected to a DBpedia resource page (sample - https://github.com/MollieEcheverria/LIS-664/blob/master/Paul_Wertico.json).
Resources in DBpedia contain links to related resources, including the corresponding page for the same resource on Wikipedia.
To query DBpedia, I had to access these pages in the form of JSON data.
To do this, I used a script to replace the word “/resource/” in each URI in the directory to “/data/”. This allowed me to access the JSON equivalent of each page (https://github.com/MollieEcheverria/LIS-664/blob/master/resourcetodata.py).

Querying DBpedia for Wikidata URIs
DBpedia resources (sample - https://github.com/MollieEcheverria/LIS-664/blob/master/Q5693560.json) store the URI for the resource’s corresponding Wikidata page in a property called “sameAs”.
To get Wikidata URIs for the names in the Jazz Directory, I used a script to loop through all of properties in each person’s DBpedia page, writing their DBpedia and Wikidata URI to a new document (https://github.com/MollieEcheverria/LIS-664/blob/master/resourcetodata.py).

DBpedia Querying Issues
Not all DBpedia pages had a corresponding Wikidata page, causing me to get a KeyError when I tried to run the script.
I eventually ended up adding a Try/Except statement, allowing the script to pass over DBpedia resources missing Wikidata URIs.
Eventually, I ended up with another JSON directory containing corresponding DBpedia and Wikidata URIs for those names in the Jazz Names Directory that had resources on both sites (https://github.com/MollieEcheverria/LIS-664/blob/master/db_to_wiki.json).

Getting Places of Birth and Death from Wikidata
Now that I had Wikidata URIs for names in the Jazz Names Directory, my next step was figuring out which of these musicians were from New Orleans.
To find New Orleans-based musicians, I had to query two of Wikidata’s resource properties: Place of Birth and Place of Death.
Using Wikidata’s API, I attempted to extract these two properties for each resource in the name directory.


Hitting a Wall: Wikipedia’s Server
When I attempted to query Wikidata’s API (https://github.com/MollieEcheverria/LIS-664/blob/master/get_bd_data.py), I soon encountered a major obstacle in the form of ConnectionResetError 54 (https://github.com/MollieEcheverria/LIS-664/blob/master/connection_aborted_error.txt).
A few hundred names into my query, I would get disconnected from Wikidata’s server on Wikidata’s end. This was possibly due to the volume of data I was querying.
I modified the “requests” method in my script from “requests.get” to “requests.post”, and set my requests to not time out, but continued to be kicked off of Wikidata’s server after a certain point.

Querying in Chunks
After dozens of unsuccessful querying attempts, I decided to split my 8,000+ name directory into smaller JSON files.
Using a script (https://github.com/MollieEcheverria/LIS-664/blob/master/json-split.py), I split the large directory into 54 smaller JSON files (https://github.com/MollieEcheverria/LIS-664/tree/master/db_to_wiki_split), each with around 150 names. 150 names seemed to be a cutoff where I could reliably query Wikidata without being kicked off their server.
I then manually updated and ran my birthplace/deathplace script 54 times, once for each JSON file.

Separating New Orleans Names
After querying for musicians who were born or died in New Orleans, I output that list to a new directory 'neworleansdbwiki.json' using another
script ('extractneworleans.py').

Querying Relationship Properties
After extracting the URIs of New Orleans-based individuals, I queried Wikidata ('get_all_wiki_props.py')
for all available properties for each entity ('get_all_wiki_props_new_orleans.json').
I then searched the results ('getrelationshipprops.py') for Wikidata's 11 familial relationship properties, along with the 'student'
and 'student of' properties, and wrote the results to JSON ('new_orleans_jazz_family_relationships.json')

Results
8,201 of the 8,725 names in the DBpedia-based Jazz Name Directory have corresponding Wikidata pages.
114 people were born and/or died in New Orleans.
3 people from New Orleans had spouses with Wikidata URIs, 1 had a sibling.

Explore other Wikidata properties in the 'get_all_wiki_props_new_orleans.json', such as gender.
Refine list by profession (not all names are actually musicians).
Examine names from Tulane University list.
Use links in Wikidata entity pages to query music database sites like MusicBrainz and Discogs (Wikidata has links to corresponding pages on these sites).