# Script to determine commonly used Wikidata reference properties

Throughout the script I refer to "Wikidata" but this could be used for any Wikibase instance.

## Configuration section

Import modules, set values, and define functions

In [2]:
# determine_ref_properties.ipynb This is part of the VandyCite project https://www.wikidata.org/wiki/Wikidata:WikiProject_VandyCite
# (c) 2020 Vanderbilt University. This program is released under a GNU General Public License v3.0 http://www.gnu.org/licenses/gpl-3.0
# Author: Steve Baskauf 2020-09-06

from pathlib import Path
import requests
from time import sleep
import json
import csv

sparql_sleep = 0.1 # number of seconds to wait between queries to SPARQL endpoint
home = str(Path.home()) # gets path to home directory; supposed to work for both Win and Mac
data_path = home + '/divinity-law/'
item_source_csv = 'identified-journals.csv'

endpoint = 'https://query.wikidata.org/sparql'
accept_media_type = 'application/json'

def generate_header_dictionary(accept_media_type):
    user_agent_header = 'VanderDiv/0.1 (https://github.com/HeardLibrary/linked-data/tree/master/publications/divinity-law; mailto:steve.baskauf@vanderbilt.edu)'
    requestHeaderDictionary = {
        'Accept' : accept_media_type,
        'Content-Type': 'application/sparql-query',
        'User-Agent': user_agent_header
    }
    return requestHeaderDictionary

requestheader = generate_header_dictionary(accept_media_type)

# read from a CSV file into a list of dictionaries
def read_dict(filename):
    with open(filename, 'r', newline='', encoding='utf-8') as file_object:
        dict_object = csv.DictReader(file_object)
        array = []
        for row in dict_object:
            array.append(row)
    return array

# extracts the qNumber from a Wikidata IRI
def extract_qnumber(iri):
    # pattern is http://www.wikidata.org/entity/Q6386232
    pieces = iri.split('/')
    return pieces[4]

# extracts the UUID and qId from a statement IRI
def extract_statement_uuid(iri):
    # pattern is http://www.wikidata.org/entity/statement/Q7552806-8B88E0CA-BCC8-49D5-9AC2-F1755464F1A2
    pieces = iri.split('/')
    statement_id = pieces[5]
    pieces = statement_id.split('-')
    return pieces[1] + '-' + pieces[2] + '-' + pieces[3] + '-' + pieces[4] + '-' + pieces[5], pieces[0]

## Load list of items from file

The CSV has a header row with column headers: `qid` and `label`. The `qid` column contains the Wikidata Q identifiers for each item. The `label` column contains the label, which isn't necessarily the label in Wikidata, but provides a way for humans to recognize the item.

In [3]:
# Load item data from csv
print('loading item data from file')
filename = data_path + item_source_csv
items = read_dict(filename)

# Create VALUES list for journals
item_qids = ''
for item in items:
    item_qids += 'wd:' + item['qid'] + '\n'
# remove trailing newline
item_qids = item_qids[:len(item_qids)-1]

# create properties dictionary
prop_list = [
    {'pid': 'P31', 'variable': 'instance_of', 'value_type': 'item'},
    {'pid': 'P1476', 'variable': 'title', 'value_type': 'string'},
    {'pid': 'P407', 'variable': 'language_of_work', 'value_type': 'item'},
    {'pid': 'P495', 'variable': 'country_of_origin', 'value_type': 'item'},
    {'pid': 'P123', 'variable': 'publisher', 'value_type': 'item'},
    {'pid': 'P571', 'variable': 'inception', 'value_type': 'date'},
    {'pid': 'P2669', 'variable': 'discontinued_date', 'value_type': 'item'},
    {'pid': 'P856', 'variable': 'official_website', 'value_type': 'uri'},
    {'pid': 'P155', 'variable': 'follows', 'value_type': 'item'},
    {'pid': 'P156', 'variable': 'followed_by', 'value_type': 'item'},
    {'pid': 'P921', 'variable': 'main_subject', 'value_type': 'item'},
    {'pid': 'P2896', 'variable': 'publication_interval', 'value_type': 'decimal'},
    {'pid': 'P236', 'variable': 'issn', 'value_type': 'string'}
]

#print(item_qids)

loading item data from file


This cell is to find out what properties are used by references given for the statement properties above and the items in the list.

In [4]:
for property in prop_list:
    print('*', property['pid'], property['variable'])
    query = '''select distinct ?gprop ?prop_label where {
    '''
    query += '''
      VALUES ?qid
    {
    ''' + item_qids + '''
    }

    '''
    query += '?qid p:' + property['pid'] + ''' ?issn_statement.
    ?issn_statement prov:wasDerivedFrom ?reference.
    ?reference ?prop ?value.
    ?gprop wikibase:reference ?prop.
    ?gprop rdfs:label ?prop_label.
    filter(lang(?prop_label)='en')
    }'''
    #print(query)
    
    # send request to Wikidata Query Service
    response = requests.post(endpoint, data=query, headers=requestheader)
    data = response.json()

    # extract the values from the response JSON
    results = data['results']['bindings']
    #print(json.dumps(results, indent = 2))
    for result in results:
        print(extract_qnumber(result['gprop']['value']), result['prop_label']['value'])
    print()
    sleep(sparql_sleep)

* P31 instance_of
P236 ISSN
P143 imported from Wikimedia project
P248 stated in
P813 retrieved
P1683 quotation

* P1476 title
P143 imported from Wikimedia project
P236 ISSN
P248 stated in

* P407 language_of_work
P143 imported from Wikimedia project
P248 stated in
P813 retrieved
P854 reference URL
P887 based on heuristic
P4656 Wikimedia import URL

* P495 country_of_origin
P143 imported from Wikimedia project
P248 stated in
P854 reference URL
P4656 Wikimedia import URL

* P123 publisher
P143 imported from Wikimedia project
P236 ISSN
P248 stated in
P813 retrieved
P854 reference URL
P4656 Wikimedia import URL

* P571 inception
P143 imported from Wikimedia project
P248 stated in
P854 reference URL
P4327 BHL bibliography ID
P1476 title

* P2669 discontinued_date

* P856 official_website
P143 imported from Wikimedia project
P813 retrieved
P4656 Wikimedia import URL
P854 reference URL

* P155 follows

* P156 followed_by

* P921 main_subject
P143 imported from Wikimedia project
P248 stated in

### Breakdown of results:

I am dubious about using P236 (ISSN) as a source property. See [Q63871731](https://www.wikidata.org/wiki/Q63871731) for example. Since a URL is used, wouldn't it be P854 (reference URL)?

P1683 (quotation) seems to be being used to provide the string version of the `stated in` item. See [Q6295853](https://www.wikidata.org/wiki/Q6295853).

P887 (based on heuristic) is legitimate for language of the work. Daniel Mietchen uses it with [Q15755692](https://www.wikidata.org/wiki/Q15755692). But it seems like we could do better. Not sure about this one but not widely used (2 journals only).

We won't be using P143 (imported from Wikimedia project) and P4656 (Wikimedia import URL) since we aren't importing from Wikipedia. They are OK, but if we are really going to improve the quality of the data, we should be using primary sources. Therefore, I don't see this as worthy of tracking.

P4327 (BHL bibliography ID) is used in addition to P248 (stated in) within the same reference. Not sure if this is the best practice, but I don't think it's a source we will be using. See [Q6087079](https://www.wikidata.org/wiki/Q6087079) for an example.

P1436 (title) is used along with P854 (reference URL) to show the title of the page. Maybe not a bad idea, but only done once in [Q6087079](https://www.wikidata.org/wiki/Q6087079), so not useful to track.

### Conclusion

That leaves us with:

- P248 (stated in)
- P854 (reference URL)
- P813 (retrieved)

which are the properties I would have thought of using anyway. So that is good verification.