This notebook is a work in progress process for initializing the Geoscience Knowledgebase with a set of properties and foundational semantics that establishes a base to build from in curating geoscientific knowledge. Our initial use cases have to do with integrating mineral occurrence information along with document references contributing to those and other facts and concepts associated with conducting mineral resource assessment. In order to build a useful knowledge graph on these concepts, though, we also need to tie in lots of other things needed for the claims associated with these things to legitimately link to other things.

For instance, we use NI "43-101 Technical Reports" and a newer "SK-1300 Technical Report" as sources for claims associated with mining projects/properties such as geographic location, mineral commodities identified and/or extracted, figures indicating estimates of ore grade and tonnage, and other details. These are technical geoscientific reports required by the Canadian and U.S. governments, respectively. We need an "instance of" (rdf:type) claim on everything like this in the system. While we could simply create items in the graph to represent these two classes with no further classification, it is useful to "work back up the semantic hierarchy" for as many concepts as we can as far as we need to in order for the information we are recording in the GeoKB to be understandable in the broader global knowledge commons (Wikidata and/or other efforts).

The initialization process here is designed to give us a semantic baseline to work from as we pull in the information and connections we really care about within this context. We're taking a pragmatic approach that is slightly more rigorous (and certainly more streamlined) than the wild west of Wikidata but somewhere short of endlessly academic. We have to get a whole bunch of information into the GeoKB to support analytical use, so we make a best effort to align what we have with mature ontologies and namespaces, knowing we'll have to evolve it over time. The notebook approach on this gives us a good basis to record our reasoning and the places we have to make pragmatic tradeoffs.

In [1]:
import pywikibot as pwb
import json
import pandas as pd

All or most of these functions should be movable to the abstract Wikibase management python package we are designing. That needs to be applicable to the GeoKB but generic enough to apply in other types of domains and circumstances. There are other communities doing similar work such as the wikidataintegrator project in the health sciences. We just found a need to start from the basics of pywikibot and how it operates.

In [5]:
def get_wb(site_name: str, language='en'):
    site = pwb.Site(language, site_name)
    site.login()
    repo = site.data_repository()

    return site, repo

def build_item(label: str, description: str, aliases=[], language: str = 'en') -> dict:
    if isinstance(aliases, str):
        aliases = aliases.split(',')
    item = {
        'labels': {
            language: {
                'language': language,
                'value': label
            }
        },
        'descriptions': {
            language: {
                'language': language,
                'value': description
            }
        },
        'aliases': {
            language: {
                'language': language,
                'value': aliases
            }
        }
    }

    return item

def add_item(site: pwb.APISite, item: dict) -> dict:
    params = {
        'action': 'wbeditentity',
        'new': 'item',
        'data': json.dumps(item),
        'token': site.tokens['csrf'],
    }

    try:
        req = site.simple_request(**params)
        results = req.submit()
        return results
    except Exception as e:
        item["error"] = e
        return e

# Next steps here are to work in the build_item idea and revisit the upsert methods
# to deal with cases where items already exist

## Foundational Classes

Looking toward the initial use cases for the GeoKB, we need to lay down a basic class structure such that the items we need to introduce will all have an appropriate instance of claim pointing to a reasonable concept for basic classification. We could go on forever trying to get the semantics just right and aligned with as many sources of definition as possible, but we'll ease into that level of sophistication over time. In the near term, I've started a Google Sheet to contain the basic structure to get our knowledgebase initialized. We will likely need to iterate on this several times to get it right, and we will doubtless miss some things.

There are lots of ways to spin up these details, but a Google Sheet seems simple enough, it can be edited by multiple people, and we can read it as a CSV and process it here in the notebook.

In [3]:
geokb_init_sheet_id = '1dbuKc4cZJz0YY81B2xWXM5fId6gWgzmQar3hg3CI0Rw'
geokb_classes_sheet_name = 'ClassItems'
geokb_classes_csv = f'https://docs.google.com/spreadsheets/d/{geokb_init_sheet_id}/gviz/tq?tqx=out:csv&sheet={geokb_classes_sheet_name}'

geokb_classes = pd.read_csv(geokb_classes_csv)

In [6]:
geokb_classes

Unnamed: 0,label,description,aliases,subclass of,wd_id
0,entity,"anything that can be considered, discussed, or...",thing,,Q35120
1,person,"common name of Homo sapiens, unique extant spe...",human,entity,Q5
2,organization,social entity established to meet needs or pur...,,entity,Q43229
3,document,form for preservation of structured and identi...,,entity,Q49848
4,publication,content made available to the general public,,entity,Q732577
5,scholarly article,"article in an academic publication, usually pe...","research article,scientific article,journal ar...",publication,Q13442814
6,report,"informational, formal, and detailed text",,document,Q10870555
7,government report,document written by a government to convey inf...,,report,Q15629444
8,NI 43-101 Technical Report,"National Instrument 43-101 (the ""NI 43-101"" or...",,report,
9,USGS Report Series,"official, USGS-authored publications of the U....",USGS Series Report,government report,


## Classfication Items

We'll need to work on the build_item function a bit, but it is a fairly simple structure to start with until we get to the claims. We can work through this however we want, but the basic structure in dict form would look something like the following.

In [9]:
geokb_classes['wb_item'] = geokb_classes.apply(lambda x: build_item(x.label, x.description, x.aliases), axis=1)
geokb_classes.wb_item.to_list()

[{'labels': {'en': {'language': 'en', 'value': 'entity'}},
  'descriptions': {'en': {'language': 'en',
    'value': 'anything that can be considered, discussed, or observed'}},
  'aliases': {'en': {'language': 'en', 'value': ['thing']}}},
 {'labels': {'en': {'language': 'en', 'value': 'person'}},
  'descriptions': {'en': {'language': 'en',
    'value': 'common name of Homo sapiens, unique extant species of the genus Homo, from embryo to adult'}},
  'aliases': {'en': {'language': 'en', 'value': ['human']}}},
 {'labels': {'en': {'language': 'en', 'value': 'organization'}},
  'descriptions': {'en': {'language': 'en',
    'value': 'social entity established to meet needs or pursue goals'}},
  'aliases': {'en': {'language': 'en', 'value': nan}}},
 {'labels': {'en': {'language': 'en', 'value': 'document'}},
  'descriptions': {'en': {'language': 'en',
    'value': 'form for preservation of structured and identified information'}},
  'aliases': {'en': {'language': 'en', 'value': nan}}},
 {'lab

## Build classification items

I'll pick up here next. I need to build in an upsert capability for cases where we want to re-run this. We may also need to work up actual delete functionality once we stabilize a Wikibase instance. Part of my thinking on tracing everything back to entity (thing) is that we should be able to use SPARQL pretty readily to read out the classification items to determine if something has disappeared from our starter spreadsheet and take actions accordingly.

In [3]:
# Required connection points in the pwb API
geokb_site, geokb_repo = get_wb('geokb')

In [4]:
print(type(geokb_site), type(geokb_repo))

<class 'pywikibot.site._apisite.APISite'> <class 'pywikibot.site._datasite.DataSite'>


I left off here after proving that I can at least create items. This last bit was following the [tutorial](https://www.wikidata.org/wiki/Wikidata:Pywikibot_-_Python_3_Tutorial/Labels) on building and editing items to include more rich provenance (history). I'll come back to build this into functional logic.

In [24]:
item = pwb.ItemPage(geokb_repo, "Q3")

# new_labels = {"en": "bear", "de": "Bär"}
# new_descr = {"en": "gentle creature of the forest",
#              "de": "Friedlicher Waldbewohner"}
new_alias = {"en": ["person"]}

for key in new_alias:
    item.editAliases({key: new_alias[key]},
        summary="Settings aliases: {} = '{}'".format(key, new_alias[key]))