# Standardizing place names programmatically using the Getty Thesaurus of Geographic Names 

## Introduction
FAIR data is good.
Standardization helps you get there.
One way to standardize biological observations is by using Darwin Core.
Darwin Core Location Core suggests checking placenames against [GTGN](https://www.getty.edu/research/tools/vocabularies/tgn/).

The TGN is a structured vocabulary containing place names. These names may be in English, in a vernacular language, or even historical. Information such as geographic coordinates, notes, sources for the data, and place types (e.g. state capital) is linked to each name. It is important to note that the TGN is a thesauras, not a Geographic Information System (GIS); any coordinates provided are for reference perposes only.  

At first glance, it looks like the only way to do this is to search all place names manually.
But in fact, GTGN provides a SPARQL endpoint that allows you to search programmatically as well.
The documentation for using this endpoint is really long and complicated.
I wanted to provide an accessible introduction to standardizing place names using GTGN for biological data sets.

## SPARQL

### What is it?
Many resources - such as WoRMS - provide programmatic access to their data via a web Application Programming Interface (API). (Check that this is true)
APIs are fantastic, but they only allow certain data requests, as defined by the API's syntax (correct word?).
For example, I can get x but not y.
Resource Description Framework (RDF) databases, or "graph" databases, get around this problem by defining complex relationships between data objects, and allowing the user to request data using these relationships. These relationships can even extend between databases, allowing the user to bring together disparate types of information related to their research question. 
"An interlinked query like this means that we can ask Europeana questions about its objects that rely on information about geography (what cities are in the Netherlands?) that Europeana does not need to store and maintain itself. In the future, more cultural LOD will hopefully link to authority databases like the Getty’s Union List of Artist Names, allowing, for example, the British Museum to outsource biographical data to the more complete resources at the Getty." (For example the TGN refers to the Art and Architecture Thesaurus for its place types.

The Getty Vocabulary Program maintains a set of RDF databases including the GTGN. SPARQL (a recursive acronym which stands for SPARQL Protocol and RDF Query Language) is the query language you use to request data from these databases. 

### How do I use it?

`select * {
    ?f a gvp:Facet; 
       skos:inScheme tgn: ; 
       gvp:prefLabelGVP/xl:literalForm ?l
}`
The subject (?f) is an instance of (a) a class (gvp:Facet), the subject is in TGN, the subject has a prefered label ?l. Returns two columns, f and l. f contains the term ID and l contains the preferred label.

![Image of GVP Semantic Overview](http://vocab.getty.edu/doc/img/005-semantic-overview.png)

There are many other operations you can add to a SPARQL query, such as aggregating results, sorting results, and removing duplicate results. However, I'm assuming that you, like me, would prefer to perform these operations in your programming language of choice (almost certainly not SPARQL). If you are curious about more complex operations, check out the resources at the end of this tutorial.

### Query examples

### Python implementation on example biological occurrence data

In [None]:
## Import packages

import numpy as np
import pandas as pd
from SPARQLWrapper import SPARQLWrapper, JSON

In [None]:
## Create example data

## References and resources
https://programminghistorian.org/en/lessons/retired/graph-databases-and-SPARQL

https://en.wikibooks.org/wiki/XQuery/SPARQL_Tutorial
https://www.w3.org/TR/sparql11-overview/