<a href="https://colab.research.google.com/github/OnroerendErfgoed/scriptorium/blob/main/notebooks/11_all_getty_skos_matches.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# All your match are belong to us

Sometimes the API doesn't exactly provide you with a filter or search parameter to do what you want. In those cases you might need to make more of an effort to get the results that you want, but because the data is all available, there's nothing you can't do with a little script. Bear in mind that this can put some strain on a server. If you're doing a lot of calls to a service it might be a good idea to reach out to the server administrator and ask them if they're ok with what you're doing or if they know of a better way to do what you want to do.

One good way of not overloading a server is to make sure you wait a bit between calls. This makes your script slower, but it ensures the server does not get overloaded.

```python
import time

# Sleep for half a second
time.sleep(0.5)
```

For this script we'll work with the [Onroerend Erfgoed Thesaurus](https://thesaurus.onroerenderfgoed.be). We'll use the Conceptscheme of `Erfgoedtypes` to see how many of them are linked to the [Getty AAT thesaurus](http://www.getty.edu/research/tools/vocabularies/aat/). So, we want to know which `Erfgoedtypes` have a link (called a Match in SKOS) with a concept in the AAT.

The thesaurus API supports checking which concepts are linked to a certain AAT concept, but it does not support searching for all concepts that are linked to any AAT concept.

In [None]:
import requests

THESAURUS_HOST = 'https://thesaurus.onroerenderfgoed.be'
ERFGOEDTYPES_URL = THESAURUS_HOST + '/conceptschemes/ERFGOEDTYPES/c'

session = requests.Session()

session.headers.update({'Accept': 'application/json'})

res = session.get(
    ERFGOEDTYPES_URL,
    params={
        'match': 'http://vocab.getty.edu/aat/300005241'
    }
)

print(res.json())

The previous example works well. It does tell us which concept is linked to this AAT concept, but we would like to see all concepts in this thesaurus that have a link to the AAT. We could query the thesaurus with every single AAT URI, but there are a lot of concepts in the AAT, so this would take forever.

Since our thesaurus of `Erfgoedtypes` is much smaller than the AAT, we'll GET every concept in it and check to see if there's a link with the AAT. We'll also save the information to a CSV file. This does require authentication your Google Drive. Also make sure the variable OUTPUT_FILE point to a valid folder that is writeable.

In [None]:
import requests
import csv
import time

from google.colab import drive
drive.mount('/drive')

THESAURUS_HOST = 'https://thesaurus.onroerenderfgoed.be'
ERFGOEDTYPES_URL = THESAURUS_HOST + '/conceptschemes/ERFGOEDTYPES/c'

MAX_OUTPUT = 10
OUTPUT_FILE = '/drive/My Drive/Colab Notebooks/skos_matches.csv'

session = requests.Session()

session.headers.update({'Accept': 'application/json'})

# Get all erfgoedtypes
# This collection does not support pagination and always sends everything
res = session.get(
    ERFGOEDTYPES_URL,
    params = {
        'type': 'concept'
    }
)

# Make sure everything went well
res.raise_for_status()

data = res.json()

output = []

for concept in data:
  print(f"Processing {concept['uri']}")
  conceptresponse = session.get(concept['uri'])
  conceptresponse.raise_for_status()
  conceptdetail = conceptresponse.json()
  # Only look at concepts that have any matches at all
  if conceptdetail['matches']:
    for matchtype, matchvalues in conceptdetail['matches'].items():
      # matchvalue might be a list
      for matchvalue in matchvalues:
        if matchvalue.startswith('http://vocab.getty.edu'):
          output.append([concept['uri'], concept['label'], matchtype, matchvalue])
  # We handled a concept, now sleep a bit
  time.sleep(0.1)
  # Once we have MAX_OUTPUT rows, stop
  if len(output) >= MAX_OUTPUT:
    break

with open(OUTPUT_FILE, 'w') as csvfile:
  outputwriter = csv.writer(csvfile)
  for row in output:
    outputwriter.writerow(row)

# Conclusion

This script shows us that even if the API doesn't provide the perfect filter for us to use, we can still access all the data and process it ourselves.

This example could be written in a different way using the [Skosprovider](https://github.com/OnroerendErfgoed/skosprovider) family of libraries, including [one](https://github.com/OnroerendErfgoed/skosprovider_getty) that makes it much easier to talk to the Getty services.