<a href="https://colab.research.google.com/github/OnroerendErfgoed/scriptorium/blob/main/notebooks/12_generate_a_thesaurus_tree.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Generating a thesaurus tree

[Flanders Heritage](https://www.onroerenderfgoed.be) maintains several thesauri at https://thesaurus.onroerenderfgoed.be. Thesauri can appear to be complex, but are essentially a controlled vocabulary of concepts we want to enumerate and describe. For more information see [Calling it what it is. Thesauri in the Flanders Heritage Agency: History, Importance, Use and Technological Advances.](https://doi.org/10.5194/isprs-annals-IV-2-W2-151-2017)

This script demonstrates generating a print out from the thesaurus of [Heritage Types](https://thesaurus.onroerenderfgoed.be/conceptschemes/ERFGOEDTYPES) or a subtree of it. This can be used for maintaining a high level overview of the thesarus. It's also possible to print how often certain concepts are used in the [Flanders Heritage Inventory](https://inventaris.onroerenderfgoed.be).

For this script we use the [skosprovider_atramhasis](https://skosprovider-atramhasis.readthedocs.io/) library to have easier access to the [Flanders Heritage thesauri](https://thesaurus.onroerenderfgoed). This library does need to be installed separately using *pip*. Please run the following command before running the script itself.

In [None]:
!pip install skosprovider_atramhasis

Once we've installed the library, we can run the script. There are a few parameters than can be modified.

`TOP_CONCEPTS = []`

This parameter controls how much of the thesaurus is rendered. It can either be an empty list, or it can contain id's of concepts in the thesaurus. It's assumed these will be top-level concepts, ie. concepts that have no broader concepts themselves, though this is not enforced.

`QUERY_USAGE = True`

This parameter controls whether we want to know how often a certain concept is used in the [Flanders Heritage Inventory](https://inventaris.onroerenderfgoed.be). Please keep in mind that setting this to *True* will make the script run far longer than to `False`. Setting this to *True* and settings *TOP CONCEPTS* to an empty list will make the script run for a very long time. This can be improved somewhat because there is a cache on the skosprovider, but this will only help with rendering the thesaurus, not the querying.

`LANGUAGE = 'nl-be'`

This parameters controls in which language to render the thesaurus tree in. It should be an IANA language tag. The skosprovider_atramhasis library will try it's best to render all concepts in this language, provided labels in this language are available. Practically speaking, very few labels in languages other than `nl` or `nl-BE` are available.

In [None]:
#!/usr/bin/python
# -*- coding: utf-8 -*-
'''
This script demonstrates using the AtramhasisProvider connected to https://thesaurus.onroerenderfgoed.be/conceptschemes/ERFGOEDTYPES to print a thesaurus or main branches thereof as a simple rst tree.

Optionally, it can query the Flanders Heritage Inventory (https://inventaris.onroerenderfgoed.be) to report on the number of times a certain concept is assigned to erfgoed- or aanduidingsobjecten.
'''
import requests
from IPython.display import display
from IPython.display import Markdown

from skosprovider_atramhasis.providers import AtramhasisProvider
from skosprovider.providers import DictionaryProvider
from skosprovider.utils import dict_dumper

# Print the entire thesaurus
#TOP_CONCEPTS = []
# Use to only generate the tree for certain top terms.
# TOP_CONCEPTS = [147, 237, 930, 933, 78, 1751]
# Only print the tree for RELIGIEUZE GEBOUWEN EN COMPLEXEN
TOP_CONCEPTS = [359]

# Query the FHA inventory for the number of times these are used?
QUERY_USAGE= False

# Which language should the thesaurus be displayed in?
# (This will generally not provide a lot of results for anything other than nl or nl-BE)
LANGUAGE = 'nl-BE'

INVENTARIS_HOST = 'https://inventaris.onroerenderfgoed.be/'
URL_ERFGOEDOBJECTEN = INVENTARIS_HOST + 'erfgoedobjecten'
URL_AANDUIDINGSOBJECTEN = INVENTARIS_HOST + 'aanduidingsobjecten'

def find_aantal_erfgoedobjecten(concept_id):
    '''
    Find the number of erfgoedobjecten indexed with a certain concept from 
    the ERFGOEDTYPES thesaurus.
    '''
    res = requests.get(
        URL_ERFGOEDOBJECTEN,
        params={'typologie': concept_id},
        headers={'accept': 'application/json'}
    )
    aantal = None
    if res.status_code == requests.codes.ok:
        if 'Content-Range' in res.headers:
            aantal = res.headers['Content-Range'].split('/')[-1]
        else:
            aantal = 0
    return (aantal, res.url)

def find_aantal_aanduidingsobjecten(concept_id):
    '''
    Find the number of aanduidingsobjecten indexed with a certain concept from 
    the ERFGOEDTYPES thesaurus. Only stil valid aanduidingsobjecten related to 
    beschermingen will be counted.
    '''
    res = requests.get(
        URL_AANDUIDINGSOBJECTEN,
        params={
            'typologie': concept_id,
            'geldig': 1,
            'categorie': 'beschermingen'
        },
        headers={'accept': 'application/json'}
    )
    aantal = None
    if res.status_code == requests.codes.ok:
        if 'Content-Range' in res.headers:
            aantal = res.headers['Content-Range'].split('/')[-1]
        else:
            aantal = 0
    return (aantal, res.url)

def list_concepts(items, provider, indent=''):
    '''
    Generate a list of concepts and recurse the hierarchy.
    '''
    output = ''
    for i in items:
        if i['type'] == 'collection':
          output += f"{indent}* <[{i['label']}]({i['uri']})>\n"
        else:
            if QUERY_USAGE:
                erfgoedobjecten = find_aantal_erfgoedobjecten(i['id'])
                aanduidingsobjecten = find_aantal_aanduidingsobjecten(i['id'])
                output += f"{indent}* [{i['label']}]({i['uri']}) ([{erfgoedobjecten[0]}]({erfgoedobjecten[1]}) erfgoedobjecten, [{aanduidingsobjecten[0]}]({aanduidingsobjecten[1]}) aanduidingsobjecten)\n"
            else:
              output += f"{indent}* [{i['label']}]({i['uri']})\n"

        child = provider.get_children_display(
            i['id'],
            language=LANGUAGE,
            sort='label'
        )
        if (len(child)):
            output += list_concepts(child, provider, indent=indent + ' ')
    return output

def main():
    # Keep cache in between runs of the script
    # Value is considered valid for 1 day
    provider = AtramhasisProvider(
        {
            'id': 'vioe-erfgoedtypes)',
            'uri': 'https://id.erfgoed.net/thesauri/erfgoedtypes'
        },
        base_url='https://thesaurus.onroerenderfgoed.be',
        scheme_id='ERFGOEDTYPES',
        cache_config={
            'cache.backend': 'dogpile.cache.dbm',
            'cache.expiration_time': 60 * 60 * 24,
            'cache.arguments.filename': 'erfgoedtypes.dbm'
        }
    )

    thesaurus_title = provider.concept_scheme.label(LANGUAGE).label
    output = f'#{thesaurus_title}\n'

    # Fetch only certain top concepts or fetch the entire thesaurus
    if len(TOP_CONCEPTS):
        top = []
        for tcid in TOP_CONCEPTS:
            tc = provider.get_by_id(tcid)
            top.append({
                'id': tc.id,
                'uri': tc.uri,
                'type': tc.type,
                'label': tc.label(LANGUAGE).label
            })
    else:
        top = provider.get_top_display(language=LANGUAGE, sort='label')

    # For each top concept, generate the subtree
    for t in top:
        if t['type'] == 'collection':
          output += f"##<{t['label']}> ({t['uri']})\n"
        else:
          output += f"##{t['label']} ({t['uri']})\n"

        child = provider.get_children_display(
            t['id'],
            language='nl-BE',
            sort='label'
        )
        output += list_concepts(child, provider)
    
    display(Markdown(output))

if __name__ == "__main__":
    main()

## Further modifications

If this script was helpful, you might try to adapt is for your own purposes. Some suggestions:

*   The script only works for the thesaurus of [Heritage Types](https://thesaurus.onroerenderfgoed.be/conceptschemes/ERFGOEDTYPES), but it's rather easy to adapt to other thesauri such as [Periods](https://thesaurus.onroerenderfgoed.be/conceptschemes/DATERINGEN) or [Styles and Cultures](https://thesaurus.onroerenderfgoed.be/conceptschemes/STIJLEN_EN_CULTUREN). This would involve some changes to the instantiated AtramhasisProvider. And, if you want to set *QUERY_USAGE* to True, adapting the `find_aantal_` functions.
*   The script queries the usage of thesaurus concepts in *erfgoedobjecten* and *aanduidingsobjecten*, but concepts can also be used by [waarnemingen](https://inventaris.onroerenderfgoed.be/waarnemingsobjecten). See if you can add this information as well.
*   The script queries the usage of *aanduidingsobjecten* that are *beschermd*, but it should be trivial to changes this to all *aanduidingsobjecten* or *aanduidingsobjectens* that are *vastgesteld*.

