# Howto

On the general setup and the idea behind this API see also the CLS INFRA Deliverable D7.1 "On Programmable Corpora" https://doi.org/10.5281/zenodo.7664964. The system is an adapted version of the *POSTDATA 2 DraCor API*.

## Running the API
see `Readme` in the repo. Be careful, in the current setup username and password of the triple store are exposed, because the settings file `dev.env` is commited to the repository. Take precautions when running in production!

## issue with flask port 5000 (Mac)
In case of port conflict with port `5000` on Mac running Monterey, see https://medium.com/pythonistas/port-5000-already-in-use-macos-monterey-issue-d86b02edd36c. Port could be changed in the docker `compose.yaml` file:

```
    ports:
      - "5000:5000"
```
but also change this in the `dev.env` (`SERVICE_PORT`) file as well, which is read by the API in `api.py` (`service_port = int(os.environ.get("SERVICE_PORT", 5000))`; no need to change here if using environment variables)


## Interfacing with the triple store (virtuoso)
For the following examples to work, there should be already some sample data `data/generated_example_data.ttl` in the Triple Store. See the notebook `generate_test_data.ipynb` to see how this data is generated.

With Virtuoso running, it can be used directly from Python by using the `DB` class. A connection can be established as such:

In [1]:
from sparql import DB

virtuoso = DB(triplestore="virtuoso", protocol="http",url="localhost",port="8890", username="dba", password="pwd123")

In [2]:
# see the set attributes
#virtuoso.__dict__

In [3]:
# the endpoints are set when initializing the class
print(virtuoso.sparql_query_endpoint)
print(virtuoso.sparql_auth_endpoint)
print(virtuoso.crud_endpoint)

http://localhost:8890/sparql
http://localhost:8890/sparql-auth
http://localhost:8890/sparql-graph-crud-auth


## SPARQL queries

It is possible to simply send SPARQL queries to Virtuoso:

In [4]:
query = """
SELECT * WHERE {
?s ?p ?o.
}
LIMIT 1
"""

virtuoso.sparql(query)

{'head': {'link': [], 'vars': ['s', 'p', 'o']},
 'results': {'distinct': False,
  'ordered': True,
  'bindings': [{'s': {'type': 'uri',
     'value': 'http://www.openlinksw.com/schemas/virtrdf#DefaultQuadMap'},
    'p': {'type': 'uri',
     'value': 'http://www.w3.org/1999/02/22-rdf-syntax-ns#type'},
    'o': {'type': 'uri',
     'value': 'http://www.openlinksw.com/schemas/virtrdf#QuadMap'}}]}}

### Pre-defined SPARQL queries
The classes that power the API use pre-defined SPARQL queries. They are defined as classes in the module `sparql_queries.py`. The can be used by importing them and passing a Virtuoso `DB` instance to execute them:

They are all derived from the base class `GolemQuery` (`sparql_queries.py`) that inherits from `SparqlQuery` in `sparql.py`. The basic class can be used to define things, that are relevant for all other pre-defined queries, e.g. the prefixes.

In [5]:
from sparql_queries import GolemQuery
golem = GolemQuery()

The defined prefixes can be retieved:

In [6]:
golem.prefixes

[{'prefix': 'gd', 'uri': 'http://data.golemlab.eu/data/'},
 {'prefix': 'gt', 'uri': 'http://data.golemlab.eu/data/entity/type/'},
 {'prefix': 'crm', 'uri': 'http://www.cidoc-crm.org/cidoc-crm/'},
 {'prefix': 'owl', 'uri': 'http://www.w3.org/2002/07/owl#'},
 {'prefix': 'xsd', 'uri': 'http://www.w3.org/2001/XMLSchema#'},
 {'prefix': 'cls', 'uri': 'http://clscor.io/ontology/'},
 {'prefix': 'go', 'uri': 'http://golemlab.eu/ontology/'},
 {'prefix': 'lrm', 'uri': 'http://www.cidoc-crm.org/cidoc-crm/lrmoo/'},
 {'prefix': 'rdfs', 'uri': 'http://www.w3.org/2000/01/rdf-schema#'},
 {'prefix': 'nif',
  'uri': 'http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#'}]

There is a method `get_prefix_uri` to resolve a prefix to its full uri:

In [7]:
# Get the full uri for the prefix crm
golem.get_prefix_uri("crm")

'http://www.cidoc-crm.org/cidoc-crm/'

#### Example: ID of an entity
There is a pre-defined query that retrieves an ID (E42 Identifier) of a E1 CRM Entity.

In [8]:
from sparql_queries import EntityId
entity_query = EntityId()

In [9]:
# Explain, what a query does:
print(entity_query.explain())

ID of an Entity: 
    Generic query to get ID of an entity identified by an URI.
    It identifies the node that holds the ID as value by the type "id" (gt:id).
    


In [10]:
# there are some ways to check the instance after it has been initialized, e.g.
entity_query.template_includes_variables

True

This query is only a template. This means, it contains a variable that needs to be substituted before executing it:

In [11]:
# see the variables
entity_query.variables

[{'id': 'entity_uri',
  'class': 'crm:E1_CRM_Entity',
  'description': 'URI of an Entity.'}]

In [12]:
# to see the template
print(entity_query.template)


    SELECT ?id WHERE {
        <$1> crm:P1_is_identified_by ?identifier .

        ?identifier a crm:E42_Identifier ;
            crm:P2_has_type gt:id ; 
            rdf:value ?id .
    }
    


Before executing the query, the variable has to be substituted. This can be done with the `inject()` method:

In [13]:
uri_to_be_inserted = 'http://data.golemlab.eu/data/potter_corpus' 

entity_query.inject([uri_to_be_inserted])

True

In [14]:
# the query has been prepared, to see the query:
print(entity_query.query)

PREFIX gd: <http://data.golemlab.eu/data/>
PREFIX gt: <http://data.golemlab.eu/data/entity/type/>
PREFIX crm: <http://www.cidoc-crm.org/cidoc-crm/>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX cls: <http://clscor.io/ontology/>
PREFIX go: <http://golemlab.eu/ontology/>
PREFIX lrm: <http://www.cidoc-crm.org/cidoc-crm/lrmoo/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX nif: <http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core#>
    SELECT ?id WHERE {
        <http://data.golemlab.eu/data/potter_corpus> crm:P1_is_identified_by ?identifier .

        ?identifier a crm:E42_Identifier ;
            crm:P2_has_type gt:id ; 
            rdf:value ?id .
    }
    


There is a method to explicitly "prepare" (add prefixes) `prepare()`.

After the query has been prepared, it can be executed:

In [15]:
# it needs to be passed a DB instance:
entity_query.execute(virtuoso)

True

### Results of a SPARQL Query
The results of the query are stored in the query's `results` as an instance of the class `SparqlResults` (see `sparql.py`)

In [16]:
entity_query.results

<sparql.SparqlResults at 0x103ddb910>

In [17]:
# get the results as json
entity_query.results.dump()

{'head': {'link': [], 'vars': ['id']},
 'results': {'distinct': False,
  'ordered': True,
  'bindings': [{'id': {'type': 'literal', 'value': 'potter_corpus'}}]}}

In [18]:
# variables and bindings(values) can be get separately
print(entity_query.results.vars)
print(entity_query.results.bindings)

['id']
[{'id': {'type': 'literal', 'value': 'potter_corpus'}}]


### Transform the Results
`SparqlResults` has a method to "simplify" the query results. If there is only one Variable defined, `simplify` will return a list with the values as it's items:

In [19]:
entity_query.results.simplify()

['potter_corpus']

The method allows for defining datatypes and rename the keys of the simplified result. Therefore a mapping has to be specified. The following example demonstates this:

In [20]:
from sparql_queries import CorpusMetrics

corpus_metrics_query = CorpusMetrics()

print(corpus_metrics_query.explain())

Corpus Metrics: 
    Get all metrics of a corpus identified by URI
    


In [21]:
# prepare and execute the query for the corpus specified above
corpus_metrics_query.inject([uri_to_be_inserted])
corpus_metrics_query.execute(virtuoso)

True

In [22]:
# simplify the results without a mapping
corpus_metrics_query.results.simplify()

[{'dimensionURI': 'http://data.golemlab.eu/data/potter_corpus/dimension/chapters',
  'value': 500},
 {'dimensionURI': 'http://data.golemlab.eu/data/potter_corpus/dimension/characters',
  'value': 4000},
 {'dimensionURI': 'http://data.golemlab.eu/data/potter_corpus/dimension/comments',
  'value': 7000},
 {'dimensionURI': 'http://data.golemlab.eu/data/potter_corpus/dimension/female',
  'value': 1990},
 {'dimensionURI': 'http://data.golemlab.eu/data/potter_corpus/dimension/male',
  'value': 1990},
 {'dimensionURI': 'http://data.golemlab.eu/data/potter_corpus/dimension/nonbinary',
  'value': 20},
 {'dimensionURI': 'http://data.golemlab.eu/data/potter_corpus/dimension/paragraphs',
  'value': 9000},
 {'dimensionURI': 'http://data.golemlab.eu/data/potter_corpus/dimension/wordsInComments',
  'value': 20000},
 {'dimensionURI': 'http://data.golemlab.eu/data/potter_corpus/dimension/wordsInDocuments',
  'value': 500000}]

In [23]:
# show the variables
corpus_metrics_query.results.vars

['dimensionURI', 'value']

The keys `dimensionURI` and `value` can be renamed; the query returns integers in the fields with the key `value` (which makes sense). For demonstration purposes, we can cast them to strings.

In [24]:
my_example_mapping = {
    "dimensionURI" : {"key": "i_renamed_this"}, # renames the key of the field from "dimensionURI" to "i_renamed_this"
    "value" : {"datatype" : "str"} #casts the datatype from integer to string 
}

In [25]:
corpus_metrics_query.results.simplify(mapping=my_example_mapping)

[{'i_renamed_this': 'http://data.golemlab.eu/data/potter_corpus/dimension/chapters',
  'value': '500'},
 {'i_renamed_this': 'http://data.golemlab.eu/data/potter_corpus/dimension/characters',
  'value': '4000'},
 {'i_renamed_this': 'http://data.golemlab.eu/data/potter_corpus/dimension/comments',
  'value': '7000'},
 {'i_renamed_this': 'http://data.golemlab.eu/data/potter_corpus/dimension/female',
  'value': '1990'},
 {'i_renamed_this': 'http://data.golemlab.eu/data/potter_corpus/dimension/male',
  'value': '1990'},
 {'i_renamed_this': 'http://data.golemlab.eu/data/potter_corpus/dimension/nonbinary',
  'value': '20'},
 {'i_renamed_this': 'http://data.golemlab.eu/data/potter_corpus/dimension/paragraphs',
  'value': '9000'},
 {'i_renamed_this': 'http://data.golemlab.eu/data/potter_corpus/dimension/wordsInComments',
  'value': '20000'},
 {'i_renamed_this': 'http://data.golemlab.eu/data/potter_corpus/dimension/wordsInDocuments',
  'value': '500000'}]

Sidenote: An idea to develop that further: It would be helpful, if in the mapping, a function could be included, that would be applied to the item in question, e.g. manipulating the value of "i_renamed_this" (`split("/")[-1:][0]` to get only the last part after the final slash). Something similar has to be done separately in corpus.py `get_metrics`.

## Entity Classes
While a minimal implementation of the API could only use the above described functionality, there are designated classes for the main entities that can be used for create RDF data and/or to fetch data from the triple store with pre-defined SPARQL queries.

### Corpora
The class `Corpora` in the module `corpora.py` allows to retieve data on all corpora in the system.

The class is not fully developed; currently it is used in the `api.py` in the `/corpora` and the `/corpora/{id}` endpoints. Metadata on the whole collection could go there. It is also possible to programmatically access single corpora from an instance of the corpora class if they have been loaded without having to individually instantiate them (see example at the end of this section).

No functionality to generate RDF data has been implemented with this class.

In [26]:
from corpora import Corpora

In [27]:
# it is necessary to pass a database connection if da
golem_corpora = Corpora(database=virtuoso)

In [28]:
# Corpora are not loaded automatically, listing returns an empty list
golem_corpora.list_corpora()

[]

If corpus data is available in the triple store, it can be loaded to the copora instance:

In [29]:
golem_corpora.load()

In [30]:
# just output the URIs of corpora
golem_corpora.get_uris()

['http://data.golemlab.eu/data/potter_corpus']

In [31]:
# if used with the testdata, there should be one corpus with the id "potter_corpus"
golem_corpora.list_corpora()

[{'id': 'potter_corpus',
  'uri': 'http://data.golemlab.eu/data/potter_corpus',
  'corpusName': 'Harry Potter Corpus',
  'acronym': 'potter',
  'corpusDescription': 'Harry Potter Corpus derived form AO3.',
  'licence': 'CC0',
  'licenceUrl': 'https://creativecommons.org/publicdomain/zero/1.0'}]

In [32]:
# Metrics of the corpora can be included
golem_corpora.list_corpora(include_metrics=True)

[{'id': 'potter_corpus',
  'uri': 'http://data.golemlab.eu/data/potter_corpus',
  'corpusName': 'Harry Potter Corpus',
  'acronym': 'potter',
  'corpusDescription': 'Harry Potter Corpus derived form AO3.',
  'licence': 'CC0',
  'licenceUrl': 'https://creativecommons.org/publicdomain/zero/1.0',
  'metrics': {'chapters': 500,
   'characters': 4000,
   'comments': 7000,
   'female': 1990,
   'male': 1990,
   'nonbinary': 20,
   'paragraphs': 9000,
   'wordsInComments': 20000,
   'wordsInDocuments': 500000}}]

The single corpora are stored as instances of the class `Corpus` inside the corpora instance and thus can be accessed as such:

In [33]:
# it is a dictionary with corpus id as keys and an instance of class Corpus
golem_corpora.corpora

{'potter_corpus': <corpus.Corpus at 0x104fc9220>}

In [34]:
golem_corpora.corpora["potter_corpus"]

<corpus.Corpus at 0x104fc9220>

In [35]:
# e.g. get the description of the potter corpus
golem_corpora.corpora["potter_corpus"].description

'Harry Potter Corpus derived form AO3.'

In [36]:
# get the metrics of this corpus
golem_corpora.corpora["potter_corpus"].get_metrics()

{'chapters': 500,
 'characters': 4000,
 'comments': 7000,
 'female': 1990,
 'male': 1990,
 'nonbinary': 20,
 'paragraphs': 9000,
 'wordsInComments': 20000,
 'wordsInDocuments': 500000}

In [37]:
# it would be possible to iterate over the corpora; there is only one, so the example is not great..
for key in golem_corpora.corpora.keys():
    print(golem_corpora.corpora[key].name) 

Harry Potter Corpus


### Corpus
Class `Corpus` in `corpus.py`.

The class has functionality to generate RDF data as it is demonstrated in the notebook `generate_testdata.ipynb`. We can also load it from the Triple Store.

In [38]:
from corpus import Corpus

In [39]:
# need to pass a database connection and an URI otherwhise it will be an empty instance
harry_potter_corpus = Corpus(database=virtuoso, uri="http://data.golemlab.eu/data/potter_corpus")

In [40]:
# obviously, the URI is available
harry_potter_corpus.uri

'http://data.golemlab.eu/data/potter_corpus'

The URI is available from the start, as is the ID, the rest might need to be fetched along the go. If the data is not there, the methods will try to fetch them from the triple store. A user won't see the difference.

In [41]:
harry_potter_corpus.id

'potter_corpus'

In [42]:
harry_potter_corpus.name

'Harry Potter Corpus'

In [43]:
# get_metadata will try to fetch the fields according to the schema in schemas.py
harry_potter_corpus.get_metadata()

{'id': 'potter_corpus',
 'uri': 'http://data.golemlab.eu/data/potter_corpus',
 'corpusName': 'Harry Potter Corpus',
 'acronym': 'potter',
 'corpusDescription': 'Harry Potter Corpus derived form AO3.',
 'licence': 'CC0',
 'licenceUrl': 'https://creativecommons.org/publicdomain/zero/1.0'}

In [44]:
# the data can be actively validated, if something is not as it should be, an exception will be raised
harry_potter_corpus.get_metadata(validation=True)

{'id': 'potter_corpus',
 'uri': 'http://data.golemlab.eu/data/potter_corpus',
 'corpusName': 'Harry Potter Corpus',
 'acronym': 'potter',
 'corpusDescription': 'Harry Potter Corpus derived form AO3.',
 'licence': 'CC0',
 'licenceUrl': 'https://creativecommons.org/publicdomain/zero/1.0'}

In [45]:
harry_potter_corpus.get_metadata(validation=True, include_metrics=True)

{'id': 'potter_corpus',
 'uri': 'http://data.golemlab.eu/data/potter_corpus',
 'corpusName': 'Harry Potter Corpus',
 'acronym': 'potter',
 'corpusDescription': 'Harry Potter Corpus derived form AO3.',
 'licence': 'CC0',
 'licenceUrl': 'https://creativecommons.org/publicdomain/zero/1.0',
 'metrics': {'chapters': 500,
  'characters': 4000,
  'comments': 7000,
  'female': 1990,
  'male': 1990,
  'nonbinary': 20,
  'paragraphs': 9000,
  'wordsInComments': 20000,
  'wordsInDocuments': 500000}}

There is a method to get the URIs of characters. (More functionality to retrieve data has not been implemented yet.)

In [46]:
harry_potter_corpus.get_character_uris()

['http://data.golemlab.eu/data/C000000001',
 'http://data.golemlab.eu/data/C000000002',
 'http://data.golemlab.eu/data/C000000003']

In [47]:
# get some data on characters
harry_potter_corpus.get_characters()

[{'uri': 'http://data.golemlab.eu/data/C000000001',
  'id': 'C000000001',
  'characterName': 'Harry Potter'},
 {'uri': 'http://data.golemlab.eu/data/C000000002',
  'id': 'C000000002',
  'characterName': 'Hermione Granger'},
 {'uri': 'http://data.golemlab.eu/data/C000000003',
  'id': 'C000000003',
  'characterName': 'Harry Potter'}]

In [48]:
# Characters can be included in the output of get_metadata(). The stored characters is not used, they are sparqled.
harry_potter_corpus.get_metadata(include_characters=True)

{'id': 'potter_corpus',
 'uri': 'http://data.golemlab.eu/data/potter_corpus',
 'corpusName': 'Harry Potter Corpus',
 'acronym': 'potter',
 'corpusDescription': 'Harry Potter Corpus derived form AO3.',
 'licence': 'CC0',
 'licenceUrl': 'https://creativecommons.org/publicdomain/zero/1.0',
 'characters': [{'uri': 'http://data.golemlab.eu/data/C000000001',
   'id': 'C000000001',
   'characterName': 'Harry Potter'},
  {'uri': 'http://data.golemlab.eu/data/C000000002',
   'id': 'C000000002',
   'characterName': 'Hermione Granger'},
  {'uri': 'http://data.golemlab.eu/data/C000000003',
   'id': 'C000000003',
   'characterName': 'Harry Potter'}]}

In [49]:
# optionally, it is possible to store characters inside the corpus instance
harry_potter_corpus.get_characters(store=True) # will return True
harry_potter_corpus.characters

{'C000000001': <character.Character at 0x104f675e0>,
 'C000000002': <character.Character at 0x104f67dc0>,
 'C000000003': <character.Character at 0x104f67af0>}

### Character
The class `Character` as well as `Author` and `Work` were mainly used to generate the data for testing. See notebook. While Author and Work have no functionality to fetch data yet, Character can already do some very basic things:

In [50]:
from character import Character

In [51]:
harry = Character(database=virtuoso, uri="http://data.golemlab.eu/data/C000000001")

In [52]:
harry.uri

'http://data.golemlab.eu/data/C000000001'

In [53]:
harry.get_id()

'C000000001'

In [54]:
harry.id

'C000000001'