Harvester for RDF documents based on the BRegDCAT specification.
This project is divided in two distinct modules:
- The
api
folder contains a Flask-based HTTP API to manage asynchronous harvest jobs. - The
app
folder contains a React-based application to serve as the user entrypoint to the API.
Main features:
- Collect machine-readable files described in RDF/XML, RDF Turtle, N-triples and JSON-LD. This is known as a harvest job.
- Schedule harvest jobs to be run in a periodic fashion.
- Simple faceted search on the collected RDF data using BRegDCAT-AP v2 facets/categories.
Each harvest job basically consists of three phases:
- Retrieve the RDF documents from the remote sources.
- Validate the shapes in the RDF documents using the ISA2 Interoperability Test Bed SHACL Validator.
- Merge the data and update the graph in the triple store.
The following diagram presents a high-level view of the architecture and typical usage flow. The user first sets the periodic harvest interval or enqueues a manual job using the Web application; these serialized jobs are kept in an in-memory Redis data store. A Queue Worker observes the Redis store, pulling and executing jobs as they become available (only one job may be executed in parallel). Finally, the results of each job execution are persisted in the Virtuoso triple store.
Two distinct Docker Compose files are provided for deployment. The first is a self-contained stack (docker-compose.yml
) that requires only for the data sources to be externally defined (see below). The second (docker-compose-lean.yml
) assumes that the Virtuoso triple store is available externally, and thus does not define a Virtuoso service in the stack, expecting the configuration to be set explicitly.
Data sources are configured using the $HARVESTER_SOURCES
environment variable. This variable should contain a list of lists in a JSON-serialized string. Each list item must contain two sub-items:
- The URI of the data source.
- The format of the data source as defined by rdflib. It should be one of
xml
,turtle
,nt
orjson-ld
.
For example:
export HARVESTER_SOURCES='[["https://gist.githubusercontent.com/agmangas/b07a69fd8a4d415c8e3d7a7dff7e41e5/raw/e3d574fdcdd14a11acce566c98486bca3a0f1fa4/breg-sample-01.xml", "xml"], ["https://gist.githubusercontent.com/agmangas/5f737b17ebf97c318e2ca3b4099c4c19/raw/5a1411286eb86a9689230ffcd3052a72fee05d74/breg-sample-02.ttl", "turtle"], ["https://gist.githubusercontent.com/agmangas/6ddc1e3405d9e890c74f2c1daf28c3fc/raw/623c2392276ecd6b86201744e1eecea324b0ef4c/breg-sample-03.json", "json-ld"]]'
The data sources variable is then explicitly injected by the Compose files into the api container.
Once the data sources have been defined in the environment a new self-contained stack may be deployed with the following command:
$ docker-compose up -d --build
...
Creating breg-harvester_virtuoso_1 ... done
Creating breg-harvester_redis_1 ... done
Creating breg-harvester_api_1 ... done
Creating breg-harvester_worker_1 ... done
The Web application will then be available on port 9090.
Variable | Default | Description |
---|---|---|
HARVESTER_LOG_LEVEL |
info |
Log level of the API logger. |
HARVESTER_REDIS |
redis://redis |
Redis URL for the jobs queue. |
HARVESTER_SPARQL_ENDPOINT |
http://virtuoso:8890/sparql |
Virtuoso SPARQL query endpoint. |
HARVESTER_SPARQL_UPDATE_ENDPOINT |
http://virtuoso:8890/sparql-auth |
Virtuoso SPARQL update endpoint. |
HARVESTER_GRAPH_URI |
http://fundacionctic.org/breg-harvester |
Default graph URI. |
HARVESTER_SPARQL_USER |
dba |
User of the Virtuoso triple store. |
HARVESTER_SPARQL_PASS |
dba |
Password for the user of the Virtuoso triple store. |
HARVESTER_VALIDATOR_DISABLED |
None | Flag to disable the SHACL validator API. |
HARVESTER_RESULT_TTL |
2592000 (30 days) |
Seconds that successful jobs will be kept in Redis. |
Create a new harvest job:
$ curl -X POST http://localhost:9090/api/harvest/
{
"description": "breg_harvester.harvest.run_harvest(graph_uri='http://fundacionctic.org/breg-harvester', sources=[<SourceDataset> Type='DataTypes.XML' URI='https://gist.githubusercontent.c..., store_kwargs={'query_endpoint': 'http://virtuoso:8890/sparql', 'update_endpoint': 'http:..., validator=<breg_harvester.validator.BRegAPIValidator object at 0x7f6b68034b00>)",
"ended_at": null,
"enqueued_at": "2020-09-28T07:20:40.700335",
"exc_info": null,
"job_id": "5c47a2e3-19ad-49f8-baff-547d93e9b738",
"result": null,
"started_at": null,
"status": "queued"
}
Fetch the current status of a previously created harvest job using the job ID provided in the POST response:
$ curl -X GET http://localhost:9090/api/harvest/5c47a2e3-19ad-49f8-baff-547d93e9b738
{
"description": "breg_harvester.harvest.run_harvest(graph_uri='http://fundacionctic.org/breg-harvester', sources=[<SourceDataset> Type='DataTypes.XML' URI='https://gist.githubusercontent.c..., store_kwargs={'query_endpoint': 'http://virtuoso:8890/sparql', 'update_endpoint': 'http:..., validator=<breg_harvester.validator.BRegAPIValidator object at 0x7f6b68034b00>)",
"ended_at": "2020-09-28T07:20:43.988582",
"enqueued_at": "2020-09-28T07:20:40.700335",
"exc_info": null,
"job_id": "5c47a2e3-19ad-49f8-baff-547d93e9b738",
"result": {
"num_triples": 33,
"sources": [{
"data_type": "xml",
"format": "xml",
"mime": "application/rdf+xml",
"uri": "https://gist.githubusercontent.com/agmangas/b07a69fd8a4d415c8e3d7a7dff7e41e5/raw/e3d574fdcdd14a11acce566c98486bca3a0f1fa4/breg-sample-01.xml"
}, {
"data_type": "turtle",
"format": "turtle",
"mime": "text/turtle",
"uri": "https://gist.githubusercontent.com/agmangas/5f737b17ebf97c318e2ca3b4099c4c19/raw/5a1411286eb86a9689230ffcd3052a72fee05d74/breg-sample-02.ttl"
}, {
"data_type": "json-ld",
"format": "json-ld",
"mime": "application/ld+json",
"uri": "https://gist.githubusercontent.com/agmangas/6ddc1e3405d9e890c74f2c1daf28c3fc/raw/623c2392276ecd6b86201744e1eecea324b0ef4c/breg-sample-03.json"
}]
},
"started_at": "2020-09-28T07:20:41.049328",
"status": "finished"
}
Fetch the list of the most recent jobs grouped by job status:
$ curl -X GET http://localhost:9090/api/harvest/
{
"failed": [],
"finished": [{
"ended_at": "2020-09-28T07:20:43.988582",
"enqueued_at": "2020-09-28T07:20:40.700335",
"job_id": "5c47a2e3-19ad-49f8-baff-547d93e9b738",
"started_at": "2020-09-28T07:20:41.049328",
"status": "finished"
}],
"scheduled": [],
"started": []
}
Fetch the current configuration of the scheduler:
$ curl -X GET http://localhost:9090/api/scheduler/
{
"id": "scheduled-harvester",
"interval_seconds": 432000.0,
"name": "scheduled-harvester",
"next_date": "2020-10-03T07:12:38.329589+00:00"
}
Update the scheduler interval:
$ curl -X POST --header "Content-Type: application/json" --data '{"interval": 1800}' http://localhost:9090/api/scheduler/
{
"id": "scheduled-harvester",
"interval_seconds": 1800.0,
"name": "scheduled-harvester",
"next_date": "2020-09-28T08:15:30.650931+00:00"
}
There are two aspects that should be considered when adapting the HTTP API and Web app from BRegDCAT-AP v2 to a different RDF specification:
- Modules related to the harvest logic are mostly decoupled from the BRegDCAT-AP v2 specification. Arbitrary RDF documents can be harvested if the validator is disabled using the
HARVESTER_VALIDATOR_DISABLED
configuration variable. Another approach would be to provide a new validator implementation to replace the current BRegDCAT-AP v2 validator. - Modules related to the faceted search are strongly coupled with the BRegDCAT-AP v2 specification and as such would require a significant amount of work to be adapted.