BREG-DCAT Harvester

Introduction

Harvester for RDF documents based on the BRegDCAT specification.

This project is divided in two distinct modules:

The api folder contains a Flask-based HTTP API to manage asynchronous harvest jobs.
The app folder contains a React-based application to serve as the user entrypoint to the API.

Main features:

Collect machine-readable files described in RDF/XML, RDF Turtle, N-triples and JSON-LD. This is known as a harvest job.
Schedule harvest jobs to be run in a periodic fashion.
Simple faceted search on the collected RDF data using BRegDCAT-AP v2 facets/categories.

Each harvest job basically consists of three phases:

Retrieve the RDF documents from the remote sources.
Validate the shapes in the RDF documents using the ISA2 Interoperability Test Bed SHACL Validator.
Merge the data and update the graph in the triple store.

The following diagram presents a high-level view of the architecture and typical usage flow. The user first sets the periodic harvest interval or enqueues a manual job using the Web application; these serialized jobs are kept in an in-memory Redis data store. A Queue Worker observes the Redis store, pulling and executing jobs as they become available (only one job may be executed in parallel). Finally, the results of each job execution are persisted in the Virtuoso triple store.

Deployment

Two distinct Docker Compose files are provided for deployment. The first is a self-contained stack (docker-compose.yml) that requires only for the data sources to be externally defined (see below). The second (docker-compose-lean.yml) assumes that the Virtuoso triple store is available externally, and thus does not define a Virtuoso service in the stack, expecting the configuration to be set explicitly.

Data sources are configured using the $HARVESTER_SOURCES environment variable. This variable should contain a list of lists in a JSON-serialized string. Each list item must contain two sub-items:

The URI of the data source.
The format of the data source as defined by rdflib. It should be one of xml, turtle, nt or json-ld.

For example:

export HARVESTER_SOURCES='[["https://gist.githubusercontent.com/agmangas/b07a69fd8a4d415c8e3d7a7dff7e41e5/raw/e3d574fdcdd14a11acce566c98486bca3a0f1fa4/breg-sample-01.xml", "xml"], ["https://gist.githubusercontent.com/agmangas/5f737b17ebf97c318e2ca3b4099c4c19/raw/5a1411286eb86a9689230ffcd3052a72fee05d74/breg-sample-02.ttl", "turtle"], ["https://gist.githubusercontent.com/agmangas/6ddc1e3405d9e890c74f2c1daf28c3fc/raw/623c2392276ecd6b86201744e1eecea324b0ef4c/breg-sample-03.json", "json-ld"]]'

The data sources variable is then explicitly injected by the Compose files into the api container.

Once the data sources have been defined in the environment a new self-contained stack may be deployed with the following command:

$ docker-compose up -d --build

...

Creating breg-harvester_virtuoso_1 ... done
Creating breg-harvester_redis_1    ... done
Creating breg-harvester_api_1      ... done
Creating breg-harvester_worker_1   ... done

The Web application will then be available on port 9090.

API

API Configuration

Variable	Default	Description
`HARVESTER_LOG_LEVEL`	`info`	Log level of the API logger.
`HARVESTER_REDIS`	`redis://redis`	Redis URL for the jobs queue.
`HARVESTER_SPARQL_ENDPOINT`	`http://virtuoso:8890/sparql`	Virtuoso SPARQL query endpoint.
`HARVESTER_SPARQL_UPDATE_ENDPOINT`	`http://virtuoso:8890/sparql-auth`	Virtuoso SPARQL update endpoint.
`HARVESTER_GRAPH_URI`	`http://fundacionctic.org/breg-harvester`	Default graph URI.
`HARVESTER_SPARQL_USER`	`dba`	User of the Virtuoso triple store.
`HARVESTER_SPARQL_PASS`	`dba`	Password for the user of the Virtuoso triple store.
`HARVESTER_VALIDATOR_DISABLED`	None	Flag to disable the SHACL validator API.
`HARVESTER_RESULT_TTL`	`2592000` (30 days)	Seconds that successful jobs will be kept in Redis.

API Usage

Create a new harvest job:

$ curl -X POST http://localhost:9090/api/harvest/
{
	"description": "breg_harvester.harvest.run_harvest(graph_uri='http://fundacionctic.org/breg-harvester', sources=[<SourceDataset> Type='DataTypes.XML' URI='https://gist.githubusercontent.c..., store_kwargs={'query_endpoint': 'http://virtuoso:8890/sparql', 'update_endpoint': 'http:..., validator=<breg_harvester.validator.BRegAPIValidator object at 0x7f6b68034b00>)",
	"ended_at": null,
	"enqueued_at": "2020-09-28T07:20:40.700335",
	"exc_info": null,
	"job_id": "5c47a2e3-19ad-49f8-baff-547d93e9b738",
	"result": null,
	"started_at": null,
	"status": "queued"
}

Fetch the current status of a previously created harvest job using the job ID provided in the POST response:

$ curl -X GET http://localhost:9090/api/harvest/5c47a2e3-19ad-49f8-baff-547d93e9b738
{
	"description": "breg_harvester.harvest.run_harvest(graph_uri='http://fundacionctic.org/breg-harvester', sources=[<SourceDataset> Type='DataTypes.XML' URI='https://gist.githubusercontent.c..., store_kwargs={'query_endpoint': 'http://virtuoso:8890/sparql', 'update_endpoint': 'http:..., validator=<breg_harvester.validator.BRegAPIValidator object at 0x7f6b68034b00>)",
	"ended_at": "2020-09-28T07:20:43.988582",
	"enqueued_at": "2020-09-28T07:20:40.700335",
	"exc_info": null,
	"job_id": "5c47a2e3-19ad-49f8-baff-547d93e9b738",
	"result": {
		"num_triples": 33,
		"sources": [{
			"data_type": "xml",
			"format": "xml",
			"mime": "application/rdf+xml",
			"uri": "https://gist.githubusercontent.com/agmangas/b07a69fd8a4d415c8e3d7a7dff7e41e5/raw/e3d574fdcdd14a11acce566c98486bca3a0f1fa4/breg-sample-01.xml"
		}, {
			"data_type": "turtle",
			"format": "turtle",
			"mime": "text/turtle",
			"uri": "https://gist.githubusercontent.com/agmangas/5f737b17ebf97c318e2ca3b4099c4c19/raw/5a1411286eb86a9689230ffcd3052a72fee05d74/breg-sample-02.ttl"
		}, {
			"data_type": "json-ld",
			"format": "json-ld",
			"mime": "application/ld+json",
			"uri": "https://gist.githubusercontent.com/agmangas/6ddc1e3405d9e890c74f2c1daf28c3fc/raw/623c2392276ecd6b86201744e1eecea324b0ef4c/breg-sample-03.json"
		}]
	},
	"started_at": "2020-09-28T07:20:41.049328",
	"status": "finished"
}

Fetch the list of the most recent jobs grouped by job status:

$ curl -X GET http://localhost:9090/api/harvest/
{
	"failed": [],
	"finished": [{
		"ended_at": "2020-09-28T07:20:43.988582",
		"enqueued_at": "2020-09-28T07:20:40.700335",
		"job_id": "5c47a2e3-19ad-49f8-baff-547d93e9b738",
		"started_at": "2020-09-28T07:20:41.049328",
		"status": "finished"
	}],
	"scheduled": [],
	"started": []
}

Fetch the current configuration of the scheduler:

$ curl -X GET http://localhost:9090/api/scheduler/
{
	"id": "scheduled-harvester",
	"interval_seconds": 432000.0,
	"name": "scheduled-harvester",
	"next_date": "2020-10-03T07:12:38.329589+00:00"
}

Update the scheduler interval:

$ curl -X POST --header "Content-Type: application/json" --data '{"interval": 1800}' http://localhost:9090/api/scheduler/
{
	"id": "scheduled-harvester",
	"interval_seconds": 1800.0,
	"name": "scheduled-harvester",
	"next_date": "2020-09-28T08:15:30.650931+00:00"
}

Adaptability

There are two aspects that should be considered when adapting the HTTP API and Web app from BRegDCAT-AP v2 to a different RDF specification:

Modules related to the harvest logic are mostly decoupled from the BRegDCAT-AP v2 specification. Arbitrary RDF documents can be harvested if the validator is disabled using the HARVESTER_VALIDATOR_DISABLED configuration variable. Another approach would be to provide a new validator implementation to replace the current BRegDCAT-AP v2 validator.
Modules related to the faceted search are strongly coupled with the BRegDCAT-AP v2 specification and as such would require a significant amount of work to be adapted.

Name		Name	Last commit message	Last commit date
Latest commit History 113 Commits
api		api
app		app
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
build-app.sh		build-app.sh
diagram.graffle		diagram.graffle
diagram.png		diagram.png
docker-compose-lean.yml		docker-compose-lean.yml
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

api

api

app

app

.dockerignore

.dockerignore

.gitignore

.gitignore

Dockerfile

Dockerfile

LICENSE

LICENSE

README.md

README.md

build-app.sh

build-app.sh

diagram.graffle

diagram.graffle

diagram.png

diagram.png

docker-compose-lean.yml

docker-compose-lean.yml

docker-compose.yml

docker-compose.yml

Repository files navigation

BREG-DCAT Harvester

Introduction

Deployment

API

API Configuration

API Usage

Adaptability

About

Releases

Packages

Languages

License

SEMICeu/breg-dcat-harvester

Folders and files

Latest commit

History

Repository files navigation

BREG-DCAT Harvester

Introduction

Deployment

API

API Configuration

API Usage

Adaptability

About

Topics

Resources

License

Stars

Watchers

Forks

Languages