Skip to content

thedatahub/Datahub-Factory-Pipelines

Repository files navigation

Datahub-Factory-Pipelines

This repository contains several example pipeline configurations for the Datahub::Factory application. You could use them as a boilerplate to create your own pipelines

Pipelines

A pipeline is a connection between two system used to exchange records. A pipeline has three functions:

  • Fetch data from a source
  • Transform the data. This entails manipulating the structure of the data as well as the format in which the data will be delivered to a destination.
  • Push transformed data to a destination.

Pipeline configuration

The definition of a pipeline is contained in a configuration file. These configuration files are based on the Config::Simple syntax. A configuration file roughly defines these three 'plugin' types and their associated configuration in distinct blocks:

  • Importer. Defines the source of the data and how to fetch it.
  • Fixer. Defines the Catmandu::Fix logic that will transform the data
  • Exporter. Defines how the destination for the data and how it can be accessed it.

Note that each plugin instance needs to be referenced explicitely in a global plugin block. Each instance gets it's own dedicated plugin definition.

A very minimal configuration would look like this:

[General]
id_path = 'id'

[Importer]
plugin = JSON

[plugin_importer_JSON]
file_name = './data/bar.json'

[Fixer]
plugin = Fix

[plugin_fixer_Fix]
file_name = './fixes/empty.fix'

[Exporter]
plugin = YAML

[plugin_exporter_YAML]

Conditions

The pipeline also accept Fixer "conditions". Since a collection can contain multiple sets of records that require different transformation processing, "conditions" allow you to determine which Fix will be applied to a given record based on a conditional statement.

A condition would look like this:

[Fixer]
plugin = Fix

[plugin_fixer_Fix]
condition = "_metadata.institution\\.name.value"
fixers = Foo, Bar

[plugin_fixer_Foo]
condition_path = 'Museum of Foo'
file_name = '/home/foobar/fixes/foo.fix'

[plugin_fixer_Bar]
condition_path = 'Museum of Bar'
file_name = '/home/foobar/fixes/bar.fix'

Technology

Datahub::Factory is an application based on Catmandu. As such, pipeline configurations are based on Catmandu concepts and terminology.

Authors

Copyright

Copyright 2016, 2019 - PACKED vzw, Vlaamse Kunstcollectie vzw

License

This library is free software; you can redistribute it and/or modify it under the terms of the GPLv3.

About

A set of example pipelines for Datahub-Factory

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published