Related Projects

Patrick Hochstenbach edited this page Aug 1, 2018 · 66 revisions

Here we provide a list of related projects that also provide ETL/data processing tools.

selected formats

  • Cocoon - Apache Cocoon XML pipeline
  • csvkit - Csvkit is a suite of utilities for converting to and working with CSV, the king of tabular file formats.
  • Datamash - performs calculation (e.g. sum,, count, min, max, skewness, standard deviation) on input files.
  • DNB-Conv-Tools - Java conversion tools for MARC, ONIX, MAB, Pica and others
  • easyM2R -
  • ETL-Yertl
  • jq
  • Librisxl - Tools for conversion of data
  • MABLE - MABLE+ ist ein Java-gestütztes Software-Tool zur automatischen Daten- und Fehleranalyse von Bibliothekskatalogen.
  • MABTools - MAB tools created by the Deutschen Nationalbibliothek
  • MARCEdit -
  • - is a Perl script to filter or count bibliographic records based on condition built upon tag name, indicators, subfield, field value (or tag, positions, value for control fields 00x).
  • marc2rdf - (uses JSON mappings such as this)
  • MARCspec - (mapping language for MARC)
  • marctools - (various MARC command line utilities)
  • MARiMbA - is a command-line tool, designed with librarians in mind, to transform MARC (MAchine-Readable Cataloging) records to RDF
  • miller - is like sed, awk, cut, join, and sort for name-indexed data such as CSV and tabular JSON
  • pymarc - pymarc is a python library for working with bibliographic data encoded in MARC21
  • rml - RML Generic Mapping Language (RDF)
  • solrmarc -
  • TARQL - a SPARQL-based data mapping language to convert CSV, XML, JSON to RDF
  • Traject - an easy to use, high-performance, flexible and extensible MARC to Solr indexer.

general frameworks

  • Akara - Akara is a platform for developing data services available on the Web, using REST architecture. Akara is open source software written in Python and C
  • App::RecordStream - App::RecordStream - recs - A system for command-line analysis of data.
  • ATTX - Putting Linked Data to Work (University of Helsinki)
  • bibcat - Engineering toolkit for building semantic web and bibliographic applications
  • Conduit - Haskell framework for dealing with streaming data
  • COMSODE - The project COMSODE is an SME-driven RTD project aimed at progressing the capabilities in the field of Open Data re-use.
  • DNet
  • d:swarm - data management platform for enrichment, normalization and linkage of knowledge data structures.
  • ETL::Yertl - ETL with a Shell
  • Heiðrún - Heiðrún is the DPLA metadata ingestion and QA system, and is an implementation of the Kri-kri Rails engine.
  • JAQL - Query Language for JavaScript(r) Object Notation (JSON)
  • KNIME - Open source Analytics Platform
  • Krikri - DPLA Ruby on Rails engine for metadata aggregation, enhancement and quality control.
  • Luwak - A Lucene extention to search data streams. See also this blog entry.
  • Metadata Interoperability Framework (MIF) - PPT
  • MINT - Metadata Interoperability Services
  • Meresco - Under the Meresco name Dutch public institutions share quality software components related to metadata management and search.
  • Metacrunch
  • metafacture - used in culturegraph
  • Metadata Services Toolkit - part of the eXtensible Catalog (XC)
  • Metadata & Object Repository (MoRe)
  • MUPD8 - Data stream processing from Wallmartlabs.
  • OpenRefine - (formerly Google Refine) a toolkit to work with tabular data.
  • Petl - Python ETL library
  • Pig - Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs.
  • Ratchet - A library for performing data pipeline / ETL tasks in Go.
  • REPOX - Data Aggregation and Interoperability Manager
  • Samza - Apache Samza is a distributed stream processing framework.
  • Silk - The Silk framework provides a declarative language for specifying which types of RDF links should be discovered between data sources as well as which conditions data items must fulfil in order to be interlinked.
  • Spark Streaming - Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams.
  • Storm - Apache Storm is a distributed stream processing framework.
  • Strukt - The most interactive way to work with all kinds of tabular data
  • Supplejack - Supplejack was designed to provide assurance to the quality of data management activities when working at scale.
  • TeePee - Command line tool to extract data from structures
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.