Skip to content
Patrick Hochstenbach edited this page Nov 15, 2018 · 67 revisions

Here we provide a list of related projects that also provide ETL/data processing tools.

selected formats

  • Cocoon - Apache Cocoon XML pipeline
  • csvkit - Csvkit is a suite of utilities for converting to and working with CSV, the king of tabular file formats.
  • Datamash - performs calculation (e.g. sum,, count, min, max, skewness, standard deviation) on input files.
  • DNB-Conv-Tools - Java conversion tools for MARC, ONIX, MAB, Pica and others
  • easyM2R - https://github.com/cKlee/easyM2R
  • ETL-Yertl
  • jq
  • Librisxl - Tools for conversion of libris.kb.se data
  • MABLE - MABLE+ ist ein Java-gestütztes Software-Tool zur automatischen Daten- und Fehleranalyse von Bibliothekskatalogen.
  • MABTools - MAB tools created by the Deutschen Nationalbibliothek
  • MARCEdit - http://marcedit.reeset.net/
  • MARCgrep.pl - MARCgrep.pl is a Perl script to filter or count bibliographic records based on condition built upon tag name, indicators, subfield, field value (or tag, positions, value for control fields 00x).
  • marc2rdf - https://github.com/digibib/marc2rdf (uses JSON mappings such as this)
  • MARCspec - http://cklee.github.io/marc-spec/marc-spec.html (mapping language for MARC)
  • marctools - https://github.com/ubleipzig/marctools (various MARC command line utilities)
  • MARiMbA - is a command-line tool, designed with librarians in mind, to transform MARC (MAchine-Readable Cataloging) records to RDF
  • miller - is like sed, awk, cut, join, and sort for name-indexed data such as CSV and tabular JSON
  • pymarc - pymarc is a python library for working with bibliographic data encoded in MARC21
  • rml - RML Generic Mapping Language (RDF)
  • solrmarc - https://code.google.com/p/solrmarc/
  • TARQL - a SPARQL-based data mapping language to convert CSV, XML, JSON to RDF
  • Traject - an easy to use, high-performance, flexible and extensible MARC to Solr indexer.

general frameworks

  • Akara - Akara is a platform for developing data services available on the Web, using REST architecture. Akara is open source software written in Python and C
  • App::RecordStream - App::RecordStream - recs - A system for command-line analysis of data.
  • ATTX - Putting Linked Data to Work (University of Helsinki)
  • bibcat - Engineering toolkit for building semantic web and bibliographic applications
  • Conduit - Haskell framework for dealing with streaming data
  • COMSODE - The project COMSODE is an SME-driven RTD project aimed at progressing the capabilities in the field of Open Data re-use.
  • DNet
  • d:swarm - data management platform for enrichment, normalization and linkage of knowledge data structures.
  • ETL::Yertl - ETL with a Shell
  • Fink - Apache Flink® - Stateful Computations over Data Streams
  • Heiðrún - Heiðrún is the DPLA metadata ingestion and QA system, and is an implementation of the Kri-kri Rails engine.
  • JAQL - Query Language for JavaScript(r) Object Notation (JSON)
  • KNIME - Open source Analytics Platform
  • Krikri - DPLA Ruby on Rails engine for metadata aggregation, enhancement and quality control.
  • Luwak - A Lucene extention to search data streams. See also this blog entry.
  • Metadata Interoperability Framework (MIF) - http://elag2014.org/programme/elag-workshops-list-page/11-5/ PPT
  • MINT - Metadata Interoperability Services
  • Meresco - Under the Meresco name Dutch public institutions share quality software components related to metadata management and search.
  • Metacrunch
  • metafacture - used in culturegraph
  • Metadata Services Toolkit - part of the eXtensible Catalog (XC)
  • Metadata & Object Repository (MoRe)
  • MUPD8 - Data stream processing from Wallmartlabs.
  • OpenRefine - (formerly Google Refine) a toolkit to work with tabular data.
  • Petl - Python ETL library
  • Pig - Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs.
  • Ratchet - A library for performing data pipeline / ETL tasks in Go.
  • REPOX - Data Aggregation and Interoperability Manager
  • Samza - Apache Samza is a distributed stream processing framework.
  • Silk - The Silk framework provides a declarative language for specifying which types of RDF links should be discovered between data sources as well as which conditions data items must fulfil in order to be interlinked.
  • Spark Streaming - Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams.
  • Storm - Apache Storm is a distributed stream processing framework.
  • Strukt - The most interactive way to work with all kinds of tabular data
  • Supplejack - Supplejack was designed to provide assurance to the quality of data management activities when working at scale.
  • TeePee - Command line tool to extract data from structures