Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Here we provide a list of related projects that also provide ETL/data processing tools.
- Cocoon - Apache Cocoon XML pipeline
- csvkit - Csvkit is a suite of utilities for converting to and working with CSV, the king of tabular file formats.
- Datamash - performs calculation (e.g. sum,, count, min, max, skewness, standard deviation) on input files.
- DNB-Conv-Tools - Java conversion tools for MARC, ONIX, MAB, Pica and others
- easyM2R - https://github.com/cKlee/easyM2R
- Librisxl - Tools for conversion of libris.kb.se data
- MABLE - MABLE+ ist ein Java-gestütztes Software-Tool zur automatischen Daten- und Fehleranalyse von Bibliothekskatalogen.
- MABTools - MAB tools created by the Deutschen Nationalbibliothek
- MARCEdit - http://marcedit.reeset.net/
- MARCgrep.pl - MARCgrep.pl is a Perl script to filter or count bibliographic records based on condition built upon tag name, indicators, subfield, field value (or tag, positions, value for control fields 00x).
- marc2rdf - https://github.com/digibib/marc2rdf (uses JSON mappings such as this)
- MARCspec - http://cklee.github.io/marc-spec/marc-spec.html (mapping language for MARC)
- marctools - https://github.com/ubleipzig/marctools (various MARC command line utilities)
- MARiMbA - is a command-line tool, designed with librarians in mind, to transform MARC (MAchine-Readable Cataloging) records to RDF
- miller - is like sed, awk, cut, join, and sort for name-indexed data such as CSV and tabular JSON
- pymarc - pymarc is a python library for working with bibliographic data encoded in MARC21
- rml - RML Generic Mapping Language (RDF)
- solrmarc - https://code.google.com/p/solrmarc/
- TARQL - a SPARQL-based data mapping language to convert CSV, XML, JSON to RDF
- Traject - an easy to use, high-performance, flexible and extensible MARC to Solr indexer.
- Akara - Akara is a platform for developing data services available on the Web, using REST architecture. Akara is open source software written in Python and C
- App::RecordStream - App::RecordStream - recs - A system for command-line analysis of data.
- ATTX - Putting Linked Data to Work (University of Helsinki)
- bibcat - Engineering toolkit for building semantic web and bibliographic applications
- Conduit - Haskell framework for dealing with streaming data
- COMSODE - The project COMSODE is an SME-driven RTD project aimed at progressing the capabilities in the field of Open Data re-use.
- d:swarm - data management platform for enrichment, normalization and linkage of knowledge data structures.
- ETL::Yertl - ETL with a Shell
- Heiðrún - Heiðrún is the DPLA metadata ingestion and QA system, and is an implementation of the Kri-kri Rails engine.
- KNIME - Open source Analytics Platform
- Krikri - DPLA Ruby on Rails engine for metadata aggregation, enhancement and quality control.
- Luwak - A Lucene extention to search data streams. See also this blog entry.
- Metadata Interoperability Framework (MIF) - http://elag2014.org/programme/elag-workshops-list-page/11-5/ PPT
- MINT - Metadata Interoperability Services
- Meresco - Under the Meresco name Dutch public institutions share quality software components related to metadata management and search.
- metafacture - used in culturegraph
- Metadata Services Toolkit - part of the eXtensible Catalog (XC)
- Metadata & Object Repository (MoRe)
- MUPD8 - Data stream processing from Wallmartlabs.
- OpenRefine - (formerly Google Refine) a toolkit to work with tabular data.
- Petl - Python ETL library
- Pig - Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs.
- Ratchet - A library for performing data pipeline / ETL tasks in Go.
- REPOX - Data Aggregation and Interoperability Manager
- Samza - Apache Samza is a distributed stream processing framework.
- Silk - The Silk framework provides a declarative language for specifying which types of RDF links should be discovered between data sources as well as which conditions data items must fulfil in order to be interlinked.
- Spark Streaming - Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams.
- Storm - Apache Storm is a distributed stream processing framework.
- Strukt - The most interactive way to work with all kinds of tabular data
- Supplejack - Supplejack was designed to provide assurance to the quality of data management activities when working at scale.
- TeePee - Command line tool to extract data from structures
Clone this wiki locally
Press h to open a hovercard with more details.