Skip to content
A cross-platform command line tool for parallelised content extraction and analysis.
Branch: master
Clone or download
Latest commit 3ea62ac May 23, 2019
Type Name Latest commit message Commit time
Failed to load latest commit information.
.circleci adds removeDuplicatePaths in DocumentQueue Mar 16, 2019
extract-cli [release] 2.1.5 May 23, 2019
extract-lib [release] 2.1.5 May 23, 2019
.gitignore [release] 2.0.1 Oct 13, 2018
LICENSE Update LICENSE Oct 11, 2018 Update Oct 11, 2018
pom.xml [release] 2.1.5 May 23, 2019


Circle CI

A cross-platform command line tool for parallelized, distributed content-extraction. Built on top of Apache Tika and an essential part of the engineering behind the Panama Papers, Swiss Leaks and Luxembourg Leaks investigations.

It supports Redis-backed queueing for distributed, parallel extraction and will write to Solr, plain text files or standard output.

For guidance and instructions, please see the wiki.

Credits and Collaboration

Initialy developed by Matthew Caruana Galizia at ICIJ.

We welcome contributions! Please submit pull requests or contact us directly.


Copyright (c) 2018 International Consortium of Investigative Journalists. See LICENSE.

You can’t perform that action at this time.