Datashare

A self-hosted search engine to find stories in any files.

	Status
Download
CI checks
Translations
Latest version
Release date
Open issues
Documentation

Datashare

Datashare is an open‑source, self‑hosted document search and analysis platform built by the International Consortium of Investigative Journalists (ICIJ). It ingests heterogeneous data (PDFs, emails, spreadsheets, images, archives, etc.), extracts text (including via OCR), enriches it with metadata and named entities, and exposes everything through a powerful search UI and REST API. Because Datashare runs on your own machines, you keep full control over sensitive material—no external cloud services required.

📣 Help shape the next content extraction features in Datashare! Please take 10 minutes to fill out our user survey, it will directly influences our roadmap, and lets you opt‑in for early previews/beta testing.

Main Features

🔍 Full‑text search: Index & query PDFs, emails, office docs, images, archives, and more.
🖼️ OCR on scans & images: Turn visual text into searchable text.
🧠 Named‑entity extraction: Auto-detect people, orgs, locations, emails, etc.
⭐ Stars & tags: Mark and organize key documents.
🧰 Advanced filters & operators: Combine facets with boolean, wildcard, and fuzzy queries.
🤝 Team/server mode: Multi-user deployment with shared tags and recommendations.
🔌 Plugin architecture: Extend Datashare with custom modules.

Developer Guide

This section explains how to set up a development environment, build the project, run tests, and manage database migrations. It assumes you are comfortable with Java/Maven projects and basic service orchestration.

Requirements

Languages & tooling

JDK 17
Apache Maven 3.8+: primary build tool for the backend
GNU Make (optional but recommended): convenient shortcuts (make dist, make update-db, etc.)

Services

Those services must be running to have a complete developer environement. You might want

PostgreSQL 13+
- Available on host postgres:5432
- Two DBs expected by default: datashare (dev) and test (tests)
- A role with privileges, e.g. user: test, password: test
Elasticsearch 7.x
- Available on host elasticsearch:9200
- 8.x is not officially supported
Redis 5+
- Available on host redis:6379
- Used to store session and orchestrate async tasks.

Build

The project is modular. Typical steps:

# 1. Validate the build and resolve deps
mvn validate

# 2. Build shared testing utilities (some modules depend on these)
mvn -pl commons-test -am install

# 3. Apply DB migrations so your dev DB schema matches the code
mvn -pl datashare-db liquibase:update

# 4. Build everything (excluding tests)
mvn package -Dmaven.test.skip=true

Run Tests

Datashare has both unit and integration tests. Integration tests expect Postgres, Elasticsearch, and Redis to be reachable.

# Run the whole test suite
mvn test

# Or run a single module
mvn -pl datashare-api test

# Or a single test class
mvn -pl datashare-api -Dtest=org.icij.datashare.PropertiesProviderTest test

Database Migrations

Datashare uses Liquibase to version and apply schema changes.

Apply latest migrations:

make update-db

Start from scratch (danger: drops data):

make reset-db

Adding a new changeset:

Create a new XML/YAML changeset under datashare-db/src/main/resources/db/changelog/
Reference it in the master changelog file
Run make update-db locally to verify
Commit both the changeset and updated master file

Frontend

The web UI is built with Vue 3 and maintained in a separate repository. When building the backend, you must also build the client and copy its compiled files into the ./app directory. The backend bundles these static assets using FluentHTTP, which serves resources from ./app (relative to the repo root). If this folder is missing or empty, only the API will be available, no UI.

Prerequisites for Frontend Dev

Node.js 20.19+
Yarn 1

Build workflow

Clone & enter the client repo

git clone https://github.com/ICIJ/datashare-client.git
cd datashare-client

Install and build
```
yarn
yarn build
```
The build outputs a production bundle into dist/.

Copy (or symlink) into backend

rm -rf ../datashare/app
mkdir -p ../datashare/app
cp -r dist/* ../datashare/app/

License

Datashare is distributed under the GNU Affero General Public License v3.0.

About ICIJ

The International Consortium of Investigative Journalists (ICIJ) is a global network of reporters and media organizations collaborating on cross‑border investigations (e.g., Panama Papers, Luanda Leaks, Uber Files, Pandora Papers). The tech team at ICIJ builds tools like Datashare to empower investigative journalism at scale, handling millions of documents securely and efficiently. We open‑sourced Datashare to empower solo reporters and small newsrooms with advanced investigative tools, enable larger organizations to audit, extend, and self‑host the platform, and foster collaboration within the investigative community to continually improve the software.

Contact & Community

Issues & feature requests: GitHub Issues
Email: datashare@icij.org
Security reports: please email us and avoid filing public issues for vulnerabilities.

Name		Name	Last commit message	Last commit date
Latest commit History 4,802 Commits
.circleci		.circleci
.github		.github
.gitsecret		.gitsecret
commons-test		commons-test
datashare-api		datashare-api
datashare-app		datashare-app
datashare-cli		datashare-cli
datashare-db		datashare-db
datashare-dist		datashare-dist
datashare-index		datashare-index
datashare-nlp-corenlp		datashare-nlp-corenlp
datashare-tasks		datashare-tasks
doc		doc
tools		tools
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
Makefile		Makefile
README.md		README.md
installSwaggerUi.sh		installSwaggerUi.sh
launchBack.sh		launchBack.sh
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Datashare

Datashare

Table of Contents

Main Features

Developer Guide

Requirements

Build

Run Tests

Database Migrations

Frontend

Prerequisites for Frontend Dev

Build workflow

License

About ICIJ

About

Uh oh!

Releases 529

Packages

Uh oh!

Contributors 19

Uh oh!

Languages

License

ICIJ/datashare

Folders and files

Latest commit

History

Repository files navigation

Datashare

Datashare

Table of Contents

Main Features

Developer Guide

Requirements

Build

Run Tests

Database Migrations

Frontend

Prerequisites for Frontend Dev

Build workflow

License

About ICIJ

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 529

Packages 0

Uh oh!

Contributors 19

Uh oh!

Languages

Packages