Skip to content
This project leverages Ansible to automate DataWave deployments on your cluster
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.

Powered by Muchos  and  Ansible Ansible

Apache License


This project is intended to be used in tandem with Muchos to automate the deployment of DataWave for development and testing purposes on a cluster of arbitrary size.

The project is comprised primarily of Ansible scripts, which are intended to be used on your cluster after Muchos setup has been completed. Thus, users will first employ Muchos independently to establish DataWave's base dependencies (Hadoop, Accumulo, and ZooKeeper) and to establish the base Ansible inventory required to automate configuration and deployment of DataWave.

Compatibility Notes

Testing/verification has been performed on AWS using the following

Muchos Commit Configuration DataWave Commit
6e786a0 muchos.props.example 116e1f8

Prerequisites / Assumptions

  • Familiarity with the basics of Ansible is recommended but not required
  • Familiarity with the following is assumed
    • Hadoop HDFS and MapReduce
    • Accumulo and ZooKeeper
    • DataWave
    • Muchos (see Muchos documentation for prerequisites)

Get Started

1. Use Muchos to set up your cluster

If desired, you can have Muchos launch dedicated EC2 hosts for DataWave's ingest master and web server(s) by adding them as nodes of type client in your muchos.props as follows:

  ingest1 = client
  webserver1 = client

Muchos will install and configure base dependencies on client nodes, but no service daemons will be activated.

2. When Muchos setup is complete, ssh to your proxy host and clone this repository. For example:

<me@localhost>$ cd /path/to/fluo-muchos
<me@localhost>$ bin/muchos ssh
<cluster_user@leader1>$ git clone

Remaining tasks below should be performed on the proxy host as the user denoted by your cluster_user variable.

3. Symlink your Muchos inventory and assign your DataWave-specific hosts in the dw-hosts file

$ cd datawave-muchos/ansible/inventory

# 3.1 - Create symlink to your Muchos hosts file
$ ln -s /home/cluster_user/ansible/conf/hosts muchos-hosts

# 3.2 - Edit the DataWave inventory file as needed
$ vi dw-hosts

This allows us to pass the inventory directory itself as an argument to Ansible, e.g., ansible-playbook -i inventory/ ..., which tells Ansible to merge all files present into a single inventory automatically.

At this point, you should have only two files in the directory, muchos-hosts and dw-hosts.

4. Configure your all group and datawave group variables

$ cd datawave-muchos/ansible/group_vars

# 4.1 (Required) - Symlink the Muchos 'all' vars file
$ ln -s /home/cluster_user/ansible/group_vars/all all

# 4.2 (Optional) - Set DataWave-specific overrides in the 'datawave' vars file
$ vi datawave
  • Generally, you'll find variables and their default values defined in ansible/roles/{{ role name }}/defaults/main.yml, so that they can be easily overridden (values assigned there receive the lowest possible precedence in Ansible)

  • Most of the variables you'll care about are here: ansible/roles/common/defaults/main.yml

  • You may find it convenient to override variables from the command line via Ansible's -e / --extra-vars option, as demonstrated below in post-deployment/force redeploy. (In Ansible, command line overrides receive the highest possible precedence)

5. Lastly, build/deploy DataWave with the datawave.yml playbook

$ cd datawave-muchos/ansible
$ ansible-playbook -i inventory datawave.yml

# Or equivalently...
$ scripts/
  • Note: The dw-build role will first git-clone a remote DataWave repository on your proxy host, as configured by the following variables: dw_repo, dw_clone_dir, dw_checkout_version

  • Note: To build DataWave's ingest and web tarballs, the proxy host will need a few GB free on the volume containing the local git repo. Additionally, you'll need a few GB free for the local Maven repo. For EC2 clusters, depending on the source AMI and storage configuration, you may need to attach and mount a volume large enough to accomodate these directories, configured via dw_clone_dir and dw_m2_repo_dir respectively

  • Note: By default, ingest services should be started up automatically on your ingestmaster host upon successful completion of the datawave.yml playbook. See this issue for instructions to verify that services started successfully


Additional playbooks are provided as a convenience to simplify common post-deployment tasks on your cluster. These are described below. Also note that the datawave.yml playbook imports post-deployment.yml to allow you to run many of these tasks automatically after DataWave has been installed. In general, tasks in post-deployment.yml will be conditionally activated based on the value of one or more boolean variables, which you may override as needed.

DataWave Query Client

If dw_install_web_client was set to True (default), then a simple, curl-based query client for DataWave will have been installed and configured on your proxy host.

The client will simplify your interaction with the DataWave Query API by...

  • automatically configuring test PKI materials and associated curl parameters
  • setting reasonable defaults for DataWave-specific parameters
  • automatically pretty-printing web service responses based on their content type
  • automatically closing queries when response code 204 is returned (no results found)
  • etc

For example:

 $ which datawave || source ~/.bashrc
 $ datawave query --expression "PAGE_TITLE:AccessibleComputing" --show-meta
   "Events": [
   "ReturnedEvents": 1
 Query ID: 51082ed4-b579-45b8-879f-3afdb10e6ec3
 Time: 0.271 Response Code: 200 Response Type: application/json
 $ datawave query --next 51082ed4-b579-45b8-879f-3afdb10e6ec3 --show-meta
 Time: 0.093 Response Code: 204 Response Type: N/A
 [DW-INFO] - End of result set, as indicated by 204 response. Closing query automatically
  • Query Client options: $ datawave query --help
  • Other options: $ datawave --help
  • More info: view the client script

Force Redeploy

Generally speaking, all Anisble tasks here are designed to be idempotent operations on your cluster. Thus, it is usually safe to assume that executing the datawave.yml playbook multiple times will always result in the same cluster state. However, you may want to change that behavior at times by overriding certain default variables.

For example, you may want to rebuild DataWave and redeploy updated versions of ingest and query services:

# Force rebuild/redeploy
$ cd datawave-muchos/ansible
$ ansible-playbook -i inventory datawave.yml -e '{ "dw_force_redeploy": true }'

# Or equivalently...
$ scripts/

Upon redeploy...

  • Previously ingested data in Accumulo is always preserved.
  • Any manual, in-place modifications made to deployed services will likely be lost.
  • Prior to redeploy, graceful shutdown of DataWave services is attempted.

Ansible Tags

For additional flexibility, the datawave.yml playbook makes use of Ansible tags, so specific tasks can be whitelisted/blacklisted via the --tags,--skip-tags options respectively. For example:

# Force a redeploy of DataWave without rebuilding the source code
$ cd datawave-muchos/ansible
$ ansible-playbook -i inventory datawave.yml -e '{ "dw_force_redeploy": true }' --skip-tags build

# Or equivalently...
$ scripts/ --skip-tags build
# View all tasks and their associated tags for the entire playbook
$ ansible-playbook datawave.yml --list-tasks

# Or equivalently...
$ scripts/ --list-tasks

Start/Stop Ingest

cd datawave-muchos/ansible

# Start (this is already a post-deployment task, as dw_start_ingest is set to True by default)
$ ansible-playbook -i inventory start-ingest.yml

# Stop
$ ansible-playbook -i inventory stop-ingest.yml

Start/Stop Web Services

$ cd datawave-muchos/ansible

# Start (can be automated as a post-deployment task, if dw_start_web == True)
$ ansible-playbook -i inventory start-web.yml

# Stop
$ ansible-playbook -i inventory stop-web.yml

See also scripts/ and scripts/

DataWave Ingest Examples

TVMAZE Dataset (

To download/ingest a small subset of TVMAZE show and cast member data:

# Note: this can also be automated as a post-deployment task, if dw_ingest_tvmaze == True

$ cd datawave-muchos/ansible
$ ansible-playbook -i inventory tvmaze-ingest.yml

To download and ingest all TV shows and associated cast info:

$ cd scripts
$ ./

Wikipedia Dataset (

To download a Wikipedia XML data dump and ingest a small subset (~100,000 pages) of its entries:

# Note: this can also be automated as a post-deployment task, if dw_ingest_wikipedia == True

$ cd datawave-muchos/ansible
$ ansible-playbook -i inventory wikipedia-ingest.yml

# Or equivalently...
$ scripts/
  • If desired, the entire XML dump may be ingested by tweaking Ansible variable, wiki_max_streams_to_extract, subject to the storage limitations of your cluster
  • More info: ansible/roles/wikipedia/README
You can’t perform that action at this time.