Skip to content

NationalSecurityAgency/datawave-muchos

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Powered by Muchos  and  Ansible Ansible

Apache License

Purpose

This project is intended to be used in tandem with Muchos to automate the deployment of DataWave for development and testing purposes on a cluster of arbitrary size.

The project is comprised primarily of Ansible scripts, which are intended to be used on your cluster after Muchos setup has been completed. Thus, users will first employ Muchos independently to establish DataWave's base dependencies (Hadoop, Accumulo, and ZooKeeper) and to establish the base Ansible inventory required to automate configuration and deployment of DataWave.

Compatibility Notes

Testing/verification has been performed on AWS using the following

Muchos Commit Configuration DataWave Commit
4f1a4ae muchos.props.example 2.6.41

Prerequisites / Assumptions

  • Familiarity with the basics of Ansible is recommended but not required
  • Familiarity with the following is assumed
    • Hadoop HDFS and MapReduce
    • Accumulo and ZooKeeper
    • DataWave
    • Muchos (see Muchos documentation for prerequisites)

Get Started

1. Use Muchos to set up your cluster

When configuring Muchos, keep in mind that you'll be installing DataWave after Muchos setup is complete. So, you'll want to consider the future home for DataWave's ingest and web components when defining the [nodes] section of muchos.props

If desired, you can have Muchos set up dedicated hosts for these by adding nodes of type client in muchos.props. For example:

  ...
  [nodes]
  ...
  ingest1 = client
  webserver1 = client
  

Muchos will install and configure base dependencies on client nodes, but no service daemons will be activated.

It is not a requirement to have distinct hosts for DataWave's ingest and web services. They may coexist on the the same host and/or alongside other cluster services, provided that sufficient resources exist on the target host(s).

In Step 3 below, you'll define the target host(s) for DataWave and integrate them into your existing Ansible inventory.

2. When Muchos setup is complete, ssh to your proxy host and clone this repository. For example:

<me@localhost>$ cd /path/to/fluo-muchos
<me@localhost>$ bin/muchos ssh
...
<cluster_user@leader1>$ git clone https://github.com/NationalSecurityAgency/datawave-muchos.git

Remaining tasks below should be performed on the proxy host as the user denoted by your cluster_user variable.

3. Symlink your Muchos inventory and assign your DataWave-specific hosts in the dw-hosts file

$ cd datawave-muchos/ansible/inventory

# 3.1 - Create symlink to your Muchos hosts file
$ ln -s /home/cluster_user/ansible/conf/hosts muchos-hosts

# 3.2 - Edit the DataWave inventory file as needed
$ vi dw-hosts
  ...   

This allows us to pass the inventory directory itself as an argument to Ansible, e.g., ansible-playbook -i inventory/ ..., which tells Ansible to merge all files present into a single inventory automatically.

At this point, you should have only two files in the directory, muchos-hosts and dw-hosts.

4. Configure your all group and datawave group variables

$ cd datawave-muchos/ansible/group_vars

# 4.1 (Required) - Symlink the Muchos 'all' vars file
$ ln -s /home/cluster_user/ansible/group_vars/all all

# 4.2 (Optional) - Set DataWave-specific overrides in the 'datawave' vars file
$ vi datawave
  ...
  • Generally, you'll find variables and their default values defined in ansible/roles/{{ role name }}/defaults/main.yml, so that they can be easily overridden (values assigned there receive the lowest possible precedence in Ansible)

  • Most of the variables you'll care about are here: ansible/roles/common/defaults/main.yml

  • You may find it convenient to override variables from the command line via Ansible's -e / --extra-vars option, as demonstrated below in post-deployment/force redeploy. (In Ansible, command line overrides receive the highest possible precedence)

5. Lastly, build/deploy DataWave with the datawave.yml playbook

$ cd datawave-muchos/ansible
$ ansible-playbook -i inventory datawave.yml

# Or equivalently...
$ scripts/dw-play.sh
  • Note: The dw-build role will first git-clone a remote DataWave repository on your proxy host, as configured by the following variables: dw_repo, dw_clone_dir, dw_checkout_version

  • Note: To build DataWave's ingest and web tarballs, the proxy host will need a few GB free on the volume containing the local git repo. Additionally, you'll need a few GB free for the local Maven repo. For EC2 clusters, depending on the source AMI and storage configuration, you may need to attach and mount a volume large enough to accomodate these directories, configured via dw_clone_dir and dw_m2_repo_dir respectively

  • Note: By default, ingest services should be started up automatically on your ingestmaster host upon successful completion of the datawave.yml playbook. See this issue for instructions to verify that services started successfully

Post-Deployment

Additional playbooks are provided as a convenience to simplify common post-deployment tasks on your cluster. These are described below. Also note that the datawave.yml playbook imports post-deployment.yml to allow you to run many of these tasks automatically after DataWave has been installed. In general, tasks in post-deployment.yml will be conditionally activated based on the value of one or more boolean variables, which you may override as needed.

DataWave Query Client

If dw_install_web_client was set to True (default), then a simple, curl-based query client for DataWave will have been installed and configured on your proxy host.

The client will simplify your interaction with the DataWave Query API by...

  • automatically configuring test PKI materials and associated curl parameters
  • setting reasonable defaults for DataWave-specific parameters
  • automatically pretty-printing web service responses based on their content type
  • automatically closing queries when response code 204 is returned (no results found)
  • etc

For example:

 $ which datawave || source ~/.bashrc
 ...
 $ datawave query --expression "PAGE_TITLE:AccessibleComputing" --show-meta
 {
   "Events": [
       {
         ...
       }
   ], 
   ...
   "ReturnedEvents": 1
 }
 Query ID: 51082ed4-b579-45b8-879f-3afdb10e6ec3
 Time: 0.271 Response Code: 200 Response Type: application/json
 
 $ datawave query --next 51082ed4-b579-45b8-879f-3afdb10e6ec3 --show-meta
 Time: 0.093 Response Code: 204 Response Type: N/A
 [DW-INFO] - End of result set, as indicated by 204 response. Closing query automatically
 ...
  • Query Client options: $ datawave query --help
  • Other options: $ datawave --help
  • More info: view the client script

Force Redeploy

Generally speaking, all Anisble tasks here are designed to be idempotent operations on your cluster. Thus, it is usually safe to assume that executing the datawave.yml playbook multiple times will always result in the same cluster state. However, you may want to change that behavior at times by overriding certain default variables.

For example, you may want to rebuild DataWave and redeploy updated versions of ingest and query services:

# Force rebuild/redeploy
$ cd datawave-muchos/ansible
$ ansible-playbook -i inventory datawave.yml -e '{ "dw_force_redeploy": true }'

# Or equivalently...
$ scripts/dw-play-redeploy.sh

Upon redeploy...

  • Previously ingested data in Accumulo is always preserved.
  • Any manual, in-place modifications made to deployed services will likely be lost.
  • Prior to redeploy, graceful shutdown of DataWave services is attempted.

Ansible Tags

For additional flexibility, the datawave.yml playbook makes use of Ansible tags, so specific tasks can be whitelisted/blacklisted via the --tags,--skip-tags options respectively. For example:

# Force a redeploy of DataWave without rebuilding the source code
$ cd datawave-muchos/ansible
$ ansible-playbook -i inventory datawave.yml -e '{ "dw_force_redeploy": true }' --skip-tags build

# Or equivalently...
$ scripts/dw-play-redeploy.sh --skip-tags build
  
# View all tasks and their associated tags for the entire playbook
$ ansible-playbook datawave.yml --list-tasks

# Or equivalently...
$ scripts/dw-play.sh --list-tasks

Start/Stop Ingest

cd datawave-muchos/ansible

# Start (this is already a post-deployment task, as dw_start_ingest is set to True by default)
$ ansible-playbook -i inventory start-ingest.yml

# Stop
$ ansible-playbook -i inventory stop-ingest.yml

Start/Stop Web Services

$ cd datawave-muchos/ansible

# Start (can be automated as a post-deployment task, if dw_start_web == True)
$ ansible-playbook -i inventory start-web.yml

# Stop
$ ansible-playbook -i inventory stop-web.yml

See also scripts/dw-services-start.sh and scripts/dw-services-stop.sh

DataWave Ingest Examples

TVMAZE Dataset (http://www.tvmaze.com/api)

To download/ingest a small subset of TVMAZE show and cast member data:

# Note: this can also be automated as a post-deployment task, if dw_ingest_tvmaze == True

$ cd datawave-muchos/ansible
$ ansible-playbook -i inventory tvmaze-ingest.yml

To download and ingest all TV shows and associated cast info:

$ cd scripts
$ ./tvmaze-ingest.sh

To download a Wikipedia XML data dump and ingest a small subset (~100,000 pages) of its entries:

# Note: this can also be automated as a post-deployment task, if dw_ingest_wikipedia == True

$ cd datawave-muchos/ansible
$ ansible-playbook -i inventory wikipedia-ingest.yml

# Or equivalently...
$ scripts/wikipedia-ingest.sh
  • If desired, the entire XML dump may be ingested by tweaking Ansible variable, wiki_max_streams_to_extract, subject to the storage limitations of your cluster
  • More info: ansible/roles/wikipedia/README