Deployment Guide

emilymatthews edited this page Sep 27, 2016 · 5 revisions

Purpose and Audience

The deployment section provides a summarized view of the installation and recommended locations. The intended audience is the people responsible for leading the installation.

Component Distribution

There are three components in Apache Spot (Incubating):

  • Ingest – binary and log files are captured or transferred into the Hadoop cluster, where they are transformed and loaded into solution data stores
  • Machine Learning – machine learning algorithms are used to add additional learning information to the ingest data, which is used to filter and sort raw data.
  • Operational Analytics – data output from the machine learning component is augmented with context (i.e. geographic data) and heuristics, then is available to the user for interacting with it.

While all of the components can be installed on the same server in a development or test scenario, the recommended configuration for production is to map the components to specific server roles in a Hadoop cluster.

Component Node/Key Role
Ingest Edge Server (Gateway)
Machine Learning YARN Node Manager (Gateway)
Operational Analytics Node with Cloudera Manager / Hue (Gateway)

During the install, each components installs in the /home/"sol-user"/ folder in the appropriate node. This will require the creation of the "solution" user on each node.


Six subcomponents are installed on the edge server:

  • nfdump ( a set of utilities for capturing and decoding flow data.
  • tshark: (packet only) a CLI component of wireshark ( for decoding packet data
  • Ingest workflow – bash script or Oozie workflow
  • Ingest master and workers – python code for data ingest
  • Ingest directory structure – local file system.

NOTE: Kafka service is required for Ingest.

There are also required changes to the Hadoop configuration:

  • Create HDFS path for binary data
  • Create HDFS path for Hive tables
  • Create solution Hive tables(staging, search)

Machine Learning (ML)

There are multiple sub-components installed in each DataNode / NodeManager used for the solution:

  • Scala scripts to run spark pre- and post-processing jobs
  • Python scripts used for local transformation
  • Algorithm code written in C/C++
  • MPI (Message Passing Interface) libraries – used to parallelize algorithm code
  • ML workflow – bash script
  • ML directory structure on local file system

Some changes are required on the Hadoop Cluster as well:

  • Spark configuration settings will need to be reviewed or modified
  • YARN configuration settings will need to be reviewed or modified
  • Directory structure for machine learning data

Operational Analytics (OA)

Multiple subcomponents are required for installation on the Cloudera Manager/Hue server:

  • IPython – provides a server for static html and JavaScript, as well as Jupyter notebooks, the key interface and the Hadoop cluster
  • Matplotlib (optional) – provides rich charting and plotting within Jupyter notebooks
  • D3js and other JavaScript libraries – provide dynamic behavior and interactivity in the user
  • Interface
  • Solution code – static html, JavaScript, and Jupyter notebooks used to access the operational
  • Analytics and information about the system
  • Ops directory structure on the local file system

Some changes may be required on the Hadoop Cluster as well:

  • YARN configuration settings will need to be reviewed or modified (for Hive query optimization).

Because the top-level components of the solution can be used independently or together, we recommend the following approach to installation. For each component (ingest, machine learning, operational analytics):

  • Identify deployment target nodes
  • Install prerequisites on local file system
  • Install solution component on local file system
  • Make configuration/installation changes to Hadoop
  • Validate and Test
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.