Deployment Guide

Purpose and Audience

The deployment section provides a summarized view of the installation and recommended locations. The intended audience is the people responsible for leading the installation.

Component Distribution

There are three components in Apache Spot (Incubating):

Ingest – binary and log files are captured or transferred into the Hadoop cluster, where they are transformed and loaded into solution data stores
Machine Learning – machine learning algorithms are used to add additional learning information to the ingest data, which is used to filter and sort raw data.
Operational Analytics – data output from the machine learning component is augmented with context (i.e. geographic data) and heuristics, then is available to the user for interacting with it.

While all of the components can be installed on the same server in a development or test scenario, the recommended configuration for production is to map the components to specific server roles in a Hadoop cluster.

Component	Node/Key Role
Ingest	Edge Server (Gateway)
Machine Learning	YARN Node Manager (Gateway)
Operational Analytics	Node with Cloudera Manager / Hue (Gateway)

During the install, each components installs in the /home/"sol-user"/ folder in the appropriate node. This will require the creation of the "solution" user on each node.

Ingest

Six subcomponents are installed on the edge server:

nfdump (http://nfdump.sourceforge.net/): a set of utilities for capturing and decoding flow data.
tshark: (packet only) a CLI component of wireshark (https://www.wireshark.org/) for decoding packet data
Ingest workflow – bash script or Oozie workflow
Ingest master and workers – python code for data ingest
Ingest directory structure – local file system.

NOTE: Kafka service is required for Ingest.

There are also required changes to the Hadoop configuration:

Create HDFS path for binary data
Create HDFS path for Hive tables
Create solution Hive tables(staging, search)

Machine Learning (ML)

There are multiple sub-components installed in each DataNode / NodeManager used for the solution:

Scala scripts to run spark pre- and post-processing jobs
Python scripts used for local transformation
Algorithm code written in C/C++
MPI (Message Passing Interface) libraries – used to parallelize algorithm code
ML workflow – bash script
ML directory structure on local file system

Some changes are required on the Hadoop Cluster as well:

Spark configuration settings will need to be reviewed or modified
YARN configuration settings will need to be reviewed or modified
Directory structure for machine learning data

Operational Analytics (OA)

Multiple subcomponents are required for installation on the Cloudera Manager/Hue server:

IPython – provides a server for static html and JavaScript, as well as Jupyter notebooks, the key interface and the Hadoop cluster
Matplotlib (optional) – provides rich charting and plotting within Jupyter notebooks
D3js and other JavaScript libraries – provide dynamic behavior and interactivity in the user
Interface
Solution code – static html, JavaScript, and Jupyter notebooks used to access the operational
Analytics and information about the system
Ops directory structure on the local file system

Some changes may be required on the Hadoop Cluster as well:

YARN configuration settings will need to be reviewed or modified (for Hive query optimization).

Because the top-level components of the solution can be used independently or together, we recommend the following approach to installation. For each component (ingest, machine learning, operational analytics):

Identify deployment target nodes
Install prerequisites on local file system
Install solution component on local file system
Make configuration/installation changes to Hadoop
Validate and Test

Home

Home

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deployment Guide

Clone this wiki locally