Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Purpose and Audience
The deployment section provides a summarized view of the installation and recommended locations. The intended audience is the people responsible for leading the installation.
There are three components in Apache Spot (Incubating):
- Ingest – binary and log files are captured or transferred into the Hadoop cluster, where they are transformed and loaded into solution data stores
- Machine Learning – machine learning algorithms are used to add additional learning information to the ingest data, which is used to filter and sort raw data.
- Operational Analytics – data output from the machine learning component is augmented with context (i.e. geographic data) and heuristics, then is available to the user for interacting with it.
While all of the components can be installed on the same server in a development or test scenario, the recommended configuration for production is to map the components to specific server roles in a Hadoop cluster.
|Ingest||Edge Server (Gateway)|
|Machine Learning||YARN Node Manager (Gateway)|
|Operational Analytics||Node with Cloudera Manager / Hue (Gateway)|
During the install, each components installs in the /home/"sol-user"/ folder in the appropriate node. This will require the creation of the "solution" user on each node.
Six subcomponents are installed on the edge server:
- nfdump (http://nfdump.sourceforge.net/): a set of utilities for capturing and decoding flow data.
- tshark: (packet only) a CLI component of wireshark (https://www.wireshark.org/) for decoding packet data
- Ingest workflow – bash script or Oozie workflow
- Ingest master and workers – python code for data ingest
- Ingest directory structure – local file system.
NOTE: Kafka service is required for Ingest.
There are also required changes to the Hadoop configuration:
- Create HDFS path for binary data
- Create HDFS path for Hive tables
- Create solution Hive tables(staging, search)
Machine Learning (ML)
There are multiple sub-components installed in each DataNode / NodeManager used for the solution:
- Scala scripts to run spark pre- and post-processing jobs
- Python scripts used for local transformation
- Algorithm code written in C/C++
- MPI (Message Passing Interface) libraries – used to parallelize algorithm code
- ML workflow – bash script
- ML directory structure on local file system
Some changes are required on the Hadoop Cluster as well:
- Spark configuration settings will need to be reviewed or modified
- YARN configuration settings will need to be reviewed or modified
- Directory structure for machine learning data
Operational Analytics (OA)
Multiple subcomponents are required for installation on the Cloudera Manager/Hue server:
- Matplotlib (optional) – provides rich charting and plotting within Jupyter notebooks
- Analytics and information about the system
- Ops directory structure on the local file system
Some changes may be required on the Hadoop Cluster as well:
- YARN configuration settings will need to be reviewed or modified (for Hive query optimization).
Because the top-level components of the solution can be used independently or together, we recommend the following approach to installation. For each component (ingest, machine learning, operational analytics):
- Identify deployment target nodes
- Install prerequisites on local file system
- Install solution component on local file system
- Make configuration/installation changes to Hadoop
- Validate and Test