Skip to content

Detailed Guides

Riccardo Vincenzo Vincelli edited this page May 10, 2017 · 13 revisions

Detailed Guides

The Core Working Principles

  • Zero data leaks
  • Multi-dimensional Scalability
  • Flexibility for Analysts
  • Evolves with you as you need it
  • The best of the best open-source software

The Installer:

AmbariKave

Installation Manual

Installing a large number of services within a distributed environment is very difficult to get right. We have encapsulated the necessary scripts to install various components together underneath the HortonWorks Ambari installer.

  • What Is It?: Ambari is the installer for the Hortonworks Data Platform HDP, but it's true genius lies in the fact it has a service provision model, it is easily extendable to deploy more services in a cluster. So, it is here that we have added our KAVE specific services as an open-source extension to what is already there. See the KAVE project here, the original Ambari project here and the ambari github here. In principle we are applying the existing Hortonworks Ambari installer to a wider range of services in-keeping with the KAVE model, simplifying the installation of pre-existing services to a single-click deployment.
  • When to use it?: Whenever/wherever you need to install Hadoop along with KAVE libraries, whenever you want to install a new cluster of Machines with Kave components. AmbariKave takes care of automatic provision of amazon resources (if needed) and automatic deployment of services within a cluster.
  • How to get it?: See the project here, and the Installation Manual and/or the repo server (with username repos, password kaverepos)
  • What are the best practises? Read more on this wiki, see the very very good hortonworks ambari wiki here

If you would like to get started have a look at our Installation Manual

KaveToolbox

Installation Manual

For your single laptop? Or your single workstation? Or your single VM? For your development machine?

  • What Is It?: KaveToolbox is a wrapper for data processing and statistical libraries, big data toolkits, interface functionality and collaboration tools meant to provision a blank Centos6/7 or Ubuntu machine with a common development environment. See the readme. We normally use ipython notebooks as our standard exploration tool, and we give examples in the KaveToolbox. See also the core analytics tools list below. In principle we are simplifying installation of these pre-existing key components to a single-click deployment.
  • When to use it?: The KaveToolbox is a key element of the Kave design, that on your own laptop or VM you should be able to obtain the exact same libraries as you will end up with on your KAVE, it gives the same, common, look-and-feel for all KAVEs and allows distribution of our own small common libraries for data processing and visualization along with core analytics tools.
  • How to get it?: See the project and the repo server and Installation Manual.
  • What are the best practises?: Main advice here is to just leave the default configurations, to make sure you have the same look-and-feel everywhere. See the project for details. Don't forget to take a look at the example ipython notebooks!

The Components

Lambda Stack

What is a Lambda Stack?

The Lambda stack is a relatively new concept in data science. It is the perfect realization of a truly robust data-processing system combining real-time streaming data and batch processing to increase uptime. The Lambda stack is the brainchild of Nathan Mars, the inventor of Storm, and is described in detail elsewhere:

In the case of Kave, the lambda stack is the key dataworkflow component, because not only does it allow for robust processing, and combination of real time with batch data, but it also is highly modular by design. It is not necessary to realize a full lambda stack for all applications, and so we can pick and choose what we need when we need it.

Storm: For realtime processing

  • What Is It?: Introduction to Storm, storm.apache.org
  • How does it fit in KAVE?: A lambda stack uses realtime and batch processing together to ensure a guaranteed uptime at low cost. Furthermore, storm is a great analytics tool in its own right able to take advantage of the complete power of your computing resources through efficient parallel processing.
  • When to use it?: Storm can be used to receive data from a stream and feed into your hadoop, whenever you have a realtime data source, storm is the way to process it most effectively.
  • How to get it?: The pre-existing Ambari installer can install Storm integrated with YARN in a hadoop cluster, however, this does not match a lambda architecture in its purest form, nor does it allow separation of the node requirements. The StormSD component in our AmbariKave installer is needed for that.
  • What are the best practices? Running Storm

Hadoop: For storage and batch-processing

  • What Is It?: Hadoop is almost synonymous with Big Data. In principle hadoop began as a distributed file system with added stability and scalability, however, it has been coerced and coopted into many different processing frameworks allowing the data stored on hdfs to be processed in parallel across a large cluster.
  • How does it fit in KAVE?: With so many key data science tools, such as hive and python, having direct integration with hadoop, and since hadoop is open source and well-known, it is the cornerstone of the KAVE around which everything else is built.
  • When to use it?: As soon as your data becomes to big to process on one machine, especially if it is highly-structured data, or when your datasize is expected to grow, hadoop is a good alternative to a giant flat file.
  • What are the best practices? Ambari already incorporates a lot of ancillary tools for a hadoop ecosystem based on what is most appropriate, see http://hadoop.apache.org/

MongoDB: The intermediate database

  • What Is It?: MongoDB is a distributed NoSQL database, a generic document store.
  • How does it fit in KAVE?: A lambda stack is able to process the same data through real-time (storm) systems or historical (batch/hadoop) systems. The results need to be conglomerated in an intermediate database. MongoDB is the perfect database for this purpose since it is horizontally scalable and flexible enough for almost-unstructured data as a "document store". With communication through JSON/BSON it simplifies serving the results to any web front-end.
  • When to use it?: It is a necessary component in a Lambda stack, and can also be used to add another layer of protection to your data. The JBOSS server can only ask from the MongoDB, and the MongoDB is pushed into, it cannot create jobs itself, the analyst can decide which results to put there, and the underlying data need not be accessible even if the front-end is hacked.
  • What are the best practices? Best Practices of MongoDB Operations Best Practices of MongoDB Blog Best Practices of MongoDB Performance

JBoss: The results server

  • What Is It?: JBoss Web Server is an enterprise ready web server designed for medium and large applications, based on Tomcat. http://jbossweb.jboss.org/
  • How does it fit in KAVE?: JBOSS is the standard application server of the KAVE. In the Lambda architecture results are fed into an intermeadiate database, these results are by default not visible outside your KAVE. Adding a JBOSS edge node which is only able to communicate with the MongoDB, but not with the remaining machines in the cluster ensures security through separation of concerns. JBOSS can run a full rest API allowing us to serve the correct pieces of what's stored in MongoDB to the correct endpoints. In the JBOSS it is of course possible to install your own visualizations and dashboards, if they are Jave WAR files, but most often JBOSS will be configured as a rest API serving apps which live outside your KAVE.
  • When to use it?: Whenever you need to export your insights to a third-party application, but need to protect the raw data.
  • What are the best practices?: from docs.jboss.org, from rightscale.com

Gateway: How your analysts access your KAVE

  • What Is It?: The gateway is the single-point-of-entry into the KAVE for your analysts.
  • How does it fit in KAVE?: The KAVE contains many different components which are in most cases only visible from within the network. By allowing ssh connections from a restricted list of sources onto just one set of machines in the network we harden the system. By installing the KaveToolbox in workstation mode on the gateway.
  • When to use it?: You should always use a gateway! If you ahve a single-machine cluster, that machine is by definition your gateway.
  • How to get it?: Use the AmbariKave installer if deploying on a cluster, or follow KaveToolbox installation instructions within a dedicated network.
  • What are the best practices?: 1 core + 1 core per user. 2 GB ram + 2 GB ram per user. 100 GB temp directory. 20 GB mounted to /opt. Recommended max 4 analysts per gateway. 100 GB home directory per user. See KaveToolbox readme

If you are looking for instructions on how to reach your KAVE, see Accessing your KAVE

The Development Line

The KAVE is not a static product which you are stuck with until something better comes along. The KAVE is a concept which empowers you to develop your own big data solutions, putting the tools you need at your fingertips. Everything from the exploratory data science through to the professional software solution is possible with KAVE, but this must be managed in the most effective way.

A mature development line, with integration testing, code quality checks, review procedures, and a well-supported internal user base can push forward your solution from the ad-hoc to the professional.

What do you need a development line for?

Any time you have a solution you wish to bring into a production mode, you will need a modern release management and software quality solution. Any time you have a distributed team, which needs to communicate their code, findings, tips and tricks, you will need a stable list of collaborative tools that they can use. However, perhaps your code is propitiatory? Perhaps your development will belong to a client? An in-built development line, living within your own KAVE can help your own code evolve properly and securely.

Twiki: Collaborate on documentation and user experiences

  • What Is It?: Twiki is a collaboration tool where teams can contribute on a large set of interlinked webpages, without needing to know HTML, or have a fixed schema in advance. http://twiki.org/ .
  • How does it fit in KAVE?: A team of data scientists or programmers will need somewhere to jot down their ideas, share their experiences, tips, tricks, etc. Since twiki is open source, it provides a
  • When to use it?: Twiki allows users or developers to create their own living documentation and can help create a team identity.

Gitlab: Collaborate on your code through Git

  • What Is It?: Gitlab is a free and open-source alternative to github-enterprise.
  • How does it fit in KAVE?: Allows you to collaborate on your code, branch and share code, all within your secure kave environment, or share code between different KAVEs, a proper version control system such as git is the cornerstone for effective code development.
  • When to use it?: Whenever you are working with code and need to share that code across your KAVE users. Or whenever you are developing on top of existing code, or whenever you wish to have an automated testing environment with jenkins.
  • What are the best practises? Commit Often, Perfect Later, Publish Once, Test before commit/push/merge, and use branches, agree on a workflow in advance.

Jenkins: Continuous integration testing, automatically build and distribute your code within your KAVE

  • What Is It?: Meet Jenkins
  • How does it fit in KAVE?: automatically build and distribute your code within your KAVE, necessary for a mature development process.
  • When to use it?: As needed to support integration testing and systems testing of your product running within the KAVE especially when you have a distributed team working on several component parts.
  • What are the best practices?: https://wiki.jenkins-ci.org/display/JENKINS/Jenkins+Best+Practices

Archiva: Store and distribute your JAR files

  • What Is It?: Apache Archiva™ is an extensible repository management software that helps taking care of your own personal or enterprise-wide build artifact repository. http://archiva.apache.org/
  • How does it fit in KAVE?: Hadoop and storm both rely on java code stored in jar files. A complete lambda stack should in principle be able to run the same code inside Storm as it does on Hadoop. Therefore a package repository and distribution system is necessary.
  • When to use it?: Whenever you are coding large java solutions, or need to deploy your solutions across multiple clusters, or have a complex dependency tree, then Archiva is for you.
  • What are the best practices? Stack Overflow Maven Apache Maven project

SonarQube: perform static-code analysis to measure your code quality

  • What Is It?: SonarQube is an open platform to manage code quality. http://www.sonarqube.org/
  • How does it fit in KAVE?: When working with a distributed team on a production-level system, code quality assurance is of prime concern. A code quality checker is necessary to ensure coding to conventions and enable transferrable maintainable code. Within the KAVE development line, Sonar is seen to be the best of breed for this purpose.
  • When to use it?: If you intend to create a production system and are coding yourself in python or java, sonar will help you.
  • What are the best practices? why sonar 'rules' top ten lessons

The core analytics tools

KAVE arrives pre-packaged with the most advanced analytical tools available via KaveToolbox. These are not corner-case solutions from specific problems, these are general turing-complete programming environments with statistical libraries with which you can squeeze the most out of your data. Make your data work for you. Combine the power of CERN Big Data with the simplicity of SQL in the common language of Python.

HIVE

  • What Is It?: HIVE lets you query your data on hadoop with SQL. SQL is a very quick-to-learn query language and so is very handy.
  • How does it fit in KAVE?: If you are working with structured data, you need the freedom to explore that data dynamically. KAVE is attempting to give you every means of doing that, and the simplest recommended access is through hive.
  • When to use it?: Whenever working with structured data stored on your hadoop system.
  • What are the best practices?: Compress your data if you are I/O limited, this is easy to test on a small database. Create new tables when doing data exploration whenever it speeds up your process of iteration. Template and design your queries through the web UI and then save them into a dedicated .hive file for later execution with a script. Use Kettle integration to orchestrate your complete ETL solution through hive.

KaveToolbox

See KaveToolbox, above

IPython

  • What Is It?: On one hand, Python is a turing-complete programming language. On the other hand IPython and iPython notebook is a really great way of exploring your data and combining with analytical techniques. It is fun, fast, and powerful. Python takes the pain out of coding an analysis and gives you the quickest route through your data exploration with no inbuilt limits against big-data.
  • How does it fit in KAVE?: Python has native connectors for all the components of the KAVE and the SciPy library contains statistical tools. Python connects to R and ROOT.
  • When to use it?: When exploring your data or developing an analysis prototype. When creating plots and visualizations.
  • How to get it?: Part of the KaveToolbox, above, and needs to be installed on all nodes where code may in principle be executed during the data workflow.
  • What are the best practices? IPython is a very very flexible tool to design an analysis as the analyst wants it to be. Best practices are difficult to define and evolve directly with the ever expanding list of packages python provides. We recommend the following as good practices. Defensive Programming. Functional Programming. Test-Driven development. PEP-8 coding standard.

ROOT

  • What Is It?: ROOT is by far the most advanced statistical and data science tool available, developed by scientists, for scientists by CERN. It is designed around the Big Data processing requirements of the LHC and in effect is a complete data science platform in its own right with interactive plotting, and in-built extensible. Extensibility comes in three ways, macro-writing (no macro recording), python interface and decorator options, and compilation of your own ROOT classes against their C++. Historically, however, ROOT is difficult to use without a firm background in programming, C++ and python. It is also difficult to install generically since CERN supports a limited range of linux platforms and the python interface necessities should a non-system python be required.
  • How does it fit in KAVE?: ROOT libraries, especially, RooFit are unique. There is no sufficiently advanced alternative to ROOT for advanced fitting and multi-variate analysis. Since it is designed by data scientists and has an open-source user community of thousands of PhD and beyond-level contributors, it continues to move from strength to strength.
  • When to use it?: ROOT is best used for fitting and multivariate analysis, or when the data size exceeds what can safely be held in local memory. Pandas from the SciPy library is an alternative only when the data size fits within local memory.
  • How to get it?: Part of the KaveToolbox, above, and needs to be installed on all nodes where code may in principle be executed during the data workflow.
  • What are the best practices?: It is very difficult to be specific here, but we recommend the good practices of using Python (especially iPython) for exploration using root (PyROOT). See our example. It is not a good idea to migrate ROOT versions during an analysis, unless a bugfix is required.

R

  • What Is It?: R is a very popular and well-known set of statistical libraries for data processing. Used widely in social sciences and in existing Big Data platforms.
  • How does it fit in KAVE?: The libraries are installed onto all processing nodes, integrated into IPython and may be used along side ROOT and SciPy.
  • When to use it?: If you are familiar with R.
  • How to get it?: Part of the KaveToolbox, above, and needs to be installed on all nodes where code may in principle be executed during the data workflow.
  • What are the best practices?: R-project documentation

User experience features

Apache

  • What Is It?: A simple http webserver
  • How does it fit in KAVE?: In case you want to configure and monitor an edge node serving some simple websites, without needing to install and monitor apache yourself. In principle it is not a strict requirement to have such a machine in your KAVE.
  • When to use it?: In case you want to configure and monitor an edge node serving some simple websites
  • How to get it?:Use the AmbariKave installer if deploying on a cluster
  • What are the best practices? geekflare.com techtarget.com stackexchange

KaveLanding

  • What Is It?: A simple webUI giving quicklinks for accessing your KAVE resources. KaveLanding runs a very simple apache server serving one static webpage which displays internal links only.
  • How does it fit in KAVE?: The gateway is the one point of entry for developers, and it is easy to get lost in what you are able to access over your dynamic kave tunnel. The KaveLanding page brings all these pieces back together again in a short-cu/quicklink menu for your convenience and is often the user homepage when reaching the kave
  • When to use it?: Whenever you have a gateway machine, or data analysts who are not expected to have admin rights on the ambari server.
  • What are the best practises?: Install on each gateway, the gateways should in principle never be running other high-load services or high-uptime services apart from the KaveLanding page, which is designed to be very lightweight.

FreeIPA

  • What Is It?: Advanced user management. IPA can manage user access to a host of services and machines. http://www.freeipa.org/
  • How does it fit in KAVE?: Each service within your KAVE can be controlled separately, which allows for separation of user rights, but in the case of an integrated team who wish access to the same resources, or a particularly large team, a more integrated solution is preferred. FreeIPA is the secure solution which integrates best into the Kave's open-source mantra.
  • When to use it?: Whenever you have an integrated team wishing to access the same resources, or a particularly large user base.
  • What are the best practices?: Consider flexibility when deciding on administrators, it is most usual for a data science team to self-elect a user administrator per KAVE. If nobody in the team has admin rights, this significantly reduces the performance of the team.

Table of Contents

For users, installers, and other persons interested in the KAVE, or developing solutions on top of a KAVE.

Kave on Azure

For contributors

For someone who modifies the AmbariKave code itself and contributes to this project. Persons working on top of existing KAVEs or developing solutions on top of KAVE don't need to read any of this second part.

Clone this wiki locally