Skip to content

Commit

Permalink
[DOCS] Add docs on YARN
Browse files Browse the repository at this point in the history
relates #321
  • Loading branch information
costin committed Nov 14, 2014
1 parent 3f7c086 commit a714638
Show file tree
Hide file tree
Showing 17 changed files with 458 additions and 44 deletions.
31 changes: 29 additions & 2 deletions docs/src/reference/asciidoc/core/intro.adoc
@@ -1,9 +1,33 @@
[[reference]]
= Reference
= Elasticsearch for Apache Hadoop

[partintro]
--
{ehtm} is an <<license, open-source>>, stand-alone, self-contained, small library that allows Hadoop jobs (whether using {mr} or libraries built upon it such as Hive, Pig or Cascading or new upcoming libraries like {sp} ) to 'interact' with {es}. One can think of it as a _connector_ that allows data to flow 'bi-directionaly' so that applications can leverage transparently the {es} engine capabilities to significantly enrich their capabilities and increase the performance.

{ehtm} offers first-class support for vanilla {mr}, Cascading, Pig and Hive so that using {es} is literally like using resources within the Hadoop cluster. As such,
{ehtm} is a _passive_ component, allowing Hadoop jobs to use it as a library and interact with {es} through {ehtm} APIs.

[[project-name-alias]]
While the official name of the project is {ehtm} through-out the documentation the term {eh} will be used instead to increase readability.

include::intro/typos.adoc[]

TIP: If you are looking for {es} HDFS Snapshot/Restore plugin (a separate project), please refer to its https://github.com/elasticsearch/elasticsearch-hadoop/tree/master/repository-hdfs[home page].

TIP: If you are looking for {es} on YARN (a separate project), please refer to its dedicated <<es-yarn,section>>.
--

[[doc-sections]]
== Documentation sections
The documentation is broken-down in two parts:

=== Setup & Requirements
This <<features,section>> provides an overview of the project, its requirements (and supported environment and libraries) plus information on how to easily install {eh} in your environment.

=== Reference Documentation
This part of the documentation explains the core functionality of {eh} starting with the configuration options and architecture and gradually explaining the various major features. At a higher level the reference is broken down into architecture and configuration section which are general, {mr} and the libraries built on top of it, upcoming computation libraries (like {sp}) and finally mapping, metrics and troubleshooting.

We recommend going through the entire documentation even superficially when trying out {eh} for the first time, however those in a rush, can jump directly to the desired sections:

<<arch>>:: overview of the {eh} architecture and how it maps on top of Hadoop
Expand All @@ -25,4 +49,7 @@ We recommend going through the entire documentation even superficially when tryi
<<metrics>>:: Elasticsearch Hadoop metrics

<<troubleshooting>>:: tips on troubleshooting and getting help
--


include::intro/intro.adoc[]

5 changes: 5 additions & 0 deletions docs/src/reference/asciidoc/core/intro/intro.adoc
@@ -0,0 +1,5 @@
include::features.adoc[]

include::requirements.adoc[]

include::download.adoc[]
Expand Up @@ -23,14 +23,14 @@ java version "1.7.0_55"
[[requirements-es]]
=== {es}

version *0.90* or higher, though we highly recommend using the latest Elasticsearch (currently 1.3.x) is needed to run {es}. Using a lower version is not possible as {eh} uses new features added in 0.90 for distributed, parallel interactions with {es}. We strongly recommend using the latest, stable version of Elasticsearch.
version *0.90* or higher, though we *highly* recommend using the latest Elasticsearch (currently 1.4.0) is needed to run {es}. Using a lower version is not possible as {eh} uses new features added in 0.90 for distributed, parallel interactions with {es}. We strongly recommend using the latest, stable version of Elasticsearch.

The {es} version is shown in its folder name:

[source,bash]
----
$ ls
elasticsearch-1.3.4
elasticsearch-1.4.0
----

If {es} is running (locally or remotely), one can find out through REST its version:
Expand All @@ -42,11 +42,11 @@ $ curl -XGET http://localhost:9200
"status" : 200,
"name" : "Dazzler",
"version" : {
"number" : "1.3.4",
"build_hash" : "a70f3ccb52200f8f2c87e9c370c6597448eb3e45",
"build_timestamp" : "2014-09-30T09:07:17Z",
"number" : "1.4.0",
"build_hash" : "bc94bd81298f81c656893ab1ddddd30a99356066",
"build_timestamp" : "2014-11-05T14:26:12Z",
"build_snapshot" : false,
"lucene_version" : "4.9"
"lucene_version" : "4.10.2"
},
"tagline" : "You Know, for Search"
}
Expand Down Expand Up @@ -79,12 +79,14 @@ As a guide, the table below lists the Hadoop-based distributions against with th
| Amazon EMR | 3.0.x
| Amazon EMR | 2.4.x

| Cloudera CDH | 5.2.x
| Cloudera CDH | 5.1.x
| Cloudera CDH | 5.0.x
| Cloudera CDH | 4.5.x
| Cloudera CDH | 4.4.x
| Cloudera CDH | 4.2.x

| Hortonworks HDP | 2.2.x
| Hortonworks HDP | 2.1.x
| Hortonworks HDP | 2.0.x
| Hortonworks HDP | 1.3.x
Expand Down
20 changes: 19 additions & 1 deletion docs/src/reference/asciidoc/index.adoc
@@ -1,15 +1,33 @@
= Elasticsearch for Apache Hadoop

:icons:
:ehtm: Elasticsearch for Apache Hadoop
:eh: elasticsearch-hadoop
:es: Elasticsearch
:mr: Map/Reduce
:sp: Apache Spark
:st: Apache Storm
:ey: Elasticsearch on YARN
:ref: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current
:description: Reference documentation of {eh}

include::intro/index.adoc[]

[[float]]
[preface]
== Preface

{ehtm} is an `umbrella' project consisting of three similar, yet independent sub-projects with their own, dedicated, section in the documentation:

{ey}:: run {es} on top of YARN - see <<es-yarn>>

repository-hdfs:: use HDFS as a repository back-end; that is storage for doing snapshot/restore from/to {es}. For more information refer to its https://github.com/elasticsearch/elasticsearch-hadoop/tree/master/repository-hdfs[home page]

{eh} proper:: interact with {es} from within a Hadoop environment. If you are using {mr}, Cascading, Hive, Pig, {sp} or {st}, this project is for you.


Thus, while all projects fall under the Hadoop umbrella, each is covering a certain aspect of it so please be sure to read the appropriate documentation.

include::yarn/index.adoc[]

include::core/index.adoc[]

Expand Down
25 changes: 0 additions & 25 deletions docs/src/reference/asciidoc/intro/index.adoc

This file was deleted.

47 changes: 47 additions & 0 deletions docs/src/reference/asciidoc/yarn/download.adoc
@@ -0,0 +1,47 @@
[[ey-install]]
== Installation

{ey} binaries can be obtained either by downloading them from the http://elasticsearch.org[elasticsearch.org] site as a ZIP (containing project jars, sources and documentation) or by using any http://maven.apache.org/[Maven]-compatible tool with the following dependency:

[source,xml]
----
<dependency>
<groupId>org.elasticsearch</groupId>
<artifactId>elasticsearch-yarn</artifactId>
<version>2.1.0.Beta3</version>
</dependency>
----

The jar above contains {ey} and does not require any other dependencies at runtime; in other words it can be used as is.

[[ey-download-dev]]
=== Development Builds

Development (or nightly or snapshots) builds are published daily at 'sonatype-oss' repository (see below). Make sure to use snapshot versioning:

[source,xml]
----
<dependency>
<groupId>org.elasticsearch</groupId>
<artifactId>elasticsearch-yarn</artifactId>
<version>2.1.0.BUILD-SNAPSHOT</version> <1>
</dependency>
----

<1> notice the 'BUILD-SNAPSHOT' suffix indicating a development build

but also enable the dedicated snapshots repository :

[source,xml]
----
<repositories>
<repository>
<id>sonatype-oss</id>
<url>http://oss.sonatype.org/content/repositories/snapshots</url> <1>
<snapshots><enabled>true</enabled></snapshots> <2>
</repository>
</repositories>
----

<1> add snapshot repository
<2> enable 'snapshots' capability on the repository otherwise these will not be found by Maven
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
20 changes: 20 additions & 0 deletions docs/src/reference/asciidoc/yarn/index.adoc
@@ -0,0 +1,20 @@
[[es-yarn]]
= Elasticsearch on YARN

[[es-yarn-intro]]
[partintro]
--
Distributed as part of {ehtm}, {ey} is a separate, stand-alone, self-container CLI (command-line interface) that allows {es} to run (and thus be managed) within a http://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/YARN.html[YARN] environment.

In other words, {es} can use YARN to allocate its resources and be started and stopped, on said resources, through YARN infrastructure.
--

include::requirements.adoc[]

include::download.adoc[]

include::setup.adoc[]

include::usage.adoc[]


53 changes: 53 additions & 0 deletions docs/src/reference/asciidoc/yarn/requirements.adoc
@@ -0,0 +1,53 @@
[[yarn-requirements]]
== Requirements

Before using {ey}, please pay attention to the requirements below - ignoring them can lead to abnormal behavior, error and ultimately a poor experience and data loss.

NOTE: make sure to verify *all* nodes in a cluster when checking the version of a certain artifact.

[[ey-requirements-yarn]]
=== YARN

A YARN environment running on Hadoop 2.4 (or higher) is recommended. This can be easily checked by verifying the Hadoop version installed on the target nodes:

[source,bash]
----
$ hadoop version
Hadoop 2.4.1
Subversion http://svn.apache.org/repos/asf/hadoop/common -r 1604318
Compiled by jenkins on 2014-06-21T05:43Z
Compiled with protoc 2.5.0
From source with checksum bb7ac0a3c73dc131f4844b873c74b630
This command was run using /opt/share/hadoop/common/hadoop-common-2.4.1.jar
----

For Hadoop distros, check the base core YARN/Hadoop version and make sure it is 2.4 compatible.

As a guide, the table below lists the Hadoop-based distributions that include YARN, against with this version has been tested against at various points in time:

|===
| Distribution | Release

| Apache Hadoop | 2.5.x
| Apache Hadoop | 2.4.x

| Amazon EMR | 3.3.x
| Amazon EMR | 3.2.x
| Amazon EMR | 3.1.x

| Cloudera CDH | 5.2.x
| Cloudera CDH | 5.1.x
| Cloudera CDH | 5.0.x

| Hortonworks HDP | 2.2.x
| Hortonworks HDP | 2.1.x

| MapR | 4.0.x
|===


[[ey-requirements-es]]
=== {es}

{ey} uses the same requirements on {es} as {eh} - in other words, using the latest stable {es} is highly recommended for both stability and performance reasons.
55 changes: 55 additions & 0 deletions docs/src/reference/asciidoc/yarn/setup.adoc
@@ -0,0 +1,55 @@
[[ey-setup]]
== Understanding the YARN environment

[quote, Wikipedia]
____
http://hadoop.apache.org/[YARN] stands for "Yet Another Resource Negotiator" and was added later as part of Hadoop 2.0. YARN takes the resource management capabilities that were in MapReduce and packages them so they can be used by new engines. This also streamlines MapReduce to do what it does best, process data. With YARN, you can now run multiple applications in Hadoop, all sharing a common resource management. As of September, 2014, YARN manages only CPU (number of cores) and memory [..]
____

In its current incarnation, {ey} interacts with the YARN APIs in order to start and stop Elasticsearch nodes on YARN infrastructure. In YARN terminology, {ey} has several components:

Client Application::
The entity that bootstraps the entire process and controls the life-cycle of the cluster based on user feedback. This is the CLI (Command-Line Interface) that the user interacts with.

Application Manager::
Based on the user configuration, the dedicated +ApplicationManager+ negotiates with YARN the number of {es} nodes to be created as YARN containers and their capabilities (memory and CPU).
It oversees the cluster life-cycle and handles the container allocation.

Node/Container setup::
Handles the actual {es} invocation for each allocated +Container+.

As YARN is all about cluster resource management, it is important to properly configure YARN and {es} accordingly since over or under-allocating resources can lead to undesirable consequences. There are plenty of resources
available on how to configure and plan your YARN cluster; the section below will touch on the core components and their impact on {es}.

=== CPU

As of Hadoop 2.4, YARN can restrict the amount of CPU allocated to a container: each has a number of so-called +vcores+ (virtual cores) with a minimum of 1. What this translates in practice depends highly on your
underlying hardware configuration and cluster configuration; a good approximation is to map each +vcore+ to an actual core on the CPU; just like with native hardware, expect the core to be shared across the rest of the
applications so depending on system load, the amount of actual CPU available can be considerably lower. Thus, it is recommended to allocate multiple +vcores+ to {es} - a good start number being the number of actual cores
your CPU supports.

=== Memory

Simplifying things a lot, YARN requires containers to specify the amount of memory they need within a certain band - specifying more or less memory results in the container allocation request being denied. By default, YARN
enforces a minimum limit of 1 GB (1024 MB) and a maximum of 8 GB (8192 MB). While {es} can work with this amount of RAM, you typically want to increase this amount for performance reasons.
Out of the box, {ey} requests _only_ 2 GB of memory per container so that users can easily try it out even within a testing YARN environment (such as pseudo-distributed or VMs); significantly increase this amount once you get
beyond the YARN `Hello World' stage.

=== Storage

Each container inside YARN is responsible for saving its state and storage between container restarts. In general, there are two main strategies one can take:

Use a globally accessible file-system (like HDFS):: With a storage accessible by all Hadoop nodes, each container can use it as its backing store. For example one can use HDFS to save the data in one container and read it from another.
With {es}, one can simply mount HDFS as a https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html[NFS gateway] and simply point each {es} node to a folder on it.
Note however that performance is going to suffer significantly - HDFS is design as big, network-based storage thus each call is likely to have a significant delay (due to network latency).

Use the container node local storage:: Each container can currently access its local storage - with proper configuration this can be kept outside the disposable container folder thus allowing the data to _live_ between restarts.
This is the recommended approach as it offers the best performance and due to {es} itself, redundancy as well (through replicas).

Note that the approaches above require additional configuration to either {es} or your YARN cluster. There are plans to simplify this procedure in the future.

IMPORTANT: If no storage is configured, out of the box {es} will use its container storage which means when the container is disposed, so is its data. In other words, between restarts any existing data is _destroyed_.

=== Node affinity

Currently, {ey} does not provide any option for tying {es} nodes to specify YARN nodes however this will be addressed in the future. In practice this means that unless YARN is specifically configured, there are no guarantees on its topology between restarts, that is on what machines {es} nodes will run each time.

0 comments on commit a714638

Please sign in to comment.