Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed the docs to build with docbook #76

Merged
merged 1 commit into from Aug 19, 2013
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
4 changes: 2 additions & 2 deletions docs/src/reference/asciidoc/appendix/index.adoc
@@ -1,6 +1,6 @@

include::appendix/resources.adoc[]
include::resources.adoc[]

include::appendix/license.adoc[]
include::license.adoc[]


2 changes: 1 addition & 1 deletion docs/src/reference/asciidoc/appendix/resources.adoc
Expand Up @@ -8,7 +8,7 @@ Source repository:: http://github.com/elasticsearch/elasticsearch-hadoop[GitHub]

Issue tracker:: http://github.com/elasticsearch/elasticsearch-hadoop/issues[GitHub]

Mailing list / forum:: https://groups.google.com/forum/?fromgroups#!forum/elasticsearch[Google Groups] - please add +[Hadoop]+ prefix to the topic message
Mailing list / forum:: https://groups.google.com/forum/?fromgroups#!forum/elasticsearch[Google Groups] - please add `[Hadoop]` prefix to the topic message

Twitter:: http://twitter.com/elasticsearch[Elasticsearch], http://twitter.com/costinl[Costin Leau]

Expand Down
16 changes: 10 additions & 6 deletions docs/src/reference/asciidoc/core/cascading.adoc
Expand Up @@ -8,8 +8,8 @@ ____

Cascading abstracting the {mr} API and focusing on http://docs.cascading.org/cascading/2.1/userguide/html/ch03.html[data processing]
in terms of 'tuples' http://docs.cascading.org/cascading/2.1/userguide/html/ch03s08.html['flowing'] through http://docs.cascading.org/cascading/2.1/userguide/html/ch03s02.html[pipes] between http://docs.cascading.org/cascading/2.1/userguide/html/ch03s05.html['taps'],
from input (called +SourceTap+) to output (named +SinkTap+). As the data flows, various operations are applied to the tuple; the whole system being transformed to {mr} operations at runtime.
With {eh}, {es} can be plugged into Cascading flows as a +SourceTap+ or +SinkTap+ through +ESTap+.
from input (called `SourceTap`) to output (named `SinkTap`). As the data flows, various operations are applied to the tuple; the whole system being transformed to {mr} operations at runtime.
With {eh}, {es} can be plugged into Cascading flows as a `SourceTap` or `SinkTap` through `ESTap`.

****
.Local or Hadoop mode?
Expand All @@ -18,24 +18,27 @@ Cascading supports two 'execution' modes or http://docs.cascading.org/cascading/
Local:: for unit testing and quick POCs. Everything runs only on the local machine and file-system.
Hadoop:: production mode - connects to a proper Hadoop cluster (as oppose to the 'local' mode which is running just on the local machine).

{eh} supports *both* platforms automatically. One does not have to choose between different classes, +EsTap+ can be used as both +sink+ or +source+, in both modes transparently.
{eh} supports *both* platforms automatically. One does not have to choose between different classes, `EsTap` can be used as both `sink` or `source`, in both modes transparently.
****

[float]
=== Installation

Just like other libraries, {eh} needs to be available in the jar classpath (either by being manually deployed in the cluster or shipped along with the Hadoop job).

[[type-conversion-cascading]]
[float]
=== Type conversion

Depending on the http://docs.cascading.org/cascading/2.1/userguide/html/ch03s04.html[platform] used, Cascading can use internally either +Writable+ or JDK types for its tuples. {es} handles both transparently
Depending on the http://docs.cascading.org/cascading/2.1/userguide/html/ch03s04.html[platform] used, Cascading can use internally either `Writable` or JDK types for its tuples. {es} handles both transparently
(see the {mr} <<type-conversion-writable,conversion>> section) though we recommend using the same types (if possible) in both cases to avoid the overhead of maintaining two different versions.

IMPORTANT: If automatic index creation is used, please review <<auto-mapping-type-loss,this>> section for more information.

[float]
=== Writing data to {es}

Simply hook, +ESTap+ into the Cascading flow:
Simply hook, `ESTap` into the Cascading flow:

[source,java]
----
Expand All @@ -45,9 +48,10 @@ Tap out = new ESTap("radio/artists", new Fields("name", "url", "picture"));
new HadoopFlowConnector().connect(in, out, new Pipe("write-to-Eleasti")).complete();
----

[float]
=== Reading data from {es}

Just the same, add +ESTap+ on the other end of a pipe, to read (instead of writing) to it.
Just the same, add `ESTap` on the other end of a pipe, to read (instead of writing) to it.

[source,java]
----
Expand Down
48 changes: 27 additions & 21 deletions docs/src/reference/asciidoc/core/configuration.adoc
@@ -1,17 +1,18 @@
[[configuration]]
== Configuration options

{eh} behavior can be customized through the properties below, typically by setting them on the target job Hadoop +Configuration+. However some of them can be specified through other means depending on the library used (see the relevant section).
{eh} behavior can be customized through the properties below, typically by setting them on the target job Hadoop `Configuration`. However some of them can be specified through other means depending on the library used (see the relevant section).

****
{eh} uses the same conventions and reasonable defaults as {es} so you can try out without bothering with the configuration. Most of the time, these defaults are just fine for running a production cluster; if you are fine-tunning your cluster or wondering about the effect of certain configuration option, please _do ask_ for more information.
****

NOTE: All configuration properties start with the +es+ prefix. The namespace +es.internal+ is reserved by the library for its internal use and should _not_ be used by the user at any point.
NOTE: All configuration properties start with the `es` prefix. The namespace `es.internal` is reserved by the library for its internal use and should _not_ be used by the user at any point.

[float]
=== Required settings

.+es.resource+
.`es.resource`
{es} resource location, relative to the {es} host/port (see below). Can be either an index/type (for writing) or a search query.

.Examples
Expand All @@ -21,46 +22,50 @@ es.resource = twitter/costinl # write to index 'twitter', type
es.resource = twitter/costinl/_search?q=hadoop # read entries matching 'hadoop' from 'twitter/costinl'
----

[float]
=== Optional settings

[float]
==== Network
+es.host+ (default localhost)::
`es.host` (default localhost)::
{es} cluster node. When using {es} remotely, _do_ set this option.

+es.port+ (default 9200)::
`es.port` (default 9200)::
HTTP/REST port used for connecting to {es}.

+es.http.timeout+ (default 1m)::
`es.http.timeout` (default 1m)::
Timeout for HTTP/REST connections to {es}.

+es.scroll.keepalive+ (default 10m)::
`es.scroll.keepalive` (default 10m)::
The maximum duration of result scrolls between query requests.

+es.scroll.size+ (default 50)::
`es.scroll.size` (default 50)::
Number of results/items returned by each individual scroll.

[[configuration-options-index]]
[float]
==== Index

+es.index.auto.create+ (default yes)::
`es.index.auto.create` (default yes)::
Whether {eh} should create an index (if its missing) when writing data to {es} or fail.

[float]
==== Serialization

+es.batch.size.bytes+ (default 10mb)::
`es.batch.size.bytes` (default 10mb)::
Size (in bytes) for batch writes using {es} http://www.elasticsearch.org/guide/reference/api/bulk/[bulk] API

+es.batch.size.entries+ (default 0/disabled)::
Size (in entries) for batch writes using {es} http://www.elasticsearch.org/guide/reference/api/bulk/[bulk] API. Companion to +es.batch.size.bytes+, once one matches, the batch update is executed.
`es.batch.size.entries` (default 0/disabled)::
Size (in entries) for batch writes using {es} http://www.elasticsearch.org/guide/reference/api/bulk/[bulk] API. Companion to `es.batch.size.bytes`, once one matches, the batch update is executed.

+es.batch.write.refresh+ (default true)::
`es.batch.write.refresh` (default true)::
Whether to invoke an http://www.elasticsearch.org/guide/reference/api/admin-indices-refresh/[index refresh] or not after a bulk update has been completed. Note this is called only after the entire write (meaning multiple bulk updates) have been executed.

+es.ser.reader.class+ (default _depends on the library used_)::
Name of the +ValueReader+ implementation for converting JSON to objects. This is set by the framework depending on the library ({mr}, Cascading, Hive, Pig, etc...) used.
`es.ser.reader.class` (default _depends on the library used_)::
Name of the `ValueReader` implementation for converting JSON to objects. This is set by the framework depending on the library ({mr}, Cascading, Hive, Pig, etc...) used.

+es.ser.writer.class+ (default _depends on the library used_)::
Name of the +ValueWriter+ implementation for converting objects to JSON. This is set by the framework depending on the library ({mr}, Cascading, Hive, Pig, etc...) used.
`es.ser.writer.class` (default _depends on the library used_)::
Name of the `ValueWriter` implementation for converting objects to JSON. This is set by the framework depending on the library ({mr}, Cascading, Hive, Pig, etc...) used.

[[runtime-configuration]]
== Hadoop runtime options
Expand All @@ -69,6 +74,7 @@ When using {eh}, it is important to be aware of the following Hadoop configurati

IMPORTANT: Unfortunately, these settings need to be setup *manually* *before* the job / script configuration. Since {eh} is called too late in the life-cycle, after the task has been tasks have been already dispatched and as such, cannot influence the execution anymore.

[float]
=== Speculative execution

[quote, Yahoo! developer network]
Expand All @@ -79,12 +85,12 @@ ____
In other words, speculative execution is an *optimization*, enabled by default, that allows Hadoop to create duplicates tasks of those which it considers hanged or slowed down. When doing data crunching or reading resources, having duplicate tasks is harmless and means at most a waste of computation resources; however when writing data to an external store, this can cause data corruption through duplicates or unnecessary updates.
Since the 'speculative execution' behavior can be triggered by external factors (such as network or CPU load which in turn cause false positive) even in stable environments (virtualized clusters are particularly prone to this) and has a direct impact on data, {eh} disables this optimization for data safety.

Speculative execution can be disabled for the map and reduce phase - we recommend disabling in both cases - by setting to +false+ the following two properties:
Speculative execution can be disabled for the map and reduce phase - we recommend disabling in both cases - by setting to `false` the following two properties:

+mapred.map.tasks.speculative.execution+
+mapred.reduce.tasks.speculative.execution+
`mapred.map.tasks.speculative.execution`
`mapred.reduce.tasks.speculative.execution`

One can either set the properties by name manually on the +Configuration+/+JobConf+ client:
One can either set the properties by name manually on the `Configuration`/`JobConf` client:

[source,java]
----
Expand Down
65 changes: 35 additions & 30 deletions docs/src/reference/asciidoc/core/hive.adoc
Expand Up @@ -8,11 +8,11 @@ ____

Hive abstracts Hadoop by abstracting it through SQL-like language, called HiveQL so that users can apply data defining and manipulating operations to it, just like with SQL. In Hive data set are https://cwiki.apache.org/confluence/display/Hive/GettingStarted#GettingStarted-DDLOperations[defined] through 'tables' (that expose type information) in which data can be https://cwiki.apache.org/confluence/display/Hive/GettingStarted#GettingStarted-DMLOperations[loaded], https://cwiki.apache.org/confluence/display/Hive/GettingStarted#GettingStarted-SQLOperations[selected and transformed] through built-in operators or custom/user defined functions (or https://cwiki.apache.org/confluence/display/Hive/OperatorsAndFunctions[UDF]s).

[float]
=== Installation

Make {eh} jar available in the Hive classpath. Depending on your options, there are various https://cwiki.apache.org/confluence/display/Hive/HivePlugins#HivePlugins-DeployingjarsforUserDefinedFunctionsandUserDefinedSerDes[ways] to achieve that. Use https://cwiki.apache.org/Hive/languagemanual-cli.html#LanguageManualCli-HiveResources[ADD] command to add files, jars (what we want) or archives to the classpath:

[source]
----
ADD /path/elasticsearch-hadoop.jar;
----
Expand All @@ -27,9 +27,9 @@ As an alternative, one can use the command-line:
----
$ bin/hive -hiveconf hive.aux.jars.path=/path/elasticsearch-hadoop.jar
----
or if the +hive-site.xml+ configuration can be modified, one can register additional jars through the +hive.aux.jars.path+ option (that accepts an URI as well):
or if the `hive-site.xml` configuration can be modified, one can register additional jars through the `hive.aux.jars.path` option (that accepts an URI as well):

.+hive-ste.xml+ configuration
.`hive-ste.xml` configuration
[source,xml]
----
<property>
Expand All @@ -40,9 +40,10 @@ or if the +hive-site.xml+ configuration can be modified, one can register additi
----

[[hive-configuration]]
[float]
=== Configuration

When using Hive, one can use +TBLPROPERTIES+ to specify the <<configuration,configuration>> properties (as an alternative to Hadoop +Configuration+ object) when declaring the external table backed by {es}:
When using Hive, one can use `TBLPROPERTIES` to specify the <<configuration,configuration>> properties (as an alternative to Hadoop `Configuration` object) when declaring the external table backed by {es}:

[source,sql]
----
Expand All @@ -55,10 +56,11 @@ TBLPROPERTIES('es.resource' = 'radio/artists/',
<1> {eh} settings

[[hive-alias]]
[float]
=== Mapping

By default, {eh} uses the Hive table schema to map the data in {es}, using both the field names and types in the process. There are cases however when the names in Hive cannot
be used with {es} (the field name can contain characters accepted by {es} but not by Hive). For such cases, one can use the +es.column.aliases+ setting which accepts a comma-separated list of names mapping in the following format: ++Hive field name+:++{es} field name++
be used with {es} (the field name can contain characters accepted by {es} but not by Hive). For such cases, one can use the `es.column.aliases` setting which accepts a comma-separated list of names mapping in the following format: ``Hive field name`:``{es} field name``

To wit:

Expand All @@ -71,52 +73,54 @@ TBLPROPERTIES('es.resource' = 'radio/artists/',
----

<1> name mapping for two fields
<2> Hive column +date+ mapped in {es} to +@timestamp+
<3> Hive column +url+ mapped in {es} to +url_123+
<2> Hive column `date` mapped in {es} to `@timestamp`
<3> Hive column `url` mapped in {es} to `url_123`

TIP: {es} accepts only lower-case field name and, as such, {eh} will always convert Hive column names to lower-case. This poses no issue as Hive is **case insensitive**
however it is recommended to use the default Hive style and use upper-case names only for Hive commands and avoid mixed-case names.

[[hive-type-conversion]]
[float]
=== Type conversion

IMPORTANT: If automatic index creation is used, please review <<auto-mapping-type-loss,this>> section for more information.

Hive provides various https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types[types] for defining data and internally uses different implementations depending on the target environment (from JDK native types to binary-optimized ones). {es} integrates with all of them, including
Hive provides various https://cwiki.apache.org/confluence/display/Hive/LanguageManual`Types[types] for defining data and internally uses different implementations depending on the target environment (from JDK native types to binary-optimized ones). {es} integrates with all of them, including
and Serde2 http://hive.apache.org/docs/r0.11.0/api/index.html?org/apache/hadoop/hive/serde2/lazy/package-summary.html[lazy] and http://hive.apache.org/docs/r0.11.0/api/index.html?org/apache/hadoop/hive/serde2/lazybinary/package-summary.html[lazy binary]:

[cols="^,^",options="header"]

|===
| Hive type | {es} type

| +void+ | +null+
| +boolean+ | +boolean+
| +tinyint+ | +byte+
| +smallint+ | +short+
| +int+ | +int+
| +bigint+ | +long+
| +double+ | +double+
| +float+ | +float+
| +string+ | +string+
| +binary+ | +binary+
| +timestamp+ | +date+
| +struct+ | +map+
| +map+ | +map+
| +array+ | +array+
| +union+ | not supported yet

2+h| Available in Hive 0.11 or higher

| +decimal+ | +string+
| `void` | `null`
| `boolean` | `boolean`
| `tinyint` | `byte`
| `smallint` | `short`
| `int` | `int`
| `bigint` | `long`
| `double` | `double`
| `float` | `float`
| `string` | `string`
| `binary` | `binary`
| `timestamp` | `date`
| `struct` | `map`
| `map` | `map`
| `array` | `array`
| `union` | not supported yet

2`h| Available in Hive 0.11 or higher

| `decimal` | `string`

|===

NOTE: While {es} understands Hive types up to version 0.11, it is backwards compatible with Hive 0.9

[float]
=== Writing data to {es}

With {eh}, {es} becomes just an external https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTable[table] in which data can be loaded or read from:
With {eh}, {es} becomes just an external https://cwiki.apache.org/confluence/display/Hive/LanguageManual`DDL#LanguageManualDDL-CreateTable[table] in which data can be loaded or read from:

[source,sql]
----
Expand All @@ -132,9 +136,10 @@ INSERT OVERWRITE TABLE artists
SELECT NULL, s.name, named_struct('url', s.url, 'picture', s.picture) FROM source s;
----

<1> {es} Hive +StorageHandler+
<1> {es} Hive `StorageHandler`
<2> {es} resource (index and type) associated with the given storage

[float]
=== Reading data from {es}

Reading from {es} is strikingly similar:
Expand All @@ -152,5 +157,5 @@ TBLPROPERTIES('es.resource' = 'radio/artists/_search?q=me*'<2>);
SELECT * FROM artists;
----

<1> same {es} Hive +StorageHandler+
<1> same {es} Hive `StorageHandler`
<2> {es} resource (in case of reading, a query) associated with the given storage
15 changes: 7 additions & 8 deletions docs/src/reference/asciidoc/core/index.adoc
@@ -1,14 +1,13 @@
include::intro.adoc[]

include::core/intro.adoc[]
include::configuration.adoc[]

include::core/configuration.adoc[]
include::mr.adoc[]

include::core/mr.adoc[]
include::cascading.adoc[]

include::core/cascading.adoc[]
include::hive.adoc[]

include::core/hive.adoc[]
include::pig.adoc[]

include::core/pig.adoc[]

include::core/mapping.adoc[]
include::mapping.adoc[]
6 changes: 5 additions & 1 deletion docs/src/reference/asciidoc/core/intro.adoc
@@ -1,5 +1,8 @@
= Reference documentation
= Reference


[partintro]
--
This part of the documentation explains the core functionality of {eh} starting with the configuration options and architecture and gradually explaining the various major features. We recommend going through the entire documentation even superficially when trying out {eh} for the first time, however those in a rush, can jump directly to the desired sections:

<<configuration>>:: overview of the various configuration switches in {eh}
Expand All @@ -13,3 +16,4 @@ This part of the documentation explains the core functionality of {eh} starting
<<pig>>:: how-to on using {es} in Pig scripts through {eh}.

<<mapping>>:: deep-dive into the strategies employed by {eh} for doing type conversion and mapping to and from {es}.
--