Skip to content

Commit

Permalink
[PIG] Document work-around for Pig 1 reducer problem
Browse files Browse the repository at this point in the history
Since in many cases Pig internal assumptions force the use of 1 reducer,
no matter the job properties or the number of splits, document the
settings that disable this behavior (and which unfortunately cannot be
set directly by es-hadoop).

fix elastic#294
  • Loading branch information
costin committed Dec 9, 2014
1 parent 20549cb commit 3f7aee4
Show file tree
Hide file tree
Showing 2 changed files with 52 additions and 1 deletion.
52 changes: 52 additions & 0 deletions docs/src/reference/asciidoc/core/pig.adoc
Expand Up @@ -99,6 +99,58 @@ The table below illustrates the difference between the two settings:
IMPORTANT: When using tuples, it is _highly_ recommended to create the index mapping before-hand as it is quite common for tuples to contain mixed types (numbers, strings, other tuples, etc...) which, when mapped as an array (the default) can cause parsing errors (as the automatic mapping can infer the fields to be numbers instead of strings, etc...). In fact, the example above falls in this category since the tuple contains both a number (+1+) and a string (+"kimchy"+), which will the auto-detection to map both +foo+ and +bar+ as a number and thus causing an exception when encountering +"kimchy"+. Please refer to <<auto-mapping-type-loss,this>> for more information.
Additionally consider +breaking+/++flatten++ing the tuple into primitive/data atoms before sending the data off to Elasticsearch.

[[handling-splits]]
[float]
==== Reducers parallelism

By default, Pig will only use one reducer per job which in most cases is inefficient. To address these issue:

Use the Parallel Features:: As explained in the http://pig.apache.org/docs/r0.13.0/perf.html#parallel[reference docs], out of the box Pig expects each reducer to process about 1 GB of data; unfortunately if the data is scattered
around the network this becomes inefficient as the entire job is effectively serialized. Change this by increasing the number of reducers to map that of your shards through the +default_parallel+ property or +PARALLEL+ keyword:

[source,sql]
----
-- launch the Map/Reduce job with 5 reducers
SET default_parallel 5;
----
or by using the +PARALLEL+ keyword with +COGROUP+, +CROSS+, +DISTINCT+, +GROUP+, +JOIN+(inner), +JOIN+(outer) and ++ORDER BY++.
[source,sql]
----
B = GROUP A BY t PARALLEL 18;
----

Disable split combination:: Out of the box Pig over-eagerly https://pig.apache.org/docs/r0.13.0/perf.html#combine-files[combines its input splits] even if it does not know how big they are. This again kills parallelism since it serializes the queries to {es} ; typically this looks as follows
in the logs:

[source,bash]
----
20yy-mm-dd hh:mm:ss,mss [JobControl] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 25
20yy-mm-dd hh:mm:ss,mss [JobControl] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 1
----

Avoid this by setting +pig.noSplitCombination+ to +true+ (one can also use +pig.splitCombination+ to +false+ however we recommend the former) either by setting the property before invoking the script:

[source,bash]
----
pig -Dpig.noSplitCombination=true myScript.pig
----
in the Pig script itself:

[source,sql]
----
SET pig.noSplitCombination TRUE;
----
or through the global +pig.properties+ configuration in your Pig install:

[source,properties]
----
pig.noSplitCombination=true
----


Unfortunately {esh} cannot set these properties automatically so the user has to do that manually per script or making them global through the Pig configuration as described above.


[[pig-alias]]
[float]
=== Mapping
Expand Down
Expand Up @@ -358,7 +358,6 @@ public void setUDFContextSignature(String signature) {
}


@SuppressWarnings({ "rawtypes", "unchecked" })
private void extractProjection(Configuration cfg) throws IOException {
String fields = getUDFProperties().getProperty(InternalConfigurationOptions.INTERNAL_ES_TARGET_FIELDS);
if (fields != null) {
Expand Down

0 comments on commit 3f7aee4

Please sign in to comment.