[PIG] Document work-around for Pig 1 reducer problem

Since in many cases Pig internal assumptions force the use of 1 reducer, no matter the job properties or the number of splits, document the settings that disable this behavior (and which unfortunately cannot be set directly by es-hadoop). fix elastic#294
williamrandolph · Dec 9, 2014 · 3f7aee4 · 3f7aee4
1 parent 20549cb
commit 3f7aee4
Show file tree

Hide file tree

Showing 2 changed files with 52 additions and 1 deletion.
diff --git a/docs/src/reference/asciidoc/core/pig.adoc b/docs/src/reference/asciidoc/core/pig.adoc
@@ -99,6 +99,58 @@ The table below illustrates the difference between the two settings:
 IMPORTANT: When using tuples, it is _highly_ recommended to create the index mapping before-hand as it is quite common for tuples to contain mixed types (numbers, strings, other tuples, etc...) which, when mapped as an array (the default) can cause parsing errors (as the automatic mapping can infer the fields to be numbers instead of strings, etc...). In fact, the example above falls in this category since the tuple contains both a number (+1+) and a string (+"kimchy"+), which will the auto-detection to map both +foo+ and +bar+ as a number and thus causing an exception when encountering +"kimchy"+. Please refer to <<auto-mapping-type-loss,this>> for more information.
 Additionally consider +breaking+/++flatten++ing the tuple into primitive/data atoms before sending the data off to Elasticsearch.
 
+[[handling-splits]]
+[float]
+==== Reducers parallelism
+
+By default, Pig will only use one reducer per job which in most cases is inefficient.  To address these issue:
+
+Use the Parallel Features:: As explained in the http://pig.apache.org/docs/r0.13.0/perf.html#parallel[reference docs], out of the box Pig expects each reducer to process about 1 GB of data; unfortunately if the data is scattered 
+around the network this becomes inefficient as the entire job is effectively serialized. Change this by increasing the number of reducers to map that of your shards through the +default_parallel+ property or +PARALLEL+ keyword:
+
+[source,sql]
+----
+-- launch the Map/Reduce job with 5 reducers
+SET default_parallel 5;
+----
+or by using the +PARALLEL+ keyword with +COGROUP+, +CROSS+, +DISTINCT+, +GROUP+, +JOIN+(inner), +JOIN+(outer) and ++ORDER BY++. 
+[source,sql]
+----
+B = GROUP A BY t PARALLEL 18;
+----
+
+Disable split combination:: Out of the box Pig over-eagerly https://pig.apache.org/docs/r0.13.0/perf.html#combine-files[combines its input splits] even if it does not know how big they are. This again kills parallelism since it serializes the queries to {es} ; typically this looks as follows
+in the logs:
+
+[source,bash]
+----
+20yy-mm-dd hh:mm:ss,mss [JobControl] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 25
+20yy-mm-dd hh:mm:ss,mss [JobControl] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 1
+----
+
+Avoid this by setting +pig.noSplitCombination+ to +true+ (one can also use +pig.splitCombination+ to +false+ however we recommend the former) either by setting the property before invoking the script:
+
+[source,bash]
+----
+pig -Dpig.noSplitCombination=true myScript.pig
+----
+in the Pig script itself:
+
+[source,sql]
+----
+SET pig.noSplitCombination TRUE;
+----
+or through the global +pig.properties+ configuration in your Pig install:
+
+[source,properties]
+----
+pig.noSplitCombination=true
+----
+
+
+Unfortunately {esh} cannot set these properties automatically so the user has to do that manually per script or making them global through the Pig configuration as described above.
+
+
 [[pig-alias]]
 [float]
 === Mapping

diff --git a/pig/src/main/java/org/elasticsearch/hadoop/pig/EsStorage.java b/pig/src/main/java/org/elasticsearch/hadoop/pig/EsStorage.java
@@ -358,7 +358,6 @@ public void setUDFContextSignature(String signature) {
     }
 
 
-    @SuppressWarnings({ "rawtypes", "unchecked" })
     private void extractProjection(Configuration cfg) throws IOException {
         String fields = getUDFProperties().getProperty(InternalConfigurationOptions.INTERNAL_ES_TARGET_FIELDS);
         if (fields != null) {