Failure while using EsStorage twice on a single Pig script to store a Parent Child relation #756

lleviraz · 2016-05-01T15:08:46Z

Issue description

We are using EsStorage in a pig script, to store a parent-child relationship in ElasticSearch.
Both parent and child documents are defined as 2 types (my_parent,my_child) under the same index (my_index).
We created the index with predefined "mappings" in order to define the parent-child relationship.
In the pig script we are creating 2 instances of the EsStorage, one for the parent and one for the child - the child has the extra 'es.mapping.parent' property to point on the parent's "parentId".

The script logic is simple:

load the data from the input file
generate parent document relation and store it
generate child document relation and store it

The problem is that - in our implementation, when having the 2 calls to 'EsStorage' on a single pig script, a single 'MAP_ONLY' job is created, and the following error is received:
Error: org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: Can't specify parent if no parent field has been configured

(If it helps, we debugged the issue - and found out that when there is a single job running - the configuration of the 2 EsStorage instances is being mixed - and causes the parent store operation to look for a non existing parent relation - where actually the parent type should not have a parent...)

Notes:

When splitting the script into 2 separate jobs or scripts, and having every EsStorage instance running on a separate job, is successful.

Steps to reproduce

Create the index and mappings:

curl -XPUT http://localhost:9200/my_index -d'
{
  "mappings": {
    "my_parent": {},
    "my_child": {
      "_parent": {
        "type": "my_parent"
      }
    }
  }
}'

Upload the attached input_file.txt to HDFS /tmp directory
input_file.txt
Edit the first line of the code snippet with the location of the "elasticsearch-hadoop-2.2.0.jar" on your local file system before running the script (or download the attached pig file and rename to pigTest.pig
pigTest.pig.txt
).
Run the pig script using "pig" command

Code:

REGISTER elasticsearch-hadoop-2.2.0.jar;

-- Load all 5 records from the input file
ALL_RECORDS = LOAD  '/tmp/input_file.csv' USING PigStorage(',') AS (id:long, name:chararray, birthdatedate:long);

-- generate the parent relation with 2 fields from the input
PARENT_RECS = FOREACH ALL_RECORDS GENERATE birthdatedate as text, id as parentId;

-- store the parent records into the my_parent index using parentId field as the document ID.
STORE PARENT_RECS INTO 'my_index/my_parent' USING org.elasticsearch.hadoop.pig.EsStorage('es.http.retries=10','es.nodes = localhost','es.mapping.id = parentId','es.index.auto.create = false', 'es.net.ssl = false');

-- generate the child relation with all 3 fields from the input
CHILD_RECS = FOREACH ALL_RECORDS GENERATE name as text, id as parentId, birthdatedate as childId;

-- store the child records into my_child index using  the "parentIdRef" field of the child, as es.mapping.parent  - which is the same value as the parentId of the parent
STORE CHILD_RECS INTO 'my_index/my_child' USING org.elasticsearch.hadoop.pig.EsStorage('es.http.retries=10','es.nodes = localhost','es.mapping.id = childId','es.mapping.parent = parentId','es.index.auto.create = false', 'es.net.ssl = false');

Strack trace:

2016-05-01 17:08:08,102 [uber-SubtaskRunner] ERROR org.apache.pig.tools.pigstats.SimplePigStats  - ERROR 2997: Unable to recreate exception from backed error: AttemptID:attempt_1461671868747_11164_m_000000_3 Info:Error: org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: Can't specify parent if no parent field has been configured
{"index":{"_id":"1","_parent":"1"}}
{"text":1422853200000,"parentId":1}
{"index":{"_id":"2","_parent":"2"}}
{"text":1425272400000,"parentId":2}
{"index":{"_id":"3","_parent":"3"}}
{"text":1425531600000,"parentId":3}
{"index":{"_id":"4","_parent":"4"}}
{"text":1388552400000,"parentId":4}
{"index":{"_id":"5","_parent":"5"}}
{"text":1393650000000,"parentId":5}

                at org.elasticsearch.hadoop.rest.RestClient.checkResponse(RestClient.java:467)
                at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:425)
                at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:415)
                at org.elasticsearch.hadoop.rest.RestClient.bulk(RestClient.java:145)
                at org.elasticsearch.hadoop.rest.RestRepository.tryFlush(RestRepository.java:225)
                at org.elasticsearch.hadoop.rest.RestRepository.flush(RestRepository.java:248)
                at org.elasticsearch.hadoop.rest.RestRepository.close(RestRepository.java:267)
                at org.elasticsearch.hadoop.mr.EsOutputFormat$EsRecordWriter.doClose(EsOutputFormat.java:214)
                at org.elasticsearch.hadoop.mr.EsOutputFormat$EsRecordWriter.close(EsOutputFormat.java:196)
                at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReducePOStoreImpl.tearDown(MapReducePOStoreImpl.java:99)
                at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POStore.tearDown(POStore.java:125)
                at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.cleanup(PigGenericMapBase.java:134)
                at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:148)
                at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
                at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
                at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
                at java.security.AccessController.doPrivileged(Native Method)
                at javax.security.auth.Subject.doAs(Subject.java:422)
                at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
                at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)

2016-05-01 17:08:08,112 [uber-SubtaskRunner] ERROR org.apache.pig.tools.pigstats.PigStatsUtil  - 1 map reduce job(s) failed!
2016-05-01 17:08:08,114 [uber-SubtaskRunner] INFO  org.apache.pig.tools.pigstats.SimplePigStats  - Script Statistics: 

HadoopVersion PigVersion          UserId  StartedAt            FinishedAt          Features
2.6.0-cdh5.5.1    0.12.0-cdh5.5.1 yarn       2016-05-01 17:07:27        2016-05-01 17:08:08        UNKNOWN

Failed!

Failed Jobs:
JobId     Alias       Feature                Message              Outputs
job_1461671868747_11164          ALL_RECORDS,CHILD_RECS,PARENT_RECS           MULTI_QUERY,MAP_ONLY                Message: Job failed!      my_index/my_parent,my_index/my_child,

Version Info

OS: : REHL 7.1
JVM : 1.8.0_74 Java HotSpot(TM) 64-Bit Server VM
Hadoop/Spark: Cloudera 5.5.1 (Hadoop 2.6.0)
ES-Hadoop : 2.2.0
ES : 2.2.0
Pig: 0.12-cdh5.5.1

The text was updated successfully, but these errors were encountered:

costin · 2016-05-03T10:37:16Z

Thanks for the detailed bug report.
Unfortunately there's not much we can do. As you've noticed within the same job, Hadoop shares all the configuration in one big place, its Configuration object which is just a glorified properties file.
In case of Pig, without any dedicated configuration per Storage the component are left to make assumptions. Trying to create ones instance is not really working since inside Pig, there's no understanding on how many other instances are there and how to manage them.
Using things like singleton or statics fails since within the same VM several jobs can run which means data ends up being left over.
I will try another approach to piggy back on the Pig API however there's a high chance this won't work in the end (since it's still the same Configuration object and thus multiple Storages will end up with the same setting).

lleviraz · 2016-05-04T07:27:13Z

Thank you Costin.
As a workaround we redesigned the process to have only a single instance of EsStorage in every job.

costin added question wontfix :Pig v2.3.1 v5.0.0-alpha2 labels May 3, 2016

costin added v5.0.0-alpha3 v2.3.2 and removed v5.0.0-alpha2 v2.3.1 labels May 3, 2016

lleviraz closed this as completed May 4, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failure while using EsStorage twice on a single Pig script to store a Parent Child relation #756

Failure while using EsStorage twice on a single Pig script to store a Parent Child relation #756

lleviraz commented May 1, 2016

costin commented May 3, 2016

lleviraz commented May 4, 2016

Failure while using EsStorage twice on a single Pig script to store a Parent Child relation #756

Failure while using EsStorage twice on a single Pig script to store a Parent Child relation #756

Comments

lleviraz commented May 1, 2016

Issue description

Steps to reproduce

Version Info

costin commented May 3, 2016

lleviraz commented May 4, 2016