Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure while using EsStorage twice on a single Pig script to store a Parent Child relation #756

Closed
lleviraz opened this issue May 1, 2016 · 2 comments

Comments

@lleviraz
Copy link

lleviraz commented May 1, 2016

Issue description

  • We are using EsStorage in a pig script, to store a parent-child relationship in ElasticSearch.
  • Both parent and child documents are defined as 2 types (my_parent,my_child) under the same index (my_index).
  • We created the index with predefined "mappings" in order to define the parent-child relationship.
  • In the pig script we are creating 2 instances of the EsStorage, one for the parent and one for the child - the child has the extra 'es.mapping.parent' property to point on the parent's "parentId".

The script logic is simple:

  1. load the data from the input file
  2. generate parent document relation and store it
  3. generate child document relation and store it

The problem is that - in our implementation, when having the 2 calls to 'EsStorage' on a single pig script, a single 'MAP_ONLY' job is created, and the following error is received:
Error: org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: Can't specify parent if no parent field has been configured

(If it helps, we debugged the issue - and found out that when there is a single job running - the configuration of the 2 EsStorage instances is being mixed - and causes the parent store operation to look for a non existing parent relation - where actually the parent type should not have a parent...)

Notes:

  1. When splitting the script into 2 separate jobs or scripts, and having every EsStorage instance running on a separate job, is successful.

Steps to reproduce

  1. Create the index and mappings:
curl -XPUT http://localhost:9200/my_index -d'
{
  "mappings": {
    "my_parent": {},
    "my_child": {
      "_parent": {
        "type": "my_parent"
      }
    }
  }
}'
  1. Upload the attached input_file.txt to HDFS /tmp directory
    input_file.txt

  2. Edit the first line of the code snippet with the location of the "elasticsearch-hadoop-2.2.0.jar" on your local file system before running the script (or download the attached pig file and rename to pigTest.pig
    pigTest.pig.txt
    ).

  3. Run the pig script using "pig" command

Code:

REGISTER elasticsearch-hadoop-2.2.0.jar;

-- Load all 5 records from the input file
ALL_RECORDS = LOAD  '/tmp/input_file.csv' USING PigStorage(',') AS (id:long, name:chararray, birthdatedate:long);

-- generate the parent relation with 2 fields from the input
PARENT_RECS = FOREACH ALL_RECORDS GENERATE birthdatedate as text, id as parentId;

-- store the parent records into the my_parent index using parentId field as the document ID.
STORE PARENT_RECS INTO 'my_index/my_parent' USING org.elasticsearch.hadoop.pig.EsStorage('es.http.retries=10','es.nodes = localhost','es.mapping.id = parentId','es.index.auto.create = false', 'es.net.ssl = false');

-- generate the child relation with all 3 fields from the input
CHILD_RECS = FOREACH ALL_RECORDS GENERATE name as text, id as parentId, birthdatedate as childId;

-- store the child records into my_child index using  the "parentIdRef" field of the child, as es.mapping.parent  - which is the same value as the parentId of the parent
STORE CHILD_RECS INTO 'my_index/my_child' USING org.elasticsearch.hadoop.pig.EsStorage('es.http.retries=10','es.nodes = localhost','es.mapping.id = childId','es.mapping.parent = parentId','es.index.auto.create = false', 'es.net.ssl = false');

Strack trace:

2016-05-01 17:08:08,102 [uber-SubtaskRunner] ERROR org.apache.pig.tools.pigstats.SimplePigStats  - ERROR 2997: Unable to recreate exception from backed error: AttemptID:attempt_1461671868747_11164_m_000000_3 Info:Error: org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: Can't specify parent if no parent field has been configured
{"index":{"_id":"1","_parent":"1"}}
{"text":1422853200000,"parentId":1}
{"index":{"_id":"2","_parent":"2"}}
{"text":1425272400000,"parentId":2}
{"index":{"_id":"3","_parent":"3"}}
{"text":1425531600000,"parentId":3}
{"index":{"_id":"4","_parent":"4"}}
{"text":1388552400000,"parentId":4}
{"index":{"_id":"5","_parent":"5"}}
{"text":1393650000000,"parentId":5}

                at org.elasticsearch.hadoop.rest.RestClient.checkResponse(RestClient.java:467)
                at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:425)
                at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:415)
                at org.elasticsearch.hadoop.rest.RestClient.bulk(RestClient.java:145)
                at org.elasticsearch.hadoop.rest.RestRepository.tryFlush(RestRepository.java:225)
                at org.elasticsearch.hadoop.rest.RestRepository.flush(RestRepository.java:248)
                at org.elasticsearch.hadoop.rest.RestRepository.close(RestRepository.java:267)
                at org.elasticsearch.hadoop.mr.EsOutputFormat$EsRecordWriter.doClose(EsOutputFormat.java:214)
                at org.elasticsearch.hadoop.mr.EsOutputFormat$EsRecordWriter.close(EsOutputFormat.java:196)
                at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReducePOStoreImpl.tearDown(MapReducePOStoreImpl.java:99)
                at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POStore.tearDown(POStore.java:125)
                at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.cleanup(PigGenericMapBase.java:134)
                at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:148)
                at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
                at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
                at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
                at java.security.AccessController.doPrivileged(Native Method)
                at javax.security.auth.Subject.doAs(Subject.java:422)
                at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
                at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)

2016-05-01 17:08:08,112 [uber-SubtaskRunner] ERROR org.apache.pig.tools.pigstats.PigStatsUtil  - 1 map reduce job(s) failed!
2016-05-01 17:08:08,114 [uber-SubtaskRunner] INFO  org.apache.pig.tools.pigstats.SimplePigStats  - Script Statistics: 

HadoopVersion PigVersion          UserId  StartedAt            FinishedAt          Features
2.6.0-cdh5.5.1    0.12.0-cdh5.5.1 yarn       2016-05-01 17:07:27        2016-05-01 17:08:08        UNKNOWN

Failed!

Failed Jobs:
JobId     Alias       Feature                Message              Outputs
job_1461671868747_11164          ALL_RECORDS,CHILD_RECS,PARENT_RECS           MULTI_QUERY,MAP_ONLY                Message: Job failed!      my_index/my_parent,my_index/my_child,

Version Info

OS: : REHL 7.1
JVM : 1.8.0_74 Java HotSpot(TM) 64-Bit Server VM
Hadoop/Spark: Cloudera 5.5.1 (Hadoop 2.6.0)
ES-Hadoop : 2.2.0
ES : 2.2.0
Pig: 0.12-cdh5.5.1

@costin
Copy link
Member

costin commented May 3, 2016

Thanks for the detailed bug report.
Unfortunately there's not much we can do. As you've noticed within the same job, Hadoop shares all the configuration in one big place, its Configuration object which is just a glorified properties file.
In case of Pig, without any dedicated configuration per Storage the component are left to make assumptions. Trying to create ones instance is not really working since inside Pig, there's no understanding on how many other instances are there and how to manage them.
Using things like singleton or statics fails since within the same VM several jobs can run which means data ends up being left over.
I will try another approach to piggy back on the Pig API however there's a high chance this won't work in the end (since it's still the same Configuration object and thus multiple Storages will end up with the same setting).

@lleviraz
Copy link
Author

lleviraz commented May 4, 2016

Thank you Costin.
As a workaround we redesigned the process to have only a single instance of EsStorage in every job.

@lleviraz lleviraz closed this as completed May 4, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants