You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We are using EsStorage in a pig script, to store a parent-child relationship in ElasticSearch.
Both parent and child documents are defined as 2 types (my_parent,my_child) under the same index (my_index).
We created the index with predefined "mappings" in order to define the parent-child relationship.
In the pig script we are creating 2 instances of the EsStorage, one for the parent and one for the child - the child has the extra 'es.mapping.parent' property to point on the parent's "parentId".
The script logic is simple:
load the data from the input file
generate parent document relation and store it
generate child document relation and store it
The problem is that - in our implementation, when having the 2 calls to 'EsStorage' on a single pig script, a single 'MAP_ONLY' job is created, and the following error is received:
Error: org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: Can't specify parent if no parent field has been configured
(If it helps, we debugged the issue - and found out that when there is a single job running - the configuration of the 2 EsStorage instances is being mixed - and causes the parent store operation to look for a non existing parent relation - where actually the parent type should not have a parent...)
Notes:
When splitting the script into 2 separate jobs or scripts, and having every EsStorage instance running on a separate job, is successful.
Upload the attached input_file.txt to HDFS /tmp directory input_file.txt
Edit the first line of the code snippet with the location of the "elasticsearch-hadoop-2.2.0.jar" on your local file system before running the script (or download the attached pig file and rename to pigTest.pig pigTest.pig.txt
).
Run the pig script using "pig" command
Code:
REGISTER elasticsearch-hadoop-2.2.0.jar;
-- Load all 5 records from the input file
ALL_RECORDS = LOAD '/tmp/input_file.csv' USING PigStorage(',') AS (id:long, name:chararray, birthdatedate:long);
-- generate the parent relation with 2 fields from the input
PARENT_RECS = FOREACH ALL_RECORDS GENERATE birthdatedate as text, id as parentId;
-- store the parent records into the my_parent index using parentId field as the document ID.
STORE PARENT_RECS INTO 'my_index/my_parent' USING org.elasticsearch.hadoop.pig.EsStorage('es.http.retries=10','es.nodes = localhost','es.mapping.id = parentId','es.index.auto.create = false', 'es.net.ssl = false');
-- generate the child relation with all 3 fields from the input
CHILD_RECS = FOREACH ALL_RECORDS GENERATE name as text, id as parentId, birthdatedate as childId;
-- store the child records into my_child index using the "parentIdRef" field of the child, as es.mapping.parent - which is the same value as the parentId of the parent
STORE CHILD_RECS INTO 'my_index/my_child' USING org.elasticsearch.hadoop.pig.EsStorage('es.http.retries=10','es.nodes = localhost','es.mapping.id = childId','es.mapping.parent = parentId','es.index.auto.create = false', 'es.net.ssl = false');
Strack trace:
2016-05-01 17:08:08,102 [uber-SubtaskRunner] ERROR org.apache.pig.tools.pigstats.SimplePigStats - ERROR 2997: Unable to recreate exception from backed error: AttemptID:attempt_1461671868747_11164_m_000000_3 Info:Error: org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: Can't specify parent if no parent field has been configured
{"index":{"_id":"1","_parent":"1"}}
{"text":1422853200000,"parentId":1}
{"index":{"_id":"2","_parent":"2"}}
{"text":1425272400000,"parentId":2}
{"index":{"_id":"3","_parent":"3"}}
{"text":1425531600000,"parentId":3}
{"index":{"_id":"4","_parent":"4"}}
{"text":1388552400000,"parentId":4}
{"index":{"_id":"5","_parent":"5"}}
{"text":1393650000000,"parentId":5}
at org.elasticsearch.hadoop.rest.RestClient.checkResponse(RestClient.java:467)
at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:425)
at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:415)
at org.elasticsearch.hadoop.rest.RestClient.bulk(RestClient.java:145)
at org.elasticsearch.hadoop.rest.RestRepository.tryFlush(RestRepository.java:225)
at org.elasticsearch.hadoop.rest.RestRepository.flush(RestRepository.java:248)
at org.elasticsearch.hadoop.rest.RestRepository.close(RestRepository.java:267)
at org.elasticsearch.hadoop.mr.EsOutputFormat$EsRecordWriter.doClose(EsOutputFormat.java:214)
at org.elasticsearch.hadoop.mr.EsOutputFormat$EsRecordWriter.close(EsOutputFormat.java:196)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReducePOStoreImpl.tearDown(MapReducePOStoreImpl.java:99)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POStore.tearDown(POStore.java:125)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.cleanup(PigGenericMapBase.java:134)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:148)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
2016-05-01 17:08:08,112 [uber-SubtaskRunner] ERROR org.apache.pig.tools.pigstats.PigStatsUtil - 1 map reduce job(s) failed!
2016-05-01 17:08:08,114 [uber-SubtaskRunner] INFO org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics:
HadoopVersion PigVersion UserId StartedAt FinishedAt Features
2.6.0-cdh5.5.1 0.12.0-cdh5.5.1 yarn 2016-05-01 17:07:27 2016-05-01 17:08:08 UNKNOWN
Failed!
Failed Jobs:
JobId Alias Feature Message Outputs
job_1461671868747_11164 ALL_RECORDS,CHILD_RECS,PARENT_RECS MULTI_QUERY,MAP_ONLY Message: Job failed! my_index/my_parent,my_index/my_child,
Version Info
OS: : REHL 7.1
JVM : 1.8.0_74 Java HotSpot(TM) 64-Bit Server VM
Hadoop/Spark: Cloudera 5.5.1 (Hadoop 2.6.0)
ES-Hadoop : 2.2.0
ES : 2.2.0
Pig: 0.12-cdh5.5.1
The text was updated successfully, but these errors were encountered:
Thanks for the detailed bug report.
Unfortunately there's not much we can do. As you've noticed within the same job, Hadoop shares all the configuration in one big place, its Configuration object which is just a glorified properties file.
In case of Pig, without any dedicated configuration per Storage the component are left to make assumptions. Trying to create ones instance is not really working since inside Pig, there's no understanding on how many other instances are there and how to manage them.
Using things like singleton or statics fails since within the same VM several jobs can run which means data ends up being left over.
I will try another approach to piggy back on the Pig API however there's a high chance this won't work in the end (since it's still the same Configuration object and thus multiple Storages will end up with the same setting).
Issue description
The script logic is simple:
The problem is that - in our implementation, when having the 2 calls to 'EsStorage' on a single pig script, a single 'MAP_ONLY' job is created, and the following error is received:
Error: org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: Can't specify parent if no parent field has been configured
(If it helps, we debugged the issue - and found out that when there is a single job running - the configuration of the 2 EsStorage instances is being mixed - and causes the parent store operation to look for a non existing parent relation - where actually the parent type should not have a parent...)
Notes:
Steps to reproduce
Upload the attached input_file.txt to HDFS /tmp directory
input_file.txt
Edit the first line of the code snippet with the location of the "elasticsearch-hadoop-2.2.0.jar" on your local file system before running the script (or download the attached pig file and rename to pigTest.pig
pigTest.pig.txt
).
Run the pig script using "pig" command
Code:
Strack trace:
Version Info
OS: : REHL 7.1
JVM : 1.8.0_74 Java HotSpot(TM) 64-Bit Server VM
Hadoop/Spark: Cloudera 5.5.1 (Hadoop 2.6.0)
ES-Hadoop : 2.2.0
ES : 2.2.0
Pig: 0.12-cdh5.5.1
The text was updated successfully, but these errors were encountered: