Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading an array in pig #158

Closed
jpparis-orange opened this issue Mar 3, 2014 · 5 comments
Closed

Reading an array in pig #158

jpparis-orange opened this issue Mar 3, 2014 · 5 comments

Comments

@jpparis-orange
Copy link

Hi!

I'm trying to work with pig over an ES index. In my ES index I have arrays of strings that I can't read without error in pig. A gist recreation is here: https://gist.github.com/jpparis-orange/9329308#file-espigarray (ES index creation and pig commands).

Here is my configuration:

  • elasticsearch-1.0.0
  • elasticsearch-hadoop-yarn.jar from 1.3.0.M2
  • hadoop-2.2.0-bin
  • hive-0.12.0-bin
  • pig-0.12.0 but recompiled pig-0.12.0-withouthadoop.jar for yarn

If I declare the pig array with my_array:{ the_tuple: ( the_item: chararray ) }, I get (more detailled stack in the gist)
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2367)
at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:130)

If I use the (incorrect) pig syntax my_array:(), I can print the array, but if I try to COUNT the elements, I get this error:
<line 2, column 52> Could not infer the matching function for org.apache.pig.builtin.COUNT as multiple or none of them fit. Please use an explicit cast.

Thanks for any hints!
jp

@costin
Copy link
Member

costin commented Mar 4, 2014

From my experience, Pig tends to be quite picky about tuples and bags when reading them directly from the source - I'm not sure why, maybe this is something that we could improve in ES but again, I'm not sure how.
Back to your gist, in ES you have defined an array (not a bag which is a collection of tuples). That's why it works reading as tuple(). Function COUNT fails since it works on bags not tuples - however you can create a bag from that tuple (for example by using TOBAG).

Regarding the initial error (the outofmemory one) I'm not sure why that occurs, but based on the stack trace it seems to be caused by Pig; I wasn't able to reproduce locally (using Pig in local mode).

My advice going forward is to try to read the data as a basic tuple and then create your structures by hand. From what I've seen this seems to be the recommendation on the Pig mailing list as well.

Hope this helps,

@costin costin added pig labels Mar 4, 2014
@jpparis-orange
Copy link
Author

I tried TOBAG, but was not successful:

es_read = LOAD 'hread/doc' USING org.elasticsearch.hadoop.pig.EsStorage('') AS ( my_id: chararray, my_array:( ) );
the_gen = FOREACH es_read GENERATE my_id AS the_id, TOBAG(my_array) AS the_bag;
DUMP the_gen;

gives me the same OOM exception:

Unexpected System Error Occured: java.lang.RuntimeException: java.lang.reflect.InvocationTargetException
    at org.apache.pig.backend.hadoop23.PigJobControl.submit(PigJobControl.java:130)
    at org.apache.pig.backend.hadoop23.PigJobControl.run(PigJobControl.java:191)
...
Caused by: java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
...
Caused by: java.lang.OutOfMemoryError: Java heap space
    at java.util.Arrays.copyOf(Arrays.java:2367)
    at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:130)

@costin
Copy link
Member

costin commented Mar 10, 2014

For some reason I can't reproduce this - maybe it's hadoop2 vs hadoop1. However I would imagine the crux of the problem is the usage of an array inside a bag.
Can you enable logging in Pig and report back? I'm still unsure why an array (created by es-hadoop or otherwise) would cause this issue in Pig - especially when only 2 items are contained within.

As an alternative you could load the array as individual items and then manually create the bag through TOBAG.

@costin
Copy link
Member

costin commented Apr 8, 2014

Rescheduling this for 1.3 RC1 potentially by adding more documentation on this type of mapping.

@costin
Copy link
Member

costin commented May 2, 2014

Hi,

This should be fixed in master - though I was not able to reproduce your issue, I bumped into one that had similar behaviour (and it turned out it depended on the JSON being read). It would be great if you could try the latest dev builds and let us know whether they work for you or not.

Cheers,

@costin costin closed this as completed May 2, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants