New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
write array from pig #177
Comments
I also am having problems with writing a string array from Pig to ES. The current ES-hadoop documentation (http://www.elasticsearch.org/guide/en/elasticsearch/hadoop/current/pig.html) states that a pig Bag maps to a ES Array, but I am not getting the results I expect. Here is a full example:
|
Hi guys, Sorry it took a while to get to this. The problem is caused by the fact that bags themselves contain tuples, and each tuple can (and does) contain a schema indicating its field names and their types. However I can see how this might not create issues in case a basic array is needed so I'll try to come up with the fix. Note that writing the tuple is 'simple' format is easy, reading it back it's not (since the tuple type needs to be figured out). |
Great! Thanks for looking into the issue Costin! |
Guys, I've pushed a draft update in the nightly builds - can you please try it out and report back? |
P.S. The upload maven artifact looks something like this |
Hi Costin. I tried the snapshot build you specified and the results are better, but still not quite right. Using the same example I posted above:
The better part is that the mapping now looks correct:
The not-quite-right part shows up when we look at the data though:
But I expect it to look like so:
BTW, to pull down that build I had to change Thanks Costin! |
Notice the JSON representation is the same between a tuple with two values and a bag with two tuples. |
Ok, thanks for the explanation Costin. It looks like my expectations were incorrect... and it sounds like maybe I will not be able to load data in the exact structure I was hoping for - it will have to be an array with each element in its own nested array, e.g.: |
You can get the JSON structure you need but not with a bag. The crux of the problem is that Pig uses tuple as its 'atom' and provide other complex data structures on top. And since a tuple can (and will) have multiple entries, it means an array (which ES can handle just fine) needs to be the basic mapping 'atom'. |
change default serialization of tuples to hide/ignore their names. this results in tuples being pure arrays/lists vs maps (name : list of values) relates #177
Hey Costin. It looks like with the snapshot build you specified, I am able to do as you suggest to get an array without field names, but some of the other behavior has also changed (with regard to the M2 release). It seems that it is no longer possible for any nested tuple to be named. I think the following describes the behavior I am seeing:
Here is an example demonstrating this behavior:
Here is the behavior of the snapshot build (elasticsearch-hadoop-1.3.0.BUILD-20140417.223030-390.jar). As we can see, the
And if we go back to the M2 build, all of the fields are named (this is the build I was using when I originally chimed in on this ticket):
I am trying to create something like this:
I should clarify my use case. I am replacing an old Java-based ETL with a Pig-based one, and I am trying to exactly replicate the structure of the index it created. And of course, thank you for your time! |
There root tuple that you refer to is the actual row/entry in Pig that is mapped to a JSON document. That's why it needs to use names since otherwise its JSON representation would be invalid.
Placing your mapping aside, within the same structure the tool should use names for some of your nested tuples but use lists for others - what's the criteria? Further more, how would it know to deserialize the same JSON back into Pig? If you want a dedicated mapping, trying working your way backwards - instead of using named tuples, use maps - keep using f1 as a chararray, define kitty as a map with one key "names" for a tuple, and the other a chararray. You can still use tuples but there are two ways of dealing with it - without names, in which case they are an array of primitives, with names in which case they are converted into an array of maps (each entry will be converted to a map - field:tuple entry). Hope this helps, |
thanks costin |
How is it possible to write from PIG to ES data containing an array of strings like
{
"name":"toto",
"tags": ["tag1", "tag2"]
}
I've tried with bags and tuples but it always ends with schema names inside the array ?
The text was updated successfully, but these errors were encountered: