write array from pig #177

oalam · 2014-03-25T17:55:32Z

How is it possible to write from PIG to ES data containing an array of strings like

{
"name":"toto",
"tags": ["tag1", "tag2"]
}

I've tried with bags and tuples but it always ends with schema names inside the array ?

aortez · 2014-04-15T02:18:41Z

I also am having problems with writing a string array from Pig to ES. The current ES-hadoop documentation (http://www.elasticsearch.org/guide/en/elasticsearch/hadoop/current/pig.html) states that a pig Bag maps to a ES Array, but I am not getting the results I expect.

Here is a full example:

% this is pseudo grunt code here:
data = LOAD 'test_data' USING PigStorage (',') AS (f1: chararray, f2: chararray, f3: chararray);
data2 = FOREACH data GENERATE TOBAG(f1, f2, f3) AS my_fields;
STORE data2 INTO 'dgb-1610/test' USING EsStorage();

dump data2
({(A),(B),(C)})

describe data2
data2: {my_fields: {(chararray)}}

# this is what the resulting index looks like:
curl -XGET 'http://gblb-es01.dev.g2:9200/dgb-1610/test/_mapping'
{
    "test": {
        "properties": {
            "my_fields": {
                "properties": {
                    "0": {
                        "type": "string"
                    }
                }
            }
        }
    }
}

# but this is how I expect it to be:
{
    "test": {
        "properties": {
            "my_fields": {
                "type": "string"
                }
            }
        }
    }
}

costin · 2014-04-16T20:56:56Z

Hi guys,

Sorry it took a while to get to this. The problem is caused by the fact that bags themselves contain tuples, and each tuple can (and does) contain a schema indicating its field names and their types.
The mapping above is the result of handling both tuples with and without schema in a consistent manner.

However I can see how this might not create issues in case a basic array is needed so I'll try to come up with the fix. Note that writing the tuple is 'simple' format is easy, reading it back it's not (since the tuple type needs to be figured out).

aortez · 2014-04-16T22:28:36Z

Great! Thanks for looking into the issue Costin!

costin · 2014-04-17T22:29:33Z

Guys, I've pushed a draft update in the nightly builds - can you please try it out and report back?
I'd like to run more tests to make sure it's solid, but so far the relevant tests are passing.

costin · 2014-04-17T22:30:59Z

P.S. The upload maven artifact looks something like this elasticsearch-hadoop-1.3.0.BUILD-20140417.223030-390.jar

aortez · 2014-04-18T00:54:51Z

Hi Costin. I tried the snapshot build you specified and the results are better, but still not quite right.

Using the same example I posted above:

% 'test_data' = A,B,C
data = LOAD 'test_data' USING PigStorage (',') AS (f1: chararray, f2: chararray, f3: chararray);
data2 = FOREACH data GENERATE TOBAG(f1, f2, f3) AS my_fields;
STORE data2 INTO 'dgb-1610/test' USING EsStorage();

dump data2
({(A),(B),(C)})

The better part is that the mapping now looks correct:

$ curl -XGET 'http://gblb-es01.dev.g2:9200/dgb-1610/test/_mapping' | python -mjson.tool
{
    "test": {
        "properties": {
            "my_fields": {
                "type": "string"
            }
        }
    }
}

The not-quite-right part shows up when we look at the data though:

curl -XGET 'http://gblb-es01.dev.g2:9200/dgb-1610/test/_search' | python -mjson.tool
...
                "_source": {
                    "my_fields": [
                        [
                            "A"
                        ], 
                        [
                            "B"
                        ], 
                        [
                            "C"
                        ]
                    ]
                }, 
...

But I expect it to look like so:

...
                "_source": {
                    "my_fields": [
                         "A",
                         "B",
                         "C"
                    ]
                }, 
...

BTW, to pull down that build I had to change
http://oss.sonatype.org/content/repositories/releases
(the above url was specified in the Development Builds section of the Installation page
to
http://oss.sonatype.org/content/repositories/snapshots
in my pom. I might be doing something wrong here...

Thanks Costin!

costin · 2014-04-18T07:44:27Z

({(A),(B),(C)}) is a bag of tuples. A tuple can have or multiple elements and are an ordered list of values - hence their representation as JSON arrays.
so (A) becomes [A], (B) -> [B] and so on. One could argue that the array is not needed for tuples with one elements but consider the following example:
(A) -> A
(A,B) -> [A, B]
If we nest the tuple as per your example:
({(A)}) -> [A] and ({(A), (B)}) -> [A, B]

Notice the JSON representation is the same between a tuple with two values and a bag with two tuples.

aortez · 2014-04-18T18:10:29Z

Ok, thanks for the explanation Costin.

It looks like my expectations were incorrect... and it sounds like maybe I will not be able to load data in the exact structure I was hoping for - it will have to be an array with each element in its own nested array, e.g.: "my_fields": [ ["A"], ["B"] ], as opposed to "my_fields": [ "A", "B" ]. Right?

costin · 2014-04-18T18:18:55Z

You can get the JSON structure you need but not with a bag. The crux of the problem is that Pig uses tuple as its 'atom' and provide other complex data structures on top. And since a tuple can (and will) have multiple entries, it means an array (which ES can handle just fine) needs to be the basic mapping 'atom'.
If you use es-hadoop to read/write data to ES, the structure shouldn't matter in the end.
However if you want to share the JSON with somebody else then, to get only a list, try get rid of the bag and simply write a tuple, that is rather then write ({(A),(B),(C)}) (a bag of tuples), write (A,B,C) a basic tuple.
You can achieve this by 'flattening' the bag - see the Pig manual for more information.

change default serialization of tuples to hide/ignore their names. this results in tuples being pure arrays/lists vs maps (name : list of values) relates #177

aortez · 2014-04-21T22:33:05Z

Hey Costin. It looks like with the snapshot build you specified, I am able to do as you suggest to get an array without field names, but some of the other behavior has also changed (with regard to the M2 release). It seems that it is no longer possible for any nested tuple to be named.

I think the following describes the behavior I am seeing:

if a tuple is at the root level, then it is named
if a tuple is at any other level, it is not named

Here is an example demonstrating this behavior:

-- data = A,B,C
data = LOAD 'test_data' USING PigStorage (',') AS (f1: chararray, f2: chararray, f3: chararray);
data2 = FOREACH data GENERATE f1, TOTUPLE(TOTUPLE(f1, f2), TOTUPLE(f3)) AS kitty: tuple(names, f3);
STORE data2 INTO 'dgb-1611/testSNAPSHOT_3' USING EsStorage();

Here is the behavior of the snapshot build (elasticsearch-hadoop-1.3.0.BUILD-20140417.223030-390.jar). As we can see, the names and f3 fields are not named, but the f1 and kitty fields are:

$ curl -XGET 'http://...:9200/dgb-1610/testSNAPSHOT/_search' | python -mjson.tool
...
                "_source": {
                    "f1": "A", 
                    "kitty": [
                        [
                            "A", 
                            "B"
                        ], 
                        "C"
                    ]
                }, 
...

And if we go back to the M2 build, all of the fields are named (this is the build I was using when I originally chimed in on this ticket):

$ curl -XGET 'http://...:9200/dgb-1610/testM2/_search' | python -mjson.tool
...
                "_source": {
                    "f1": "A", 
                    "kitty": {
                        "f3": "C", 
                        "names": {
                            "f1": "A", 
                            "f2": "B"
                        }
                    }
                }, 
...

I am trying to create something like this:

                "_source": {
                    "f1": "A", 
                    "kitty": {
                        names: [
                            "A", 
                            "B"
                        ], 
                        f3: "C"
                    }
                },

I should clarify my use case. I am replacing an old Java-based ETL with a Pig-based one, and I am trying to exactly replicate the structure of the index it created. And of course, thank you for your time!

costin · 2014-04-22T03:59:49Z

There root tuple that you refer to is the actual row/entry in Pig that is mapped to a JSON document. That's why it needs to use names since otherwise its JSON representation would be invalid.
If you apply describe to data2 it will probably look something like this:

data2 = FOREACH data GENERATE f1, TOTUPLE(TOTUPLE(f1, f2), TOTUPLE(f3)) AS kitty: tuple(names, f3);
STORE data2 INTO 'dgb-1611/testSNAPSHOT_3' USING EsStorage();

f1:chararray, kitty t(t:(chararray, chararray), t:(chararray))

Placing your mapping aside, within the same structure the tool should use names for some of your nested tuples but use lists for others - what's the criteria? Further more, how would it know to deserialize the same JSON back into Pig?

If you want a dedicated mapping, trying working your way backwards - instead of using named tuples, use maps - keep using f1 as a chararray, define kitty as a map with one key "names" for a tuple, and the other a chararray.

You can still use tuples but there are two ways of dealing with it - without names, in which case they are an array of primitives, with names in which case they are converted into an array of maps (each entry will be converted to a map - field:tuple entry).

Hope this helps,

oalam · 2014-05-09T06:48:54Z

thanks costin

costin added bug and removed rest labels Apr 16, 2014

costin added a commit that referenced this issue Apr 18, 2014

Improve handling of Pig Tuples

31122c3

change default serialization of tuples to hide/ignore their names. this results in tuples being pure arrays/lists vs maps (name : list of values) relates #177

costin closed this as completed in 18030db May 8, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

write array from pig #177

write array from pig #177

oalam commented Mar 25, 2014

aortez commented Apr 15, 2014

costin commented Apr 16, 2014

aortez commented Apr 16, 2014

costin commented Apr 17, 2014

costin commented Apr 17, 2014

aortez commented Apr 18, 2014

costin commented Apr 18, 2014

aortez commented Apr 18, 2014

costin commented Apr 18, 2014

aortez commented Apr 21, 2014

costin commented Apr 22, 2014

oalam commented May 9, 2014

write array from pig #177

write array from pig #177

Comments

oalam commented Mar 25, 2014

aortez commented Apr 15, 2014

costin commented Apr 16, 2014

aortez commented Apr 16, 2014

costin commented Apr 17, 2014

costin commented Apr 17, 2014

aortez commented Apr 18, 2014

costin commented Apr 18, 2014

aortez commented Apr 18, 2014

costin commented Apr 18, 2014

aortez commented Apr 21, 2014

costin commented Apr 22, 2014

oalam commented May 9, 2014