[Hive] NULL structure value if key name is also a column name #543

jpastuszek · 2015-09-04T08:41:14Z

I have a problem with following mappings:

CREATE EXTERNAL TABLE http_access_20150902_external
    id STRING,
    clinics ARRAY<STRUCT<id:BIGINT, ppl:DOUBLE>>,
    ...
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES(
    'es.nodes' = 'prod.log.example.net',
    'es.resource' = 'logstash-2015.09.02/http-access',
    'es.index.auto.create' = 'false',
    'es.mapping.id' = '_id',
    'es.mapping.names' = 'date:@timestamp, id:_id',
    'es.field.read.validate.presence' = 'strict',
    'es.field.read.empty.as.null' = 'no'
);

With this table definition I am getting NULL value in clinics.id field:

select clinics from http_access_20150902_external where clinics is not null limit 1;
clinics  [{"id":null,"ppl":7.0},{"id":null,"ppl":0.0}...]

If I change the id column to something else it I get proper values:

CREATE EXTERNAL TABLE http_access_20150902_external
    logentry_id STRING,
    clinics ARRAY<STRUCT<id:BIGINT, ppl:DOUBLE>>,
    ...
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES(
    'es.nodes' = 'prod.log.example.net',
    'es.resource' = 'logstash-2015.09.02/http-access',
    'es.index.auto.create' = 'false',
    'es.mapping.id' = '_id',
    'es.mapping.names' = 'date:@timestamp, logentry_id:_id',
    'es.field.read.validate.presence' = 'strict',
    'es.field.read.empty.as.null' = 'no'
);

select clinics from http_access_20150902_external where clinics is not null limit 1;
clinics [{"id":106611,"ppl":7.0},{"id":107575,"ppl":0.0}...]

So it looks like if I have same column name as a structure field name the structure field will get NULL value.
I am using elasticsearch-hadoop-2.1.1 with hive-1.1.0 (cdh5.4.4) and elasticsearch-0.90.9.

The text was updated successfully, but these errors were encountered:

costin · 2015-09-04T09:28:50Z

_id is an internal field which is actually not part of the document source but rather its metadata.
To read it, you need to tell es-hadoop to properly return it, through these settings.
Further more, es.mapping.id is used for writing not reading data (I've updated the docs to better reflect its role).
If you have your own ids in the document, I suggest keeping the field in the document (whether its id or myid or what have you) and configure the mapping to point to it.
While you end up with the same results, it better indicates the fact that it is not an ID created by Elasticsearch but rather manually maintained.

P.S. You are using a very old version of ES. Not only upgrading to latest ES will give you better security but also performance.

jpastuszek · 2015-09-04T11:24:53Z

Sorry that was a typo actually and probably not related to the issue.

The problem is that the id filed is in the document stored like:

    "clinics": [
      {
        "id": 95611,
        "ppl": 5
      },
      {
        "id": 130244,
        "ppl": 0
      },
      {
        "id": 132508,
        "ppl": 7
      },
      {
        "id": 210577,
        "ppl": 5
      }
    ],

So to map it I use ARRAY<STRUCT<id:BIGINT, ppl:DOUBLE>> in the table.
This is all good and work but only if you don't have any other column named id.
This is like the space of the column names collide with space of nested structure field names in a way that is causing runtime bug where NULL is inserted in the structure id field instead in the actual BIGINT value.
In other words it should not matter if I use column name id and the STRUCT object field is also called id - this should work fine.

Re P.S. I know :/ I am planning migration to new ES cluster when I build it on top of my Hadoop nodes but first want to get this working so I know if I can meet my goals.

BTW: I use the table to read data from ES to Hadoop; does that mean I don't need es.mapping.id at all if I am not planning to write?

costin · 2015-09-05T21:16:22Z

When you say column are you referring to Hive or Elastic? Hive is flat so the columns name don't reflect their structure too much - and many times they can't reflect the one in ES. This is where you can use es.mapping.names as mentioned here.
Do note that complex types are not really well supported by Hive (or Pig for that matter) and many times just reading the ES fields directly in their top-level columns and then merging these manually in Hive/Pig tends to work better (not ideal but doable).

BTW: yes, if you don't write, you don't need es.mapping.id

costin · 2015-10-28T18:27:14Z

Since there hasn't been any update, I'm closing the issue. Please open a new one if needed. Thanks!

costin added question :Hive :Rest v2.1.2 v2.2.0-beta1 labels Sep 4, 2015

costin closed this as completed Oct 28, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Hive] NULL structure value if key name is also a column name #543

[Hive] NULL structure value if key name is also a column name #543

jpastuszek commented Sep 4, 2015

costin commented Sep 4, 2015

jpastuszek commented Sep 4, 2015

costin commented Sep 5, 2015

costin commented Oct 28, 2015

[Hive] NULL structure value if key name is also a column name #543

[Hive] NULL structure value if key name is also a column name #543

Comments

jpastuszek commented Sep 4, 2015

costin commented Sep 4, 2015

jpastuszek commented Sep 4, 2015

costin commented Sep 5, 2015

costin commented Oct 28, 2015