Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Hive] NULL structure value if key name is also a column name #543

Closed
jpastuszek opened this issue Sep 4, 2015 · 4 comments
Closed

[Hive] NULL structure value if key name is also a column name #543

jpastuszek opened this issue Sep 4, 2015 · 4 comments

Comments

@jpastuszek
Copy link

I have a problem with following mappings:

CREATE EXTERNAL TABLE http_access_20150902_external
    id STRING,
    clinics ARRAY<STRUCT<id:BIGINT, ppl:DOUBLE>>,
    ...
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES(
    'es.nodes' = 'prod.log.example.net',
    'es.resource' = 'logstash-2015.09.02/http-access',
    'es.index.auto.create' = 'false',
    'es.mapping.id' = '_id',
    'es.mapping.names' = 'date:@timestamp, id:_id',
    'es.field.read.validate.presence' = 'strict',
    'es.field.read.empty.as.null' = 'no'
);

With this table definition I am getting NULL value in clinics.id field:

select clinics from http_access_20150902_external where clinics is not null limit 1;
clinics  [{"id":null,"ppl":7.0},{"id":null,"ppl":0.0}...]

If I change the id column to something else it I get proper values:

CREATE EXTERNAL TABLE http_access_20150902_external
    logentry_id STRING,
    clinics ARRAY<STRUCT<id:BIGINT, ppl:DOUBLE>>,
    ...
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES(
    'es.nodes' = 'prod.log.example.net',
    'es.resource' = 'logstash-2015.09.02/http-access',
    'es.index.auto.create' = 'false',
    'es.mapping.id' = '_id',
    'es.mapping.names' = 'date:@timestamp, logentry_id:_id',
    'es.field.read.validate.presence' = 'strict',
    'es.field.read.empty.as.null' = 'no'
);

select clinics from http_access_20150902_external where clinics is not null limit 1;
clinics [{"id":106611,"ppl":7.0},{"id":107575,"ppl":0.0}...]

So it looks like if I have same column name as a structure field name the structure field will get NULL value.
I am using elasticsearch-hadoop-2.1.1 with hive-1.1.0 (cdh5.4.4) and elasticsearch-0.90.9.

@costin
Copy link
Member

costin commented Sep 4, 2015

_id is an internal field which is actually not part of the document source but rather its metadata.
To read it, you need to tell es-hadoop to properly return it, through these settings.
Further more, es.mapping.id is used for writing not reading data (I've updated the docs to better reflect its role).
If you have your own ids in the document, I suggest keeping the field in the document (whether its id or myid or what have you) and configure the mapping to point to it.
While you end up with the same results, it better indicates the fact that it is not an ID created by Elasticsearch but rather manually maintained.

P.S. You are using a very old version of ES. Not only upgrading to latest ES will give you better security but also performance.

@jpastuszek
Copy link
Author

Sorry that was a typo actually and probably not related to the issue.

The problem is that the id filed is in the document stored like:

    "clinics": [
      {
        "id": 95611,
        "ppl": 5
      },
      {
        "id": 130244,
        "ppl": 0
      },
      {
        "id": 132508,
        "ppl": 7
      },
      {
        "id": 210577,
        "ppl": 5
      }
    ],

So to map it I use ARRAY<STRUCT<id:BIGINT, ppl:DOUBLE>> in the table.
This is all good and work but only if you don't have any other column named id.
This is like the space of the column names collide with space of nested structure field names in a way that is causing runtime bug where NULL is inserted in the structure id field instead in the actual BIGINT value.
In other words it should not matter if I use column name id and the STRUCT object field is also called id - this should work fine.

Re P.S. I know :/ I am planning migration to new ES cluster when I build it on top of my Hadoop nodes but first want to get this working so I know if I can meet my goals.

BTW: I use the table to read data from ES to Hadoop; does that mean I don't need es.mapping.id at all if I am not planning to write?

@costin
Copy link
Member

costin commented Sep 5, 2015

When you say column are you referring to Hive or Elastic? Hive is flat so the columns name don't reflect their structure too much - and many times they can't reflect the one in ES. This is where you can use es.mapping.names as mentioned here.
Do note that complex types are not really well supported by Hive (or Pig for that matter) and many times just reading the ES fields directly in their top-level columns and then merging these manually in Hive/Pig tends to work better (not ideal but doable).

BTW: yes, if you don't write, you don't need es.mapping.id

@costin
Copy link
Member

costin commented Oct 28, 2015

Since there hasn't been any update, I'm closing the issue. Please open a new one if needed. Thanks!

@costin costin closed this as completed Oct 28, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants