Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple indexes setting for 'es.resource' #289

Closed
mungeol opened this issue Oct 2, 2014 · 8 comments
Closed

Multiple indexes setting for 'es.resource' #289

mungeol opened this issue Oct 2, 2014 · 8 comments

Comments

@mungeol
Copy link

mungeol commented Oct 2, 2014

'es.resource' = 'apache-2014.09./apache-access' or
'es.resource' = 'apache-2014.09.29,apache-2014.09.30/apache-access'
are not working well for 'select count(*) from test' which is HiveQL.
The count result is not right.
'es.resource' configuration should support multiple indexes setting.
Or, at least give an error message.

@costin
Copy link
Member

costin commented Oct 3, 2014

Could you explain what the expectation is and what is the actual result? es-hadoop does minimal interpretation of the index/type and feeds the information directly to Elasticsearch.

@mungeol
Copy link
Author

mungeol commented Oct 6, 2014

The hive table I created is like below

CREATE EXTERNAL TABLE test
(
date timestamp,
clientip string,
request string
)
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES
(
'es.resource' = 'apache-2014.09.29/apache-access',
-- or
-- 'es.resource' = 'apache-2014.09.30/apache-access',
'es.mapping.names' = 'date:@timestamp'
);

and I used 'select count() from test;' which is a hive query to count the total number of rows of the table.
the result is same with ES count.
the count result are 1454536 and 215564 for each apache-2014.09.29 and apache-2014.09.30 index
then, I changed 'es.resource' = 'apache-2014.09.29/apache-access' to
'es.resource' = 'apache-2014.09.
/apache-access' or
'es.resource' = 'apache-2014.09.29,apache-2014.09.30/apache-access'
for including multiple indexes.
and I used 'select count(*) from test;' again to count the total number of documents of the indexes,
but the result is different with ES count.
the count result is 2919161 which should be 1670100 (1454536 + 215564).


environmental information

  • centos base 6.4 64-bit / java version "1.7.0_55"
  • CDH-5.1.2-1.cdh5.1.2.p0.3
  • hive 0.12.0
  • elasticsearch-hadoop-2.0.1
  • 3 nodes' hadoop and es cluster

@costin
Copy link
Member

costin commented Oct 8, 2014

Hi,

Sorry for the delay in picking this up. I've tried reproducing this but can't - maybe it has something to do with the dataset or potentially the way the counting is done.
I'm not sure where that number (2919161) is coming from - clearly some data is being returned but not properly processed.
Can you please confirm the following:

  • the count result when done directly to ES through curl. Double check that the indices are properly refreshed (i.e there's no inflight data)
  • after you have loaded the data, start a separate Hive query with logging turned all the way to TRACE level on org.elasticsearch.hadoop package (as mentioned in the docs). There will be a lot of data so please be patient - archive the results and let me know where you have uploaded them.

Thanks!

@mungeol
Copy link
Author

mungeol commented Oct 10, 2014

Hi,

I created new indexes with small data for easy test.
I indexed same three documents into each cars-01 and cars-02 indexes with same type name transactions

POST /cars-01/transactions/_bulk
{ "index": {}}
{ "price" : 10000, "color" : "red", "make" : "honda", "sold" : "2014-10-28" }
{ "index": {}}
{ "price" : 20000, "color" : "red", "make" : "honda", "sold" : "2014-11-05" }
{ "index": {}}
{ "price" : 30000, "color" : "green", "make" : "ford", "sold" : "2014-05-18" }

POST /cars-02/transactions/_bulk
{ "index": {}}
{ "price" : 10000, "color" : "red", "make" : "honda", "sold" : "2014-10-28" }
{ "index": {}}
{ "price" : 20000, "color" : "red", "make" : "honda", "sold" : "2014-11-05" }
{ "index": {}}
{ "price" : 30000, "color" : "green", "make" : "ford", "sold" : "2014-05-18" }

'GET cars-01/_search?search_type=count' returns 3 hits
'GET cars-02/_search?search_type=count' returns 3 hits
'GET cars-*/_search?search_type=count' returns 6 hits

and created table like below

CREATE EXTERNAL TABLE cars
(
price bigint,
color string,
make string,
sold timestamp
)
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES
(
'es.resource' = 'cars-*/transactions'
);

'select * from cars' and 'select count(*) from cars' returns different result every time
example 1 (9 rows)

10000 red honda 2014-10-28 00:00:00
20000 red honda 2014-11-05 00:00:00
30000 green ford 2014-05-18 00:00:00
20000 red honda 2014-11-05 00:00:00
30000 green ford 2014-05-18 00:00:00
10000 red honda 2014-10-28 00:00:00
10000 red honda 2014-10-28 00:00:00
20000 red honda 2014-11-05 00:00:00
30000 green ford 2014-05-18 00:00:00

example 2 (10 rows)

10000 red honda 2014-10-28 00:00:00
20000 red honda 2014-11-05 00:00:00
30000 green ford 2014-05-18 00:00:00
10000 red honda 2014-10-28 00:00:00
10000 red honda 2014-10-28 00:00:00
10000 red honda 2014-10-28 00:00:00
20000 red honda 2014-11-05 00:00:00
30000 green ford 2014-05-18 00:00:00
20000 red honda 2014-11-05 00:00:00
30000 green ford 2014-05-18 00:00:00

I uploaded two logs.
es_hadoop.log file includes logs after I added 'log4j.category.org.elasticsearch.hadoop.hive=TRACE' into log4j.properties file
hive.log file includes logs after I changed 'hadoop.root.logger=INFO' to 'hadoop.root.logger=TRACE'
here is the links for both logs
https://raw.githubusercontent.com/mungeol/log/master/es_hadoop.log
https://raw.githubusercontent.com/mungeol/log/master/hive.log
hope it could help

Thanks.

@romanmar1
Copy link

I get this issue as well, but from a Spark perspective.
I have two exactly identical indices.
I use a query that return exactly 4 documents.
If i set es.resource to be "index1/mydoc", everything works as expected:

  1. The count of the rdd is 4.
  2. collecting partitions reveals that there is a single partition of size 4.

If i set es.resource to be "index1,index2/mydoc", things get a bit wierd:

  1. The count is 16 (instead of the expected 8).
  2. There are two partitions, each of size 8, with the same documents.

Moving forward, if i add an additional index, "index1,index2,index3/mydoc", the count will be 36, with 3 identical partitions of size 12 each.

@costin
Copy link
Member

costin commented Apr 28, 2015

Folks, can you try the latest Beta (4) and see whether it addressed your issue? There have been several updates on this front.

Thanks,

@mungeol
Copy link
Author

mungeol commented Apr 28, 2015

It is working now.
Thanks,

@costin
Copy link
Member

costin commented Oct 28, 2015

Closing the issue...

@costin costin closed this as completed Oct 28, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants