support dynamic index/type #175

cynosureabu · 2014-03-25T00:09:28Z

my data is a partitioned hive table. I know i can read partition by partition and create external table with es.resources pointing to that partition. But is it possible to read multiple partitions altogether, and have different partition data writing to different type/index?

something like
es.resource.index = column_name1,
es.resource.type = column_partition_column name

Is there such functionality already?
Thanks,
Chen

costin · 2014-03-25T00:20:24Z

Let me first start by saying that partitions and external tables are clunky and buggy.
If I understand correctly you'd like to use the partitions of a table as the type for writing data to Elasticsearch.
This should be fairly easy to achieve by parameterizing your Hive script - create a simple script that reads a table based on partition {X} and writes to Elasticsearch based on index/{X}
Then run the Hive script binding {X} to the partition you desire.

As a side note, partitioning is typically used to improve performance - if you deal with large volumes of data you probably want each partition to point to a different index instead of a different type (since that would push all the data under the same index).

cynosureabu · 2014-03-25T04:55:44Z

" parameterizing your Hive script " This is what I am currently doing. The issue is that I cannot query multiple partitions at the same time.

I also see lots of connection(out of nodes and retry?) exceptions in my mapper. Is there any known tune up I could do to avoid this issue? i have tried to increase the time out to be 10m, it seems to get better, but wanna know if any better ways.

My cluster has 6 machines each ES is running with 20G mem, and each partition is around 30million records.(with 4-5 string fields).

costin · 2014-03-25T08:42:32Z

From an elasticsearch perspective, if you push your data under the same index, you can access it with the same query. If you have multiple indices, you can create a query that queries all of them.

The connection exceptions can have a plethora of reasons and without any concrete information I can only guess what's the issue. If you are writing then consider minimizing the bulk size (we'll do this in the next release); if it's reading then depends on how you stream data.

You can always turn on logging on the various packages in org.elasticsearch.hadoop to see what's wrong. I also recommend trying the latest master.

cynosureabu · 2014-03-25T17:42:31Z

Thanks Costin. All my operations are writes. I will try to decrease the bulk size, and will turn on the logging. will keep you posted.

costin · 2014-03-25T17:53:36Z

Try with a bulk size of 5 MBs and move from there. Note that, this is the batch size per task - if you job has 10 jobs, it leads to 100MBs bulks, 20, 200MBs, etc...

costin added hive labels Mar 25, 2014

cynosureabu closed this as completed Mar 25, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support dynamic index/type #175

support dynamic index/type #175

cynosureabu commented Mar 25, 2014

costin commented Mar 25, 2014

cynosureabu commented Mar 25, 2014

costin commented Mar 25, 2014

cynosureabu commented Mar 25, 2014

costin commented Mar 25, 2014

support dynamic index/type #175

support dynamic index/type #175

Comments

cynosureabu commented Mar 25, 2014

costin commented Mar 25, 2014

cynosureabu commented Mar 25, 2014

costin commented Mar 25, 2014

cynosureabu commented Mar 25, 2014

costin commented Mar 25, 2014