New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support dynamic index/type #175
Comments
Let me first start by saying that partitions and external tables are clunky and buggy. As a side note, partitioning is typically used to improve performance - if you deal with large volumes of data you probably want each partition to point to a different index instead of a different type (since that would push all the data under the same index). |
" parameterizing your Hive script " This is what I am currently doing. The issue is that I cannot query multiple partitions at the same time. I also see lots of connection(out of nodes and retry?) exceptions in my mapper. Is there any known tune up I could do to avoid this issue? i have tried to increase the time out to be 10m, it seems to get better, but wanna know if any better ways. My cluster has 6 machines each ES is running with 20G mem, and each partition is around 30million records.(with 4-5 string fields). |
From an elasticsearch perspective, if you push your data under the same index, you can access it with the same query. If you have multiple indices, you can create a query that queries all of them. The connection exceptions can have a plethora of reasons and without any concrete information I can only guess what's the issue. If you are writing then consider minimizing the bulk size (we'll do this in the next release); if it's reading then depends on how you stream data. You can always turn on logging on the various packages in |
Thanks Costin. All my operations are writes. I will try to decrease the bulk size, and will turn on the logging. will keep you posted. |
Try with a bulk size of 5 MBs and move from there. Note that, this is the batch size per task - if you job has 10 jobs, it leads to 100MBs bulks, 20, 200MBs, etc... |
my data is a partitioned hive table. I know i can read partition by partition and create external table with es.resources pointing to that partition. But is it possible to read multiple partitions altogether, and have different partition data writing to different type/index?
something like
es.resource.index = column_name1,
es.resource.type = column_partition_column name
Is there such functionality already?
Thanks,
Chen
The text was updated successfully, but these errors were encountered: