Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto open/close index functionality #10869

Closed
markwalkom opened this issue Apr 29, 2015 · 7 comments
Closed

Auto open/close index functionality #10869

markwalkom opened this issue Apr 29, 2015 · 7 comments

Comments

@markwalkom
Copy link
Contributor

For time series users, most of the queries happen within the relatively close past, 24-48 hours, up to a week or a month. However retention requirements mean that this data may need to be kept around for months to years and currently we recommend that people use a hot/cold setup with shard allocation filtering.

Other (additional) options include closing indices but persisting them to disk to reduce the impact to heap.

The last choice is definitely viable, but Elasticsearch is relatively ignorant around this, from a user end point; An admin will need to open an index, allow the end-user to query, and then close when, to ensure best resource use for their infrastructure. None of that is automatic.

It'd be handy for these use cases if we had configurable functionality that allows an admin to set (eg);

  • Allow closed indices to be opened automatically
  • Set a limit on number of indices that were previously closed to be opened. eg indexes only allow Y closed indices to be opened at any one time, to ensure resources are kept under any limit.
  • Time frame from when closed index was opened to when it will be re-closed as a;
    • Hard time. eg 1 hour from open.
    • Time from last use. eg 1 hour from last query seen.
  • Automatically close the index after period above.

There are a few things to be careful of here, particularly around stability;

  • Setting re-opened limit too high
  • Opening a large number of indices at a single time

There's going to be more to it than this but I've seen this sort of thing mentioned a few times in the community and I think it'd be a good feature to have for larger installs using time-series data.

@ashpynov
Copy link

ashpynov commented May 5, 2015

I'm one of person requested it.
One more suggestion - open indices only by direct query only for query time (time need only to search and fetch data from indices). May be have some kind of LRU cache of "semi-opened" indices.

for information: I've 3.4TB stored data on each node for 1 week, it request 8.6 GB Lucene memory. But i have to store and provide on demand query up to 1 year long ago data. And it will requery 447.2GB heap, that looks unreasonable even with g1gc garbage collector (we used now).

@clintongormley
Copy link

Hi @markwalkom and @ashpynov

We talked about this on our FixItFriday call today. While the feature sounds appealing, I think there are lots of gotchas that are not immediately apparent, and would result in adding a huge amount of complexity in order to try to support widely differing policies, eg:

  • opening an index can be a heavy task, especially if the translog needs replaying
  • the query might result in lots of fielddata being loaded into memory, which could also take a significant amount of time and heap
  • what happens if you query 10 closed indices, but your auto-open limit is 5?
  • what happens if you query the first 5, then query the next 5? Does the user have to wait for (eg) 1 hour before they can run the second query?
  • what happens if the 5 indices you open use up more than the available heap?
  • what happens when multiple people query different closed indices at the same time?

Today we provide a simple open and close API, which allows the administrator to decide on policies, and to build an interface to implement those policies. I think this is the correct approach - complexity is handled on a case by case basis, rather than Elasticsearch trying to cater for every need.

@ashpynov you mention that each of your indices requires about 9GB of heap. Presumably you're using fielddata rather than doc values. You can switch to using docvalues today and in v2 we'll be switching them on by default.

@ashpynov
Copy link

ashpynov commented May 8, 2015

@clintongormley no, we use docvalues already, otherwise it will take up to 2% of index size (x10 more).

@ashpynov
Copy link

ashpynov commented Oct 7, 2015

Hi, @clintongormley @markwalkom
We started some prototyping of such feature based on version 1.7. Having such decission on policies:

  • freezed index are READ ONLY, so no translog. Opening index time is about 0,5% index size readtime from HDD about 12 sec on 500GB index (16 HDD 7200 rpm at RAID6) its very acceptable.
  • of course field data loading is problem even on "hot" indices and it more or less protected by breaker.
  • in case of indices count on limit - breaker can be solution. The amout of loaded cold indices limited by seach thread pool size and amount of indexes per query

The open/close functionality is very hard to implement due to cluster architecture:

  1. we have to proxy query and decide - need to open index or not
  2. we have to proxy query trough single node/
  3. we need to know where "involved" index shards is located, and how many indices at that node is already open.

But most complexity is

  1. Index opening make cluster RED at open time. So in our case cluster will be red half time or even always.
  2. In case on searching in several day indices we need to place result reducing logic on client side

@makeyang
Copy link
Contributor

why is it closed? u guys already made up the decision?

@clintongormley
Copy link

Yes

@ashpynov
Copy link

We did it over version 1.7.3. In our use case (store data per daily basis for 2 years, but common search interval about 1 month) it reduced memory usage from 256GB to 32GB per server. SSD cache on RAID storage also reduce penalty of loading/unloading data during query.
But we did not use it in production due to changing search engine to our custom and migration to ES 2.3 in specific search cases.
Also the code was developed by C++ coders so we are ashamed to publish it even here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants