Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

index going to "completed" ILM phase instead of "warm" #46357

Closed
DanRoscigno opened this issue Sep 4, 2019 · 37 comments
Closed

index going to "completed" ILM phase instead of "warm" #46357

DanRoscigno opened this issue Sep 4, 2019 · 37 comments
Labels
>bug :Data Management/ILM+SLM Index and Snapshot lifecycle management

Comments

@DanRoscigno
Copy link
Contributor

Bug report

Elasticsearch version
7.2.0 and 7.3.1 ESS

Description of the problem including expected versus actual behavior:
When I look at my indices under index management and apply the Lifecycle filters (hot, warm, cold) I see that I have a bunch of Filebeat indices that are not in any of the above states. When I look at the details I see the state is "completed":

image

Using the following ILM policy index should go to "warm" on rollover:

{
    "policy": {
        "phases": {
            "hot": {
                "min_age": "0ms",
                "actions": {
                    "rollover": {
                        "max_age": "1h",
                        "max_size": "50mb",
                        "max_docs": 100
                    }
                }
            },
            "warm": {
                "min_age": "1h",
                "actions": {
                    "allocate": {
                        "include": {},
                        "exclude": {},
                        "require": {
                            "data": "warm"
                        }
                    }
                }
            }
        }
    }
}

Steps to reproduce:

  1. Run filebeat setup
  2. Edit ILM policy to reduce time (so you do not have to wait 30 days)
  3. Run filebeat
  4. Wait for rollover
@DanRoscigno
Copy link
Contributor Author

@TomLawler : Change the filter on your indices from ilm.phase:(warm) to ilm.phase:(completed) and see if they show up.

@iverase iverase added :Data Management/ILM+SLM Index and Snapshot lifecycle management >bug labels Sep 5, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-features

@dakrone
Copy link
Member

dakrone commented Sep 5, 2019

@DanRoscigno after the rollover, ILM moves the index into the warm phase and then immediately executes the allocate action, then moves the index into the complete phase. It sounds like you're expecting it to stay in the warm phase forever? This is expected behavior from ILM.

@DanRoscigno
Copy link
Contributor Author

Thanks @dakrone ,
In my case (where I have no cold phase) does complete mean that the index is warm because it is available for searching and on a warm node and also complete because I have no further phases to go through?

If this is the thinking, then I would like it to show up in the ILM UI when I set the filter to show me warm indices:
image
From looking at the way the UI works, setting the phase to warm-completed would work (the filter matches something like *<choice>*

@dakrone
Copy link
Member

dakrone commented Sep 5, 2019

In my case (where I have no cold phase) does complete mean that the index is warm because it is available for searching and on a warm node and also complete because I have no further phases to go through?

That's right, ILM is single pass execution, and once a policy has been completely executed it moves to the complete phase, so in this case, it executed the actions in the warm phase and then moved to complete because there is nothing left to do in warm.

If this is the thinking, then I would like it to show up in the ILM UI when I set the filter to show me warm indices:

This is something that would probably have to be addressed on the UI side, perhaps @bmcconaghy who worked on the original UI could direct you to where to file an appropriate issue for it?

@bmcconaghy
Copy link

Hmm this seems like a deficiency in the ES API. Not sure how the UI could figure out that something is still in warm phase if the phase is "complete". We filter based on phase (just a dumb string filter).

@dakrone
Copy link
Member

dakrone commented Sep 5, 2019

I think the disconnect is thinking that an index resides in a phase permanently, for example, an index is only in the "warm" phase whilst operations are being executed in that phase.

To clarify this, we could potentially add a "last_phase" output to the ILM explain output for an index so the UI could use phase = phase == "completed" ? : last_phase : phase as the filtering criteria?

@bmcconaghy
Copy link

To be clear, there isn't really any "logic" to the filtering, it's a dumb UI component. I'm sure something could be coded to make this filtering work for the user, but the data needs to be in the return from ES. An issue should be filed in the Kibana repo linked to this one for the corresponding Kibana work.

@dakrone
Copy link
Member

dakrone commented Sep 5, 2019

Another thing to discuss, is this something we event want? Perhaps a completed option should be added to the UI to show indices in the completed phase? Is there a benefit from trying to keep an index "in a phase" it's not currently in for the UI?

@bmcconaghy
Copy link

To me, "completed" is not a phase. An index stays in the last defined phase, so in the case described in this issue, it remains in "warm". This seems like what the person who filed this issue thinks makes the most sense, and I agree. "Completed" is about the lifecycle as a whole, not a phase I think.

@bmcconaghy
Copy link

I would defer to Elasticsearch UI on this one, though, as I don't work on this any longer. @cjcenizal any thoughts on this one?

@gwbrown
Copy link
Contributor

gwbrown commented Sep 5, 2019

My concern with showing complete indices as the last phase they were in before completion in the UI is that it might lead to users thinking that, e.g. "Since this policy only had hot and warm phases, I can add a cold to it and all my indices will immediately go into the new cold phase!" (which will not work, at least at the moment). It's already confusing for some users that editing a phase in a policy won't retroactively apply to indices which have already been through the changed phase, and I don't want to add on to that.

@DanRoscigno
Copy link
Contributor Author

@dakrone : Do we need completed at all? Does this phase prevent some actions from considering those indices for action? If there is a check for phase == completed, maybe the checks could look at current action instead? Notice that both phase and current action here are completed:

image

@dakrone
Copy link
Member

dakrone commented Sep 5, 2019

@DanRoscigno as Gordon mentioned, the completed phase is used to indicate that all phases has been executed, if we kept the index in a warm phase on a completed step, a user would assume that adding a cold phase to the policy would cause the index to execute those next, since cold is executed after the warm phase. By having a dedicated completed phase it's clear that no more actions are going to be taken on the index.

We already have confusion with ILM's single pass execution, removing the completed phase I think would make it even more confusing to users.

@DanRoscigno
Copy link
Contributor Author

I hear you, but I think it is way more confusing to have a dropdown filter that does not work. If the choices are hot, warm, cold, delete then I expect to be able to see the indices that are in hot, warm, or cold phases (I don't know why delete is in there, if the indices are deleted, then they should not show up in the manage indices UI, right?). I think that users should expect changes to an ILM policy to only be in effect for new indices. A short "these changes will be in effect for new indices" message near the save button could be added to remind the user.

@dakrone
Copy link
Member

dakrone commented Sep 5, 2019

I hear you, but I think it is way more confusing to have a dropdown filter that does not work. If the choices are hot, warm, cold, delete then I expect to be able to see the indices that are in hot, warm, or cold phases (I don't know why delete is in there, if the indices are deleted, then they should not show up in the manage indices UI, right?).

I do think the dropdown is a little strange, I think having hot, warm, cold, and completed as the options seems a lot better to me. Having delete as a phase filter makes (almost) no sense, and having completed as an option makes it explicit that no other actions are going to be taken on the shown indices.

@gwbrown
Copy link
Contributor

gwbrown commented Sep 5, 2019

The delete phase is in the UI because it's possible for an index to be waiting in the delete phase if a snapshot is in progress that's preventing the index from being deleted. We've also discussed adding other actions in the delete phase which would cause the index to be in the delete phase for a nontrivial amount of time (e.g. wait for SLM to take a snapshot before deleting this index).

@dakrone
Copy link
Member

dakrone commented Sep 5, 2019

Ahh yeah, good point Gordon, I forgot about the snapshots blocking deletion so yeah, we should keep delete. I do think we should add completed to the dropdown though.

@gwbrown
Copy link
Contributor

gwbrown commented Sep 5, 2019

I think that users should expect changes to an ILM policy to only be in effect for new indices. A short "these changes will be in effect for new indices" message near the save button could be added to remind the user.

This doesn't give the user the right understanding either, because changes to policies don't just effect new indices. If a policy is changed, indices which enter new phases after that will use the new version. For example, if an index is in the warm phase waiting for a shrink to happen, and that policy's cold phase is changed, when the index finishes the shrink and moves to cold it will use the new cold phase.

This is, honestly, the source of a lot of confusion, probably second only to the required alias setup.

@DanRoscigno
Copy link
Contributor Author

I can't imagine clicking on warm in the dropdown and seeing no entries and thinking "ok, everything is working fine". If the indices are warm, then they should show up in warm, not completed. Completed is not a phase of ILM. The phases of ILM are hot, warm, cold, and delete.

@DanRoscigno
Copy link
Contributor Author

regarding the alias, hopefully most users are using Beats. If so, then they have an easy time. The policy and alias are created for them, they never see any JSON at all (unless they are self managed, then they have to set the node attributes for hot, warm, cold)

@dakrone
Copy link
Member

dakrone commented Sep 5, 2019

If the indices are warm, then they should show up in warm, not completed.

This I think gets to the crux of the problem, we have two different conceptions of "warm", from a user's perspective an index is "warm" during a range of time (which may be perpetual if there is no cold or delete phase). From ILM's execution perspective however, an index is only in the warm phase while executing actions in that phase, then moving from that phase to the completed phase to indicate where the index is in the giant state diagram of the ILM execution model.

@gwbrown
Copy link
Contributor

gwbrown commented Sep 5, 2019

I mean, it might be possible for us to just park an index in warm/complete/complete if there's no cold or delete phases in the policy, but that would be a pretty significant breaking change. It might be more intuitive, though.

regarding the alias, hopefully most users are using Beats. If so, then they have an easy time.

That's what I expected too, and have been proven very painfully wrong by experience. But that's a different issue than the one under discussion here.

@dakrone
Copy link
Member

dakrone commented Sep 5, 2019

I mean, it might be possible for us to just park an index in warm/complete/complete if there's no cold or delete phases in the policy,

We already do this, we have a hot/completed/completed, warm/completed/completed, cold/completed/completed, delete/completed/completed step:

// add `after` step for phase before next
if (phase != null) {
// after step should have the name of the previous phase since the index is still in the
// previous phase until the after condition is reached
Step.StepKey afterStepKey = new Step.StepKey(previousPhase.getName(), PhaseCompleteStep.NAME, PhaseCompleteStep.NAME);
Step phaseAfterStep = new PhaseCompleteStep(afterStepKey, lastStepKey);
steps.add(phaseAfterStep);
lastStepKey = phaseAfterStep.getKey();
}

The thing is that we need to progress to the next phase when reaching this step assuming that the min_age predicate is met. This goes back to the discussion about confusion regarding a phase added after a warm phase if an index were not parked in the completed phase.

@gwbrown
Copy link
Contributor

gwbrown commented Sep 5, 2019

Yes, I meant rework the design so that there's no complete phase - just wait in e.g. warm/complete/complete forever if there's no cold or delete phases, and actually do then go on to a cold/delete phase if one is added later.

As I said, this would be a very significant breaking change and well behind a lot of other changes in terms of priority, and I'm not sure the current setup is unintuitive enough to justify the effort it would require.

@DanRoscigno
Copy link
Contributor Author

The UI is extremely unintuitive. As someone with 25 years ops experience, when I clicked on warm and found no indices I figured I screwed something up and wasted hours trying to convince myself that I had broken something. I think that if we provide a UI to our users and in the UI there is a filter, then the filter should match something. Since it is a breaking change to change the way we label the phases, and the UI does not work as is, then can we just remove that dropdown completely? Without the dropdown nobody will be confused.

@cjcenizal
Copy link
Contributor

@DanRoscigno Thanks for your feedback! The ES UI team owns this UI. I've been out for a couple days so apologies for not chiming in sooner. We'll review this discussion and create a plan for resolving the UX problems you've pointed out.

@DanRoscigno
Copy link
Contributor Author

Thanks all!

@cjcenizal
Copy link
Contributor

cjcenizal commented Sep 12, 2019

I'm caught up on this discussion now, and I'm inclined to agree with @DanRoscigno.

As Lee touched upon, users hear "phase" and think about the stages in the Index Lifecycle Policy form, and reasonably form a mental model of an index moving from phase to phase -- but always being in a phase. Engineers hear "phase" and think of the actions a lifecycle policy takes upon an index entering a particular phase, actions which are of course finite. These two concepts are out of alignment.

I think the long-term solution is to align the users' and engineers' perspectives. We need to let users view their indices in the UI in a way that aligns with their mental model. It seems like there's no quick solution here, so I'd like to keep this issue open or replace it with another terser one to track that goal.

In the short-term, I'll open an issue and corresponding PR that removes this filter from Index Management because I feel like it will cause confusion among the majority of our users.

@cjcenizal
Copy link
Contributor

cjcenizal commented Sep 12, 2019

Issue: elastic/kibana#45484
PR: elastic/kibana#45486

@slimsheddy
Copy link

slimsheddy commented Sep 17, 2019

Issue

Current phase depicts warm, but this doesn't mean that the index/shares are in the warm nodes.

Below screenshot shows that the index successfully executed the ILM policy, however due to ILM misconfiguration/error/etc, the shards didn't get moved to the warm nodes. Upon checking _cat/shards/, i find the index is still in the hot nodes.

Proposal

Can the UI also show which node (with attribute) the index is in? (E.g,. hot, warm)

Screen Shot 2019-09-17 at 8 48 20 pm

@dakrone
Copy link
Member

dakrone commented Sep 19, 2019

Current phase depicts warm, but this doesn't mean that the index/shares are in the warm nodes.

The names of the phases does not automatically translate to shards moving to those nodes. In your policy you have no include or require allocation rules for the index to move its shards to the 'warm' nodes.

@gwbrown
Copy link
Contributor

gwbrown commented Sep 19, 2019

As @dakrone says, the phase and the allocation are orthogonal - you could set up some fairly complex allocation with an ILM policy in necessary. There may be an argument for displaying allocation on that page, but if you want to request that feature @slimsheddy, please open a new issue in the Kibana repo rather than requesting it in this issue so that we can keep this discussion focused on a single topic.

@david-in-perth
Copy link

david-in-perth commented Sep 27, 2019

Note that indices start in the "new" phase (which hasn't been mentioned in this discussion so far). So there are 6 phases, from the point of view of the 'explain' API method.

So to summarise what I have learned from experiments in ES 6.7, from reading this discussion, and reading the ES docs for index-lifecycle-management (note: I've used the phase/action/step notation as others have above):

  • When an index has an ILM policy assigned, it starts in new/completed/completed.
  • The phases can be considered to be ordered as follows: new, hot, warm, cold, delete, completed.
  • Whenever an index reaches <phase>/completed/completed (i.e. finishes executing all actions in the phase, if any), one of three things happens:
    • If the ILM policy doesn't specify any phases after <phase>, move to completed/completed/completed
    • Otherwise, for the next phase in the ILM policy after <phase>, if the index is older than min_age on that phase, move to that phase and start executing actions.
    • Otherwise, wait in <phase>/completed/completed until one of the two cases above is satisfied (either due to the passage of time, or to ILM policy changes).
  • Once an index starts executing actions on a phase, it will continue executing those actions until they are all done, as they were defined in the ILM policy when the index entered that phase. Some actions can block, waiting for a condition to be satisfied, for extended periods, and changes to the ILM policy can not interrupt this.
  • Once all actions on the current phase (if any) are completed, the index moves to <phase>/completed/completed, and checks whether it should move to a later phase (see above).

@david-in-perth
Copy link

david-in-perth commented Sep 27, 2019

It took me about a day of reading/experimenting/reading/experimenting/etc. to figure all of this out. Key insights:

  • Once an index is in the completed phase, it will never move to another phase or execute any actions. I had started out by experimenting with an ILM policy with a "warm" phase containing no actions, and no later phases specified. Part of my problem is that I was trying to do experiments on indices that had already run through that ILM policy and moved to completed, so none of my experiments made any difference.
  • "phase" : "completed", (from the explain API call) really does mean the index is currently in the phase called "completed". Due to my confusion about my experiments not doing anything, and the docs not making it clear that "completed" is a phase, I was second-guessing my interpretation of the explain output. I started thinking it might mean "the current phase has been completed".
  • Indices waiting in <phase>/completed/completed are affected by ILM policy updates that change min_age in a later phase. I wasn't sure if the indices I was testing were 'currently executing' the phase that I was changing the min_age value for. This was a knock-on effect from the previous point. Plus I thought I'd read somewhere that updates to min_age don't always affect existing managed indices.

So yes, the behaviour of jumping to the completed phase when not waiting on a later min_age condition is confusing and unintuitive, as noted by others above. But also a lot of this information appears to be missing from or (seemingly) contradicted in the docs. For example: The single-pass behaviour (as described by @dakrone above), the completed phase (including the behaviour described in this ticket), and how waiting at <phase>/completed/completed works. Also the new phase is not mentioned in the index-lifecycle-management section of the docs. This mismatch between the docs and actual behaviour likely contributes to the confusion around this functionality.

@gwbrown
Copy link
Contributor

gwbrown commented Sep 27, 2019

Thank you for the detailed feedback @david-in-perth! What you describe in your comments looks correct. When writing this documentation it's always a balance between including too much detail and not enough, and it can take a few iterations to get it right. I think we can certainly do a better job of documenting some of this, regardless of any code/product changes we make to make this more intuitive in the first place. I'll try to take a crack at it soon (although anyone else reading this is welcome to as well if you want to beat me to it!).

@dakrone
Copy link
Member

dakrone commented Feb 11, 2020

This has been resolved by #51631

@dakrone dakrone closed this as completed Feb 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Data Management/ILM+SLM Index and Snapshot lifecycle management
Projects
None yet
Development

No branches or pull requests

9 participants