Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[meta] Index Lifecycle Management Plan #29823

Closed
elasticmachine opened this issue Oct 30, 2017 · 7 comments
Closed

[meta] Index Lifecycle Management Plan #29823

elasticmachine opened this issue Oct 30, 2017 · 7 comments
Assignees
Labels
:Data Management/ILM+SLM Index and Snapshot lifecycle management >feature Meta

Comments

@elasticmachine
Copy link
Collaborator

elasticmachine commented Oct 30, 2017

Tasks

  • Write LifecycleAction for each type of action
    • delete
    • forcemerge EMAIL REDACTED LINK REDACTED)
    • rollover EMAIL REDACTED LINK REDACTED)
    • allocate EMAIL REDACTED LINK REDACTED)
    • shrink EMAIL REDACTED LINK REDACTED)
    • replica EMAIL REDACTED LINK REDACTED)
    • snapshot (will implement snapshotting as a separate solution)
  • Create concept of a lifecycle type which will: EMAIL REDACTED LINK REDACTED)
    • Constrain the available phase names
    • Set the order in which the phases are executed
  • Create concept of Phase types which will:
    • Set the actions that are available in each phase (LINK REDACTED)
    • Set the order in which the actions are executed within each phase
    • Remove shuffled fields exception for phases field in unit tests (IndexLifecycleMetadataTests, LifecyclePolicyTests, PutLifecycleRequestTests)
  • Create the first lifecycle type timeseries, which will allow the following phases (in order): EMAIL REDACTED LINK REDACTED LINK REDACTED LINK REDACTED)
    • Hot - Actions:
      • rollover
    • Warm - Actions:
      • allocate
      • shrink
      • forcemerge
      • replicas
    • Cold - Actions:
      • allocate
      • replicas
    • Delete - Actions:
      • delete
  • Verify Master election re-initialization strategy. Once a Master with an existing in-memory schedule is dropped, the new master needs to be able to re-initialize all the state and relaunch to-be launched tasks. It helps that all time is to be relative to the index.creation.date
  • Add ability to change the poll interval through cluster settings EMAIL REDACTED LINK REDACTED)
  • stop using IndexMetaData.getCreationDate and use a custom setting so that it can be inherited across shrink and other operations EMAIL REDACTED LINK REDACTED)
  • Clean up logging
  • Allow the scheduled job to be added and removed while the node is still running when it is elected and un-elected as master. EMAIL REDACTED LINK REDACTED)
  • Introduce index.lifecycle.phase_time and index.lifecycle.action_time to help track
  • update Shrink Action to properly support self-allocation to specific node from specified attributes

tracking Steps progress

  • PhaseAfterStep
  • InitializationPolicyContextStep
  • TerminalPolicyStep
  • AllocateAction
    • EnoughShardsWaitStep
    • UpdateAllocationSettingsStep
    • AllocationRoutedStep
  • DeleteAction
    • DeleteStep
  • ForceMergeAction
    • UpdateBestCompressionSettingsStep
    • ForceMergeStep (upgrade?)
    • SegmentCountStep
  • ReadOnlyAction
    • ReadOnlyStep
  • ReplicasAction
    • UpdateReplicaSettingsStep
    • EnoughShardsWaitStep
  • RolloverAction
    • RolloverStep
  • ShrinkAction
    • ShrinkStep
    • ShrunkShardsAllocatedStep
    • AliasStep
    • ShrunkenIndexCheckStep

Remaining Tasks

Completed

Blockers to merging into master in priority order from most to least (items are marked in difficulty using *, **, ***)

Blockers to first release in priority order from most to least (items are marked in difficulty using *, **, ***)

Optional (but would be really good to have)

@elasticmachine
Copy link
Collaborator Author

Original comment by @talevy:

Example Pipeline for reference (this comment may be updated as changes occur)

Lifecycle Policy

PUT /_xpack/index_lifecycle/my_lifecycle
{
   "policy": {
     "type": "timeseries",
     "phases": {
       "hot": {
         "after": "0s",
         "actions": {
          "rollover": {
            "alias": "logs-write",
            "max_age": "5s"
          }        
         }
       },
       "warm": {
         "after": "10s",
         "actions": {
           "allocate": {
             "require": { "_name": "node-1" },
             "include": {},
             "exclude": {}
           },
           "shrink": {
             "number_of_shards": 1
           },
           "forcemerge": {
             "max_num_segments": 1000
           }
         }
       },
       "cold": {
         "after": "20s",
         "actions": {
          "replicas": {
            "number_of_replicas": 0
          }
         }
       },
       "delete": {
         "after": "30s",
         "actions": {
           "delete": {}
         }
       }
     }
   }
}

template

PUT _template/my_template
{
  "index_patterns": ["logs-*"],
  "settings": {
    "number_of_shards": 4,
    "number_of_replicas": 1,
    "index.lifecycle.name": "my_lifecycle"
  },
  "aliases": {
    "logs-read": {}
  },
  "mappings": {
    "_doc": {
      "_source": {
        "enabled": false
      }
    }
  }
}

Create Index

PUT logs-000001
{
  "aliases": {
    "logs-write": {}
  }
}

@elasticmachine
Copy link
Collaborator Author

Original comment by @ppf2:

I would like to suggest also setting "index.blocks.write":true against the warm indices once they have moved to the warm tier to ensure that when force merge runs, it is run against an index with no further writes. This can even be done as the very first step of the warm processing, like as a step right before we do the allocation filtering to move it to warm.

@elasticmachine
Copy link
Collaborator Author

Original comment by @colings86:

@ppf2 yeah this is a good point and something we have already thought about (though its not detailed here explicitly) in the context of the forcemerge and shrink actions specifically. We decide that we didn't want to add a write block to the indices automatically which persists for all time after the start of the warm phase because users might find this surprising/annoying but instead the forcemerge and shrink actions enable the write block as their first step and then disable the write block when they are finished.

We could also potentially add an explicit write block action that you could enable in the warm/cold phase to keep the index write block enabled outside of these specific actions as well but this isn't something that we currently have on the plan. Then, if we decide to add this explicit action we could have the UI support the action and potentially set it by default on the policy UI with the option for the user to disable it if they wish. Also if we go down the route where beats/logstash define a default policy for their indexes then I would expect those policies to enable this action too.

@elasticmachine
Copy link
Collaborator Author

Original comment by @PhaedrusTheGreek:

@pickypg I bet it would be useful for the index_stats monitoring to capture the index.lifecycle.name here, so we can agg on it, e.g., Which index series have the worst search latency.

@elasticmachine
Copy link
Collaborator Author

Original comment by @colings86:

Let's stick to implementing the feature first and then we can look into what we can/should add to monitoring. It's still early enough in the implementation that the design is in flux and things may change. Also note that in the current design the index.lifecycle.name is in the index settings so would already be exported in monitoring I think.

@elasticmachine
Copy link
Collaborator Author

Pinging @elastic/es-core-infra

@colings86 colings86 added the :Data Management/ILM+SLM Index and Snapshot lifecycle management label Apr 25, 2018
dakrone added a commit to dakrone/elasticsearch that referenced this issue Jul 25, 2018
This option is only settable while the index is closed, and doesn't make sense
for a force merge.

Relates to elastic#29823
dakrone added a commit that referenced this issue Jul 25, 2018
This option is only settable while the index is closed, and doesn't make sense
for a force merge.

Relates to #29823
dakrone added a commit to dakrone/elasticsearch that referenced this issue Jul 25, 2018
These are only ever set internally during regular ILM execution, they don't need
to be set otherwise.

A subsequent PR will work on adding a dedicated endpoint for the
`LIFECYCLE_NAME` setting so it can be changed by a user (and then marked as
`InternalIndex` as well)

Relates to elastic#29823
dakrone added a commit to dakrone/elasticsearch that referenced this issue Jul 27, 2018
This adds HLRC support for the ILM operation of setting an index's lifecycle
policy.

It also includes extracting and renaming a number of classes (like the request
and response objects) as well as the addition of a new `IndexLifecycleClient`
for the HLRC. This is a prerequisite to making the `index.lifecycle.name`
setting internal only, because we require a dedicated REST endpoint to change
the policy, and our tests currently set this setting with the REST client
multiple places. A subsequent PR will change the setting to be internal and move
those uses over to this new API.

This misses some links to the documentation because I don't think ILM has any
documentation available yet.

Relates to elastic#29827 and elastic#29823
dakrone added a commit that referenced this issue Jul 30, 2018
These are only ever set internally during regular ILM execution, they don't need
to be set otherwise.

A subsequent PR will work on adding a dedicated endpoint for the
`LIFECYCLE_NAME` setting so it can be changed by a user (and then marked as
`InternalIndex` as well)

Relates to #29823
dakrone added a commit that referenced this issue Jul 30, 2018
This adds HLRC support for the ILM operation of setting an index's lifecycle
policy.

It also includes extracting and renaming a number of classes (like the request
and response objects) as well as the addition of a new `IndexLifecycleClient`
for the HLRC. This is a prerequisite to making the `index.lifecycle.name`
setting internal only, because we require a dedicated REST endpoint to change
the policy, and our tests currently set this setting with the REST client
multiple places. A subsequent PR will change the setting to be internal and move
those uses over to this new API.

This misses some links to the documentation because I don't think ILM has any
documentation available yet.

Relates to #29827 and #29823
dakrone added a commit to dakrone/elasticsearch that referenced this issue Jul 31, 2018
This commit makes the `index.lifecycle.name` setting internal an index, this
means that the policy can only be set on the index creation, or with the
specialized `RestSetIndexLifecyclePolicy` action.

Relates to elastic#29823
@talevy talevy self-assigned this Jul 31, 2018
dakrone added a commit to dakrone/elasticsearch that referenced this issue Aug 1, 2018
By making the `settings()` method public on `UpdateSettingsRequest` (I think it
should have been in the first place) we can get rid of this class entirely. Mock
response objects are now constructed by parsing JSON without making the
constructor public.

Relates to elastic#29823
jasontedor pushed a commit that referenced this issue Aug 17, 2018
This commit makes the `index.lifecycle.name` setting internal an index, this
means that the policy can only be set on the index creation, or with the
specialized `RestSetIndexLifecyclePolicy` action.

Relates to #29823
jasontedor pushed a commit that referenced this issue Aug 17, 2018
This commit removes the hacks associated with mocking Response objects. Rather
than parse a wrapped byte array, the constructors for `IndicesAliasesResponse`
and `ResizeResponse` are made public

Relates to #29823
jasontedor pushed a commit that referenced this issue Aug 17, 2018
* Remove RolloverIndexTestHelper

This removes the `RolloverIndexTestHelper` class in favor of making a couple of
getters publically accessible as well as custom building a response object using
JSON parsing.

Relates to #29823
jasontedor pushed a commit that referenced this issue Aug 17, 2018
* Remove UpdateSettingsTestHelper class

By making the `settings()` method public on `UpdateSettingsRequest` (I think it
should have been in the first place) we can get rid of this class entirely. Mock
response objects are now constructed by parsing JSON without making the
constructor public.

Relates to #29823
dakrone added a commit that referenced this issue Aug 18, 2018
* Store phase steps for index in PolicyStepsRegistry

This changes the way that steps are retrieved from `PolicyStepsRegistry` to
store the steps on a per-index basis (in memory for now, though that will change
in subsequent PRs). These steps are rebuilt as the index changes phases.

This also fixes a bug where an action with the same phase and name was not being
considered changed (and thus updated) in the compiled steps list. These are now
correctly considered as "upsert" diffs.

Relates to #29823
dakrone added a commit that referenced this issue Aug 18, 2018
* Store phase steps for index in PolicyStepsRegistry

This changes the way that steps are retrieved from `PolicyStepsRegistry` to
store the steps on a per-index basis (in memory for now, though that will change
in subsequent PRs). These steps are rebuilt as the index changes phases.

This also fixes a bug where an action with the same phase and name was not being
considered changed (and thus updated) in the compiled steps list. These are now
correctly considered as "upsert" diffs.

Relates to #29823
dakrone added a commit to dakrone/elasticsearch that referenced this issue Aug 21, 2018
Since we now store a pre-compiled list of steps for an index's phase in the
`PolicyStepsRegistry`, we no longer need to worry about updating policies as any
updates won't affect the current phase, and will only be picked up on phase
transitions.

This also removes the tests that test these methods

Relates to elastic#29823
dakrone added a commit that referenced this issue Aug 23, 2018
* Remove canSetPolicy, canUpdatePolicy and canRemovePolicy

Since we now store a pre-compiled list of steps for an index's phase in the
`PolicyStepsRegistry`, we no longer need to worry about updating policies as any
updates won't affect the current phase, and will only be picked up on phase
transitions.

This also removes the tests that test these methods

Relates to #29823
dakrone added a commit that referenced this issue Aug 23, 2018
* Remove canSetPolicy, canUpdatePolicy and canRemovePolicy

Since we now store a pre-compiled list of steps for an index's phase in the
`PolicyStepsRegistry`, we no longer need to worry about updating policies as any
updates won't affect the current phase, and will only be picked up on phase
transitions.

This also removes the tests that test these methods

Relates to #29823
dakrone added a commit to dakrone/elasticsearch that referenced this issue Aug 24, 2018
This commit removes PhaseAfterStep and all the plumbing associated with it.
Instead, we rely on the LifecyclePolicyRunner to police itself for advancing
phases.

This also makes a modification to the settings that are exposed related to the
current phase, instead of returning the current phase/step/action as-is in the
`index.lifecycle.phase` (etc) setting, these are now split into:

`index.lifecycle.current_phase|action|step` - the currently executing
phase/action/step which may or may not have completed
`index.lifecycle.next_phase|action|step` - the next phase/action/step to which
we will be proceeding

While I don't think these will cause much issue (especially since nothing is
being broken for users here), these changes were required to have the
`phase_time` correctly updated now that we don't have a "shim" step between
phases. Without these it would be confusing as the index would advance to have
an `index.lifecycle.phase` setting that was potentially one phase in the future.

Relates to elastic#29823
dakrone added a commit that referenced this issue Sep 5, 2018
This removes `PhaseAfterStep` in favor of a new `PhaseCompleteStep`. This step
in only a marker that the `LifecyclePolicyRunner` needs to halt until the time
indicated for entering the next phase.

This also fixes a bug where phase times were encapsulated into the policy
instead of dynamically adjusting to policy changes.

Supersedes #33140, which it replaces
Relates to #29823
dakrone added a commit that referenced this issue Sep 6, 2018
This removes `PhaseAfterStep` in favor of a new `PhaseCompleteStep`. This step
in only a marker that the `LifecyclePolicyRunner` needs to halt until the time
indicated for entering the next phase.

This also fixes a bug where phase times were encapsulated into the policy
instead of dynamically adjusting to policy changes.

Supersedes #33140, which it replaces
Relates to #29823
dakrone added a commit to dakrone/elasticsearch that referenced this issue Sep 17, 2018
This moves away from caching a list of steps for a current phase, instead
rebuilding the necessary step from the phase JSON stored in the index's
metadata.

Relates to elastic#29823
dakrone added a commit that referenced this issue Sep 18, 2018
This moves away from caching a list of steps for a current phase, instead
rebuilding the necessary step from the phase JSON stored in the index's
metadata.

Relates to #29823
dakrone added a commit that referenced this issue Sep 19, 2018
This moves away from caching a list of steps for a current phase, instead
rebuilding the necessary step from the phase JSON stored in the index's
metadata.

Relates to #29823
dakrone added a commit to dakrone/elasticsearch that referenced this issue Sep 27, 2018
This commit changes the way that step execution flows. Rather than have any step
run when the cluster state changes or the periodic scheduler fires, this now
runs the different types of steps at different times.

`AsyncWaitStep` is run at a periodic manner, ie, every 10 minutes by default
`ClusterStateActionStep` and `ClusterStateWaitStep` are run every time the
cluster state changes.
`AsyncActionStep` is now run only after the cluster state has been transitioned
into a new step. This prevents these non-idempotent steps from running at the
same time. It addition to being run when transitioned into, this is also run
when a node is newly elected master (only if set as the current step) so that
master failover does not fail to run the step.

This also changes the `RolloverStep` from an `AsyncActionStep` to an
`AsyncWaitStep` so that it can run periodically.

Relates to elastic#29823
dakrone added a commit that referenced this issue Oct 3, 2018
This commit changes the way that step execution flows. Rather than have any step
run when the cluster state changes or the periodic scheduler fires, this now
runs the different types of steps at different times.

`AsyncWaitStep` is run at a periodic manner, ie, every 10 minutes by default
`ClusterStateActionStep` and `ClusterStateWaitStep` are run every time the
cluster state changes.
`AsyncActionStep` is now run only after the cluster state has been transitioned
into a new step. This prevents these non-idempotent steps from running at the
same time. It addition to being run when transitioned into, this is also run
when a node is newly elected master (only if set as the current step) so that
master failover does not fail to run the step.

This also changes the `RolloverStep` from an `AsyncActionStep` to an
`AsyncWaitStep` so that it can run periodically.

Relates to #29823
dakrone added a commit that referenced this issue Oct 3, 2018
This commit changes the way that step execution flows. Rather than have any step
run when the cluster state changes or the periodic scheduler fires, this now
runs the different types of steps at different times.

`AsyncWaitStep` is run at a periodic manner, ie, every 10 minutes by default
`ClusterStateActionStep` and `ClusterStateWaitStep` are run every time the
cluster state changes.
`AsyncActionStep` is now run only after the cluster state has been transitioned
into a new step. This prevents these non-idempotent steps from running at the
same time. It addition to being run when transitioned into, this is also run
when a node is newly elected master (only if set as the current step) so that
master failover does not fail to run the step.

This also changes the `RolloverStep` from an `AsyncActionStep` to an
`AsyncWaitStep` so that it can run periodically.

Relates to #29823
@colings86
Copy link
Contributor

All blockers for the initial beta release are now merged so I'm closing this out since we will track bugs and tasks for GA in separate issues with the :Core/Features/ILM label

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/ILM+SLM Index and Snapshot lifecycle management >feature Meta
Projects
None yet
Development

No branches or pull requests

4 participants