Skip to content

CheckDeploymentWorkflow

Jörg Pernfuß edited this page Oct 23, 2016 · 1 revision

SOMA Check Deployment Workflow

User step, check configuration

A unsuspecting user uses somaadm checks create to send a check_configuration to SOMA. This check_configuration is the entity that users can interact with, all other derived objects are internally managed.

Internal step, job phase: check creation

The user request is saved as a pending job in the database and acknowledged to the user 202/Accepted. It is then put into the appropriate job queue and processed asynchronously.

Based on the specification in the check_configuration a check is created on the selected object of the tree. If inheritance was true and the object has children, the check creation request is passed down and every child object creates a check as well.

Internal step, job phase: check instance creation

Every tree object evaluates all its checks and their constraints and creates the appropriare arbitrary amount of check_instances. Together with the check_instance, a check_instance_configuration is also created in state awaiting_computation. Check instance configurations are versioned and a check instance can have multiple configurations.

Internal step, job phase: check instance configuration computation

For every check_instance_configuration in state awaiting_computation the deployment details are assembled. The state transitions into computed.

Internal step, job phase: check instance configuration ordering

If a check_instance_configuration is the first configuration for the instance, it transistions into state awaiting_rollout and the check instance is updated with the id of its current configuration. The update_available flag is set.

If a previous configuration was found, that configuration is loaded and the deployment details of both are compared. If the new version is the same as the current one, the new configuration is discarded. This deep compare ignores:

  1. values that must be different, ie. the check instance configuration id and the version number
  2. array element order, i.e. [a, b] == [b, a] is true

If a difference between the two versions was found, the new configuration is moved into state blocked. The registered unblocking condition is the old configuration in state deprovisioned. The old configuration is transitioned into state awaiting_deprovision. The update_available flag is set.

This ordering step means that SOMA never sends out an update deployment. If there is a change, the destination monitoring system first receives an undeployment of the exact same deployment details used for the deployment. Due to this, the deployment/undeployment on the client side can be completely stateless. It should also be order independent and idempotent.

This concludes the part of the workflow that is ran as part of the add_check_to_${foo} user requested job.

Internal step, life cycle phase: ghost removal

This is the first step by the internal life cycle component that activates every 20 seconds.

It performs three tasks:

  1. configurations in state awaiting_rollout flagged as deleted with active update_available flag are transitioned directly to awaiting_deletion since the destination monitoring system has not yet picked them up
  2. configurations flagged as deleted in state rollout_failed are transitioned directly to awaiting_deletion since there is nothing to deprovision
  3. configurations flagged as deleted in state deprovisioned are transitioned to awaiting_deletion

Internal step, life cycle phase: remove blocked deleted

During this next step, all configurations in state blocked that belong to a deleted check instance are transitioned directly to awaiting_deletion and their registered unblocking condition deleted.

Internal step, life cycle phase: unblock configurations

This step is only executed if there was no error during the previous remove blocked deleted step.

Every registered unblock condition is evaluated. If the condition is true, the condition is deleted and the configuration transitioned to either awaiting_rollout or awaiting_deprovision. The update_available flag for the check instance is set.

Internal step, life cycle phase: active deletions

This step transitions all configurations flagged as deleted in state active to state awaiting_deprovision and sets the update_available flag on the instance.

Internal step, life cycle phase: poke

This steps takes all check instances with the update_available flag set, that are provisioned on a monitoring system with a notification callback registered. For every available check instance, the monitoring system receives a poke on its callback. The update_available flag is cleared if the poke was successful.

This step is the transition point where a check instance deployment leaves the SOMA application server.

External step, fetch deployment

Using the id received with the poke, the destination monitoring system fetches the deployment information from SOMA. This GET request has a side effect and transitions the workflow!

The following transitions can be triggered by request:

  1. awaiting_rollout -> rollout_in_progress
  2. rollout_in_progress -> rollout_in_progress
  3. active -> active
  4. rollout_failed -> rollout_in_progress
  5. awaiting_deprovision -> deprovision_in_progress
  6. deprovision_in_progress -> deprovision_in_progress
  7. deprovision_failed -> deprovision_in_progress

External step, deployment result

The destination monitoring system must, after processing the deployment request, send feedback about the deployment result. This transitions the check instances as follows:

Feedback: success

  1. rollout_in_progress -> active
  2. deprovision_in_progress -> deprovisioned

Feedback: failed

  1. rollout_in_progress -> rollout_failed
  2. deprovision_in_progress -> deprovision_failed

Polling step, list deployments

Monitoring systems that do not have a registered callback address, which requires a REST'ish service that can be contacted to be implemented, can poll SOMA for updates.

This request returns all instance ids that have the update_available flag set and clears it. This means every deployment is only exactly once part of of this list.

With this list of IDs, the destination monitoring system can fetch the deployments the same way as if it had received pokes for it.

Polling step, list all deployments

This request returns all instance ids with configurations in one of the following states, regardless of the update_available flag. If the flag is active, it is cleared.

  1. awaiting_rollout
  2. rollout_in_progress
  3. awaiting_deprovision
  4. deprovision_in_progress

This request can be used to resynchronize pending requests.

A REST'ish configuration service can use it on startup, clean or after a crash, the fetch all pending deployments again. This allows these services to be fully stateless with regards to which deployments they have already fetched.

User step, check deletion

Sometimes users wish to delete a check configuration via somaadm checks delete.

Internal step, job phase: check deletion

The check deletion job deletes the following objects from the in-memory tree:

  1. all checks for the configuration
  2. all check instances spawned by those checks

This results in the following objects to be flagged as deleted in the database:

  1. the check configuration
  2. all checks derived from the check configuration
  3. all check instances derived from the checks

At this point the lifeccycle component will pick this up and deprovision all currently active configurations, ultimately moving them into state awaiting_deletion.

Internal step, cleanup pruning

At some point we may have to clean up the database of all the things either in state awaiting_deletion or simply flagged as deleted. At that point, we also need to decide how much deleted history to keep around and whether to simply delete or archive these old records.

That point has not yet come.

SOMA

somaadm command reference

  • init
  • attributes
    • create
    • delete
    • list
    • show
  • buckets
    • create
    • delete
    • restore
    • purge
    • freeze
    • thaw
    • rename
    • list
    • show
    • tree
    • property
  • capabilities
  • checks
  • clusters
    • create
    • delete
    • rename
    • list
    • show
    • tree
    • members
      • add
      • delete
      • list
    • property
      • add
      • delete
  • datacenters
    • add
    • remove
    • rename
    • list
    • show
    • synclist
  • environments
    • add
    • remove
    • rename
    • list
    • show
  • groups
    • create
    • delete
    • rename
    • list
    • show
    • tree
    • members
      • add
      • delete
      • list
    • property
      • add
      • delete
  • jobs
    • list
    • show
    • local
      • outstanding
      • update
      • list
      • prune
  • levels
    • create
    • delete
    • list
    • show
  • metrics
    • create
    • delete
    • list
    • show
  • modes
    • create
    • delete
    • list
    • show
  • monitoring
    • create
    • delete
    • list
    • show
  • nodes
    • create
    • delete
    • purge
    • restore
    • update
    • rename
    • repossess
    • relocate
    • online
    • offline
    • assign
    • list
    • synclist
    • show
    • tree
    • config
    • property
      • add
      • delete
  • oncall
    • add
    • remove
    • rename
    • update
    • list
    • show
    • member
      • add
      • remove
      • list
  • permissions
    • category
      • add
      • remove
      • list
      • show
    • add
    • remove
    • list
    • show
  • predicates
    • create
    • delete
    • list
    • show
  • property
    • create
    • delete
    • show
    • list
  • providers
    • create
    • delete
    • list
    • show
  • rights
    • grant
      • global
      • system
    • revoke
      • global
      • system
  • repository
    • create
    • delete
    • restore
    • purge
    • clear
    • rename
    • repossess
    • activate
    • list
    • show
    • tree
    • property
  • servers
  • states
    • add
    • remove
    • rename
    • list
    • show
  • status
    • create
    • delete
    • list
    • show
  • teams
    • add
    • remove
    • rename
    • migrate
    • list
    • synclist
    • show
    • update
  • types
    • add
    • remove
    • rename
    • list
    • show
  • units
    • create
    • delete
    • list
    • show
  • users
    • create
    • delete
    • purge
    • update
    • activate
    • password
    • list
    • show
    • synclist
  • validity
    • create
    • delete
    • list
    • show
  • views
    • add
    • remove
    • rename
    • list
    • show
  • ops
Clone this wiki locally