Skip to content

Commit

Permalink
Update doc
Browse files Browse the repository at this point in the history
  • Loading branch information
theunrepentantgeek committed Dec 14, 2022
1 parent ea48814 commit 2fa5240
Showing 1 changed file with 12 additions and 18 deletions.
30 changes: 12 additions & 18 deletions docs/hugo/content/design/ADR-2022-11-Change-Detection.md
Expand Up @@ -6,57 +6,51 @@ title: '2022-11: Change Detection'

In ASO v2 up to at least the `beta.4` release, we reconcile each resource by doing a PUT, relying on the Azure Resource Manager (ARM) to do the goal state comparison and only update the resource if it has changed.

While this works, customers are already running up against ARM throttling with moderate numbers of resources Typically, ARM throttles PUT requests for a given subscription to just 1200 per hour. With a reconcile period of 15m, ASO users are hitting this limit with just 300 active resources.
While this works, customers are already running up against ARM throttling with moderate numbers of resources. Typically, ARM throttles PUT requests for a given endpoint connection to just 1200 per hour per subscription. With a reconcile period of 15m, ASO users are hitting this limit with just 300 active resources.

In the `beta.4` release, we're lifting the default to 1 hour, but this is a compromise. It means we will detect and correct resource drift more slowly, and it's still not going to be enough for larger deployments. The change only impacts on steady state as well; we'll still hit throttling during initial deployments and when errors occur.
In the `beta.4` release, we're lifting the default to 1 hour, but this is a compromise. It means we will detect and correct resource drift more slowly, and it's still not going to be enough to prevent throttling for larger deployments. The change only impacts on steady state as well; we'll still hit throttling during initial deployments and when errors occur.

It's been previously suggested (see #1491) that we should do our own goal state comparison and only do a PUT if the resource has changed. This would allow us to reconcile more frequently, as GET requests have a significantly higher throttle limit.
It's been previously suggested (see [#1491](https://github.com/Azure/azure-service-operator/issues/1491)) that we should do our own goal state comparison and only do a PUT if the resource has changed. This would allow us to reconcile more frequently, as GET requests have a significantly higher throttle limit.

While it sounds straightforward to compare a Status retrieved from Azure with the Spec held by ASO, there are a few complicating factors. (Credit to @matthchr for much of the wording in these descriptions, lifted from earlier discussions.)
While it sounds straightforward to compare a *Status* retrieved from Azure with the *Spec* held by ASO, there are a few complicating factors. (Credit to @matthchr for much of the wording in these descriptions, lifted from earlier discussions.)

### Readonly fields

Some fields are set by Azure and cannot be changed by the user. These fields *should* be marked as read-only in the Swagger, resulting in their omission from our generated Spec types. However, not all such fields are properly marked.

Potential mitigation: We will likely have to use our configuration file to explicitly omit these fields from the comparison.
Potential mitigation: We will likely have to use our configuration file to explicitly omit these fields from the comparison. We should also create PRs to correct the Swagger where we find fields not marked as read-only.

### Default values

Some fields are optional in the Swagger, but are required by Azure. In these cases, Azure will set a default value if the user does not provide one. In a Naive implementation, we'll see a difference between the Spec we have and the Status we retrieve from Azure. Complicating this, some resources may use a different default value depending on other fields.
Some fields are optional in the Swagger, but are required by Azure. In these cases, Azure will set a default value if the user does not provide one. In a naive implementation, we'll see these fields as differences between the *Spec* we have and the *Status* we retrieve from Azure. Complicating this, some resources may use a different default value depending on other fields. (We know this is the case for AKS, for example.)

For example, GitHub CoPilot suggests the `publicNetworkAccess` field on a Storage Account defaults to `Enabled` if the `networkRuleSet` field is not set, but defaults to `Disabled` if it is set.

*TODO: Verify this is true.*

While we can tell the different between a field that was never set by the user, and a field that we set to an explicit value (because all our Spec properties are pointers), this doesn't mean we can assume any explicit value is a default. For example, if a third party makes a change via the Azure portal, we want to detect this as a change and and revert it.
While we can tell the different in a Spec between a field that was never set by the user, and a field that we set to an explicit value (because all our Spec properties are pointers), this doesn't mean we can assume any explicit value in a Status is a default. For example, if a third party makes a change via the Azure portal, we want to detect this as a change and and revert it.

### Read-After-Write Consistency

Azure is not guaranteed to return the exact same resource that was PUT. For example, the `etag` field may change, or the `id` field may be returned with a different casing. Some resource providers also *normalize* field values, for example the CosmosDB API will return `West US` when the resource specified `westus`.

Potential mitigation: Generally speaking, Azure is expected to be "case-preserving, case-insensitive", so we should compare all string fields in a case insensitive manner. Region names and virtual machines SKUS are two known special cases where we may need to do some special case handling.
Potential mitigation: Generally speaking, Azure is expected to be "case-preserving, case-insensitive", so we should compare all string fields in a case insensitive manner. Region names and virtual machines SKUS are two known special cases where we may need to do some special case handling. For example, regions specified as `eastus` may be returned as `East US`.

### Array ordering

Some resources have arrays of sub-resources. These arrays may not be guaranteed to be returned in the same order as they were specified.

For example, GitHub CoPilot suggests the `networkAcls` field on a Storage Account has an array of `bypass` values. These are returned in a different order than they were specified.

*TODO: Verify this is true.*

Potential mitigation: Where the items in an array have a known identifier (easy for nested resources), use that identifier to match them up. Where the items in an array do not have a known identifier, compare them by index.
Potential mitigation: Where the items in an array have a known identifier (easy for nested resources), use that identifier to match them up. Where the items in an array do not have a known identifier, compare them by index. We may find need for

### Changes to test recordings

When a resource changes from PUT only to GET+PUT, we'll need to re-record the test results for that resource. To avoid having to re-record all the tests in one go, we probably want to have some way to migrate to the new approach in a controlled fashion.

Potential mitigation: We could have a configuration flag that allows us to switch between the two approaches for a specific resource. We'd make GET+PUT the default, explicitly configure all our existing resources as PUT-only, and migrate in a controlled fashion, prioritizing those resources where we know we have the most issues with the PUT-only approach.

### Violation of the DRY principle

The really unfortunate thing, and the reason we actually like just the approach of always PUT-ing (with the obvious caveat that it causes throttling if you do it a lot) is that all of the logic to determine if there has been a change (what fields have what defaults, whether the field is stable or not, etc) all of that exists in the Resource Provider PUT path.

### Risk of bad change detection

If we get the change detection wrong, we could end up in a situation where we're reconciling a resource every 15m, but we're not actually making any changes to it.
If we get the change detection wrong, we could end up in a situation where we're using PUT to reconcile a resource every 15m, but we're not actually making any changes to it.

This would be bad for a number of reasons, but the most obvious is that we'd still be hitting the ARM throttling limits for no reason.

Expand Down

0 comments on commit 2fa5240

Please sign in to comment.