New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

High Level design for using GET+PUT to reconcile #2600

Merged

theunrepentantgeek merged 3 commits into main from doc/spec-diff-detection

Dec 14, 2022

Member

theunrepentantgeek commented Nov 22, 2022 •

edited

What this PR does / why we need it:

Outlines a broad-brush stroke design for doing GET+PUT to do reconciliation (as discussed in #1491) without hitting ARM throttling limits.

How does this PR make you feel:

If applicable:

this PR contains documentation

theunrepentantgeek mentioned this pull request

Bug: using azure service operator deploy PostgreSQL flexible servers encounter error in azure "Number of write requests for subscription xxx exceeded the limit of '1200' for time interval '01:00:00'. Please try again after xx seconds" #2643

Closed

theunrepentantgeek added 2 commits

December 14, 2022 16:20


          First draft

ea48814


          Update doc

2fa5240

theunrepentantgeek force-pushed the doc/spec-diff-detection branch from 4e00c9d to 2fa5240 Compare

December 14, 2022 03:20

theunrepentantgeek marked this pull request as ready for review

December 14, 2022 03:20

theunrepentantgeek requested review from davefellows, matthchr, babbageclunk and super-harsh as code owners

December 14, 2022 03:20

super-harsh approved these changes

View reviewed changes


          Merge branch 'main' into doc/spec-diff-detection

c31750b

theunrepentantgeek enabled auto-merge (squash)

December 14, 2022 21:04

theunrepentantgeek merged commit 35f6b90 into main

theunrepentantgeek deleted the doc/spec-diff-detection branch

December 14, 2022 22:05

matthchr reviewed

View reviewed changes

Member

matthchr left a comment

Didn't get a chance to review this while I was on vacation, so leaving comments on it now.

docs/hugo/content/design/ADR-2022-11-Change-Detection.md


		In ASO v2 up to at least the `beta.4` release, we reconcile each resource by doing a PUT, relying on the Azure Resource Manager (ARM) to do the goal state comparison and only update the resource if it has changed.

		While this works, customers are already running up against ARM throttling with moderate numbers of resources. Typically, ARM throttles PUT requests for a given endpoint connection to just 1200 per hour per subscription. With a reconcile period of 15m, ASO users are hitting this limit with just 300 active resources.

Member

matthchr Jan 4, 2023

This might actually be 1200 per hour per subscription per connection as per John Rusk, which would make getting around it a lot easier by just using a few more connections (we have a single global shared connection currently).

Worth noting/experimenting with that some probably. Should I file a bug to track this testing or do you want to update this document?

Member

matthchr Jan 4, 2023

Confirmed this is 1200 per ARM instance, see email I forwarded you. Takeaways from that mail ("ARM and CRP API limit increase") seem to be:

Use multiple connections/connection pool (I believe can be configured in Go HTTP, see the MaxIdleConnsPerHost property.
Set ForceAttemptHTTP2 to false

If ForceAttemptHTTP2 is true (the default), connections using HTTP2 will be multiplexed over a single TCP connection, which means still hitting a single ARM FE. See this for details about disabling HTTP2 (though we may not need to disable it, just not force it?).

More reading:

Member Author

theunrepentantgeek Jan 9, 2023

I used the wording "for a given endpoint" to try and capture that the throttle isn't just per sub; I'll amend to try and clarify.

I've created #2672 to track use of multiple HTTP connections.

docs/hugo/content/design/ADR-2022-11-Change-Detection.md


		Some fields are set by Azure and cannot be changed by the user. These fields should be marked as read-only in the Swagger, resulting in their omission from our generated Spec types. However, not all such fields are properly marked.

		Potential mitigation: We will likely have to use our configuration file to explicitly omit these fields from the comparison. We should also create PRs to correct the Swagger where we find fields not marked as read-only.

Member

matthchr Jan 4, 2023

Since these fields are readonly we can also just remove them entirely from the spec type. Technically that's breaking but since in reality they cannot be set, it has no real user impact.

Member Author

theunrepentantgeek Jan 9, 2023

Agreed.

docs/hugo/content/design/ADR-2022-11-Change-Detection.md


		Azure is not guaranteed to return the exact same resource that was PUT. For example, the `etag` field may change, or the `id` field may be returned with a different casing. Some resource providers also normalize field values, for example the CosmosDB API will return `West US` when the resource specified `westus`.

		Potential mitigation: Generally speaking, Azure is expected to be "case-preserving, case-insensitive", so we should compare all string fields in a case insensitive manner. Region names and virtual machines SKUS are two known special cases where we may need to do some special case handling. For example, regions specified as `eastus` may be returned as `East US`.

Member

matthchr Jan 4, 2023

I think that the case-preserving, case-insensitive is only for ID fields, not all string fields. Both the etag and id examples you give should in theory be covered by by readonly tag above, as (AFAIK) the id and etag fields should be marked readonly.

The example about normalization is 100% correct. I don't think it's consistent between services either -- some normalize and some may not.

Another type of read-after-write consistency issue would be something like lists that have default values inserted:
So you write ["a"], but get back ["default", "a"]. There was a proposed API in AKS that would've done this, but we decided against it because of the read-after-write issues it would cause. Still, other services somewhere may have done something like that (or similar with a map)

Member Author

theunrepentantgeek Jan 9, 2023

Looks seriously as though we're going to need some levers and knobs in our configuration to allow us to control things with sufficient fidelity - and possibly extension points as well. This all starts to get very unsimple very quickly.

At the very least, our comparison is going to need to more than a simple changed/unchanged result - we're going to need to capture which fields are changed (and how!) so that we can log sufficient information to troubleshoot.

docs/hugo/content/design/ADR-2022-11-Change-Detection.md


		### Array ordering

		Some resources have arrays of sub-resources. These arrays may not be guaranteed to be returned in the same order as they were specified.

Member

matthchr Jan 4, 2023

Do you mean actual ARM subresources, or does this really apply for general array ordering? Maybe clarify?

docs/hugo/content/design/ADR-2022-11-Change-Detection.md


		Some resources have arrays of sub-resources. These arrays may not be guaranteed to be returned in the same order as they were specified.

		Potential mitigation: Where the items in an array have a known identifier (easy for nested resources), use that identifier to match them up. Where the items in an array do not have a known identifier, compare them by index. We may find need for

Member

matthchr Jan 4, 2023

Last sentence doesn't seem complete

docs/hugo/content/design/ADR-2022-11-Change-Detection.md


		### Changes to test recordings

		When a resource changes from PUT only to GET+PUT, we'll need to re-record the test results for that resource. To avoid having to re-record all the tests in one go, we probably want to have some way to migrate to the new approach in a controlled fashion.

Member

matthchr Jan 4, 2023

It's not that bad to re-record everything (I've done it 2-3x).

theunrepentantgeek mentioned this pull request

Update Spec for switching to GET/PUT #2673

Merged

1 task

theunrepentantgeek mentioned this pull request

Reconcile should perform a diff with Azure #2811

Open

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

matthchr matthchr left review comments

super-harsh super-harsh approved these changes

davefellows Awaiting requested review from davefellows davefellows is a code owner

babbageclunk Awaiting requested review from babbageclunk babbageclunk is a code owner