Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High Level design for using GET+PUT to reconcile #2600

Merged
merged 3 commits into from Dec 14, 2022

Conversation

theunrepentantgeek
Copy link
Member

@theunrepentantgeek theunrepentantgeek commented Nov 22, 2022

What this PR does / why we need it:

Outlines a broad-brush stroke design for doing GET+PUT to do reconciliation (as discussed in #1491) without hitting ARM throttling limits.

How does this PR make you feel:
gif

If applicable:

  • this PR contains documentation

@theunrepentantgeek theunrepentantgeek enabled auto-merge (squash) December 14, 2022 21:04
@theunrepentantgeek theunrepentantgeek merged commit 35f6b90 into main Dec 14, 2022
@theunrepentantgeek theunrepentantgeek deleted the doc/spec-diff-detection branch December 14, 2022 22:05
Copy link
Member

@matthchr matthchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't get a chance to review this while I was on vacation, so leaving comments on it now.


In ASO v2 up to at least the `beta.4` release, we reconcile each resource by doing a PUT, relying on the Azure Resource Manager (ARM) to do the goal state comparison and only update the resource if it has changed.

While this works, customers are already running up against ARM throttling with moderate numbers of resources. Typically, ARM throttles PUT requests for a given endpoint connection to just 1200 per hour per subscription. With a reconcile period of 15m, ASO users are hitting this limit with just 300 active resources.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might actually be 1200 per hour per subscription per connection as per John Rusk, which would make getting around it a lot easier by just using a few more connections (we have a single global shared connection currently).

Worth noting/experimenting with that some probably. Should I file a bug to track this testing or do you want to update this document?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed this is 1200 per ARM instance, see email I forwarded you. Takeaways from that mail ("ARM and CRP API limit increase") seem to be:

  1. Use multiple connections/connection pool (I believe can be configured in Go HTTP, see the MaxIdleConnsPerHost property.
  2. Set ForceAttemptHTTP2 to false

If ForceAttemptHTTP2 is true (the default), connections using HTTP2 will be multiplexed over a single TCP connection, which means still hitting a single ARM FE. See this for details about disabling HTTP2 (though we may not need to disable it, just not force it?).

More reading:

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used the wording "for a given endpoint" to try and capture that the throttle isn't just per sub; I'll amend to try and clarify.

I've created #2672 to track use of multiple HTTP connections.


Some fields are set by Azure and cannot be changed by the user. These fields *should* be marked as read-only in the Swagger, resulting in their omission from our generated Spec types. However, not all such fields are properly marked.

Potential mitigation: We will likely have to use our configuration file to explicitly omit these fields from the comparison. We should also create PRs to correct the Swagger where we find fields not marked as read-only.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since these fields are readonly we can also just remove them entirely from the spec type. Technically that's breaking but since in reality they cannot be set, it has no real user impact.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.


Azure is not guaranteed to return the exact same resource that was PUT. For example, the `etag` field may change, or the `id` field may be returned with a different casing. Some resource providers also *normalize* field values, for example the CosmosDB API will return `West US` when the resource specified `westus`.

Potential mitigation: Generally speaking, Azure is expected to be "case-preserving, case-insensitive", so we should compare all string fields in a case insensitive manner. Region names and virtual machines SKUS are two known special cases where we may need to do some special case handling. For example, regions specified as `eastus` may be returned as `East US`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the case-preserving, case-insensitive is only for ID fields, not all string fields. Both the etag and id examples you give should in theory be covered by by readonly tag above, as (AFAIK) the id and etag fields should be marked readonly.

The example about normalization is 100% correct. I don't think it's consistent between services either -- some normalize and some may not.

Another type of read-after-write consistency issue would be something like lists that have default values inserted:
So you write ["a"], but get back ["default", "a"]. There was a proposed API in AKS that would've done this, but we decided against it because of the read-after-write issues it would cause. Still, other services somewhere may have done something like that (or similar with a map)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks seriously as though we're going to need some levers and knobs in our configuration to allow us to control things with sufficient fidelity - and possibly extension points as well. This all starts to get very unsimple very quickly.

At the very least, our comparison is going to need to more than a simple changed/unchanged result - we're going to need to capture which fields are changed (and how!) so that we can log sufficient information to troubleshoot.


### Array ordering

Some resources have arrays of sub-resources. These arrays may not be guaranteed to be returned in the same order as they were specified.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean actual ARM subresources, or does this really apply for general array ordering? Maybe clarify?


Some resources have arrays of sub-resources. These arrays may not be guaranteed to be returned in the same order as they were specified.

Potential mitigation: Where the items in an array have a known identifier (easy for nested resources), use that identifier to match them up. Where the items in an array do not have a known identifier, compare them by index. We may find need for
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last sentence doesn't seem complete


### Changes to test recordings

When a resource changes from PUT only to GET+PUT, we'll need to re-record the test results for that resource. To avoid having to re-record all the tests in one go, we probably want to have some way to migrate to the new approach in a controlled fashion.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not that bad to re-record everything (I've done it 2-3x).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants