New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
High Level design for using GET+PUT to reconcile #2600
Conversation
4e00c9d
to
2fa5240
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't get a chance to review this while I was on vacation, so leaving comments on it now.
|
||
In ASO v2 up to at least the `beta.4` release, we reconcile each resource by doing a PUT, relying on the Azure Resource Manager (ARM) to do the goal state comparison and only update the resource if it has changed. | ||
|
||
While this works, customers are already running up against ARM throttling with moderate numbers of resources. Typically, ARM throttles PUT requests for a given endpoint connection to just 1200 per hour per subscription. With a reconcile period of 15m, ASO users are hitting this limit with just 300 active resources. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might actually be 1200 per hour per subscription per connection as per John Rusk, which would make getting around it a lot easier by just using a few more connections (we have a single global shared connection currently).
Worth noting/experimenting with that some probably. Should I file a bug to track this testing or do you want to update this document?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confirmed this is 1200 per ARM instance, see email I forwarded you. Takeaways from that mail ("ARM and CRP API limit increase") seem to be:
- Use multiple connections/connection pool (I believe can be configured in Go HTTP, see the MaxIdleConnsPerHost property.
- Set ForceAttemptHTTP2 to false
If ForceAttemptHTTP2 is true (the default), connections using HTTP2 will be multiplexed over a single TCP connection, which means still hitting a single ARM FE. See this for details about disabling HTTP2 (though we may not need to disable it, just not force it?).
More reading:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used the wording "for a given endpoint" to try and capture that the throttle isn't just per sub; I'll amend to try and clarify.
I've created #2672 to track use of multiple HTTP connections.
|
||
Some fields are set by Azure and cannot be changed by the user. These fields *should* be marked as read-only in the Swagger, resulting in their omission from our generated Spec types. However, not all such fields are properly marked. | ||
|
||
Potential mitigation: We will likely have to use our configuration file to explicitly omit these fields from the comparison. We should also create PRs to correct the Swagger where we find fields not marked as read-only. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since these fields are readonly we can also just remove them entirely from the spec type. Technically that's breaking but since in reality they cannot be set, it has no real user impact.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed.
|
||
Azure is not guaranteed to return the exact same resource that was PUT. For example, the `etag` field may change, or the `id` field may be returned with a different casing. Some resource providers also *normalize* field values, for example the CosmosDB API will return `West US` when the resource specified `westus`. | ||
|
||
Potential mitigation: Generally speaking, Azure is expected to be "case-preserving, case-insensitive", so we should compare all string fields in a case insensitive manner. Region names and virtual machines SKUS are two known special cases where we may need to do some special case handling. For example, regions specified as `eastus` may be returned as `East US`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the case-preserving, case-insensitive is only for ID fields, not all string fields. Both the etag
and id
examples you give should in theory be covered by by readonly
tag above, as (AFAIK) the id
and etag
fields should be marked readonly
.
The example about normalization is 100% correct. I don't think it's consistent between services either -- some normalize and some may not.
Another type of read-after-write consistency issue would be something like lists that have default values inserted:
So you write ["a"]
, but get back ["default", "a"]
. There was a proposed API in AKS that would've done this, but we decided against it because of the read-after-write issues it would cause. Still, other services somewhere may have done something like that (or similar with a map)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks seriously as though we're going to need some levers and knobs in our configuration to allow us to control things with sufficient fidelity - and possibly extension points as well. This all starts to get very unsimple very quickly.
At the very least, our comparison is going to need to more than a simple changed
/unchanged
result - we're going to need to capture which fields are changed (and how!) so that we can log sufficient information to troubleshoot.
|
||
### Array ordering | ||
|
||
Some resources have arrays of sub-resources. These arrays may not be guaranteed to be returned in the same order as they were specified. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean actual ARM subresources, or does this really apply for general array ordering? Maybe clarify?
|
||
Some resources have arrays of sub-resources. These arrays may not be guaranteed to be returned in the same order as they were specified. | ||
|
||
Potential mitigation: Where the items in an array have a known identifier (easy for nested resources), use that identifier to match them up. Where the items in an array do not have a known identifier, compare them by index. We may find need for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Last sentence doesn't seem complete
|
||
### Changes to test recordings | ||
|
||
When a resource changes from PUT only to GET+PUT, we'll need to re-record the test results for that resource. To avoid having to re-record all the tests in one go, we probably want to have some way to migrate to the new approach in a controlled fashion. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not that bad to re-record everything (I've done it 2-3x).
What this PR does / why we need it:
Outlines a broad-brush stroke design for doing GET+PUT to do reconciliation (as discussed in #1491) without hitting ARM throttling limits.
How does this PR make you feel:
If applicable: