-
Notifications
You must be signed in to change notification settings - Fork 195
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Separate max file size and decoding size handling in oss re client #639
base: main
Are you sure you want to change the base?
Separate max file size and decoding size handling in oss re client #639
Conversation
Ensure that the latter is at least 2x bigger than the former so that we can account for additional fields and bits in the response.
…x_decoding_message_size
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! A couple comments, primarily the result of missing context on my side
let mut exec_enabled = true; | ||
|
||
if let Some(cache_cap) = resp.cache_capabilities { | ||
let max_msg_size_from_capabilities: Option<usize> = if let Some(cache_cap) = resp.cache_capabilities { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we rename max_msg_size
here, but also in the RECapabilities
type, the buckconfig, etc. to just match/approximately match the max_batch_total_size_bytes
name that the protocol uses? I feel like that would make this all a bit less confusing to read
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the 2 are different things actually, which is probably the same type of confusion that created the original bug in the first place :) max_batch_total_size_bytes
in the protocol does not refer to the max size of the GRPC message and should be strictly lower to that amount, but rather refers to the maximum amount of bytes in the content of BatchFileRead responses. Maybe the difference should be made even clearer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the 2 are different things actually, which is probably the same type of confusion that created the original bug in the first place
Right, I understand that, that's why my comment. Based on my reading, the value here still has the semantics of the max_batch_total_size_bytes
value, and it's not until L301 that we use this to get the max_msg_size
value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, it has the same semantics (intended) and max_msg_size refers to the GRPC limit, but in L301 this meaning doesn't change, I am just ensuring that if no explicit value is provided then max_msg_size is definitely bigger than the max_batch_total_size_bytes or else any batch message of size close to or equal to max_batch_total_size_bytes will always fail to be deserialised. Does it make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right but... the way that I'm reading this code there are two different values:
- The max decoding message size, a GRPC property
- The max total batch size, the sum of the sizes of the files in a single batch upload request
From my reading, the value returned by the cache_capabilities
, the value on L355 and the value on L367 all fall into category 2. The value on L282 falls into category 1. L301 then relates the two categories, setting the max decoding message size to twice the max total batch size (unless another value is specified). Is that all correct?
If so, what I'm asking for is that things in category two are all named something like max_total_batch_size
and things in category 1 all named something like max_msg_size
or max_decoding_msg_size
I don't insist too much though, so would be willing to merge this, as long as you understand what I'm asking
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahhhh ok, that makes way more sense now! Thanks :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, maybe we're still miscommunicating, but I would've expected the rename to also include L355, L341, and more importantly, the config property. I won't block on this anymore though, just mentioning it in case you want to change that with the other small nit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah no, I think this was just simply a case of "oooops, missed it", my bad :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
anyhow, I fixed this one, thanks for catching it!
section: BUCK2_RE_CLIENT_CFG_SECTION, | ||
property: "max_decoding_message_size", | ||
})?, | ||
max_total_file_size: legacy_config.parse(BuckconfigKeyRef { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I don't mind doing this, but can you give an example of why a user might want to configure this away from the default? (also for the max_decoding_message_size
I guess)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some advantages to batching more or less and some of them depend on the implementation of the RE protocol. An implementation that does not delegate uploads to a distributed system may have hotspotting if you tried to upload too many files in parallel and thus such a limit would help control this issue (though if would also be useful to add a max_files_per_batch
limit for the same case). On the other hand, if this isn't the case it is generally advantegeus to use as big of a batch as you can in order to reduce the back and forth of every method call. max_decoding_message_size
follows that very same logic, with the added constraint that a large portion of the message might be coming from the digests, which we do not account for (this is especially bad for builds that generate lots and lots of small files). So if you can change the max_files_per_batch
then you should also be able to increase the message size accordingly. I don't think either flags will be changed often, but in case this is needed it is rather handy to be able to. The default behaviour is good enough but we shouldn't shy away from allowing tweaking
let max_decoding_msg_size = opts.max_decoding_message_size | ||
.unwrap_or(capabilities.max_msg_size * 2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like this would allow the user to configure a value that is actually too low and thus will lead to errors. In the case where that happens, should we either be taking the max or maybe throwing an error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could assert a minimum of let's say 1MB? There is a check though to ensure that the max grpc decoding size is strictly higher than the max batch size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, sorry, my eyes somehow glazed over that check, nevermind me
execution_client: ExecutionClient::with_interceptor( | ||
execution.context("Error creating Execution client")?, | ||
interceptor.dupe(), | ||
).max_decoding_message_size(max_decoding_msg_size), | ||
action_cache_client: ActionCacheClient::with_interceptor( | ||
action_cache.context("Error creating ActionCache client")?, | ||
interceptor.dupe(), | ||
).max_decoding_message_size(max_decoding_msg_size), | ||
bytestream_client: ByteStreamClient::with_interceptor( | ||
bytestream.context("Error creating Bytestream client")?, | ||
interceptor.dupe(), | ||
).max_decoding_message_size(max_decoding_msg_size), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not obvious to me that it's correct to apply a value that possibly computed specifically for the CAS client to all the other clients as well. Is there some justification for that?
I maybe just don't know enough about protobuf here as well. What's the consequence of not setting this value at all on the client?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It wouldn't be a bad idea to have a separate value for each grpc client, I was just trying to keep things simple, but I wouldn't be against making this change. I am not sure users would normally tweak these values and aside from bytestream and cas methods all the others methods in the RE protocol tend to have generally small payloads so this limit never really kicks in, which is why I think it's generally safe to keep the same limit around for all of them and reduce the cognitive burden on the configuration. Wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know that I'm asking for different configuration for each client, I guess my suggestion is to just apply this value to the CAS client only and then leave the rest as they are. Especially if we think it's not necessary for those clients, that seems like the simplest model to me and prevents anyone from hitting issues where other clients change behavior as a result of modifying what looks like a CAS config
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then how about we keep this value just for Bytestream and CAS clients? Wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds good
… size or batch size Ensure that only the CAS and BS clients have non-default max decoding sizes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one nit
@@ -81,7 +81,7 @@ use crate::response::*; | |||
// in the Capabilities message query. | |||
const CONCURRENT_UPLOAD_LIMIT: usize = 64; | |||
|
|||
const DEFAULT_MAX_MSG_SIZE: usize = 4 * 1000 * 1000; | |||
const DEFAULT_MAX__TOTAL_BATCH_SIZE: usize = 4 * 1000 * 1000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const DEFAULT_MAX__TOTAL_BATCH_SIZE: usize = 4 * 1000 * 1000; | |
const DEFAULT_MAX_TOTAL_BATCH_SIZE: usize = 4 * 1000 * 1000; |
let mut exec_enabled = true; | ||
|
||
if let Some(cache_cap) = resp.cache_capabilities { | ||
let max_msg_size_from_capabilities: Option<usize> = if let Some(cache_cap) = resp.cache_capabilities { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, maybe we're still miscommunicating, but I would've expected the rename to also include L355, L341, and more importantly, the config property. I won't block on this anymore though, just mentioning it in case you want to change that with the other small nit
@JakobDegen has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Before it was possible to set a total file size for all the RE's IO API that exceed tonic's default max-decoding size of 4MB. Matter of fact, because the default limit is 4MB for both this happened by default with files of comulative size close to or equal to the limit. Example:
PS: Please be nice, this is my first time contribution both here and in rust :)