Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Separate max file size and decoding size handling in oss re client #639

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

TheGrizzlyDev
Copy link
Contributor

Before it was possible to set a total file size for all the RE's IO API that exceed tonic's default max-decoding size of 4MB. Matter of fact, because the default limit is 4MB for both this happened by default with files of comulative size close to or equal to the limit. Example:

genrule(
    name = "large_download",
    out = "out.txt",
    cmd = "dd if=/dev/zero of=$OUT bs=4M count=1",
)

PS: Please be nice, this is my first time contribution both here and in rust :)

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 8, 2024
@TheGrizzlyDev TheGrizzlyDev marked this pull request as ready for review May 8, 2024 13:16
Copy link
Contributor

@JakobDegen JakobDegen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! A couple comments, primarily the result of missing context on my side

let mut exec_enabled = true;

if let Some(cache_cap) = resp.cache_capabilities {
let max_msg_size_from_capabilities: Option<usize> = if let Some(cache_cap) = resp.cache_capabilities {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we rename max_msg_size here, but also in the RECapabilities type, the buckconfig, etc. to just match/approximately match the max_batch_total_size_bytes name that the protocol uses? I feel like that would make this all a bit less confusing to read

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the 2 are different things actually, which is probably the same type of confusion that created the original bug in the first place :) max_batch_total_size_bytes in the protocol does not refer to the max size of the GRPC message and should be strictly lower to that amount, but rather refers to the maximum amount of bytes in the content of BatchFileRead responses. Maybe the difference should be made even clearer?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the 2 are different things actually, which is probably the same type of confusion that created the original bug in the first place

Right, I understand that, that's why my comment. Based on my reading, the value here still has the semantics of the max_batch_total_size_bytes value, and it's not until L301 that we use this to get the max_msg_size value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, it has the same semantics (intended) and max_msg_size refers to the GRPC limit, but in L301 this meaning doesn't change, I am just ensuring that if no explicit value is provided then max_msg_size is definitely bigger than the max_batch_total_size_bytes or else any batch message of size close to or equal to max_batch_total_size_bytes will always fail to be deserialised. Does it make sense?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right but... the way that I'm reading this code there are two different values:

  1. The max decoding message size, a GRPC property
  2. The max total batch size, the sum of the sizes of the files in a single batch upload request

From my reading, the value returned by the cache_capabilities, the value on L355 and the value on L367 all fall into category 2. The value on L282 falls into category 1. L301 then relates the two categories, setting the max decoding message size to twice the max total batch size (unless another value is specified). Is that all correct?

If so, what I'm asking for is that things in category two are all named something like max_total_batch_size and things in category 1 all named something like max_msg_size or max_decoding_msg_size

I don't insist too much though, so would be willing to merge this, as long as you understand what I'm asking

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahhhh ok, that makes way more sense now! Thanks :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, maybe we're still miscommunicating, but I would've expected the rename to also include L355, L341, and more importantly, the config property. I won't block on this anymore though, just mentioning it in case you want to change that with the other small nit

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah no, I think this was just simply a case of "oooops, missed it", my bad :(

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

anyhow, I fixed this one, thanks for catching it!

section: BUCK2_RE_CLIENT_CFG_SECTION,
property: "max_decoding_message_size",
})?,
max_total_file_size: legacy_config.parse(BuckconfigKeyRef {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I don't mind doing this, but can you give an example of why a user might want to configure this away from the default? (also for the max_decoding_message_size I guess)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some advantages to batching more or less and some of them depend on the implementation of the RE protocol. An implementation that does not delegate uploads to a distributed system may have hotspotting if you tried to upload too many files in parallel and thus such a limit would help control this issue (though if would also be useful to add a max_files_per_batch limit for the same case). On the other hand, if this isn't the case it is generally advantegeus to use as big of a batch as you can in order to reduce the back and forth of every method call. max_decoding_message_size follows that very same logic, with the added constraint that a large portion of the message might be coming from the digests, which we do not account for (this is especially bad for builds that generate lots and lots of small files). So if you can change the max_files_per_batch then you should also be able to increase the message size accordingly. I don't think either flags will be changed often, but in case this is needed it is rather handy to be able to. The default behaviour is good enough but we shouldn't shy away from allowing tweaking

Comment on lines 301 to 302
let max_decoding_msg_size = opts.max_decoding_message_size
.unwrap_or(capabilities.max_msg_size * 2);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like this would allow the user to configure a value that is actually too low and thus will lead to errors. In the case where that happens, should we either be taking the max or maybe throwing an error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could assert a minimum of let's say 1MB? There is a check though to ensure that the max grpc decoding size is strictly higher than the max batch size.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, sorry, my eyes somehow glazed over that check, nevermind me

Comment on lines 314 to 325
execution_client: ExecutionClient::with_interceptor(
execution.context("Error creating Execution client")?,
interceptor.dupe(),
).max_decoding_message_size(max_decoding_msg_size),
action_cache_client: ActionCacheClient::with_interceptor(
action_cache.context("Error creating ActionCache client")?,
interceptor.dupe(),
).max_decoding_message_size(max_decoding_msg_size),
bytestream_client: ByteStreamClient::with_interceptor(
bytestream.context("Error creating Bytestream client")?,
interceptor.dupe(),
).max_decoding_message_size(max_decoding_msg_size),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not obvious to me that it's correct to apply a value that possibly computed specifically for the CAS client to all the other clients as well. Is there some justification for that?

I maybe just don't know enough about protobuf here as well. What's the consequence of not setting this value at all on the client?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It wouldn't be a bad idea to have a separate value for each grpc client, I was just trying to keep things simple, but I wouldn't be against making this change. I am not sure users would normally tweak these values and aside from bytestream and cas methods all the others methods in the RE protocol tend to have generally small payloads so this limit never really kicks in, which is why I think it's generally safe to keep the same limit around for all of them and reduce the cognitive burden on the configuration. Wdyt?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know that I'm asking for different configuration for each client, I guess my suggestion is to just apply this value to the CAS client only and then leave the rest as they are. Especially if we think it's not necessary for those clients, that seems like the simplest model to me and prevents anyone from hitting issues where other clients change behavior as a result of modifying what looks like a CAS config

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then how about we keep this value just for Bytestream and CAS clients? Wdyt?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds good

… size or batch size

Ensure that only the CAS and BS clients have non-default max decoding sizes
Copy link
Contributor

@JakobDegen JakobDegen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one nit

@@ -81,7 +81,7 @@ use crate::response::*;
// in the Capabilities message query.
const CONCURRENT_UPLOAD_LIMIT: usize = 64;

const DEFAULT_MAX_MSG_SIZE: usize = 4 * 1000 * 1000;
const DEFAULT_MAX__TOTAL_BATCH_SIZE: usize = 4 * 1000 * 1000;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const DEFAULT_MAX__TOTAL_BATCH_SIZE: usize = 4 * 1000 * 1000;
const DEFAULT_MAX_TOTAL_BATCH_SIZE: usize = 4 * 1000 * 1000;

let mut exec_enabled = true;

if let Some(cache_cap) = resp.cache_capabilities {
let max_msg_size_from_capabilities: Option<usize> = if let Some(cache_cap) = resp.cache_capabilities {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, maybe we're still miscommunicating, but I would've expected the rename to also include L355, L341, and more importantly, the config property. I won't block on this anymore though, just mentioning it in case you want to change that with the other small nit

@facebook-github-bot
Copy link
Contributor

@JakobDegen has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants