Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Size of generated code #148

Open
cgwalters opened this issue Aug 31, 2023 · 7 comments
Open

Size of generated code #148

cgwalters opened this issue Aug 31, 2023 · 7 comments

Comments

@cgwalters
Copy link

cgwalters commented Aug 31, 2023

This crate generates a lot of code. It is by nearly an order of magnitude the largest code size in our app's dependencies.

$ ll cache/github.com-1ecc6299db9ec823/k8s-openapi-0.18.0.crate
-rw-r--r--. 1 walters walters 4.5M Jul 11 19:52 cache/github.com-1ecc6299db9ec823/k8s-openapi-0.18.0.crate

4.5M doesn't sound that bad, but it actually uncompresses to:

$ zcat cache/github.com-1ecc6299db9ec823/k8s-openapi-0.18.0.crate | wc -c
57543168

Almost 55MiB uncompressed Rust source! By comparison, tokio-1.29.1.crate is 678KiB compressed, ~4MiB uncompressed.

Now, one thing that seems like it should work is trimming the versions we don't use; AIUI enabling e.g. the v1_25 feature means that the other versions are unused. I can pretty easily do that downstream by removing the source when generating our vendor snapshot (we use https://github.com/coreos/cargo-vendor-filterer ).

That said...skimming through diff -u src/v1_2{2,3}/api/core/v1/ for example, 99% of the changes are just to documentation comments; and many of the doc comment changes are entirely spurious. To cherry pick one:

--- v1_22/api/core/v1/toleration.rs	2023-08-31 12:57:20.335533880 -0400
+++ v1_23/api/core/v1/toleration.rs	2023-08-31 12:57:20.349534101 -0400
@@ -4,12 +4,14 @@
 #[derive(Clone, Debug, Default, PartialEq)]
 pub struct Toleration {
     /// Effect indicates the taint effect to match. Empty means match all taint effects. When specified, allowed values are NoSchedule, PreferNoSchedule and NoExecute.
+    ///
     pub effect: Option<String>,
 
     /// Key is the taint key that the toleration applies to. Empty means match all taint keys. If the key is empty, operator must be Exists; this combination means to match all values and all keys.
     pub key: Option<String>,
 
     /// Operator represents a key's relationship to the value. Valid operators are Exists and Equal. Defaults to Equal. Exists is equivalent to wildcard for value, so that a pod can tolerate all taints of a particular category.
+    ///
     pub operator: Option<String>,
 
     /// TolerationSeconds represents the period of time the toleration (which must be of effect NoExecute, otherwise this field is ignored) tolerates the taint. By default, it is not set, which means tolerate the taint forever (do not evict). Zero and negative values will be treated as 0 (evict immediately) by the system.
@@ -167,7 +169,7 @@
                         "effect".to_owned(),
                         crate::schemars::schema::Schema::Object(crate::schemars::schema::SchemaObject {
                             metadata: Some(Box::new(crate::schemars::schema::Metadata {
-                                description: Some("Effect indicates the taint effect to match. Empty means match all taint effects. When specified, allowed values are NoSchedule, PreferNoSchedule and NoExecute.".to_owned()),
+                                description: Some("Effect indicates the taint effect to match. Empty means match all taint effects. When specified, allowed values are NoSchedule, PreferNoSchedule and NoExecute.\n\n".to_owned()),
                                 ..Default::default()
                             })),
                             instance_type: Some(crate::schemars::schema::SingleOrVec::Single(Box::new(crate::schemars::schema::InstanceType::String))),
@@ -189,7 +191,7 @@
                         "operator".to_owned(),
                         crate::schemars::schema::Schema::Object(crate::schemars::schema::SchemaObject {
                             metadata: Some(Box::new(crate::schemars::schema::Metadata {
-                                description: Some("Operator represents a key's relationship to the value. Valid operators are Exists and Equal. Defaults to Equal. Exists is equivalent to wildcard for value, so that a pod can tolerate all taints of a particular category.".to_owned()),
+                                description: Some("Operator represents a key's relationship to the value. Valid operators are Exists and Equal. Defaults to Equal. Exists is equivalent to wildcard for value, so that a pod can tolerate all taints of a particular category.\n\n".to_owned()),
                                 ..Default::default()
                             })),
                             instance_type: Some(crate::schemars::schema::SingleOrVec::Single(Box::new(crate::schemars::schema::InstanceType::String))),

That's just whitespace changes (and spurious ones at that!)

Maybe some sort of post-processing that identifies identical APIs and instead rewrites the generate code to just be a type alias for the one in the prior version would work. (But as soon as we do that, it breaks the crude hammer of just rm -rf)

@Arnavion
Copy link
Owner

So just to be clear, your concern about the size is only because you're vendoring, right? Or is there any other reason?


What is the mechanism of deduping?

Type-alias stuff would be complicated because it's not just a matter of discovering duplicates but also of moving the "common" type into some place unaffected by version features. Eg v1_23::Pod can't just be an alias for v1_22::Pod because v1_22::Pod only exists when the v1_22 feature is enabled, so v1_22::Pod would have to be moved to common::Pod and then both v1_22::Pod and v1_23::Pod would be aliases of that.

An easier way would be to just symlink v1_23::Pod's file to v1_22::Pod's file. Also that'll be even more space savings, because the size of the symlink is the length of the file names, which would be smaller than a file that contains a type alias. I know cargo publish supports symlinks in the source, but I don't know if it unsymlinks them when constructing the actual tarball. Do you know?


What should be the basis of deduping? Let's say the crate supports five versions 1.21 through 1.25...

  • If Pod is identical in all versions (for some definition of "identical", say ignoring doc comments as you suggested), then sure it can be deduped.

  • What if it's identical in 1.21 through 1.24 but not 1.25? Is it still deduped for 1.21 through 1.23, while 1.24 and 1.25 keep their individual files? If yes, what is the smallest set of versions for which we would dedupe? The absolute minimum of two versions?

  • What if it's identical in 1.21 through 1.23 and separately it's identical in 1.24 and 1.25? Do we have two deduped files? If we have to have multiple deduped files for the same type, what is the scheme to prevent them from interfering?

I think the answers are:

  1. Any file which is identical in two versions or more can be deduped, for maximum savings.

  2. The deduped filename can incorporate the version module name of the lowest version. Eg in the third point above we would end up with v1_21_pod.rs and v1_24_pod.rs

What do you think?

@cgwalters
Copy link
Author

So just to be clear, your concern about the size is only because you're vendoring, right?

Yes, that's the primary concern. (Which isn't really about disk space though that matters a little, but about auditability)

If Pod is identical in all versions (for some definition of "identical", say ignoring doc comments as you suggested), then sure it can be deduped.

I think we'd need to do some scripting to see just how many types are identical, but I'd just offhand guess it's at least 50%.

What should be the basis of deduping?

There is an alternative approach that would require more work in the code generator, which is basically to use #[cfg] inside a single common module. I suspect that's nontrivial.

I know cargo publish supports symlinks in the source, but I don't know if it unsymlinks them when constructing the actual tarball. Do you know?

symlinks have a few suboptimal things; one is that not all search and IDE tools handle them nicely. Another is Windows.

Related to all of the above: It'd probably help overall to just use the latest version of the documentation (I suspect there's 0.5% of cases where this might be misleading, but it's probably worth it to default and then add overrides)

@Arnavion
Copy link
Owner

Okay. One more question: Do you use kube or do you use the API operations that k8s-openapi provides via the api feature?

As part of adding v1.28 support there are a bunch of non-trivial changes that need to be made to the code generator. The change to support extra parameters as mentioned there is already done, but there are more changes needed on top of that, and I'm wondering if it's worth it. Basically every user I know disables the api feature and uses kube instead, so I might just drop API operations from the next release entirely. That would also decreases the size of the crate by a lot.

@Arnavion
Copy link
Owner

Arnavion commented Sep 8, 2023

v0.20 has been released with the API operations-related code removed. Is it small enough now? If not, we can keep this open for working on a deduping solution.

@cgwalters
Copy link
Author

One more question: Do you use kube

We currently use kube, yes.

And, thanks for looking at this. It looks like v0.20 is better, but still really large:

$ du -shc vendor/k8s-openapi/src/*
4.0K    vendor/k8s-openapi/src/byte_string.rs
12K     vendor/k8s-openapi/src/deep_merge.rs
12K     vendor/k8s-openapi/src/lib.rs
8.0K    vendor/k8s-openapi/src/resource.rs
6.6M    vendor/k8s-openapi/src/v1_22
7.0M    vendor/k8s-openapi/src/v1_23
7.0M    vendor/k8s-openapi/src/v1_24
6.4M    vendor/k8s-openapi/src/v1_25
6.6M    vendor/k8s-openapi/src/v1_26
6.8M    vendor/k8s-openapi/src/v1_27
7.1M    vendor/k8s-openapi/src/v1_28
48M     total

Versus previously:

du -shc vendor/k8s-openapi/src/*
8.0K    vendor/k8s-openapi/src/api.rs
4.0K    vendor/k8s-openapi/src/byte_string.rs
12K     vendor/k8s-openapi/src/deep_merge.rs
24K     vendor/k8s-openapi/src/lib.rs
8.0K    vendor/k8s-openapi/src/resource.rs
9.7M    vendor/k8s-openapi/src/v1_20
9.8M    vendor/k8s-openapi/src/v1_21
8.2M    vendor/k8s-openapi/src/v1_22
8.5M    vendor/k8s-openapi/src/v1_23
8.5M    vendor/k8s-openapi/src/v1_24
7.8M    vendor/k8s-openapi/src/v1_25
8.1M    vendor/k8s-openapi/src/v1_26
61M     total

@fbernier
Copy link

fbernier commented Apr 4, 2024

I'd like to add that this generated code takes a long time to compile. I have a beefy desktop computer and k8s_openapi is the slowest crate to compile in a 800 crates project at 5.49s.

@Arnavion
Copy link
Owner

Arnavion commented Apr 4, 2024

Compilation time will not change because of what's being discussed in this issue. That would be #77

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants