Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it possible to make Option<CompactStr> same size as Option<String>? #19

Closed
mateuszkj opened this issue Oct 27, 2021 · 11 comments
Closed

Comments

@mateuszkj
Copy link

mateuszkj commented Oct 27, 2021

It would be very beneficial for me to reduce size of Option<CompactStr>. I use lots of it in my projects.

I made some tests, and it looks like Option<ComactStr> takes additional 8 bytes in compaction to Option<String>.

Rust has null pointer optimization (see https://doc.rust-lang.org/std/option/index.html#representation) but I not sure if it can by used in userland crate.

sizeofs of popular Strings.

String: 24 bytes
Option<String>: 24 bytes
CompactStr: 24 bytes
Option<CompactStr>: 32 bytes
KString: 24 bytes
Option<KString>: 24 bytes
SmartString<Compact>: 24 bytes
Option<SmartString<Compact>>: 32 bytes
SmolStr: 24 bytes
Option<SmolStr>: 24 bytes
@NobodyXu
Copy link
Contributor

NobodyXu commented Oct 27, 2021

I think it is possible to do, but then CompactStr have to store the size of "" as 1 and "1" as 2.

Since it can store the string inline, the only field it can guarantee to not be 0 is the discriminator/length field.

@ParkMyCar
Copy link
Owner

Thanks for the issue!

This is a great use case I hadn't thought of. At a bit level, we should be able to represent a distinct None variant, how to get this to work with Option<CompactStr> will require some more research though. This is definitely something I'd like to pursue

@tttwang23
Copy link

Why not bring the length byte to byte 0? If you disallow byte 0 to be ever 0, then you can use the NonZeroU8 rule to optimize the option. Then byte 0 having value of 1 means it is a heap case. Value 2 to 25 maps to length of 0 to 23. Value 26 onwards maps to UTF8 values for the longest inline character.

Length byte at byte 0 is nice because getting the length byte does not need an address offset calculation. Perhaps this is a way to claw back lost performance.

For the heap case, the 8 pad bytes would be the first 8 bytes, padded with value 1.

@NobodyXu
Copy link
Contributor

But what if the string is empty?

@ParkMyCar
Copy link
Owner

We don't necessarily need the niche value to be zero (i.e. NonZeroU8), you can define your own NonXU8 using enums, see #22 for an example of a NonMaxU8.

The tricky part is getting the compiler to see the niche value, without losing any performance. Internally a CompactStr uses a union which the compiler skips when checking for niche values. In #22 I tried defining a "compiler public" struct for a CompactStr which was essentially (NonMaxU8, [u8; 23]), and then transmuting to the union when we need to do an operation. This way the compiler could use the value 255 in the first byte to store the None variant. Unfortunately transmuting introduced an additional copy, so it literally doubled the time to create an inline string because now we had to copy data twice (once for transmute, once for the actual creating of a CompactStr).

Also FWIW the length byte is now the last byte because it allows the memcpys for an inline string to be better aligned, which had a significant (~35%) improvement on performance

@CAD97
Copy link
Contributor

CAD97 commented Mar 31, 2022

I tried again (I'll open a draft PR after some more experimentation) and, yeah, the compiler hates it

empty                   time:   [8.8500 ns 8.8586 ns 8.8729 ns]
                        change: [+515.19% +516.38% +517.57%] (p = 0.00 < 0.05)
                        Performance has regressed.

inline                  time:   [16.020 ns 16.024 ns 16.033 ns]
                        change: [+50.601% +50.720% +50.922%] (p = 0.00 < 0.05)
                        Performance has regressed.

packed                  time:   [11.073 ns 11.082 ns 11.090 ns]
                        change: [+228.99% +229.96% +230.67%] (p = 0.00 < 0.05)
                        Performance has regressed.

heap                    time:   [54.283 ns 54.383 ns 54.495 ns]
                        change: [+1.1258% +1.5541% +1.9540%] (p = 0.00 < 0.05)
                        Performance has regressed.

Note, though, that I'm testing in a noisy environment; the std_string bench is also seeing noise.

@ParkMyCar
Copy link
Owner

ParkMyCar commented Mar 31, 2022

Thanks! I was thinking using transmute_copy might be able to help since that doesn't move the data. But IMO that starts moving into a world of being too unsafe, and at that point why not just use the compiler macros that NonZero* uses (with the exception of transmute_copy being stable)

@CAD97
Copy link
Contributor

CAD97 commented Mar 31, 2022

No transmute_copy still moves the data; transmute_copy is roughly transmute(ptr::read(src)) or *(src as *const T as *const U). #75 is my attempt, which does the zero-copy &ReprNiche as *const ReprNiche as *const ReprUnion as &ReprUnion, but sees the regression. I took a peek at the assembly, and, have no idea why the regression exists. (Further discussion of this attempt should probably happen in #75.)

@ParkMyCar ParkMyCar mentioned this issue Mar 31, 2022
@tttwang23
Copy link

How about the idea that the first 64 bit word for the heap case and short string are shared (non-union). Basically one single 64 bit memory loads the length byte, and the fist 7 byte values. Shouldn't 64 bit read & write beat memcpy? You can shift in multiples of 8 bits to access the byte values. (word64 >> (indx << 3)) as u8

You just have to map length 0 to a non-zero 64 bit word. Either the length mapping itself can be played around, or though a non-zero padding of the out of bound bytes. The heap string case would have length of 25 chars of longer, so 0 is not possible.

@CAD97
Copy link
Contributor

CAD97 commented Apr 1, 2022

See #75. The problem isn't extra copies, and the compiler/optimizer is really good at turning small copies into the efficient thing, you aren't going to beat it by changing between aligned writes and larger typed copies.

The difference comes from opaque optimizer shenanigans, and seemingly unrelated changes are going to be what fix/diagnose it.

My current bet is on these microbenchmarks are so sensitive that the noise we're seeing may even be down to machine code alignment (I know I've read other, better articles about this, at least one of which was attached to a tool to shuffle machine code around to diagnose if that is the compounding factor, but I can't find it right now.)

nottirb pushed a commit to nottirb/compact_str that referenced this issue May 15, 2022
@ParkMyCar
Copy link
Owner

Completed as part of #105

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants