New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimised code in h1.c doesn't do what it says it does #2545
Comments
Hi! Thanks for the report, you're absolutely right, at first glance the test should be changed like this:
I'll have a deeper look but at first glance it should be OK since we want to make sure that all 4 bytes overflow when subtracted 0x7f. Thanks! |
No, just changing the test is not enough. The problem in the direction of the subtraction. This is perhaps clearer:
|
Hmmm why ? We've already subtracted 0x24242424 from the value. Now if all bytes were between 24 and 7e, subtracting 5b5b5b5b will make all of them flip 0x80, so the test will verify that all of them have this 0x80, and if one misses it, it's because it was >= 7f. The negation should generally be avoided on x86 where it's a unary operator that requires one extra cycle to load the value or to copy it when you want to preserve it later. |
Got it! Indeed, by doing the second subtract we're failing the second byte check by slipping some bits of the first one, I was stupid. I've now verified that the last one gets all of them right (for all 32-bit values). |
I'm going to simplify it into a generic function "in_range_32b()" that does the range check based on the two bounds:
Actually I'd rather return the non-negated value but I'm unable to find a simple english term to describe this "out-of-bounds" nature :-) "exceeds" maybe ? |
Finally "is_char4_outof()" makes sense to me. This way I can make "is_char8_outof()" by stacking two of them with the direct result so that the pre-loaded values and the final mask and test are fused on 32-bit archs. |
“outside” appears to be a common term for this. |
Ah yes thank you Tim, I like this one! |
This definitely much more readable. I saw testing with the compiler explorer that gcc also thought of merging the operands before the and. Actually, because of this merging the new code is actually less instructions than the original, nice! See here. |
Yes that's what I validated by testing with various compiler versions (gcc 4.2 to 13.2) on several archs including x86_64, i386, armv7, armv5, aarch64, mips32 and riscv64. The other benefit I was foreseeing (hence my intent to make the function return the mask of outliers) is that it also works for 64-bit on 32-bit archs because (on some archs) the registers are loaded once with the 24242424 and 7e7e7e7e masks and used for both 32-bit halves, and the final test is fused as well. I don't intend to use it there but could in other places and can help get rid of some ifdefs if the is_char8_outside() function is always available. I'll do a minimum backportable fix and will make a cleanup for new versions using the inline function that does the multiply above. If you have a public name I can credit you for spotting that bug. If you don't care I won't annoy you ;-) |
Nice work. this way is much more flexible. For credits you can put "Martijn van Oosterhout", though kleptog is more unique. Thanks for picking this up. |
if I may make one more suggestion (since you're looking at this code already): I'd suggest defining:
This allows you to get rid of all the #ifdefs. Even on architectures requiring aligned loads the optimised loop is almost certainly a win, since the alternative is to go through the rest of the state machine byte-by-byte. I'd also suggest following it up with the simpler loop:
So you always end at the first non-matching byte, that avoids even more loops through the state machine. Also, by declaring |
No, please really no. No memcpy(). That's almost always wrong. In the best case it gets it gains nothing, and in many cases it's wrong. On some archs and/or compiler versions and/or optimization levels it will implement a call to memcpy(). In other cases it will inline it while the architecture would have supported an unaligned access. And it's wrong to use it in code that can end up in libraries that are meant to be portable or even embedded in free-standing environments. I don't know how I could state it louder, because I'm hearing the same misconception here and there over and over, but: ONE MUST NEVER EVER USE MEMCPY() TO PERFORM UNALIGNED ACCESSES. What it really does is to ask the compiler to implement a call to memcpy() unless it thinks it can do better and it's willing to do that. And it doesn't take long to find such broken examples. Just take the link you proposed above: https://godbolt.org/z/6Pb961P66 change the compiler at the top right to "ARM GCC 4.5.4 (linux)", change options to "-O3 -march=armv7-a", and check the code below "step2":
See that atrocious memcpy() here in the most performance critical path that's going to turn your public parser into a DoS amplifier ? Another version, gcc-5.4, again for armv7:
See ? It performs an unaligned access, stores it into a local variable on the stack, then re-reads it. Now you could say these are old compilers, but we do support them since they work fine. OK what about more recent ones ? Let's take gcc-13.2.0 on MIPS at -Os :
Yep, another call to memcpy. You may think "but gcc guys don't care about MIPS anymore". Then I guess they do care about RISCV-64, don't they ? Let's check:
Ouch! That's why I really don't want to see anyone stuff these nasty memcpy() into code that is supposed to be portable. They're always the wrong solution. The correct portable way to perform unaligned accesses is via the "packed" attribute applied to structures and unions, that the compile knows how to access for any type. That's what we're doing with our read_u32(), read_u64() etc functions in https://github.com/haproxy/haproxy/blob/master/include/haproxy/net_helper.h The code is most often optimal. You can try yourself with the following code:
For example, gcc-13.2 on MIPS, which does provide instructions to perform unaligned accesses by loading partial words:
As you can see, the packed version is only two load instructions compared to the horror of the memcpy() at the top. Why didn't the compiler use that instead of memcpy() since it knows how to do it natively ? Just because nothing forces it to! You asked it to implement a function call, it didn't feel in the mood to optimize it. Or maybe it did not recognize a less well-known pattern for whatever reason. Loading 64-bits on gcc-11.3 for armv7 also does a much better job:
As you can see the first one performs an inlined memcpy() while the packed version just performs two unaligned loads. It could even be better by dropping this unneded initial mov. But there are still some rare corner cases where the compiler will not do as good as a job, e.g. here with gcc4 at for i386 -O0:
Again it perfoms inlined memcpy() onto the stack. One could argue that it's not dramatic because we're at -O0, but it just shows that it can happen, so when we know we certainly don't want to face this, we prefer to avoid it and rely on the natural and correct way to load a word on that architecture, via a normal dereference. Finally regarding the ifdefs, it's important to understand that architectures are not all equal regarding unaligned accesses. Some are faster but not all. For example, if you take RISCV64 which doesn't support unaligned accesses by default, the unaligned access implemented by gcc-13.2 via the packed method gives this:
All of this just to check 4 chars. Often it's more efficient to check the chars one by one, as some of the operations can be parallelized. Thus we need the ifdefs anyway. And to finish on this, there's no undefined behavior in performing unaligned accesses, the notion of aligned/unaligned is architecture dependent. Gcc even gives you the info regarding the native support:
If you look at the following structure, what do you guess the offsets will be?
The response is: "it depends". It depends on the architecture. On most architectures, you'll see 0, 2, 8, 16 respectively because we align 64-bit on 64-bit. On i386, you'll get 0, 2, 4, 12 because i386 is fine with unaligned u64 (since we never aligned anything historically on this platform before switching to 32-bit and I guess they preserved this principle for qwords on 32-bits), and this principle is still preserved:
Thus no, it's not undefined, it's architecture-specific thus it's perfectly fine once properly checked in ifdefs or any other method. You'll find thousands of them in the kernel, it's present in virtually every crypto or hash algorithm, compression tools make heavy use of that, and CPU vendors always end up supporting this for the tremendous gains this provides on network processing. |
In 1.7 with commit 5f10ea3 ("OPTIM: http: improve parsing performance of long URIs") we improved the URI parser's performance on platforms supporting unaligned accesses by reading 4 chars at a time in a 32-bit word. However, as reported in GH issue #2545, there's a bug in the way the top bytes are checked, as the parser will stop when all 4 of them are above 7e instead of when one of them is, so certain patterns can be accepted through if the last ones are all valid. The fix requires to negate the value but on the other hand it allows to parallelize some of the tests and fuse the masks, which could even end up slightly faster. This needs to be backported to all stable versions, but be careful, this code moved a lot over time, from proto_http.c to h1.c, to http_msg.c, to h1.c again. Better just grep for "24242424" or "21212121" in each version to find it. Big kudos to Martijn van Oosterhout (@kleptog) for spotting this problem while analyzing that piece of code, and reporting it.
Now fixed, thanks @kleptog! |
Ok, memcpy() was inlined for the cases I tried, but I admit I didn't really go into it more deeply. It was more the point that at least in this case I think the speed up could be made to work for more architectures. Thanks for the work. |
Detailed Description of the Problem
It's about this code:
The comment suggests what it's doing is "skip bytes not between 0x24 and 0x7e inclusive", but it seems to actually want to do the opposite: "skip bytes between 0x24 and 0x7e inclusive". That is: the equivalent of:
But it doesn't do that. For example: *ptr = 0x24267E80.
There are many other examples, eg 0x80802626. The effect of this is that it will skip high bytes, as long as not all 4 bytes are high bytes. This doesn't seem to be the intention, and is certainly not what you'd expect from the comment.
Expected Behavior
The expected behaviour seems to me to be just skipping bytes in blocks of four between 0x24 and 0x7e inclusive. This can be achieved by a slight adjustment to the code.
Steps to Reproduce the Behavior
Error was discovered by code examination, and simulation by a Python script.
Example output:
Do you have any idea what may have caused this?
Probably just an oversight by the original developer, and the code is complicated enough to avoid people looking too deeply.
Do you have an idea how to solve the issue?
Suggested improved code is above.
Note: this form of type punning is not allowed by the C standard. The following code produces the same compiled byte-code, but avoids undefined behaviour.
What is your configuration?
Output of
haproxy -vv
Last Outputs and Backtraces
Additional Information
N/A
The text was updated successfully, but these errors were encountered: