Optimised code in h1.c doesn't do what it says it does #2545

kleptog · 2024-04-20T14:35:58Z

Detailed Description of the Problem

#ifdef HA_UNALIGNED_LE
		/* speedup: skip bytes not between 0x24 and 0x7e inclusive */
		while (ptr <= end - sizeof(int)) {
			int x = *(int *)ptr - 0x24242424;
			if (x & 0x80808080)
				break;

			x -= 0x5b5b5b5b;
			if (!(x & 0x80808080))
				break;

			ptr += sizeof(int);
		}
#endif

The comment suggests what it's doing is "skip bytes not between 0x24 and 0x7e inclusive", but it seems to actually want to do the opposite: "skip bytes between 0x24 and 0x7e inclusive". That is: the equivalent of:

while(ptr < end && 0x24 <= *ptr && *ptr <= 0x7e) ptr++;

But it doesn't do that. For example: *ptr = 0x24267E80.

*ptr = 0x24267E80
     - 0x24242424
       ==========
     = 0x00025A5C & 0x80808080 = False (continue)
     - 0x5b5b5b5b
       ==========
     = 0x5A57FFFE & 0x80808080 = True (continue)

There are many other examples, eg 0x80802626. The effect of this is that it will skip high bytes, as long as not all 4 bytes are high bytes. This doesn't seem to be the intention, and is certainly not what you'd expect from the comment.

Expected Behavior

The expected behaviour seems to me to be just skipping bytes in blocks of four between 0x24 and 0x7e inclusive. This can be achieved by a slight adjustment to the code.

		/* speedup: skip bytes between 0x24 and 0x7e inclusive */
		while (ptr <= end - sizeof(int)) {
			int x = *(unsigned int *)ptr;
			if ((x - 0x24242424) & 0x80808080)
				break;

			if ((~x - ~0x7e7e7e7e) & 0x80808080)
				break;

			ptr += sizeof(int);
		}

Steps to Reproduce the Behavior

Error was discovered by code examination, and simulation by a Python script.

Example output:

x            t1    t2    t3     t4     n1      n2      n3
247B267E   True  True  True   True  0057025A A4FBA6FF 5A035800
247B267F  False  True  True  False  0057025B A4FBA700 5A0357FF
247B2680  False  True  True  False  0057025C A4FBA701 5A0357FE

x = number representing characters in string
t1 = expected result of skipping
t2 = what haproxy code is actually doing
t3 = simplified expression of what haproxy actually does
t4 = corrected haproxy code above

t1 == t4 and t2 == t3

n1 = x - 0x24242424
n2 = x - 0x7f7f7f7f7f
n3 = ~x - ~0x7e7e7e7e

Do you have any idea what may have caused this?

Probably just an oversight by the original developer, and the code is complicated enough to avoid people looking too deeply.

Do you have an idea how to solve the issue?

Suggested improved code is above.

Note: this form of type punning is not allowed by the C standard. The following code produces the same compiled byte-code, but avoids undefined behaviour.

		/* speedup: skip bytes between 0x24 and 0x7e inclusive */
		while (ptr <= end - sizeof(int)) {
			unsigned int x;
			memcpy(&x, ptr, sizeof(int));
			if ((x - 0x24242424) & 0x80808080)
				break;

			if ((~x - ~0x7e7e7e7e) & 0x80808080)
				break;

			ptr += sizeof(int);
		}

What is your configuration?

N/A

Output of `haproxy -vv`

N/A

Last Outputs and Backtraces

N/A

Additional Information

N/A

The text was updated successfully, but these errors were encountered:

wtarreau · 2024-04-23T17:11:55Z

Hi!

Thanks for the report, you're absolutely right, at first glance the test should be changed like this:

-		if (!(x & 0x80808080))
+		if ((x & 0x80808080) != 0x80808080)

I'll have a deeper look but at first glance it should be OK since we want to make sure that all 4 bytes overflow when subtracted 0x7f. Thanks!

kleptog · 2024-04-23T17:48:48Z

No, just changing the test is not enough. The problem in the direction of the subtraction. This is perhaps clearer:

		/* speedup: skip bytes between 0x24 and 0x7e inclusive */
		while (ptr <= end - sizeof(int)) {
			unsigned int x;
			memcpy(&x, ptr, sizeof(int));
			if ((x - 0x24242424) & 0x80808080)
				break;

			if ((0x7e7e7e7e - x) & 0x80808080)
				break;

			ptr += sizeof(int);
		}

wtarreau · 2024-04-23T18:06:25Z

Hmmm why ? We've already subtracted 0x24242424 from the value. Now if all bytes were between 24 and 7e, subtracting 5b5b5b5b will make all of them flip 0x80, so the test will verify that all of them have this 0x80, and if one misses it, it's because it was >= 7f. The negation should generally be avoided on x86 where it's a unary operator that requires one extra cycle to load the value or to copy it when you want to preserve it later.

wtarreau · 2024-04-23T18:37:53Z

Got it! Indeed, by doing the second subtract we're failing the second byte check by slipping some bits of the first one, I was stupid. I've now verified that the last one gets all of them right (for all 32-bit values).

wtarreau · 2024-04-23T18:54:06Z

I'm going to simplify it into a generic function "in_range_32b()" that does the range check based on the two bounds:

static inline uint32_t in_range_32b(uint32_t x, uint8_t min8, uint8_t max8)
{
	uint32_t min32 = min8 * 0x01010101;
	uint32_t max32 = max8 * 0x01010101;
	return !(((x - min32) | (max32 - x)) & 0x80808080);
}

Actually I'd rather return the non-negated value but I'm unable to find a simple english term to describe this "out-of-bounds" nature :-) "exceeds" maybe ?

wtarreau · 2024-04-23T19:00:50Z

Finally "is_char4_outof()" makes sense to me. This way I can make "is_char8_outof()" by stacking two of them with the direct result so that the pre-loaded values and the final mask and test are fused on 32-bit archs.

TimWolla · 2024-04-23T19:02:20Z

but I'm unable to find a simple english term to describe this "out-of-bounds" nature :-) "exceeds" maybe ?

“outside” appears to be a common term for this.

wtarreau · 2024-04-23T19:08:37Z

Ah yes thank you Tim, I like this one!

kleptog · 2024-04-23T20:36:11Z

This definitely much more readable. I saw testing with the compiler explorer that gcc also thought of merging the operands before the and. Actually, because of this merging the new code is actually less instructions than the original, nice! See here.

wtarreau · 2024-04-23T20:45:01Z

Yes that's what I validated by testing with various compiler versions (gcc 4.2 to 13.2) on several archs including x86_64, i386, armv7, armv5, aarch64, mips32 and riscv64. The other benefit I was foreseeing (hence my intent to make the function return the mask of outliers) is that it also works for 64-bit on 32-bit archs because (on some archs) the registers are loaded once with the 24242424 and 7e7e7e7e masks and used for both 32-bit halves, and the final test is fused as well. I don't intend to use it there but could in other places and can help get rid of some ifdefs if the is_char8_outside() function is always available. I'll do a minimum backportable fix and will make a cleanup for new versions using the inline function that does the multiply above. If you have a public name I can credit you for spotting that bug. If you don't care I won't annoy you ;-)

kleptog · 2024-04-23T20:57:40Z

Nice work. this way is much more flexible.

For credits you can put "Martijn van Oosterhout", though kleptog is more unique. Thanks for picking this up.

kleptog · 2024-04-24T06:43:33Z

if I may make one more suggestion (since you're looking at this code already): I'd suggest defining:

static inline uint32_t load_char4_unaligned(unsigned char *ptr) {
    /* Compiler will collapse to simple load on architectures that support unaligned loads */
    uint32_t x;
    memcpy(&x, ptr, sizeof(uint32_t));
    return x;
}

This allows you to get rid of all the #ifdefs. Even on architectures requiring aligned loads the optimised loop is almost certainly a win, since the alternative is to go through the rest of the state machine byte-by-byte.

I'd also suggest following it up with the simpler loop:

while(ptr < end && *ptr >= 0x24 && *(unsigned char *)ptr <= 0x7e) ptr++;

So you always end at the first non-matching byte, that avoids even more loops through the state machine. Also, by declaring ptr, start and end to be (unsigned char*) you could get rid of a lot of casts.

wtarreau · 2024-04-24T07:42:48Z

No, please really no. No memcpy(). That's almost always wrong. In the best case it gets it gains nothing, and in many cases it's wrong. On some archs and/or compiler versions and/or optimization levels it will implement a call to memcpy(). In other cases it will inline it while the architecture would have supported an unaligned access. And it's wrong to use it in code that can end up in libraries that are meant to be portable or even embedded in free-standing environments.

I don't know how I could state it louder, because I'm hearing the same misconception here and there over and over, but: ONE MUST NEVER EVER USE MEMCPY() TO PERFORM UNALIGNED ACCESSES.

What it really does is to ask the compiler to implement a call to memcpy() unless it thinks it can do better and it's willing to do that.

And it doesn't take long to find such broken examples. Just take the link you proposed above: https://godbolt.org/z/6Pb961P66 change the compiler at the top right to "ARM GCC 4.5.4 (linux)", change options to "-O3 -march=armv7-a", and check the code below "step2":

        mov     r1, r4
        mov     r2, #4
        mov     r0, r6
>>>     bl      memcpy
        ldr     r3, [sp, #4]
        movw    r1, #32382

See that atrocious memcpy() here in the most performance critical path that's going to turn your public parser into a DoS amplifier ?

Another version, gcc-5.4, again for armv7:

.L11:
        ldr     r0, [r2]  @ unaligned
        str     r0, [sp, #4]      @ unaligned
        ldr     r0, [sp, #4]
        add     ip, r0, lr

See ? It performs an unaligned access, stores it into a local variable on the stack, then re-reads it.

Now you could say these are old compilers, but we do support them since they work fine. OK what about more recent ones ? Let's take gcc-13.2.0 on MIPS at -Os :

$LBB16 = .
        move    $5,$16
>>>     jal     memcpy
$LVL11 = .
        addiu   $4,$sp,24

        lw      $3,24($sp)
        nop
$LVL12 = .

Yep, another call to memcpy. You may think "but gcc guys don't care about MIPS anymore". Then I guess they do care about RISCV-64, don't they ? Let's check:

.L13:
        li      a2,4
        mv      a1,s0
        addi    a0,sp,12
>>>     call    memcpy
        lw      a4,12(sp)
        addw    a5,a4,s2
        subw    a4,s3,a4

Ouch! That's why I really don't want to see anyone stuff these nasty memcpy() into code that is supposed to be portable. They're always the wrong solution.

The correct portable way to perform unaligned accesses is via the "packed" attribute applied to structures and unions, that the compile knows how to access for any type. That's what we're doing with our read_u32(), read_u64() etc functions in https://github.com/haproxy/haproxy/blob/master/include/haproxy/net_helper.h

The code is most often optimal.

You can try yourself with the following code:

#define read_u32(p) ({ \
        const union {  uint32_t u32; } __attribute__((packed))*u = (const void *)p; \
        u->u32; \
})

#define read_u64(p) ({ \
        const union {  uint64_t u64; } __attribute__((packed))*u = (const void *)p; \
        u->u64; \
})

uint32_t rd32_cpy(const char *buf)
{
        uint32_t a;
        memcpy(&a, buf, sizeof(a));
        return a;
}

uint32_t rd32_pck(const char *buf)
{
        return read_u32(buf);
}

uint32_t rd32_cast(const char *buf)
{
        return *(uint32_t *)buf;
}

uint64_t rd64_cpy(const char *buf)
{
        uint64_t a;
        memcpy(&a, buf, sizeof(a));
        return a;
}

uint64_t rd64_pck(char *buf)
{
        return read_u64(buf);
}

uint64_t rd64_cast(const char *buf)
{
        return *(uint64_t *)buf;
}

For example, gcc-13.2 on MIPS, which does provide instructions to perform unaligned accesses by loading partial words:

00000000 <rd32_cpy>:
   0:   3c1c0000        lui     gp,0x0
                        0: R_MIPS_HI16  __gnu_local_gp
   4:   279c0000        addiu   gp,gp,0
                        4: R_MIPS_LO16  __gnu_local_gp
   8:   27bdffd8        addiu   sp,sp,-40
   c:   8f990000        lw      t9,0(gp)
                        c: R_MIPS_CALL16        memcpy
  10:   00802825        move    a1,a0
  14:   afbf0024        sw      ra,36(sp)
  18:   afbc0010        sw      gp,16(sp)
  1c:   27a40018        addiu   a0,sp,24
  20:   0320f809        jalr    t9
                        20: R_MIPS_JALR memcpy
  24:   24060004        li      a2,4
  28:   8fbf0024        lw      ra,36(sp)
  2c:   8fa20018        lw      v0,24(sp)
  30:   03e00008        jr      ra
  34:   27bd0028        addiu   sp,sp,40

00000038 <rd32_pck>:
  38:   88820000        lwl     v0,0(a0)
  3c:   00000000        nop
  40:   98820003        lwr     v0,3(a0)
  44:   03e00008        jr      ra
  48:   00000000        nop

As you can see, the packed version is only two load instructions compared to the horror of the memcpy() at the top. Why didn't the compiler use that instead of memcpy() since it knows how to do it natively ? Just because nothing forces it to! You asked it to implement a function call, it didn't feel in the mood to optimize it. Or maybe it did not recognize a less well-known pattern for whatever reason.

Loading 64-bits on gcc-11.3 for armv7 also does a much better job:

0000000c <rd64_cpy>:
   c:   b082            sub     sp, #8
   e:   4602            mov     r2, r0
  10:   466b            mov     r3, sp
  12:   6800            ldr     r0, [r0, #0]
  14:   6851            ldr     r1, [r2, #4]
  16:   c303            stmia   r3!, {r0, r1}
  18:   e9dd 0100       ldrd    r0, r1, [sp]
  1c:   b002            add     sp, #8
  1e:   4770            bx      lr

00000020 <rd64_pck>:
  20:   4603            mov     r3, r0
  22:   6800            ldr     r0, [r0, #0]
  24:   6859            ldr     r1, [r3, #4]
  26:   4770            bx      lr

As you can see the first one performs an inlined memcpy() while the packed version just performs two unaligned loads. It could even be better by dropping this unneded initial mov.

But there are still some rare corner cases where the compiler will not do as good as a job, e.g. here with gcc4 at for i386 -O0:

00000000 <rd32_cpy>:
   0:   83 ec 2c                sub    $0x2c,%esp
   3:   c7 44 24 08 04 00 00    movl   $0x4,0x8(%esp)
   a:   00 
   b:   8b 44 24 30             mov    0x30(%esp),%eax
   f:   89 44 24 04             mov    %eax,0x4(%esp)
  13:   8d 44 24 1c             lea    0x1c(%esp),%eax
  17:   89 04 24                mov    %eax,(%esp)
  1a:   e8 fc ff ff ff          call   1b <rd32_cpy+0x1b>
                        1b: R_386_PC32  memcpy
  1f:   8b 44 24 1c             mov    0x1c(%esp),%eax
  23:   83 c4 2c                add    $0x2c,%esp
  26:   c3                      ret    

00000027 <rd32_pck>:
  27:   83 ec 10                sub    $0x10,%esp
  2a:   8b 44 24 14             mov    0x14(%esp),%eax
  2e:   89 44 24 0c             mov    %eax,0xc(%esp)
  32:   8b 44 24 0c             mov    0xc(%esp),%eax
  36:   8b 00                   mov    (%eax),%eax
  38:   83 c4 10                add    $0x10,%esp
  3b:   c3                      ret    

0000003c <rd32_cast>:
  3c:   8b 44 24 04             mov    0x4(%esp),%eax
  40:   8b 00                   mov    (%eax),%eax
  42:   c3                      ret

Again it perfoms inlined memcpy() onto the stack. One could argue that it's not dramatic because we're at -O0, but it just shows that it can happen, so when we know we certainly don't want to face this, we prefer to avoid it and rely on the natural and correct way to load a word on that architecture, via a normal dereference.

Finally regarding the ifdefs, it's important to understand that architectures are not all equal regarding unaligned accesses. Some are faster but not all. For example, if you take RISCV64 which doesn't support unaligned accesses by default, the unaligned access implemented by gcc-13.2 via the packed method gives this:

000000000000001a <rd32_pck>:
  1a:   00154703                lbu     a4,1(a0)
  1e:   00054783                lbu     a5,0(a0)
  22:   0722                    sll     a4,a4,0x8
  24:   8f5d                    or      a4,a4,a5
  26:   00254783                lbu     a5,2(a0)
  2a:   00354503                lbu     a0,3(a0)
  2e:   07c2                    sll     a5,a5,0x10
  30:   8fd9                    or      a5,a5,a4
  32:   0562                    sll     a0,a0,0x18
  34:   8d5d                    or      a0,a0,a5
  36:   2501                    sext.w  a0,a0
  38:   8082                    ret

All of this just to check 4 chars. Often it's more efficient to check the chars one by one, as some of the operations can be parallelized. Thus we need the ifdefs anyway.

And to finish on this, there's no undefined behavior in performing unaligned accesses, the notion of aligned/unaligned is architecture dependent. Gcc even gives you the info regarding the native support:

$ gcc95_glibc231-linux-gnueabi-gcc -dM -E -xc - </dev/null |grep UNA
#define __ARM_FEATURE_UNALIGNED 1

If you look at the following structure, what do you guess the offsets will be?

struct foo {
        char u8;
        unsigned short u16;
        unsigned long long u64;
        unsigned int u32;
} foo;

The response is: "it depends". It depends on the architecture. On most architectures, you'll see 0, 2, 8, 16 respectively because we align 64-bit on 64-bit. On i386, you'll get 0, 2, 4, 12 because i386 is fine with unaligned u64 (since we never aligned anything historically on this platform before switching to 32-bit and I guess they preserved this principle for qwords on 32-bits), and this principle is still preserved:

struct foo {
        char                       u8;                   /*     0     1 */

        /* XXX 1 byte hole, try to pack */

        short unsigned int         u16;                  /*     2     2 */
        long long unsigned int     u64;                  /*     4     8 */
        unsigned int               u32;                  /*    12     4 */

        /* size: 16, cachelines: 1, members: 4 */
        /* sum members: 15, holes: 1, sum holes: 1 */
        /* last cacheline: 16 bytes */
};

Thus no, it's not undefined, it's architecture-specific thus it's perfectly fine once properly checked in ifdefs or any other method. You'll find thousands of them in the kernel, it's present in virtually every crypto or hash algorithm, compression tools make heavy use of that, and CPU vendors always end up supporting this for the tremendous gains this provides on network processing.

@kleptog

In 1.7 with commit 5f10ea3 ("OPTIM: http: improve parsing performance of long URIs") we improved the URI parser's performance on platforms supporting unaligned accesses by reading 4 chars at a time in a 32-bit word. However, as reported in GH issue #2545, there's a bug in the way the top bytes are checked, as the parser will stop when all 4 of them are above 7e instead of when one of them is, so certain patterns can be accepted through if the last ones are all valid. The fix requires to negate the value but on the other hand it allows to parallelize some of the tests and fuse the masks, which could even end up slightly faster. This needs to be backported to all stable versions, but be careful, this code moved a lot over time, from proto_http.c to h1.c, to http_msg.c, to h1.c again. Better just grep for "24242424" or "21212121" in each version to find it. Big kudos to Martijn van Oosterhout (@kleptog) for spotting this problem while analyzing that piece of code, and reporting it.

wtarreau · 2024-04-24T09:53:09Z

Now fixed, thanks @kleptog!

kleptog · 2024-04-24T16:18:33Z

Ok, memcpy() was inlined for the cases I tried, but I admit I didn't really go into it more deeply. It was more the point that at least in this case I think the speed up could be made to work for more architectures.

Thanks for the work.

kleptog added status: needs-triage This issue needs to be triaged. type: bug This issue describes a bug. labels Apr 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimised code in h1.c doesn't do what it says it does #2545

Optimised code in h1.c doesn't do what it says it does #2545

kleptog commented Apr 20, 2024

wtarreau commented Apr 23, 2024

kleptog commented Apr 23, 2024

wtarreau commented Apr 23, 2024

wtarreau commented Apr 23, 2024

wtarreau commented Apr 23, 2024

wtarreau commented Apr 23, 2024

TimWolla commented Apr 23, 2024

wtarreau commented Apr 23, 2024

kleptog commented Apr 23, 2024

wtarreau commented Apr 23, 2024

kleptog commented Apr 23, 2024

kleptog commented Apr 24, 2024

wtarreau commented Apr 24, 2024

wtarreau commented Apr 24, 2024

kleptog commented Apr 24, 2024

Optimised code in h1.c doesn't do what it says it does #2545

Optimised code in h1.c doesn't do what it says it does #2545

Comments

kleptog commented Apr 20, 2024

Detailed Description of the Problem

Expected Behavior

Steps to Reproduce the Behavior

Do you have any idea what may have caused this?

Do you have an idea how to solve the issue?

What is your configuration?

Output of haproxy -vv

Last Outputs and Backtraces

Additional Information

wtarreau commented Apr 23, 2024

kleptog commented Apr 23, 2024

wtarreau commented Apr 23, 2024

wtarreau commented Apr 23, 2024

wtarreau commented Apr 23, 2024

wtarreau commented Apr 23, 2024

TimWolla commented Apr 23, 2024

wtarreau commented Apr 23, 2024

kleptog commented Apr 23, 2024

wtarreau commented Apr 23, 2024

kleptog commented Apr 23, 2024

kleptog commented Apr 24, 2024

wtarreau commented Apr 24, 2024

wtarreau commented Apr 24, 2024

kleptog commented Apr 24, 2024

Output of `haproxy -vv`