Improved the codegen of the memcmp intrinsic #3415

Orvid · 2014-04-01T14:24:11Z

This modifies the codegen of the memcmp intrinsic so that it will now generate cmpsw, cmpsd, and cmpsq instructions, it also omits the rep and rep count instructions if only 1 instruction was needed. This is the first time I'm poking dmd's backend, so there are probably a few things I don't know, be it coding style, or just how to do certain things, so any feedback is appreciated.

There is still another optimization possible by using a cmov instruction, provided it's available on the CPU that we are compiling for, but I wanted to make sure I'm going about this the right way before I continue.

yebblies · 2014-04-01T14:34:49Z

Well, you used tabs when you should have used spaces.

yebblies · 2014-04-01T14:35:39Z

src/backend/cod2.c

@@ -3226,7 +3226,24 @@ code *cdmemcmp(elem *e,regm_t *pretregs)

    // Get nbytes into CX
    regm_t retregs3 = mCX;
-    c1 = cat(c1,scodelem(e->E2,&retregs3,retregs | retregs1,FALSE));
+	bool single = FALSE;


If you're using bool might as well use false.

yebblies · 2014-04-01T14:46:18Z

Is this valid for I16?
Is this worthwhile when working on unaligned memory (comments in memcpy codegen say no)
It doesn't work, see failing phobos tests

Orvid · 2014-04-01T16:05:29Z

There's probably a fringe condition that I missed that is causing that issue, so I'll have to look at the disassembly of it when I get home. The situation I was targeting generated the correct code.

Provided I16 is a 16-bit target, I don't believe so, no, this won't currently work for that, due to the fact I don't have any condition on the 32-bit op-code.

When you say unaligned memory, you are talking about the memory locations I'm comparing, not the size of the values I'm comparing, right? I'll look at the memcpy codegen to see what the comments say.

Edit:
After a bit of checking, the comments in the codegen are not true on Intel CPU's with the Nehalem micro-architecture or later. These numbers however refer to copying memory, rather than comparing memory, which I still need to locate more information on.

yebblies · 2014-04-01T16:53:49Z

The easiest option is probably to add I16 to the checks that enable this optimization.

Yes, I mean instructions that will issue unaligned loads and stores. eg for short unaligned comparisons repe cmpsb will probably be a lot faster than repe cmpsd. Even more so for the 8-byte instructions.

Orvid · 2014-04-01T21:40:30Z

Hmmm... Well, I've located the codegen error, and determined that I did indeed cause it... The question is though, how I managed it, because a constant value should be getting loaded into ECX, and instead it's loading EBX, which is currently a 16-byte string -_-..... (because someone decided it was a good idea to implement string comparisons in terms of memcmp....)

Edit:
Alright, figured out the issue, as far as I know how to tell in the backend currently, the operand is indeed constant, however it is getting a register allocated to it, because it's being used twice in sequence, the first time when checking that the length of the 2 strings is equal, the second in my code, which, although it's stored as a specific value, is not actually wanting to emit that value. However due to the fact that there appear to be multiple code-gen passes, I have to restore the value if I modify it, meaning that the register allocator thinks I need a different value than I really do.

yebblies · 2014-04-02T03:15:31Z

If you make the load into ECX unconditional (and therefore dead code sometimes) does the problem go away?

Orvid · 2014-04-02T14:25:12Z

I don't know if that would actually have an effect in this case, (I'll check anyways when I get home) because ecx is already getting loaded, here's the ASM of the method that has the issue:

.text:0000000000001950 void main.SecondaryTest() proc near     ; DATA XREF: .pdata:$pdata$_D4main13SecondaryTestFZv�o
.text:0000000000001950                                         ; .pdata:0000000000001A7C�o
.text:0000000000001950
.text:0000000000001950 var_110         = qword ptr -110h
.text:0000000000001950 var_108         = qword ptr -108h
.text:0000000000001950 var_100         = qword ptr -100h
.text:0000000000001950 var_F8          = qword ptr -0F8h
.text:0000000000001950 var_F0          = byte ptr -0F0h
.text:0000000000001950 var_90          = byte ptr -90h
.text:0000000000001950 var_70          = byte ptr -70h
.text:0000000000001950 var_50          = byte ptr -50h
.text:0000000000001950 var_20          = byte ptr -20h
.text:0000000000001950
.text:0000000000001950                 push    rbp
.text:0000000000001951                 mov     rbp, rsp
.text:0000000000001954                 sub     rsp, 118h
.text:000000000000195B                 push    rbx
.text:000000000000195C                 push    rsi
.text:000000000000195D                 push    rdi
.text:000000000000195E                 mov     rdx, cs:qword_18
.text:0000000000001965                 mov     rax, cs:_TMP0
.text:000000000000196C                 mov     [rbp+var_100], rax
.text:0000000000001973                 mov     [rbp+var_F8], rdx
.text:000000000000197A                 lea     rsi, Z std.digest.sha.SHA1.__init
.text:0000000000001981                 lea     rdi, [rbp+var_F0]
.text:0000000000001988                 mov     ecx, 0Ch
.text:000000000000198D                 rep movsq
.text:0000000000001990                 lea     rcx, [rbp+var_F0]
.text:0000000000001997                 sub     rsp, 20h
.text:000000000000199B                 call    std.digest.sha.SHA1.start()
.text:00000000000019A0                 add     rsp, 20h
.text:00000000000019A4                 mov     rdx, [rbp+var_F8]
.text:00000000000019AB                 mov     rax, [rbp+var_100]
.text:00000000000019B2                 mov     [rbp+var_110], rax
.text:00000000000019B9                 mov     [rbp+var_108], rdx
.text:00000000000019C0                 lea     rdx, [rbp+var_110]
.text:00000000000019C7                 lea     rcx, [rbp+var_F0]
.text:00000000000019CE                 sub     rsp, 20h
.text:00000000000019D2                 call    std.digest.sha.SHA1.put(const(ubyte)[]...)
.text:00000000000019D7                 add     rsp, 20h
.text:00000000000019DB                 lea     rdx, [rbp+var_90]
.text:00000000000019E2                 lea     rcx, [rbp+var_F0]
.text:00000000000019E9                 sub     rsp, 20h
.text:00000000000019ED                 call    std.digest.sha.SHA1.finish()
.text:00000000000019F2                 add     rsp, 20h
.text:00000000000019F6                 mov     rsi, rax
.text:00000000000019F9                 lea     rdi, [rbp+var_20]
.text:00000000000019FD                 movsq
.text:00000000000019FF                 movsq
.text:0000000000001A01                 movsb
.text:0000000000001A02                 movsb
.text:0000000000001A03                 movsb
.text:0000000000001A04                 movsb
.text:0000000000001A05                 lea     rsi, [rbp+var_20]
.text:0000000000001A09                 lea     rdi, [rbp+var_70]
.text:0000000000001A0D                 movsq
.text:0000000000001A0F                 movsq
.text:0000000000001A11                 movsb
.text:0000000000001A12                 movsb
.text:0000000000001A13                 movsb
.text:0000000000001A14                 movsb
.text:0000000000001A15                 lea     rdx, [rbp+var_70]
.text:0000000000001A19                 lea     rcx, [rbp+var_50]
.text:0000000000001A1D                 sub     rsp, 20h
.text:0000000000001A21                 call    std.digest.digest.toHexString!(std.digest.digest.Order 0,ulong 20,std.ascii.LetterCase 0).toHexString(const(ubyte[20]))
.text:0000000000001A26                 add     rsp, 20h
.text:0000000000001A2A                 mov     rbx, 28h
.text:0000000000001A34                 cmp     cs:_TMP6, rbx
.text:0000000000001A3B                 jnz     short loc_1A51
.text:0000000000001A3D                 mov     rsi, rax
.text:0000000000001A40                 mov     rdi, cs:off_58
.text:0000000000001A47                 mov     rcx, rbx
.text:0000000000001A4A                 xor     eax, eax
.text:0000000000001A4C                 repe cmpsq
.text:0000000000001A4F                 jz      short loc_1A63
.text:0000000000001A51
.text:0000000000001A51 loc_1A51:                               ; CODE XREF: main.SecondaryTest()+EB�j
.text:0000000000001A51                 mov     ecx, 0Fh
.text:0000000000001A56                 sub     rsp, 20h
.text:0000000000001A5A                 call    main.__assert(int)
.text:0000000000001A5F                 add     rsp, 20h
.text:0000000000001A63
.text:0000000000001A63 loc_1A63:                               ; CODE XREF: main.SecondaryTest()+FF�j
.text:0000000000001A63                 pop     rdi
.text:0000000000001A64                 pop     rsi
.text:0000000000001A65                 pop     rbx
.text:0000000000001A66                 lea     rsp, [rbp+0]
.text:0000000000001A6A                 pop     rbp
.text:0000000000001A6B                 retn
.text:0000000000001A6B void main.SecondaryTest() endp
.text:0000000000001A6B

The issue is right above loc_1A51, it moves rbx, which is 0x28, the number of bytes, into ecx, however we then proceed to do a qword comparison, it should actually be loading 0x05 into ecx, because that's the number of qwords in it.

Also, I'll eventually update the memcpy intrinsic as well, as it appears to be unrolling the movsq excessively, Intel's manual only recommends unrolling assignments up to 16 bytes, although I may be reading that differently than I should, above that should be using a rep prefixed movs instruction. In this particular case, the source and destination are known to be aligned, yet we emit 4 movsb instructions rather than a movsd instruction. (which is valid in 64-bit mode) The 4 movsb instructions are 4x the size of the 1 movsd instruction needed.

yebblies · 2014-04-02T14:54:10Z

src/backend/cod2.c

+		single = e->E2->EV.Vlong == 1;
+	}
+	if (!single)
+		c1 = cat(c1,scodelem(e->E2,&retregs3,retregs | retregs1,FALSE));


So basically your changes to e->E2 are being ignored if e->E2 is a CSE? I'm guessing it's too late to be changing the elems directly, you should probably just tell it to do a constant load instead of going through scodelem when you've picked a new value. Alternatively, add new OPmemcmp16/32/64 ops and pick a multi-byte one from the optimizer.

Yes, that is the case. Modifying the elems directly in the codegen did feel a bit hackish, so it seems to me like the optimizer approach would be the correct way to do it, now I just have to figure out where the backend's optimizer is hiding....

optabgen.c will tell you the optimizer function for each OP

Hmmm, didn't mean to push the updated commit quite yet.. (I do it by default after I commit, but the habit is a bit of a mixed blessing) I just started the build locally, and I have it running the phobos unittests now, so we'll see which gives me results first.

I ended up taking the constant-loading approach, hopefully I got the codegen for it right.

I broke it, as it's generating a load to edx instead of ecx, will poke the code a bit more later.

yebblies · 2014-04-02T14:55:17Z

I don't get an email when you edit your comments, so it's probably better to post a new comment when adding new information.

Orvid · 2014-04-02T16:05:32Z

src/backend/cod2.c

+		single = e->E2->EV.Vlong == 1;
+
+		if (!single)
+			c1 = movregconst(c1, retregs3, e->E2->EV.Vlong, 0);


This is generating a load to EDX, even though I've already passed in ECX as the destination. Is there something else I need to do here to make sure it's loading it to ECX?

Ah, nvm, I was using the mask, not the register number.

Orvid · 2014-04-02T18:12:13Z

Alright, this works on my win64 desktop, both 32 and 64-bit versions so this will be fun to figure out the issue with.

Orvid · 2014-04-02T21:30:38Z

Ah, woops, I forgot to actually commit my changes, so it wasn't trying to build the same code, if the rest of the tests are successful, I'll fix the tabs issue and it should be good.

Orvid · 2014-04-03T13:51:06Z

There is also another optimization I'd like to do, in the generated code above for SecondaryTest, loc_1A51 should not be the fall-through case, because it is the failure case, and thus, less likely to happen. It should instead be placed after the retn, meaning that the success path would do no branching. This can be thought of as a weak branch hint. I would also like to see if I can add strong branch hinting to branches ending with exceptions, however this would likely need to add an additional option, because there are programs, such as vibe.d, that have to be able to handle the failure cases as well. For these programs, the weak hint provided by the branch re-ordering would still be worth the positive performance boost for the usual success case, as the CPU would still likely predict correctly if the failure case branch is taken.

yebblies · 2014-04-03T13:55:10Z

Who says the failure case is the less likely case?

Orvid · 2014-04-03T17:17:05Z

In the cases that I'm conisdering doing the re-ordering in, unittests for instance, the intention is to have the test pass, thus the failure conditions should not be placed inline in the body, but should be placed outside of the main body of the method, and jump back into the main body after the failure is dealt with. What I'm describing is very similar to the code re-organization that is possible utilizing Profile-Guided-Optimizations, except that it makes assumptions based on high-level information and the design of the overall language. Exceptions for instance, are slow, and shouldn't be relied on as a mechanism for anything other than error handling, because of this, the expected result (by the language design) is to not throw an exception, so a branch to a block ending with throwing an exception unconditionally should be weakly hinted, optionally strongly hinted, to allow what is intended by the language design to be a hot code path.

By failure case I was intending to refer to certain specific, yet common, cases, not to the general failure of an if condition.

In very hot code locations, such as __array_bounds, the small change to the order of the blocks will reduce, and, if strongly hinted, completely remove any branch mis-predictions except in cases of failure, where the expectation of speed is already not there.

yebblies · 2014-04-03T17:36:37Z

I understand what you're doing, but in the general case it is wrong to assume that memcmp returning zero is the likely case. Unittests and assertions are not a case that should be optimized for at the expense of others.

Orvid · 2014-04-03T17:59:54Z

I wasn't intending to make assumptions on the return of memcmp, but to make assumptions on the fact that assertions should be true.

yebblies · 2014-04-04T03:24:44Z

Riiiiight I thought you were planning to add it in here as the default case.

Orvid · 2014-04-04T14:12:36Z

Alright, I've replaced the tabs with spaces, removed the extra whitespace change that I made, made a minor change so that the value in the actual expression never gets modified, and made I16 only ever emit the byte comparison. It should be possible to make it so that the rep prefix omission when only 1 byte is being compared, but, as I don't have a 16-bit machine to test on (nor does the auto-tester) I don't want to make the change and potentially break it.

yebblies · 2014-04-04T14:40:57Z

src/backend/cod2.c

+            break;
+        case 2:
+            gen1(c3, 0x66);
+            gen1(c3, 0xA7);						/* CMPSW                */


A have a feeling you can do 0x66A7

Well, we'll find out in about 5 minutes, when both the build and unittests finish on my local machine.

No, that actually doesn't work, I end up with an internal error in the backend if I try that.

yebblies · 2014-04-04T14:43:41Z

I'm pretty sure no-one cares about optimizing I16.

Orvid · 2014-04-04T15:13:31Z

Alright, fixed the tabs, combining the 2 gen1 calls produced an ICE, so I left that as-is.

Orvid · 2014-04-07T12:22:57Z

Unless I'm wrong, I do believe I've addressed all the issues that have been brought up.

yebblies · 2014-04-09T14:07:38Z

I can't find anywhere else that gen1 is used to add a prefix to an instruction. I suspect that means there is another preferred way to do it, possibly by setting flags on the code object.

I'm not sure I'm comfortable merging backend pull requests - I don't know this code like I know the frontend. I'm considering it.

Orvid · 2014-04-09T14:19:58Z

Well, I wasn't sure how to add the prefix, as the other places I could find where it needed that prefix had a modrm byte that could be modified. If I knew the 'correct' way to do it, I would.

yebblies · 2014-04-09T15:13:05Z

Maybe @WalterBright could help us out?

WalterBright · 2014-04-10T02:06:36Z

Since this impacts the 16 bit code generator, I'd need to test this out for that before accepting it.

Orvid · 2014-04-10T02:18:32Z

This actually shouldn't effect the 16-bit generator, provided I understand the usage of I16 correctly.

WalterBright · 2014-04-25T06:04:41Z

This optimization will try to use cmpsw, cmpsd, and cmpsq on unaligned data. This will result in terrible performance. The library implementations of memcmp that I've seen will first test the operands, then align them using cmpsb, then do the aligned comparison, then finish with cmpsb.

In fact, I think this inline asm would be very hard pressed to beat the library memcmp, it may be better to simply call the library memcmp.

It's also true that the newer processors are very slow with CMPS - it's really only a win in 16 bit code. My suggestion is to take a look at what, say, gcc generates for these memcmp's.

yebblies · 2014-04-25T06:25:02Z

It's also true that the newer processors are very slow with CMPS - it's really only a win in 16 bit code.

I'm not sure this is true any more, although it certainly was once.

A few benchmarks could certainly answer the performance question.

Orvid · 2014-04-25T15:21:02Z

I definitely do need to revisit this PR, firstly to add a check to determine if the target CPU is new enough that the unaligned penalty is non-existent (Intel Sandy Bridge or newer), in which case there is no need to deal with the alignment, and also to expand it to use various more prominent features, such as MMX for 32-bit 64-byte comparisons, SSE for 128-byte comparisons, and AVX for 256-byte comparisons. I also will see about implementing larger comparisons via SSE4.2 when available.

yebblies · 2014-06-01T02:26:54Z

If you look in cod3.c around line 5917, you can see it's adding the operand size prefix if the CFopsize flag is set. It would be better to do this instead of manually generating the prefix.

quickfur · 2014-08-07T04:43:35Z

ping

Orvid · 2014-08-11T03:36:27Z

I'm going to leave this as is, if anyone wants to do anything with it, I'll leave the branch but close this PR.

yebblies reviewed Apr 1, 2014
View reviewed changes

yebblies reviewed Apr 2, 2014
View reviewed changes

Orvid reviewed Apr 2, 2014
View reviewed changes

yebblies reviewed Apr 4, 2014
View reviewed changes

Improved the codegen of the memcmp intrinsic

a80eaa2

Orvid closed this Aug 11, 2014

Improved the codegen of the memcmp intrinsic #3415

Improved the codegen of the memcmp intrinsic #3415

Conversation

Orvid commented Apr 1, 2014

yebblies commented Apr 1, 2014

Choose a reason for hiding this comment

yebblies commented Apr 1, 2014

Orvid commented Apr 1, 2014

yebblies commented Apr 1, 2014

Orvid commented Apr 1, 2014

yebblies commented Apr 2, 2014

Orvid commented Apr 2, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yebblies commented Apr 2, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Orvid commented Apr 2, 2014

Orvid commented Apr 2, 2014

Orvid commented Apr 3, 2014

yebblies commented Apr 3, 2014

Orvid commented Apr 3, 2014

yebblies commented Apr 3, 2014

Orvid commented Apr 3, 2014

yebblies commented Apr 4, 2014

Orvid commented Apr 4, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yebblies commented Apr 4, 2014

Orvid commented Apr 4, 2014

Orvid commented Apr 7, 2014

yebblies commented Apr 9, 2014

Orvid commented Apr 9, 2014

yebblies commented Apr 9, 2014

WalterBright commented Apr 10, 2014

Orvid commented Apr 10, 2014

WalterBright commented Apr 25, 2014

yebblies commented Apr 25, 2014

Orvid commented Apr 25, 2014

yebblies commented Jun 1, 2014

quickfur commented Aug 7, 2014

Orvid commented Aug 11, 2014