-
-
Notifications
You must be signed in to change notification settings - Fork 606
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improved the codegen of the memcmp intrinsic #3415
Conversation
Well, you used tabs when you should have used spaces. |
@@ -3226,7 +3226,24 @@ code *cdmemcmp(elem *e,regm_t *pretregs) | |||
|
|||
// Get nbytes into CX | |||
regm_t retregs3 = mCX; | |||
c1 = cat(c1,scodelem(e->E2,&retregs3,retregs | retregs1,FALSE)); | |||
bool single = FALSE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're using bool
might as well use false
.
|
There's probably a fringe condition that I missed that is causing that issue, so I'll have to look at the disassembly of it when I get home. The situation I was targeting generated the correct code. Provided I16 is a 16-bit target, I don't believe so, no, this won't currently work for that, due to the fact I don't have any condition on the 32-bit op-code. When you say unaligned memory, you are talking about the memory locations I'm comparing, not the size of the values I'm comparing, right? I'll look at the memcpy codegen to see what the comments say. Edit: |
The easiest option is probably to add Yes, I mean instructions that will issue unaligned loads and stores. eg for short unaligned comparisons |
Hmmm... Well, I've located the codegen error, and determined that I did indeed cause it... The question is though, how I managed it, because a constant value should be getting loaded into ECX, and instead it's loading EBX, which is currently a 16-byte string -_-..... (because someone decided it was a good idea to implement string comparisons in terms of memcmp....) Edit: |
If you make the load into ECX unconditional (and therefore dead code sometimes) does the problem go away? |
I don't know if that would actually have an effect in this case, (I'll check anyways when I get home) because ecx is already getting loaded, here's the ASM of the method that has the issue: .text:0000000000001950 void main.SecondaryTest() proc near ; DATA XREF: .pdata:$pdata$_D4main13SecondaryTestFZv�o
.text:0000000000001950 ; .pdata:0000000000001A7C�o
.text:0000000000001950
.text:0000000000001950 var_110 = qword ptr -110h
.text:0000000000001950 var_108 = qword ptr -108h
.text:0000000000001950 var_100 = qword ptr -100h
.text:0000000000001950 var_F8 = qword ptr -0F8h
.text:0000000000001950 var_F0 = byte ptr -0F0h
.text:0000000000001950 var_90 = byte ptr -90h
.text:0000000000001950 var_70 = byte ptr -70h
.text:0000000000001950 var_50 = byte ptr -50h
.text:0000000000001950 var_20 = byte ptr -20h
.text:0000000000001950
.text:0000000000001950 push rbp
.text:0000000000001951 mov rbp, rsp
.text:0000000000001954 sub rsp, 118h
.text:000000000000195B push rbx
.text:000000000000195C push rsi
.text:000000000000195D push rdi
.text:000000000000195E mov rdx, cs:qword_18
.text:0000000000001965 mov rax, cs:_TMP0
.text:000000000000196C mov [rbp+var_100], rax
.text:0000000000001973 mov [rbp+var_F8], rdx
.text:000000000000197A lea rsi, Z std.digest.sha.SHA1.__init
.text:0000000000001981 lea rdi, [rbp+var_F0]
.text:0000000000001988 mov ecx, 0Ch
.text:000000000000198D rep movsq
.text:0000000000001990 lea rcx, [rbp+var_F0]
.text:0000000000001997 sub rsp, 20h
.text:000000000000199B call std.digest.sha.SHA1.start()
.text:00000000000019A0 add rsp, 20h
.text:00000000000019A4 mov rdx, [rbp+var_F8]
.text:00000000000019AB mov rax, [rbp+var_100]
.text:00000000000019B2 mov [rbp+var_110], rax
.text:00000000000019B9 mov [rbp+var_108], rdx
.text:00000000000019C0 lea rdx, [rbp+var_110]
.text:00000000000019C7 lea rcx, [rbp+var_F0]
.text:00000000000019CE sub rsp, 20h
.text:00000000000019D2 call std.digest.sha.SHA1.put(const(ubyte)[]...)
.text:00000000000019D7 add rsp, 20h
.text:00000000000019DB lea rdx, [rbp+var_90]
.text:00000000000019E2 lea rcx, [rbp+var_F0]
.text:00000000000019E9 sub rsp, 20h
.text:00000000000019ED call std.digest.sha.SHA1.finish()
.text:00000000000019F2 add rsp, 20h
.text:00000000000019F6 mov rsi, rax
.text:00000000000019F9 lea rdi, [rbp+var_20]
.text:00000000000019FD movsq
.text:00000000000019FF movsq
.text:0000000000001A01 movsb
.text:0000000000001A02 movsb
.text:0000000000001A03 movsb
.text:0000000000001A04 movsb
.text:0000000000001A05 lea rsi, [rbp+var_20]
.text:0000000000001A09 lea rdi, [rbp+var_70]
.text:0000000000001A0D movsq
.text:0000000000001A0F movsq
.text:0000000000001A11 movsb
.text:0000000000001A12 movsb
.text:0000000000001A13 movsb
.text:0000000000001A14 movsb
.text:0000000000001A15 lea rdx, [rbp+var_70]
.text:0000000000001A19 lea rcx, [rbp+var_50]
.text:0000000000001A1D sub rsp, 20h
.text:0000000000001A21 call std.digest.digest.toHexString!(std.digest.digest.Order 0,ulong 20,std.ascii.LetterCase 0).toHexString(const(ubyte[20]))
.text:0000000000001A26 add rsp, 20h
.text:0000000000001A2A mov rbx, 28h
.text:0000000000001A34 cmp cs:_TMP6, rbx
.text:0000000000001A3B jnz short loc_1A51
.text:0000000000001A3D mov rsi, rax
.text:0000000000001A40 mov rdi, cs:off_58
.text:0000000000001A47 mov rcx, rbx
.text:0000000000001A4A xor eax, eax
.text:0000000000001A4C repe cmpsq
.text:0000000000001A4F jz short loc_1A63
.text:0000000000001A51
.text:0000000000001A51 loc_1A51: ; CODE XREF: main.SecondaryTest()+EB�j
.text:0000000000001A51 mov ecx, 0Fh
.text:0000000000001A56 sub rsp, 20h
.text:0000000000001A5A call main.__assert(int)
.text:0000000000001A5F add rsp, 20h
.text:0000000000001A63
.text:0000000000001A63 loc_1A63: ; CODE XREF: main.SecondaryTest()+FF�j
.text:0000000000001A63 pop rdi
.text:0000000000001A64 pop rsi
.text:0000000000001A65 pop rbx
.text:0000000000001A66 lea rsp, [rbp+0]
.text:0000000000001A6A pop rbp
.text:0000000000001A6B retn
.text:0000000000001A6B void main.SecondaryTest() endp
.text:0000000000001A6B The issue is right above loc_1A51, it moves rbx, which is 0x28, the number of bytes, into ecx, however we then proceed to do a qword comparison, it should actually be loading 0x05 into ecx, because that's the number of qwords in it. Also, I'll eventually update the memcpy intrinsic as well, as it appears to be unrolling the movsq excessively, Intel's manual only recommends unrolling assignments up to 16 bytes, although I may be reading that differently than I should, above that should be using a rep prefixed movs instruction. In this particular case, the source and destination are known to be aligned, yet we emit 4 movsb instructions rather than a movsd instruction. (which is valid in 64-bit mode) The 4 movsb instructions are 4x the size of the 1 movsd instruction needed. |
single = e->E2->EV.Vlong == 1; | ||
} | ||
if (!single) | ||
c1 = cat(c1,scodelem(e->E2,&retregs3,retregs | retregs1,FALSE)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So basically your changes to e->E2
are being ignored if e->E2
is a CSE? I'm guessing it's too late to be changing the elems directly, you should probably just tell it to do a constant load instead of going through scodelem
when you've picked a new value. Alternatively, add new OPmemcmp16/32/64
ops and pick a multi-byte one from the optimizer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is the case. Modifying the elems directly in the codegen did feel a bit hackish, so it seems to me like the optimizer approach would be the correct way to do it, now I just have to figure out where the backend's optimizer is hiding....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
optabgen.c will tell you the optimizer function for each OP
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, didn't mean to push the updated commit quite yet.. (I do it by default after I commit, but the habit is a bit of a mixed blessing) I just started the build locally, and I have it running the phobos unittests now, so we'll see which gives me results first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ended up taking the constant-loading approach, hopefully I got the codegen for it right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I broke it, as it's generating a load to edx instead of ecx, will poke the code a bit more later.
I don't get an email when you edit your comments, so it's probably better to post a new comment when adding new information. |
single = e->E2->EV.Vlong == 1; | ||
|
||
if (!single) | ||
c1 = movregconst(c1, retregs3, e->E2->EV.Vlong, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is generating a load to EDX, even though I've already passed in ECX as the destination. Is there something else I need to do here to make sure it's loading it to ECX?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, nvm, I was using the mask, not the register number.
Alright, this works on my win64 desktop, both 32 and 64-bit versions so this will be fun to figure out the issue with. |
Ah, woops, I forgot to actually commit my changes, so it wasn't trying to build the same code, if the rest of the tests are successful, I'll fix the tabs issue and it should be good. |
There is also another optimization I'd like to do, in the generated code above for SecondaryTest, |
Who says the failure case is the less likely case? |
In the cases that I'm conisdering doing the re-ordering in, unittests for instance, the intention is to have the test pass, thus the failure conditions should not be placed inline in the body, but should be placed outside of the main body of the method, and jump back into the main body after the failure is dealt with. What I'm describing is very similar to the code re-organization that is possible utilizing Profile-Guided-Optimizations, except that it makes assumptions based on high-level information and the design of the overall language. Exceptions for instance, are slow, and shouldn't be relied on as a mechanism for anything other than error handling, because of this, the expected result (by the language design) is to not throw an exception, so a branch to a block ending with throwing an exception unconditionally should be weakly hinted, optionally strongly hinted, to allow what is intended by the language design to be a hot code path. By failure case I was intending to refer to certain specific, yet common, cases, not to the general failure of an if condition. In very hot code locations, such as __array_bounds, the small change to the order of the blocks will reduce, and, if strongly hinted, completely remove any branch mis-predictions except in cases of failure, where the expectation of speed is already not there. |
I understand what you're doing, but in the general case it is wrong to assume that memcmp returning zero is the likely case. Unittests and assertions are not a case that should be optimized for at the expense of others. |
I wasn't intending to make assumptions on the return of memcmp, but to make assumptions on the fact that assertions should be true. |
Riiiiight I thought you were planning to add it in here as the default case. |
Alright, I've replaced the tabs with spaces, removed the extra whitespace change that I made, made a minor change so that the value in the actual expression never gets modified, and made I16 only ever emit the byte comparison. It should be possible to make it so that the rep prefix omission when only 1 byte is being compared, but, as I don't have a 16-bit machine to test on (nor does the auto-tester) I don't want to make the change and potentially break it. |
break; | ||
case 2: | ||
gen1(c3, 0x66); | ||
gen1(c3, 0xA7); /* CMPSW */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A have a feeling you can do 0x66A7
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, we'll find out in about 5 minutes, when both the build and unittests finish on my local machine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, that actually doesn't work, I end up with an internal error in the backend if I try that.
I'm pretty sure no-one cares about optimizing |
Alright, fixed the tabs, combining the 2 |
Unless I'm wrong, I do believe I've addressed all the issues that have been brought up. |
I can't find anywhere else that I'm not sure I'm comfortable merging backend pull requests - I don't know this code like I know the frontend. I'm considering it. |
Well, I wasn't sure how to add the prefix, as the other places I could find where it needed that prefix had a modrm byte that could be modified. If I knew the 'correct' way to do it, I would. |
Maybe @WalterBright could help us out? |
Since this impacts the 16 bit code generator, I'd need to test this out for that before accepting it. |
This actually shouldn't effect the 16-bit generator, provided I understand the usage of I16 correctly. |
This optimization will try to use cmpsw, cmpsd, and cmpsq on unaligned data. This will result in terrible performance. The library implementations of memcmp that I've seen will first test the operands, then align them using cmpsb, then do the aligned comparison, then finish with cmpsb. In fact, I think this inline asm would be very hard pressed to beat the library memcmp, it may be better to simply call the library memcmp. It's also true that the newer processors are very slow with CMPS - it's really only a win in 16 bit code. My suggestion is to take a look at what, say, gcc generates for these memcmp's. |
I'm not sure this is true any more, although it certainly was once. A few benchmarks could certainly answer the performance question. |
I definitely do need to revisit this PR, firstly to add a check to determine if the target CPU is new enough that the unaligned penalty is non-existent (Intel Sandy Bridge or newer), in which case there is no need to deal with the alignment, and also to expand it to use various more prominent features, such as MMX for 32-bit 64-byte comparisons, SSE for 128-byte comparisons, and AVX for 256-byte comparisons. I also will see about implementing larger comparisons via SSE4.2 when available. |
If you look in cod3.c around line 5917, you can see it's adding the operand size prefix if the CFopsize flag is set. It would be better to do this instead of manually generating the prefix. |
ping |
I'm going to leave this as is, if anyone wants to do anything with it, I'll leave the branch but close this PR. |
This modifies the codegen of the memcmp intrinsic so that it will now generate cmpsw, cmpsd, and cmpsq instructions, it also omits the rep and rep count instructions if only 1 instruction was needed. This is the first time I'm poking dmd's backend, so there are probably a few things I don't know, be it coding style, or just how to do certain things, so any feedback is appreciated.
There is still another optimization possible by using a cmov instruction, provided it's available on the CPU that we are compiling for, but I wanted to make sure I'm going about this the right way before I continue.