Permalink
Browse files

gsdx: try to ask GCC to generate not dumb code

Unfortunately it requires at least GCC6. If a nice guy can check the generated code on GCC6.
I don't know clang status.

Here the only example, I have found on the web
https://developers.redhat.com/blog/2016/02/25/new-asm-flags-feature-for-x86-in-gcc-6/

Current generated code in GSTextureCache::SourceMap::Add

    38b3:	bsf    eax,esi
    38b6:	add    esp,0x10
    38b9:	test   esi,esi
    38bb:	jne    387e <GSTextureCache::SourceMap::Add(GSTextureCache::Source*, GIFRegTEX0 const&, GSOffset*)+0x6e>

BSF already set the Z flag when input (esi) is 0. So it would be better
to not put a silly add before the jump and to skip the test operation.
  • Loading branch information...
1 parent 1fbee92 commit 8cf3a83dd735fd90f169d92428b1f047bb9eda7f @gregory38 gregory38 committed Jan 11, 2017
Showing with 9 additions and 3 deletions.
  1. +9 −3 plugins/GSdx/stdafx.h
View
@@ -391,11 +391,17 @@ using namespace stdext;
// http://svn.reactos.org/svn/reactos/trunk/reactos/include/crt/mingw32/intrin_x86.h?view=markup
- __forceinline unsigned char _BitScanForward(unsigned long* const Index, const unsigned long Mask)
+ __forceinline int _BitScanForward(unsigned long* const Index, const unsigned long Mask)
{
- __asm__("bsfl %k[Mask], %k[Index]" : [Index] "=r" (*Index) : [Mask] "mr" (Mask));
-
+#if defined(__GCC_ASM_FLAG_OUTPUTS__) && 0
+ // Need GCC6 to test the code validity
+ int flag;
+ __asm__("bsfl %k[Mask], %k[Index]" : [Index] "=r" (*Index), "=@ccz" (flag) : [Mask] "mr" (Mask));
+ return flag;
+#else
+ __asm__("bsfl %k[Mask], %k[Index]" : [Index] "=r" (*Index) : [Mask] "mr" (Mask) : "cc");
return Mask ? 1 : 0;
+#endif
}
#ifdef __GNUC__

15 comments on commit 8cf3a83

@gregory38
Contributor

@turtleli I'm sure you're a nice guy. If you want to toy with it :) And my condition flag is wrong, I think it ought to be =@ccnz

@turtleli
Member

Code seems worse with the asm flag output stuff.

dc522: bsf    eax,ebx
dc525: setne  dl
dc528: add    esp,0x10
dc52b: movzx  edx,dl
dc52e: test   edx,edx
dc530: jne    dc4f0 <_ZN14GSTextureCache9SourceMap3AddEPNS_6SourceERK10GIFRegTEX0P8GSOffset+0x70>

It might be worth testing __builtin_ffsl().

@gregory38
Contributor

Strange, potentially the ASM inline is badly written.
You can forget the built-in. It does extra operation (behavior isn't the same as x86).

@gregory38
Contributor

Did you try ccz with flags == 0 ?

@turtleli
Member

Generated code is almost the same:

   dc522: bsf    eax,ebx
   dc525: sete   dl
   dc528: add    esp,0x10
   dc52b: movzx  edx,dl
   dc52e: test   edx,edx
   dc530: je     dc4f0 <_ZN14GSTextureCache9SourceMap3AddEPNS_6SourceERK10GIFRegTEX0P8GSOffset+0x70>
@turtleli
Member

The builtin actually seems slightly better for clang (__builtin_ctz(), not __builtin_ffsl() which I wrongly chose earlier). With the inline assembly it does:

mov mem,reg
bsf reg,mem

But with the builtin it does:

tzcnt reg,reg

I don't think we need the _BitScanForward return value in GSTextureCache.cpp, so maybe this will generate better code (I didn't test):

diff --git a/plugins/GSdx/GSTextureCache.cpp b/plugins/GSdx/GSTextureCache.cpp
index 92e5ae33c..be92b1ca3 100644
--- a/plugins/GSdx/GSTextureCache.cpp
+++ b/plugins/GSdx/GSTextureCache.cpp
@@ -1985,17 +1985,17 @@ void GSTextureCache::SourceMap::Add(Source* s, const GIFRegTEX0& TEX0, GSOffset*
 		{
 			list<Source*>* m = &m_map[i << 5];
 
-			unsigned long j;
-
-			while(_BitScanForward(&j, p))
+			do
 			{
+				unsigned long j;
+				_BitScanForward(&j, p);
 				// FIXME: this statement could be optimized to a single ASM instruction (instead of 4)
 				// Either BTR (AKA bit test and reset). Depends on the previous instruction.
 				// Or BLSR (AKA Reset Lowest Set Bit). No dependency but require BMI1 (basically a recent CPU)
 				p ^= 1 << j;
 
 				m[j].push_front(s);
-			}
+			} while (p);
 		}
 	}
 }
@gregory38
Contributor

Ah yes it could be better. Tzcnt is only supported on recent CPU. But hopefully there is a trick to convert it to a BSF with a rep prefix. So yes it could be better.

@turtleli
Member

If the builtin is used, I think the compiler will choose between bsf and tzcnt depending on what CPU is being used - it won't change inline assembly because it assumes the dev knows what they're doing, but it can optimise intrinsics and builtin functions.

@gregory38
Contributor

Yes. I think the compiler will use the same trick anyway. It avoid any check. They said that tzcnt is faster than bsf on AMD. It isn't the hottest loop but I might use BSF also on the erase side.

@gregory38
Contributor

Actually I don't think the do/while loop will be better. You still need to do BSF + the check, whereas both could be done in one shot (It could explain the X86 design). Anyway, it isn't important, gain will be null anyway.

@turtleli
Member

Actually the gcc asm output flag stuff might produce better code if flag is changed to a bool. See https://godbolt.org/g/PKvghz (you can see output from other compilers as well).

@gregory38
Contributor

Ah ah. GCC is picky sometimes. Potentially the best solution will be tzcnt in ASM (executed as BSF on older CPU). However there is a trick I'm not sure they share the same flag semantic. TZCNT seems to use the C for input and Z for output.

@gregory38
Contributor

Hum, for the story, I better understand why some game developer said that the only sane language is asm.

Here the generated code for TEX0.TBP0 >> 5 (I suspect a strange optimization bug on GCC but still).

    38e0:	movzx  eax,BYTE PTR [ebx+0x1]
    38e4:	movzx  edx,BYTE PTR [ebx]
   ...
   ...
    38ed:	and    eax,0x3f
    38f0:	shl    eax,0x8
    38f3:	or     eax,edx
    38f5:	sar    eax,0x5
@turtleli
Member

Oh, it also does that with GCC 6 ;)

clang seems to use bextr on my system:

  10074d:	8b 44 24 48          	mov    eax,DWORD PTR [esp+0x48]
...
...
  10075a:	b9 05 09 00 00       	mov    ecx,0x905
  10075f:	c4 e2 70 f7 00       	bextr  eax,DWORD PTR [eax],ecx

However there is a trick I'm not sure they share the same flag semantic. TZCNT seems to use the C for input and Z for output.

Yeah, I think they're only compatible if you ignore the flags and the input isn't 0. Personally I think the builtin should just be used (or at the very least, for clang).

@gregory38
Contributor

Either there is a story behind union. Or GCC became crazy. I don't know if it could worth to open a GCC bug. I quickly try to reproduce it with your link without success. (Only try a shift of the field). I guess need to put all the function and try to trim it down.

Please sign in to comment.