(Prototype) q1_0 nrc = 2 and diabolic tiles branches#21
(Prototype) q1_0 nrc = 2 and diabolic tiles branches#21pl752 wants to merge 9 commits intoPrismML-Eng:masterfrom
Conversation
|
Hi @pl752 , impressive. Initially the tps gain is not visible, since my test prompt does not have a long context. Have you considered a 4x4 |
|
@zcattacz I was experimenting with repack and other shapes of 2x2 dot for AVX2, yet not so successuly, 4x4 dot is likely my next target |
|
Most likely that I will implement 4x1, 8x1, 4x2 or other kernel shapes outside default mul_mat |
|
Some points from AI's analysis that seem to make sense:
|
|
Also I have found reason for lower than usual results, I have somehow missed later tests are run with |
|
Also cooking |
|
I have completed trying various tile shapes (final used forms are 1x1, 2x2, 2x1, 4x2 and 4x4), large tiles are only used where it is reasonable from register counts and memory bandwidth limitations. Resulting code is pretty cursed/diabolic (ofc it is vibe coded and won't go anywhere near mainline), however it seems that it more or less maxes out my cpu, if there are no other significant refinements. Results since nrc=2 as following (SSSE3 was not affected code wise), most benefits are from AVX-512 (4x4) in pp and AVX-2/512 (2x1) in tg:
I have tried other (larger or wider/longer shapes) and didn't obtain notable improvements |
Pretty much direct continuation of #10.
Vibe coded prototype (just for proof of concept, needs refining) of nrows = 2 branches for x86 SIMD.
Yields significant PP improvements as it allows better utilization of memory bandwidth (hot y operand, high compute density).
I also think ARM NEON is worth trying to expand with nrows = 2
SSSE3pp512SSSE3tg128AVXpp512AVXtg128AVX+F16Cpp512AVX+F16Ctg128AVX2+FMApp512AVX2+FMAtg128AVX512BWpp512AVX512BWtg128Also for some reason AVX-512 opts do hurt performance for PP consistently for nrows = 2 and sometimes results are inconsistent
Code for these branches is enormous and is most likely suboptimal, so suggestions are welcome, register spills occur of course
Funny part is that I have tried iterating the AVX2 prototype, but haven't managed to achieve any improvements.
I have also tried altering tile geometry to use rectangular blocks due to significant operand size assymetry like was attempted in #4 by @Marxist-Leninist, which yields some changes, but is inconclusive.
blck_0pp512tg128pp512tg128