Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Super FX speed test #340

Open
paulb-nl opened this issue Jun 16, 2022 · 12 comments
Open

Super FX speed test #340

paulb-nl opened this issue Jun 16, 2022 · 12 comments

Comments

@paulb-nl
Copy link
Contributor

Here is a test that measures how long it takes for an instruction to complete. It counts in a loop until the SFX is stopped so higher numbers mean it took longer. Small differences don't matter so much. One 21MHz cycle results in a difference of around $66 (102) loops. For example nop is $0134 in 21Mhz cache mode which is 1 cycle. add # is 2 cycles and results in $0199 loops.

It can be run on an original Super FX cart by swapping the cartridge while the console is on. The code runs in WRAM on SNES and Cart RAM/Cache on Super FX.

Here are reference captures of a StarFox cart (Mario Chip), Stunt Race FX (GSU1) and Yoshi's Island (GSU2)
https://drive.google.com/drive/folders/15ac9U-x__n0AgOlWa3FGo5eEMShZYl5g?usp=sharing

The Mario Chip (v1) is unstable with reading/writing to Cart RAM. Some tests timeout which doesn't happen with the GSU chips.

Another difference with the Mario Chip is that the cache opcode will work immediately with GSU while it seems the Mario Chip needs 16 bytes to fill first so not all instructions are faster in this test with the StarFox cart.

The ljmp instruction is also quite weird. It takes much longer on the GSU chip than on Mario Chip. Not sure what's going on there.

With cache off the MiSTer core runs faster in 10Mhz than 21Mhz which is strange.

Buttons:
Left/Right: Switch to different tests
Select: Toggle 10/21Mhz
Y: Toggle High speed multiplier
B: Toggle Cache

MiSTer captures:
sfx_test_MiSTer_captures.zip
https://drive.google.com/drive/folders/1noo2pRPoexCtVPgqSbzaexr61WOvjHNW?usp=sharing

Test rom:
SuperFX.sfc.zip

Source:
https://github.com/paulb-nl/sfx_speed_test

@FitzRoyX
Copy link

Nice test! Would it be possible to color the text green/red based to reflect correct/incorrect based on the hw results?

@paulb-nl
Copy link
Contributor Author

That would get a bit complicated with the 2 different chip versions and 8 setting combinations. It is also complicated to decide if a small difference is acceptable because some tests like the plot tests need to be accurate to 1/8th of a cycle. The cycle count can change after 8 plot instructions because then it will write the pixel data to ram.

@paulb-nl
Copy link
Contributor Author

Here is a comparison of some differences. These are not all the differences but I think it is enough for now :)

The cycles mentioned below with the 10MHz tests are 10MHz cycles so 1 cycle = 2x 21MHz cycles.

MiSTer vs Stunt Race FX (GSU1):

10MHz, MS0, No cache:
Everything is too fast.
NOP $72F-$4C8 = $267 = 3 cycles too fast
ADC # (2 NOPS) $993-$660 = $333 = 4 cycles too fast
MiSTer NOP vs 2 NOPS $660 - $4C8 = $198 = 2 cycles
GSU NOP vs 2 NOPS $993 - $72F = $264 = 3 cycles

10MHz, MS0, Cache on
FMULT $8C9 - $7F9 = $D0 = 1 cycle too fast
GETB* $7FC - $662 = $19A = 2 cycles too fast
GETB_2 $730 - $595 = $19B = 2 cycles too fast
LDB $663 - $595 = $CE = 1 cycle too fast
LDW $730 - $661 = $CF = 1 cycle too fast
LM $994 - $8C5 = $CF = 1 cycle too fast
LMS $8C7 - $7F8 = $CF = 1 cycle too fast
LMULT $994 - $8C5 = $CF = 1 cycle too fast
SBK $4CB - $3FE = $CD = 1 cycle too fast
SM $663 - $595 = $CE = 1 cycle too fast
SMS $597 - $4C9 = $CE = 1 cycle too fast
STW $4CB - $3FD = $CE = 1 cycle too fast

10MHz, MS1, Cache on
FMULT $598 - $4C9 = $CF = 1 cycle too fast
LMULT $663 - $595 = $CE= 1 cycle too fast

10MHz PLOT, Cache on
PLOT 4 color: $267 - $29A = -$33 = 0.25 cycles too slow (2 cycles every 8 plots?)
PLOT 16 color: $266 - $2FE = -$98 = 0.75 cycles too slow (6 cycles every 8 plots?)
PLOT 256 color: $280 - $3CA = -$14A = 1.625 cycles too slow (13 cycles every 8 plots?)

The PLOT -> LOOP-> NOP loop takes 3 cycles so 8 plots takes 8x3= 24 cycles. This is enough cycles to save the secondary pixel cache to RAM for 4 & 16 color data without waiting so PLOT should only take 1 cycle. For 256 color PLOT is 0.125 cycles slower ($280 vs $266) so it seems to wait 1 cycle every 8 plots.

PLOT with color #$FC should be treated as no-plot in 4 color transparent mode since low 2 bits are zero.

21MHz, MS0, No cache
FMULT $AC5 - $7F8 = $2CD = 7 cycles too fast
GETB* $CC4 - $BF4 = $D0 = 2 cycles too fast
GETB_2 $AC6-$9F6 = $D0 = 2 cycles too fast
LDB $A60 - $C5A = -$1FA = 5 cycles too slow
LDW $9F9 - $C5A = -$261 = 6 cycles too slow
LM $FF4 - $1253 = -$25F = 6 cycles too slow
LMS $DF6 - $1055 = -$25F = 6 cycles too slow
LMULT $CC4 - $9F6 = $2CE = 7 cycles too fast
MULT $861 - $7F8 = $69 = 1 cycle too fast
SBK $BF8 - $C5A = -$62 = 1 cycle too slow
SM $FF4 - $1055 = -$61 = 1 cycle too slow
SMS $DF6 - $E58 = -$62 = 1 cycle too slow
STW $9F9 - $A5C = -$63 = 1 cycle too slow
UMULT $861 - $7F8 = $69 = 1 cycle too fast

21MHz, MS1, No cache
FMULT $92D - $7F8 = $135 = 3 cycles too fast
LMULT $B2B - $9F6 = $135 = 3 cycles too fast

21MHz, MS0, Cache on
FMULT $466 - $3FD = $69 = 1 cycle too fast
GETB* $4CB - $3FE = $CD = 2 cycles too fast
GETB_2 $465- $397 = $CE = 2 cycles too fast
LDW $531 - $595 = -$64 = 1 cycle too slow
LM $663 - $6C7 = -$64 = 1 cycle too slow
LMS $5FD - $661 = -$64 = 1 cycle too slow
LMULT $4CB - $463 = $68 = 1 cycle too fast
SBK $3FF - $463 = -$64 = 1 cycle too slow
SM $4CB - $52F = -$64 = 1 cycle too slow
SMS $465 - $4C9 = -$64 = 1 cycle too slow
STW $3FF - $463 = -$64 = 1 cycle too slow

21MHz, MS1, Cache on
FMULT $2CD - $3FD = -$130 = 3 cycles too slow
LMULT $332 - $463 = -$131 = 3 cycles too slow

21MHz PLOT Cache on
PLOT 4 color: $134 - $19B = -$67 = 1 cycle too slow (8 cycles every 8 plots?)
PLOT 16 color: $133 - $218 = -$E5 = 2.25 cycles too slow (18 cycles every 8 plots?)
PLOT 256 color: $20C - $317 = -$10B = 2.625 cycles too slow (21 cycles every 8 plots?)

@sorgelig
Copy link
Member

If i remember right GSU code was written as a functional analog, not cycle accurate. So, most likely it needs rework with cycle accuracy.

@paulb-nl
Copy link
Contributor Author

With this list it may seem that not much is accurate but many of the instructions in 21MHz mode (and 10Mhz with cache) are accurate.

Almost all of the instructions that are not accurate are about reading/writing from ROM/RAM and the multiplier instructions.

@srg320
Copy link
Collaborator

srg320 commented Aug 15, 2022

Fixed some timings. I do not yet understand the logic of instructions rpix and ljmp.

@birdybro
Copy link
Member

Some ljmp and rpix info for quick reference:

from https://en.wikibooks.org/wiki/Super_NES_Programming/Super_FX_tutorial#Instruction_Set

Instruction Description ALT(Hex) CODE(HEX) ARG Length(B) B ATL1 ALT2 O/V S CY Z ROM RAM Cache Classification Note
LJMP Long jump 3D 0x9 Rn 2 0 0 0 / / / / 6 6 2 "Jump, Branch and Loop Instructions"
RPIX Read pixel color 3D 0x4C / 2 0 0 0 / * / * 24-80 24-78 20-74 Plot/related instructions

ROM/RAM/Cache columns are execution time in cycles.

LJMP seems pretty tight. o_O

@paulb-nl
Copy link
Contributor Author

Thanks @srg320. I have some findings.

RAM_CYCLES for 10Mhz should be "010" instead of "001". Otherwise it will access RAM with only 2 cycles instead of 3.

elsif SPEED = '0' then
RAM_CYCLES := "001";

4-color transparency should only check the lower 2 bits so this should be added: if COLR(1 downto 0) /= "00"

elsif SCMR_MD /= "11" or POR_FH = '1' then
if COLR(3 downto 0) /= "0000" then
PLOT_EXEC <= '1';
end if;
else
if COLR /= "00000000" then
PLOT_EXEC <= '1';
end if;
end if;

I did some tests to figure out the PLOT pixel cache save logic:
PLOT will save the pixel cache to RAM after 8 PLOTS if it is full. Not at 9th PLOT.

If executing from ROM or Cache and the pixel cache is being saved to RAM and it executes an STB or STW instruction to write to RAM then the pixel cache save is paused and continues after the RAM write buffer is finished. This is probably the same for the other instructions that use the RAM write buffer like SM, SMS, SBK.

For example executing the loop STB->PLOT->LOOP->NOP will only take 5 cycles @ 10Mhz because it doesn’t wait for the RAM writes. It must be interrupting the pixel cache save at the end of writing a byte because otherwise both pixel caches would fill up and PLOT would go into wait state.

Here are some test roms. sfx_stb will use STB to write to RAM while the pixel cache is writing to RAM and reads the values after the SFX is stopped. The value $FF means the pixel cache write has overwritten the data written by STB. There is a cache instruction before the STB writes so you can ignore the NO CACHE text in the test rom.

sfx_speed_test_stb_plot has removed some tests to add two STB/STW PLOT speed tests. The result of the STB PLOT test at 10Mhz with Cache On is $3FE-$400 for 4, 16 & 256 color. This is only 2 cycles more than the PLOT tests and STB is a 2 cycle opcodes so that means it didn't wait.

sfx_stb.zip
sfx_speed_test_stb_plot.zip

Reference captures:
sfx_speed_test_StuntRaceFx_10MHz_plot_cache_stb
sfx_speed_test_StuntRaceFx_10MHz_plot_stb
sfx_speed_test_StuntRaceFx_21MHz_plot_cache_stb
sfx_speed_test_StuntRaceFx_21MHz_plot_stb

sfx_stb_StuntRaceFx_10Mhz
sfx_stb_StuntRaceFx_21Mhz

@srg320
Copy link
Collaborator

srg320 commented Aug 16, 2022

I did some tests to figure out the PLOT pixel cache save logic:
PLOT will save the pixel cache to RAM after 8 PLOTS if it is full. Not at 9th PLOT.

That's interesting. Thanks.

If executing from ROM or Cache and the pixel cache is being saved to RAM and it executes an STB or STW instruction to write to RAM then the pixel cache save is paused and continues after the RAM write buffer is finished. This is probably the same for the other instructions that use the RAM write buffer like SM, SMS, SBK.
For example executing the loop STB->PLOT->LOOP->NOP will only take 5 cycles @ 10Mhz because it doesn’t wait for the RAM writes. It must be interrupting the pixel cache save at the end of writing a byte because otherwise both pixel caches would fill up and PLOT would go into wait state.

I agree, executing an any RAM write instructions do not stop the queue of next instructions until any RAM access appears. And this is implemented in the core in last commit.

@srg320
Copy link
Collaborator

srg320 commented Aug 16, 2022

I am also interested in the ROM access time when the cache is loaded. I suspect that this time is faster than the time to load byte from ROM.

@paulb-nl
Copy link
Contributor Author

The tests on the first page at 21Mhz with Cache on seem to be all fixed. The plot tests also look good. 21Mhz without Cache and 10Mhz still need to be fixed.

However the latest fixes caused everything executing from ROM at 21MHz to be 2 cycles too slow. From 5 to 7 cycles per byte. I have attached a test rom that runs the SFX code from ROM. Most results without cache should have the same results as the version that runs from Cart RAM, except for instructions that access RAM/ROM. For example PLOT without cache should be faster executing from ROM than RAM.

SuperFX_rom.sfc.zip

Unfortunately I am unable to make reference captures for the ROM versions because that would need a modified Super FX cartridge.

I agree, executing an any RAM write instructions do not stop the queue of next instructions until any RAM access appears. And this is implemented in the core in last commit.

Ok but I meant the RAM write buffer will have priority and will pause the pixel cache write. I will give an example from my test:

    ibt R0, #$34
    iwt R3, #$1031

    plots 7
    cache
    plot ; 8th plot, start pixel cache write (256-color 8 bytes)
    
    stb (R3) ;  pause pixel cache write, RAM buffer will write $34 to $701031
    inc R0

    ; pixel cache will overwrite $701031 ($34) with $FF

I am also interested in the ROM access time when the cache is loaded. I suspect that this time is faster than the time to load byte from ROM.

Which ROM access do you mean? As far as I know ROM access is the same as RAM. 3 cycles at 10Mhz and 5 cycles at 21Mhz. The GETB instructions test ROM reading so we know what the results should be.

@srg320
Copy link
Collaborator

srg320 commented Aug 20, 2022

Ok but I meant the RAM write buffer will have priority and will pause the pixel cache write. I will give an example from my test:

    ibt R0, #$34
    iwt R3, #$1031

    plots 7
    cache
    plot ; 8th plot, start pixel cache write (256-color 8 bytes)
    
    stb (R3) ;  pause pixel cache write, RAM buffer will write $34 to $701031
    inc R0

    ; pixel cache will overwrite $701031 ($34) with $FF

Ok. I wonder what the result would be if you add one or two nop before stb (R3).

As far as I know ROM access is the same as RAM. 3 cycles at 10Mhz and 5 cycles at 21Mhz. The GETB instructions test ROM reading so we know what the results should be.

From this test you can see that in the Load/Store Word to/from RAM commands the second (MSB) access is shorter by 1 cycle. Perhaps when loading the cache (16 bytes sequential access) the access time is less than 5 cycles (some kind of burst mode).

srg320 referenced this issue Sep 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

5 participants