Skip to content

JIT-inline F64SIN, F64COS, and F64TAN for reduced precision x87 path#5343

Merged
Sonicadvance1 merged 3 commits intoFEX-Emu:mainfrom
pmatos:f64-sin-cos-tan
Mar 20, 2026
Merged

JIT-inline F64SIN, F64COS, and F64TAN for reduced precision x87 path#5343
Sonicadvance1 merged 3 commits intoFEX-Emu:mainfrom
pmatos:f64-sin-cos-tan

Conversation

@pmatos
Copy link
Copy Markdown
Collaborator

@pmatos pmatos commented Mar 3, 2026

No description provided.

@pmatos
Copy link
Copy Markdown
Collaborator Author

pmatos commented Mar 3, 2026

I didn't manage to replicate the exact algorithm in the advsimd routines due to lack of registers, but I manage to rewrite it using a similar form with the available registers and got some good results:

  │ Operation │ vs ABI fallback │ vs Softfloat 80-bit │
  │ sin       │ 2.1x faster     │ 11.8x faster        │
  │ cos       │ 2.9x faster     │ 15.6x faster        │
  │ tan       │ 1.8x faster     │ 11.2x faster        │
  │ sincos    │ 2.4x faster     │ 25.3x faster        │

I am going to try to see if I can get similar results by jitting other f64 operations.

@Sonicadvance1
Copy link
Copy Markdown
Member

AmpereA1A:

Before:

Cycle counter frequency: 1000000000
ns in cycle: 1
suite: x87
Test, Total Cycles, Iterations, Cycles Average, Iter Time Average, iterations/Second
FSINCOS, 13208488520, 250000000, 52.83, 52.83 nanosecond, 18927222.42
FSIN, 11246190780, 250000000, 44.98, 44.98 nanosecond, 22229749.16
FCOS, 11428738260, 250000000, 45.71, 45.71 nanosecond, 21874680.68

After:

Cycle counter frequency: 1000000000
ns in cycle: 1
suite: x87
Test, Total Cycles, Iterations, Cycles Average, Iter Time Average, iterations/Second
FSINCOS, 7781475640, 250000000, 31.13, 31.13 nanosecond, 32127582.42
FSIN, 3908226340, 250000000, 15.63, 15.63 nanosecond, 63967636.02
FCOS, 4101809360, 250000000, 16.41, 16.41 nanosecond, 60948712.64

Cortex-A720/Radxa:
Before:

Cycle counter frequency: 1000000000
ns in cycle: 1
suite: x87
Test, Total Cycles, Iterations, Cycles Average, Iter Time Average, iterations/Second
FSINCOS, 16818118130, 250000000, 67.27, 67.27 nanosecond, 14864921.16
FSIN, 14680384340, 250000000, 58.72, 58.72 nanosecond, 17029526.90
FCOS, 15025618350, 250000000, 60.10, 60.10 nanosecond, 16638250.37

After:

Cycle counter frequency: 1000000000
ns in cycle: 1
suite: x87
Test, Total Cycles, Iterations, Cycles Average, Iter Time Average, iterations/Second
FSINCOS, 21518495820, 250000000, 86.07, 86.07 nanosecond, 11617912.43
FSIN, 3161922390, 250000000, 12.65, 12.65 nanosecond, 79065824.26
FCOS, 3327755660, 250000000, 13.31, 13.31 nanosecond, 75125708.00

Looks like Cortex-A720 is worse off in SINCOS because of the code emission. Would be interesting to keep it in the dispatcher and branch out so we don't annihilate icache. Kind of like the conversion operations.

@pmatos pmatos changed the title F64 sin cos tan JIT-inline F64SIN, F64COS, and F64TAN for reduced precision x87 path Mar 6, 2026
@pmatos
Copy link
Copy Markdown
Collaborator Author

pmatos commented Mar 17, 2026

AmpereA1A:

Before:

Cycle counter frequency: 1000000000
ns in cycle: 1
suite: x87
Test, Total Cycles, Iterations, Cycles Average, Iter Time Average, iterations/Second
FSINCOS, 13208488520, 250000000, 52.83, 52.83 nanosecond, 18927222.42
FSIN, 11246190780, 250000000, 44.98, 44.98 nanosecond, 22229749.16
FCOS, 11428738260, 250000000, 45.71, 45.71 nanosecond, 21874680.68

After:

Cycle counter frequency: 1000000000
ns in cycle: 1
suite: x87
Test, Total Cycles, Iterations, Cycles Average, Iter Time Average, iterations/Second
FSINCOS, 7781475640, 250000000, 31.13, 31.13 nanosecond, 32127582.42
FSIN, 3908226340, 250000000, 15.63, 15.63 nanosecond, 63967636.02
FCOS, 4101809360, 250000000, 16.41, 16.41 nanosecond, 60948712.64

Cortex-A720/Radxa: Before:

Cycle counter frequency: 1000000000
ns in cycle: 1
suite: x87
Test, Total Cycles, Iterations, Cycles Average, Iter Time Average, iterations/Second
FSINCOS, 16818118130, 250000000, 67.27, 67.27 nanosecond, 14864921.16
FSIN, 14680384340, 250000000, 58.72, 58.72 nanosecond, 17029526.90
FCOS, 15025618350, 250000000, 60.10, 60.10 nanosecond, 16638250.37

After:

Cycle counter frequency: 1000000000
ns in cycle: 1
suite: x87
Test, Total Cycles, Iterations, Cycles Average, Iter Time Average, iterations/Second
FSINCOS, 21518495820, 250000000, 86.07, 86.07 nanosecond, 11617912.43
FSIN, 3161922390, 250000000, 12.65, 12.65 nanosecond, 79065824.26
FCOS, 3327755660, 250000000, 13.31, 13.31 nanosecond, 75125708.00

Looks like Cortex-A720 is worse off in SINCOS because of the code emission. Would be interesting to keep it in the dispatcher and branch out so we don't annihilate icache. Kind of like the conversion operations.

Ufff - inside the dispatcher I could use more registers so I tried to make the code closer to the advsimd in the arm optimized-routines repo.

I have done similarly for other operations and I will push them as separate prs. If you run the same microbenchmarks as earlier do you get better results?

@Sonicadvance1
Copy link
Copy Markdown
Member

Much better! So I guess the main question now is checking if the precision difference causes real problems or not?

A1A-Before:
Cycle counter frequency: 1000000000
ns in cycle: 1
suite: x87
Test, Total Cycles, Iterations, Cycles Average, Iter Time Average, iterations/Second
FPTAN, 16301669240, 250000000, 65.21, 65.21 nanosecond, 15335852.81
FSIN, 10976102700, 250000000, 43.90, 43.90 nanosecond, 22776754.81
FCOS, 11436870040, 250000000, 45.75, 45.75 nanosecond, 21859127.46
FSINCOS, 13527759940, 250000000, 54.11, 54.11 nanosecond, 18480517.18

A1A-After:
Cycle counter frequency: 1000000000
ns in cycle: 1
suite: x87
Test, Total Cycles, Iterations, Cycles Average, Iter Time Average, iterations/Second
FPTAN, 5591198400, 250000000, 22.36, 22.36 nanosecond, 44713133.41
FSIN, 3363389100, 250000000, 13.45, 13.45 nanosecond, 74329788.37
FCOS, 4450470540, 250000000, 17.80, 17.80 nanosecond, 56173835.50
FSINCOS, 7663827020, 250000000, 30.66, 30.66 nanosecond, 32620778.02

Cortex-A720/Radxa-Before:
Cycle counter frequency: 1000000000
ns in cycle: 1
suite: x87
Test, Total Cycles, Iterations, Cycles Average, Iter Time Average, iterations/Second
FPTAN, 17921762970, 250000000, 71.69, 71.69 nanosecond, 13949520.50
FSIN, 14688639670, 250000000, 58.75, 58.75 nanosecond, 17019955.94
FCOS, 15022294320, 250000000, 60.09, 60.09 nanosecond, 16641931.96
FSINCOS, 16760394360, 250000000, 67.04, 67.04 nanosecond, 14916116.81

Cortex-A720/Radxa-After:
Cycle counter frequency: 1000000000
ns in cycle: 1
suite: x87
Test, Total Cycles, Iterations, Cycles Average, Iter Time Average, iterations/Second
FPTAN, 4265674550, 250000000, 17.06, 17.06 nanosecond, 58607377.82
FSIN, 3624740340, 250000000, 14.50, 14.50 nanosecond, 68970457.62
FCOS, 3544529050, 250000000, 14.18, 14.18 nanosecond, 70531231.79
FSINCOS, 6225296450, 250000000, 24.90, 24.90 nanosecond, 40158730.11

@Sonicadvance1
Copy link
Copy Markdown
Member

Oop, looks like something about this implementation breaks Mirror's Edge from running.

@pmatos
Copy link
Copy Markdown
Collaborator Author

pmatos commented Mar 17, 2026

Oop, looks like something about this implementation breaks Mirror's Edge from running.

That's odd - thanks for pointing that out.

@pmatos
Copy link
Copy Markdown
Collaborator Author

pmatos commented Mar 19, 2026

Oop, looks like something about this implementation breaks Mirror's Edge from running.

ok - nzcv save/restore issue. Now it's working. Sorry 'bout that.

@Sonicadvance1
Copy link
Copy Markdown
Member

Confirmed with the latest changes that the problems are resolved, and the performance improvement is still around the same.
I need to throw some more games at it, but looking good!

Copy link
Copy Markdown
Member

@Sonicadvance1 Sonicadvance1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Woop woop. Here we go.

@Sonicadvance1 Sonicadvance1 merged commit 4229154 into FEX-Emu:main Mar 20, 2026
13 checks passed
@pmatos pmatos deleted the f64-sin-cos-tan branch March 20, 2026 07:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants