JIT-inline F64SIN, F64COS, and F64TAN for reduced precision x87 path by pmatos · Pull Request #5343 · FEX-Emu/FEX

pmatos · 2026-03-03T15:45:35Z

No description provided.

pmatos · 2026-03-03T15:48:39Z

I didn't manage to replicate the exact algorithm in the advsimd routines due to lack of registers, but I manage to rewrite it using a similar form with the available registers and got some good results:

  │ Operation │ vs ABI fallback │ vs Softfloat 80-bit │
  │ sin       │ 2.1x faster     │ 11.8x faster        │
  │ cos       │ 2.9x faster     │ 15.6x faster        │
  │ tan       │ 1.8x faster     │ 11.2x faster        │
  │ sincos    │ 2.4x faster     │ 25.3x faster        │

I am going to try to see if I can get similar results by jitting other f64 operations.

Sonicadvance1 · 2026-03-03T21:29:44Z

AmpereA1A:

Before:

Cycle counter frequency: 1000000000
ns in cycle: 1
suite: x87
Test, Total Cycles, Iterations, Cycles Average, Iter Time Average, iterations/Second
FSINCOS, 13208488520, 250000000, 52.83, 52.83 nanosecond, 18927222.42
FSIN, 11246190780, 250000000, 44.98, 44.98 nanosecond, 22229749.16
FCOS, 11428738260, 250000000, 45.71, 45.71 nanosecond, 21874680.68

After:

Cycle counter frequency: 1000000000
ns in cycle: 1
suite: x87
Test, Total Cycles, Iterations, Cycles Average, Iter Time Average, iterations/Second
FSINCOS, 7781475640, 250000000, 31.13, 31.13 nanosecond, 32127582.42
FSIN, 3908226340, 250000000, 15.63, 15.63 nanosecond, 63967636.02
FCOS, 4101809360, 250000000, 16.41, 16.41 nanosecond, 60948712.64

Cortex-A720/Radxa:
Before:

Cycle counter frequency: 1000000000
ns in cycle: 1
suite: x87
Test, Total Cycles, Iterations, Cycles Average, Iter Time Average, iterations/Second
FSINCOS, 16818118130, 250000000, 67.27, 67.27 nanosecond, 14864921.16
FSIN, 14680384340, 250000000, 58.72, 58.72 nanosecond, 17029526.90
FCOS, 15025618350, 250000000, 60.10, 60.10 nanosecond, 16638250.37

After:

Cycle counter frequency: 1000000000
ns in cycle: 1
suite: x87
Test, Total Cycles, Iterations, Cycles Average, Iter Time Average, iterations/Second
FSINCOS, 21518495820, 250000000, 86.07, 86.07 nanosecond, 11617912.43
FSIN, 3161922390, 250000000, 12.65, 12.65 nanosecond, 79065824.26
FCOS, 3327755660, 250000000, 13.31, 13.31 nanosecond, 75125708.00

Looks like Cortex-A720 is worse off in SINCOS because of the code emission. Would be interesting to keep it in the dispatcher and branch out so we don't annihilate icache. Kind of like the conversion operations.

pmatos · 2026-03-17T17:15:07Z

AmpereA1A:

Before:

Cycle counter frequency: 1000000000
ns in cycle: 1
suite: x87
Test, Total Cycles, Iterations, Cycles Average, Iter Time Average, iterations/Second
FSINCOS, 13208488520, 250000000, 52.83, 52.83 nanosecond, 18927222.42
FSIN, 11246190780, 250000000, 44.98, 44.98 nanosecond, 22229749.16
FCOS, 11428738260, 250000000, 45.71, 45.71 nanosecond, 21874680.68

After:

Cycle counter frequency: 1000000000
ns in cycle: 1
suite: x87
Test, Total Cycles, Iterations, Cycles Average, Iter Time Average, iterations/Second
FSINCOS, 7781475640, 250000000, 31.13, 31.13 nanosecond, 32127582.42
FSIN, 3908226340, 250000000, 15.63, 15.63 nanosecond, 63967636.02
FCOS, 4101809360, 250000000, 16.41, 16.41 nanosecond, 60948712.64

Cortex-A720/Radxa: Before:

Cycle counter frequency: 1000000000
ns in cycle: 1
suite: x87
Test, Total Cycles, Iterations, Cycles Average, Iter Time Average, iterations/Second
FSINCOS, 16818118130, 250000000, 67.27, 67.27 nanosecond, 14864921.16
FSIN, 14680384340, 250000000, 58.72, 58.72 nanosecond, 17029526.90
FCOS, 15025618350, 250000000, 60.10, 60.10 nanosecond, 16638250.37

After:

Cycle counter frequency: 1000000000
ns in cycle: 1
suite: x87
Test, Total Cycles, Iterations, Cycles Average, Iter Time Average, iterations/Second
FSINCOS, 21518495820, 250000000, 86.07, 86.07 nanosecond, 11617912.43
FSIN, 3161922390, 250000000, 12.65, 12.65 nanosecond, 79065824.26
FCOS, 3327755660, 250000000, 13.31, 13.31 nanosecond, 75125708.00

Looks like Cortex-A720 is worse off in SINCOS because of the code emission. Would be interesting to keep it in the dispatcher and branch out so we don't annihilate icache. Kind of like the conversion operations.

Ufff - inside the dispatcher I could use more registers so I tried to make the code closer to the advsimd in the arm optimized-routines repo.

I have done similarly for other operations and I will push them as separate prs. If you run the same microbenchmarks as earlier do you get better results?

Sonicadvance1 · 2026-03-17T18:46:55Z

Much better! So I guess the main question now is checking if the precision difference causes real problems or not?

A1A-Before:
Cycle counter frequency: 1000000000
ns in cycle: 1
suite: x87
Test, Total Cycles, Iterations, Cycles Average, Iter Time Average, iterations/Second
FPTAN, 16301669240, 250000000, 65.21, 65.21 nanosecond, 15335852.81
FSIN, 10976102700, 250000000, 43.90, 43.90 nanosecond, 22776754.81
FCOS, 11436870040, 250000000, 45.75, 45.75 nanosecond, 21859127.46
FSINCOS, 13527759940, 250000000, 54.11, 54.11 nanosecond, 18480517.18

A1A-After:
Cycle counter frequency: 1000000000
ns in cycle: 1
suite: x87
Test, Total Cycles, Iterations, Cycles Average, Iter Time Average, iterations/Second
FPTAN, 5591198400, 250000000, 22.36, 22.36 nanosecond, 44713133.41
FSIN, 3363389100, 250000000, 13.45, 13.45 nanosecond, 74329788.37
FCOS, 4450470540, 250000000, 17.80, 17.80 nanosecond, 56173835.50
FSINCOS, 7663827020, 250000000, 30.66, 30.66 nanosecond, 32620778.02

Cortex-A720/Radxa-Before:
Cycle counter frequency: 1000000000
ns in cycle: 1
suite: x87
Test, Total Cycles, Iterations, Cycles Average, Iter Time Average, iterations/Second
FPTAN, 17921762970, 250000000, 71.69, 71.69 nanosecond, 13949520.50
FSIN, 14688639670, 250000000, 58.75, 58.75 nanosecond, 17019955.94
FCOS, 15022294320, 250000000, 60.09, 60.09 nanosecond, 16641931.96
FSINCOS, 16760394360, 250000000, 67.04, 67.04 nanosecond, 14916116.81

Cortex-A720/Radxa-After:
Cycle counter frequency: 1000000000
ns in cycle: 1
suite: x87
Test, Total Cycles, Iterations, Cycles Average, Iter Time Average, iterations/Second
FPTAN, 4265674550, 250000000, 17.06, 17.06 nanosecond, 58607377.82
FSIN, 3624740340, 250000000, 14.50, 14.50 nanosecond, 68970457.62
FCOS, 3544529050, 250000000, 14.18, 14.18 nanosecond, 70531231.79
FSINCOS, 6225296450, 250000000, 24.90, 24.90 nanosecond, 40158730.11

Sonicadvance1 · 2026-03-17T19:33:57Z

Oop, looks like something about this implementation breaks Mirror's Edge from running.

pmatos · 2026-03-17T20:16:41Z

Oop, looks like something about this implementation breaks Mirror's Edge from running.

That's odd - thanks for pointing that out.

pmatos · 2026-03-19T18:04:37Z

Oop, looks like something about this implementation breaks Mirror's Edge from running.

ok - nzcv save/restore issue. Now it's working. Sorry 'bout that.

…n x87 path

…ion x87 path

Sonicadvance1 · 2026-03-19T22:13:14Z

Confirmed with the latest changes that the problems are resolved, and the performance improvement is still around the same.
I need to throw some more games at it, but looking good!

Sonicadvance1

Woop woop. Here we go.

pmatos changed the title ~~F64 sin cos tan~~ JIT-inline F64SIN, F64COS, and F64TAN for reduced precision x87 path Mar 6, 2026

pmatos force-pushed the f64-sin-cos-tan branch from e6d6b85 to 29de640 Compare March 17, 2026 17:13

JIT-inline F64SIN, F64COS, and F64TAN for reduced precision x87 path

30e8533

pmatos force-pushed the f64-sin-cos-tan branch from 29de640 to 362dfa6 Compare March 19, 2026 18:03

pmatos added 2 commits March 19, 2026 19:18

asm_tests: JIT-inline F64SIN, F64COS, and F64TAN for reduced precisio…

a1d78dc

…n x87 path

instcountci: JIT-inline F64SIN, F64COS, and F64TAN for reduced precis…

9d5f7ca

…ion x87 path

pmatos force-pushed the f64-sin-cos-tan branch from 362dfa6 to 9d5f7ca Compare March 19, 2026 18:18

Sonicadvance1 approved these changes Mar 20, 2026

View reviewed changes

Sonicadvance1 merged commit 4229154 into FEX-Emu:main Mar 20, 2026
13 checks passed

pmatos deleted the f64-sin-cos-tan branch March 20, 2026 07:52

This was referenced Apr 17, 2026

JIT-inline F64ATAN and F64FYL2X for reduced precision x87 path #5425

Merged

JIT-inline FPREM/FPREM1 for reduced precision x87 path #5432

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JIT-inline F64SIN, F64COS, and F64TAN for reduced precision x87 path#5343

JIT-inline F64SIN, F64COS, and F64TAN for reduced precision x87 path#5343
Sonicadvance1 merged 3 commits intoFEX-Emu:mainfrom
pmatos:f64-sin-cos-tan

pmatos commented Mar 3, 2026

Uh oh!

pmatos commented Mar 3, 2026

Uh oh!

Sonicadvance1 commented Mar 3, 2026

Uh oh!

pmatos commented Mar 17, 2026

Uh oh!

Sonicadvance1 commented Mar 17, 2026

Uh oh!

Sonicadvance1 commented Mar 17, 2026

Uh oh!

pmatos commented Mar 17, 2026

Uh oh!

pmatos commented Mar 19, 2026

Uh oh!

Sonicadvance1 commented Mar 19, 2026

Uh oh!

Sonicadvance1 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

pmatos commented Mar 3, 2026

Uh oh!

pmatos commented Mar 3, 2026

Uh oh!

Sonicadvance1 commented Mar 3, 2026

Uh oh!

pmatos commented Mar 17, 2026

Uh oh!

Sonicadvance1 commented Mar 17, 2026

Uh oh!

Sonicadvance1 commented Mar 17, 2026

Uh oh!

pmatos commented Mar 17, 2026

Uh oh!

pmatos commented Mar 19, 2026

Uh oh!

Sonicadvance1 commented Mar 19, 2026

Uh oh!

Sonicadvance1 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants