Correctly emulate FPU concurrent execution timings #2022
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Summary
since the original 8087, x87 floating point acted as a separate execution unit from the main processor, a feature which Intel called "concurrent execution": when an FPU instruction is run, the main processor has to spend a few cycles handing the instruction information and any relevent data (values from memory, etc) to the FPU. after that, for the remainder of the time that the FPU is running the instruction, the main CPU is free to continue running integer work.
naive emulation of x86 processors does not account for this. e.g. when running the following routine on an emulated i486,
a naive fpu emulation (assuming that imull is calculated correctly for the input values as 42 cycles, rather than averaged) would run this routine in 124 cycles - 73 for the fdiv, 1 for each mov, 42 for the imull, and 7 for the fstps.
on real hardware, while the fdiv does take 73 cycles if the fpu is in 80-bit precision mode, only 3 of those will stall the main cpu - 70 cycles are concurrent. this means that, overall, the above routine would actually take 80 cycles: the fdiv takes 73 cycles, the movls and imull run concurrently with the fdiv and therefore take a combined effective 0 cycles (but take up 44 total of the concurrent execution time), and then the fstps takes 7 cycles.
this set of patches adds support into 86box for emulating this properly, and adds timing information for the i486 taken from its datasheet.
running it with some of the tests i've been developing in
qmark, a testing program i have been using as part of developing 486quake, everything appears to be working properly, and the timings look more like they should, no longer setting off my timing-related emulation checks (one behavioral check still does fire).before
after
note: this patchset is only tested with the interpreter and the new dynarec - i am not able to test the old dynarec.
References
i486 Processor Programmer's Reference Manual