Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider removing non-fetched instrs for repstr #4915

Open
abhinav92003 opened this issue May 18, 2021 · 8 comments
Open

Consider removing non-fetched instrs for repstr #4915

abhinav92003 opened this issue May 18, 2021 · 8 comments

Comments

@abhinav92003
Copy link
Contributor

Xref #2051 and discussion on PR #4912: #4912 (comment)

We should consider removing non-fetched instrs for repstr. Today, if the string loop is executed N times, we'll have the rep stringop instr N times in the instr stream. The last N-1 instances are marked as TRACE_TYPE_INSTR_NO_FETCH so that cache simulators can choose to ignore these duplicate entries.

The thinking today is that maybe core simulators also don't need these duplicate instrs, and we should just have only one single instr entry and then multiple memref entries based on loop count.

@derekbruening
Copy link
Contributor

If we implemented #4948 and changed the underlying rep string expansion, the non-fetched instrs would disappear on their own and we'd just change the docs for the update.

@derekbruening
Copy link
Contributor

Online traces incorrectly mark a continuing rep string loop as "fetched" for the first instr after a thread switch. Lumping here rather than a separate issue we want to fix b/c we plan to remove non-fetched. I hit this in PR #4974:

::14112:13616::  @76F68FA5 non-fetched instr x2
::14112.13616::  @76F68FA5 write 00EA9FBC x4
::14112:13616::  @76F68FA5 non-fetched instr x2
::14112.13616::  @76F68FA5 write 00EA9FC0 x4
::14112.14428:: marker type 2 value 3083420504
::14112.14428:: marker type 3 value 2
::14112.14428::  @76F62D1C instr x3
...
::14112.14428::  @754D3B02 write 00CFFE48 x4
::14112.13616:: marker type 2 value 3083420504
::14112.13616:: marker type 3 value 10
::14112.13616::  @76F68FA5 instr x2
Assertion failed: TESTALL(OFFLINE_FILE_TYPE_FILTERED, file_type_) || (prev_instr_[memref.data.tid].instr.addr + prev_instr_[memref.data.tid].instr.size == memref.instr.addr) || (prev_instr_[memref.data.tid].instr.addr == memref.instr.addr && memref.instr.type == TRACE_TYPE_INSTR_NO_FETCH) || (prev_xfer_marker_[memref.data.tid].instr.tid != 0 && (prev_xfer_marker_[memref.data.tid].instr.tid != memref.instr.tid || prev_xfer_marker_[memref.data.tid].marker.marker_type == TRACE_MARKER_TYPE_KERNEL_EVENT || prev_xfer_marker_[memref.data.tid].marker.marker_type == TRACE_MARKER_TYPE_KERNEL_XFER)) || prev_instr_[memref.data.tid].instr.type == TRACE_TYPE_INSTR_SYSENTER, file D:\derek\dr\git\src\clients\drcachesim\tests\trace_invariants.cpp, line 167

@derekbruening
Copy link
Contributor

Trying to re-write the choices here:

A) On entry if TLS flag not set, record instr fetch and set flag; on final
iter, clear flag. I assume have to try to clear flag on abnormal loop
break too: a signal. So adds overhead and extra branches.

B) Keep entire loop inside block and unroll 1st iter to do fetch. Has all
the downsides of a loop inside a block: messes up DR's trace building,
signal delivery delay bounds, unlink flushing, shared fragment deletion
promptness, etc.

C) Try to split across two blocks, one for 1st iter and 2nd for rest. Use
PC+1 (skipping rep byte) for tag for 2nd and hope app never targets that
PC. When see a plain string op have to decode backward I guess or have
some recorded state so know it's this artificial loop body.

D) #4915: Remove all but 1st fetch in raw2trace, and add analysis to do the same
when dumping a trace buffer to get the live instr counts to match too --
which also fixes online. But then instead of a single contiguous buffer
being written/transmitted you have to do it in many pieces to skip the
extra fetches, or waste time memcpying it on top of itself back to
contiguous.

@derekbruening
Copy link
Contributor

Whoops that last comment was meant for #4948 with this issue being choice D

@abhinav92003
Copy link
Contributor Author

We had an offline chat where we discussed that option (A) is doable. Note that ensuring a single instr entry gets trickier if the repstr execution is interrupted by a signal; this is because we'll need to save and restore the mentioned TLS flag on entry and exit from the signal handler, and there can be nested signals too. But we decided that it's actually better to emit a new instr entry when the repstr expansion resumes after such an interruption; this is because the repstr instruction would actually be re-fetched by the processor pipeline. So all we need to do on an signal is to clear the mentioned TLS flag.

Question: do we want the same behavior on interruption of the scatter/gather expansion by a fault? Note that the scatter/gather expansion cannot be interrupted by an async signal (DR would delay the async signal, as the expansion is a single fragment that's executed just once per dynamic scatter/gather instruction, unlike the repstr where the same fragment is executed N times), so this is only for a scatter/gather instr resuming after a fault. Today we add a single pc entry for the scatter/gather instruction, even if some store/load faults. Note that the scatter/gather expansion does not use a loop, unlike the repstr expansion. So, it is actually more complex to add a new pc entry on resumption after a fault. We'll need to have a TLS-flag-protected instr-entry emit before each memref-emit.

@derekbruening
Copy link
Contributor

Not sure I understand the concern about scatter-gather: there is no problem today. We don't have to do anything extra for proper resumption of a rep-string or scatter-gather after a signal that is delivered to the app: that will always re-enter the original block from the start but now with a different resumption state (rcx for rep string; mask for scatter-gatther) because we translate the expansion to the original instruction.

@abhinav92003
Copy link
Contributor Author

that will always re-enter the original block from the start but now with a different resumption state

Oh, so after the fault, DR will not resume at the faulting cache pc, but at the faulting app pc (which will take it to the top of the scatter/gather fragment). Yes in that case we're fine, as the scatter/gather instr entry will indeed be emitted again.

@derekbruening
Copy link
Contributor

The online failure to identify non-fetched after a thread switch lumped into this issue at #4915 (comment) can result in a PC discontinuity error:

     1184144      760077:        9972 ifetch       3 byte(s) @ 0x76266f9f non-branch
     1184145      760078:        9972 ifetch       2 byte(s) @ 0x76266fa2 non-branch
     1184146      760078:        9972 write        4 byte(s) @ 0x009bfb08 by PC 0x76266fa2
     1184147      760078:        9972 ifetch       2 byte(s) @ 0x76266fa2 non-fetched instruction
     1184148      760078:        9972 write        4 byte(s) @ 0x009bfb0c by PC 0x76266fa2
     1184149      760078:        9972 ifetch       2 byte(s) @ 0x76266fa2 non-fetched instruction
     1184150      760078:        9972 write        4 byte(s) @ 0x009bfb10 by PC 0x76266fa2
2
------------------------------------------------------------
     1184151      760079:        4424 ifetch       5 byte(s) @ 0x773beaf7 non-branch
<...>
     1184206      760107:        4424 write        4 byte(s) @ 0x02c2f454 by PC 0x773beb0f
------------------------------------------------------------
     1184207      760108:        9972 ifetch       2 byte(s) @ 0x76266fa2 non-branch
Trace invariant failure in T9972 at ref # 1184207 (69 instrs since timestamp 1660902515): Non-explicit control flow has no marker

derekbruening added a commit that referenced this issue Sep 8, 2023
Adds a timestamp+cpuid pair at the end of normal buffers, to help
separate trace output i/o time.

Adds a timestamp+cpuid pair before and after each application syscall.
It appears around the syscall marker. The trace buffer is no longer
output pre-syscall (this addresses #3113) except for i-filtered traces
or small-window traces where we need a frequent trace size check.

Adds a timestamp+cpuid pair when a kernel transfer event occurs. The
trace buffer is no longer output here.

Fixes the existing behavior where the post-syscall-buffer's timestamp
contains the pre-syscall time.

Bumps the trace version for this change as the contents of post-syscall
timestamps are now different.

Augments the documentation around timestamps and adds a missing entry on
syscall markers.

Adds some invariant checks, and improve existing syscall marker checks.
Further checks are more difficult as the post-syscall time is not
present when there is no post-syscall event, such as with exit or
sigreturn. Did some manual testing with the view tool.

Applies several auxiliary changes required to get all modes to work with
these additional timestamps:
+ Removes the timestamp+cpuid from the split-pipe header, as that
timestamp is delayed by i/o and results in misleading out-of-order
timestamps.
+ Prevents pipe-splitting before a cpuid, to avoid a confusing sequence.
+ Fixes a bug where a syscall marker is missing when the prior block
wrote out its buffer (#6291).
+ Fixes #6245 by re-instating the footer for non-split window files and
removing the template `.*` from PR #6165 which was hiding the warning
(this one could possibly be split into its own PR).
+ Adjusts the PC discontinuity relaxation for online unfetched instrs
(#4915) to check the size now that there's no timestamp.

Issue: #4915
Fixes #6289
Fixes #3113
Fixes #6291 
Fixes #6245
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants