Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[drcachesim] trim trace to avoid staggered attach and detach partial-thread anomalies at ends of trace #2039

Closed
derekbruening opened this issue Oct 20, 2016 · 2 comments · Fixed by #5832

Comments

@derekbruening
Copy link
Contributor

Xref #1729

Due to the asynchronous nature of stopping all the threads on detach, some
may run further and produce more data than we'd like for a bursty trace. I
already added a new option -max_trace_size to put a cap on the per-thread
trace size, but we probably also want a way to precisely throw out data
past the detach timestamp.

On the burst-threads test, without the -max_trace_size cap, the secondary
threads produce surprisingly large traces: 65MB, when the start/stop thread
only ran 4 iters of its loop. It may be worth investigating further to
ensure nothing is going wrong there: is it really just a few hundred more
iters that adds up to 65MB??

@derekbruening
Copy link
Contributor Author

derekbruening commented Sep 14, 2022

This problem occurs on attach as well: the staggered attach means that in an app with thousands of threads the threads taken over early start producing trace data well before the rest of the threads are taken over, resulting in an inaccurate initial picture of the workload.

An alternative to discarding after the fact based on timestamps is to use drbbdup modes to have a non-tracing alternative mode. Only once attach has taken over all threads would we flip a switch and have the tracing mode start, simultaneously in all threads. We would use a load-acquire read of the mode flag which is fairly performant.

While this is more overhead than fixing up later, it fits well with schemes that pre-populate the code cache in a lower-overhead no-tracing mode before starting tracing.

There are two ways to trigger the mode switches for statically linked drmemtrace: new DR events for post-attach and pre-detach, or annotations. Annotations are more flexible as when pre-populating the tool can separate attach from tracing.

We would want to implement annotations for aarchxx #1672, and solve the raw2trace issues with annotations #4141.

@derekbruening derekbruening changed the title [drcachesim] add detach timestamp to trace and discard data past that point [drcachesim] trim trace to avoid staggered attach and detach partial-thread anomalies at ends of trace Sep 14, 2022
derekbruening added a commit that referenced this issue Sep 16, 2022
Adds two new DR events: post-attach and pre-detach.  These help tools
align instrumentation with points where all threads are under DR
control, avoiding uneven thread execution during the incremental
staggered attach and detach processes.

Adds some sanity tests that the events are called where they should
be.  Adding them to drmemtrace will be done separately.

Issue: #2039
derekbruening added a commit that referenced this issue Sep 19, 2022
Adds two new DR events: post-attach and pre-detach.  These help tools
align instrumentation with points where all threads are under DR
control, avoiding uneven thread execution during the incremental
staggered attach and detach processes.

Adds some sanity tests that the events are called where they should
be, for both internal and external attach.
Adding them to drmemtrace will be done separately.

Issue: #2039
derekbruening added a commit that referenced this issue Sep 26, 2022
Adds support for multiple app copies in rseq mangling finalization's
search for labels to fill in code cache PC's.

Adds support for multiple rseq regions in one fragment_t by expanding
the stored rseq_cs per fragment to become a list.

Adds "-trace_after_instrs 5K" to the tool.drcacheoff.rseq test to
exercise rseq with drbbdup.

Augments the attach/detach api.rseq test to use drmemtrace, which with
the forthcoming #2039 mode switches also hits this multi-rseq case.

Fixes #5659
derekbruening added a commit that referenced this issue Sep 28, 2022
Adds support for multiple app copies in rseq mangling finalization's
search for labels to fill in code cache PC's.

Adds support for multiple rseq regions in one fragment_t by expanding
the stored rseq_cs per fragment to become a list.

Adds "-trace_after_instrs 5K" to the tool.drcacheoff.rseq test to
exercise rseq with drbbdup.

Augments the attach/detach api.rseq test to use drmemtrace, which with
the forthcoming #2039 mode switches also hits this multi-rseq case.

Fixes #5659
derekbruening added a commit that referenced this issue Sep 30, 2022
Fixes PR #5663 which fixed drreg restoration for drbbdup but broke it
for non-drbbdup.  bb->instr was NULL so the preinsert was an append,
and the label as the last instruction moved the drreg restore point to
after the native sequence!

Here, we explicitly append to avoid confusion, and we move the native
sequence to the point of the barrier instead of at the final app
instr, which is where the registers are actually native.

Tested on larger apps using rseq which reliably crashed without this
fix.  This fix also seems to fully fix drbbdup usage with drmemtrace
for #2039.

Issue: #5658, #2039
derekbruening added a commit that referenced this issue Sep 30, 2022
Fixes PR #5663 which fixed drreg restoration for drbbdup but broke it
for non-drbbdup.  bb->instr was NULL so the preinsert was an append,
and the label as the last instruction moved the drreg restore point to
after the native sequence!

Here, we explicitly append to avoid confusion, and we move the native
sequence to the point of the barrier instead of at the final app
instr, which is where the registers are actually native.

Tested on larger apps using rseq which reliably crashed without this
fix.  This fix also seems to fully fix drbbdup usage with drmemtrace
for #2039.

Issue: #5658, #2039
derekbruening added a commit that referenced this issue Oct 7, 2022
Adds drmemtrace_replace_file_ex_ops() with an expanded file opening
function which takes in the thread id and window id, to better support
the external file opener with delayed opens due to nop mode and
windows.

Adds a test of the function to burst_replaceall.  Adds additional
threads to the test and checks that the tid for each thread was seen.

Issue: #2039
derekbruening added a commit that referenced this issue Oct 10, 2022
Adds drmemtrace_replace_file_ex_ops() with an expanded file opening
function which takes in the thread id and window id, to better support
the external file opener with delayed opens due to nop mode and
windows.  The new function takes in a struct to make it easier to
extend further in the future.

Adds a test of the function to burst_replaceall.  Adds additional
threads to the test and checks that the tid for each thread was seen,
but disables that part on Windows due to #2040.

Issue: #2039
derekbruening added a commit that referenced this issue Oct 25, 2022
Adds a new drbbdup mode to drmemtrace which performs zero
instrumentation (except drwrap cleanup).  This mode is used when
attaching for the period prior to full control of all threads and when
detaching prior to starting to let threads go native, to avoid uneven
thread instrumentation during these incremental staggered processes.
The mode is initially under an off-by-default new option
-align_endpoints while drbbdup stability is being worked on (i#5686).

Threads that did nothing during the trace-mode period are now omitted
from the trace.

Adds a test by adding 4 idle threads to burst_threads and ensuring
they do not show up in the trace.

Alignment itself was tested manually by running larger applications
and analyzing the timestamp ranges in the trace.

Fixes the burst_replace test to work properly with .zip output (the
output regex still matched despite a printed error from raw2trace, it
seems).

Timestamps during detach require further work as they inaccurately
imply executing after tracing mode was turned off.  The next part will
address this issue.

Issue: #2039, #5686
derekbruening added a commit that referenced this issue Oct 25, 2022
Adds a new drbbdup mode to drmemtrace which performs zero
instrumentation (except drwrap cleanup).  This mode is used when
attaching for the period prior to full control of all threads and when
detaching prior to starting to let threads go native, to avoid uneven
thread instrumentation during these incremental staggered processes.
The mode is initially under an off-by-default new option
-align_endpoints while drbbdup stability is being worked on (i#5686).

Threads that did nothing during the trace-mode period are now omitted
from the trace.

Adds a test by adding 4 idle threads to burst_threads and ensuring
they do not show up in the trace.

Alignment itself was tested manually by running larger applications
and analyzing the timestamp ranges in the trace.

Fixes the burst_replace test to work properly with .zip output (the
output regex still matched despite a printed error from raw2trace, it
seems).

Timestamps during detach require further work as they inaccurately
imply executing after tracing mode was turned off.  The next part will
address this issue.

Issue: #2039, #5686
derekbruening added a commit that referenced this issue Oct 25, 2022
Moves the timestamp+cpu headers to be added at buffer creation time
instead of at buffer output time.

Makes instru_t::frozen_timestamp_ use release-acquire semantics and
sets it for -align_endpoints at detach time.

Adds a 'minimum timestamp' recorded at post-attach time.  At buffer
output time, we examine the buffer's timestamp and if it was from
before the minimum, we replace it with the current (not the minimum)
timestamp.

Tested on a large app where these changes result in much tighter
timestamps matching the burst period.

Issue: #2039
@derekbruening
Copy link
Contributor Author

From #5704 (comment): we should add an invariant check that timestamps always increase monotonically through each trace.

derekbruening added a commit that referenced this issue Oct 27, 2022
Moves the timestamp+cpu headers to be added at buffer creation time
instead of at buffer output time.

Makes instru_t::frozen_timestamp_ use release-acquire semantics and
sets it for -align_endpoints at detach time.

Adds a 'minimum timestamp' recorded at post-attach time.  At buffer
output time, we examine the buffer's timestamp and if it was from
before the minimum, we replace it with the minimum.

Tested on a large app where these changes result in much tighter
timestamps matching the burst period.

This also directly fixes #5537 which pointed out that the timestamps
being at buffer output time is contradicted by the documentation.

Issue: #2039, #5537
Fixes #5537
derekbruening added a commit that referenced this issue Nov 30, 2022
Part 4 added a minimum timestamp, but it only works for data that is
not actually emitted until we are fully attached.  For data emitted
before that point, we'll use a too-early timestamp.  We solve that
here by dropping such data.

Tested on a large app where these changes solve gaps at the start of
the trace due to an early timestamp in one thread.

Issue: #2039
derekbruening added a commit that referenced this issue Dec 1, 2022
Since the stability issues with drbbdup on larger apps seem limited to
x86, we enable -align_endpoints by default for AArchXX.

Issue: #2039
derekbruening added a commit that referenced this issue Dec 1, 2022
Part 4 added a minimum timestamp, but it only works for data that is
not actually emitted until we are fully attached.  For data emitted
before that point, we'll use a too-early timestamp.  We solve that
here by dropping such data.

It is not easy to make an automated test unfortunately.

Issue: #2039
derekbruening added a commit that referenced this issue Dec 1, 2022
Since the stability issues with drbbdup on larger apps seem limited to
x86, we enable -align_endpoints by default for AArchXX.

This revealed a drbbdup issue on AArch64:
OP_isb must end a bb.  Added this to drbbdup's list of special
opcodes, and to the bb event docs where it was missing.
This fixes drcacheoff.gencode test failures.

Adds more synch to the burst_replaceall test to ensure all
threads are scheduled to avoid empty threads which mess up the
expected output with -align_endpoints.

Relaxes the burst_threadfilter template to allow empty unscheduled
threads.  Relaxes the burst_replace template to allow lazy file opening.

Issue: #2039
derekbruening added a commit that referenced this issue Dec 2, 2022
Fixes an issue with refreshing the timestamp when filtering is on
where the first (extended) header flag was cleared on a
filter-discarded buffer.

Tested by running "ctest -V --repeat-until-fail 100 -R burst_threadL0filter".
Previously it failed on the 3rd test; now all 100 pass.

Issue: #2039
derekbruening added a commit that referenced this issue Dec 2, 2022
Fixes an issue with refreshing the timestamp when filtering is on
where the first (extended) header flag was cleared on a
filter-discarded buffer.

Tested by running "ctest -V --repeat-until-fail 100 -R burst_threadL0filter".
Previously it failed on the 3rd test; now all 100 pass.

Fixes #5774
Issue: #2039, #5774
derekbruening added a commit that referenced this issue Jan 20, 2023
With the stability issues with drbbdup seemingly solved, we enable
-align_endpoints by default for x86 (it was already on for AArchXX).

Fixes #2039
derekbruening added a commit that referenced this issue Jan 21, 2023
With the stability issues with drbbdup seemingly solved, we enable
-align_endpoints by default for x86 (it was already on for AArchXX).

Fixes #2039
dolanzhao pushed a commit that referenced this issue Jan 30, 2023
With the stability issues with drbbdup seemingly solved, we enable
-align_endpoints by default for x86 (it was already on for AArchXX).

Fixes #2039
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant