Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add fast-forward feature to skip N instructions in drmemtrace analysis tools #5538

Open
derekbruening opened this issue Jun 21, 2022 · 3 comments

Comments

@derekbruening
Copy link
Contributor

derekbruening commented Jun 21, 2022

This is a feature request to make it easier to simulate a subset of a long
trace. Today, this would be done by running a trace splitter offline. The
proposal is to support skipping forward N instructions by seeking in the
trace file, to make it usable online during simulation.

The records are fixed-size, but the instruction type density is not
uniform, so we'd have to do something like embed markers with instruction
counts every N records so the seeking can find the proper instruction
boundary. We could limit the fast-forward jumps to every N instructions.

There are several features which may not interact well with this feature,
where we want to only emit non-changing data once early in the trace and
assume a trace reader can cache the data:

@derekbruening
Copy link
Contributor Author

A related feature is having an instruction count in the view tool in addition to the total record count used today, and corresponding -skip_instrs and -sim_instrs to go with today's -skip_refs and -sim_refs (for view and simulator tools).

Also, as part of this feature we should probably solve #4915 / #4948: what about non-fetched instrs?

derekbruening added a commit that referenced this issue Aug 31, 2022
Splits post-processed offline drmemtrace files into chunks of a fixed
instruction count.  These chunks are combined inside one zipfile per
thread, maintaining the current file-per-thread invariant.

The minizip library, a contributed part of the zlib sources, is added
as a submodule and used to write and read the zipfile (via new
zipfile_ostream_t and zipfile_file_reader_t classes, respectively).
If the submodule is not present, we fall back to a gzipped single file
(if we have a system zlib).

Issue: #5538
derekbruening added a commit that referenced this issue Sep 2, 2022
Splits post-processed offline drmemtrace files into chunks of a fixed
instruction count.  These chunks are combined inside one zipfile per
thread, maintaining the current file-per-thread invariant.

The minizip library, a contributed part of the zlib sources, is added
as a submodule and used to write and read the zipfile (via new
zipfile_ostream_t and zipfile_file_reader_t classes, respectively).
If the submodule is not present, we fall back to a gzipped single file
(if we have a system zlib).

Adds a new marker type holding the chunk instr count.
Adds a new -chunk_instr_count option for specifying the count.
We pass a small value to the drcacheoff.simple and invariant
checker tests to test multiple chunks in one zip file.
Documents the change.
Refactors drmemtrace_get_timestamp_from_offline_trace to handle
new markers.
Updates tests to handle the new marker.

Adds a new marker type as a chunk footer, to identify truncation.

Re-emits the last timestamp+cpu at the top of each new chunk.
The reader skips them in linear walking.

Refactors delayed branch handling so we can identify instr entries.

Removes the histogram.gzip test as it is superfluous since we've been
gzipping files for a long time; plus, it blindly assumes it can gzip
any output, including a .zip.

Issue: #5538
derekbruening added a commit that referenced this issue Sep 6, 2022
Cleans up the drmemtrace i/o classes by removing redundant "virtual"
keywords and adding missing "explicit" constructor qualifiers.

Issue: #5538
derekbruening added a commit that referenced this issue Sep 6, 2022
)

Cleans up the drmemtrace i/o classes by removing redundant "virtual"
keywords and adding missing "explicit" constructor qualifiers.

Issue: #5538
derekbruening added a commit that referenced this issue Sep 8, 2022
Adds buffering to the zipfile reader, which eliminates the 60%
slowdown it showed compared to our gzip reader and in fact results in
faster read times than the gzip reader.

Adds similar buffering to the gzip reader, which results in an 18%
speedup and matches the new zipfile speed.

Issue: #5538
derekbruening added a commit that referenced this issue Sep 8, 2022
Adds buffering to the zipfile reader, which eliminates the 60%
slowdown it showed compared to our gzip reader and in fact results in
faster read times than the gzip reader.

Adds similar buffering to the gzip reader, which results in an 18%
speedup and matches the new zipfile speed.

Issue: #5538
derekbruening added a commit that referenced this issue Sep 8, 2022
PR #5633 added code to skip the duplicate timestamps at the top of
each chunk: but its logic assumed there would never be two legitimate
timestamps with identical values.  That does happen, in particular in
our online-drcachesim tests on Windows.  This resulted in the
invariant_checker not seeing some timestamp entries, causing its
exception for non-fetched instrs across thread switches to not apply
and resulting in invariant error reports.

We fix this by skipping the first timestamp in each chunk by
instruction count instead.  We'll want the insruction and chunk
counts for #5538 and I was about to add those fields in any case.

Fixes #5636
derekbruening added a commit that referenced this issue Sep 9, 2022
PR #5633 added code to skip the duplicate timestamps at the top of
each chunk: but its logic assumed there would never be two legitimate
timestamps with identical values.  That does happen, in particular in
our online-drcachesim tests on Windows.  This resulted in the
invariant_checker not seeing some timestamp entries, causing its
exception for non-fetched instrs across thread switches to not apply
and resulting in invariant error reports.

We fix this by skipping the first timestamp in each chunk by
instruction count instead.  We'll want the insruction and chunk
counts for #5538 and I was about to add those fields in any case.

Fixes #5636
dolanzhao pushed a commit that referenced this issue Sep 12, 2022
)

Cleans up the drmemtrace i/o classes by removing redundant "virtual"
keywords and adding missing "explicit" constructor qualifiers.

Issue: #5538
dolanzhao pushed a commit that referenced this issue Sep 12, 2022
Adds buffering to the zipfile reader, which eliminates the 60%
slowdown it showed compared to our gzip reader and in fact results in
faster read times than the gzip reader.

Adds similar buffering to the gzip reader, which results in an 18%
speedup and matches the new zipfile speed.

Issue: #5538
dolanzhao pushed a commit that referenced this issue Sep 12, 2022
PR #5633 added code to skip the duplicate timestamps at the top of
each chunk: but its logic assumed there would never be two legitimate
timestamps with identical values.  That does happen, in particular in
our online-drcachesim tests on Windows.  This resulted in the
invariant_checker not seeing some timestamp entries, causing its
exception for non-fetched instrs across thread switches to not apply
and resulting in invariant error reports.

We fix this by skipping the first timestamp in each chunk by
instruction count instead.  We'll want the insruction and chunk
counts for #5538 and I was about to add those fields in any case.

Fixes #5636
derekbruening added a commit that referenced this issue Oct 28, 2022
Adds an invariant check that chunk boundaries contain the proper
number of instructions.

Fixes off-by-one errors in chunk counts due to not checking at
end-of-loop, found by visual inspection (with a local view tool that
prints instr counts; that will be committed later) and confirmed to
break this new check without the fix in the
tool.drcacheoff.invariant_checker test.

Issue: #5538
derekbruening added a commit that referenced this issue Oct 31, 2022
Adds an invariant check that chunk boundaries contain the proper
number of instructions.

Fixes off-by-one errors in chunk counts due to not checking at
end-of-loop, found by visual inspection (with a local view tool that
prints instr counts; that will be committed later) and confirmed to
break this new check without the fix in the
tool.drcacheoff.invariant_checker test.

Issue: #5538
derekbruening added a commit that referenced this issue Oct 31, 2022
Adds two new files generated by raw2trace, each containing 4-entry
records of <tid, timestamp, cpuid, instr_count>.  These are used to
take a global or per-cpu target instruction count and determine the
corresponding instruction count within each software thread, for
seeking within each trace file.  One file is globally sorted by
timestamp, and the other is a zip file with a separate component for
each cpuid holding entries sorted by timestamp for that cpu.

Adds new invariant_checker tests for each file by passing them in for
test_mode.  The invariant_checker re-constructs the same sorted
sequences using the trace data and confirms it matches the data in the
files.  This involves adding a new zipfile_stream_t interface for a
simple continuous stream of each zipfile component in turn.

Issue: #5538
derekbruening added a commit that referenced this issue Nov 1, 2022
Adds two new files generated by raw2trace, each containing 4-entry
records of <tid, timestamp, cpuid, instr_count>.  These are used to
take a global or per-cpu target instruction count and determine the
corresponding instruction count within each software thread, for
seeking within each trace file.  One file is globally sorted by
timestamp, and the other is a zip file with a separate component for
each cpuid holding entries sorted by timestamp for that cpu.

Adds new invariant_checker tests for each file by passing them in for
test_mode.  The invariant_checker re-constructs the same sorted
sequences using the trace data and confirms it matches the data in the
files.  This involves adding a new zipfile_stream_t interface for a
simple continuous stream of each zipfile component in turn.

Switches raw2trace's thread_data_ to a protected field using unique_ptr
heap indirection so subclasses can extend raw2trace_thread_data_t and
use the new schedule generation code directly.

Issue: #5538
derekbruening added a commit that referenced this issue Nov 3, 2022
Adds a missing piece from part 5 PR #5713 where raw2trace's
thread_data_ was indirected to support a subclass extending it.
However, its destructor was not virtual, which prevented such
extension.  We address that here.

Issue: #5538
derekbruening added a commit that referenced this issue Nov 3, 2022
Adds a missing piece from part 5 PR #5713 where raw2trace's
thread_data_ was indirected to support a subclass extending it.
However, its destructor was not virtual, which prevented such
extension.  We address that here.

Issue: #5538
derekbruening added a commit that referenced this issue Nov 8, 2022
Adds a new memref_stream_t interface class which provides the record
and instruction count to drmemtrace analysis tools.  A pointer to this
interface is passed to new extended-argument versions of
analysis_tool_t's initialize() and parallel_shard_init() functions,
which are now what is called by the analyzer.  The base class
implementation of these new functions simply calls the old versions,
which are now deprecated but will continue to work.

This new interface is not just for convenience: the tool itself cannot
accurately count when the reader skips over records, as will happen
with seeking.  The counting must be done in the reader.  (If the tool
indeed wants to count only records/instrs that it actually sees, it
can continue using its own counters.)

Updates the view tool to use the new interface to obtain the record
ordinal, replacing its own counter.  The tool is expanded to print a
new column with the instruction ordinal.  The view tool test is
updated along with example output in the docs.

Issue: #5538
derekbruening added a commit that referenced this issue Nov 9, 2022
Adds a new memtrace_stream_t interface class which provides the record
and instruction count to drmemtrace analysis tools.  A pointer to this
interface is passed to new extended-argument versions of
analysis_tool_t's initialize() (initialize_stream()) and 
parallel_shard_init() (parallel_shard_init_stream()) functions,
which are now what is called by the analyzer.  The base class
implementation of these new functions simply calls the old versions,
which are now deprecated but will continue to work.

This new interface is not just for convenience: the tool itself cannot
accurately count when the reader skips over records, as will happen
with seeking.  The counting must be done in the reader.  (If the tool
indeed wants to count only records/instrs that it actually sees, it
can continue using its own counters.)

We had considered other avenues for analysis_tool_t to obtain things like
the record and instruction ordinals within the stream, in the presence of
skipping: we could add fields to memref but we'd either have to append
and have them at different offsets for each type or we'd have to break
compatbility to prepend every time we added more; or we could add parameters
to process_memref().  Passing an interface to the init routines seems
the simplest and most flexible.

Updates the view tool to use the new interface to obtain the record
ordinal, replacing its own counter.  The tool is expanded to print a
new column with the instruction ordinal.  The view tool test is
updated along with example output in the docs.

Issue: #5538
derekbruening added a commit that referenced this issue Nov 11, 2022
Adds a new skip_instructions() reader iterator interface.  It is a
linear walk for every type of reader except a chunked zipfile walking
a single thread.

Adds a drcachesim command line option -skip_instrs which triggers the
analyzer to skip from the start before passing anything to the tool.

Refactors the reader_t++ to provide a process_input_entry to update
state while skipping.

Adds a unit test with an added trace file with a small chunk size.
The test checks the view output for every skip value from 0 to over
double the chunk size.

Leaves several pieces for future work:
+ Recording the record count in each chunk so we have an accurate
  count after skipping.
+ Presenting global headers skipped over as memtrace_stream_t values
  that tools can query.
+ Reading the schedule files for serial skipping (or the planned cpu
  iterator and skipping).
+ Repeating the timestamp+cpu for non-zipfile skipping.

Issue: #5538
derekbruening added a commit that referenced this issue Nov 11, 2022
Adds a new skip_instructions() reader iterator interface.  It is a
linear walk for every type of reader except a chunked zipfile walking
a single thread.

Adds a drcachesim command line option -skip_instrs which triggers the
analyzer to skip from the start before passing anything to the tool.

Refactors the reader_t++ to provide a process_input_entry to update
state while skipping.

Adds a unit test with an added trace file with a small chunk size.
The test checks the view output for every skip value from 0 to over
double the chunk size.

Leaves several pieces for future work:
+ Full support for skipping from the midde: the timestamp,cpuid will
  not always be duplicated with the current code.
+ Recording the record count in each chunk so we have an accurate
  count after skipping.
+ Presenting global headers skipped over as memtrace_stream_t values
  that tools can query.
+ Reading the schedule files for serial skipping (or the planned cpu
  iterator and skipping).
+ Repeating the timestamp+cpu for non-zipfile skipping.

Issue: #5538
derekbruening added a commit that referenced this issue Nov 15, 2022
Adds a reader subclass to raw2trace for computing the memref_t record
count for each chunk.  A new marker is inserted in the chunk header
which the zipfile skip code uses to obtain the correct ref count when
skipping over chunks.

Adds a count suppression feature where the view tool prints 0 for the
record count for the synthetic timestamp+cpu added after a seek.

Updates the seek test with a new trace and new expected output.

Issue: #5538
derekbruening added a commit that referenced this issue Nov 15, 2022
Adds cached values of the 5 top-level headers to the metrace_stream_t
interface and implements this for the readers.

Adds checks of these values to invariant_checker.

Adds a test with skipped instructions using the skip_unit_tests
checked-in trace.

Reverses the order of the initial 2 markers and the tid,pid pair sent
to the reader to avoid a 0 tid in tools.  I am surprised this hasn't
caused more problems and I thought it was already this fixed way.

Issue: #5538
derekbruening added a commit that referenced this issue Nov 16, 2022
Adds a reader subclass to raw2trace for computing the memref_t record
count for each chunk.  A new marker is inserted in the chunk header
which the zipfile skip code uses to obtain the correct ref count when
skipping over chunks.

Adds a count suppression feature where the view tool prints 0 for the
record count for the synthetic timestamp+cpu added after a seek.

Updates the seek test with a new trace and new expected output.

Issue: #5538
derekbruening added a commit that referenced this issue Nov 17, 2022
Adds a reader subclass to raw2trace for computing the memref_t record
count for each chunk.  A new marker is inserted in the chunk header
which the zipfile skip code uses to obtain the correct ref count when
skipping over chunks.

Adds a count suppression feature where the view tool prints 0 for the
record count for the synthetic timestamp+cpu added after a seek.

Updates the seek test with a new trace and new expected output.

Issue: #5538
derekbruening added a commit that referenced this issue Nov 17, 2022
Adds cached values of the 5 top-level headers to the memtrace_stream_t
interface and implements this for the readers.

Adds checks of these values to invariant_checker.

Adds a test with skipped instructions using the skip_unit_tests
checked-in trace.

Reverses the order of the initial 2 markers and the tid,pid pair sent
to the reader to avoid a 0 tid in tools.  I am surprised this hasn't
caused more problems and I thought it was already this fixed way.
Long-term maybe we could swap in the file itself but this is complex
as it moves the version field.

Issue: #5538
@derekbruening
Copy link
Contributor Author

There is a bug in how reader_t is skipping the duplicate top-of-chunk timestamp,cpu header pair: it is assuming single-thread operation and completely fails for serial mode. We saw this in a larger traces with serial mode and it can be reproduced in a small trace where the headers are skipped when there is no chunk:

        8        0: T3 <marker: timestamp 1001>
        9        0: T3 <marker: tid 3 on core 2>
       10        1: T3 ifetch       4 byte(s) @ 0x000000000000002a non-branch
       11        2: T3 ifetch       4 byte(s) @ 0x000000000000002a non-branch
------------------------------------------------------------
       12        3: T7 ifetch       4 byte(s) @ 0x000000000000002a non-branch
       13        4: T7 ifetch       4 byte(s) @ 0x000000000000002a non-branch
------------------------------------------------------------
       14        5: T3 ifetch       4 byte(s) @ 0x000000000000002a non-branch
------------------------------------------------------------
       15        5: T7 <marker: timestamp 1004>
       16        5: T7 <marker: tid 7 on core 3>
       17        6: T7 ifetch       4 byte(s) @ 0x000000000000002a non-branch

derekbruening added a commit that referenced this issue Jan 28, 2023
Fixes a bug where reader_t's detection of duplicated timestamp,cpuid
headers at the start of a chunk assumed single-threaded mode.  We
switch to using a simple per-tid chunk footer trigger.

Adds a test to view_test via a new serial mock which takes in
trace_entry_t and allows testing of the interleaving code.  Tests both
proper chunk header elision as well as replicating the bug where
elision should not happen.

The test revealed a separate bug in the view tool where the version
and filetype ordinals, for delaying, were not updated on new threads.
That is fixed here as well as otherwise the new tests fail.

Issue: #5538
derekbruening added a commit that referenced this issue Jan 31, 2023
Fixes a bug where reader_t's detection of duplicated timestamp,cpuid
headers at the start of a chunk assumed single-threaded mode.  We
switch to using a simple per-tid chunk footer trigger.

Adds a test to view_test via a new serial mock which takes in
trace_entry_t and allows testing of the interleaving code.  Tests both
proper chunk header elision as well as replicating the bug where
elision should not happen.

Fixes problems revealed in the drcacheoff.skip test by this change:
do not increment the ref count for the
hidden markers at the start of a chunk when skipping in a zipfile as
well as in raw2trace.

The test revealed a separate bug in the view tool where the version
and filetype ordinals, for delaying, were not updated on new threads.
That is fixed here as well as otherwise the new tests fail.

Issue: #5538
derekbruening added a commit that referenced this issue Mar 9, 2023
Removes multi-input support from file_reader_t and other readers now
that the scheduler_t owns that.  Specifically:

+ Removes read_next_thread_entry() and requires that read_next_entry()
  always check the queue (via a provided helper function).

+ Removes skip_thread_instructions() and refactors the pre-skip header
  reading and the post-skip walking while remembering timestamps.
  Places these latter two inside reader_t for use by all readers, with
  zipfile overriding just the fast skip in the middle and sharing all
  the other code.  This refactoring and sharing solves the problem of
  missing timestamps when skipping from the middle.

+ Removes the arrays of data for multiple inputs from file_reader_t
  and all subclasses.

Updates the view_test to use a scheduler for its multiple-input mock
reader.

While at it, removes is_complete().

Issue: #5843, #5538
derekbruening added a commit that referenced this issue Mar 13, 2023
Removes multi-input support from file_reader_t and other readers now
that the scheduler_t owns that. Specifically:

+ Removes read_next_thread_entry() and requires that read_next_entry()
always check the queue (via a provided helper function).

+ Removes skip_thread_instructions() and refactors the pre-skip header
reading and the post-skip walking while remembering timestamps. Places
these latter two inside reader_t for use by all readers, with zipfile
overriding just the fast skip in the middle and sharing all the other
code. This refactoring and sharing solves the problem of missing
timestamps when skipping from the middle.

+ Removes the arrays of data for multiple inputs from file_reader_t and
all subclasses.

Updates the view_test to use a scheduler for its multiple-input mock
reader.

While at it, removes is_complete().

Issue: #5843, #5538
derekbruening added a commit that referenced this issue Aug 17, 2023
Fixes a boundary case of skipping 1 instruction when the scheduler has
already read an instruction record but not yet passed it to the user.

Fixes a boundary case of back-to-back regions of interest.

Adds test cases.

Issue: #5538
derekbruening added a commit that referenced this issue Aug 17, 2023
Fixes a boundary case of skipping 1 instruction when the scheduler has
already read an instruction record but not yet passed it to the user.

Fixes a boundary case of back-to-back regions of interest.

Adds test cases.

Issue: #5538
@derekbruening
Copy link
Contributor Author

Split repeating physical address markers as #6654

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant