Create simulator interface which provides thread scheduling + speculative fetching #5843

derekbruening · 2023-01-31T02:06:41Z

Rather than having each simulator figure out how to schedule traced software threads onto simulated cores in their own ad hoc way, we would like to provide a scheduler service, which should result in several benefits:

Ease of use: a new simulator use case has one less thing to implement
Consistency: all simulators can now use the same approach
Fill in gaps in trace-based simulation:
- We can re-schedule threads even when simulating the recorded hardware to deflate context switches increased by tracing overhead
- We can more easily combine multiple single-workload traces
- We can provide speculative path fetching using various schemes (from heuristics to additional data recorded during tracing) to help bridge the gap with execution-driven simulation

Xref #5694: provide per-core iterator.
That may become subsumed by this new broader-scope feature.

Adds a new scheduler component to drmemtrace which provides flexibility in combining input traces and is meant to supply key features for simulation of traces. This first stage adds a base scheduler which only supports the two analyzer modes: parallel software thread streams or a single serial stream. The input file opening code and the input-to-worker code is moved from the analyzer to the scheduler. The analyzer now has to look at the tid fields in the stream records to identify shards to tools, but the input-to-worker does belong in the scheduler. Removes the analyzer external iterator interface; tools should instead use the scheduler directly. Updates histogram_launcher and two tests to do this. Adds a new scheduler unit test with a mocked reader that takes vectors of records, containing some initial sanity tests. The scheduler takes in either file paths and opens its own readers for those, or it can be passed readers. This latter interface is used for online IPC readers, as well as for the unit test using a mocked reader. The IPC reader requires a delayed init() call which is handled by paying for a flag check on each stream advance. To support -skip_instrs, region-of-interest code is implemented here. However, it requires fixing a problem in reader_t::skip_instructions() by adding a queue and a new use-prior-record method. (The queue can be merged with the file_reader_t queue later.) It might be nicer to separate that out but that would leave -skip_instrs not working. Future work includes moving the serial mode interleaving from the file reader to the scheduler, and then adding new scheduling and simulation features. Issue: #5843

derekbruening · 2023-02-17T00:07:02Z

There are many design points here; documenting some smaller ones and will probably put the rest in a separate doc:

Lots of little issues with the scheduler -- here is one: the output streams
have to keep their own record counts (b/c they combine multiple inputs).
Yet the inputs sometimes "hide" records like the synthetic headers after a
skip:

<--record#-> <--instr#->: <---tid---> <record details>
------------------------------------------------------------
           0          63:      296231 <marker: timestamp 13319413770947393>
           0          63:      296231 <marker: tid 296231 on core 10>
          90          64:      296231 ifetch       4 byte\(s\) @ 0x0000000000401028 48 83 eb 01          sub    \$0x0000000000000001 %rbx -> %rbx
          91          65:      296231 ifetch       4 byte\(s\) @ 0x000000000040102c 48 83 fb 00          cmp    %rbx \$0x0000000000000000

The scheduler though does the skip and asks the zipfile reader what the new
record# is so it can update its count and is told 0 so it goes from there
and doesn't see the 90 2 entries later (only queries input on a skip):

<--record#-> <--instr#->: <---tid---> <record details>
------------------------------------------------------------
           0          63:      296231 <marker: timestamp 13319413770947393>
           1          63:      296231 <marker: tid 296231 on core 10>
           2          64:      296231 ifetch       4 byte(s) @ 0x0000000000401028 48 83 eb 01          sub    $0x0000000000000001 %rbx -> %rbx
           3          65:      296231 ifetch       4 byte(s) @ 0x000000000040102c 48 83 fb 00          cmp    %rbx $0x0000000000000000

What is the best solution?

Does it have to query both ordinals before and after every input advance?

Have separate "effective" and "presented" ordinals?

Use the same get_last_record_ordinal() proposed for scheduler-inserted
"doesn't count" cpuid markers w/ ords of 0?

Maybe these inserted records should all be reported as the same prior
ordinal and we add a separate flag "inserted" and the view tool looks for
"inserted" and displays 0 in that case. Or abandon the 0 and leave it
blank or as "--" or sthg? But will that be confusing if the view tool
shows one thing and the direct query shows another? I guess it's the 0
that's confusing: anything else seems compatible with the direct query
showing the prior record ordinal.

I seem to recall a prior discussion where we came up with the 0 and liked
it though, for synthetic records, which include the post-skip headers
above plus the scheduler inserting
cpuid markers for synthetic schedules: we decided those would not interrupt
the original record count.

Decison:

Add to memtrace_stream_t: is_current_record_synthetic()
Remove reader_t games where record ordinals are 0 for a few records
Have get_record_ordinal() return the previous record's ordinal for
synthetic records not present in the original stream
Have view_t use the new API to display "--" or some other non-numeric
indicator for synthetic records

Adds a new scheduler component to drmemtrace which provides flexibility in combining input traces and is meant to supply key features for simulation of traces. This first stage adds a base scheduler which only supports the two analyzer modes: parallel software thread streams or a single serial stream. The input file opening code and the input-to-worker code is moved from the analyzer to the scheduler. The analyzer now has to look at the tid fields in the stream records to identify shards to tools, but the input-to-worker does belong in the scheduler. Removes the analyzer external iterator interface; tools should instead use the scheduler directly. Updates histogram_launcher and two tests to do this. Adds a new scheduler unit test with a mocked reader that takes vectors of records, containing some initial sanity tests. The scheduler takes in either file paths and opens its own readers for those, or it can be passed readers. This latter interface is used for online IPC readers, as well as for the unit test using a mocked reader. The IPC reader requires a delayed init() call which is handled by paying for a flag check on each stream advance. To support -skip_instrs, region-of-interest code is implemented here. However, it requires fixing a problem in reader_t::skip_instructions() by adding a queue and a new use-prior-record method. (The queue can be merged with the file_reader_t queue later.) It might be nicer to separate that out but that would leave -skip_instrs not working. To support skipping with multiple inputs, changes how synthetic records are treated: Eliminates synthetic records being considered to have a 0 record ordinal: instead they have the ordinal of the prior record. A new memtrace_stream_t function is_record_synthetic() is introduced for identifying synthetic records. This change is required to allow the scheduler_t layer to properly figure out output stream orderinals. Updates the reader, zipfile reader, and tests. Adds a new test to test both synthetic and real headers after a skip. Future work includes moving the serial mode interleaving from the file reader to the scheduler, and then adding new scheduling and simulation features. Issue: #5843

Implements timestamp ordering in scheduler_t rather than relying on the old implementation inside file_reader_t. Adds a sanity test. Removing the file_reader_t code, along with eliminating the thread-as-sub-reader API routines, will be done as a separate refactoring. Issue: #5843

Implements timestamp ordering in scheduler_t rather than relying on the old implementation inside file_reader_t. Adds a sanity test. Fixes a bug with only_threads and adds a simple test. Removing the file_reader_t code, along with eliminating the thread-as-sub-reader API routines, will be done as a separate refactoring. Issue: #5843

Removes multi-input support from file_reader_t and other readers now that the scheduler_t owns that. Specifically: + Removes read_next_thread_entry() and requires that read_next_entry() always check the queue (via a provided helper function). + Removes skip_thread_instructions() and refactors the pre-skip header reading and the post-skip walking while remembering timestamps. Places these latter two inside reader_t for use by all readers, with zipfile overriding just the fast skip in the middle and sharing all the other code. This refactoring and sharing solves the problem of missing timestamps when skipping from the middle. + Removes the arrays of data for multiple inputs from file_reader_t and all subclasses. Updates the view_test to use a scheduler for its multiple-input mock reader. While at it, removes is_complete(). Issue: #5843, #5538

Adds get_input_stream_count() and get_input_stream_name() to the scheduler_t drmemtrace interface. Adds a test of these to the scheduler unit tests which uses real files and also serves as a test of only_threads for real files, whose code paths are different enough it had a bug which we fix here as well. Issue: #5843

Adds to the scheduler interface a query to obtain the current input stream's memtrace_stream_t handle. Adds a new scheduler flag SCHEDULER_USE_INPUT_ORDINALS and sets it by default for parallel mode so the output stream's ordinals are suppressed and instead the current input stream's ordinals are presented on the output stream. This fixes a problem where the parallel analysis tool framework saw accumulated ordinals across inputs. Adds a similar flag SCHEDULER_USE_SINGLE_INPUT_ORDINALS which causes the first flag to be set if there is a single input and single output. This solves a serial mode problem where an analysis tool does want to see input gaps when there is no interleaving as there is only one input. Adds a test. Also manually tested a real analysis tool to confirm by tweaking the view tool to operate in parallel: Before: =========================================================================== [analyzer] Worker 0 starting on trace shard 0 stream is 0x562a2b0ff480 1 0: 3443916 <marker: version 4> 2 0: 3443916 <marker: filetype 0x240> ... 1479 585: 3443916 <thread 3443916 exited> [analyzer] Worker 0 starting on trace shard 1 stream is 0x562a2b0ff480 ------------------------------------------------------------ 1480 585: 3443921 <marker: version 4> 1481 585: 3443921 <marker: filetype 0x240> =========================================================================== After: =========================================================================== [analyzer] Worker 0 starting on trace shard 0 stream is 0x555cebc44480 1 0: 3443916 <marker: version 4> 2 0: 3443916 <marker: filetype 0x240> ... 1479 585: 3443916 <thread 3443916 exited> [analyzer] Worker 0 starting on trace shard 1 stream is 0x555cebc44480 ------------------------------------------------------------ 1 0: 3443921 <marker: version 4> 2 0: 3443921 <marker: filetype 0x240> =========================================================================== Issue: #5843

Fixes some fencepost errors in scheduler input region of interest handling. Adds a test of regions of interest which actually contains timestamps, which is what revealed the errors. Refactors the scheduler unit tests to use trace_entry_t instead of memref_t, which is required to properly test the scheduler's input readers, as that is the record type they operate on. This results in no longer needing reader_t::use_prev() which is removed here. Issue: #5843

Adds to the scheduler interface a query to obtain the current input stream's memtrace_stream_t handle. Adds a new scheduler flag SCHEDULER_USE_INPUT_ORDINALS and sets it by default for parallel mode so the output stream's ordinals are suppressed and instead the current input stream's ordinals are presented on the output stream. This fixes a problem where the parallel analysis tool framework saw accumulated ordinals across inputs. Adds a similar flag SCHEDULER_USE_SINGLE_INPUT_ORDINALS which causes the first flag to be set if there is a single input and single output. This solves a serial mode problem where an analysis tool does want to see input gaps when there is no interleaving as there is only one input. Adds a test. Also manually tested a real analysis tool to confirm by tweaking the view tool to operate in parallel: Before: =========================================================================== [analyzer] Worker 0 starting on trace shard 0 stream is 0x562a2b0ff480 1 0: 3443916 <marker: version 4> 2 0: 3443916 <marker: filetype 0x240> ... 1479 585: 3443916 <thread 3443916 exited> [analyzer] Worker 0 starting on trace shard 1 stream is 0x562a2b0ff480 ------------------------------------------------------------ 1480 585: 3443921 <marker: version 4> 1481 585: 3443921 <marker: filetype 0x240> =========================================================================== After: =========================================================================== [analyzer] Worker 0 starting on trace shard 0 stream is 0x555cebc44480 1 0: 3443916 <marker: version 4> 2 0: 3443916 <marker: filetype 0x240> ... 1479 585: 3443916 <thread 3443916 exited> [analyzer] Worker 0 starting on trace shard 1 stream is 0x555cebc44480 ------------------------------------------------------------ 1 0: 3443921 <marker: version 4> 2 0: 3443921 <marker: filetype 0x240> =========================================================================== Issue: #5843

Fixes some fencepost errors in scheduler input region of interest handling. Adds a test of regions of interest which actually contains timestamps, which is what revealed the errors. Refactors the scheduler unit tests to use trace_entry_t instead of memref_t, which is required to properly test the scheduler's input readers, as that is the record type they operate on. This results in no longer needing reader_t::use_prev() which is removed here. Issue: #5843

Adds initial support for MAP_TO_ANY_OUTPUT with multiple outputs. Uses a simple queue of ready-to-schedule inputs and implements an instruction-based scheduling quantum. Adds a test. Issue: #5843

Adds initial support for MAP_TO_ANY_OUTPUT with multiple outputs. Uses a simple queue of ready-to-schedule inputs and implements an instruction-based scheduling quantum. Adds a test. Adds new types input_ordinal_t and output_ordinal_t and corresponding invalid constants and updates all existing code to use these. Issue: #5843

Implements initial speculation support, supplying nops. Speculation is separated into its own class where we can fill in different strategies in the future. The start_speculation() function takes a flag controlling whether the scheduler queues up the current record and re-returns it as the first record after speculation stops. This is often what a simulator wants as it has to read the instruction record following a branch to determine whether it is on the wrong path, and it would like to resume with that already-read instruction after speculation. Adds a unit test. Issue: #5843

Adds a lock for each input to enforce missing synchronization during scheduling decisions. Fixes a bug with the existing scheduler lock. Adds a multi-threaded test. Tested a similar multi-threaded test under ThreadSanitizer which now reports no races (it did before these code changes). Fixes #5843

Improves two instances of push_back by replacing with emplace_back. Issue: #5843

Add epoll_pwait2, sendmmsg, recvmmmsg, and membarrier to the maybe-blocking syscall list. These don't always block: e.g., membarrier has some sub-operations for which it never blocks. Updates the DR syscall headers to include recently added syscalls, including epoll_pwait2. The uapi headers are only partly updated due to lack of easy access to a header to fill in the other SYS_ defines. Issue: #5843

Add epoll_pwait2, sendmmsg, recvmmsg, and membarrier to the maybe-blocking syscall list. These don't always block: e.g., membarrier has some sub-operations for which it never blocks. Updates the DR syscall headers to include recently added syscalls, including epoll_pwait2. The uapi headers are only partly updated due to lack of easy access to a header to fill in the other SYS_ defines. Issue: #5843

Adds a new marker type TRACE_MARKER_TYPE_DIRECT_THREAD_SWITCH for use with custom kernel scheduling features where one thread directly switches to another on the same cpu. Refactors raw2trace marker processing code to allow a subclass to insert the new marker. Makes the raw2trace blocking syscall code virtual to allow a subclass to label custom syscalls as blocking. Issue: #5843

Adds a new marker type TRACE_MARKER_TYPE_DIRECT_THREAD_SWITCH for use with custom kernel scheduling features where one thread directly switches to another on the same cpu. Refactors raw2trace marker processing code to allow a subclass to insert the new marker. Makes the raw2trace blocking syscall code virtual to allow a subclass to label custom syscalls as blocking. Given that the changes are used in separate code it is not simple to make a test of the raw2trace refactoring + virtual. For the marker: tests that use the marker will be forthcoming in scheduler_unit_tests. Issue: #5843

Adds a flexible priority queue class which tracks indices and so supports asking whether an entry is in the queue and removing an entry from anywhere in the queue. Adds a simple unit test. Changes the scheduler to use this new queue class, in anticipation of needing both new features to handle direct targeted thread switches. Issue: #5843

Adds support for the TRACE_MARKER_TYPE_DIRECT_THREAD_SWITCH marker, when it appears after TRACE_MARKER_TYPE_MAYBE_BLOCKING_SYSCALL. The scheduler directly switches to the target thread if it is on the ready queue. Performing a forced migration if the target is running on another output is not yet implemented. Once i/o wait states are added, waking up a target thread will be added, but that is future work as well. Adds a simple unit test. Issue: #5843

Adds support for the TRACE_MARKER_TYPE_DIRECT_THREAD_SWITCH marker, when it appears after TRACE_MARKER_TYPE_MAYBE_BLOCKING_SYSCALL. The scheduler directly switches to the target thread if it is on the ready queue. Performing a forced migration if the target is running on another output is not yet implemented. Once i/o wait states are added, waking up a target thread will be added, but that is future work as well. Adds a DEPENDENCY_DIRECT_SWITCH_BITFIELD and renames DEPENDENCY_TIMESTAMPS to DEPENDENCY_TIMESTAMP_BITFIELD so we can combine them, and makes a new enum entry DEPENDENCY_TIMESTAMPS which combines the two bitfields, which is what nearly every use case should want while still giving us control and without really breaking compatibility (and by providing bits and combinations the enum type is all that's needed still). Adds a unit test where the schedule would clearly be different without the switch target. Issue: #5843

Rather than context switching on every syscall labeled maybe-blocking, the scheduler uses the now-available syscall latency to decide whether the syscall should block and result in a context switch. Adds two new command line options, -sched_syscall_switch_us (default 500us) and -sched_blocking_switch_us (default 100us), and corresponding scheduler_t inputs, to control the latency thresholds. To avoid relying too much on the maybe-blocking labels, we do consider a very-high-latency syscall not marked as maybe-blocking to block. Adds a new unit test. Tested in a large proprietary app where this reduces the context switch rate from ~100x too high down to ~10x too high. The next step of adding i/o wait times should further improve the representativeness. Issue: #5843

Rather than context switching on every syscall labeled maybe-blocking, the scheduler uses the now-available syscall latency to decide whether the syscall should block and result in a context switch. Adds two new command line options, -sched_syscall_switch_us (default 500us) and -sched_blocking_switch_us (default 100us), and corresponding scheduler_t inputs, to control the latency thresholds. To avoid relying too much on the maybe-blocking labels, we do consider a very-high-latency syscall not marked as maybe-blocking to result in a context switch. Adds a new schedule_stats unit test. Tested in a large proprietary app where this reduces the context switch rate from ~100x too high down to ~10x too high. The next step of adding i/o wait times should further improve the representativeness. Issue: #5843

Fixes a < assert from PR #6458 to be <=, to allow the pre-syscall timestamp to equal the post-syscall timestamp. Adds a test that fails without the fix. Issue: #5843

Changes the quanta accounting to match the real kernel by accumulating it across executions if a prior execution was terminated early due to a voluntary context switch. Adds new testing, and updates old tests with the behavior change. Scheduler unit test string changes were carefully vetted. E.g., for test_synthetic_with_syscalls_multiple(): the output strings changed because H's quantum accumulates and it hits a preempt in the middle of its second HH sequence, which decrements B's quantum, causing B to become available sooner. Issue: #5843

Adds a new scheduler option field honor_direct_switches and a corresponding command-line parameter -sched_disable_direct_switches to allow a way to disable direct thread switches, primarily for scheduling experimentation. Adds a unit test. Issue #5843

Fixes an inconsistency in the CLI drmemtrace scheduler quantum and the internal API by making them both the same at 6 million. We pick 6 million to match 2 instructions per nanosecond with a 3ms quantum. The scheduler_launcher default is also made to match. Issue: #5843

derekbruening added Type-Feature Component-DrMemtrace labels Jan 31, 2023

derekbruening mentioned this issue Feb 16, 2023

i#5843 scheduler: Add memtrace scheduler and refactor analyzer #5877

Merged

derekbruening mentioned this issue Mar 7, 2023

i#5843 scheduler: Move timestamp ordering into scheduler #5895

Merged

derekbruening mentioned this issue Mar 9, 2023

i#5843 scheduler: Refactor readers to be single-input #5900

Merged

derekbruening mentioned this issue Mar 16, 2023

Add input stream queries #5915

Merged

derekbruening mentioned this issue Mar 22, 2023

i#5843 scheduler: Add input stream ordinal access #5924

Merged

derekbruening mentioned this issue Mar 23, 2023

i#5843 scheduler: Improve region skipping support #5925

Merged

derekbruening mentioned this issue Mar 24, 2023

i#5843 scheduler: Add simple interleaving #5928

Merged

derekbruening mentioned this issue May 1, 2023

i#5843 scheduler: Implement speculation with nops #6016

Merged

derekbruening mentioned this issue May 4, 2023

i#5843 scheduler: Add input locks #6029

Merged

derekbruening added a commit that referenced this issue Oct 9, 2023

i#5843 scheduler: Replace push_back with emplace_back (#6351)

3511be8

Improves two instances of push_back by replacing with emplace_back. Issue: #5843

derekbruening mentioned this issue Oct 19, 2023

i#5843 scheduler: Mark more maybe-blocking syscalls #6380

Merged

derekbruening mentioned this issue Nov 1, 2023

i#5843 scheduler: Add direct thread switch marker #6404

Merged

derekbruening mentioned this issue Nov 7, 2023

i#5843 scheduler: Add flexible_queue_t and use it in scheduler_t #6414

Merged

This was referenced Nov 8, 2023

i#5843 scheduler: Add direct thread switch support #6424

Merged

Add drmemtrace schedule analysis tool #6426

Open

derekbruening mentioned this issue Nov 16, 2023

i#5843 scheduler: Only consider long-latency syscalls blocking #6458

Merged

derekbruening added a commit that referenced this issue Nov 17, 2023

i#5843 scheduler: Fix assert to allow 0 time

a69fad0

Fixes a < assert from PR #6458 to be <=, to allow the pre-syscall timestamp to equal the post-syscall timestamp. Adds a test that fails without the fix. Issue: #5843

derekbruening mentioned this issue Nov 17, 2023

i#5843 scheduler: Fix assert to allow 0-latency syscalls #6460

Merged

derekbruening mentioned this issue Nov 22, 2023

Model i/o and idle time in drmemtrace scheduler #6471

Open

derekbruening mentioned this issue Dec 11, 2023

i#5843 scheduler: Accumulate quanta across runs #6502

Merged

derekbruening mentioned this issue Apr 11, 2024

i#5843 scheduler: Add option to disable direct switches #6770

Merged

derekbruening mentioned this issue Jun 26, 2024

i#5843 scheduler: Raise default drmemtrace sched quantum to 6M #6857

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create simulator interface which provides thread scheduling + speculative fetching #5843

Create simulator interface which provides thread scheduling + speculative fetching #5843

derekbruening commented Jan 31, 2023

derekbruening commented Feb 17, 2023

Create simulator interface which provides thread scheduling + speculative fetching #5843

Create simulator interface which provides thread scheduling + speculative fetching #5843

Comments

derekbruening commented Jan 31, 2023

derekbruening commented Feb 17, 2023