drcachesim incorrectly counts each rep string iter as an ifetch #2051

derekbruening · 2016-11-01T19:45:25Z

Xref #2011

The original rep string loop only has one ifetch for the whole loop, while the drutil-expanded instru in drcachesim has an ifetch per iteration.

This will be easy to solve for offline, but harder for online: in fact it seems that some kind of explicit iter count check is needed.

derekbruening · 2016-11-01T19:51:53Z

Offline now looks like this:

[drmemtrace]: Appending 1 instrs in bb 0x00007fb352d124ba in mod 3 +0x1a4ba = /usr/lib64/ld-2.21.so
 0x00007f946082f4ba rep stos %al %rdi %rcx -> %es:(%rdi)[1byte] %rdi %rcx
[drmemtrace]: Appended memref to 0x00007f945f97a028
[drmemtrace]: Appending 1 instrs in bb 0x00007fb352d124ba in mod 3 +0x1a4ba = /usr/lib64/ld-2.21.so
[drmemtrace]: Skipping instr fetch for 0x00007fb352d124ba
[drmemtrace]: Appended memref to 0x00007f945f97a029
[drmemtrace]: Appending 1 instrs in bb 0x00007fb352d124ba in mod 3 +0x1a4ba = /usr/lib64/ld-2.21.so
[drmemtrace]: Skipping instr fetch for 0x00007fb352d124ba
[drmemtrace]: Appended memref to 0x00007f945f97a02a

derekbruening · 2016-11-01T20:13:11Z

514e90d i#2051 drcachesim repstr: skip repstr ifetch for offline

derekbruening · 2017-11-30T20:38:44Z

We'd like to support feeding our traces to core simulators as well as cache simulators.
But, while a cache simulator expects a single instr fetch followed by N data
refs, a core simulator expects an instr with each iteration.

Middle ground would be to include the count with the first rep instr,
though this complicates the currently-unified trace format.

There could be a thread switch in the middle of a rep loop? Yes, but
raw2trace will insert a rep instr when the thread resumes.

Since it’s easier to ignore the instr if it matches the prior instr than to
look ahead, we could go back to including the instr in each iteration and
force cache simulators to add special handling.

derekbruening · 2017-12-01T20:07:59Z

I looked at some other tools to see how they handle rep string instructions. I made a tiny app with no crt that just executes some assembly:

# To build:
# as -o allasm_rep64.o allasm_rep64.s && gcc -static -o allasm_rep64 allasm_rep64.o -nostartfiles
.text
.globl _start
.type _start, @function        
        .align   8
_start:
        and      $-16, %rsp         # align stack pointer to cache line
# rep loop
        mov      $128, %rcx           # loop counter
        mov      $0,%eax
        mov      %rsp,%rdi
        std
        rep stosq
        cld        
# print hello
        mov      $1, %rdi           # stdout
        mov      $hello, %rsi
        mov      $13, %rdx          # sizeof(hello)
        movl     $1, %eax           # SYS_write
        syscall
# exit
        mov      $0, %rdi          # exit code
        mov      $231, %eax        # SYS_exit_group
        syscall
        .data
        .align   8
hello:
        .string   "Hello world!\n"

No matter how many iterations I give the rep stos loop, the hardware perfctrs count it as one instruction:

$ perf stat -e instructions:u -- ./allasm_rep64
Hello world!

 Performance counter stats for './allasm_rep64':

                16      instructions:u

Other stats, all with :u in the command line:

   <not supported>      L1-icache-loads          
                 4      L1-icache-loads-misses                                      
                26      cache-references:u                                          
                 3      branch-instructions:u                                       
                 6      L1-dcache-loads                                             
               128      L1-dcache-stores                                            
                 3      branch-loads

This is as expected, and where L1-icache-loads is supported we'd expect to see only one load per loop, not one per iter.

Yet cachegrind has an L1i ref per iter:

$ valgrind --tool=cachegrind -- ~/dr/test/allasm_rep64
==5097== Cachegrind, a cache and branch-prediction profiler
==5097== Copyright (C) 2002-2013, and GNU GPL'd, by Nicholas Nethercote et al.
==5097== Using Valgrind-3.10.1 and LibVEX; rerun with -h for copyright info
==5097== Command: /home/bruening/dr/test/allasm_rep64
==5097== 
--5097-- warning: L3 cache found, using its data for the LL simulation.
Hello world!
==5097== 
==5097== I   refs:      143
==5097== I1  misses:      2
==5097== LLi misses:      2
==5097== I1  miss rate: 1.39%
==5097== LLi miss rate: 1.39%
==5097== 
==5097== D   refs:      128  (0 rd   + 128 wr)
==5097== D1  misses:     17  (0 rd   +  17 wr)
==5097== LLd misses:     17  (0 rd   +  17 wr)
==5097== D1  miss rate: 13.2% (0.0%     + 13.2%  )
==5097== LLd miss rate: 13.2% (0.0%     + 13.2%  )
==5097== 
==5097== LL refs:        19  (2 rd   +  17 wr)
==5097== LL misses:      19  (2 rd   +  17 wr)
==5097== LL miss rate:  7.0% (1.3%     + 13.2%  )

Simple simulators like Pin's icache sample also have a ref per iter.

That doesn't mean that we should follow suit and be inaccurate in our simulator, but it wouldn't be unprecedented to do so.

derekbruening · 2017-12-02T19:23:27Z

My plan is:

To satisfy both cache and core simulators we mark subsequent iterations of
rep string loops with a new trace entry type TRACE_TYPE_INSTR_NO_FETCH,
which cache simulators can ignore.  For offline traces, raw2trace does this
for us.  Since online traces would need extra overhead to distinguish the
first from subsequent iters, they use a new internal type
TRACE_TYPE_INSTR_MAYBE_FETCH which is converted by reader_t.

To satisfy both cache and core simulators we mark subsequent iterations of rep string loops with a new trace entry type TRACE_TYPE_INSTR_NO_FETCH, which cache simulators can ignore. For offline traces, raw2trace does this for us. Since online traces would need extra overhead to distinguish the first from subsequent iters, they use a new internal type TRACE_TYPE_INSTR_MAYBE_FETCH which is converted by reader_t. Adds instr_is_string_op() and instr_is_rep_string_op() to DR's API to facilitate this. Adds no-fetch stats to the basic_counts tool and updates the basic_counts tests. Fixes #2051

derekbruening added Component-DRTool Bug-ToolFail labels Nov 1, 2016

derekbruening self-assigned this Dec 2, 2017

derekbruening mentioned this issue Dec 2, 2017

i#2051 rep ifetch: add no-fetch entries for rep strings #2730

Merged

derekbruening closed this as completed in #2730 Dec 2, 2017

derekbruening added the Component-DrMemtrace label Oct 30, 2018

derekbruening mentioned this issue May 17, 2021

i#2985 scatter-gather: Fix drcachesim and raw2trace issues. #4912

Merged

abhinav92003 mentioned this issue May 18, 2021

Consider removing non-fetched instrs for repstr #4915

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

drcachesim incorrectly counts each rep string iter as an ifetch #2051

drcachesim incorrectly counts each rep string iter as an ifetch #2051

derekbruening commented Nov 1, 2016

derekbruening commented Nov 1, 2016

derekbruening commented Nov 1, 2016

derekbruening commented Nov 30, 2017

derekbruening commented Dec 1, 2017

derekbruening commented Dec 2, 2017

drcachesim incorrectly counts each rep string iter as an ifetch #2051

drcachesim incorrectly counts each rep string iter as an ifetch #2051

Comments

derekbruening commented Nov 1, 2016

derekbruening commented Nov 1, 2016

derekbruening commented Nov 1, 2016

derekbruening commented Nov 30, 2017

derekbruening commented Dec 1, 2017

derekbruening commented Dec 2, 2017