Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

drcachesim incorrectly counts each rep string iter as an ifetch #2051

Closed
derekbruening opened this issue Nov 1, 2016 · 5 comments
Closed

Comments

@derekbruening
Copy link
Contributor

Xref #2011

The original rep string loop only has one ifetch for the whole loop, while the drutil-expanded instru in drcachesim has an ifetch per iteration.

This will be easy to solve for offline, but harder for online: in fact it seems that some kind of explicit iter count check is needed.

@derekbruening
Copy link
Contributor Author

Offline now looks like this:

[drmemtrace]: Appending 1 instrs in bb 0x00007fb352d124ba in mod 3 +0x1a4ba = /usr/lib64/ld-2.21.so
 0x00007f946082f4ba rep stos %al %rdi %rcx -> %es:(%rdi)[1byte] %rdi %rcx
[drmemtrace]: Appended memref to 0x00007f945f97a028
[drmemtrace]: Appending 1 instrs in bb 0x00007fb352d124ba in mod 3 +0x1a4ba = /usr/lib64/ld-2.21.so
[drmemtrace]: Skipping instr fetch for 0x00007fb352d124ba
[drmemtrace]: Appended memref to 0x00007f945f97a029
[drmemtrace]: Appending 1 instrs in bb 0x00007fb352d124ba in mod 3 +0x1a4ba = /usr/lib64/ld-2.21.so
[drmemtrace]: Skipping instr fetch for 0x00007fb352d124ba
[drmemtrace]: Appended memref to 0x00007f945f97a02a

@derekbruening
Copy link
Contributor Author

  • 514e90d i#2051 drcachesim repstr: skip repstr ifetch for offline

@derekbruening
Copy link
Contributor Author

We'd like to support feeding our traces to core simulators as well as cache simulators.
But, while a cache simulator expects a single instr fetch followed by N data
refs, a core simulator expects an instr with each iteration.

Middle ground would be to include the count with the first rep instr,
though this complicates the currently-unified trace format.

There could be a thread switch in the middle of a rep loop? Yes, but
raw2trace will insert a rep instr when the thread resumes.

Since it’s easier to ignore the instr if it matches the prior instr than to
look ahead, we could go back to including the instr in each iteration and
force cache simulators to add special handling.

@derekbruening
Copy link
Contributor Author

I looked at some other tools to see how they handle rep string instructions. I made a tiny app with no crt that just executes some assembly:

# To build:
# as -o allasm_rep64.o allasm_rep64.s && gcc -static -o allasm_rep64 allasm_rep64.o -nostartfiles
.text
.globl _start
.type _start, @function        
        .align   8
_start:
        and      $-16, %rsp         # align stack pointer to cache line
# rep loop
        mov      $128, %rcx           # loop counter
        mov      $0,%eax
        mov      %rsp,%rdi
        std
        rep stosq
        cld        
# print hello
        mov      $1, %rdi           # stdout
        mov      $hello, %rsi
        mov      $13, %rdx          # sizeof(hello)
        movl     $1, %eax           # SYS_write
        syscall
# exit
        mov      $0, %rdi          # exit code
        mov      $231, %eax        # SYS_exit_group
        syscall
        .data
        .align   8
hello:
        .string   "Hello world!\n"

No matter how many iterations I give the rep stos loop, the hardware perfctrs count it as one instruction:

$ perf stat -e instructions:u -- ./allasm_rep64
Hello world!

 Performance counter stats for './allasm_rep64':

                16      instructions:u                                              

Other stats, all with :u in the command line:

   <not supported>      L1-icache-loads          
                 4      L1-icache-loads-misses                                      
                26      cache-references:u                                          
                 3      branch-instructions:u                                       
                 6      L1-dcache-loads                                             
               128      L1-dcache-stores                                            
                 3      branch-loads                                                

This is as expected, and where L1-icache-loads is supported we'd expect to see only one load per loop, not one per iter.

Yet cachegrind has an L1i ref per iter:

$ valgrind --tool=cachegrind -- ~/dr/test/allasm_rep64
==5097== Cachegrind, a cache and branch-prediction profiler
==5097== Copyright (C) 2002-2013, and GNU GPL'd, by Nicholas Nethercote et al.
==5097== Using Valgrind-3.10.1 and LibVEX; rerun with -h for copyright info
==5097== Command: /home/bruening/dr/test/allasm_rep64
==5097== 
--5097-- warning: L3 cache found, using its data for the LL simulation.
Hello world!
==5097== 
==5097== I   refs:      143
==5097== I1  misses:      2
==5097== LLi misses:      2
==5097== I1  miss rate: 1.39%
==5097== LLi miss rate: 1.39%
==5097== 
==5097== D   refs:      128  (0 rd   + 128 wr)
==5097== D1  misses:     17  (0 rd   +  17 wr)
==5097== LLd misses:     17  (0 rd   +  17 wr)
==5097== D1  miss rate: 13.2% (0.0%     + 13.2%  )
==5097== LLd miss rate: 13.2% (0.0%     + 13.2%  )
==5097== 
==5097== LL refs:        19  (2 rd   +  17 wr)
==5097== LL misses:      19  (2 rd   +  17 wr)
==5097== LL miss rate:  7.0% (1.3%     + 13.2%  )

Simple simulators like Pin's icache sample also have a ref per iter.

That doesn't mean that we should follow suit and be inaccurate in our simulator, but it wouldn't be unprecedented to do so.

@derekbruening
Copy link
Contributor Author

My plan is:

To satisfy both cache and core simulators we mark subsequent iterations of
rep string loops with a new trace entry type TRACE_TYPE_INSTR_NO_FETCH,
which cache simulators can ignore.  For offline traces, raw2trace does this
for us.  Since online traces would need extra overhead to distinguish the
first from subsequent iters, they use a new internal type
TRACE_TYPE_INSTR_MAYBE_FETCH which is converted by reader_t.

@derekbruening derekbruening self-assigned this Dec 2, 2017
derekbruening added a commit that referenced this issue Dec 2, 2017
To satisfy both cache and core simulators we mark subsequent iterations of
rep string loops with a new trace entry type TRACE_TYPE_INSTR_NO_FETCH,
which cache simulators can ignore.  For offline traces, raw2trace does this
for us.  Since online traces would need extra overhead to distinguish the
first from subsequent iters, they use a new internal type
TRACE_TYPE_INSTR_MAYBE_FETCH which is converted by reader_t.

Adds instr_is_string_op() and instr_is_rep_string_op() to DR's API to
facilitate this.

Adds no-fetch stats to the basic_counts tool and updates the basic_counts
tests.

Fixes #2051
derekbruening added a commit that referenced this issue Dec 2, 2017
To satisfy both cache and core simulators we mark subsequent iterations of
rep string loops with a new trace entry type TRACE_TYPE_INSTR_NO_FETCH,
which cache simulators can ignore.  For offline traces, raw2trace does this
for us.  Since online traces would need extra overhead to distinguish the
first from subsequent iters, they use a new internal type
TRACE_TYPE_INSTR_MAYBE_FETCH which is converted by reader_t.

Adds instr_is_string_op() and instr_is_rep_string_op() to DR's API to
facilitate this.

Adds no-fetch stats to the basic_counts tool and updates the basic_counts
tests.

Fixes #2051
fhahn pushed a commit that referenced this issue Dec 4, 2017
To satisfy both cache and core simulators we mark subsequent iterations of
rep string loops with a new trace entry type TRACE_TYPE_INSTR_NO_FETCH,
which cache simulators can ignore.  For offline traces, raw2trace does this
for us.  Since online traces would need extra overhead to distinguish the
first from subsequent iters, they use a new internal type
TRACE_TYPE_INSTR_MAYBE_FETCH which is converted by reader_t.

Adds instr_is_string_op() and instr_is_rep_string_op() to DR's API to
facilitate this.

Adds no-fetch stats to the basic_counts tool and updates the basic_counts
tests.

Fixes #2051
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant