Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reduce overhead of malloc tracking #460

Open
derekbruening opened this issue Nov 28, 2014 · 12 comments
Open

reduce overhead of malloc tracking #460

derekbruening opened this issue Nov 28, 2014 · 12 comments

Comments

@derekbruening
Copy link
Contributor

From bruen...@google.com on June 16, 2011 10:24:33

below are perf numbers for -leaks_only -no_count_leaks vs native.

-no_count_leaks disables stack zeroing, so this is the cost of malloc
tracking (w/o recording callstacks): this is way too high!
this is one barrier to higher performance with -no_check_uninitialized.

spec2k6cmpave namedres/leaksonly-nocount-ref/ namedres/native-ref/
400.perlbench 3.77 ( 1897 / 503)
401.bzip2 1.10 ( 951 / 865)
403.gcc 1.75 ( 750 / 429)
429.mcf 1.05 ( 393 / 373)
445.gobmk 1.58 ( 1009 / 639)
456.hmmer 1.09 ( 1301 / 1195)
458.sjeng 1.77 ( 1381 / 782)
462.libquantum 1.07 ( 793 / 739)
464.h264ref 1.67 ( 1751 / 1049)
471.omnetpp 3.72 ( 1559 / 419)
473.astar 1.32 ( 874 / 660)
483.xalancbmk 3.81 ( 1331 / 349)
410.bwaves 1.17 ( 875 / 751)
416.gamess 1.16 ( 1610 / 1392)
433.milc 1.14 ( 706 / 620)
434.zeusmp 1.14 ( 910 / 795)
435.gromacs 1.02 ( 1198 / 1170)
436.cactusADM 1.02 ( 1653 / 1617)
437.leslie3d 1.04 ( 851 / 816)
444.namd 1.03 ( 761 / 741)
447.dealII 2.17 ( 1299 / 599)
450.soplex 1.21 ( 488 / 404)
453.povray 1.74 ( 627 / 360)
454.calculix 1.08 ( 1616 / 1497)
459.GemsFDTD 1.05 ( 983 / 935)
465.tonto 4.29 ( 3613 / 842)
470.lbm 1.23 ( 795 / 648)
481.wrf 1.95 ( 2459 / 1258)
482.sphinx3 1.13 ( 840 / 741)
average 1.66

back when we first created Dr. Heapstat I spent some time optimizing malloc
tracking, including recording callstacks, and I sure thought I had
performance better than this (I was looking at spec2000 and some
malloc-intensive benchmarks like roboop and cfrac). something could have
regressed.

from profiling, some of this cost is from maintaining the hashtables and
delay-free queue, much of which could go away w/ malloc replacement b/c the
header could be used more freely: or perhaps a larger redzone could be used
with more data than just the size stored directly in it. early injection
would make that much easier, since today no-redzone allocs must be handled:
one reason for the current design.

Original issue: http://code.google.com/p/drmemory/issues/detail?id=460

@derekbruening
Copy link
Contributor Author

From bruen...@google.com on December 07, 2011 08:43:30

studying cfrac on windows:

symcache has no effect as 2nd run is no faster than 1st:
% /usr/bin/time ~/drmemory/git/build_drmem_rel/bin/drmemory.exe -dr c:/src/dr/git/exports -batch -- ./cfrac.exe 41757646344123832613190542166099121
0.01user 0.01system 2:23.43elapsed 0%CPU (0avgtext+0avgdata 238336maxresident)k
0.00user 0.00system 2:25.32elapsed 0%CPU (0avgtext+0avgdata 233984maxresident)k

% for i in "-no_count_leaks" "-no_check_uninitialized" "-no_check_uninitialized -no_count_leaks" "-leaks_only -no_count_leaks -no_zero_stack" "-perturb_only" "-leaks_only -no_count_leaks -no_zero_stack -no_track_allocs"; do echo $i; /usr/bin/time ~/drmemory/git/build_drmem_rel/bin/drmemory.exe $i -quiet -dr c:/src/dr/git/exports -batch -- ./cfrac.exe 41757646344123832613190542166099121; done
-no_count_leaks
0.00user 0.00system 1:18.67elapsed 0%CPU (0avgtext+0avgdata 233984maxresident)k
-no_check_uninitialized
0.00user 0.01system 1:46.22elapsed 0%CPU (0avgtext+0avgdata 233984maxresident)k
-no_check_uninitialized -no_count_leaks
0.00user 0.03system 0:45.65elapsed 0%CPU (0avgtext+0avgdata 234240maxresident)k
-leaks_only -no_count_leaks -no_zero_stack
0.00user 0.00system 0:35.09elapsed 0%CPU (0avgtext+0avgdata 233984maxresident)k
-perturb_only
0.00user 0.03system 0:10.37elapsed 0%CPU (0avgtext+0avgdata 233728maxresident)k
-leaks_only -no_count_leaks -no_zero_stack -no_track_allocs
0.00user 0.01system 0:03.44elapsed 0%CPU (0avgtext+0avgdata 233984maxresident)k

-leaks_only -no_count_leaks -no_zero_stack + empty hook functions
(alloc_hook, alloc_syscall_hook, and handle_alloc_post)
(though I had to make handle_alloc_post have a conditional and not be
entirely empty else crash b/c of some optimization)
0.00user 0.00system 0:06.41elapsed 0%CPU (0avgtext+0avgdata 234240maxresident)k

% /usr/bin/time ~/dr/git/exports/bin32/drrun.exe -quiet ./cfrac.exe 41757646344123832613190542166099121
0.01user 0.01system 0:02.50elapsed 1%CPU (0avgtext+0avgdata 238592maxresident)k

% /usr/bin/time ./cfrac.exe 41757646344123832613190542166099121
0.00user 0.01system 0:01.67elapsed 0%CPU (0avgtext+0avgdata 234496maxresident)k

I have a series of 10 or so optimizations including DRi#632 that get the 35 seconds down to about 24 seconds.

@derekbruening
Copy link
Contributor Author

From bruen...@google.com on December 07, 2011 13:12:48

I have it down at 20 seconds now, 14s more than just clean calls. but note that leak scan/callstack time dwarfs that. clean calls seem to incur 2x here. malloc interception was 5x on top of that: now 3x.

@derekbruening
Copy link
Contributor Author

From bruen...@google.com on December 07, 2011 13:21:19

-no_check_uninitialized w/ leak callstacks but no leak scan:
0.00user 0.01system 1:32.35elapsed 0%CPU (0avgtext+0avgdata 233984maxresident)k
so the scan takes 14s; callstacks take 47s!

@derekbruening
Copy link
Contributor Author

From bruen...@google.com on December 07, 2011 13:46:25

strike comment 3: that was measured w/ a dozen different new optimizations, so it needs a new base to compare to. the leak scan is not an issue.

this is with A through L (my internal names for the opts in my notes) so
have to re-run base:
-no_check_uninitialized
0.00user 0.03system 1:25.19elapsed 0%CPU (0avgtext+0avgdata 233984maxresident)k
0.00user 0.00system 1:25.75elapsed 0%CPU (0avgtext+0avgdata 233984maxresident)k
0.00user 0.00system 1:19.90elapsed 0%CPU (0avgtext+0avgdata 233984maxresident)k
0.01user 0.00system 1:26.43elapsed 0%CPU (0avgtext+0avgdata 233984maxresident)k
-no_check_uninitialized w/ leak callstacks but no leak scan:
0.00user 0.01system 1:32.35elapsed 0%CPU (0avgtext+0avgdata 233984maxresident)k
0.00user 0.00system 1:29.23elapsed 0%CPU (0avgtext+0avgdata 234240maxresident)k
0.00user 0.00system 1:32.88elapsed 0%CPU (0avgtext+0avgdata 234240maxresident)k
-no_check_uninitialized -no_count_leaks
0.00user 0.00system 0:27.48elapsed 0%CPU (0avgtext+0avgdata 233984maxresident)k

strange, getting slower times w/o the scan. but the point here is that
it's the callstacks that matter, not the scan. the callstacks cost a full
minute, over a 3x slowdown.
that's not what this case is about: but after hitting the low-hanging fruit
here it may not be worth further optimization before optimizing callstacks
(if possible); plus further opts here may require -early_injection.

-leaks_only -no_zero_stack (so malloc intercept + callstacks):
0.00user 0.00system 1:10.29elapsed 0%CPU (0avgtext+0avgdata 233728maxresident)k
0.00user 0.00system 1:14.33elapsed 0%CPU (0avgtext+0avgdata 233984maxresident)k
0.00user 0.00system 1:11.55elapsed 0%CPU (0avgtext+0avgdata 234240maxresident)k
-leaks_only -no_count_leaks (so just malloc intercept):
0.00user 0.00system 0:18.65elapsed 0%CPU (0avgtext+0avgdata 233984maxresident)k
0.00user 0.00system 0:18.31elapsed 0%CPU (0avgtext+0avgdata 233984maxresident)k
0.00user 0.03system 0:18.57elapsed 0%CPU (0avgtext+0avgdata 233984maxresident)k

@derekbruening
Copy link
Contributor Author

From bruen...@google.com on December 07, 2011 14:18:34

-leaks_only -no_zero_stack -callstack_max_frames 6
0.00user 0.01system 0:59.23elapsed 0%CPU (0avgtext+0avgdata 233984maxresident)k
0.00user 0.03system 1:03.00elapsed 0%CPU (0avgtext+0avgdata 233984maxresident)k
0.01user 0.00system 0:56.94elapsed 0%CPU (0avgtext+0avgdata 234240maxresident)k

-leaks_only -no_zero_stack -callstack_max_frames 3
0.00user 0.01system 0:35.73elapsed 0%CPU (0avgtext+0avgdata 233728maxresident)k
0.00user 0.00system 0:35.27elapsed 0%CPU (0avgtext+0avgdata 233984maxresident)k
0.00user 0.00system 0:34.64elapsed 0%CPU (0avgtext+0avgdata 233984maxresident)k

-leaks_only -no_zero_stack -delay_frees 0
0.00user 0.00system 1:14.24elapsed 0%CPU (0avgtext+0avgdata 233984maxresident)k
0.00user 0.00system 1:12.00elapsed 0%CPU (0avgtext+0avgdata 233984maxresident)k
0.00user 0.00system 1:14.45elapsed 0%CPU (0avgtext+0avgdata 233984maxresident)k

@derekbruening
Copy link
Contributor Author

From bruen...@google.com on December 07, 2011 17:53:55

note that the above callstack walking is on cfrac built /Ox so not really fair: haven't looked in detail but it's probably having to scan frequently b/c of FPO. I will re-analyze w/ /Oy-.

I have CodeAnalyst data for a series of optimizations I labeled A
through L in my notes and will commit separately. Probably not
worth pasting all that in here.

I will paste here at least the two endpoints.

Wall-clock-time-wise:
I've cut malloc interception cost in half.
Dropped cfrac malloc-intercept-only from 35s to 19s, comparing to 3.4s
no-intercept base:
(gdb) p (19.-3.4)/(35.-3.4)
$3 = 0.49367088607594939

Prior to optimizations A through L:

Process Name 64-bit Timer samples
dynamorio.dll 40.47
drmemorylib.dll 35.57
unknown module pid (6452) 1 18.84
hal.dll 1 2.2
ntdll.dll 1.24
ntoskrnl.exe 1 0.73
6 modules, Total: 34342 samples, 99.04% of shown samples

dynamorio:
CS:EIP Symbol + Offset Timer samples
0x68c268b0 priv_mcontext_to_dr_mcontext 17.13
0x68c0b740 mutex_trylock 16.46
0x68c0b780 mutex_unlock 14.06
0x68c522c0 dr_mutex_lock 9.07
0x68c57120 get_thread_private_dcontext 6.66
0x68c52320 dr_mutex_unlock 4.48
0x68c52db0 dr_get_tls_field 4.4
0x68c124c0 common_heap_alloc 3.96
0x68c0d460 mutex_lock 3.88
0x68c54b60 dr_get_mcontext 3.59
0x68c0b830 release_recursive_lock 2.34
0x68c12750 common_heap_free 2.27
0x68c26880 dr_mcontext_to_priv_mcontext 1.58
0x68c12bb0 heap_alloc 1.2
0x68bd1287 dr_setjmp 1.18
0x68c0d4c0 acquire_recursive_lock 0.91
0x68c52080 dr_try_stop 0.86
0x68c52050 dr_try_setup 0.67
0x68c52d90 dr_get_current_drcontext 0.66
0x68c54be0 dr_set_mcontext 0.61
0x68c697a0 get_thread_id 0.58
0x68c129c0 global_heap_alloc 0.55
0x68c129e0 global_heap_free 0.48
0x68c12bf0 heap_free 0.46
0x68c52da0 dr_get_thread_id 0.45
0x68c12950 common_global_heap_free 0.32
0x68bd1280 dr_try_start 0.29
0x68c128f0 common_global_heap_alloc 0.26
28 functions, 287 instructions, Total: 13942 samples, 99.36% of shown samples, 9.68% of total session samples

drmemorylib:
CS:EIP Symbol + Offset Timer samples
0x68b2cf30 hashtable_lookup 19.98
0x68b2cd20 hash_key 6.88
0x68b16db0 handle_alloc_pre_ex 6.1
0x68b2cdb0 keys_equal 5.38
0x68b12a10 get_alloc_entry 4.65
0x68b17240 handle_alloc_post_func 4.13
0x68b17570 alloc_hook 3.91
0x68b17470 handle_alloc_post 3.25
0x68b163a0 handle_free_pre 3.2
0x68b136b0 redzone_size 2.15
0x68b15ad0 handle_malloc_pre 2.08
0x68b15570 enter_heap_routine 1.98
0x68b155b0 check_recursive_same_sequence 1.74
0x68b24f00 safe_read 1.72
0x68b12900 is_alloc_prepost_routine 1.56
0x68b2c740 dr_get_current_drcontext 1.39
0x68b128c0 is_alloc_routine 1.26
0x68b2d1c0 hashtable_add_replace 1.26
0x68b167b0 handle_malloc_post 1.24
0x68b2c84e dr_mutex_unlock 1.18
0x68b12860 is_rtl_routine 1.03
0x68b157e0 record_allocator 0.98
0x68b2c854 dr_mutex_lock 0.97
0x68b14330 malloc_lock_held_by_self 0.91
0x68b166d0 adjust_alloc_result 0.85
0x68b12770 is_free_routine 0.83
0x68b143e0 malloc_add_common 0.83
0x68b2d360 hashtable_remove 0.82
0x68b14360 malloc_lock 0.79
0x68b2c728 dr_get_tls_field 0.79
0x68b154c0 get_retaddr_at_entry 0.74
0x68b2cf10 hashtable_lock 0.71
0x68b15800 handle_free_check_mismatch 0.7
0x68b2cf20 hashtable_unlock 0.69
0x68b156b0 set_auxarg 0.66
0x68b2c73a dr_get_thread_id 0.61
0x68b12b90 is_replace_routine 0.58
0x68b1b120 client_handle_malloc 0.51
0x68b12790 is_malloc_routine 0.5
0x68b1af10 shared_callstack_free 0.46
0x68afe320 client_exiting_heap_routine 0.45
0x68b158d0 handle_free_post 0.45
0x68b14510 malloc_entry_remove 0.44
0x68afe300 client_entering_heap_routine 0.4
0x68b148a0 malloc_size 0.37
0x68b2c76a dr_get_mcontext 0.37
0x68b1b400 client_handle_free 0.36
0x68b25a00 global_free 0.36
0x68b2cfa0 hashtable_check_for_resize 0.36
0x68b14730 malloc_alloc_type 0.35
0x68b15530 malloc_is_native 0.34
0x68b143b0 malloc_lock_if_not_held_by_me 0.33
0x68b136e0 get_size_from_app_routine 0.32
0x68b13bd0 malloc_allocator_type 0.31
0x68b259f0 global_alloc 0.31
0x68b16650 get_alloc_real_size 0.3
0x68b25b40 hashwrap_alloc 0.27
0x68b13980 malloc_hash 0.26
0x68b12730 is_size_routine 0.24
0x68b12ab0 size_func_in_set 0.24
0x68b16250 get_alloc_size 0.24
0x68b127b0 is_realloc_routine 0.19
0x68b13960 malloc_entry_free 0.19
0x68b143d0 malloc_unlock_if_locked_by_me 0.19
0x68b14710 malloc_entry_is_pre_us 0.19
0x68b23b30 leak_handle_alloc 0.19
0x68b14390 malloc_unlock 0.18
0x68b15ac0 size_plus_redzone_overflow 0.16
68 functions, 1009 instructions, Total: 12125 samples, 98.31% of shown samples, 8.42% of total session samples

after:

keep in mind that these are percentages (would be nice to have a
column of absolute hits as well): so while dr_get_tls_field() was
actually cut in half, the percentage here looks higher b/c its on
a smaller total # samples:

Process Name 64-bit Timer samples
drmemorylib.dll 36.53
unknown module pid (7060) 1 32.76
dynamorio.dll 26.37
hal.dll 1 1.45
ntdll.dll 0.74
msvcr90.dll 0.67
ntoskrnl.exe 1 0.52
7 modules, Total: 19219 samples, 99.04% of shown samples

dynamorio:
CS:EIP Symbol + Offset Timer samples
0x6ea0b740 mutex_trylock 15.59
0x6ea0b780 mutex_unlock 8.5
0x6ea0b830 release_recursive_lock 7.95
0x6ea0b890 read_lock 7.7
0x6ea52f90 dr_get_tls_field 7.52
0x6ea0b9f0 read_unlock 6.14
0x6ea524b0 dr_mutex_lock 6.04
0x6ea572a0 get_thread_private_dcontext 5.24
0x6ea54c60 dr_get_mcontext_priv 4.67
0x6ea124c0 common_heap_alloc 4.57
0x6ea26970 priv_mcontext_to_dr_mcontext 3.48
0x6ea0d460 mutex_lock 3.46
0x6ea52510 dr_mutex_unlock 2.75
0x6ea12750 common_heap_free ...

@derekbruening
Copy link
Contributor Author

From bruen...@google.com on December 07, 2011 17:53:55

... 2.44
0x6ea698f0 get_thread_id 1.37
0x6ea0d4c0 acquire_recursive_lock 1.29
0x6ea52f70 dr_get_current_drcontext 1.25
0x6ea268d0 dr_mcontext_to_priv_mcontext 1.19
0x6ea129e0 global_heap_free 1.17
0x6ea52f80 dr_get_thread_id 1.15
0x6ea54d70 dr_set_mcontext 1.13
0x6ea128f0 common_global_heap_alloc 1.07
0x6ea54d50 dr_get_mcontext 0.98
23 functions, 238 instructions, Total: 4947 samples, 96.66% of shown samples, 6.19% of total session samples

drmemorylib:
CS:EIP Symbol + Offset Timer samples
0x6fedce00 hashtable_lookup 10.51
0x6fec6f70 handle_alloc_pre_ex 9.07
0x6fec75f0 handle_alloc_post 6.04
0x6fedcbf0 hash_key 5.8
0x6fec73c0 handle_alloc_post_func 5.7
0x6fec7710 alloc_hook 4.71
0x6fec6580 handle_free_pre 4.09
0x6fedcc80 keys_equal 3.91
0x6fec5730 check_recursive_same_sequence 3.81
0x6fec56d0 enter_heap_routine 3.36
0x6fec5c40 handle_malloc_pre 3.15
0x6fec3780 redzone_size 2.93
0x6fec5980 record_allocator 2.2
0x6fec6970 handle_malloc_post 1.69
0x6fec46d0 malloc_entry_remove 1.62
0x6fec45a0 malloc_add_common 1.5
0x6fedc600 dr_get_current_drcontext 1.45
0x6fec2990 is_rtl_routine 1.44
0x6fedd090 hashtable_add_replace 1.33
0x6fec4520 malloc_lock 1.31
0x6fec68a0 adjust_alloc_result 1.3
0x6fedd230 hashtable_remove 1.19
0x6fec28a0 is_free_routine 1.11
0x6fec59a0 handle_free_check_mismatch 1.02
0x6fec44f0 malloc_lock_held_by_self 1
0x6fec5850 set_auxarg 0.97
0x6fedc5e8 dr_get_tls_field 0.87
0x6fec48f0 malloc_alloc_type 0.76
0x6fecb470 client_handle_free 0.73
0x6fedcde0 hashtable_lock 0.68
0x6fec4a60 malloc_size 0.66
0x6fecb190 client_handle_malloc 0.65
0x6fedcdf0 hashtable_unlock 0.63
0x6fec4570 malloc_lock_if_not_held_by_me 0.62
0x6fed5880 global_alloc 0.59
0x6fed5890 global_free 0.58
0x6fec5690 malloc_is_native 0.54
0x6fec2c50 is_replace_routine 0.52
38 functions, 726 instructions, Total: 6383 samples, 90.05% of shown samples, 7.99% of total session samples

@derekbruening
Copy link
Contributor Author

From bruen...@google.com on December 07, 2011 17:56:51

List of optimizations and other changes:

  • issue do not perform unknown syscall analysis unless -check_uninitialized #705: skip unknown syscall handling for -no_check_uninitialized
  • issue CRASH -no_track_allocs #706: get -no_track_allocs to work
  • issue -conservative option to split ironclad from slight risk but faster #708: add -conservative option and pass pointers that may
    becoming invalid on racy module unload
  • issue reduce overhead of malloc tracking #460: improve malloc interception performance:
    I managed to cut the malloc interception cost in half.
    There are many sub-pieces here, some of which are filed and
    listed separately. Those covered by issue reduce overhead of malloc tracking #460 include:
    • statically tell hook whether syscall or needs post
    • improve performance of alloc_check_pending_module(), which is
      called on every instruction, via a last_page cache and a flag
      for whether the rbtree is empty
    • pending-module improvement: load postcall entries from
      symcache immediately if module is already loaded at event time
    • switch to a readwrite lock for the postcall table
    • use alloc_routine_entry_t.intercept_post instead of re-deriving
    • tweak malloc hash func and resize threshold to reduce collisions
    • only safe_read retaddr in hook callee if -conservative"
    • pass in xsp and avoid dr_get_mcontext altogether in pre hook
      since it only needs xsp (for 32-bit where params are all on stack)
    • pass per_thread_t around instead of calling dr_get_tls_field()
      everywhere
  • DRi#632: add dr_mcontext_t flags to select which parts of the context to
    copy and avoid perf hit of copying ymm arrays

@derekbruening
Copy link
Contributor Author

From bruen...@google.com on December 07, 2011 19:15:36

indeed /Ox is scanning a lot:
app mallocs: 10890330, frees: 10890127, large mallocs: 0
unique malloc stacks: 7054500
callstack fp scans: 61058402
callstack is_retaddr: 924470226, backdecode: 153355929, unreadable: 0

/Ox /Oy-:
app mallocs: 10890330, frees: 10890127, large mallocs: 0
unique malloc stacks: 7050289
callstack fp scans: 0
callstack is_retaddr: 10890130, backdecode: 10890130, unreadable: 0

(unfortunately on linux there is only 1 scan, and adding -fno-omit-frame-pointer makes no difference)

splitting off any work on callstack walking perf to issue #711 . see also in that issue a new breakdown of times for cfrac built /Oy- (huge difference).

@derekbruening
Copy link
Contributor Author

From bruen...@google.com on December 08, 2011 07:21:00

spec2k6cmpave namedres/leaksonly-nocount-i460-A-L-ref/ namedres/native-ref
400.perlbench 3.02 ( 1521 / 503)
401.bzip2 1.10 ( 948 / 865)
403.gcc 1.68 ( 719 / 429)
429.mcf 1.02 ( 380 / 373)
445.gobmk 1.56 ( 1000 / 639)
456.hmmer 1.09 ( 1297 / 1195)
458.sjeng 1.76 ( 1374 / 782)
462.libquantum 1.21 ( 894 / 739)
464.h264ref 1.67 ( 1755 / 1049)
471.omnetpp 3.09 ( 1293 / 419)
473.astar 1.31 ( 863 / 660)
483.xalancbmk 3.57 ( 1245 / 349)
410.bwaves 1.07 ( 803 / 751)
416.gamess 1.15 ( 1598 / 1392)
433.milc 1.22 ( 757 / 620)
434.zeusmp 1.08 ( 857 / 795)
435.gromacs 1.02 ( 1197 / 1170)
436.cactusADM 1.01 ( 1635 / 1617)
437.leslie3d 1.04 ( 851 / 816)
444.namd 1.03 ( 764 / 741)
447.dealII 1.93 ( 1156 / 599)
450.soplex 1.14 ( 459 / 404)
453.povray 1.73 ( 624 / 360)
454.calculix 1.07 ( 1609 / 1497)
459.GemsFDTD 1.00 ( 934 / 935)
465.tonto 3.19 ( 2685 / 842)
470.lbm 1.07 ( 692 / 648)
481.wrf 1.71 ( 2147 / 1258)
482.sphinx3 1.12 ( 830 / 741)
average 1.54

vs old 1.66 => 9.3% improvement

direct comparison shows 26% win on tonto, 20% on perlbench:

spec2k6cmpave namedres/leaksonly-nocount-i460-A-L-ref/ namedres/leaksonly-nocount-ref/
400.perlbench 0.80 ( 1521 / 1897)
401.bzip2 1.00 ( 948 / 951)
403.gcc 0.96 ( 719 / 750)
429.mcf 0.97 ( 380 / 393)
445.gobmk 0.99 ( 1000 / 1009)
456.hmmer 1.00 ( 1297 / 1301)
458.sjeng 0.99 ( 1374 / 1381)
462.libquantum 1.13 ( 894 / 793)
464.h264ref 1.00 ( 1755 / 1751)
471.omnetpp 0.83 ( 1293 / 1559)
473.astar 0.99 ( 863 / 874)
483.xalancbmk 0.94 ( 1245 / 1331)
410.bwaves 0.92 ( 803 / 875)
416.gamess 0.99 ( 1598 / 1610)
433.milc 1.07 ( 757 / 706)
434.zeusmp 0.94 ( 857 / 910)
435.gromacs 1.00 ( 1197 / 1198)
436.cactusADM 0.99 ( 1635 / 1653)
437.leslie3d 1.00 ( 851 / 851)
444.namd 1.00 ( 764 / 761)
447.dealII 0.89 ( 1156 / 1299)
450.soplex 0.94 ( 459 / 488)
453.povray 1.00 ( 624 / 627)
454.calculix 1.00 ( 1609 / 1616)
459.GemsFDTD 0.95 ( 934 / 983)
465.tonto 0.74 ( 2685 / 3613)
470.lbm 0.87 ( 692 / 795)
481.wrf 0.87 ( 2147 / 2459)
482.sphinx3 0.99 ( 830 / 840)
average 0.96

@derekbruening
Copy link
Contributor Author

From bruen...@google.com on December 08, 2011 12:37:19

unfortunately the optimizations done here so far (A through L) don't have
much impact on ui_tests. it seems to spend most of its time on
instrumentation, even when there's really not much instrumentation being
done:

results on windesk for HEAD (i.e., no issue #460 opts):
% ./batch.sh
native
[----------] 1 test from NPAPITesterBase (2814 ms total)
[----------] 1 test from NPAPITesterBase (2485 ms total)
[----------] 1 test from NPAPITesterBase (2506 ms total)
DR
[----------] 1 test from NPAPITesterBase (13527 ms total)
[----------] 1 test from NPAPITesterBase (12923 ms total)
[----------] 1 test from NPAPITesterBase (13342 ms total)

[----------] 1 test from NPAPITesterBase (544662 ms total)
[----------] 1 test from NPAPITesterBase (414874 ms total)
[----------] 1 test from NPAPITesterBase (430466 ms total)
-no_count_leaks
[----------] 1 test from NPAPITesterBase (329820 ms total)
[----------] 1 test from NPAPITesterBase (388453 ms total)
[----------] 1 test from NPAPITesterBase (379673 ms total)
-no_check_uninitialized
[----------] 1 test from NPAPITesterBase (71159 ms total)
[----------] 1 test from NPAPITesterBase (72336 ms total)
[----------] 1 test from NPAPITesterBase (70873 ms total)
-no_check_uninitialized -no_count_leaks
[----------] 1 test from NPAPITesterBase (49564 ms total)
[----------] 1 test from NPAPITesterBase (51328 ms total)
[----------] 1 test from NPAPITesterBase (51490 ms total)
-leaks_only
[----------] 1 test from NPAPITesterBase (46393 ms total)
[----------] 1 test from NPAPITesterBase (46452 ms total)
[----------] 1 test from NPAPITesterBase (58235 ms total)
-leaks_only -no_zero_stack
[----------] 1 test from NPAPITesterBase (43570 ms total)
[----------] 1 test from NPAPITesterBase (43709 ms total)
[----------] 1 test from NPAPITesterBase (45537 ms total)
-leaks_only -no_count_leaks
[----------] 1 test from NPAPITesterBase (23473 ms total)
[----------] 1 test from NPAPITesterBase (24109 ms total)
[----------] 1 test from NPAPITesterBase (23976 ms total)
-leaks_only -no_count_leaks -no_track_allocs
<crashes: fix issue #706 not in>

results for optimized (i.e., drmem has A through L):
% ./batch.sh
native
[----------] 1 test from NPAPITesterBase (3059 ms total)
[----------] 1 test from NPAPITesterBase (2349 ms total)
[----------] 1 test from NPAPITesterBase (2401 ms total)
DR
[----------] 1 test from NPAPITesterBase (12905 ms total)
[----------] 1 test from NPAPITesterBase (12950 ms total)
[----------] 1 test from NPAPITesterBase (13154 ms total)
DR -code_api -disable_traces -bb_single_restore_prefix -max_bb_instrs 256
[----------] 1 test from NPAPITesterBase (8144 ms total)
[----------] 1 test from NPAPITesterBase (8221 ms total)
[----------] 1 test from NPAPITesterBase (7747 ms total)

[----------] 1 test from NPAPITesterBase (481645 ms total)
[----------] 1 test from NPAPITesterBase (323987 ms total)
[----------] 1 test from NPAPITesterBase (162724 ms total) <--- ?!?
-no_count_leaks
[----------] 1 test from NPAPITesterBase (315300 ms total)
[----------] 1 test from NPAPITesterBase (154741 ms total)
[----------] 1 test from NPAPITesterBase (308583 ms total)
-no_check_uninitialized
[----------] 1 test from NPAPITesterBase (68615 ms total)
[----------] 1 test from NPAPITesterBase (68943 ms total)
[----------] 1 test from NPAPITesterBase (69362 ms total)
-no_check_uninitialized -no_count_leaks
[----------] 1 test from NPAPITesterBase (47861 ms total)
[----------] 1 test from NPAPITesterBase (48193 ms total)
[----------] 1 test from NPAPITesterBase (48197 ms total)
-leaks_only
[----------] 1 test from NPAPITesterBase (44453 ms total)
[----------] 1 test from NPAPITesterBase (44938 ms total)
[----------] 1 test from NPAPITesterBase (45062 ms total)
-leaks_only -no_zero_stack
[----------] 1 test from NPAPITesterBase (42587 ms total)
[----------] 1 test from NPAPITesterBase (42854 ms total)
[----------] 1 test from NPAPITesterBase (44263 ms total)
-leaks_only -no_count_leaks
[----------] 1 test from NPAPITesterBase (23529 ms total)
[----------] 1 test from NPAPITesterBase (23139 ms total)
[----------] 1 test from NPAPITesterBase (23799 ms total)
-leaks_only -no_count_leaks -no_track_allocs
[----------] 1 test from NPAPITesterBase (21472 ms total)
[----------] 1 test from NPAPITesterBase (21593 ms total)
[----------] 1 test from NPAPITesterBase (21742 ms total)

splitting investigating why -no_track_allocs is so slow to issue #714

@derekbruening
Copy link
Contributor Author

From bruen...@google.com on April 12, 2012 09:22:51

xref issue #794 , issue #721 , issue #793 , issue #689

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant