Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ui_tests -no_track_allocs 22s vs plain DR 13s: mostly from -show_threads #714

Closed
derekbruening opened this issue Nov 28, 2014 · 7 comments
Closed

Comments

@derekbruening
Copy link
Contributor

From bruen...@google.com on December 08, 2011 15:26:58

xref issue #681 xref issue #460 xref DRi#622, DRi#606

xref numbers from comment 10 of issue #460 , pasted here, where drmem doing
essentially nothing useful is 65% or so slower than plain DR:
DR
[----------] 1 test from NPAPITesterBase (12905 ms total)
[----------] 1 test from NPAPITesterBase (12950 ms total)
[----------] 1 test from NPAPITesterBase (13154 ms total)
-leaks_only -no_count_leaks -no_track_allocs
[----------] 1 test from NPAPITesterBase (21472 ms total)
[----------] 1 test from NPAPITesterBase (21593 ms total)
[----------] 1 test from NPAPITesterBase (21742 ms total)

presumably this is all from instrument_bb doing analysis, routine
replacement, repstr unrolling, etc.? worth investigating whether we can
reduce the overhead.

here's -leaks_only -no_count_leaks -no_track_allocs in CodeAnalyst:

Process Name 64-bit Timer samples
dynamorio.dll 59.22
unknown module pid (388) 1 15.33
ntoskrnl.exe 1 10.32
dbghelp.dll 7.98
drmemorylib.dll 2.44
ntdll.dll 1.89
Parity.sys 1 0.65
wow64cpu.dll 1 0.59
fltmgr.sys 1 0.33
wow64.dll 1 0.29
win32k.sys 1 0.28
11 modules, Total: 24757 samples, 99.32% of shown samples

CS:EIP Symbol + Offset Timer samples
0x7037d460 mutex_lock 13.99
0x7037b890 read_lock 3.16
0x70367fd0 hashtable_fragment_lookup 2.87
0x703a9df0 instr_is_cti 2.32
0x7037b740 mutex_trylock 2.15
0x703824c0 common_heap_alloc 1.98
0x70367d90 hashtable_ibl_lookup_for_removal 1.93
0x7037b9f0 read_unlock 1.88
0x70398d70 read_instruction 1.72
0x703715c0 find_free_list_slot 1.69
0x703b2d30 build_bb_ilist 1.61
0x703a9bf0 instr_is_cbr 1.37
0x7037b780 mutex_unlock 1.29
0x703998f0 decode_operand 1.29
0x7037ab60 emit_fragment_common 1.19
0x7039c0a0 encoding_possible_pass 1.18
0x703a9d90 instr_is_ubr 1.13
0x7039e100 instr_encode_common 1.09
0x7039ace0 opnd_type_ok 1.06
0x7039a0b0 decode_common 1
0x703a9c60 instr_is_mbr 0.97
0x703ac3f0 instr_is_exit_cti 0.88
0x70399270 decode_modrm 0.87
0x703d2430 is_intercepted_app_pc 0.79
0x7037ba40 write_unlock 0.77
0x70386fc0 vmvector_overlap 0.77
0x7036f9f0 hashtable_fragment_add 0.72
0x70373960 fcache_return_extra_space 0.72
0x703b8170 mangle 0.7
0x70382750 common_heap_free 0.67
0x70397800 get_ibl_routine_type_ex 0.67
0x70371450 add_to_free_list 0.64
0x703a9a90 instr_is_call 0.64
0x7038c1d0 check_thread_vm_area 0.62
0x7037a8d0 set_linkstub_fields 0.61
0x70386e10 lookup_addr 0.61
0x703985e0 resolve_variable_size 0.59
0x703b2a60 client_process_bb 0.58
0x7036c740 fragment_prepare_for_removal 0.57
0x703789b0 dispatch_exit_fcache_stats 0.54
0x70398970 read_operand 0.54
0x703aac10 instr_get_src 0.54
0x703ac4f0 instr_is_wow64_syscall 0.54
0x7037b830 release_recursive_lock 0.53
0x70368d80 fragment_lookup_type 0.52
0x703af1a0 instr_is_syscall 0.51
0x703aab00 instr_get_opcode 0.5
0x7036c690 fragment_prepare_for_removal_from_table 0.48
0x703a8620 instr_get_next 0.47
0x703aac80 instr_get_dst 0.47
0x703aae70 instr_length 0.47
0x703c72a0 get_thread_private_dcontext 0.46
0x703a6040 nop_pad_ilist 0.43
0x7039a740 size_ok 0.42
0x7039eda0 patch_branch 0.41
0x7037a360 dispatch 0.39
0x703ae4d0 instr_is_nop 0.39
0x7039d800 encode_operand 0.37
0x703d6ad0 is_wow64_process 0.36
0x7036e750 hashtable_fragment_reset 0.35
0x70379ed0 dispatch_enter_dynamorio 0.35
0x70388190 remove_fraglist_entry 0.35
0x703b3be0 build_basic_block_fragment 0.35
63 functions, 1933 instructions, Total: 10338 samples, 70.04% of shown samples, 11.65% of total session samples

CS:EIP Symbol + Offset Timer samples
0x735d7840 instrument_bb 46.14
0x735e5c80 instr_writes_esp 8.05
0x735f7a30 alloc_instrument 6.73
0x735d41b0 opnd_uses_nonignorable_memory 5.75
0x735f63a0 alloc_replace_instrument 4.76
0x73607050 symcache_symbol_add 3.78
0x735d2b50 reg_is_gpr 2.79
0x735f2b30 is_alloc_sysroutine 2.46
0x735d2bc0 opc_is_push 1.97
0x7360ca50 instr_num_dsts 1.97
0x7360ca68 opnd_is_reg 1.81
0x735d2c00 opc_is_pop 1.48
0x735d52b0 app_to_app_transformations 1.48
0x7360ca5c opnd_is_memory_reference 1.31
14 functions, 227 instructions, Total: 551 samples, 90.48% of shown samples, 0.62% of total session samples

Original issue: http://code.google.com/p/drmemory/issues/detail?id=714

@derekbruening
Copy link
Contributor Author

From bruen...@google.com on December 08, 2011 12:38:14

a better comparison is to DR using drmem's options:
DR -code_api -disable_traces -bb_single_restore_prefix -max_bb_instrs 256
[----------] 1 test from NPAPITesterBase (8144 ms total)
[----------] 1 test from NPAPITesterBase (8221 ms total)
[----------] 1 test from NPAPITesterBase (7747 ms total)

=> 2.5x slowdown for drmem -no_track_allocs!

@derekbruening
Copy link
Contributor Author

From bruen...@google.com on December 08, 2011 12:39:07

Owner: bruen...@google.com

@derekbruening
Copy link
Contributor Author

From bruen...@google.com on December 09, 2011 09:55:00

the majority of this time is not the bb event: it seems to be ntdll Nt* processing, but it doesn't make sense:

on laptop:
DR -code_api -disable_traces -bb_single_restore_prefix -max_bb_instrs 256
[----------] 1 test from NPAPITesterBase (5080 ms total)
no bb event at all:
[----------] 1 test from NPAPITesterBase (12974 ms total)
no bb event + no module events:
[----------] 1 test from NPAPITesterBase (5796 ms total)
disable loop in syscall_os_module_load:
[----------] 1 test from NPAPITesterBase (8034 ms total)
disable just ntdll syscall export lookup:
[----------] 1 test from NPAPITesterBase (8702 ms total)

but how can the ntdll export processing take 5-6s on ui_tests when the
whole -no_track_allocs on cfrac is only 0.2s slower than DR (3.2s for DR w/
drmem ops, 3.44s for -no_track_allocs)?

duplicated on windesk:
-leaks_only -no_count_leaks -no_track_allocs
[----------] 1 test from NPAPITesterBase (22832 ms total)
[----------] 1 test from NPAPITesterBase (21858 ms total)
[----------] 1 test from NPAPITesterBase (21656 ms total)
ditto but minus ntdll add_syscall_entry loop
[----------] 1 test from NPAPITesterBase (14354 ms total)
[----------] 1 test from NPAPITesterBase (14684 ms total)
[----------] 1 test from NPAPITesterBase (14481 ms total)

yet debug build says:
pre-module-load @0:00:00.061 in thread 6004
module load event: "ntdll.dll" 0x77190000-0x77310000 C:\Windows\SysWOW64\ntdll.dll
post-module-load @0:00:00.088 in thread 6004

???

@derekbruening
Copy link
Contributor Author

From bruen...@google.com on December 09, 2011 11:20:37

neither get_proc_addr nor hashtable_add showed up in the sampling (below)

remove hashtable_add if..else from add_syscall_entry:
[----------] 1 test from NPAPITesterBase (26160 ms total)
[----------] 1 test from NPAPITesterBase (14473 ms total)
[----------] 1 test from NPAPITesterBase (14302 ms total)
remove just hashtable_add line:
[----------] 1 test from NPAPITesterBase (31974 ms total)
[----------] 1 test from NPAPITesterBase (21368 ms total)
[----------] 1 test from NPAPITesterBase (21733 ms total)
add "false&&" in if() for hashtable_add:
[----------] 1 test from NPAPITesterBase (14584 ms total)
[----------] 1 test from NPAPITesterBase (14393 ms total)
[----------] 1 test from NPAPITesterBase (15468 ms total)
I was thinking it was some weird icache effect, but disabling just the
*syslist->num_out line shows it's really disabling some syscall handler:
[----------] 1 test from NPAPITesterBase (14378 ms total)
[----------] 1 test from NPAPITesterBase (14476 ms total)
[----------] 1 test from NPAPITesterBase (14386 ms total)
it's thread handling: or more specifically, issue #312 :
HEAD with "-no_show_threads -leaks_only -no_count_leaks -no_track_allocs":
[----------] 1 test from NPAPITesterBase (14542 ms total)
[----------] 1 test from NPAPITesterBase (14437 ms total)
[----------] 1 test from NPAPITesterBase (14382 ms total)

xref issue #691 .
this test creates 28 threads.

@derekbruening
Copy link
Contributor Author

From bruen...@google.com on December 09, 2011 13:10:55

Summary: ui_tests -no_track_allocs 22s vs plain DR 13s: mostly from -show_threads

@derekbruening
Copy link
Contributor Author

From bruen...@google.com on December 15, 2011 08:19:38

the time is all from symbolization: address lookups in the giant chrome.dll pdb.

one solution is to record mod+offs but not symbolize until later, just like
drmem already does for mallocs (not symbolized until leak report).
trigger to report would be an error being reported whose thread id has not
been reported (maybe have option to dump all thread callstacks instead of
just those w/ errors if people want that). have to worry about thread id
reuse: should just throw away stored callstack on thread exit.

@derekbruening
Copy link
Contributor Author

From derek.br...@gmail.com on December 15, 2011 19:20:33

This issue was closed by revision r682 .

Status: Fixed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant