Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

omnitrace-run executable - required for running binary writes #257

Merged
merged 36 commits into from
Mar 15, 2023

Conversation

jrmadsen
Copy link
Collaborator

@jrmadsen jrmadsen commented Mar 10, 2023

  • This exe is very similar to omnitrace-sample except that it works with instrumented binaries
  • The main goals are:
    1. Provide a command-line interface to all the config options
    2. Ensure LD_PRELOAD for libomnitrace-dl.so
  • After the binary rewrite of an exe or lib, the instrumented exe or exe loading the instrumented lib must be launched with the omnitrace-run exe

Usage

Binary rewrite

$ omnitrace-instrument -o foo.inst -- foo
$ omnitrace-run -TPHDS -- ./foo.inst

Sampling

The following two command are effectively identical:

$ omnitrace-run -S -- foo
$ omnitrace-sample -- foo

Help Menu

$ omnitrace-run --help
[omnitrace-run] Usage: ./bin/omnitrace-run [ --help (count: 0, dtype: bool)
                                             --version (count: 0, dtype: bool)
                                             --monochrome (max: 1, dtype: bool)
                                             --debug (max: 1, dtype: bool)
                                             --verbose (count: 1, dtype: integral)
                                             --ci (min: 0, dtype: boolean)
                                             --dl-verbose (min: 1, dtype: integral)
                                             --perfetto-annotations (min: 0, dtype: boolean)
                                             --critical-trace-debug (min: 0, dtype: boolean)
                                             --kokkosp-kernel-logger (min: 0, dtype: boolean)
                                             --kokkosp-prefix (min: 0, dtype: string)
                                             --sampling-allocator-size (min: 1, dtype: integral)
                                             --kokkosp-name-length-max (min: 1, dtype: integral)
                                             --critical-trace-serialize-names (min: 0, dtype: boolean)
                                             --config (min: 1, dtype: filepath)
                                             --output (min: 1, dtype: path [prefix])
                                             --trace (max: 1, dtype: bool)
                                             --profile (max: 1, dtype: bool)
                                             --flat-profile (max: 1, dtype: bool)
                                             --sample (min: 0, dtype: timer-type)
                                             --host (max: 1, dtype: bool)
                                             --device (max: 1, dtype: bool)
                                             --wait (count: 1, dtype: seconds)
                                             --duration (count: 1, dtype: seconds)
                                             --periods (min: 1, dtype: period-spec(s))
                                             --include (min: 1, dtype: [backend...])
                                             --exclude (min: 1, dtype: [backend...])
                                             --mode (min: 1, dtype: string)
                                             --use-causal (min: 0, dtype: boolean)
                                             --use-kokkosp (min: 0, dtype: boolean)
                                             --use-mpip (min: 0, dtype: boolean)
                                             --use-roctx (min: 0, dtype: boolean)
                                             --critical-trace (min: 0, dtype: boolean)
                                             --use-code-coverage (min: 0, dtype: boolean)
                                             --use-perfetto (min: 0, dtype: boolean)
                                             --use-process-sampling (min: 0, dtype: boolean)
                                             --use-rcclp (min: 0, dtype: boolean)
                                             --use-rocm-smi (min: 0, dtype: boolean)
                                             --use-rocprofiler (min: 0, dtype: boolean)
                                             --use-roctracer (min: 0, dtype: boolean)
                                             --use-sampling (min: 0, dtype: boolean)
                                             --use-timemory (min: 0, dtype: boolean)
                                             --trace-thread-barriers (min: 0, dtype: boolean)
                                             --trace-thread-join (min: 0, dtype: boolean)
                                             --trace-thread-locks (min: 0, dtype: boolean)
                                             --trace-thread-rw-locks (min: 0, dtype: boolean)
                                             --trace-thread-spin-locks (min: 0, dtype: boolean)
                                             --thread-pool-size (min: 1, dtype: integral)
                                             --num-threads-hint (min: 1, dtype: integral)
                                             --trace-file (count: 1, dtype: filepath)
                                             --trace-buffer-size (count: 1, dtype: KB)
                                             --trace-fill-policy (count: 1, dtype: policy)
                                             --trace-wait (count: 1, dtype: seconds)
                                             --trace-duration (count: 1, dtype: seconds)
                                             --trace-periods (min: 1, dtype: period-spec(s))
                                             --trace-clock-id (count: 1, dtype: clock-id)
                                             --profile-format (min: 1, dtype: string)
                                             --profile-diff (min: 1, dtype: path [prefix])
                                             --process-freq (count: 1, dtype: floating-point)
                                             --process-wait (count: 1, dtype: seconds)
                                             --process-duration (count: 1, dtype: seconds)
                                             --cpus (count: unlimited, dtype: int and/or range)
                                             --gpus (count: unlimited, dtype: int and/or range)
                                             --sampling-freq (count: 1, dtype: floating-point)
                                             --tids (min: 1, dtype: int and/or range)
                                             --sampling-wait (count: 1, dtype: seconds)
                                             --sampling-duration (count: 1, dtype: seconds)
                                             --sample-cputime (min: 0, dtype: [freq] [delay] [tids...])
                                             --sample-realtime (min: 0, dtype: [freq] [delay] [tids...])
                                             --sampling-cputime-delay (min: 1, dtype: floating-point)
                                             --sampling-cputime-freq (min: 1, dtype: floating-point)
                                             --sampling-cputime-tids (min: 0, dtype: string)
                                             --sampling-include-inlines (min: 0, dtype: boolean)
                                             --sampling-keep-internal (min: 0, dtype: boolean)
                                             --sampling-realtime-delay (min: 1, dtype: floating-point)
                                             --sampling-realtime-freq (min: 1, dtype: floating-point)
                                             --sampling-realtime-offset (min: 1, dtype: integral)
                                             --sampling-realtime-tids (min: 0, dtype: string)
                                             --cpu-events (min: 1, dtype: [EVENT ...])
                                             --gpu-events (min: 1, dtype: [EVENT ...])
                                             --enable-categories (min: 1, dtype: string)
                                             --disable-categories (min: 1, dtype: string)
                                             --tmpdir (min: 0, dtype: string)
                                             --use-pid (min: 0, dtype: boolean)
                                             --time-output (min: 0, dtype: boolean)
                                             --causal-file (min: 0, dtype: string)
                                             --causal-file-reset (min: 0, dtype: boolean)
                                             --use-temporary-files (min: 0, dtype: boolean)
                                             --perfetto-backend (min: 1, dtype: string)
                                             --perfetto-roctracer-per-stream (min: 0, dtype: boolean)
                                             --perfetto-shmem-size-hint-kb (min: 1, dtype: integral)
                                             --timemory-components (min: 0, dtype: string)
                                             --roctracer-hip-activity (min: 0, dtype: boolean)
                                             --roctracer-hip-api (min: 0, dtype: boolean)
                                             --roctracer-hsa-activity (min: 0, dtype: boolean)
                                             --roctracer-hsa-api (min: 0, dtype: boolean)
                                             --roctracer-hsa-api-types (min: 0, dtype: string)
                                             --critical-trace-buffer-count (min: 1, dtype: integral)
                                             --critical-trace-count (min: 1, dtype: integral)
                                             --critical-trace-per-row (min: 1, dtype: integral)
                                             --inlines (max: 1, dtype: bool)
                                             --hsa-interrupt (count: 1, dtype: int)
                                             --causal-binary-exclude (min: 0, dtype: string)
                                             --causal-binary-scope (min: 0, dtype: string)
                                             --causal-delay (min: 1, dtype: floating-point)
                                             --causal-duration (min: 1, dtype: floating-point)
                                             --causal-end-to-end (min: 0, dtype: boolean)
                                             --causal-fixed-speedup (min: 0, dtype: string)
                                             --causal-function-exclude (min: 0, dtype: string)
                                             --causal-function-exclude-defaults (min: 0, dtype: boolean)
                                             --causal-function-scope (min: 0, dtype: string)
                                             --causal-mode (min: 0, dtype: string)
                                             --causal-random-seed (min: 1, dtype: integral)
                                             --causal-source-exclude (min: 0, dtype: string)
                                             --causal-source-scope (min: 0, dtype: string)
                                           ] 

    Command line interface to omnitrace configuration.
    

Options:
    -h, -?, --help                 Shows this page (count: 0, dtype: bool) 
    --version                      Prints the version and exit (count: 0, dtype: bool) 
                                                                 
    [DEBUG OPTIONS]                                  
                                                                 
    --monochrome                   Disable colorized output (max: 1, dtype: bool) 
    --debug                        Debug output (max: 1, dtype: bool) 
    -v, --verbose                  Verbose output (count: 1, dtype: integral) 
    --ci                           Enable some runtime validation checks (typically enabled for continuous integration) (min: 0, dtype: boolean) 
    --dl-verbose                   Verbosity within the omnitrace-dl library (min: 1, dtype: integral) 
    --perfetto-annotations         Include debug annotations in perfetto trace. When enabled, this feature will encode information such as the values of 
                                   the function arguments (when available). Disabling this feature may dramatically reduce the size of the trace (min: 0, 
                                   dtype: boolean) 
    --critical-trace-debug         Enable debugging for critical trace (min: 0, dtype: boolean) 
    --kokkosp-kernel-logger        Enables kernel logging (min: 0, dtype: boolean) 
    --kokkosp-prefix               Set to [kokkos] to maintain old naming convention (min: 0, dtype: string) 
    --sampling-allocator-size      The number of sampled threads handled by an allocator running in a background thread. Each thread that is sampled 
                                   communicates with an allocator running in a background thread which handles storing/caching the data when it's buffer 
                                   is full. Setting this value too high (i.e. equal to the number of threads when the thread count is high) may cause loss 
                                   of data -- the sampler may fill a new buffer and overwrite old buffer data before the allocator can process it. Setting 
                                   this value to 1 will result in a background allocator thread for every thread started by the application. (min: 1, 
                                   dtype: integral) 
    --kokkosp-name-length-max      Set this to a value > 0 to help avoid unnamed Kokkos Tools callbacks. Generally, unnamed callbacks are the demangled 
                                   name of the function, which is very long (min: 1, dtype: integral) 
    --critical-trace-serialize-names
                                   Include names in serialization of critical trace (mainly for debugging) (min: 0, dtype: boolean) 
                                                                 
    [GENERAL OPTIONS]  These are options which are ubiquitously applied 
                                                                 
    -c, --config                   Configuration file (min: 1, dtype: filepath) 
    -o, --output                   Output path. Accepts 1-2 parameters corresponding to the output path and the output prefix (min: 1, dtype: path 
                                   [prefix]) 
    -T, --trace                    Generate a detailed trace (perfetto output) (max: 1, dtype: bool) 
    -P, --profile                  Generate a call-stack-based profile (conflicts with --flat-profile) (max: 1, dtype: bool) 
    -F, --flat-profile             Generate a flat profile (conflicts with --profile) (max: 1, dtype: bool) 
    -S, --sample [ cputime | realtime ]
                                   Enable statistical sampling of call-stack (min: 0, dtype: timer-type) 
    -H, --host                     Enable sampling host-based metrics for the process. E.g. CPU frequency, memory usage, etc. (max: 1, dtype: bool) 
    -D, --device                   Enable sampling device-based metrics for the process. E.g. GPU temperature, memory usage, etc. (max: 1, dtype: bool) 
    -w, --wait                     This option is a combination of '--trace-wait' and '--sampling-wait'. See the descriptions for those two options. 
                                   (count: 1, dtype: seconds) 
    -d, --duration                 This option is a combination of '--trace-duration' and '--sampling-duration'. See the descriptions for those two 
                                   options. (count: 1, dtype: seconds) 
    --periods                      Similar to specifying delay and/or duration except in the form <DELAY>:<DURATION>, <DELAY>:<DURATION>:<REPEAT>, and/or 
                                   <DELAY>:<DURATION>:<REPEAT>:<CLOCK_ID> (min: 1, dtype: period-spec(s)) 
                                                                 
    [BACKEND OPTIONS]  These options control region information captured w/o sampling or instrumentation 
                                                                 
    -I, --include [ all | kokkosp | mpip | mutex-locks | ompt | rcclp | rocm-smi | rocprofiler | roctracer | roctx | rw-locks | spin-locks ]
                                   Include data from these backends (min: 1, dtype: [backend...]) 
    -E, --exclude [ all | kokkosp | mpip | mutex-locks | ompt | rcclp | rocm-smi | rocprofiler | roctracer | roctx | rw-locks | spin-locks ]
                                   Exclude data from these backends (min: 1, dtype: [backend...]) 
    --mode [ causal | coverage | sampling | trace ]
                                   Data collection mode. Used to set default values for OMNITRACE_USE_* options. Typically set by omnitrace binary 
                                   instrumenter. (min: 1, dtype: string) 
    --use-causal                   Enable causal profiling analysis (min: 0, dtype: boolean) 
    --use-kokkosp                  Enable support for Kokkos Tools (min: 0, dtype: boolean) 
    --use-mpip                     Enable support for MPI functions (min: 0, dtype: boolean) 
    --use-roctx                    Enable ROCtx API. Warning! Out-of-order ranges may corrupt perfetto flamegraph (min: 0, dtype: boolean) 
    --critical-trace               Enable generation of the critical trace (min: 0, dtype: boolean) 
    --use-code-coverage            Enable support for code coverage (min: 0, dtype: boolean) 
    --use-perfetto                 Enable perfetto backend (min: 0, dtype: boolean) 
    --use-process-sampling         Enable a background thread which samples process-level and system metrics such as the CPU/GPU freq, power, memory 
                                   usage, etc. (min: 0, dtype: boolean) 
    --use-rcclp                    Enable support for ROCm Communication Collectives Library (RCCL) Performance (min: 0, dtype: boolean) 
    --use-rocm-smi                 Enable sampling GPU power, temp, utilization, and memory usage (min: 0, dtype: boolean) 
    --use-rocprofiler              Enable ROCm hardware counters (min: 0, dtype: boolean) 
    --use-roctracer                Enable ROCm API and kernel tracing (min: 0, dtype: boolean) 
    --use-sampling                 Enable statistical sampling of call-stack (min: 0, dtype: boolean) 
    --use-timemory                 Enable timemory backend (min: 0, dtype: boolean) 
    --trace-thread-barriers        Enable tracing calls to pthread_barrier functions. (min: 0, dtype: boolean) 
    --trace-thread-join            Enable tracing calls to pthread_join functions. (min: 0, dtype: boolean) 
    --trace-thread-locks           Enable tracing calls to pthread_mutex_lock, pthread_mutex_unlock, pthread_mutex_trylock (min: 0, dtype: boolean) 
    --trace-thread-rw-locks        Enable tracing calls to pthread_rwlock_* functions. May cause deadlocks with ROCm-enabled OpenMPI. (min: 0, dtype: 
                                   boolean) 
    --trace-thread-spin-locks      Enable tracing calls to pthread_spin_* functions. May cause deadlocks with MPI distributions. (min: 0, dtype: boolean) 
                                                                 
    [PARALLELISM OPTIONS]                               
                                                                 
    --thread-pool-size             Max number of threads for processing background tasks (min: 1, dtype: integral) 
    --num-threads-hint             This is hint for how many threads are expected to be created in the application. Setting this value allows omnitrace to 
                                   preallocate resources during initialization and warn about any potential issues. For example, when call-stack sampling, 
                                   each thread has a unique sampler instance which communicates with an allocator instance running in a background thread. 
                                   Each allocator only handles N sampling instances (where N is the value of OMNITRACE_SAMPLING_ALLOCATOR_SIZE). When this 
                                   hint is set to >= the number of threads that get sampled, omnitrace can start all the background threads during 
                                   initialization (min: 1, dtype: integral) 
                                                                 
    [TRACING OPTIONS]  Specific options controlling tracing (i.e. deterministic measurements of every event) 
                                                                 
    --trace-file                   Specify the trace output filename. Relative filepath will be with respect to output path and output prefix. (count: 1, 
                                   dtype: filepath) 
    --trace-buffer-size            Size limit for the trace output (in KB) (count: 1, dtype: KB) 
    --trace-fill-policy [ discard | ring_buffer ]
                                   
                                   Policy for new data when the buffer size limit is reached:
                                       - discard     : new data is ignored
                                       - ring_buffer : new data overwrites oldest data (count: 1, dtype: policy)
    --trace-wait                   Set the wait time (in seconds) before collecting trace and/or profiling data(in seconds). By default, the duration is 
                                   in seconds of realtime but that can changed via --trace-clock-id. (count: 1, dtype: seconds) 
    --trace-duration               Set the duration of the trace and/or profile data collection (in seconds). By default, the duration is in seconds of 
                                   realtime but that can changed via --trace-clock-id. (count: 1, dtype: seconds) 
    --trace-periods                More powerful version of specifying trace delay and/or duration. Format is one or more groups of: <DELAY>:<DURATION>, 
                                   <DELAY>:<DURATION>:<REPEAT>, and/or <DELAY>:<DURATION>:<REPEAT>:<CLOCK_ID>. (min: 1, dtype: period-spec(s)) 
    --trace-clock-id [ 0 (realtime|CLOCK_REALTIME)
                       1 (monotonic|CLOCK_MONOTONIC)
                       2 (cputime|CLOCK_PROCESS_CPUTIME_ID)
                       4 (monotonic_raw|CLOCK_MONOTONIC_RAW)
                       5 (realtime_coarse|CLOCK_REALTIME_COARSE)
                       6 (monotonic_coarse|CLOCK_MONOTONIC_COARSE)
                       7 (boottime|CLOCK_BOOTTIME) ]
                                   Set the default clock ID for for trace delay/duration. Note: "cputime" is the *process* CPU time and might need to be 
                                   scaled based on the number of threads, i.e. 4 seconds of CPU-time for an application with 4 fully active threads would 
                                   equate to ~1 second of realtime. If this proves to be difficult to handle in practice, please file a feature request 
                                   for omnitrace to auto-scale based on the number of threads. (count: 1, dtype: clock-id) 
                                                                 
    [PROFILE OPTIONS]  Specific options controlling profiling (i.e. deterministic measurements which are aggregated into a summary) 
                                                                 
    --profile-format [ console | json | text ]
                                   Data formats for profiling results (min: 1, dtype: string) 
    --profile-diff                 Generate a diff output b/t the profile collected and an existing profile from another run Accepts 1-2 parameters 
                                   corresponding to the input path and the input prefix (min: 1, dtype: path [prefix]) 
                                                                 
    [HOST/DEVICE (PROCESS SAMPLING) OPTIONS]
                                   Process sampling is background measurements for resources available to the entire process. These samples are not tied 
                                   to specific lines/regions of code 
                                                                 
    --process-freq                 Set the default host/device sampling frequency (number of interrupts per second) (count: 1, dtype: floating-point) 
    --process-wait                 Set the default wait time (i.e. delay) before taking first host/device sample (in seconds of realtime) (count: 1, 
                                   dtype: seconds) 
    --process-duration             Set the duration of the host/device sampling (in seconds of realtime) (count: 1, dtype: seconds) 
    --cpus                         CPU IDs for frequency sampling. Supports integers and/or ranges (count: unlimited, dtype: int and/or range) 
    --gpus                         GPU IDs for SMI queries. Supports integers and/or ranges (count: unlimited, dtype: int and/or range) 
                                                                 
    [GENERAL SAMPLING OPTIONS] General options for timer-based sampling per-thread 
                                                                 
    -f, --sampling-freq            Set the default sampling frequency (number of interrupts per second) (count: 1, dtype: floating-point) 
    -t, --tids                     Specify the default thread IDs for sampling, where 0 (zero) is the main thread and each thread created by the target 
                                   application is assigned an atomically incrementing value. (min: 1, dtype: int and/or range) 
    --sampling-wait                Set the default wait time (i.e. delay) before taking first sample (in seconds). This delay time is based on the clock 
                                   of the sampler, i.e., a delay of 1 second for CPU-clock sampler may not equal 1 second of realtime (count: 1, dtype: 
                                   seconds) 
    --sampling-duration            Set the duration of the sampling (in seconds of realtime). I.e., it is possible (currently) to set a CPU-clock time 
                                   delay that exceeds the real-time duration... resulting in zero samples being taken (count: 1, dtype: seconds) 
                                                                 
    [SAMPLING TIMER OPTIONS] These options determine the heuristic for deciding when to take a sample 
                                                                 
    --sample-cputime               Sample based on a CPU-clock timer (default). Accepts zero or more arguments:
                                       0. Enables sampling based on CPU-clock timer.
                                       1. Interrupts per second. E.g., 100 == sample every 10 milliseconds of CPU-time.
                                       2. Delay (in seconds of CPU-clock time). I.e., how long each thread should wait before taking first sample.
                                       3+ Thread IDs to target for sampling, starting at 0 (the main thread).
                                          May be specified as index or range, e.g., '0 2-4' will be interpreted as:
                                             sample the main thread (0), do not sample the first child thread but sample the 2nd, 3rd, and 4th child threads (min: 0, dtype: [freq] [delay] [tids...])
    --sample-realtime              Sample based on a real-clock timer. Accepts zero or more arguments:
                                       0. Enables sampling based on real-clock timer.
                                       1. Interrupts per second. E.g., 100 == sample every 10 milliseconds of realtime.
                                       2. Delay (in seconds of real-clock time). I.e., how long each thread should wait before taking first sample.
                                       3+ Thread IDs to target for sampling, starting at 0 (the main thread).
                                          May be specified as index or range, e.g., '0 2-4' will be interpreted as:
                                             sample the main thread (0), do not sample the first child thread but sample the 2nd, 3rd, and 4th child threads
                                          When sampling with a real-clock timer, please note that enabling this will cause threads which are typically "idle"
                                          to consume more resources since, while idle, the real-clock time increases (and therefore triggers taking samples)
                                          whereas the CPU-clock time does not. (min: 0, dtype: [freq] [delay] [tids...])
                                                                 
    [ADVANCED SAMPLING OPTIONS] These options determine the heuristic for deciding when to take a sample 
                                                                 
    --sampling-cputime-delay       Time (in seconds) to wait before the first CPU-time sampling signal is delivered. Defaults to OMNITRACE_SAMPLING_DELAY 
                                   when <= 0.0 (min: 1, dtype: floating-point) 
    --sampling-cputime-freq        Number of software interrupts per second of CPU-time. Defaults to OMNITRACE_SAMPLING_FREQ when <= 0.0 (min: 1, dtype: 
                                   floating-point) 
    --sampling-cputime-tids        Same as OMNITRACE_SAMPLING_TIDS but applies specifically to samplers whose timers are based on the CPU-time. This is 
                                   useful when both OMNITRACE_SAMPLING_CPUTIME=ON and OMNITRACE_SAMPLING_REALTIME=ON (min: 0, dtype: string) 
    --sampling-include-inlines     Create entries for inlined functions when available (min: 0, dtype: boolean) 
    --sampling-keep-internal       Configure whether the statistical samples should include call-stack entries from internal routines in omnitrace. E.g. 
                                   when ON, the call-stack will show functions like omnitrace_push_trace. If disabled, omnitrace will attempt to filter 
                                   out internal routines from the sampling call-stacks (min: 0, dtype: boolean) 
    --sampling-realtime-delay      Time (in seconds) to wait before the first real (wall) time sampling signal is delivered. Defaults to 
                                   OMNITRACE_SAMPLING_DELAY when <= 0.0 (min: 1, dtype: floating-point) 
    --sampling-realtime-freq       Number of software interrupts per second of real (wall) time. Defaults to OMNITRACE_SAMPLING_FREQ when <= 0.0 (min: 1, 
                                   dtype: floating-point) 
    --sampling-realtime-offset     Modify this value only if the target process is also using SIGRTMIN. E.g. the signal used is SIGRTMIN + <THIS_VALUE>. 
                                   Value must be <= 30 (min: 1, dtype: integral) 
    --sampling-realtime-tids       Same as OMNITRACE_SAMPLING_TIDS but applies specifically to samplers whose timers are based on the real (wall) time. 
                                   This is useful when both OMNITRACE_SAMPLING_CPUTIME=ON and OMNITRACE_SAMPLING_REALTIME=ON (min: 0, dtype: string) 
                                                                 
    [HARDWARE COUNTER OPTIONS] See also: omnitrace-avail -H  
                                                                 
    -C, --cpu-events               Set the CPU hardware counter events to record (ref: `omnitrace-avail -H -c CPU`) (min: 1, dtype: [EVENT ...]) 
    -G, --gpu-events               Set the GPU hardware counter events to record (ref: `omnitrace-avail -H -c GPU`) (min: 1, dtype: [EVENT ...]) 
                                                                 
    [CATEGORY OPTIONS]                               
                                                                 
    --enable-categories [ causal
                          comm_data
                          cpu_frequency
                          critical-trace
                          device-critical-trace
                          device_busy
                          device_hip
                          device_hsa
                          device_memory_usage
                          device_power
                          device_temp
                          host
                          host-critical-trace
                          kernel_hardware_counter
                          kokkos
                          mpi
                          numa
                          ompt
                          process_context_switch
                          process_kernel_cpu_time
                          process_memory_hwm
                          process_page_fault
                          process_sampling
                          process_user_cpu_time
                          process_virtual_memory
                          pthread
                          python
                          rccl
                          rocm_hip
                          rocm_hsa
                          rocm_roctx
                          rocm_smi
                          rocprofiler
                          roctracer
                          sampling
                          thread_context_switch
                          thread_cpu_time
                          thread_hardware_counter
                          thread_page_fault
                          thread_peak_memory
                          thread_wall_time
                          timemory
                          user ]
                                   Enable collecting profiling and trace data for these categories and disable all other categories (min: 1, dtype: 
                                   string) 
    --disable-categories [ causal
                           comm_data
                           cpu_frequency
                           critical-trace
                           device-critical-trace
                           device_busy
                           device_hip
                           device_hsa
                           device_memory_usage
                           device_power
                           device_temp
                           host
                           host-critical-trace
                           kernel_hardware_counter
                           kokkos
                           mpi
                           numa
                           ompt
                           process_context_switch
                           process_kernel_cpu_time
                           process_memory_hwm
                           process_page_fault
                           process_sampling
                           process_user_cpu_time
                           process_virtual_memory
                           pthread
                           python
                           rccl
                           rocm_hip
                           rocm_hsa
                           rocm_roctx
                           rocm_smi
                           rocprofiler
                           roctracer
                           sampling
                           thread_context_switch
                           thread_cpu_time
                           thread_hardware_counter
                           thread_page_fault
                           thread_peak_memory
                           thread_wall_time
                           timemory
                           user ]
                                   Disable collecting profiling and trace data for these categories (min: 1, dtype: string) 
                                                                 
    [IO OPTIONS]                                     
                                                                 
    --tmpdir                       Base directory for temporary files (min: 0, dtype: string) 
    --use-pid                      Enable tagging filenames with process identifier (either MPI rank or pid) (min: 0, dtype: boolean) 
    --time-output                  Output data to subfolder w/ a timestamp (see also: TIME_FORMAT) (min: 0, dtype: boolean) 
    --causal-file                  Name of causal output filename (w/o extension) (min: 0, dtype: string) 
    --causal-file-reset            Overwrite any existing causal output file instead of appending to it (min: 0, dtype: boolean) 
    --use-temporary-files          Write data to temporary files to minimize the memory usage of omnitrace, e.g. call-stack samples will be periodically 
                                   written to a file and re-loaded during finalization (min: 0, dtype: boolean) 
                                                                 
    [PERFETTO OPTIONS]                               
                                                                 
    --perfetto-backend [ all | inprocess | system ]
                                   Specify the perfetto backend to activate. Options are: 'inprocess', 'system', or 'all' (min: 1, dtype: string) 
    --perfetto-roctracer-per-stream
                                   Separate roctracer GPU side traces (copies, kernels) into separate tracks based on the stream they're enqueued into 
                                   (min: 0, dtype: boolean) 
    --perfetto-shmem-size-hint-kb 
                                   Hint for shared-memory buffer size in perfetto (in KB) (min: 1, dtype: integral) 
                                                                 
    [TIMEMORY OPTIONS]                               
                                                                 
    --timemory-components          List of components to collect via timemory (see `omnitrace-avail -C`) (min: 0, dtype: string) 
                                                                 
    [ROCM OPTIONS]                                   
                                                                 
    --roctracer-hip-activity       Enable HIP activity tracing support (min: 0, dtype: boolean) 
    --roctracer-hip-api            Enable HIP API tracing support (min: 0, dtype: boolean) 
    --roctracer-hsa-activity       Enable HSA activity tracing support (min: 0, dtype: boolean) 
    --roctracer-hsa-api            Enable HSA API tracing support (min: 0, dtype: boolean) 
    --roctracer-hsa-api-types      HSA API type to collect (min: 0, dtype: string) 
                                                                 
    [CRITICAL_TRACE OPTIONS]                               
                                                                 
    --critical-trace-buffer-count 
                                   Number of critical trace records to store in thread-local memory before submitting to shared buffer (min: 1, dtype: 
                                   integral) 
    --critical-trace-count         Number of critical trace to export (0 == all) (min: 1, dtype: integral) 
    --critical-trace-per-row       How many critical traces per row in perfetto (0 == all in one row) (min: 1, dtype: integral) 
                                                                 
    [MISCELLANEOUS OPTIONS]                               
                                                                 
    -i, --inlines                  Include inline info in output when available (max: 1, dtype: bool) 
    --hsa-interrupt [ 0 | 1 ]      Set the value of the HSA_ENABLE_INTERRUPT environment variable.
                                     ROCm version 5.2 and older have a bug which will cause a deadlock if a sample is taken while waiting for the signal
                                     that a kernel completed -- which happens when sampling with a real-clock timer. We require this option to be set to
                                     when --realtime is specified to make users aware that, while this may fix the bug, it can have a negative impact on
                                     performance.
                                     Values:
                                       0     avoid triggering the bug, potentially at the cost of reduced performance
                                       1     do not modify how ROCm is notified about kernel completion (count: 1, dtype: int)

- ensure LD_PRELOAD for libomnitrace-dl.so
- convert config options into command-line options
- updates to tsettings
- updates to argparser
- throw error if get_env<bool> has empty string
- minor tweaks to categories of settings
- add argparse for common handling of argument parsers
- fix handling of --trace-file (OMNITRACE_PERFETTO_FILE)
- updated to use omnitrace::argparse functions
- remove choices for --cpu-events and --gpu-events
@jrmadsen jrmadsen added enhancement New feature or request timemory Issue affects/involves timemory features/capabilities testing Extends/improves/modifies testing cmake Modifies the CMake build system submodule Updates a git submodule omnitrace-sample Involves the omnitrace-sample executable new feature Introduces new feature omnitrace-run Involves the omnitrace-run executable labels Mar 10, 2023
@jrmadsen
Copy link
Collaborator Author

jrmadsen commented Mar 10, 2023

  • Write documentation
  • Exclude causal profiling settings
  • Add pass regex to omnitrace-run-args test

jrmadsen and others added 8 commits March 10, 2023 20:56
- fix pedantic warning
- remove testing args that may not be there in some builds
- disable roctracer_data when roctracer not enabled
- allow DEFAULT_MODULE and LIBRARY_MODULE
- support get_env for enums
- Add "mode" category to OMNITRACE_MODE
- remove debug print statement
- change var init
- use --help instead of -?
- tweak header include style
- add_ld_preload func
- launcher and command member variables in parser_data
- support launcher
- clean up and reworked
- require LD_PRELOAD with binary rewrite
- dl::InstrumentMode
- dl::get_instrumented()
- verify_instrumented_preloaded()
- omnitrace_set_instrumented(int)
- relocated omnitrace_main from main.c to dl.cpp
- omnitrace_set_env does not dlopen libomnitrace
- omnitrace_set_main(func_ptr) [internal API]
- OMNITRACE_HIDDEN_API -> OMNITRACE_INTERNAL_API
- adhere to LD_PRELOAD requirementsa
- invoke omnitrace_set_instrumented
- binary rewrite does not instrument main
- binary rewrite does not instrument call to omnitrace_init
- runtime instr does not instrument main
- runtime instr does not instrument call to omnitrace_init
- LD_PRELOAD requirement necessitates minor version increment
- fix ambiguous get_env calls
- fix issue with temporaries
- runtime instrumentation does not work if libomnitrace-dl is preloaded
- define dl::InstrumentMode in dl.hpp
- handle instrumentation via setprofile libpyomnitrace
  - do not push trace in omnitrace_init
- move header to dl subdirectory
- omnitrace::omnitrace-headers include omnitrace-dl folder
- use InstrumentMode in omnitrace-instrument
- Use omnitrace-run on instrumented exes
- add omnitrace-run to examples of running binary rewritten exes
@jrmadsen jrmadsen added libomnitrace-dl Involves omnitrace-dl library omnitrace-instrument Involves the omnitrace-instrument executable (binary instrumenter) libpyomnitrace Involves the omnitrace python bindings continuous-integration Updates to continuous integration labels Mar 14, 2023
@jrmadsen jrmadsen mentioned this pull request Mar 14, 2023
@jrmadsen jrmadsen changed the title omnitrace-run executable omnitrace-run executable - required for running binary writes Mar 15, 2023
@jrmadsen jrmadsen merged commit abe35de into ROCm:main Mar 15, 2023
@jrmadsen jrmadsen deleted the omnitrace-run branch March 15, 2023 00:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cmake Modifies the CMake build system continuous-integration Updates to continuous integration enhancement New feature or request libomnitrace-dl Involves omnitrace-dl library libpyomnitrace Involves the omnitrace python bindings new feature Introduces new feature omnitrace-instrument Involves the omnitrace-instrument executable (binary instrumenter) omnitrace-run Involves the omnitrace-run executable omnitrace-sample Involves the omnitrace-sample executable submodule Updates a git submodule testing Extends/improves/modifies testing timemory Issue affects/involves timemory features/capabilities
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant