Skip to content

Commit

Permalink
Linux Perf Support + Causal Profiling Updates (#276)
Browse files Browse the repository at this point in the history
* causal backtrace updates

- fix initial causal sampling period value

* causal delay updates

- tweak handling of sleep_for_overhead

* Fix experiment global scaling for prog pts

- results in drastically improved predictions

* pthread_mutex_gotcha updates

- disable all wrappers during causal profiling

* validate-causal-json.py updates

- support decimal stddev
- fix setting stddev from command-line

* causal perform_experiment_impl update

- handle start failing because finalizing

* deprecate causal::component::sample_rate

- appears to not help at all

* Rework sample info

* Increase causal unwind_depth

- use OMNITRACE_MAX_UNWIND_DEPTH

* validate-causal-json updates

- min experiments
  - exclude reporting predictions with less than X experiments at a given speedup
- percent samples
  - only print samples within X% of the peak (default: 95%)

* Update timemory submodule

- extensions to sampling for signals delivered via non-timer method
  - e.g. via HW counter overflow

* dwarf_entry::operator< updates

- sort via file

* causal profiling docs updates

- info about backends
- info about installing/enabling perf

* config updates: causal backend

- CausalBackend enum
- OMNITRACE_CAUSAL_BACKEND: perf, timer, auto
- omnitrace-causal option: --backend

* debug update

- use spin_mutex instead of std::mutex

* address_range::contains update

- range from 0-100 contains range from 10-100 but was returning false because high was == 100 not < 100

* symbol::operator< update

- handle load address differences

* sampling updates (non-causal)

- update get_timer to get_trigger + dynamic_cast

* container::static_vector updates

- support construction from container::c_array
- update_size private member func for handling atomic m_size

* Move perf files

- moved library/causal/perf.{hpp,cpp} to library/perf.{hpp,cpp}

* causal example update

- created impl.hpp (forward decls)
- renamed {cpu,rng}_func_impl to {cpu,rng}_impl_func
- only create two threads which run N iterations instead of two threads each iteration

* Update timemory submodule

- updates to unwind::processed_entry
- updates to procfs::maps

* Updated causal documentation

- fixed line numbers changed by modifications to causal example

* omnitrace-causal exe updates

- set OMNITRACE_THREAD_POOL_SIZE to zero by default

* core/containers updates

- static_vector: provide data() member function
- c_array pop_front() and pop_back() member functions

* core: config and argparse updates + perf

- core/perf.{hpp,cpp}
  - forward decl of enums
  - config-related capabilities
- argparse: --sample-overflow
- renamed some config functions
  - e.g. get_sampling_cpu_freq -> get_sampling_cputime_freq
- added config settings related to overflow sampling via perf
- added timer_sampling and overflow_sampling categories

* Update timemory submodule

- sampling allocator flushing

* binary updates

- lookup_ipaddr_entry
- use bfd_find_nearest_line instead of bfd_find_nearest_line_discriminator
  - discriminators are not used
- explicit instantiations of inlined_symbol::serialize

* Bump VERSION to 1.10.0

* sampling and perf updates

- support overflow sampling via Linux Perf
- update perf namespace
- update perf::perf_event
  - update record ctor: pointer instead of const ref
  - update open member func: return optional string
  - add m_batch_size member variable
- sampling updates
  - support overflow sampling
  - flush allocators
  - increase buffer size from 1024 to 2048
  - restructure post-processing in light of perf overflow supports
  - improve offload memory usage only load buffers for thread
  - load_offload_buffer(tid) uses thread-specific filepos
- component updates
  - backtrace_metrics::operator-=
  - backtrace_metrics::operator-
  - backtrace::sample does not record for overflow signal
  - callchain: perf overflow sample

* core updates

- component::sampling_percent does not report self + uses_percent_units

* causal updates

- tweak get_line_info
- overloads for set_current_selection (uint64_t, c_array, std::array)
- delay
  - use sampling::pause/sampling::resume
- experiment
  - experiment::sample derives from unwind::processed_entry
  - experiment::samples is vector instead of set
  - fixed samples
  - overloads for is_selected (uint64_t, c_array, std::array)
  - scaling factor defaults to 100 instead of 50
  - serialize updates follow change to experiment::sample
  - modify algorithm for increasing/decreasing experiment length
- sample_data
  - use map<uintptr, uint64_t> instead of set<sample_data>
  - get_samples returns vector<sample_data> instead of set<sample_data>
- sampling
  - support overflow via Linux Perf
  - update causal_offload_buffer
  - flush sampling allocator
- backtrace
  - overflow component

* libomnitrace-dl updates

- handle dl::InstrumentMode::PythonProfile

* testing updates (causal)

- causal line 155 -> causal line 100
- causal line 165 -> causal line 110

* formatting

* exit_gotcha updates

- exit_info for abort()
- message about non-zero exit code

* testing updates

- fail regex for causal tests
- validate-causal-json: >= min_experiments instead of > min_experiments
- handle OMNITRACE_DEBUG_SETTINGS in omnitrace_write_test_config

* causal sampling updates

- add new lines where appropriate

* causal data updates

- reorder diagnostic info when experiment fails to start

* binary updates

- symbol address range from address to address + symsize + 1
  - add 1 based on debug info

* causal data updates

- sample_selection wait_ns defaults to 1,000 instead of 10,000
- sample_selection wait scaled by iteration number
- save_line_info_impl verbosity
- print latest_eligible_pc when experiment does not start

* causal sampling + component updates

- perf backend disables component::backtrace
- ensure get_sampling_(realtime|cputime|overflow)_signal do not malloc

* causal: remove period stats

* validate-causal-json update

- fix --help

* causal data updates

- improve eligible pc history reporting when experiment fails to start

* causal data updates

- fix compute_eligible_lines_impl
  - eligible address ranges returning too many ranges
  - occasionally, overwrite all *true* eligible address ranges

* causal data updates

- reduce scoped ranges to symbol ranges
- is_eligible_address() returns true contains (not just coarse)
- revert some sample_selection behavior

* binary address_multirange updates

- make coarse_range private
- fix operator+=(pair<coarse, uintptr_t>)

* causal example update

- fix nsync to default to once per iteration

* binary analysis updates

- tweak header file includes

* causal updates

- remove factoring in sleep_for_overhead
- invoke delay::process() even if experiment is not active

* causal data updates

- update latest_eligible_pc structure

* update omnitrace-install.py.in

- fix support for fedora
  - /etc/os-release does not have ID_LIKE
  - fallback to RHEL 8.7 if version not specified

* update omnitrace-install.py.in

- fix support for debian
  - /etc/os-release does not have ID_LIKE
  - version mapping

* Update documentation

- update docs on installation

* causal data and experiment updates

- data: reset_sample_selection

* causal set_current_selection debugging

- debug messages for failed e2e runs

* causal data and backtrace component updates

- data: set_current_selection returns the number of eligible addresses added
- backtrace: if cputime signal has selected zero IPs > 5x, then realtime signal starts contributing call-stacks

* core library updates

- move config::parse_numeric_range to utility namespace
- add core/utility.cpp
- support range:increment, e.g. 5-25:10 expands to '5 15 25' instead of '5 10 15 20 25'

* omnitrace-causal update

- end-to-end expands all speedups
- support range:increment in speedups

* causal backtrace updates

- remove select_ival (realtime signal always contributes when select_count == 0)

* containers: static_vector update

- explicit c_array constructor
- explicit std::array constructor

* causal data updates

- remove set_current_selection(uint64_t)
- remove set_current_selection(std::array)
- sample_selection increase default wait time
- report eligible PC candidates
- move reset_sample_selection to perform_experiment_impl
- decrease latest_eligible_pc array size
- set_current_selection does not guard for experiment::active

* core debug updates

- OMNITRACE_PRINT_COLOR macros

* causal data updates

- tweak to experiment never started message

* causal gotcha updates

- remove unused code

* critical trace updates

- remove unused code

* omnitrace-causal

- OMNITRACE_LAUNCHER

* causal data updates

- don't fail on end-to-end + omnitrace-causal

* causal backtrace updates

- reintroduce select_ival behavior

* causal data updates

- tweak verbose messages about number of PC candidates

* core mproc updates

- utilities for waiting on child PID and diagnosing status
  - omnitrace::mproc::wait_pid
  - omnitrace::mproc::diagnose_status

* omnitrace-run updates

- support --fork argument for executing via fork in current process + execvpe on child instead of execvpe in current process

* omnitrace-causal updates

- wait_pid and diagnose_status just call equivalent functions in omnitrace::mproc

* ubuntu-focal workflow update

- attempt to launch ubuntu-focal-codecov job with CAP_SYS_ADMIN and use perf backend

* tests reorg and updates

- remove binary-rewrite-sampling and runtime-instrument-sampling tests
- rename *-preload tests (which use omnitrace-sample exe) to *-sampling
- split tests/CMakeLists.txt into several tests/omnitrace-<category>-tests.cmake files
- tweak to causal-both-omni-func test
  - add args: -n 2 -b timer

* update validate-causal-json.py

- better reasoning info for adjusting tolerance
- always apply tolerance adjustments in CI mode

* causal e2e tests update

- add label "causal-e2e" label
- tweak params
  - old: 80 12 432525 500000000
  - new: 80 50 432525 100000000
- disable processor affinity for slow-func/line-100 tests
  - artificially inflates some speedups with perf

* unblocking_gotcha updates

- overload operator() according to gotcha function index

* blocking_gotcha updates

- overload operator() according to gotcha function index
- fix bug where potentially post block functors (e.g. pthread_mutex_trylock) throw error if lock is not acquired.

* parse_numeric_range update

- support unordered_set

* config update

- OMNITRACE_DEBUG_{TIDS,PIDS} use parse_numeric_range
  • Loading branch information
jrmadsen committed Apr 13, 2023
1 parent cc14b52 commit 9de3a6b
Show file tree
Hide file tree
Showing 96 changed files with 5,479 additions and 3,003 deletions.
10 changes: 5 additions & 5 deletions .cmake-format.yaml
Expand Up @@ -21,10 +21,9 @@ parse:
omnitrace_add_test:
flags:
- SKIP_BASELINE
- SKIP_PRELOAD
- SKIP_SAMPLING
- SKIP_REWRITE
- SKIP_RUNTIME
- SKIP_SAMPLING
kwargs:
NAME: '*'
TARGET: '*'
Expand All @@ -33,15 +32,16 @@ parse:
NUM_PROCS: '*'
REWRITE_TIMEOUT: '*'
RUNTIME_TIMEOUT: '*'
PRELOAD_TIMEOUT: '*'
SAMPLING_TIMEOUT: '*'
SAMPLING_ARGS: '*'
REWRITE_ARGS: '*'
RUNTIME_ARGS: '*'
RUN_ARGS: '*'
ENVIRONMENT: '*'
LABELS: '*'
PROPERTIES: '*'
PRELOAD_PASS_REGEX: '*'
PRELOAD_FAIL_REGEX: '*'
SAMPLING_PASS_REGEX: '*'
SAMPLING_FAIL_REGEX: '*'
RUNTIME_PASS_REGEX: '*'
RUNTIME_FAIL_REGEX: '*'
REWRITE_PASS_REGEX: '*'
Expand Down
2 changes: 2 additions & 0 deletions .github/workflows/ubuntu-focal.yml
Expand Up @@ -554,9 +554,11 @@ jobs:

container:
image: jrmadsen/omnitrace:ci-base-ubuntu-20.04
options: --cap-add CAP_SYS_ADMIN

env:
OMNITRACE_VERBOSE: 2
OMNITRACE_CAUSAL_BACKEND: perf

steps:
- uses: actions/checkout@v3
Expand Down
16 changes: 13 additions & 3 deletions README.md
Expand Up @@ -99,9 +99,19 @@ See the [Getting Started documentation](https://amdresearch.github.io/omnitrace/
- Visit [Releases](https://github.com/AMDResearch/omnitrace/releases) page
- Select appropriate installer (recommendation: `.sh` scripts do not require super-user priviledges unlike the DEB/RPM installers)
- If targeting a ROCm application, find the installer script with the matching ROCm version
- If you are unsure about your Linux distro, check `/etc/os-release`
- If no installer script matches your target OS, try one of the Ubuntu 18.04 `*.sh` installers
- This installation may be built against older library versions supported on your distro via backwards compatibility
- If you are unsure about your Linux distro, check `/etc/os-release` or use the `omnitrace-install.py` script

If the above recommendation is not desired, download the `omnitrace-install.py` and specify `--prefix <install-directory>` when
executing it. This script will attempt to auto-detect a compatible OS distribution and version.
If ROCm support is desired, specify `--rocm X.Y` where `X` is the ROCm major version and `Y`
is the ROCm minor version, e.g. `--rocm 5.4`.

```console
wget https://github.com/AMDResearch/omnitrace/releases/latest/download/omnitrace-install.py
python3 ./omnitrace-install.py --prefix /opt/omnitrace/rocm-5.4 --rocm 5.4
```

See the [Installation Documentation](https://amdresearch.github.io/omnitrace/installation) for detailed information.

### Setup

Expand Down
2 changes: 1 addition & 1 deletion VERSION
@@ -1 +1 @@
1.9.2
1.10.0
71 changes: 57 additions & 14 deletions cmake/Templates/omnitrace-install.py.in
Expand Up @@ -63,30 +63,73 @@ def get_os_info(os_distrib, os_version):
_key, _data = line.split("=", 1)
_os_info[_key] = _data.strip('"')

def _parse_version(_v):
_version = re.split(r"[\\.-]", _v)
return (
"{}.{}".format(_version[0], _version[1])
if len(_version) > 1
else "{}".format(_version[0])
)

if os_distrib is None or os_distrib == "auto":
if "debian" in _os_info["ID_LIKE"]:
if "ubuntu" in _os_info["ID"]:
os_distrib = "ubuntu"
elif "suse" in _os_info["ID_LIKE"]:
elif "opensuse" in _os_info["ID"]:
os_distrib = "opensuse"
elif "rhel" in _os_info["ID_LIKE"]:
elif "rhel" in _os_info["ID"]:
os_distrib = "rhel"
elif "fedora" in _os_info["ID_LIKE"]:
elif "centos" in _os_info["ID"]:
os_distrib = "rhel"
elif "centos" in _os_info["ID_LIKE"]:
elif "rockylinux" in _os_info["ID"]:
os_distrib = "rhel"
elif "debian" in _os_info["ID"]:
os_distrib = "ubuntu"
if "debian" in _os_info["ID"] and os_version is None:
_debian_version = float(_parse_version(_os_info["VERSION_ID"]))
if _debian_version >= 11.0:
os_version = "20.04"
else:
os_version = "18.04"
elif "fedora" in _os_info["ID"]:
os_distrib = "rhel"
# fedora has different versioning system so fallback to 8.7
if os_version is None:
os_version = "8.7"
else:
raise RuntimeError(
"Unknown ID_LIKE value in /etc/os-release: {}".format(_os_info["ID_LIKE"])
)
elif os_distrib == "fedora" or os_distrib == "centos":
# if we don't have an exact match, check ID_LIKE
if "ID_LIKE" not in _os_info.keys():
_os_info["ID_LIKE"] = _os_info["ID"]

if "debian" in _os_info["ID_LIKE"]:
os_distrib = "ubuntu"
if os_version is None:
# fallback on 18.04 if ID is not ubuntu but debian-like
os_version = "18.04"
elif "suse" in _os_info["ID_LIKE"]:
os_distrib = "opensuse"
# fallback on 15.3 if ID is not opensuse but suse-like
if os_version is None:
os_version = "15.3"
elif "rhel" in _os_info["ID_LIKE"] or "centos" in _os_info["ID_LIKE"]:
os_distrib = "rhel"
if os_version is None:
os_version = "8.7"
else:
raise RuntimeError(
"Unknown ID_LIKE value in /etc/os-release: {}".format(
_os_info["ID_LIKE"]
)
)
elif os_distrib == "centos":
os_distrib = "rhel"
# uses same versioning system
elif os_distrib == "fedora":
os_distrib = "rhel"
if os_version is None:
# fedora has different versioning system so fallback to 8.7
os_version = "8.7"

if os_version is None:

def _parse_version(_v):
_version = re.split(r"[\\.-]", _v)
return "{}.{}".format(_version[0], _version[1])

os_version = _parse_version(_os_info["VERSION_ID"])

return (os_distrib, os_version)
Expand Down
139 changes: 42 additions & 97 deletions examples/causal/causal.cpp
@@ -1,122 +1,67 @@
#include "causal.hpp"

#include <chrono>
#include <cmath>
#include <cstdio>
#include <cstdlib>
#include <ctime>
#include <iostream>
#include <mutex>
#include <random>
#include <string>
#include <thread>
#include <unistd.h>
#include <vector>

using mutex_t = std::timed_mutex;
using auto_lock_t = std::unique_lock<mutex_t>;
using clock_type = std::chrono::high_resolution_clock;
using nanosec = std::chrono::nanoseconds;
#include "impl.hpp"

namespace
{
std::chrono::duration<double, std::milli> t_ms;
std::chrono::duration<double, std::milli> slow_ms;
std::chrono::duration<double, std::milli> fast_ms;

template <typename... Args>
inline void
consume_variables(Args&&...)
{}
} // namespace

template <bool>
bool
rng_func_impl(int64_t n, uint64_t rseed);

template <bool>
bool
cpu_func_impl(int64_t n, int nloop);

void
rng_slow_func(int64_t n, uint64_t rseed) __attribute__((noinline));

void
rng_fast_func(int64_t n, uint64_t rseed) __attribute__((noinline));

void
cpu_slow_func(int64_t n, int nloop) __attribute__((noinline));

void
cpu_fast_func(int64_t n, int nloop) __attribute__((noinline));

#if USE_CPU > 0
# define CPU_SLOW_FUNC(...) cpu_slow_func(__VA_ARGS__)
# define CPU_FAST_FUNC(...) cpu_fast_func(__VA_ARGS__)
#else
# define CPU_SLOW_FUNC(...) consume_variables(__VA_ARGS__)
# define CPU_FAST_FUNC(...) consume_variables(__VA_ARGS__)
#endif

#if USE_RNG > 0
# define RNG_SLOW_FUNC(...) rng_slow_func(__VA_ARGS__)
# define RNG_FAST_FUNC(...) rng_fast_func(__VA_ARGS__)
#else
# define RNG_SLOW_FUNC(...) consume_variables(__VA_ARGS__)
# define RNG_FAST_FUNC(...) consume_variables(__VA_ARGS__)
#endif

int
main(int argc, char** argv)
{
uint64_t rseed = std::random_device{}();
int nitr = 200;
size_t nitr = 50;
double frac = 70;
int64_t slow_val = 100000000L;
int64_t slow_val = 200000000L;
size_t nsync = 1;

if(argc > 1) frac = std::stod(argv[1]);
if(argc > 2) nitr = std::stoi(argv[2]);
if(argc > 2) nitr = std::stoull(argv[2]);
if(argc > 3) rseed = std::stoul(argv[3]);
if(argc > 4) slow_val = std::stol(argv[4]);
if(argc > 5) nsync = std::stoull(argv[5]);

nsync = std::min<size_t>(std::max<size_t>(nsync, 1), nitr);
int64_t fast_val = (frac / 100.0) * slow_val;
double rfrac = (fast_val / static_cast<double>(slow_val));
if(argc > 5) fast_val = std::stol(argv[5]);

printf("\nIterations: %i, fraction: %6.2f, random seed: %lu :: slow = %zu, "
"fast = %zu, expected ratio = %6.2f\n",
nitr, frac, rseed, slow_val, fast_val, rfrac * 100.0);

auto _t = clock_type::now();
for(int i = 0; i < nitr; ++i)
printf("\nFraction: %6.2f, iterations: %zu, random seed: %lu :: slow = %zu, "
"fast = %zu, expected ratio = %6.2f, sync every %lu iterations\n",
frac, nitr, rseed, slow_val, fast_val, rfrac * 100.0, nsync);

auto _wait_barrier = pthread_barrier_t{};
pthread_barrier_init(&_wait_barrier, nullptr, 3);
auto _thread_func = [nitr, nsync, &_wait_barrier](const auto& _func, auto* _timer,
auto _nsec, auto _nseed,
auto _nloop) {
pthread_barrier_wait(&_wait_barrier);
for(size_t i = 0; i < nitr; ++i)
{
auto _t = clock_type::now();
_func(_nsec, _nseed, _nloop);
(*_timer) += (clock_type::now() - _t);
CAUSAL_PROGRESS_NAMED("iteration");
if(i % nsync == (nsync - 1)) pthread_barrier_wait(&_wait_barrier);
}
};

auto _t = clock_type::now();
auto _threads = std::vector<std::thread>{};
_threads.emplace_back(_thread_func, SLOW_FUNC, &slow_ms, slow_val, rseed, 10000);
_threads.emplace_back(_thread_func, FAST_FUNC, &fast_ms, fast_val, rseed, 10000);
pthread_barrier_wait(&_wait_barrier);
for(size_t i = 0; i < nitr; ++i)
{
if(i == 0 || i + 1 == nitr || i % (nitr / 5) == 0)
printf("executing iteration: %i\n", i);
//
auto&& _slow_func = [](auto _nsec, auto _seed, auto _nloop) {
auto _t = clock_type::now();
CPU_SLOW_FUNC(_nsec, _nloop);
RNG_SLOW_FUNC(_nsec / 5, _seed);
slow_ms += (clock_type::now() - _t);
};
//
auto&& _fast_func = [](auto _nsec, auto _seed, auto _nloop) {
auto _t = clock_type::now();
CPU_FAST_FUNC(_nsec, _nloop);
RNG_FAST_FUNC(_nsec / 5, _seed);
fast_ms += (clock_type::now() - _t);
};
//
CAUSAL_BEGIN("main_iteration");
//
auto _threads = std::vector<std::thread>{};
_threads.emplace_back(std::move(_slow_func), slow_val, rseed, 10000);
_threads.emplace_back(std::move(_fast_func), fast_val, rseed, 10000);
for(auto& itr : _threads)
itr.join();
CAUSAL_END("main_iteration");
CAUSAL_PROGRESS;
(printf("executing iteration: %zu\n", i), fflush(stdout));
if(i % nsync == (nsync - 1)) pthread_barrier_wait(&_wait_barrier);
}
for(auto& itr : _threads)
itr.join();

t_ms += clock_type::now() - _t;
auto rms = (fast_ms.count() / slow_ms.count());
printf("slow_func() took %10.3f ms\n", slow_ms.count());
Expand All @@ -132,7 +77,7 @@ void
rng_slow_func(int64_t n, uint64_t rseed)
{
// clang-format off
while(rng_func_impl<false>(n, rseed) != false) {}
while(rng_impl_func<false>(n, rseed) != false) {}
// clang-format on
}
//
Expand All @@ -142,7 +87,7 @@ void
rng_fast_func(int64_t n, uint64_t rseed)
{
// clang-format off
while(rng_func_impl<true>(n, rseed) != true) {}
while(rng_impl_func<true>(n, rseed) != true) {}
// clang-format on
}
//
Expand All @@ -152,7 +97,7 @@ void
cpu_slow_func(int64_t n, int nloop)
{
// clang-format off
while(cpu_func_impl<false>(n, nloop) != false) {}
while(cpu_impl_func<false>(n, nloop) != false) {}
// clang-format on
}
//
Expand All @@ -162,6 +107,6 @@ void
cpu_fast_func(int64_t n, int nloop)
{
// clang-format off
while(cpu_func_impl<true>(n, nloop) != true) {}
while(cpu_impl_func<true>(n, nloop) != true) {}
// clang-format on
}

0 comments on commit 9de3a6b

Please sign in to comment.