-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial commit for CUDA fault injection tool #399
Initial commit for CUDA fault injection tool #399
Conversation
…o faultInjectorPlayground
…o faultInjectorPlayground
Signed-off-by: Gera Shegalov <gera@apache.org>
…o faultInjectorPlayground
…rk-rapids-jni into faultInjectorPlayground
Signed-off-by: Gera Shegalov <gera@apache.org>
build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm 👍. Thank you for accommodating the changes requested.
There might be future iterations where we might make C++-related changes. E.g. RIAA for mutex locking, std::mutex
and threads instead of pthreads
primitives. But we needn't bother with those now.
Notice that my 👍 is ignored here. :] |
src/main/cpp/faultinj/faultinj.cu
Outdated
boost::property_tree::ptree::const_iterator end = pTree.end(); | ||
for (boost::property_tree::ptree::const_iterator it = pTree.begin(); it != end; ++it) { | ||
spdlog::trace("congig key={} value={}", it->first, it->second.get_value<std::string>()); | ||
traceConfig(it->second); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't have to cache end
.
for(auto it = pTree.begin(); it != pTreen.end(); ++it)
clang-format -style='file:thirdparty/cudf/cpp/.clang-format' -i src/main/cpp/faultinj/faultinj.cu
Thanks for the review @mythrocks absolutely, once the PR is merged I'll file an epic for these improvements |
build |
This PR is the initial version of CUDA fault injection tool to explore and test for correctness of CUDA error handling in fault-tolerant CUDA applications.
The tool is designed with automated testing and interactive testing use cases in mind. The tool is a dynamically linked library
libcufaultinj.so
that is loaded by the CUDA process via CUDA Driver APIcuInit
if it's provided via theCUDA_INJECTION64_PATH
environment variable.As an example it can be used to test RAPIDS Accelerator for Apache Spark.
Local Mode
Distributed Mode
When we configure the executor environment spark.executorEnv.CUDA_INJECTION64_PATH we have to use a path separator in the value ./libcufaultinj.so with the leading dot to make sure that dlopen loads the library file submitted. Otherwise it will assume a locally installed library accessible to the dynamic linker via LD_LIBRARY_PATH and similar mechanisms. See dlopen man page
Fault injection configuration
Fault injection configuration is provided via the
FAULT_INJECTOR_CONFIG_PATH
environment variable. It's a set of rules to apply fault injection when CUDA Drvier or Runtime is matched by function name or callback id with a given probability.There are currently three types of fault injection:
trap
instructionExample config:
Signed-off-by: Gera Shegalov gera@apache.org