Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial commit for CUDA fault injection tool #399

Merged

Conversation

gerashegalov
Copy link
Collaborator

@gerashegalov gerashegalov commented Jul 22, 2022

This PR is the initial version of CUDA fault injection tool to explore and test for correctness of CUDA error handling in fault-tolerant CUDA applications.

The tool is designed with automated testing and interactive testing use cases in mind. The tool is a dynamically linked library libcufaultinj.so that is loaded by the CUDA process via CUDA Driver API cuInit if it's provided via the CUDA_INJECTION64_PATH environment variable.

As an example it can be used to test RAPIDS Accelerator for Apache Spark.

Local Mode

CUDA_INJECTION64_PATH=$PWD/target/cmake-build/faultinj/libcufaultinj.so \ 
FAULT_INJECTOR_CONFIG_PATH=src/test/cpp/faultinj/test_faultinj.json \
$SPARK_HOME/bin/pyspark \ 
  --jars $SPARK_RAPIDS_REPO/dist/target/rapids-4-spark_2.12-22.08.0-SNAPSHOT-cuda11.jar \ 
  --conf spark.plugins=com.nvidia.spark.SQLPlugin

Distributed Mode

$SPARK_HOME/bin/spark-shell \
  --jars $SPARK_RAPIDS_REPO/dist/target/rapids-4-spark_2.12-22.08.0-SNAPSHOT-cuda11.jar \
  --conf spark.plugins=com.nvidia.spark.SQLPlugin \
  --files ./target/cmake-build/faultinj/libcufaultinj.so,./src/test/cpp/faultinj/test_faultinj.json \
  --conf spark.executorEnv.CUDA_INJECTION64_PATH=./libcufaultinj.so \
  --conf spark.executorEnv.FAULT_INJECTOR_CONFIG_PATH=test_faultinj.json \
  --conf spark.rapids.memory.gpu.minAllocFraction=0 \
  --conf spark.rapids.memory.gpu.allocFraction=0.2 \
  --master spark://hostname:7077 

When we configure the executor environment spark.executorEnv.CUDA_INJECTION64_PATH we have to use a path separator in the value ./libcufaultinj.so with the leading dot to make sure that dlopen loads the library file submitted. Otherwise it will assume a locally installed library accessible to the dynamic linker via LD_LIBRARY_PATH and similar mechanisms. See dlopen man page

Fault injection configuration

Fault injection configuration is provided via the FAULT_INJECTOR_CONFIG_PATH environment variable. It's a set of rules to apply fault injection when CUDA Drvier or Runtime is matched by function name or callback id with a given probability.

There are currently three types of fault injection:

  • launch a kernel with the PTX trap instruction
  • launch a kernel with a device assert
  • replace the return code for the CUDA Runtime call

Example config:

{
    "logLevel": 1,
    "dynamic": true,
    "cudaRuntimeFaults": {
        "cudaLaunchKernel_ptsz": {
            "percent": 0,
            "injectionType": 0,
            "injectionType_comment": "PTX trap = 0, C assert = 1",
            "interceptionCount": 1
        }
    },
    "cudaDriverFaults": {
        "cuMemFreeAsync_ptsz": {
            "percent": 0,
            "injectionType": 2,
            "injectionType_comment": "substitute return code",
            "substituteReturnCode": 999,
            "interceptionCount": 1
        }
    }
}

Signed-off-by: Gera Shegalov gera@apache.org

gerashegalov and others added 30 commits July 6, 2022 15:28
Signed-off-by: Gera Shegalov <gera@apache.org>
Signed-off-by: Gera Shegalov <gera@apache.org>
Signed-off-by: Gera Shegalov <gera@apache.org>
Signed-off-by: Gera Shegalov <gera@apache.org>
Signed-off-by: Gera Shegalov <gera@apache.org>
Signed-off-by: Gera Shegalov <gera@apache.org>
@gerashegalov
Copy link
Collaborator Author

build

@gerashegalov gerashegalov requested a review from jlowe July 29, 2022 15:55
mythrocks
mythrocks previously approved these changes Jul 29, 2022
Copy link
Collaborator

@mythrocks mythrocks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm 👍. Thank you for accommodating the changes requested.

There might be future iterations where we might make C++-related changes. E.g. RIAA for mutex locking, std::mutex and threads instead of pthreads primitives. But we needn't bother with those now.

@mythrocks
Copy link
Collaborator

mythrocks approved these changes 9 minutes ago
...
Review required
At least 1 approving review is required by reviewers with write access.

Notice that my 👍 is ignored here. :]

Comment on lines 412 to 416
boost::property_tree::ptree::const_iterator end = pTree.end();
for (boost::property_tree::ptree::const_iterator it = pTree.begin(); it != end; ++it) {
spdlog::trace("congig key={} value={}", it->first, it->second.get_value<std::string>());
traceConfig(it->second);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't have to cache end.

for(auto it = pTree.begin(); it != pTreen.end(); ++it)

clang-format -style='file:thirdparty/cudf/cpp/.clang-format' -i  src/main/cpp/faultinj/faultinj.cu
@gerashegalov
Copy link
Collaborator Author

There might be future iterations where we might make C++-related changes. E.g. RIAA for mutex locking, std::mutex and threads instead of pthreads primitives. But we needn't bother with those now.

Thanks for the review @mythrocks

absolutely, once the PR is merged I'll file an epic for these improvements

@gerashegalov
Copy link
Collaborator Author

build

@gerashegalov gerashegalov merged commit 358d093 into NVIDIA:branch-22.08 Aug 1, 2022
@gerashegalov gerashegalov deleted the faultInjectorPlayground branch August 1, 2022 18:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants