Initial commit for CUDA fault injection tool #399

gerashegalov · 2022-07-22T17:46:21Z

This PR is the initial version of CUDA fault injection tool to explore and test for correctness of CUDA error handling in fault-tolerant CUDA applications.

The tool is designed with automated testing and interactive testing use cases in mind. The tool is a dynamically linked library libcufaultinj.so that is loaded by the CUDA process via CUDA Driver API cuInit if it's provided via the CUDA_INJECTION64_PATH environment variable.

As an example it can be used to test RAPIDS Accelerator for Apache Spark.

Local Mode

CUDA_INJECTION64_PATH=$PWD/target/cmake-build/faultinj/libcufaultinj.so \ 
FAULT_INJECTOR_CONFIG_PATH=src/test/cpp/faultinj/test_faultinj.json \
$SPARK_HOME/bin/pyspark \ 
  --jars $SPARK_RAPIDS_REPO/dist/target/rapids-4-spark_2.12-22.08.0-SNAPSHOT-cuda11.jar \ 
  --conf spark.plugins=com.nvidia.spark.SQLPlugin

Distributed Mode

$SPARK_HOME/bin/spark-shell \
  --jars $SPARK_RAPIDS_REPO/dist/target/rapids-4-spark_2.12-22.08.0-SNAPSHOT-cuda11.jar \
  --conf spark.plugins=com.nvidia.spark.SQLPlugin \
  --files ./target/cmake-build/faultinj/libcufaultinj.so,./src/test/cpp/faultinj/test_faultinj.json \
  --conf spark.executorEnv.CUDA_INJECTION64_PATH=./libcufaultinj.so \
  --conf spark.executorEnv.FAULT_INJECTOR_CONFIG_PATH=test_faultinj.json \
  --conf spark.rapids.memory.gpu.minAllocFraction=0 \
  --conf spark.rapids.memory.gpu.allocFraction=0.2 \
  --master spark://hostname:7077

When we configure the executor environment spark.executorEnv.CUDA_INJECTION64_PATH we have to use a path separator in the value ./libcufaultinj.so with the leading dot to make sure that dlopen loads the library file submitted. Otherwise it will assume a locally installed library accessible to the dynamic linker via LD_LIBRARY_PATH and similar mechanisms. See dlopen man page

Fault injection configuration

Fault injection configuration is provided via the FAULT_INJECTOR_CONFIG_PATH environment variable. It's a set of rules to apply fault injection when CUDA Drvier or Runtime is matched by function name or callback id with a given probability.

There are currently three types of fault injection:

launch a kernel with the PTX trap instruction
launch a kernel with a device assert
replace the return code for the CUDA Runtime call

Example config:

{
    "logLevel": 1,
    "dynamic": true,
    "cudaRuntimeFaults": {
        "cudaLaunchKernel_ptsz": {
            "percent": 0,
            "injectionType": 0,
            "injectionType_comment": "PTX trap = 0, C assert = 1",
            "interceptionCount": 1
        }
    },
    "cudaDriverFaults": {
        "cuMemFreeAsync_ptsz": {
            "percent": 0,
            "injectionType": 2,
            "injectionType_comment": "substitute return code",
            "substituteReturnCode": 999,
            "interceptionCount": 1
        }
    }
}

Signed-off-by: Gera Shegalov gera@apache.org

…o faultInjectorPlayground

Signed-off-by: Gera Shegalov <gera@apache.org>

…o faultInjectorPlayground

…rk-rapids-jni into faultInjectorPlayground

…Playground

Signed-off-by: Gera Shegalov <gera@apache.org>

src/main/cpp/faultinj/README.md

src/main/cpp/faultinj/faultinj.cu

Signed-off-by: Gera Shegalov <gera@apache.org>

gerashegalov · 2022-07-29T15:52:40Z

build

mythrocks

I'm 👍. Thank you for accommodating the changes requested.

There might be future iterations where we might make C++-related changes. E.g. RIAA for mutex locking, std::mutex and threads instead of pthreads primitives. But we needn't bother with those now.

mythrocks · 2022-07-29T17:44:01Z

mythrocks approved these changes 9 minutes ago
...
Review required
At least 1 approving review is required by reviewers with write access.

Notice that my 👍 is ignored here. :]

src/main/cpp/faultinj/faultinj.cu

ttnghia · 2022-07-29T18:44:44Z

src/main/cpp/faultinj/faultinj.cu

+    boost::property_tree::ptree::const_iterator end = pTree.end();
+    for (boost::property_tree::ptree::const_iterator it = pTree.begin(); it != end; ++it) {
+        spdlog::trace("congig key={} value={}",  it->first, it->second.get_value<std::string>());
+        traceConfig(it->second);
+    }


Don't have to cache end.

for(auto it = pTree.begin(); it != pTreen.end(); ++it)

clang-format -style='file:thirdparty/cudf/cpp/.clang-format' -i src/main/cpp/faultinj/faultinj.cu

gerashegalov · 2022-07-29T23:52:27Z

There might be future iterations where we might make C++-related changes. E.g. RIAA for mutex locking, std::mutex and threads instead of pthreads primitives. But we needn't bother with those now.

Thanks for the review @mythrocks

absolutely, once the PR is merged I'll file an epic for these improvements

gerashegalov · 2022-07-29T23:52:55Z

build

src/main/cpp/faultinj/faultinj.cu

gerashegalov and others added 30 commits July 6, 2022 15:28

wip

c059b9a

wip

51e1607

assert + trap + ret value POC

48496f5

cleanup

2ad4dd5

wip

73259d5

Merge branch 'branch-22.08' of github.com:NVIDIA/spark-rapids-jni int…

9e5700b

…o faultInjectorPlayground

Merge branch 'branch-22.08' of github.com:NVIDIA/spark-rapids-jni int…

05fae9e

…o faultInjectorPlayground

temporarily undo FAULT_INJECTOR cmake

2670bd6

Signed-off-by: Gera Shegalov <gera@apache.org>

wip

8ed98a4

Signed-off-by: Gera Shegalov <gera@apache.org>

wip

0412349

Signed-off-by: Gera Shegalov <gera@apache.org>

Some C++

14b13f3

Signed-off-by: Gera Shegalov <gera@apache.org>

boost logging refactor

cd9c372

init config work

ff9288d

one time config almost, done

774acd5

Readded symbolName check for launch events

9c308d6

demoable config

4219b4a

dynamic reconfig and use rand for probability

b78e7ae

use two digits after comma for prob

80b7ccb

add reload frequency to the config

3a45a26

smaller demo config

6a7f504

update license header

2be98db

Merge branch 'NVIDIA:branch-22.08' into faultInjectorPlayground

3950552

Merge branch 'branch-22.08' of github.com:NVIDIA/spark-rapids-jni int…

9b4c0ea

…o faultInjectorPlayground

wrong indentation

49b6511

Merge branch 'faultInjectorPlayground' of github.com:gerashegalov/spa…

9d41531

…rk-rapids-jni into faultInjectorPlayground

Merge remote-tracking branch 'origin/branch-22.08' into faultInjector…

80166d4

…Playground

wildcard

62bbeea

conditional sync

36d2678

Merge remote-tracking branch 'origin/branch-22.08' into faultInjector…

c4f41ba

…Playground

spdlog

6c6e2d6

Signed-off-by: Gera Shegalov <gera@apache.org>

ttnghia reviewed Jul 27, 2022

View reviewed changes

src/main/cpp/faultinj/README.md Outdated Show resolved Hide resolved

ttnghia reviewed Jul 27, 2022

View reviewed changes