Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial commit for CUDA fault injection tool #399

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
49 commits
Select commit Hold shift + click to select a range
c059b9a
wip
gerashegalov Jul 6, 2022
51e1607
wip
gerashegalov Jul 6, 2022
48496f5
assert + trap + ret value POC
gerashegalov Jul 7, 2022
2ad4dd5
cleanup
gerashegalov Jul 7, 2022
73259d5
wip
gerashegalov Jul 7, 2022
9e5700b
Merge branch 'branch-22.08' of github.com:NVIDIA/spark-rapids-jni int…
gerashegalov Jul 8, 2022
05fae9e
Merge branch 'branch-22.08' of github.com:NVIDIA/spark-rapids-jni int…
gerashegalov Jul 11, 2022
2670bd6
temporarily undo FAULT_INJECTOR cmake
gerashegalov Jul 12, 2022
8ed98a4
wip
gerashegalov Jul 12, 2022
0412349
wip
gerashegalov Jul 12, 2022
14b13f3
Some C++
gerashegalov Jul 12, 2022
cd9c372
boost logging refactor
gerashegalov Jul 13, 2022
ff9288d
init config work
gerashegalov Jul 13, 2022
774acd5
one time config almost, done
gerashegalov Jul 14, 2022
9c308d6
Readded symbolName check for launch events
gerashegalov Jul 14, 2022
4219b4a
demoable config
gerashegalov Jul 14, 2022
b78e7ae
dynamic reconfig and use rand for probability
gerashegalov Jul 14, 2022
80b7ccb
use two digits after comma for prob
gerashegalov Jul 14, 2022
3a45a26
add reload frequency to the config
gerashegalov Jul 14, 2022
6a7f504
smaller demo config
gerashegalov Jul 14, 2022
2be98db
update license header
gerashegalov Jul 14, 2022
3950552
Merge branch 'NVIDIA:branch-22.08' into faultInjectorPlayground
gerashegalov Jul 14, 2022
9b4c0ea
Merge branch 'branch-22.08' of github.com:NVIDIA/spark-rapids-jni int…
gerashegalov Jul 15, 2022
49b6511
wrong indentation
gerashegalov Jul 18, 2022
9d41531
Merge branch 'faultInjectorPlayground' of github.com:gerashegalov/spa…
gerashegalov Jul 18, 2022
80166d4
Merge remote-tracking branch 'origin/branch-22.08' into faultInjector…
gerashegalov Jul 18, 2022
62bbeea
wildcard
gerashegalov Jul 19, 2022
36d2678
conditional sync
gerashegalov Jul 19, 2022
c4f41ba
Merge remote-tracking branch 'origin/branch-22.08' into faultInjector…
gerashegalov Jul 20, 2022
6c6e2d6
spdlog
gerashegalov Jul 21, 2022
f693d42
Config inotify
gerashegalov Jul 22, 2022
4cfd7a8
use only strings
gerashegalov Jul 22, 2022
2415a8a
Add cuptiFinalize
gerashegalov Jul 22, 2022
11e0947
revert unnecessary changes
gerashegalov Jul 22, 2022
786fb07
Explicit attribute "dynamic" whether the tool
gerashegalov Jul 22, 2022
aa37cb0
Lib name independent of CUPTI
gerashegalov Jul 25, 2022
de43cbe
Doc and feedback
gerashegalov Jul 26, 2022
09dbba9
Update src/main/cpp/faultinj/README.md
gerashegalov Jul 27, 2022
ae69afd
Update src/main/cpp/faultinj/README.md
gerashegalov Jul 27, 2022
f34f53d
Update src/main/cpp/faultinj/README.md
gerashegalov Jul 27, 2022
318ec30
Update src/main/cpp/faultinj/README.md
gerashegalov Jul 27, 2022
fb38988
Update src/main/cpp/faultinj/README.md
gerashegalov Jul 27, 2022
fa34acf
html tables to reduce line length
gerashegalov Jul 27, 2022
7283f02
Doc update
gerashegalov Jul 27, 2022
ec56130
review feedback
gerashegalov Jul 27, 2022
c9a2cdc
clang-format
gerashegalov Jul 29, 2022
b462aef
clang
gerashegalov Jul 29, 2022
d574e96
Clang again
gerashegalov Jul 29, 2022
8af0ac3
Nghia's feedback
gerashegalov Jul 29, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions src/main/cpp/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -253,3 +253,5 @@ if(SPARK_RAPIDS_JNI_BUILD_BENCHMARKS)
rapids_cpm_nvbench()
add_subdirectory(benchmarks)
endif()

add_subdirectory(faultinj)
gerashegalov marked this conversation as resolved.
Show resolved Hide resolved
41 changes: 41 additions & 0 deletions src/main/cpp/faultinj/CMakeLists.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
#=============================================================================
# Copyright (c) 2022, NVIDIA CORPORATION.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#=============================================================================

project(
CUDA_FAULT_INJECTION
VERSION 22.08.00
LANGUAGES C CXX CUDA
)

set(CMAKE_CXX_STANDARD 17)

find_package(Boost REQUIRED)

add_library(
cufaultinj SHARED
faultinj.cu
)

find_path(SPDLOG_INCLUDE "spdlog"
HINTS "$ENV{RMM_ROOT}/_deps/spdlog-src/include")

include_directories(
"${SPDLOG_INCLUDE}"
)

target_link_libraries(
cufaultinj CUDA::cupti_static
)
170 changes: 170 additions & 0 deletions src/main/cpp/faultinj/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,170 @@
# Fault Injection Tool for CUDA Runtime and CUDA Driver API

The goal of this tool is to increase testability of the failure handling logic
gerashegalov marked this conversation as resolved.
Show resolved Hide resolved
in CUDA applications.

It is especially important when CUDA is embedded in a higher-level fault-tolerant
framework to ensure that CUDA failures are handled correctly in that
- Fatal errors leaving a GPU in unusable state are detected and such GPUs are
prevented from executing retries for a failed computation
- Non-fatal errors are retried without losing valuable compute resources
- Error handling logic does not cause deadlocks and other sort of unresponsiveness.

This tool allows creating test scenarios and triggering CUDA errors at will
with some degree of configurability to determine whether an individual CUDA process
or the higher-level framework such as Apache Spark remains usable when it should,
or gracefully exits with an actionable error message when the errors are irrecoverable.

## Deployment

The tool is designed with automated testing and interactive testing use cases in mind.
The tool is a dynamically linked library `libcufaultinj.so` that is loaded by
the CUDA process via CUDA Driver API `cuInit` if it is provided
via the `CUDA_INJECTION64_PATH` environment variable.

As an example it can be used to test RAPIDS Accelerator for Apache Spark.
Consult documentation to find how to set these variables correctly in the
context of the framework under test.

Examples for Apache Spark follow:

### Local Mode
Spark local mode is a single CUDA process. We can test is as any standalone
single-process application.

```bash
CUDA_INJECTION64_PATH=$PWD/target/cmake-build/faultinj/libcufaultinj.so \
FAULT_INJECTOR_CONFIG_PATH=src/test/cpp/faultinj/test_faultinj.json \
$SPARK_HOME/bin/pyspark \
--jars $SPARK_RAPIDS_REPO/dist/target/rapids-4-spark_2.12-22.08.0-SNAPSHOT-cuda11.jar \
--conf spark.plugins=com.nvidia.spark.SQLPlugin
```
### Distributed Mode
```bash
$SPARK_HOME/bin/spark-shell \
--jars $SPARK_RAPIDS_REPO/dist/target/rapids-4-spark_2.12-22.08.0-SNAPSHOT-cuda11.jar \
--conf spark.plugins=com.nvidia.spark.SQLPlugin \
--files ./target/cmake-build/faultinj/libcufaultinj.so,./src/test/cpp/faultinj/test_faultinj.json \
--conf spark.executorEnv.CUDA_INJECTION64_PATH=./libcufaultinj.so \
--conf spark.executorEnv.FAULT_INJECTOR_CONFIG_PATH=test_faultinj.json \
--conf spark.rapids.memory.gpu.minAllocFraction=0 \
--conf spark.rapids.memory.gpu.allocFraction=0.2 \
--master spark://hostname:7077
```
When we configure the executor environment `spark.executorEnv.CUDA_INJECTION64_PATH`
we have to use a path separator in the value `./libcufaultinj.so` with the leading dot
to make sure that `dlopen` loads the library file submitted. Otherwise, it will assume a
locally installed library accessible to the dynamic linker via `LD_LIBRARY_PATH`
and similar mechanisms. See
[dlopen man page](https://man7.org/linux/man-pages/man3/dlopen.3.html)

## Fault injection configuration

Fault injection configuration is provided via the `FAULT_INJECTOR_CONFIG_PATH`
environment variable.
It is a set of rules to apply fault injection when CUDA Drvier or Runtime is matched by either
- Function name such as [`cudaLaunchKernel_ptsz`](https://docs.nvidia.com/nsight-systems/UserGuide/index.html#cuda-default-cli)
- Or callback id such as [`214`](https://gitlab.com/nvidia/headers/cuda-individual/cupti/-/blob/main/cupti_runtime_cbid.h#L224)
- Or wildcard `*` to match all API function names
with a given probability.

<table>
<tr>
<th>Top level configuration</th>
<th>Type</th>
<th>Description</th>
</tr>
<tr>
<td>logLevel</td>
<td>Number</td>
<td>Set numeric
<a href=https://github.com/gabime/spdlog/blob/d546201f127c306ec8a0082d57562a05a049af77/include/spdlog/common.h#L198-L204
>log level</a></td>
</tr>
<tr>
<td>dynamic</td>
<td>Boolean</td>
<td>Whether to re-apply config on config-file modification for interactive use</td>
</tr>
<tr>
<td>cudaDriverFaults</td>
<td>Object</td>
<td>Maps a string Driver function name,
<a href=https://gitlab.com/nvidia/headers/cuda-individual/cupti/-/blob/cuda-11.5.1/cupti_driver_cbid.h#L9
>CUPTI Driver callback id</a>, "*" to a fault injection config</td>
</tr>
<tr>
<td>cudaRuntimeFaults</td>
<td>Object</td><td>Maps a string Runtime function name,
<a href=https://gitlab.com/nvidia/headers/cuda-individual/cupti/-/blob/cuda-11.5.1/cupti_runtime_cbid.h#L9
>CUPTI Runtime callback id</a>, "*" to a fault injection config</td>
</tr>
</table>

<table>
<tr>
<th>Fault injection configuration</th>
<th>Type</th>
<th>Description</th>
</tr>
<tr>
<td>injectionType</td>
<td>Number</td>
<td>Numeric value of <code>FaultInjectionType</code>:
0 (PTX trap), 1 (device assert), 2 (replace CUDA return code by
configured <code>substituteReturnCode</code>, currently supported only
for Driver API).</td>
</tr>
<tr>
<td>interceptionCount</td>
<td>Number</td>
<td>How many consecutive matched callbacks should be sampled for fault
injection</td>
</tr>
<tr>
<td>percent</td>
<td>Number</td>
<td>Probability in percent whether a failure is injected</td>
</tr>
<tr>
<td>substituteReturnCode</td>
<td>Number</td>
<td>Error code to inject for the matched API invocation. For the Driver
API it is a numeric
<a href=https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__TYPES.html#group__CUDA__TYPES_1gc6c391505e117393cc2558fff6bfc2e9
>CUResult value</a>. For the Runtime API it is a numeric
<a href=https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html#group__CUDART__TYPES_1g3f51e3575c2178246db0a94a430e0038
>cudaError value</a>. Direct <code>cudaError</code> replacement is
not yet supported. <code>cudaError</code> can be indirectly affected
by an appropriate <code>CUresult</code> replacement.</td>
</tr>
</table>


Example config follows:
```json
{
"logLevel": 1,
"dynamic": true,
"cudaRuntimeFaults": {
"cudaLaunchKernel_ptsz": {
"percent": 0,
"injectionType": 0,
"interceptionCount": 1
},
"*": {
"percent": 1,
"injectionType": 0,
"interceptionCount": 1000
}
},
"cudaDriverFaults": {
"cuMemFreeAsync_ptsz": {
"percent": 0,
"injectionType": 2,
"substituteReturnCode": 999,
"interceptionCount": 1
}
}
}
```
Loading