Error not injected when threads/block different to 1024 #7

sergicuen · 2021-06-16T11:38:52Z

Hi all!
similar to a previous issue I am having problems when injenct faults in a very simple kernel of matrix mult and the number of threads/block is different to 32x32 (1024). Any other value (e.g.: 16x16) produces some "Error not injected" results.

kernelName=matrixMulCUDA(float*,float*,float*,int,int,int)
kernelCount=0
groupID=7
bitFlipModel=0
instID=21059427
opIDSeed=0.401131
bitIDSeed=0.326217
inspecting: matrixMulCUDA(float*,float*,float*,int,int,int)
num_static_instrs: 282
maxregs: 32(32)
Injection data
index: 0
kernel_name: matrixMulCUDA(float*,float*,float*,int,int,int)
ctas: 64
instrs: 14057472
grp 0: 0 grp 1: 2097152 grp 2: 4194304 grp 3: 262144 grp 4: 1130496 grp 5: 6373376 grp 6: 12926976 grp 7: 12664832
mask: 0x0
beforeVal: 0x0;afterVal: 0x0
regNo: -1
opcode: NOP
pcOffset: 0x0
tid: -1
Error not injected

All the versions compile and pass the test (works correctly). I also tried to inject faults with DUMMY flag with the same results (some dummy injections work others don´t). In all the cases I ´ve rerun the profiler to be sure all is ok.
I´ve activated the VERBOSE_TOOLS flags but the info is difficult to interpret since some numbers haven´t got any identifier.
I´ve checked with Jetson nano and TX2 boards with the same result and using different matrix sizes.

The kernel is very simple:
global void matrixMulCUDA(float *C, float *A, float B, int ldA, int ldB, int ldC) {
int i = blockIdx.y * blockDim.y + threadIdx.y;
int j = blockIdx.x * blockDim.x + threadIdx.x;
float ptrA = &A[ildA]; // Pointer to the first element of row i of A
float tmp = 0.0f;
for (int k = 0; k < ldA; k++) {
tmp += (ptrA++) * B[kldB+j];
}
C[ildC+j] = tmp;
}

main() {
....
int block_size = 16;
dim3 dimsA(128, 128, 1); dim3 dimsB(128, 128, 1); dim3 dimsC(128, 128, 1);
dim3 threads(block_size, block_size);
dim3 grid(dimsC.x / threads.x, dimsC.y / threads.y);
....
matrixMulCUDA<<< grid, threads >>>(d_C, d_A, d_B, dimsA.x, dimsB.x, dimsC.x);
....
}

Please could you give me some hints for debugging the problem?

Thank you in advance.

dsartzet · 2021-07-18T16:28:02Z

Same issue here. Tried with hotspot and srad_v2 from Rodinia benchmark suite and I get plenty of "Error not injected" results.
The simple_add which is included in nvbitfi works without issues. Any ideas?
Thanks.

Update
It seems that the optimization in profiler/inject_func.cu is causing the issue (at least in my Tesla K20c gpu). By letting every thread to use the atomicAdd then the instruction group counters are computed correctly and no more "Error not injected" results. So as a workaround just replace the code of the count_instrs function with this

uint64_t *counters = (uint64_t*)pcounters;
atomicAdd((unsigned long long *)&counters[index], 1);
atomicAdd((unsigned long long *)&counters[NUM_ISA_INSTRUCTIONS+grp_index], 1);
atomicAdd((unsigned long long *)&counters[num_counters-2], grp_index != G_NODEST);
atomicAdd((unsigned long long *)&counters[num_counters-1], 1 - ((grp_index == G_NODEST) || (grp_index == G_PR)));

sergicuen · 2021-07-19T23:30:18Z

Thanks Dimitris,
the preliminary test shows that the workaround works fine with the matrixmult in Jetson nano (Maxwell gpu).
I leave the issue open until futher test are performed.

Update
workaround tested sucessfully with several benchmarks.

sergicuen closed this as completed Dec 3, 2021

sergicuen mentioned this issue Dec 3, 2021

NOP instructions in Matrix Multiplication #5

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error not injected when threads/block different to 1024 #7

Error not injected when threads/block different to 1024 #7

sergicuen commented Jun 16, 2021

dsartzet commented Jul 18, 2021 •

edited

Loading

sergicuen commented Jul 19, 2021 •

edited

Loading

Error not injected when threads/block different to 1024 #7

Error not injected when threads/block different to 1024 #7

Comments

sergicuen commented Jun 16, 2021

dsartzet commented Jul 18, 2021 • edited Loading

sergicuen commented Jul 19, 2021 • edited Loading

dsartzet commented Jul 18, 2021 •

edited

Loading

sergicuen commented Jul 19, 2021 •

edited

Loading