# Ch. 3 Tools for Timing, Profiling, and Debugging

Having created both serial and parallel versions of the map app in
Chapter 2 "Function Evaluations and the Map Pattern" a natural next
step is to compare the execution times to quantify the "acceleration"
factor. In this chapter, we introduce some basic tools for timing, profiling,
and debugging.

## 3.1 Timing Comparisons

### 3.1.1 Simple Python Timing

Let's start simply with a standard python library function for obtaining "wall clock" times. The actual function is called `time` and it resides in a package also called `time`, so to avoid code like `time.time()`, we'll do the import as `from time import time`. The basic idea is to read the clock before and after the code block of interest and to obtain the runtime as the difference of the clock readings. The listing of _map_main_timed.py_ below shows a version of _map_main.py_ that has been modified to execute the serial version of `sArray` and then to execute the parallel version of `sArray` twice. A few lines of code is added to obtain and print the runtime for each as measured using `time`.

```
File: main_timed.py
01: import numpy as np
02: import matplotlib.pyplot as plt
03: from time import time #import timing function
04: from numba import cuda
05: N = 640000
06: 
07: def main():
08:     start_all = time() #start overall timer
09:     x = np.linspace(0, 1, N, endpoint=True)
10:     from serial import sArray
11:     start = time() #start timer for serial execution
12:     f = sArray(x)
13: 	end = time() #stop timer for serial execution
14:     elapsed_serial = end - start #compute serial runtime
15:     print("--- Serial timing: %0.4f seconds ---" % elapsed_serial)
16: 
17:     from parallel import sArray #import parallel version of sArray
18:     
19:     for i in range(2):
20:         start = time() #start timer for parallel execution
21:         fpar = sArray(x)
22:         end = time() #stop timer for parallel execution
23:         elapsed = end - start #compute parallel runtime
24:         print("--- Parallel timing #%d: %3.4f seconds ---" % (i,elapsed))
25: 
26:     print("--- Loop acceleration estimate: %dx ---" % (elapsed_serial//elapsed))
27:     end_all = time() #end overall timer
28:     elapsed = end_all - start_all #evaluate overall runtime
29:     print("--- Total time: %3.4f seconds ---" % elapsed)
30:     print("--- Total acceleration estimate: %3.4fx ---" % ((3*elapsed_serial)/(elapsed)))
31: 
32: if __name__ == '__main__':
33:     main()

```
 $$ \text{Listing 3.1: } map\_timed.py$$

In the code above, 2 sets of timing variables are defined:

- `start_all` and `end_all` are used to record the times bounding execution of the entire code. On line 8, `start_all = time()` records the time as `main()` begins execution; on line 27, `end_all = time()` records the time when all the calls to `sArray` have completed. The overall runtime is computed by `elapsed = end_all - start_all` on line 28, and line 30 prints the result to the terminal. 

- `start` and `end` are used to record the times bounding specific executions of `sArray`.

  - The statements `start = time()` on line 11 and `end = time()` on line 13 bracket the call to the serial implementation of `sArray`. Lines 14-15 compute the serial runtime (`elapsed_serial = end - start`) and print the result to the terminal.

  - The statements `start = time()` on line 20 and `end = time()` on line 22 bracket the calls to the parallel implementation of `sArray`. Lines 23-24 compute the parallel runtimes (`elapsed = end - start`) and print the results to the terminal.

Let's examine a couple example outputs from running _map_main_timed.py_ for different array sizes. 

```
--- Serial timing: 0.4842 seconds ---
--- Parallel timing #0: 1.4666 seconds ---
--- Parallel timing #1: 0.0060 seconds ---
--- Loop acceleration estimate: 80x ---
--- Total time: 1.9746 seconds ---
--- Total acceleration estimate: 0.7357x ---
```
The output above is for `N=640000`, and there are several salient features worthy of discussion. The timing estimate for the serial execution is about 0.5s; the timings for the 2 parallel evaluations are quite different: one longer than the serial timing and one much shorter. Why are they different? When this code was executed, the decorator preceding the kernel definition in _map_main.parallel.py_ was simply `@cuda.jit()`; i.e. without the optional signature specification. As a result, "lazy compilation" occurs and the kernel code is compiled "just-in-time" at the first call for execution. The timing for the first parallel evaluation is longer because it includes the time for kernel compilation.

> When measuring runtimes of kernel functions, always measure more than once and remember that the timing for the first execution may be extremely "pessimistic" due to inclusion of compilation time.

Given the timings for 2 parallel executions, we estimate that the 1.5s timing is almost entirely spent on compilation, and the actual time for kernel execution is about 6ms. The "acceleration factor", the ratio of execution time for serial ``sArray` to execution time for parallel `sArray`, is estimated as $\frac{0.4842}{0.0060} = 80 \times$.

> To engineers, this is an imprecise use of "acceleration" ("speedup" might be better), but it is already an established way to refer to the runtime ratio (followed by '$\times$', sometimes read as 'times'). 

Finally, contrast the acceleration factors for sArray itself (80x) and for the entire code which is estimated as the time for the entire code (which executes `sArray` 3 times) with the time for 3 serial executions. In this case, the overall acceleration factor is $0.73 \times$, so parallelization appears to make the computation take longer! However, as mentioned above, we need to take into account the fact that one of the parallel timings includes compilation time that, by itself, takes as much time as the 3 serial executions.

So if we are careful to run the timing twice, we see that parallelization reduces runtime of `sArray` by $80 \times$ which appears quite worthwhile. But is there really anything special about our chosen value of `N`? Perhaps the better question is how the runtime scales with parameters like problem size and processor count. At this point, it would not be productive to get sidetracked doing an in-depth study, but let's look at one more case before moving on. In particular, let's increase the array size by an order of magnitude (`N = 64000000`) and modify the kernel decorator to include a signature. With those 2 changes, the following results are produced:

```
--- Serial timing: 4.9607 seconds ---
--- Parallel timing #0: 0.0343 seconds ---
--- Parallel timing #1: 0.0398 seconds ---
--- Loop acceleration estimate: 124x ---
--- Total time: 6.6667 seconds ---
--- Total acceleration estimate: 2.2323x ---
```

Now the serial version of `sArray` takes almost 5s which, not surprisingly (since there are 10 times as many entries to compute), is almost exactly 10 times as long as the previous serial timing. Moving on to the parallel timings, we see that (with a signature specification allowing "eager compilation" to occur before execution) there is no longer a major difference between the parallel runtimes, and the loop acceleration factor is still signficant ($\approx 125 \times$).

Finally, let's consider the "total acceleration estimate". The factor is now greater than 1, but a big question remains:

__Why must the total acceleration factor be so much less than the loop acceleration factor?__

One might hope that as more processors are used, the runtime will continue decreasing and the acceleration factor will continue increasing. This may actually be the case, but there are limits that must be taken into account. In the current case, the complete code includes execution of the serial version of `sArray`, so there is a portion of the overall task that can be parallelized and there is a portion that is inherently serial. (Here, we insisted on an execution of serial `sArray` for timing comparison but, more typically, there is just some portion of the task that is not appropriate for or amenable to parallelization.) A typical terminology uses $p$ to denote the fraction of the overall task that is amenable to parallelization (so the remaining non-parallelizable fraction is $1-p$) and $s$ the achievable parallel acceleration factor. Given those quantities, __Amdahls' law__ gives the acceleration factor for the entire code, $S_{latency}$ as:

$$ S_{latency} = \frac{1}{1-p + (p/s)}$$

Even when approaching the case of unlimited resources (e.g. infinitely many cores) where $s \rightarrow \infty$ and $p/s \rightarrow 0$, Amdahl's law sets an upper bound on the acceleration or __latency reduction factor__:

$$ S_{latency} < \frac{1}{1-p}$$

To be concrete, we are considering an example involving 3 calls to execute `sArray`, one of which must be a serial execution. Thus the parallelizable fraction of the task is $2/3$ and the overall acceleration is limited by 

$$ S_{latency} < \frac{1}{1-(2/3)} = \frac{1}{(1/3)} = 3$$

In the second set of results (with `N = 6400000` and eager compilation), the overall acceleration is already $2.23$ and all the additional processors in the world can only provide further reduction of execution time by about $\frac{(1/2.23)-(1/3)}{(1/3)} = 0.25$ or $25\%$.

> Bottom line: Additional parallel resources are helpful, but there are some hard limits and expectatons should be adjusted to avoid "unbridled enthusiasm".

Suppose you want to time the execution of a CUDA kernel. What happens if we apply `time()` in that situation? Let's try it and find out.

Here we return to running _map_main_ (with the plotting statements removed since they are not really of interest at the moment) and call a modified version of _map_parallel.py_ (let's call it _map_parallel_timed.py_) that includes `time()` calls wrapped around the array transfer to the device and the kernel execution. A listing of _map_parallel_timed.py_ is shown below:

```
File: parallel_time.py
01: import math
02: import numpy as np
03: from numba import jit, cuda, float32
04: from time import time
05: 
06: PI = np.pi
07: TPB = 32
08: 
09: @cuda.jit(device = True)
10: def s(x0):
11: 	return (1.-2.*math.sin(PI*x0)**2)
12: 
13: @cuda.jit #Lazy compilation
14: #@cuda.jit('void(float32[:], float32[:])') #Eager compilation
15: def sKernel(d_f, d_x):
16: 	i = cuda.grid(1)
17: 	n = d_x.shape[0]	
18: 	if i < n:
19: 		d_f[i] = s(d_x[i]) #content of `for` loop in serial version
20: 
21: def sArray(x):
22: 	n = x.shape[0]
23: 	d_x = cuda.to_device(x)
24: 	d_f = cuda.device_array(n, dtype = np.float32) #need dtype spec for eager compilation
25: 	blockDims = TPB
26: 	gridDims = (n+TPB-1)//TPB
27: 
28: 	start = time()
29: 	sKernel[gridDims, blockDims](d_f, d_x)
30: 	end = time()
31: 	elapsed_time = end -  start
32: 	print("--- Kernel time(): %3.4f milliseconds ---" % (1000*elapsed_time))
33: 	return d_f.copy_to_host()
```

$$ \text{Listing 3.2 - } map\_parallel\_time.py$$

Here we see `time()` statements wrapped around the kernel call on line 29. Note that the units have shifted to milliseconds, so results from `time()` (which are in seconds) get multiplied by $1000$ in the print statements on lines 32. Executing _map.py_ (with `sArray` imported from this modified file) produces the following results:

```
--- Kernel time(): 402.4272 milliseconds ---
--- Kernel time(): 0.0000 milliseconds ---
```

The fact that the second kernel appears to take no time at all is, at the very least suspicious. If we change the decorator preceding the kernel to include a signature and enable ahead-of-time compilation, then the following result is obtained:

```
--- Kernel time(): 370.4424 milliseconds ---
--- Kernel time(): 0.0000 milliseconds ---
```

Now both timings are $0$ ms, abd we can deduce that the first (non-zero) reported time interval is related to lazy compilation at runtime. The timing result above is clearly not a valid timing for kernel execution, and this brings up an important topic.

### 3.1.2 Synchronous vs. Asynchronous Execution

For almost all of us, our computing experince is firmly CPU-based, and we typically think of function calls executing sequentially: the first function is called, starts running, and completes execution; then the next function can start and, when it has completed, the next function can begin; etc.

Upon entering the parallel world computing, we need to think about things differently. The order in which functions are called does not govern the order of execution. As mentioned previously, we give up control over the order of execution in return for the throughput enhancements offered by parallel execution. To take advantage of parallelism, whenever possible we allow processors to proceed with computations without waiting for results from other processors. In the sequential model of CPU-based serial computing, __synchronous execution__ (where we call a function and, after it has completed execution, the next function can begin execution) is the standard. In contrast, once we go parallel, that assumption gets flipped completely. In particular, when a kernel is called from the host (CPU) to run on the device (GPU), the moment the kernel is launched to start executing on the GPU the CPU moves on to its next task. This is the __asynchronous execution__ model. Waiting around for execution of the kernel would needlessly cause the CPU to be idle and, in pursuit of efficiency, the CPU continues with whatever computing task it can perform instead of sitting idle.

The standard python timing tools are generally sufficient for timing non-trivial (taking longer than the ~1 ms resolution of `time()`) synchronous execution. It turns out that `cuda.to_device()` can be either synchronous or asynchronous (refer to the numba docs for details), but kernel execution is asynchronous. The result from `time()` only measures the time to _launch the kernel_ (to start the execution), not the time to _execute the kernel_ which is what we really want to measure. For that we need a different set of tools.

### 3.1.3 CUDA Event Timing

To reliably time CUDA operations (which may have sub-millisecond duration and be asynchronous), CUDA provides a timing model based on __events__. The basic usage of events involves creating events, recording times, computing time intervals, and synchronization (to ensure full timing of asynchronous execution.) Let's stick to our terminology of using variables `start` and `end` to store clock readings that bound the interval to be measured. To be concrete, consider a snippet using `time()` to  determine the runtime for `sKernel`:

```
start = time()
sKernel[gridDims, blockDims](d_f, d_x)
end = time
elapsed = end - start
```
Using CUDA events, this becomes:

```
start = cuda.event() #create start event
end = cuda.event()   #create end event
start.record()       #read clock before execution
sKernel[gridDims, blockDims](d_f, d_x) #call for kernel execution
end.record()         #request clock read after execution
end.synchronize()    #make sure execution is complete before reading clock
elapsed = cuda.event_elapsed_time(start, end) #compute interval duration
```

The listing below of _map_parallel_event.py_ includes both `time()` statments and event timing so the results can be compared:

```
File: parallel_event.py
01: import math
02: import numpy as np
03: from numba import jit, cuda, float32
04: from time import time
05: 
06: PI = np.pi
07: TPB = 32
08: 
09: @cuda.jit(device = True)
10: def s(x0):
11: 	return (1.-2.*math.sin(PI*x0)**2)
12: 
13: @cuda.jit #Lazy compilation
14: #@cuda.jit('void(float32[:], float32[:])') #Eager compilation
15: def sKernel(d_f, d_x):
16: 	i = cuda.grid(1)
17: 	n = d_x.shape[0]	
18: 	if i < n:
19: 		d_f[i] = s(d_x[i]) #content of `for` loop in serial version
20: 
21: def sArray(x):
22: 	n = x.shape[0]
23: 	d_x = cuda.to_device(x)
24: 	d_f = cuda.device_array(n, dtype = np.float32) #need dtype spec for eager compilation
25: 	blockDims = TPB
26: 	gridDims = (n+TPB-1)//TPB
27: 
28: 	e_start = cuda.event()
29: 	e_end = cuda.event()
30: 	start = time()
31: 	e_start.record()
32: 	sKernel[gridDims, blockDims](d_f, d_x)
33: 	end = time()
34: 	e_end.record()
35: 	e_end.synchronize()
36: 	event_time = cuda.event_elapsed_time(e_start, e_end)
37: 	elapsed_time = end -  start
38: 	print("--- Kernel time(): %3.4f milliseconds ---" % (1000*elapsed_time))
39: 	print("--- Kernel event: %3.4f milliseconds ---" % (event_time))
40: 	return d_f.copy_to_host()
```

$$ \text{Listing 3.2 - } map\_parallel\_event.py$$

Let's look at the output when we run the _map_ app with the previous print statements commented out and with `sArray` imported from _map/parallel_event.py_. For both kernel executions (and ahead-of-time compilation), the interval measured by `time()` is again $0$ ms which is not valid. This is simply telling us that the _kernel launch_ takes less than the millisecond precision of `time()`. Againg for both kernel executions, the timing measured by CUDA events is resolved to last for about $0.4$ s.

```
--- Kernel time(): 0.0000 milliseconds ---
--- Kernel event: 0.4250 milliseconds ---
--- Kernel time(): 0.0000 milliseconds ---
--- Kernel event: 0.4246 milliseconds ---
```
Again, let's inspect the results when we increase the array size by an order of magnitude:
```
--- Kernel time(): 0.0000 milliseconds ---
--- Kernel event: 4.1481 milliseconds ---
--- Kernel time(): 0.0000 milliseconds ---
--- Kernel event: 4.1497 milliseconds ---
```

Again, `time()` fails to capture the kernel execution time while the event-based timings appropriately increase by about an order of magnitude. The moral of the story is to you can use python's `time()` for most executions on the host, __be sure to use CUDA events for timing asynchronous operations such as kernel execution__.

> Note that omitting the `end.synchronize()` before the call of `cuda.event_elapsed_time(start, end)` produces the error `CudaAPIError: [600] Call to cuEventElapsedTime results in CUDA_ERROR_NOT_READY`. If you encounter this error, check that you are properly synchronized before computing the time interval.


## 3.2  Profiling

In addition to tools for basic timing measurement, CUDA provides other tools for obtaining information about code performance (including timings). at this point, we will avoid a major tangent into the details of profiling, but a quick mention of profiling tools is in order. 

In previous versions, CUDA offered a list of tools including the following:

- `cuda-memcheck` to check for illegal memory access and memory leaks.
- NVIDIA Profiler, `nvprof`, offering command line access to a variety of performance data. (Unfortunately, this tool was not available for python under Windows.)
- NVIDIA Visual Profiler, `nvvp`, offering a graphical user interface (GUI) for accessing performance data and execution timelines.
- NVIDIA NSight which integrated the performance measurement tools into integrated development environments (IDEs) such as Visual Studio and Eclipse.

The bad news is that these tools are being deprecated (which means that there is still some availability, but they will be removed in an upcoming release of a new version of CUDA which typically happens at least once a year). The good news is that they are being replaced with a new set of tools including:

- NSight Systems, the new "first stop" profiling tool.
- NSight Compute which focuses on performance of compute kernels.

These tools are so new that there is not a lot of information out yet about putting them to work. However, they do include at least a temporary fix for accessing the capabilities of  `nvprof` from the command-line interface (CLI) even under Windows. The essential command is found in NSight Compute documentation under "NSight Compute CLI Quickstart" at:

https://docs.nvidia.com/nsight-compute/NsightComputeCli/index.html#quick-start

There it tells you to open a terminal, navigate to the directory where your code (let's call it `main.py`) resides, and run the following command:

```
nv-nsight-cu-cli -o profile python main_timed.py
```

This runs the NSight CUDA Command Line Interface (under the alias `nv-nsight-cu-cli`) on the code executed by calling `python main.py` and save the generated data to a file named `profile` (with file extension `.nsight-cuprof-report`). Sample output to the terminal looks like the following:

```
==PROF== Connected to process 21804 (C:\Users\storti\anaconda3\python.exe)
==PROF== Profiling "sKernel$241" - 1: 0%....50%....100% - 24 passes
==PROF== Profiling "sKernel$241" - 2: 0%....50%....100% - 24 passes
==PROF== Disconnected from process 21804
==PROF== Report: profile.nsight-cuprof-report
```

The output lines start with `==PROF==` are generated by the profiler and indicate that:

- The profiler connects to the launched compute process.
- The kernel "sKernel$241" is identified for profiling.
- Performance data is collected over multiple execution passes.
- The profiler disconnects from the process.
- The performance data is written to the output file.
  
The performance data collected can be viewed by opening the report file using NSight Compute (which should be available in the NVIDIA folder created when you installed the CUDA Toolkit.)

![](nsight-compute.png)

$$ \text{Fig. 3.1 - Sample view of profile using NSight Compute.}$$

It would be an understatement to note that a _lot_ of information is presented here in a rather dense format. However, we can pick out a few items of interest. In 2 places (just to the right of "Current" next to the blue square near the upper left corner and at the top-right corner of the "Speed of Light" section), we can see that the average time for kernel execution is $423.36 \mu s$ which is quite close to the result of the event-based timing. Also in the "Speed of Light" section the blue bars show that we are achieving about 90% of the theoretical maximum computing throughput but only using about 30% of the theoretical maximum data transfer rate.

 > This kernel is said to be "compute-bound" because it bumps up against the limit of computing throughput. Kernels with lower compute rating and high data transfer "Speed of Light" ratings are called "memory-bound".

As we deal with more sophisticated computations, we will return for more detailed look at relevant profiling data. In the meantime, you can also test for memory access erros by running `cuda-memcheck python main_timed.py`. If you make your grid size not an even multiple of the block size and delete the `if i < n:` bounds test in the kernel, then `cuda-memcheck` will produce quite a bit of output including the following indications of attempts to access out-of-bounds array entries and copy results back to an out-of-bounds index in `f`:

```
========= CUDA-MEMCHECK
.
.
.
========= Invalid __global__ read of size 4
=========     at 0x00000158 in cudapy::parallel_event::sKernel$241(Array<float, int=1, A, mutable, aligned>, Array<float, int=1, A, mutable, aligned>)
=========     by thread (31,0,0) in block (200,0,0)
=========     Address 0x70120c8f8 is out of bounds
.
.
.
========= Program hit CUDA_ERROR_LAUNCH_FAILED (error 719) due to "unspecified launch failure" on CUDA API call to cuMemcpyDtoH_v2.
.
.
.
========= ERROR SUMMARY: 2 errors
```

## 3.3 Debugging

Depending on your code development environment of choice, you may have access to a reasonable set of debugging tools (setting break points, inspecting variable and array values, identifying error sources, etc.) for regular python code, but full-featured debugging of SIMT parallel code can be tough to come by. 

> If you use Visual Studio or Eclipse, you can get an NSight plug-in that incorporates tools for dubugging kernel code into your IDE. If you have access to such tools, you are encouraged to use them. Here we aim to provide a workable alternative for when fully integrated debugging of kernel code is not available.

Fortunately, numba provides a middle-ground approach based on __CUDA Simulation Mode__.
In simulation mode, the code runs entirely on the CPU but with the CPU simulating what the GPU will do when the code is run in parallel (give or take a few restrictions described in the numba docs). There is one overriding takeaway:

__Simulation mode gives you a way to fully develop and debug parallel code without needing access to a system with CUDA-enabled hardware.__ If you are working on a system without a suitable graphics card, you can still write and debug code so that it is ready to run when you get access to CUDA-enabled hardware elsewhere. 

___If you are dependent on cloud-based CUDA hardware, you should get in the habit of developing and debugging in simulation mode on your own system, and then run your code on the coud-based server when it is known to be ready for parallel execution.___

The mechanics of entering simulation mode involves setting the value for an environment variable, so you will need to figure out how to do that on your system. Here is one example of how it is done. Using Visual Studio Code in Windows, executing any python code (or selecting "New Terminal" from the "Terminal" menu) opens a PowerShell terminal window. In PowerShell, environment variables are set using the following syntax:

`$env:VAR_NAME="VALUE"`

In this case, numba specifies the environment variable name to be `NUMBA_ENABLE_CUDASIM`, and the value is set to 1 for simulation mode and 0 otherwise; so the environment variable to enable simulation mode is set by running the following command in PowerShell:

`$env:NUMBA_ENABLE_CUDASIM="1"`

The PowerShell command for exiting simulation mode is:

`$env:NUMBA_ENABLE_CUDASIM="0"`

The PowerShell command for checking the value of the environment variable is:

`$env:NUMBA_ENABLE_CUDASIM`

> The simulator is "smart" enough to simulate much of the standard functionality of a generic CUDA-capable GPU, but it is not able to simulate the characteristics and limitations (such as available memory or compute capibility) of a specific GPU. Tests of such properties will need to be run on a CUDA-enabled system.

Note that simulation mode involves a small number of cores simulating the work of a large number of cores. As a result, the ___simulation computation times may be considerably longer___ so be sure to develop code on small example problems. When you move to a CUDA-enabled system, test that the small problem sill works as expected, then increase the problem size and take advantage of the power of large-scale parallelism.

Simulation mode involves an inherent tradeoff: a significant reduction in performance is involved but, in return, access is gained to debugging tools that can be applied even in parallel kernel code. The debugger is `pdb` (for "Python DeBugger"), and some basics are provided in the numba docs. For specifics of using `pdb` to debug kernel code, see the section "Debugging CUDA Python code" at https://numba.pydata.org/numba-doc/dev/cuda/simulator.html and the `pdb` docs at https://docs.python.org/2/library/pdb.html

Kernel code involves a computational grid typically with numerous blocks and threads and, under the SIMT model, all the threads in a warp are executing the same operations in lockstep. Both of these considerations suggest that it would be redundant and potentially overwhelming to try to debug every thread. A more productive approach is to pick a particular thread in a particular block, and inspect that single execution of the kernel code. 

Here is example code for implementing that plan. Suppose that we decide to focus on thread with index `T = 2` in the block with index `B = 1`, we create the equivalent of a breakpoint at the line of interest (typically near the start of the kernel) by inserting the following code:

```
T,B = 2,1
if threadIdx.x == T and blockIdx.x == B:
    breakpoint()
```
Calling `breakpoint()` imports `pdb` and starts an execution trace. The `if` statement ensures that this only occurs during execution of the specified thread.

> You may also see an "old school" way to import 'pbd', set a breakpoint, and initiate an execution trace:
```
T,B = 2,1
if threadIdx.x == T and blockIdx.x == B:
        from pdb import set_trace; set_trace()
```
>Here the debugger is initiated by calling `set_trace()` which is imported from `pdb`.



When `breakpoint()` is called, the debugger starts to run as indicated by the terminal prompt changing to `(Pdb)`. With the debugger running, you can do enter terminal commands (terminated by hitting the `<Enter>` key) to perform a variety of operations including:

- Inspect the value of a variable by entering its name.
- Inspect the value of an expression involving variable names.
- Step to the next line of execution with `s` or `step` or `n` or `next`. Step means go to the next line in any function, while next means go to the next line in the current function (so a function called on that line is executed in its entirety rather than line-by-line.)
- Continue to the next break point with `c`.
- Continue execution to line #L with `j L` or `jump L`.
- Print the arguments of a function call with `a` or `args`.
  
Below is the listing for a code that computes the element-wise sum of 2 input arrays. The operation is parallelized by calling `vec_add_kernel` to which the snippet has been added to debug thread 2 in block 1.

```
File: vec_add_gdb.py
01: from numba import cuda
02: import numpy as np
03: #import pdb
04: 
05: N = 128
06: TPB = 32
07: BPG = (N+TPB-1)//TPB
08: T = 2 #thread index for debug
09: B = 1 #block index for debug
10: 
11: @cuda.jit(debug=True)
12: def vec_add_kernel(out, u, v):
13:     x = cuda.threadIdx.x
14:     bx = cuda.blockIdx.x
15:     bdx = cuda.blockDim.x 
16:     if x == T and bx == B:
17:         breakpoint()
18:     i = cuda.grid(1)
19:     j = bx * bdx + x
20:     out[i] = u[i] + v[i]
21:     diff = j-i
22: 
23: def vec_add(u,v):
24:     n = u.shape[0]
25:     d_u = cuda.to_device(u)
26:     d_v = cuda.to_device(v)
27:     d_out = cuda.device_array(N)
28:     vec_add_kernel[BPG,TPB](d_out, d_u, d_v)
29:     return d_out.copy_to_host()
30: 
31: u = np.ones(N)
32: v = np.ones(N)
33: C = vec_add(u,v)
34: #print(C)
```

$$\text{Listing 3.3 - } vec\_add\_gdb.py$$

Here is a sample output of a debugging session. `(Pdb)` is the prompt for input, and the remainder of those lines are debugging input commands. Lines starting with `#` have been inserted to describe the ensuing operation. Lines with no "prefix" are outputs from the debugger, and a summary of common `pdb` commands is given in Table 3.1 below.

```
> C:\path_info...\vec_add_gdb.py(22)vec_add_kernel()
#First line in terminal indicates location of breakpoint
#Second line indicates line awaiting execution at break
-> i = cuda.grid(1)
#print value of 'x'
(Pdb) x
2
#print value of ''bx'
(Pdb) bx
1
#print value of 'i'
(Pdb) i
*** NameError: name 'i' is not defined
#execute next command where 'i' gets assigned a value
(Pdb) n
-> j = bx * bdx + x
#print value of 'i'
(Pdb) i
34
#evaluate expression for value to be assigned to 'j'
(Pdb) bx * bdx + x
34
#evaluate next line where 'j' gets assigned a value
(Pdb) n
-> out[i] = u[i] + v[i]
#print value of 'j'
(Pdb) j
*** The 'jump' command requires a line number
#'j' is shorthand for 'jump' so use 'p' to 'print'
(Pdb) p j
34
#evaluate expression to be assigned to 'j'
(Pdb) u[i] + v[i]
2.0
#print value of 'out[i]'
(Pdb) out[i]
nan
#not yet assigned so execute next line
(Pdb) n
-> diff = j-i
#print value of 'out[i]'
(Pdb) out[i]
2.0
#quit debugger
(Pdb) q
```

| Command	| Key	| Description |
|---------|-----|-------------|
| Next	| n	| Execute next line |
| Step	|s |	Step into a subroutine |
| Print	<v_> | p <v_> |	Print value of variable <v_> |
| <v_> | | Print value of variable <v_> |
| <expr_> | | Evaluate and print expression |
|Return	|r	|Run until the current subroutine returns |
|Continue	| c |	Stop debug; continue execution |
|Quit	|q |	Quit pdb  |

$$ \text{Table 3.1 - } pdb \text{ commands} $$

While this use of `pdb` may not be quite as slick as a debugger completely integrated into an IDE with a full GUI, it does provide a tool for inspecting the details of SIMT code and ironing out issues that arise when writing kernels. Getting some practice using `pdb` is a good idea and can really pay off as you progress to writing more complicated kernels.