*NOTE: This notebook is configured for bash execution, not C++, so you won't be able to run the C++ code examples shown*

# Profiling in MatX
Improving performance is at the heart of MatX's value, so it must facilitate any easy to implement and powerful capability for benchmarking and analysing code both at deployment and during development.

The NVIDIA software ecosystem provides a powerful profiling suite of tools through [Nsight Systems]() and [Nsight Compute]() that allows developers to gain great insight into the performance of their code and utilization of their hardware. MatX leverages this powerful ecosystem through the [NVTX toolkit]() which allows developers to annote their code for use with the Nsight suite of tools. 

### Executor Timers
Before we get into the meat of the profiling section, let's talk about a profiling approach you used earlier. The simplest, most light-weight (but less powerful) way to profile a block of MatX code is to leverage the built-in timer in the executor.

These methods don't integrate with the Nsight toolset, but can be useful for quick and dirty analysis:

```c++
exec.start_timer();
(C = A * fft(B)).run(exec);
exec.stop_timer();
std::cout << "Execution time: " << exec.get_time_ms() << " ms" << std::endl;
```

## MatX Profiling Tools
MatX provides an NVTX API to enable native compile-in profiling capabilities. The MatX NVTX API enable a user to 
easily profile all MatX calls using built-in NVTX ranges, while also providing a convenient API for the user to insert 
custom ranges in their own code. This API provides many convenience features such as:

- A convenient compile-in/compile-out MACRO based API 
- verbosity levels allowing varying levels of profiling detail
- Built-in color rotation
- Automatic scope management and range naming 
- Overloaded API for manual range specification

MatX Implements it's NVTX API as a set of macros, which allows users to easily compile NVTX functionality into, or out of your code. This completely removes any runtime penality that may be caused by NVTX in the most latency sensitive deployments.

To enable the NVTX Profiling API, simply compile with the ``MATX_NVTX_FLAGS=ON`` enabled in the cmake command.

### User Defined Ranges
User defined NVTX ranges require the user to provide a name and unique ID for each range. The name will appear in the NVTX range of your nsight profiles, while the unique ID is only used interally to track your ranges during deletion. Because of this, the unique ID **must** be unique for any ranges that overlap, otherwise you may delete the incorrect range during tear-down.

Below is an example of a user-defined NVTX range:

```c++
using dtype = double;
index_t input_size = 10;
// index_t inputIsze  = 10000000; // increase size to measure performance

MATX_NVTX_START_RANGE("Black-Scholes Memory Allocation", 0)
// declare input data
auto K = matx::make_tensor<dtype>({input_size});
auto S = matx::make_tensor<dtype>({input_size});
auto V = matx::make_tensor<dtype>({input_size});
auto r = matx::make_tensor<dtype>({input_size});
auto T = matx::make_tensor<dtype>({input_size});
auto output = matx::make_tensor<dtype>({input_size});  
auto referenceOutput = matx::make_tensor<dtype>({input_size});  
MATX_NVTX_END_RANGE(0)


MATX_NVTX_START_RANGE("Black-Scholes Op Creation", 1)
// create ops
auto VsqrtT = V * sqrt(T);
auto d1     = (log(S / K) + (r + 0.5 * V * V) * T) / VsqrtT ;
auto d2     = d1 - VsqrtT;
auto cdf_d1 = normcdf(d1);
auto cdf_d2 = normcdf(d2);
auto expRT  = exp(-1 * r * T); 
MATX_NVTX_END_RANGE(1)

MATX_NVTX_START_RANGE("Black-Scholes Execution", 2)
// execute ops
(output = S * cdf_d1 - K * expRT * cdf_d2).run(exec);
MATX_NVTX_END_RANGE(2)
```

### Automatic Ranges
Alternative versions of the timing macros are provided to auomate handling the NatX NVTX ranges. The `MATX_NVTX_START_RANGE` has an overload which allows the its use without providing a unique ID. Instead the macro returns an ID, which can be stored in an int variable and later passed to the end range call. when NVTX ranges are compiled out, the Macros simply return 0, and no action is taken on the end call.

Below is an example using the automatic enumeration feature:

```c++
int bc_range = MATX_NVTX_START_RANGE("Black-Scholes Execution");
(output = S * cdf_d1 - K * expRT * cdf_d2).run(exec);
MATX_NVTX_END_RANGE(bc_range);
```

### Scope Based Ranges
A final version of the API, `MATX_NVTX_START` is provided that matches the life of the NVTX range to the life of the scope in which it is defined. This automatically enumates a unique ID, and does not need to be explicitly destroyed by the user. 

Similarly it will also inherit the name of the functions it is called from, and do not require a name. This is especially useful for automating ranges for entire functions.

An example of this API is as follows:

```c++
void myFunction
{
  MATX_NVTX_START("");
  
  (output = S * cdf_d1 - K * expRT * cdf_d2).run(exec);
}
```


### Profile Level 
The MatX NVTX API supports logging levels, allowing you to fine-tune the levels of NVTX ranges that are captured at a given time. the logging level is checked at runtime, so can be dynamically changed throughout program execution.
A utility macro  `MATX_NVTX_SET_LOG_LEVEL(LOG_LEVEL)`.

All Events default to the log level `MATX_NVTX_LOG_USER`, and the default verbosity is `MATX_NVTX_LOG_API`. 


There are 5 increasing levels of verbosity:

```c++
MATX_NVTX_LOG_NONE
MATX_NVTX_LOG_USER
MATX_NVTX_LOG_API
MATX_NVTX_LOG_INTERNAL
MATX_NVTX_LOG_ALL
``` 

`MATX_NVTX_LOG_NONE` ensures that no Ranges are recorded.
`MATX_NVTX_LOG_ALL` ensures all NVTX Ranges are recorded.

Any intermediate level ensures that level and all levesl avove it are recoded. For exmaple, if `MATX_NVTX_LOG_API`
is enabled, then all events of type `MATX_NVTX_LOG_USER` **AND** `MATX_NVTX_LOG_API` will be recoded.


## Application Profiling Examples
In this section we're going to use some pre-built applications to demonstrate how to generate an Nsight Systems profile and do some basic, high-level analysis using the Nsight Systems CLI.

To take advantage of the full Nsight Systems profiler, you must example the profile report with the GUI, which isn't installed in this lab. We'll show screenshots of the output you'll see, but to interact with the reports yourself, head to https://developer.nvidia.com/nsight-systems to get started.

### Kernel Fusion Application
The first application we'll be profiling is one you've already seen before. In `samples/kernel_fusion.cu`, we've implemented a simple application that demonstrates the same concepts learned in the earlier section about operator fusion.

Specifically, we implement 2 ranges, looping over each 10 times to get an accurate timing analysis:

```C++
// first individual, independent kernels
int unfused_range = MATX_NVTX_START_RANGE("Unfused Kernels");
(result = cos(C)).run(exec);
(result = result / D).run(exec);
(result = result * B).run(exec);
MATX_NVTX_END_RANGE(unfused_range);

// now, as a fused operation
int fused_range = MATX_NVTX_START_RANGE("Fused Operation");
(A = B * cos(C)/D).run(exec);
MATX_NVTX_END_RANGE(fused_range);
```

Run the cell below to generate an Nsight Systems profile report on the application, which is saved as `samples/kernel_fusion_report.nsys-rep`:

In [None]:
nsys profile -o ./samples/kernel_fusion_report.nsys-rep ./samples/kernel_fusion

This `.nsys-rep` file is what is used by Nsight Systems for profiling and is what you would load into the Nsight Systems GUI. To see some high-level statistics, let's use the CLI in the cell below:

In [None]:
nsys stats ./samples/kernel_fusion_report.nsys-rep

In the top section, `** NVTX Range Summary`, take note of the `MatX:Unfused Kernels` and `MatX:Fused Operation` ranges on the right. You should see something like:
```
 ** NVTX Range Summary (nvtxsum):

Time (%)  Total Time (ns)  Instances   Avg (ns)    Med (ns)  Min (ns)   Max (ns)   StdDev (ns)   Style                                            Range                                        
--------  ---------------  ---------  -----------  --------  --------  ----------  -----------  --------  -------------------------------------------------------------------------------------
...
     1.0          338,264         10     33,826.4  31,930.0    30,941      44,218      4,231.5  StartEnd  MatX:Unfused Kernels
     0.4          145,858         10     14,585.8  13,614.0    13,019      21,775      2,730.8  StartEnd  MatX:Fused Operation
...
```

Now let's look at the output you'd see in Nsight Systems GUI. This plot shows a high-level view of all 10 iterations we just ran:

![Fusion High Level](img/kernel-fusion-highlevel.png)

We can zoom down to a single iteration to compare the two ranges:

![Fusion High Level](img/kernel-fusion-lowlevel.png)

### Simple Radar Application
To demonstrate the power of the NVTX ranges, we'll demonstrate using a more complex example: the [Simple Radar Pipeline](https://github.com/NVIDIA/MatX/blob/main/examples/simple_radar_pipeline.cu) that comes with the MatX example codes. This pipeline showcases both the powerful accleration MatX provides, as well as the granular insight we gain into our performance through the MatX NVTX API.

You can view the file here at `./samples/simple_radar_pipeline.cu` and `./samples/simple_radar_pipeline.h`.

The pipeline is made up of 4 stages:
1. Pulse Compression - An FFT, a dot matrix multiply, and an inverse FFT
2. Three Pulse Canceller - A 1D convolution
3. Doppler Processing - A dot matrix multiply with a Hamming window and an FFT
4. CFAR Detection - An element-wise magnitude-squared, a 2D convolution, a dot matrix divide

Which operations do you think will take the longest? The fastest?

Run the cell below to generate a profile report.

In [None]:
nsys profile -o ./samples/simple_radar_pipeline_report.nsys-rep ./samples/simple_radar_pipeline

See below for a high-level view of the profile output (this one is a lot more complicated!):

![Radar High-Level](img/radar-highlevel.png)

Here's a look, zoomed into a single pass-through of all 4 stages of the pipeline:

![Radar Pipeline](img/radar-pipeline.png)

And zoomed in even further to just look at the Pulse Compression stage:

![Radar Pulse Compression](img/radar-pulsecompression.png)

Finally, run the cell below to see the CLI output for some high-level statistics:

In [None]:
nsys stats ./samples/simple_radar_pipeline_report.nsys-rep