# Profiling in MatX
Improving performance is at the heart of MatX's value, so it must facilitate any easy to implement and powerful capability for benchmarking and analysing code both at deployment and during development.

The NVIDIA software ecosystem provides a powerful profiling suite of tools through [Nsight Systems]() and [Nsight Compute]() that allows developers to gain great insight into the performance of their code and utilization of their hardware. MatX leverages this powerful ecosystem through the [NVTX toolkit]() which allows developers to annote their code for use with the Nsight suite of tools. 

## MatX Profiling tools
MatX provides an NVTX API to enable native compile-in profiling capabilities. The MatX NVTX API enable a user to 
easily profile all MatX calls using built-in NVTX ranges, while also providing a convenient API for the user to insert 
custom ranges in their own code. This API provides many convenience features such as:

- A convenient compile-in/compile-out MACRO based API 
- verbosity levels allowing varying levels of profiling detail
- Built-in color rotation
- Automatic scope management and range naming 
- Overloaded API for manual range specification

MatX Implements it's NVTX API as a set of macros, which allows users to easily compile NVTX functionality into, or out of your code. This completely removes any runtime penality that may be caused by NVTX in the most latency sensitive deployments.

To enable the NVTX Profiling API, simply compile with the ``MATX_NVTX_FLAGS=ON`` enabled in the cmake command.

### User Defined Ranges
User defined NVTX ranges require the user to provide a name and unique ID for each range. The name will appear in the NVTX range of your nsight profiles, while the unique ID is only used interally to track your ranges during deletion. Because of this, the unique ID **must** be unique for any ranges that overlap, otherwise you may delete the incorrect range during tear-down.

Below is an example of a user-defined NVTX range:

In [None]:
using dtype = double;
index_t input_size = 10;
// index_t inputIsze  = 10000000; // increase size to measure performance

MATX_NVTX_START_RANGE("Black-Scholes Memory Allocation", 0)
// declare input data
auto K = matx::make_tensor<dtype>({input_size});
auto S = matx::make_tensor<dtype>({input_size});
auto V = matx::make_tensor<dtype>({input_size});
auto r = matx::make_tensor<dtype>({input_size});
auto T = matx::make_tensor<dtype>({input_size});
auto output = matx::make_tensor<dtype>({input_size});  
auto referenceOutput = matx::make_tensor<dtype>({input_size});  
MATX_NVTX_END_RANGE(0)


MATX_NVTX_START_RANGE("Black-Scholes Op Creation", 1)
// create ops
auto VsqrtT = V * sqrt(T);
auto d1     = (log(S / K) + (r + 0.5 * V * V) * T) / VsqrtT ;
auto d2     = d1 - VsqrtT;
auto cdf_d1 = normcdf(d1);
auto cdf_d2 = normcdf(d2);
auto expRT  = exp(-1 * r * T); 
MATX_NVTX_END_RANGE(1)

MATX_NVTX_START_RANGE("Black-Scholes Execution", 2)
// execute ops
(output = S * cdf_d1 - K * expRT * cdf_d2).run(exec);
MATX_NVTX_END_RANGE(2)

### Automatic Ranges
Alternative versions of the timing macros are provided to auomate handling the NatX NVTX ranges. The `MATX_NVTX_START_RANGE` has an overload which allows the its use without providing a unique ID. Instead the macro returns an ID, which can be stored in an int variable and later passed to the end range call. when NVTX ranges are compiled out, the Macros simply return 0, and no action is taken on the end call.

Below is an example using the automatic enumeration feature:

In [None]:
int bc_range = MATX_NVTX_START_RANGE("Black-Scholes Execution");
(output = S * cdf_d1 - K * expRT * cdf_d2).run(exec);
MATX_NVTX_END_RANGE(bc_range);

### Scope Based Ranges
A final version of the API, `MATX_NVTX_START` is provided that matches the life of the NVTX range to the life of the scope in which it is defined. This automatically enumates a unique ID, and does not need to be explicitly destroyed by the user. 

Similarly it will also inherit the name of the functions it is called from, and do not require a name. This is especially useful for automating ranges for entire functions.

An example of this API is as follows:

In [None]:
void myFunction
{
  MATX_NVTX_START("");
  
  (output = S * cdf_d1 - K * expRT * cdf_d2).run(exec);
}


### Profile Level 
The MatX NVTX API supports logging levels, allowing you to fine-tune the levels of NVTX ranges that are captured at a given time. the logging level is checked at runtime, so can be dynamically changed throughout program execution.
A utility macro  `MATX_NVTX_SET_LOG_LEVEL(LOG_LEVEL)`.

All Events default to the log level `MATX_NVTX_LOG_USER`, and the default verbosity is `MATX_NVTX_LOG_API`. 


There are 5 increasing levels of verbosity:
```
MATX_NVTX_LOG_NONE
MATX_NVTX_LOG_USER
MATX_NVTX_LOG_API
MATX_NVTX_LOG_INTERNAL
MATX_NVTX_LOG_ALL
``` 

`MATX_NVTX_LOG_NONE` ensures that no Ranges are recorded.
`MATX_NVTX_LOG_ALL` ensures all NVTX Ranges are recorded.

Any intermediate level ensures that level and all levesl avove it are recoded. For exmaple, if `MATX_NVTX_LOG_API`
is enabled, then all events of type `MATX_NVTX_LOG_USER` **AND** `MATX_NVTX_LOG_API` will be recoded.


## Profiling Radar Application
To demonstrate the power of the NVTX ranges, we'll demonstrate using the [Radar Pipeline example]() in the MatX example codes. This pipeline showcases both the powerful accleration MatX provides, as well as the granular insight we gain into our performance through the MatX NVTX API.
