# GPU-Accelerated Numerical Computing with MatX

## Tutorial List
1. [Introduction](01_introduction.ipynb)
2. Operators (this tutorial)
3. [Executors](03_executors.ipynb)
4. [Radar Pipeline Example](04_radar_pipeline.ipynb)

## Operators, Expressions, and Generators
In this tutorial, we introduce the concept of operators and operator expressions. It is assumed that the user has already been through the [01_introduction](01_introduction.ipynb) tutorial or is already familiar with MatX tensor types. 

Operators in MatX are an abstract type that defines an operation that returns a value at a given index. This concept is intentionally vague, which makes it extremely powerful for representing different concepts. As an example, both a tensor type and the addition operator `+` are MatX operators. In the case of the tensor, it returns the value in memory at that location, but for the addition operator, it returns the sum of values at a given location from both the left and right hand sides. Most operators come in unary types for operating on a single input or a binary type for operating on two inputs. For example, the expression `A + B` uses the binary `AddOp` operator to lazily add two tensors or other operators together. Operators can be chained into a longer expression, such as `A + B + conj(C)` where `A`, `B`, and `C` are all tensors or operators, and `+` and `conj` are operators. Operator expressions are lazily evaluated, so none of the operations inside the expression occurs until the `run` method is called. This allows for cleaner code since smaller expression types can be built up into temporary variables and used as part of a larger expression.

Operators are assigned to tensors using the overloaded operator `=` to indicate lazy assignment in the context of an operator expression. Calling `=` does not execute any work on the device. It creates a new data type that can be executed later using the executor method `run` (covered later). Other types of assignment operators are also available for use in operators (`<<=`, `&=`, etc), and can be used in a chain of assignments as with non-lazy operators (`A = B = C`). All operators rely on the same precedence rules as normal operators. Because operators rely heavily on C++ templates behind-the-scenes, it is important to use the `auto` keyword whenever an operator is created and not executed immediately since the type is very difficult to obtain. Even a small change to a statement can have a dramatic effect on the template types, so specifying them by hand is not feasible. A tensor is always needed for the output of an operator so that the operator can write the results into an existing memory location. The tensor rank and size of each dimension must match the expected dimensions of the expression output or an assertion will be raised. Input operators may be of mixed rank and sizes provided that the particular operation allows for it. In certain cases, values are allowed to be *broadcasted* during an operator. For example, when adding a 2D tensor to a 1D tensor, the 1D tensor would repeatedly be added across all rows of the 2D tensor. The broadcasting rules follow similar rules as MATLAB's, and a good summary can be found at: https://www.mathworks.com/help/matlab/matlab_prog/compatible-array-sizes-for-basic-operations.html.

The last topic in this exercise will cover MatX generators. MatX generators are an operator that can dynamically generate data from a formula without storing the interim values. For example, the values an identity matrix or a Hamming window can both be generated on-the-fly only by knowing the index of the value. Generators typically only take a Shape as input since their output is generated without input data.

In [1]:
//todo this should be moved to a hidden init block that runs automatically when the notebook starts
#pragma cling add_library_path("/usr/local/cuda/lib64")
#pragma cling add_library_path("/opt/xeus/cling/lib")
//#pragma cling add_library_path("/usr/Lib/gcc/x86_64-Linux-gnu/11/")
#pragma cling add_library_path("/usr/lib/x86_64-linux-gnu/openblas64-openmp/")
#pragma cling add_include_path("/usr/local/cuda/include")
#pragma cling add_include_path("/usr/include/x86_64-linux-gnu/openblas64-openmp")
#pragma cling add_include_path("/opt/xeus/cling/tools/Jupyter/kernel/MatX/include")
#pragma cling add_include_path("/opt/xeus/cling/tools/Jupyter/kernel/MatX/build/_deps/cccl-src/libcudacxx/include")
//#pragma cling load("libgomp")
#pragma cling load("libopenblas64")
#pragma cling load("libcuda")
#pragma cling load("libcudart")
#pragma cling load("libcurand")
#pragma cling load("libcublas")
#pragma cling load("libcublasLt")

#include <cuda/std/__algorithm/max.h>
#include <cuda/std/__algorithm/min.h>

#define MATX_EN_OPENBLAS
#define MATX_EN_OPENBLAS_LAPACK
#define MATX_OPENBLAS_64BITINT

#include "matx.h"



## Initialization
As in the previous example, we need to declare tensors and initialize the data:

In [2]:
auto A = matx::make_tensor<float>({2, 3});
auto B = matx::make_tensor<float>({2, 3});
auto C = matx::make_tensor<float>({2, 3});
auto V = matx::make_tensor<float>({3});
auto E = matx::make_tensor<float>({8,8});
auto H = matx::make_tensor<float>({10});



After this code is executed, four data objects are created, and managed memory is allocated to account for the shape and type of each tensor. Next, the input tensor Views (`A` and `V`) are initiailized with an increasing data pattern:


In [3]:
A.SetVals({ {1, 2, 3},
            {4, 5, 6}
          });
          
V.SetVals({7, 8, 9});

matx::print(A);
matx::print(V);

tensor_2_f32: Tensor{float} Rank: 2, Sizes:[2, 3], Strides:[3,1]
000000:  1.0000e+00  2.0000e+00  3.0000e+00 
000001:  4.0000e+00  5.0000e+00  6.0000e+00 
tensor_1_f32: Tensor{float} Rank: 1, Sizes:[3], Strides:[1]
000000:  7.0000e+00 
000001:  8.0000e+00 
000002:  9.0000e+00 


(void) @0x757f1fdfec30


## Element-wise Scalar Addition
For the first operator example, we add a scalar onto a tensor and assign it to another tensor. This can be thought of as tensor addition with the second tensor equal to a tensor of equal size with all ones. To make the separation of operators from executors explicit, we first create the operator `op` by using MatX's lazy assignment operator `=`. The statement on the right hand side can be read as "Add the number 5 to operator A, and assign the result to tensor B". Instantiating variable `op` generates a CUDA kernel that can then be executed with the `run()` method:


In [4]:
auto op = (B = A + 5);
op.run();
matx::print(B);

matx::print((B = A + 5)); ///\todo remove after run is fixed

tensor_2_f32: Tensor{float} Rank: 2, Sizes:[2, 3], Strides:[3,1]
000000:  0.0000e+00  0.0000e+00  0.0000e+00 
000001:  0.0000e+00  0.0000e+00  0.0000e+00 
Operator{float} Rank: 2, Sizes:[2, 3]
000000:  6.0000e+00  7.0000e+00  8.0000e+00 
000001:  9.0000e+00  1.0000e+01  1.1000e+01 


(void) @0x757f1fdfec30


The `run()` function takes an optional executor to determine what accelerator is used to perform the operation. When no argument is specified, the default executor is the CUDA default stream.

## Element-wise Tensor Addition
The next section adds two tensors together element-wise. Just like with a scalar, the `+` operator works on two tensors. Instead of creating a separate operator variable, this example shows how to create and execute an operator in a single line:


In [5]:
A.SetVals({ {1, 2, 3},
            {4, 5, 6}});

B.SetVals({ {7, 8, 9},
            {10, 11, 12}});

(C = A + B).run();

matx::print(C);

matx::print(C = A + B); ///\todo remove after run is fixed

tensor_2_f32: Tensor{float} Rank: 2, Sizes:[2, 3], Strides:[3,1]
000000:  0.0000e+00  0.0000e+00  0.0000e+00 
000001:  0.0000e+00  0.0000e+00  0.0000e+00 
Operator{float} Rank: 2, Sizes:[2, 3]
000000:  8.0000e+00  1.0000e+01  1.2000e+01 
000001:  1.4000e+01  1.6000e+01  1.8000e+01 


(void) @0x757f1fdfec30


## Element-wise Tensor Division
The division operator `/` can also be used on two tensors, or any scalar type that is compatible with the tensor's  data. In this example we reuse the `C` tensor from the last example and divide each element by 2:


In [6]:
C.SetVals({ {7, 8, 9},
            {10, 11, 12}});

(C = C / 2).run();  

matx::print(C);

matx::print(C = C / 2); ///\todo remove after run is fixed

tensor_2_f32: Tensor{float} Rank: 2, Sizes:[2, 3], Strides:[3,1]
000000:  7.0000e+00  8.0000e+00  9.0000e+00 
000001:  1.0000e+01  1.1000e+01  1.2000e+01 
Operator{float} Rank: 2, Sizes:[2, 3]
000000:  3.5000e+00  4.0000e+00  4.5000e+00 
000001:  5.0000e+00  5.5000e+00  6.0000e+00 


(void) @0x757f1fdfec30


With division, the usual C semantics apply - if the tensor type is an integral type, the results are rounded down. If the type is floating point, floating point division is performed. In this case we are using `float` types, so floating point division will occur.

## Broadcasted Tensor Addition
Binary operators can be used on tensors of different ranks. In this section, we add a 1D tensor `V` onto a 2D tensor `C`. Unlike previous examples, the result is stored in the same tensor `C`, which is safe since the operation is element-wise and each thread runs independent of others. When operating on tensors of different ranks, the outer dimensions of both tensors must match. The tensor with the lower rank will be broadcasted on the higher dimensions when the operation is executing.

In [7]:

C.SetVals({ {1, 2, 3},
            {4, 5, 6}});

V.SetVals({7, 8, 9});

(C = C + V).run();

matx::print(C);

matx::print(C = C + V); ///\todo remove after run is fixed

tensor_2_f32: Tensor{float} Rank: 2, Sizes:[2, 3], Strides:[3,1]
000000:  1.0000e+00  2.0000e+00  3.0000e+00 
000001:  4.0000e+00  5.0000e+00  6.0000e+00 
Operator{float} Rank: 2, Sizes:[2, 3]
000000:  8.0000e+00  1.0000e+01  1.2000e+01 
000001:  1.1000e+01  1.3000e+01  1.5000e+01 


(void) @0x757f1fdfec30



The result of the operation will be `V` repeatedly added to all rows of `C`.

## Multiple Operators
Multiple operators can be combined in a single expression. The syntax is similar to using a high-level language like MATLAB where the order of operations is followed, and the final result is stored into the tensor on the left hand side of the lazy assignment operator `=`. Unlike most C++ libraries that use operator overloading for runtime expression parsing, MatX uses templates to parse the entire expression at compile-time. This removes all unnecessary interim loads and stores that would normally occur with the runtime approach. In this example, we combined 4 operators (three `+` and one `/`) in a single expression:

In [8]:
A.SetVals({ {1, 2, 3},
            {4, 5, 6}});

V.SetVals({7, 8, 9});

(C = (A + A + 1) / 2 + V).run();

matx::print(C);

matx::print((C = (A + A + 1) / 2 + V)); ///\todo remove after run is fixed

tensor_2_f32: Tensor{float} Rank: 2, Sizes:[2, 3], Strides:[3,1]
000000:  8.0000e+00  1.0000e+01  1.2000e+01 
000001:  1.1000e+01  1.3000e+01  1.5000e+01 
Operator{float} Rank: 2, Sizes:[2, 3]
000000:  8.5000e+00  1.0500e+01  1.2500e+01 
000001:  1.1500e+01  1.3500e+01  1.5500e+01 


(void) @0x757f1fdfec30


## Conditionals
Conditional statements operators are also available to take an action based on the value of an operator or tensor view. These actions can be anything from changing the computation to choosing where to store the data. In this example, we set the output of A based on whether the value in C is greater than 3. Note that `IFELSE` is an operator, and has the same `run()` method to execute the work as a standard expression.


In [9]:
A.SetVals({ {1, 2, 3},
            {4, 5, 6}});

C.SetVals({ {1, 2, 3},
            {4, 5, 6}});

matx::IFELSE(C > 3, A = 1, A = 0).run();
matx::print(A);


///\todo currently broken, doesn't work with print for some reason

tensor_2_f32: Tensor{float} Rank: 2, Sizes:[2, 3], Strides:[3,1]
000000:  1.0000e+00  2.0000e+00  3.0000e+00 
000001:  4.0000e+00  5.0000e+00  6.0000e+00 


(void) @0x757f1fdfec30


### Random Operator
The ``random`` operator provides a way to generate random numbers using various distrubtions. Random values can be useful for many applications, including generating noise in signal processing or initializing data for testing. In this example we take an existing tensor view (`A`) and populate it with random values from a normal distribution. Before setting the random values, we set all elements of `A` to zero to show the values change after randomizing.


In [10]:
(A = 0).run();

(A = matx::random<float>({2, 3}, matx::NORMAL)).run();

matx::print(A);

//broken output?
matx::print(matx::random<float>({2, 3}, matx::NORMAL)); ///\todo remove after run is fixed broken anyways with no memory backing

tensor_2_f32: Tensor{float} Rank: 2, Sizes:[2, 3], Strides:[3,1]
000000:  1.0000e+00  2.0000e+00  3.0000e+00 
000001:  4.0000e+00  5.0000e+00  6.0000e+00 
Operator{float} Rank: 2, Sizes:[2, 3]
000000:  4.2150e-41  4.2150e-41  4.2150e-41 
000001:  4.2150e-41  4.2150e-41  4.2150e-41 


(void) @0x757f1fdfec30


In this example we store the current random values from `randTensor` into `A`. Instead of storing the random values in `A`, `randTensor` can be used directly in operator equations, and each time it's used a different set of random values is generated.

# Generators
Next, we introduce the concept of a generator by creating the identity matrix, scaling the values by `5`, and storing it in a tensor. MatX contains an `eye` operator for generating an identity matrix. Each time an element in the generator is accessed, `eye` simply returns a `1` for values in the diagonal, and `0` otherwise. Said differently, if the index for each rank is equal, the value is set to `1`. Since the goal is to have a diagonal matrix of fives, we multiply the generator by the scalar `5`. Since `eye` is a generator, the multiply and the identity matrix can be evaluated without storing any values. Since we're interested in seeing the results, we execute the operator and store it in the tensor `B`:


In [11]:

(E = matx::eye({8, 8}) * 5).run();

matx::print(E); 

matx::print(matx::eye({8, 8}) * 5); ///\todo remove after run is fixed

tensor_2_f32: Tensor{float} Rank: 2, Sizes:[8, 8], Strides:[8,1]
000000:  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00 
000001:  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00 
000002:  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00 
000003:  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00 
000004:  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00 
000005:  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00 
000006:  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00 
000007:  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00  0.0000e+00 
Operator{int32_t} Rank: 2, Sizes:[8, 8]
000000:  5  0  0  0  0  0  0  0 
000001:  0  5  0  0  0

(void) @0x757f1fdfec30


While `eye` is a fairly simple generator for creating ones on the diagonal, more complex generators exist for performing operations like windowing, or creating a linearly-spaced range of values. Below we use the `hamming` function to generate a Hamming window using the formula: $$ 0.5 * (1 - cos(\frac{2{\pi}n}{N})) $$ where `n` is the sample number and `N` is the total number of samples. Since an array of sizes is passed into the generator, these two variables are computed at runtime and the size of the shape is used as the size of the Hamming window. Like the name implies, the `_x` on `hanning` generates the window across the `x` axis, but there are versions for all four possible axes. Other window functions use the same nomenclature:


In [12]:
(H = matx::hamming<0>(H.Shape())).run();

matx::print(H);

matx::print(matx::hamming<0>(H.Shape())); ///\todo remove after run is fixed

tensor_1_f32: Tensor{float} Rank: 1, Sizes:[10], Strides:[1]
000000:  0.0000e+00 
000001:  0.0000e+00 
000002:  0.0000e+00 
000003:  0.0000e+00 
000004:  0.0000e+00 
000005:  0.0000e+00 
000006:  0.0000e+00 
000007:  0.0000e+00 
000008:  0.0000e+00 
000009:  0.0000e+00 
Operator{float} Rank: 1, Sizes:[10]
000000:  8.0000e-02 
000001:  1.8762e-01 
000002:  4.6012e-01 
000003:  7.7000e-01 
000004:  9.7226e-01 
000005:  9.7226e-01 
000006:  7.7000e-01 
000007:  4.6012e-01 
000008:  1.8762e-01 
000009:  8.0000e-02 


(void) @0x757f1fdfec30


This concludes the second tutorial on MatX. In this tutorial, you learned what operators, expressions, and generators are, and how to use them to create expressions to emit GPU kernels. In the next example you will learn about executors. 

[Start Next Tutorial](03_executors.ipynb)