In [None]:
%load_ext run_matx

# Notebook Usage

- This demo requires a custom environment to enable C++17 native compilation in the Jupyter notebook cells. C++ support in Jupyer is still very buggy, and the user must be careful to not overwrite the state of previously executed cells. 
  Cells in this notebook that are surrounded by opening and closing curly braces are safe to execute multiple times, and will not overwrite the state of previously executed cells. 

- If you receive a compiler error that you believe should not be happening, this may be Cling getting confused and needing a restart. A common error from this is a redefinition of a variable. When this happens try to restart the kernel and run 
  all cells up to the point of failure. You may also need to close and reopen the notebook to get a clean state.

  ![Restart Kernel](img/restart_kernel.png)

- Since CUDA support in Jupyter notebooks is extremely buggy and requires clang device compilation, these tutorials exclusively use CPU mode in MatX. To run code from this notebook on a GPU just change the `auto exec` line to 
  `auto exec = matx::cudaExecutor{};` above.

# Notebook Usage

This notebook uses a custom magic command `%%run_matx` to run MatX code. This command wraps the code you write in a cell, adds the necessary includes and compiler flags, and then compiles and runs the code. All code in this notebook can be copied and pasted into your own MatX code without any additions beyond the environment setup. The magic `%%run_matx` must be at the beginning of the cell for the code to compile properly. 

Since MatX is a C++ template library, some of the code may take many seconds to compile. The type of CPU, CUDA version, and complexity of the example can affect compile times.

# MatX GTC Lab Notebook
## Tensor Creation and Memory Backing

Tensors are the base class of memory backed storage in MatX. The Tensor class is highly flexible with many options for memory types, residency, and ownership. By default, tensors are allocated using CUDA managed memory so that it's available on both the host and device. This is great for quick prototyping and development, but for production code it's recommended to use device memory or pinned host memory for performance reasons. A set of utility `make_tensor` functions are provided to help streamline and simplify tensor creation rather than declaring the tensor object directly.

`make_tensor` takes one template parameter indicating the type of the tensor, and zero or more function parameters. Without any parameters the tensor is considered a "null tensor" and has no shape or memory backing it. This is useful when declaring a tensor that will be given a shape and allocation later. The sizes of the tensor are specified in curly braces, or in the case of a 0-D tensor, an empty set of braces. For a complete guide on creating tensors in different ways, please visit: https://nvidia.github.io/MatX/basics/creation.html.

MatX uses several conventions that can be different from other libraries:
- Row-major memory layout
- 0-based or C-style indexing
- A rank 1 tensor is a different type entirely than a rank 2 tensor with one dimension of length 1

In the following cell we demonstrate creating tensors of 0D (scalar), 1D, and 2D data. Tensors can be scaled to any arbitrary rank by adding more dimensions, and the rank is only limited by the available memory.

In [None]:
%%run_matx

// declare a 0D integer tensor (Scalar)
auto t0 = matx::make_tensor<int>({});

// declare a 1D integer tensor of length 4
auto t1 = matx::make_tensor<int>({4});

// declare a 2D fp32 tensor with shape 4x5 (4 rows and 5 columns)
auto t2 = matx::make_tensor<float>({4,5});

// declare tensor with user provided memory
int *myptr = new int[4*5];
auto t2_custom = matx::make_tensor<int>(myptr, t2.Shape());

// declare tensor with shape of tensor t2
auto t2_b = matx::make_tensor<int>(t2.Shape());

## Printing & Assigning
MatX provides several utilities for initializing and viewing its data inside tensors.

To set a series of values explicitly the `SetVals` member function can be used by specifying values in an intializer list syntax. The initializer list uses nested braces to match the shape of the tensor and is supported up to 4D tensors. For higher ranks or very large tensors MatX provides an IO API to read in data from a file. See the [IO section](https://nvidia.github.io/MatX/api/io/index.html) for more information. `operator()` can also available to set and get individual values of a tensor as an alternative. `operator()` can both get and set individual values, but is not recommended for large tensors as setting individual values is not memory efficient. When setting values with both `SetVals` and `operator()` the memory backing the tensor must be modifiable from the host. For example, attempting to access device memory from these functions on a system without unified memory will result in undefined behavior.

`print` is a utility function to print a tensor or operator's contents to stdout. Printing can be used with any type of operator, including ones that have no memory backing them. With no arguments `print` will print the entire contents of the operator. The size of the printing can also be limited by passing a limit to each dimension. For example, `print(3,2)` would print the first 2 columns and 3 rows of the 2D tensor. Unlike `SetVals` and `operator()`, `print` can be used on tensors with memory not accessible from the host. In this case a copy will be performed to the host before printing.

In [None]:
%%run_matx

// declare a 2D tensor of size with 4 rows and 5 columns
auto t2 = matx::make_tensor<int>({4,5});

// setVals in tensor
t2.SetVals({
          {1, 2, 3, 4, 5},
          {6, 7, 8, 9, 10},
          {11, 12, 13, 14, 15},
          {16, 17, 18, 19, 20}
          });

// print a tensor
matx::print(t2);

// print elements of tensor. Memory MUST be host-accessible for this to work.
std::cout << t2(0,0) << std::endl;

t2(0,0) = 42;
t2(3,2) = 117;

matx::print(t2);

std::cout << "My updates value for (3,2): " << t2(3,2) << std::endl;

### Tensor Creation Operators

MatX also has a number of pre-built creation routines. For the full list, see https://nvidia.github.io/MatX/api/creation/operators/index.html.

Shown below are [linspace](https://nvidia.github.io/MatX/api/creation/operators/linspace.html), [range](https://nvidia.github.io/MatX/api/creation/operators/range.html), and [ones](https://nvidia.github.io/MatX/api/creation/operators/ones.html).

In [None]:
%%run_matx

std::cout << "Linspace (steps=10, start=0, stop=1)" << std::endl;
matx::print(matx::linspace<0>({10}, 1.f, 10.f));
  
std::cout << std::endl << "Range (shape=10, first=0, step=0.1)" << std::endl;
matx::print(matx::range<0>({10}, 0.0f, 0.1f));
  
std::cout << std::endl << "Ones (2x3)" << std::endl;
matx::print(matx::ones<float>({2, 3}));

## Exercise 01_A: Creating your first tensor

Try defining a new integer tensor of size `{3, 5}` and initialize its values in increasing order from 0 to 15. Once defined, print your tensor to ensure the values are as expected. Next, updated element (1,2) to 101. Print the tensor again to ensure the update was valid.

In [None]:
%%run_matx

// Declare a tensor

// SetVals in myTensor

// Print your new tensor

// Update the value at {1,1} to 101

[01_A Solution](solutions/01_A.ipynb)

## Operator Views
MatX provides a powerful set of functions that enable arbitrary views into existing tensors without incuring additional memory storage or processing cost to reorganize the data. These views provide "zero copy" accessors to a tensor that can be used in MatX logic as if it were a real memory-backed tensor.

MatX has feature parity to most operations expected in CuPy / MATLAB style environments. A full table of the translation of a given operation to its MatX equivalant can be found in our full documentation [here](https://nvidia.github.io/MatX/basics/matlabpython.html).

### Permute
`permute` returns a view of the data with the dimensions swapped to match the order of the initializer list argument. In the example below we swap our two dimensions, equivalent to a matrix transpose. However, `permute` can be used on higher-order tensors with the dimensions swapped in any order.

In [None]:
%%run_matx
// declare a 2D tensor of size with 4 rows and 5 columns
auto t2 = matx::make_tensor<int>({4,5});

// setVals in tensor
t2.SetVals({
            {1, 2, 3, 4, 5},
            {6, 7, 8, 9, 10},
            {11, 12, 13, 14, 15},
            {16, 17, 18, 19, 20}
          });  

// base tensor
matx::print(t2);

// Permute axes 0 and 1 of the tensor
auto t2p = matx::permute(t2, {1,0});

// print the permuted tensor to show the transposed data
matx::print(t2p);

### Slice
`slice` provides a view of a subset of data in a tensor, allowing that subset to be used and manipulated as a new operator. The `slice` utility function takes the input operator and two to three initilization lists to define the range of the provided input operator the slice will container. The ranges are defined with the start index and end (exclusive) index, and optional strides. The sentinel value `matxEnd` can be used to indicate the end of the tensor rather than specifying its length.

in the example below, `t2s` will correspond to the elements [`1:2,1:5`] of the larger t2 tensor

![2D Slice](img/dli-slice.png)


In [None]:
%%run_matx

// declare a 2D tensor of size with 4 rows and 5 columns
auto t2 = matx::make_tensor<int>({4,5});

// setVals in tensor
t2.SetVals({
            {1, 2, 3, 4, 5},
            {6, 7, 8, 9, 10},
            {11, 12, 13, 14, 15},
            {16, 17, 18, 19, 20}
          });  

// slice example 1: same Rank
auto t2s = matx::slice(t2, {1,1}, {3, matx::matxEnd});

// print the sliced tensor to show the subset of data
matx::print(t2s);


To reduce the rank when slicing, `slice` can be used with a template parameter to define an operator of a lower rank (dimensionality) than the input tensor. This is useful for situations like selecting a row of a matrix, for example. In the second example, we demonstrate slicing the 0th column from the t2 tensor.

![Column Slice](img/dli-slice_col.png)

MatX also includes several helper defines to make tensor bound definitions easier. The sentinel value `matxDropDim` is used to indicate this dimension is the one being sliced (i.e. removed).

In [None]:
%%run_matx

// declare a 2D tensor of size with 4 rows and 5 columns
auto t2 = matx::make_tensor<int>({4,5});

// setVals in tensor  
t2.SetVals({
            {1, 2, 3, 4, 5},
            {6, 7, 8, 9, 10},
            {11, 12, 13, 14, 15},
            {16, 17, 18, 19, 20}
          });  

// slice example 2: Select all values of column 1
auto t1Col = matx::slice<1>(t2, {0, 1}, {matx::matxEnd, matx::matxDropDim});

// print the sliced tensor to show the subset of data
matx::print(t1Col);

### Clone
`clone` provides a utlity function to expand a smaller rank operator to a larger rank by replicating the original data. For example, a 1D Tensor can be cloned to create a 2D or higher rank tensor. Cloning does not copy or replicate the original data, but rather creates a new operator that references the same memory.

In the clone example below, we will take the t1Col from our previous operation, and clone it to build a 2D [5,4] tensor.

In [None]:
%%run_matx

// declare a 2D tensor of size with 4 rows and 5 columns
auto t2 = matx::make_tensor<int>({4,5});

// setVals in tensor  
t2.SetVals({
            {1, 2, 3, 4, 5},
            {6, 7, 8, 9, 10},
            {11, 12, 13, 14, 15},
            {16, 17, 18, 19, 20}
          });  

// slice example 2: reduce rank requires template parameter
auto t1Col = matx::slice<1>(t2, {0, 1}, {matx::matxEnd, matx::matxDropDim});

// clone the sliced 1D tensor to create a new 2D tensor
auto t2c_cols = matx::clone<2>(t1Col, {5, matx::matxKeepDim});

// print the cloned tensor to show the expanded data
matx::print(t2c_cols);


### View Data Backing
We established earlier that views are not new data, but accessors into the original operator. This is a powerful tool when operating on the core data, but it's also different from some other languages where the programmer has no control over whether the data is copied. This also means that any changes to the original tensor will be reflected in all views of that tensor.

In [None]:
%%run_matx 

// declare a 2D tensor of size with 4 rows and 5 columns
auto t2 = matx::make_tensor<int>({4,5});

// setVals in tensor  
t2.SetVals({
            {1, 2, 3, 4, 5},
            {6, 7, 8, 9, 10},
            {11, 12, 13, 14, 15},
            {16, 17, 18, 19, 20}
          });  

// slice example 2: reduce rank requires template parameter
auto t1Col = matx::slice<1>(t2, {0, 1}, {matx::matxEnd, matx::matxDropDim});
// clone the sliced 1D tensor to create a new 2D tensor
auto t2c_cols = matx::clone<2>(t1Col, {5, matx::matxKeepDim});


// modify the original tensor
t2(0,1) = 10;
// print our views to show the updated values
matx::print(t2);
matx::print(t1Col);
matx::print(t2c_cols);

// modify the tensor through a view
t1Col(1) = 203;
std::cout << "------------------- After 203 -------------------" << std::endl;

// print our views to show the updated values
matx::print(t2);
matx::print(t1Col);
matx::print(t2c_cols);

## Exercise: Operator Views
Let's demonstrate your new skills in creating views of a tensor. Using the pre-defined `baseTensor2D`, please create the following views:

- The complete first row of the `baseTensor2D`
- A 2D square of 4 elements, composed of the first 2 rows and 2 columns of data
- Modify the (1,1) element of baseTensor2D through the view corresponding to assign it the value of 87.

Print the output at each stage to ensure your views are working as expected.


In [None]:
%%run_matx

// Make tensor
auto baseTensor2D = matx::make_tensor<int>({5,3});
baseTensor2D.SetVals({
  {1, 2, 3, 4, 5},
  {6, 7, 8, 9, 10},
  {11, 12, 13, 14, 15}
});


// Slice the first row of baseTensor

// Create a 2D square of 4 elements, composed of the first 2 rows and 2 columns of data

// Assign the value 87 to the (1,1) element of baseTensor2D

In [None]:
%%run_matx

// Make tensor
auto baseTensor2D = matx::make_tensor<int>({5,3});

baseTensor2D.SetVals({
                {1, 2, 3},
                {4, 5, 6},
                {7, 8, 9},
                {10, 11, 12},
                {13, 14, 15}
                });


// slice the first row of baseTensor
auto baseTensor_row0 = matx::slice<1>(baseTensor2D, {0,0}, {matx::matxDropDim, matx::matxEnd});
matx::print(baseTensor_row0);

//slice the 2D square of the first 4 elements
auto baseSquare = matx::slice(baseTensor2D, {0,0}, {3,3});
matx::print(baseSquare);

baseSquare(1,1) = 87;
matx::print(baseTensor2D);


## MatX Operators
All of the examples above show how to view the operator's data differently, but no manipulation was done to the data. To manipulate the data, we use operators to define what work is to be done, and then call `run` to execute the operation (more on this later).

Operators in MatX are an abstract type that follow the [operator interface](https://nvidia.github.io/MatX/basics/concepts.html#operator). The operator interface dictates a small number of methods that must be implemented to be used in MatX expressions. Everything from tensors to `operator+` are considered operators in MatX. Every operator in MatX except for tensors are lazy evaluated, meaning that the operation is not performed until the statement is executed. For example, the expression `C = A + B` will not perform the addition or assignment until the `run` function is called.

Most operators come in unary types for operating on a single input or a binary type for operating on two inputs, but operators can be defined for an arbitrary number of inputs. Advanced users can also define their own operators. MatX supports most of the standard unary operators a user would expect from a library like NumPy. Broadcasting of operators is also supported, which allows for operations between tensors of different ranks.

Operator expressions follow the normal type promotion rules of C++ and any type errors or warnings will be reported at compile time.

Below we'll demonstrate both scalar and matrix support for the basic unary operators (`+`, `-`, `x`, `/`).

In [None]:
%%run_matx

auto A = matx::make_tensor<float>({2, 3});
auto B = matx::make_tensor<float>({2, 3});
auto C = matx::make_tensor<float>({2, 3});
auto D = matx::make_tensor<float>({2, 2});

A.SetVals({ {1.f, 2.f, 3.f},
            {4.f, 5.f, 6.f}
          });

(B = A).run(exec); // `run` will be discussed in more detail later

matx::print(A);
matx::print(B);
std::cout << " val: " << A(0,0) << std::endl;

// Addition
matx::print(A + 5.0f); // Broadcasting a scalar to a matrix
matx::print(A + B);    // Element-wise addition of two matrices

// Subtraction
matx::print(A - 5.0f);
matx::print(A - B);

// Multiplication
matx::print(A * 5.0f);
matx::print(A * B);

// Division
matx::print(A / 5.0f);
matx::print(A / B);


### Exercise: Operators
Please use the provided A and B tensors to complete the following set of operations:

- Multiply `A` by its scalar weight factor `aScale` to populate tensor `C`
- In-place subtract `bOffset` from the matrix `B`
- Add the `A` and `B` Tensors to populate tensor `D`

Keep in mind that rather than storing the result of an expression in a tensor, you may pass it to the `print` function directly.

In [None]:
%%run_matx

auto A = matx::make_tensor<float>({2, 3});
auto B = matx::make_tensor<float>({2, 3});
auto C = matx::make_tensor<float>({2, 3});
auto D = matx::make_tensor<float>({2, 2});

A.SetVals({ {1.f, 2.f, 3.f},
            {4.f, 5.f, 6.f}
          });

(B = A).run(exec);

int aScale = 5;
int bOffset = 2;

// scale A by aScale
// matx::print(/* New Operators here */)

// subtract B by bOffset
// matx::print(/* New Operators here */)

// add A and B Tensors
// matx::print(/* New Operators here */)


## Generator Operators
Generators are a type of operator that can generate values without another tensor or operator as input. For example, an identity function can generate a list of ones on the diagonal and zeros elsewhere. Generators are efficient since they require no memory and typically reduce to either a constant or an equation in the emitted code.

In [None]:
%%run_matx

auto A = matx::make_tensor<float>({2, 3});
auto H = matx::make_tensor<float>({10});

// random
(A = 0).run(exec);
(A = matx::random<float>(A.Shape(), matx::NORMAL)).run(exec);
matx::print(A);

// eye
(A = matx::eye(A.Shape())).run(exec);
matx::print(A);

// hamming
(H = matx::hamming<0>(H.Shape())).run(exec);
matx::print(H);


## MatX Transform Operators
Transform operators take one or more inputs and call a backend library or kernel to manipulate the data. FFTs, GEMMs, and linear solvers are all types of transform operators. Compared to non-transform operators, transforms typically use some temporary memory and require synchronization that an element-wise operator does not. Transforms are allowed to be used in all of the same contexts that other operators are used in. For example, `C = A * matmul(A, B)` mixes both a transform operator (`matmul`) and an element-wise operator (`*`). If necessary, MatX will asynchronously allocate temporary memory for the output of the transform and free it after the operation is complete.

Transform operators operate over a fixed number of dimensions, and anything higher dimensions will be batched. For example, if a 4D tensor is passed into a GEMM the left-most two dimensions are batched.

### Matrix Multiplication (GEMM)
The `matmul` operator performs the matrix-matrix multiply of $$C = {\alpha}A * B + {\beta}C$$ where `A` is of dimensions `MxK`, `B` is `KxN`, and `C` is `MxN`. We first populate the `A` and `B` matrices with random values before the multiply, then the matrix multiply is performed. The `random` operator is used to populate the tensor with random values from a chosen distribution.

In [None]:
// matrix multiplication (may take longer to run)
%%run_matx

auto A = matx::make_tensor<float>({4, 8});
auto B = matx::make_tensor<float>({8, 16});
auto C = matx::make_tensor<float>({4, 16});

(A = matx::random<float>(A.Shape(), matx::NORMAL)).run(exec);
(B = matx::random<float>(B.Shape(), matx::NORMAL)).run(exec);
matx::print(A);
matx::print(B);

(C = matx::matmul(A, B)).run(exec);
matx::print(C);


### Reductions
Reductions are a class of algorithms that reduce a number of inputs into one or more outputs. Examples of reductions include summing all elements of a tensor, finding the maximum or minimum value, or counting the number of non-zero elements. Reduction functions in MatX use similar names to their counterparts in NumPy and MATLAB, such as `sum`, `min`, `max`, `mean`, `any`, and `all`. By default a reduction operator will reduce over all elements of the tensor, but a list of axes can be specified to reduce over different dimensions.

Below is a simple example for calcluate a full reduction of the max and sum of our A data.

In [None]:
%%run_matx

auto A = matx::make_tensor<float>({2, 3});
auto max_all = matx::make_tensor<float>({});
auto sum_all = matx::make_tensor<float>({});

// Since we're taking the max of each column, our output tensor is 1D with the same number of elements as the number of columns in A
auto max_col = matx::make_tensor<float>({3});  

(A = matx::random<float>(A.Shape(), matx::NORMAL)).run(exec);

// Max of data
(max_all = matx::max(A)).run(exec);
// Min of data
(sum_all = matx::sum(A)).run(exec);
// Max of each column
(max_col = matx::max(A, {0})).run(exec);  

printf("A:\n");
matx::print(A);

printf("Max: %f\n", max_all());
printf("Sum: %f\n", sum_all());
printf("Max Col: \n");
matx::print(max_col);


### Additional Transforms
MatX Supports a wide range of transforms, including both sparse and dense solvers, tensor constractions, and more. Please review the [MatX documentation](https://nvidia.github.io/MatX/api/index.html) for an exhaustive list of supported operations.

## MatX Transform Operators
Transform operators take one or more inputs and call a backend library or kernel to manipulate the data. FFTs, GEMMs, and linear solvers are all types of transform operators. Compared to non-transform operators, transforms typically use some temporary memory and require synchronization that an element-wise operator does not. Transforms are allowed to be used in all of the same contexts that other operators are used in. For example, `C = A * matmul(A, B)` mixes both a transform operator (`matmul`) and an element-wise operator (`*`). If necessary, MatX will asynchronously allocate temporary memory for the output of the transform and free it after the operation is complete.

Transform operators operate over a fixed number of dimensions, and anything higher dimensions will be batched. For example, if a 4D tensor is passed into a GEMM the left-most two dimensions are batched.

### Matrix Multiplication (GEMM)
The `matmul` operator performs the matrix-matrix multiply of $$C = {\alpha}A * B + {\beta}C$$ where `A` is of dimensions `MxK`, `B` is `KxN`, and `C` is `MxN`. We first populate the `A` and `B` matrices with random values before the multiply, then the matrix multiply is performed. The `random` operator is used to populate the tensor with random values from a chosen distribution.

In [None]:
// matrix multiplication (may take longer to run)
{
  auto A = matx::make_tensor<float>({4, 8});
  auto B = matx::make_tensor<float>({8, 16});
  auto C = matx::make_tensor<float>({4, 16});

  (A = matx::random<float>(A.Shape(), matx::NORMAL)).run(exec);
  (B = matx::random<float>(B.Shape(), matx::NORMAL)).run(exec);
  matx::print(A);
  matx::print(B);

  (C = matx::matmul(A, B)).run(exec);
  matx::print(C);
} 

### Reductions
Reductions are a class of algorithms that reduce a number of inputs into one or more outputs. Examples of reductions include summing all elements of a tensor, finding the maximum or minimum value, or counting the number of non-zero elements. Reduction functions in MatX use similar names to their counterparts in NumPy and MATLAB, such as `sum`, `min`, `max`, `mean`, `any`, and `all`. By default a reduction operator will reduce over all elements of the tensor, but a list of axes can be specified to reduce over different dimensions.

Below is a simple example for calcluate a full reduction of the max and sum of our A data.

In [None]:
{
  auto A = matx::make_tensor<float>({2, 3});
  auto max_all = matx::make_tensor<float>({});
  auto sum_all = matx::make_tensor<float>({});

  // Since we're taking the max of each column, our output tensor is 1D with the same number of elements as the number of columns in A
  auto max_col = matx::make_tensor<float>({3});  

  (A = matx::random<float>(A.Shape(), matx::NORMAL)).run(exec);

  // Max of data
  (max_all = matx::max(A)).run(exec);
  // Min of data
  (sum_all = matx::sum(A)).run(exec);
  // Max of each column
  (max_col = matx::max(A, {0})).run(exec);  

  printf("A:\n");
  matx::print(A);

  printf("Max: %f\n", max_all());
  printf("Sum: %f\n", sum_all());
  printf("Max Col: \n");
  matx::print(max_col);
} 

### Additional Transforms
MatX Supports a wide range of transforms, including both sparse and dense solvers, tensor constractions, and more. Please review the [MatX documentation](https://nvidia.github.io/MatX/api/index.html) for an exhaustive list of supported operations.

## Exercise: Transforms and generators:

For this example we will generate random data to verify the distribution of our generator functions. Please implement the following:

- Generate three floating point 3D tensors with sizes 2x4x8, 2x8x8, and 2x4x8
- Populate the first two tensors with random values from a uniform distribution
- Perform a batched matrix multiply of the first two tensors and store the output in the third tensor
- Print the third tensor
- Find the minimum values in each row of the third tensor and print the results

Ensure that the minimum values printed match what you would expect in the third tensor.

In [None]:
%%run_matx

//input data storage
auto input = matx::make_tensor<float>({10,1000});
auto maxVal = matx::make_tensor<matx::index_t>({10});
auto maxIdx = matx::make_tensor<size_t>({10});

// generate random data

// perform matmul and print

// find min values and print
