In [None]:
// WARNING: DO NOT MODIFY, Requirements for C++ in notebook
#pragma cling add_library_path("/usr/local/cuda/lib64")
#pragma cling add_library_path("/opt/xeus/cling/lib")
//#pragma cling add_library_path("/usr/Lib/gcc/x86_64-Linux-gnu/11/")
#pragma cling add_library_path("/usr/lib/x86_64-linux-gnu/openblas64-openmp/")
#pragma cling add_include_path("/usr/local/cuda/include")
#pragma cling add_include_path("/usr/include/x86_64-linux-gnu/openblas64-openmp")
#pragma cling add_include_path("/opt/xeus/cling/tools/Jupyter/kernel/MatX/include")
#pragma cling add_include_path("/opt/xeus/cling/tools/Jupyter/kernel/MatX/build/_deps/cccl-src/libcudacxx/include")
//#pragma cling load("libgomp")
#pragma cling load("libopenblas64")
#pragma cling load("libcuda")
#pragma cling load("libcudart")
#pragma cling load("libcurand")
#pragma cling load("libcublas")
#pragma cling load("libcublasLt")

#include <cuda/std/__algorithm/max.h>
#include <cuda/std/__algorithm/min.h>

#define MATX_EN_OPENBLAS
#define MATX_EN_OPENBLAS_LAPACK
#define MATX_OPENBLAS_64BITINT

#include "matx.h"

In [None]:
// WARNGING: DO NOT execute this cell twice! If you do, restart the kernel and run all cells ONCE from the begining.
auto exec = matx::SingleThreadedHostExecutor{};
//auto exec = matx::cudaExecutor{};
// WARNGING: DO NOT execute this cell twice! If you do, restart the kernel and run all cells ONCE from the begining.

# MatX Introduction


# MatX Integrated Demo Notebook
## Notes on Operation
This demo requires a very specific environment to enable C++17 native compilation in the Jupyter notebook cells, and comes with many caveats versus normal Jupyter notebooks. While state is preserverd across cells, the state of previously executed cells is persistent, unless it directly overwrites the 


### Container Startup 
Start container with all normal options, adding `-p 8888:8888`

a sample `run.sh` script is provided in `MatX/docs_input/notebooks`

### start Juptyer server locally in container
`jupyter notebook --ip=0.0.0.0 --port=8888 --no-browser --allow-root` 

copy the token from the server start (specifically the local token should be something similar to:
`http://127.0.0.1:8888/tree?token=a3ad60a152dcafe98d4eaecc22bd773b38f1e6e93312adae`)

## Tensor Creation and Memory Backing

Tensors are the base class of memory backed data storage in MatX. The Tensor class is highly flexible with many options for memory backing, residency, and ownership, but has defaults that makes it easy to use out-of-the box. A set of utility `make_tensor` functions are provided out of the box to help streamline and simplify tensor creation; this is the suggested use pattern for beginners and experts alike. 

`make_tensor` takes one template parameter indicating the type of the tensor, and zero or more function parameters. At a minimum, the sizes of the tensor are specified in curly braces, or in the case of a 0-D tensor, no size list is specified. For a complete guide on creating tensors in different ways, please visit: https://nvidia.github.io/MatX/basics/creation.html.

**NOTE** Unlike MATLAB, MatX follows the C-style for indexing, meaning we assume row-major formats rather than column-major, and 0-based indexing rather than 1-based. 

In the following cell we demonstrate creating tensors of 0D (scalar), 1D, and 2D data. Tensors can be scaled to any arbitrary dimension.

In [None]:
{
  // declare a 0D tensor (Scalar)
  auto t0 = matx::make_tensor<int>({});

  // declare a 1D tensor of length 4
  auto t1 = matx::make_tensor<int>({4});

  // declare a 2D tensor of size with 4 rows and 5 columns
  auto t2 = matx::make_tensor<int>({4,5});

  // declare tensor with user provided memory (maybe?)

  // declare tensor with shape of tensor t2
  auto t2_b = matx::make_tensor<int>(t2.Shape());
  }

## Printing & Assigning
MatX also provides several utilities for initializing and viewing its data.

Values can be initialized using a nested initializer list inside of the `SetVals` member function, specifying the values of the matrix. The initializer list is a single-nested list to match a 2D tensor shape, but this can be extended up to 4D tensors. `operator()` is also available to set and get individual values of a tensor as an alternative.

`print` is a utility function to print a tensor or operator's contents to stdout. Printing can be used with any type of operator, including ones that have no memory backing them (see upcoming generators section). With no arguments `print` will print the entire contents of the tensor. The size of the printing can also be limited by passing a limit to each dimension. For example, `print(3,2)` would print the first 2 columns and 3 rows of the 2D tensor. `operator()` can also be used to retun a single value, and combine with traditional pritining techniques

In [None]:
{
  // declare a 2D tensor of size with 4 rows and 5 columns
  auto t2 = matx::make_tensor<int>({4,5});

  // setVals in tensor
  t2.SetVals({
            {1, 2, 3, 4},
            {5, 6, 7, 8},
            {9, 10, 11, 12},
            {13, 14, 15, 16},
            {17, 18, 19, 20}
            });

  // print a tensor
  matx::print(t2);

  // print elements of tensor
  std::cout << t2(0,0) << std::endl;

  t2(0,0) = 42;
  t2(3,2) = 117;

  matx::print(t2);

  std::cout << "My updates value for (3,2): " << t2(3,2) << std::endl;
}

### Tensor Creation Operators

MatX also has a number of pre-built creation routines. For the full list, see https://nvidia.github.io/MatX/api/creation/operators/index.html.

Shown below are [linspace](https://nvidia.github.io/MatX/api/creation/operators/linspace.html), [range](https://nvidia.github.io/MatX/api/creation/operators/range.html), and [ones](https://nvidia.github.io/MatX/api/creation/operators/ones.html).

In [None]:
{
  std::cout << "Linspace (steps=10, start=0, stop=1)" << std::endl;
  auto tlin = matx::make_tensor<float>({10});
  matx::print(tlin = matx::linspace<0>(tlin.Shape(), 0.0f, 1.0f));
    
  std::cout << std::endl << "Range (shape=10, first=0, step=0.1)" << std::endl;
  auto trange = matx::make_tensor<float>({10});
  matx::print(trange = matx::range<0>(trange.Shape(), 0.0f, 0.1f));
    
  std::cout << std::endl << "Ones (2x3)" << std::endl;
  auto tones = matx::make_tensor<float>({2, 3});
  matx::print(tones = matx::ones());
}

## Exercise: First Tensor

Try defining a new integer tensor of size `{3, 5}` and initilaize its values in increasing values from 0 to 15.

print your tensor to ensure the values are as expected.

update the 4th element `{1,1}` to `101`.

print the 4th element to ensure your update was valid.

try other tensor manipulations to test the API!

In [None]:
{
  // declare a tensor

  // setVals in myTensor
}

In [None]:
{
  // declare a tensor
  // auto myTensor = matx::make_tensor<int>({3,5});

  // setVals in myTensor
  // myTensor.SetVals({
  //                  {1, 2, 3},
  //                  {4, 5, 6},
  //                  {7, 8, 9},
  //                  {10, 11, 12},
  //                  {13, 14, 15}
  //                  });


  // print your new tensor
  // matx::print(myTensor);


  // update the value at {1,1} to 101
  // myTensor(1,1) = 101;
}

## Tensors Views
MatX provides a powerful set of functions that enable arbitrary views into existing tensors, without incuring additional memory storage or processing cost to reorganize the data. These views provide "zero copy" accessors to a tensor that can be used in MatX logic as if it were a real memory-backed tensor.

MatX has feature parity to most operations expected in cupy / matlab style environments; a full table of the translation of a given operation to it's MatX equivilant can be found in our full documentation [here](https://nvidia.github.io/MatX/basics/matlabpython.html#conversion-table).


### Permute
`permute` returns a view of the data with the dimensions swapped to match the order of the initializer list argument. In the exmaple below we swap our two dimenions, so it's equivalent to a matrix transpose. However, `permute` can be used on higher-order tensors with the dimensions swapped in any order.
  


In [None]:
{
  // declare a 2D tensor of size with 4 rows and 5 columns
  auto t2 = matx::make_tensor<int>({4,5});

  // setVals in tensor
  t2.SetVals({
            {1, 2, 3, 4},
            {5, 6, 7, 8},
            {9, 10, 11, 12},
            {13, 14, 15, 16},
            {17, 18, 19, 20}
            });  

  // base tensor
  matx::print(t2);

  // permute the tensor
  auto t2p = matx::permute(t2, {1,0});

  // print the permuted tensor to show the transposed data
  matx::print(t2p);
}

### Slice
`slice` provides a view of a subset of data in a tensor, allowing that subset to be used and manipulated as a single entity. The `slice` utility function takes the input operator and two initilization lists to define the range of  the provided input operator the slice will container. the ranges are defined wit the start index and end (exclusive) index. 

in the example below, `t2s` will corespond to the elemnts [`1:2),1:2`] of the larger t2 tensor

![2D Slice](../img/dli-slice.png)


In [None]:
{
  // declare a 2D tensor of size with 4 rows and 5 columns
  auto t2 = matx::make_tensor<int>({4,5});

  // setVals in tensor
  t2.SetVals({
            {1, 2, 3, 4},
            {5, 6, 7, 8},
            {9, 10, 11, 12},
            {13, 14, 15, 16},
            {17, 18, 19, 20}
            });  

  //slice example 1: same Rank
  auto t2s = matx::slice(t2, {1,1}, {3, 3});

  //print the sliced tensor to show the subset of data
  matx::print(t2s);
}


Similarly, `slice` can be used with a template parameter to define an operator of a different rank (dimensionality) than the input tensor. In the second example, we demonstrate slicing the 0th column from the t2 tensor as shown in the image below.

![Column Slice](../img/dli-slice_col.png)

MatX also includes several helper defines to make tensor bound definitions easier. To include all values from the beginning on, a special sentinel of `matxEnd` can be used. Similarly, `matxDropDim` is used to indicate this dimension is the one being sliced (i.e. removed).

In [None]:
{
  // declare a 2D tensor of size with 4 rows and 5 columns
  auto t2 = matx::make_tensor<int>({4,5});

  // setVals in tensor  
  t2.SetVals({
            {1, 2, 3, 4},
            {5, 6, 7, 8},
            {9, 10, 11, 12},
            {13, 14, 15, 16},
            {17, 18, 19, 20}
            });  

  //slice example 2: reduce rank requires template parameter
  auto t1Col = matx::slice<1>(t2, {0, 1}, {matx::matxEnd, matx::matxDropDim});

  //print the sliced tensor to show the subset of data
  matx::print(t1Col);
}

### Clone
`clone` provides a utlity funciton to expand a smaller rank tensor to a larger rank by replicating the original data. 

for example, a 1D Tensor can be cloned to create a 2D tensor.

In the clone example below, we will take the t1Col from our previous operation, and clone it to build a 2D [5,4] tensor.

In [None]:
{

  // declare a 2D tensor of size with 4 rows and 5 columns
  auto t2 = matx::make_tensor<int>({4,5});

  // setVals in tensor  
  t2.SetVals({
            {1, 2, 3, 4},
            {5, 6, 7, 8},
            {9, 10, 11, 12},
            {13, 14, 15, 16},
            {17, 18, 19, 20}
            });  

  //slice example 2: reduce rank requires template parameter
  auto t1Col = matx::slice<1>(t2, {0, 1}, {matx::matxEnd, matx::matxDropDim});

  //clone the sliced 1D tensor to create a new 2D tensor
  auto t2c_cols = matx::clone<2>(t1Col, {5, matx::matxKeepDim});

  //print the cloned tensor to show the expanded data
  matx::print(t2c_cols);
  
}

### View Data Backing
We established earlier that views are not new data, but variable accessors into the original memory-backed tensor. this is a powerful tool when operating on the core data, as we can desctruct a large data block into the set of data we want to operate on. 

**It is very important to remember that modifying the data in a view modified the original tensor**

This means any change to the original tensor, through the view or in any other fashion, will reflect in all views of that tensor.

In [None]:
{
  // declare a 2D tensor of size with 4 rows and 5 columns
  auto t2 = matx::make_tensor<int>({4,5});
  
  // setVals in tensor  
  t2.SetVals({
            {1, 2, 3, 4},
            {5, 6, 7, 8},
            {9, 10, 11, 12},
            {13, 14, 15, 16},
            {17, 18, 19, 20}
            });  

  // slice example 2: reduce rank requires template parameter
  auto t1Col = matx::slice<1>(t2, {0, 1}, {matx::matxEnd, matx::matxDropDim});
  // clone the sliced 1D tensor to create a new 2D tensor
  auto t2c_cols = matx::clone<2>(t1Col, {5, matx::matxKeepDim});


  // modify the original tensor
  t2(1,0) = 10;
  // print our views to show the updated values
  matx::print(t2);
  matx::print(t1Col);
  matx::print(t2c_cols);

  // modify the tensor through a view
  t1Col(1) = 203;
  std::cout << "------------------- After 203 -------------------" << std::endl;

  // print our views to show the updated values
  matx::print(t2);
  matx::print(t1Col);
  matx::print(t2c_cols);
}

## Exercise Views
Lets demonstrate your new skills in creating views of a tensor. using the pre-defined `baseTensor2D`, please create the following views:

- the complete first row of the `baseTensor`
- a 2D square of 4 elements, comprized of the first 2 rows and 2 columns of data
- modify the {1,1} element of baseTensor2D through the view corresponding to that data to assign it the value of 87.


In [None]:
{
  // Make tensor
  auto baseTensor2D = matx::make_tensor<int>({3,5});

  // Set Values

  // slice the first row of baseTensor


  //slice the 2D square of the first 4 elements
}

In [None]:
// SOLUTION
{  
  // Make tensor
  auto baseTensor2D = matx::make_tensor<int>({3,5});

  baseTensor2D.SetVals({
                  {1, 2, 3},
                  {4, 5, 6},
                  {7, 8, 9},
                  {10, 11, 12},
                  {13, 14, 15}
                  });


  // slice the first row of baseTensor
  auto baseTensor_row0 = matx::slice<1>(baseTensor2D, {0,0}, {matx::matxDropDim, matx::matxEnd});
  matx::print(baseTensor_row0);

  //slice the 2D square of the first 4 elements
  auto baseSquare = matx::slice(baseTensor2D, {0,0}, {3,3});
  matx::print(baseSquare);

  // baseSquare(1,1) = 87;
  matx::print(baseTensor2D);
}

## MatX Operations
Operators in MatX are an abstract type that defines an operation that returns a value at a given index. This concept is intentionally vague, which makes it extremely powerful for representing different concepts. As an example, both a tensor type and the addition operator `+` are MatX operators. In the case of the tensor, it returns the value in memory at that location, but for the addition operator, it returns the sum of values at a given location from both the left and right hand sides.

Most operators come in unary types for operating on a single input or a binary type for operating on two inputs. For example, the expression `A + B` uses the binary `AddOp` operator to lazily add two tensors or other operators together. MatX supports most of the standard unary operators a user would expect, and work with both MatX tensor/operator types, as well as scalar values that are compatible with the base type of the operator.

below we'll demonstrate both scalar and matrix support for the basic unary operators (`+`, `-`, `x`, `/`).

In [None]:
{
  auto A = matx::make_tensor<float>({2, 3});
  auto B = matx::make_tensor<float>({2, 3});
  auto C = matx::make_tensor<float>({2, 3});
  auto D = matx::make_tensor<float>({2, 2});

  A.SetVals({ {1.f, 2.f, 3.f},
              {4.f, 5.f, 6.f}
            });

  (B = A).run(exec);

  matx::print(A);
  matx::print(B);
  std::cout << " val: " << A(0,0) << std::endl;

  // add
  matx::print(A + 5.0f); // scalar
  matx::print(A + B); // matrix

  // subtraction
  matx::print(A - 5.0f);
  matx::print(A - B);

  // multiplication (dot)
  matx::print(A * 5.0f);
  matx::print(A * B);

  // division
  matx::print(A / 5.0f);
  matx::print(A / B);
}

### Exercise: Operators
Please use the provided A and B tensors to complete the following set of operations:

- Multiply `A` by it's scalar weight factor `aScale` to populate tensor `C`
- in place subtract `bOffset` from the matrix `B`
- Add the `A` and `B` Tensors to poulate tensor `D`


In [None]:
{
  auto A = matx::make_tensor<float>({2, 3});
  auto B = matx::make_tensor<float>({2, 3});
  auto C = matx::make_tensor<float>({2, 3});
  auto D = matx::make_tensor<float>({2, 2});

  A.SetVals({ {1.f, 2.f, 3.f},
              {4.f, 5.f, 6.f}
            });

  (B = A).run(exec);

  int aScale = 5;
  int bOffset = 2;

  // scale A by aScale
  // matx::print(/* New Operators here */)

  // subtract B by bOffset
  // matx::print(/* New Operators here */)

  // add A and B Tensors
  // matx::print(/* New Operators here */)
} 

### Expected Output

#### Catch-up Code for Exercise: Operators

In [None]:
// cell hidden by default
{
  auto A = matx::make_tensor<float>({2, 3});
  auto B = matx::make_tensor<float>({2, 3});
  auto C = matx::make_tensor<float>({2, 3});
  auto D = matx::make_tensor<float>({2, 2});

  A.SetVals({ {1.f, 2.f, 3.f},
              {4.f, 5.f, 6.f}
            });

  (B = A).run(exec);

  int aScale = 5;
  int bOffset = 2;

  // scale A by aScale
  print(A * aScale);

  // subtract B by bOffset
  print( B - bOffset);

  // add A and B Tensors
  print(A + B);

} 


## MatX Transforms
Transforms are operators that take one or more inputs and call a backend library or kernel. Transforms usually changes one or more properties of the input, but that is not always the case. An fft may change the input type or shape, but a sort transform does not. Depending on the context used, a transform may asynchronously allocate temporary memory if the expression requires it.

### Matrix Multiplication
The `matmul` executor performs the matrix-matrix multiply of $$C = {\alpha}A * B + {\beta}C$$ where `A` is of dimensions `MxK`, `B` is `KxN`, and `C` is `MxN`. We first populate the `A` and `B` matrices with random values before the multiply as we did in the example above, then the GEMM is performed. Since the random number generator allocates memory sufficient to randomize the entire tensor, we create a random number generator large enough to generate values for both A or B. This allows us to create a single random number generator, but pull different random values for A and B by simply calling `run` twice. As mentioned above, any rank above 2 is consiered a batching dimension.

We use rectangular matrices for `A` and `B`, while `C` will be a square matrix due to the outer dimensions of `A` and `B` matching. 

In [None]:
// matrix multiplication
//(D = matx::matmul(A,matx::transpose(B))).run(exec);

### FFTs
MatX provides an interface to do both 1D Fast Fourier Transforms (FFTs) and 2D FFTs. Any tensor above rank 1 will be batched in a 1D FFT, and any tensor above rank 2 will be batched in a 2D FFT. FFTs may either be done in-place or out-of-place by using the same or different variables for the output and inputs. Since the tensors are strongly-typed, the type of FFT (C2C, R2C, etc) is inferred by the tensor type at compile time. Similarly, the input and output size of the executor is deduced by the type of transform, and the input/output tensors must match those sizes.

In [None]:
// FFT
///\todo commented out because FFT is crashing
// (A = matx::random<float>(A.Shape(), matx::NORMAL)).run(exec); // random operator explained later
// matx::print(A);

// (A = fft(A)).run(exec);
// matx::print(A);

// (A = matx::ifft(A)).run(exec);
// matx::print(A);

### Reductions
Reductions are one of the most common operations perfomed on the GPU, which means they've been heavily researched and optimized for highly-parallel processors. Modern NVIDIA GPUs have special instructions for performing reductions to give even larger speedups over naive implementations. All of these details are hidden from the user and MatX automatically chooses the optimized path based on the hardware capabilities. 

MatX provides a set of optimized primitives to perform reductions on tensors for many common types. Reductions are supported across individual dimensions or on entire tensors, depending on the size of the output tensor. Currently supported reduction functions are `sum`, `min`, `max`,` mean`, `any`, and `all`

below is a simple example for calcluate a full reduction of the max and sum of our A data.

In [None]:
{

  auto A = matx::make_tensor<float>({2, 3});
  auto MD0 = matx::make_tensor<float>({});
  auto AD0 = matx::make_tensor<float>({});

  (A = matx::random<float>(A.Shape(), matx::NORMAL)).run(exec);

  // max of data
  (MD0 = max(A)).run(exec);
  // min of data
  (AD0 = sum(A)).run(exec);

  exec.sync();

  printf("A:\n");
  matx::print(A);
  std::cout << "Max: " << MD0() << std::endl;
  // float test = MD0();
  // matx::print(MD0);
  // printf("Max: %f\n", test);
  // printf("Sum: %f\n", AD0());

}

### Additional Transforms
MatX Supports a wide range of transforms, including specializations for specific domains of signal processing. Please review the [MatX documentation](https://nvidia.github.io/MatX/api/index.html) for an exhaustive list of supported operations, but we'll review a few of the most common here.

# TODO: Decide if we want to showcase any specific Transforms

In [None]:
// Do we want to show any additional reducitons here? Talk about batching?
// convolution
// batched transforms

## MatX Generators
Generators are a type of operator that can generate values without another tensor or operator as input. For example, windowing functions, such as a Hamming window, can generate values by only taking a length as input. Generators are efficient since they require no memory.

Common generators include random number generation, filters, or identity matricies. Below is an example of each:

In [None]:
{
  auto A = matx::make_tensor<float>({2, 3});
  auto H = matx::make_tensor<float>({10});

  // random
  (A = 0).run(exec);
  (A = matx::random<float>(A.Shape(), matx::NORMAL)).run(exec);
  matx::print(A);

  // eye
  (A = matx::eye(A.Shape())).run(exec);
  matx::print(A);

  // hamming
  (H = matx::hamming<0>(H.Shape())).run(exec);
  matx::print(H);
}

## Exercise: Transforms and generators:

For this example we will generate random data to verify the distribution of our generator functions. Please implement the following:

- generate 10 1D data arrays of 1000 elements
- perform a 1D FFT on the entire data set
- find the max bin of each fft'd data set

In [None]:
{

  //input data storage
  auto input = matx::make_tensor<float>({10,1000});
  auto maxVal = matx::make_tensor<matx::index_t>({10});
  auto maxIdx = matx::make_tensor<size_t>({10});

  // generate random data
  // (input = matx::random<float(input.Shape())).run(exec);

  // perform FFT
  // auto input = matx::fft(input, matx::NORMAL);

  // finx max bins and values
  // (mtie(maxVal, maxIdx) = argmax(input, {0})).run(exec);
}