# Operators and Lazy Evaluation
When writing a simple arithmetic expression like the following:

```A = B * (cos(C) / D```

Using the typical order of operations rules, we evaluate the expression in parentheses first `(cos(C) / D)`, followed by the multiply `*B1`, then the assignment `A=`. Written using standard C++ operator overloading, we would have a cosine, division, multiplication, and assignment overload. Each operator performs their respective task, then returns the value computed. That returned value is stored somewhere (either out to memory or possible in a register), then the next operator uses that output as input into its own computation. Finally, the assignment writes the value, usually out to memory.

To avoid overhead of repeated accesses to global memory and multiple discrete operation calls, MatX uses a technique called **lazy evaluation** to reduce the total number of loads and stores. It does this by overloading each operator so that **instead of performing the operation, such as multiplication, instead it returns an object that represents multiplication when it’s needed.** The entire expression then generates a single type in C++ representing the full equation above, and when we ask for element (0,0) of A above, the value is computed on-the-fly without storing any values. This also implies that you can store an entire expression into a variable and nothing will be exectuted:

`auto op = (B * (cos(C) / D));`

In the example above op is not evaluated at creation, but is instead a handle ot the operator that can calculate the result of the equation onthe right hand side.

This operator can then be further combined with other expressions, which can increase code readability without loss of performance.

# Executors
Operators are used in conjunction with executors and the ``run()`` syntax to dictate when a given operator is executed and on what acceleration hardware. Notebook 1 exploited automation in the `matx::print` function to ensure operators are called and copied to the host to facilitate printing. This is convenient, but not realistic or performant to use in a real application. 

Executors are types that describe how to execute an operator expression or transform. They are similar to C++’s execution policy, and may even use C++ execution policies behind the scenes. Executors are designed so that the code can remain unchanged while executing on a variety of different targets. For these notebooks, a single executor `exec` is created at the top of each notebook, and then used throughout. 

to exectuor work on a give executor, you can simply call the `run()` function on the operator you would like to execute, with the executor that you would like to do the work.

```

(A = B * (cos(C) / D)).run(exec); // immediate evaluation of fused operators into memory-backed tensor

auto  myOp = B * (cos(C) / D);    // define lay operator with fused operator
(A = myOp).run(exec);             // evaluate operaeter to a memory-backed tensor

(A2 = myOp * C).run(exec);        // combine op with other tensors 
(A3 = myOp * myOp).run(exec)      // combine op with other ops

```

## Fusion Example

Below we will take the example operation from above `(A = B * (cos(C) / D)).run();` and express it in MatX as individual operations, and as a single fused operation to demonstrate the value of the speed up.

In [4]:
%%run_matx
auto exec = matx::CUDAExecutor();

matx::index_t size_x = 128;
matx::index_t size_y = 256;

auto A      = matx::make_tensor<float>({size_x, size_y});
auto B      = matx::make_tensor<float>({size_x, size_y});
auto C      = matx::make_tensor<float>({size_x, size_y});
auto D      = matx::make_tensor<float>({size_x, size_y});
auto result = matx::make_tensor<float>({size_x, size_y});

// ---- populate the data ---- //
(A = matx::random<float>(A.Shape(), matx::NORMAL)).run();
(B = matx::random<float>(B.Shape(), matx::NORMAL)).run();
(C = matx::random<float>(C.Shape(), matx::NORMAL)).run();
(D = matx::random<float>(D.Shape(), matx::NORMAL)).run();
(result = matx::zeros({size_x, size_y})).run(exec);
exec.sync();


// ---- first individual, independent kernels ---- //
exec.start_timer();
(result = cos(C)).run(exec);     
(result = result / D).run(exec); 
(result = result * B).run(exec);   
exec.stop_timer();

std::cout <<"Unfused time: " << exec.get_time_ms() << " ms" << std::endl;

// ---- fused operation ---- //
exec.start_timer();
(A = B * cos(C)/D).run(exec);
exec.stop_timer();
std::cout <<"fused time: " << exec.get_time_ms() << " ms" << std::endl;


### Runtime Improvements
The fused results in all of the performance benefits we described above:
- a single kernel is submitted to the GPU to complete all operations
- memory is only read from global once and written to global once

This results in significant performance improvements, both for launch latency, and GPU kernel exection
# <img src="img/dli-fusion.png" width="80%">

## Fusion with Operators

Fusion is intuitive when all operands can be combined into a single statement (like above), and follows the natural pattern most programs would follow. The reality is often different for more complex algorithms, and this is where fusion can also provide significant benefit for readability and reuse in implementations where very complex terms are defined, in addition to the performance benefits we just showed. 

Combining the lazy evaluation of operators with the ability to combine operators, terms can be defined to clearly construct the specfic math for each term, and then combined later to create the complete final expression for execution.

Below we show a more complex operation comprised of both unary operators and transforms, and how we can break down a very complex expression into simple terms that can be reused



# Exercise: Fusion Basics

Use the following equations to create an implemention that utilizes fusion and reuse optimize underlying code. 

`result = A*C + B/D + ((D-C)/B)/(A*C) `

An exmaple implementation is given with all operations done individually; how much faster can you make it?

In [None]:
%%run_matx
auto exec = matx::CUDAExecutor();

matx::index_t size_x = 128;
matx::index_t size_y = 256;

auto A      = matx::make_tensor<float>({size_x, size_y});
auto B      = matx::make_tensor<float>({size_x, size_y});
auto C      = matx::make_tensor<float>({size_x, size_y});
auto D      = matx::make_tensor<float>({size_x, size_y});
auto result = matx::make_tensor<float>({size_x, size_y});

// ---- populate the data ---- //
(A = matx::random<float>(A.Shape(), matx::NORMAL)).run();
(B = matx::random<float>(B.Shape(), matx::NORMAL)).run();
(C = matx::random<float>(C.Shape(), matx::NORMAL)).run();
(D = matx::random<float>(D.Shape(), matx::NORMAL)).run();
(result = matx::zeros({size_x, size_y})).run(exec);
exec.sync();

// ---- Reference Implementation ---- //
exec.start_timer();
(result = A*C).run(exec);
(result += B/D).run(exec);
(result += ((D-C)/B)/(A*C)).run(exec);
exec.stop_timer();
std::cout <<"Separate Operators Runtime: " << exec.get_time_ms() << " ms" << std::endl;

// ---- Exercise: Implementation ---- //
exec.start_timer();
//
// Your implementation here:
//
exec.stop_timer();
std::cout <<"Exercise Runtime: " << exec.get_time_ms() << " ms" << std::endl;


In [None]:
%%run_matx
auto exec = matx::CUDAExecutor();

matx::index_t size_x = 128;
matx::index_t size_y = 256;

auto A      = matx::make_tensor<float>({size_x, size_y});
auto B      = matx::make_tensor<float>({size_x, size_y});
auto C      = matx::make_tensor<float>({size_x, size_y});
auto D      = matx::make_tensor<float>({size_x, size_y});
auto result = matx::make_tensor<float>({size_x, size_y});

// ---- populate the data ---- //
(A = matx::random<float>(A.Shape(), matx::NORMAL)).run();
(B = matx::random<float>(B.Shape(), matx::NORMAL)).run();
(C = matx::random<float>(C.Shape(), matx::NORMAL)).run();
(D = matx::random<float>(D.Shape(), matx::NORMAL)).run();
(result = matx::zeros({size_x, size_y})).run(exec);
exec.sync();

// ---- all crammed together ---- //
exec.start_timer();
(result = A * C  + B / D + ((D - C) / B) / A * C).run(exec); 
exec.stop_timer();
std::cout <<"One Equation Runtime: " << exec.get_time_ms() << " ms" << std::endl;

// ---- ideal implementation with reuse of operators ---- //
exec.start_timer();
auto term1 = A * C; 
auto term2 = B / D;
auto term3 = (D - C) / B;
auto term4 = term3 / term1;
(result = term1 + term2 + term4).run(exec);
exec.stop_timer();
std::cout <<"Fused Operation Runtime: " << exec.get_time_ms() << " ms" << std::endl;  


# Exercise: Black Scholes Fusion

The Black Scholes model provides a fantastic example of a real-world set of equations that greatly benefits from operator fusion. Black Scholes provides both a complex set of expressions that provide significant readability improvements if expressed as individual expressions, but also benefits from fusion of its separate operational parts. Below is a brief description of the Black Scholes models and its composite terms:


$$
C(S_0, K, T) = S_0 \,\Phi\bigl(d_1\bigr) \;-\; K \, e^{-rT} \,\Phi\bigl(d_2\bigr),
$$

where

$$
d_1 = \frac{\ln\!\bigl(\tfrac{S_0}{K}\bigr) + \bigl(r + \tfrac{\sigma^2}{2}\bigr)T}{\sigma \sqrt{T}},
\quad
d_2 = d_1 - \sigma \sqrt{T}.
$$

Here:
- \( S_0 \) is the current stock price
- \( K \) is the strike price
- \( T \) is the time to maturity (in years)
- \( r \) is the risk-free interest rate (annualized)
- \( \sigma \) is the volatility of the underlying stock (annualized)
- \( \Phi(\cdot) \) is the cumulative distribution function (CDF) of the standard normal distribution



We can easily translate this by expressing each of the terms defined above as separate MatX operators, then fusing the execution of those operators in the final run command.

Try breaking the equation below into the following operators:

```
VsqrtT  = V * sqrt(T);
d1      = (log(S / K) + (r + 0.5 * V * V) * T) / VsqrtT ;
d2      = d1 - VsqrtT;
cdf_d1  = normcdf(d1);
cdf_d2  = normcdf(d2);
expRT   = exp(-1 * r * T); 
```


In [9]:
%%run_matx
auto exec = matx::CUDAExecutor();

using dtype = double;
matx::index_t input_size = 100;

// ---- declare input data ---- //
auto K = matx::make_tensor<dtype>({input_size});
auto S = matx::make_tensor<dtype>({input_size});
auto V = matx::make_tensor<dtype>({input_size});
auto r = matx::make_tensor<dtype>({input_size});
auto T = matx::make_tensor<dtype>({input_size});
auto output = matx::make_tensor<dtype>({input_size});  

// ---- populate the data ---- //
(K = matx::random<float>(K.Shape(), matx::NORMAL)).run();
(S = matx::random<float>(S.Shape(), matx::NORMAL)).run();
(V = matx::random<float>(V.Shape(), matx::NORMAL)).run();
(r = matx::random<float>(r.Shape(), matx::NORMAL)).run();
(T = matx::random<float>(T.Shape(), matx::NORMAL)).run();
(output = matx::zeros({input_size})).run(exec);
exec.sync();

// ---- Exercise: Implementation ---- //
exec.start_timer();
//
// Your implementation here:
//
exec.stop_timer();
std::cout <<"Exercise Runtime: " << exec.get_time_ms() << " ms" << std::endl;

In [8]:
%%run_matx
auto exec = matx::CUDAExecutor();

using dtype = double;
matx::index_t input_size = 100;


// ---- declare input data ---- //
auto K = matx::make_tensor<dtype>({input_size});
auto S = matx::make_tensor<dtype>({input_size});
auto V = matx::make_tensor<dtype>({input_size});
auto r = matx::make_tensor<dtype>({input_size});
auto T = matx::make_tensor<dtype>({input_size});
auto output = matx::make_tensor<dtype>({input_size});  

// ---- populate the data ---- //
(K = matx::random<float>(K.Shape(), matx::NORMAL)).run();
(S = matx::random<float>(S.Shape(), matx::NORMAL)).run();
(V = matx::random<float>(V.Shape(), matx::NORMAL)).run();
(r = matx::random<float>(r.Shape(), matx::NORMAL)).run();
(T = matx::random<float>(T.Shape(), matx::NORMAL)).run();
(output = matx::zeros({input_size})).run(exec);
exec.sync();

auto VsqrtT = V * sqrt(T);
auto d1     = (log(S / K) + (r + 0.5 * V * V) * T) / VsqrtT ;
auto d2     = d1 - VsqrtT;
auto cdf_d1 = matx::normcdf(d1);
auto cdf_d2 = matx::normcdf(d2);
auto expRT  = exp(-1 * r * T); 
exec.start_timer();
(output = S * cdf_d1 - K * expRT * cdf_d2).run(exec);
exec.stop_timer();
std::cout <<"Fused Runtime: " << exec.get_time_ms() << " ms" << std::endl;


## Limitations & Intermediates
some limitions exist that prevents the fusion of all operations. Like all CUDA programs, there is an upper ceiling to the complexity of how much compute is optimal for a given kernel, as the kernel's complexity drives resource utilization (such as registers and shared memory), that my ultimately harm performance.

Similarly some lower-level APIs utilized by MatX may not support iterators / pre / post operations, and cannot be fused. 

To resolve this MatX uses Asnychronous memory when required to create intermediate outputs to store information between non-fusable operations. This does not require any action on the user to enable, however it may result in sub-optimal performance if asnchronous pools are not managed appropriately.  

In [None]:
{
  matx::index_t size_x = 12;
  matx::index_t size_y = 12;
  
  // matx::index_t size_x = 128;
  // matx::index_t size_y = 256;

  auto A      = matx::make_tensor<cuda::std::complex<float>>({size_x, size_y});
  auto B      = matx::make_tensor<cuda::std::complex<float>>({size_x, size_y});
  auto result = matx::make_tensor<cuda::std::complex<float>>({size_x, size_y});

  for (int i = 0; i < 10; i++) 
  {  
    exec.start_timer();
    (A = fft(A)).run(exec);
    (A = A * B).run(exec);
    (A = ifft(A)).run(exec);
    exec.stop_timer();
  }
  std::cout <<"NonFused Runtime: " << exec.get_time_ms() << " ms" << std::endl;


  for (int i = 0; i < 10; i++) 
  {  
    exec.start_timer();
    (A = ifft(fft(A)*B)).run(exec);
    exec.stop_timer();
  }
  std::cout <<"Fused Runtime: " << exec.get_time_ms() << " ms" << std::endl;

  for (int i = 0; i < 10; i++) 
  {  
    exec.start_timer();
    (A = fft(A*B)).run(exec);
    (A = ifft(A)).run(exec);
    exec.stop_timer();
  }
  std::cout <<"Partial Fused Runtime: " << exec.get_time_ms() << " ms" << std::endl;


}