# Intro

In [5]:
from tinygrad import Tensor, Context
a = Tensor.empty(4,4)
b = Tensor.empty(4,4)

print((a+b).tolist())

[[0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0]]


In [6]:
print(a+b)

<Tensor <UOp METAL (4, 4) float (<Ops.ADD: 62>, None)> on METAL with grad None>


Lazy computation so only computed answer if tolist numpy or realize on tensor are used

In [11]:
with Context(DEBUG=4, NOOPT=True): 
    a = Tensor.empty(4,4)
    b = Tensor.empty(4,4)
    print((a+b).tolist())
    print((a.sum(0).tolist()))

#include <metal_stdlib>
using namespace metal;
kernel void E_16(device float* data0_16, device float* data1_16, device float* data2_16, uint3 gid [[threadgroup_position_in_grid]], uint3 lid [[thread_position_in_threadgroup]]) {
  int gidx0 = gid.x; /* 16 */
  float val0 = (*(data1_16+gidx0));
  float val1 = (*(data2_16+gidx0));
  *(data0_16+gidx0) = (val0+val1);
}
[32m*** METAL      4[0m E_[34m16[0m[90m[0m                                         arg  3 mem  0.00 GB tm      7.25us/     0.02ms (     0.00 GFLOPS    0.0|0.0     GB/s) ['tolist', '__add__', 'empty']
[[0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0], [0.0, 0.0, 0.0, 0.0]]
#include <metal_stdlib>
using namespace metal;
kernel void r_4_4n1(device float* data0_4, device float* data1_16, uint3 gid [[threadgroup_position_in_grid]], uint3 lid [[thread_position_in_threadgroup]]) {
  float acc0[1];
  int gidx0 = gid.x; /* 4 */
  *(acc0+0) = 0.0f;
  for (int ridx1001 = 0; ridx1001 < 4; ridx1001++) {
    float val

UOp is an abstract syntax tree to represent computation

```python
class UOp:
  op: Ops
  dtype: dtypes
  src: tuple(UOp)
  arg: None
```

op field is operation

dtype is the data type

src is the parents of this node

arg is the argument of this node

In [8]:
from tinygrad.renderer.cstyle import MetalRenderer
from tinygrad.uop.ops import UOp, Ops
from tinygrad import dtypes

const = UOp(Ops.CONST, dtypes.float, arg=1.0)
add = UOp(Ops.ADD, dtypes.float, src=(const, const), arg=None)

print(add)

UOp(Ops.ADD, dtypes.float, arg=None, src=(
  x0:=UOp(Ops.CONST, dtypes.float, arg=1.0, src=()),
   x0,))


In [9]:
print(MetalRenderer().render([
    const,
    add
]))

#include <metal_stdlib>
using namespace metal;
kernel void test(uint3 gid [[threadgroup_position_in_grid]], uint3 lid [[thread_position_in_threadgroup]]) {
  float alu0 = (1.0f+1.0f);
}


In [10]:
print(MetalRenderer().render([
  UOp(Ops.SPECIAL, dtypes.int, arg=("gidx0", 16))
]))

#include <metal_stdlib>
using namespace metal;
kernel void test(uint3 gid [[threadgroup_position_in_grid]], uint3 lid [[thread_position_in_threadgroup]]) {
  int gidx0 = gid.x; /* 16 */
}


# Pattern matcher

It expresses the entire computation intoa nested tree and then recognizes parts which can be optimized.

In [13]:
from tinygrad import Tensor

with Context(DEBUG=4, NOOPT=True):
    a = Tensor.empty(4,4)
    b = a + 1

    b.realize()

#include <metal_stdlib>
using namespace metal;
kernel void E_16n1(device float* data0_16, device float* data1_16, uint3 gid [[threadgroup_position_in_grid]], uint3 lid [[thread_position_in_threadgroup]]) {
  int gidx0 = gid.x; /* 16 */
  float val0 = (*(data1_16+gidx0));
  *(data0_16+gidx0) = (val0+1.0f);
}
[32m*** METAL      7[0m E_[34m16[0m[90mn1[0m                                       arg  2 mem  0.00 GB tm      7.25us/     0.04ms (     0.00 GFLOPS    0.0|0.0     GB/s) ['__add__', 'empty']


This output is unoptimized as one can see it is running not in parallel. The code is generated by looking at the AST which is the tree of UOps.

This is the code for how the code is generated from the UOps in the AST.

```python
patterns = [
    (STORE, lambda uop: "="),
    (CONST, lambda uop: f" {uop.arg} "),
    (ADD, lambda uop: f" + "),
]

def render_code(uop):
    code = []
    for _uop in uop:
        if uop.op == pattern[0]:
            _code = pattern[1](_uop)
            code.append(_code)
```

See above how the tuples store lambda functions to generate the code dynamically based on the UOp specifics like the arg.

The lingo:
The class PatternMatcher receives the list as init args.
Each tuple in list is of class UPat and it has op, dtype, src, args.

UOp tree converted to UPat tree.
UPat is a minimal abstraction of UOp. 

See here
```python
UOp(Ops.STORE, dtypes.void, arg=None, src=(
    UOp(Ops.DEFINE_GLOBAL, dtypes.float.ptr(), arg=0, src=()),
    ...
    UOp(Ops.ADD, dtypes.float, arg=None, src=(
      ...
    )
)

# is converted to:

UPat(Ops.STORE, name="x", src=(UPat.var("define_global"), UPat.var("addition")), lambda x, define_global, addition: ... )


Using the UPat tree, the PatternMatcher can now match the patterns and return corresponding lambda function.