# Getting Tensor Comprehensions

```shell
$ conda install -y -c pytorch -c tensorcomp tensor_comprehensions
```
Note: Won;t work on your mac, this is my Ubuntu server.

In [1]:
import tensor_comprehensions as tc
import torch

In [2]:
lang = """
def matmul(float(M,N) A, float(N,K) B) -> (output) {
  output(i, j) +=! A(i, kk) * B(kk, j)
}
"""

In [3]:
matmul = tc.define(lang, name="matmul")
mat1, mat2 = torch.randn(3, 4).cuda(), torch.randn(4, 5).cuda()
out = matmul(mat1, mat2)



In [4]:
out

Variable containing:
-0.3949  0.8303 -1.5782  1.3349 -0.0063
-2.8526 -0.1465  4.4961 -4.9543  2.0988
-1.7459  0.9724  0.9590  1.3005  4.0354
[torch.cuda.FloatTensor of size 3x5 (GPU 0)]

## PyTorch layers in Tensor Comprehensions 

### Use of mapping option

Default Mapping: We provide various default options that can be chosen to closely represent the kernel. The defaults provided are:

- `pointwise, color=red`: if kernel resembles a pointwise operation
- `mlp`: if kernel resembles an Linear layer operation
- `conv`: if kernel resembles a convolution operation
- `group_conv`: if kernel resembles a group convolution operation
- `naive`: if none of the above, then chose naive default <-- This is why we get the warning
<font color='red'>bar</font>

In [5]:
# Specifying mapping options
matmul = tc.define(lang, name="matmul")
mat1, mat2 = torch.randn(100, 400).cuda(), torch.randn(400, 500).cuda()
out2 = matmul(mat1, mat2, options=tc.Options("mlp"))

In [6]:
out2

Variable containing:
-9.3951e+00 -1.8997e+01 -1.7598e+01  ...  -5.4261e+01 -1.5804e+01 -2.2281e+00
-1.7843e+01  9.8855e+00  4.3533e+01  ...   9.5547e+00  1.7750e+01  6.3619e-01
-1.5394e+01  1.0743e+01 -2.8723e+01  ...  -2.8901e+00 -4.9826e+00 -3.4955e+01
                ...                   ⋱                   ...                
 7.9932e+00 -2.8686e+01  8.3991e+00  ...  -2.2157e+00  3.9128e+01 -2.7361e+01
 1.6213e+01  4.1226e+00  3.3095e+01  ...   3.0941e+01  2.5200e+00 -1.5295e+01
-6.6816e+00 -3.2069e+01  3.0220e+01  ...  -6.3537e+00 -2.4188e-01  9.4838e+00
[torch.cuda.FloatTensor of size 100x500 (GPU 0)]

In [7]:
# Using reduction operators
# providing different input sizes for the same comprehension

matmul = tc.define(lang, name="matmul")
mat1, mat2 = torch.randn(3, 4).cuda(), torch.randn(4, 5).cuda()
out = matmul(mat1, mat2)

# different input sizes
mat3, mat4 = torch.randn(100, 400).cuda(), torch.randn(400, 500).cuda()
out2 = matmul(mat3, mat4)
print(out)
print(out2)

Variable containing:
 1.1297  1.2930  0.2438  0.2914  1.3435
-2.0003 -0.0755 -1.7224 -0.0140 -0.8154
-1.9083  0.7932 -1.7260  0.6538 -2.2691
[torch.cuda.FloatTensor of size 3x5 (GPU 0)]

Variable containing:
-28.1100 -11.6430  -7.6212  ...    3.3825  18.4664  -8.4693
 16.3596  21.3456   2.9245  ...   26.5358  -6.3813   2.3102
 31.8404 -10.3774  -5.6214  ...  -36.0098  -7.2178  21.7369
           ...               ⋱              ...            
  8.2418   3.6604  -8.9530  ...   14.3242   3.1555  -3.3104
 10.2438 -28.3403  24.1943  ...   -5.6581   3.6120  22.8566
 -2.3919 -24.7140  -8.9158  ...  -18.7468  -0.5242   4.8057
[torch.cuda.FloatTensor of size 100x500 (GPU 0)]



#### Multiple TC definitions

Let’s say you want to define all of your TCs in one string and later use that string for running different operations defined in the string. You an do so easily. You can define a <font color='blue'>lang</font> variable that holds the TC definition for all your operations. Every time you want to run a different operation, you can make a <font color='blue'>tc.define</font> call on the <font color='blue'>lang</font> variable, specify the <font color='blue'>name</font> corresponding to the operation definition and get the TC layer for it. Below is an example for how to do this:

In [8]:
lang = """
def matmul(float(M,N) A, float(N,K) B) -> (output) {
  output(i, j) +=! A(i, kk) * B(kk, j)
}
def abs(float(M, N) A) -> (O1) {
  O1(m, n) = fabs(A(m, n))
}
"""
matmul = tc.define(lang, name="matmul")
mat1, mat2 = torch.randn(3, 4).cuda(), torch.randn(4, 5).cuda()
out = matmul(mat1, mat2)

abs = tc.define(lang, name="abs")
A = torch.randn(3, 4).cuda()
out = abs(A)



In [9]:
out

Variable containing:
 0.1564  0.6053  2.4715  1.3225
 1.4164  1.2355  0.4340  0.1880
 2.4908  1.8639  0.1854  0.0301
[torch.cuda.FloatTensor of size 3x4 (GPU 0)]

#### Writing layers with Scalars

- **Option 1**: Pass a constants dictionary to the tc.define call as demo'ed below

In [10]:
lang = """
def avgpool(float(B, C, H, W) input) -> (output) {{
    output(b, c, h, w) += input(b, c, h * {sH} + kh, w * {sW} + kw) where kh in 0:{kH}, kw in 0:{kW}
}}
"""
avgpool = tc.define(lang, name="avgpool", constants={"sH":1, "sW":1, "kH":2, "kW":2})
inp = torch.ones(32, 3, 10, 10).cuda()
out4 = avgpool(inp, options=tc.Options("mlp"))

In [11]:
out4

Variable containing:
(0 ,0 ,.,.) = 
   4   4   4  ...    4   4   4
   4   4   4  ...    4   4   4
   4   4   4  ...    4   4   4
     ...       ⋱       ...    
   4   4   4  ...    4   4   4
   4   4   4  ...    4   4   4
   4   4   4  ...    4   4   4

(0 ,1 ,.,.) = 
   4   4   4  ...    4   4   4
   4   4   4  ...    4   4   4
   4   4   4  ...    4   4   4
     ...       ⋱       ...    
   4   4   4  ...    4   4   4
   4   4   4  ...    4   4   4
   4   4   4  ...    4   4   4

(0 ,2 ,.,.) = 
   4   4   4  ...    4   4   4
   4   4   4  ...    4   4   4
   4   4   4  ...    4   4   4
     ...       ⋱       ...    
   4   4   4  ...    4   4   4
   4   4   4  ...    4   4   4
   4   4   4  ...    4   4   4
     ⋮ 

(1 ,0 ,.,.) = 
   4   4   4  ...    4   4   4
   4   4   4  ...    4   4   4
   4   4   4  ...    4   4   4
     ...       ⋱       ...    
   4   4   4  ...    4   4   4
   4   4   4  ...    4   4   4
   4   4   4  ...    4   4   4

(1 ,1 ,.,.) = 
   4   4   4  ...    4  

#### Option 2 : Format the string using python regex 

In [12]:
import re
LANG="""
def avgpool(float(B, C, H, W) input) -> (output) {
    output(b, c, h, w) += input(b, c, h * <sh> + kh, w * <sw> + kw) where kh in 0:<kH>, kw in 0:<kW>
}
"""
sH, sW, kH, kW = 1, 1, 2, 2
LANG = re.sub('<sh>', str(sH), LANG)
LANG = re.sub('<sw>', str(sW), LANG)
LANG = re.sub('<kH>', str(kH), LANG)
LANG = re.sub('<kW>', str(kW), LANG)
avgpool = tc.define(LANG, name="avgpool")
inp = torch.ones(1, 1, 4, 4).cuda()
out5 = avgpool(inp)



### Manually injecting external CUDA code

If you have an external efficient CUDA code that you want to use rather than the CUDA code that TC generates, you can inject your code easily. For this, you need to create a string which has the CUDA code you want to inject and you need to pass the name of the kernel and the CUDA code string to the `tc.define` call.

As an example:

In [13]:
lang = """
def add(float(N) A, float(N) B) -> (output) {
    output(i) = A(i) + B(i) + 1
}
"""

cuda_code = """
extern "C"{
__global__ void my_add(float* __restrict__ output, const float* __restrict__ A, const float* __restrict B)
{
    int t = threadIdx.x;
    output[t] = A[t] + B[t];
}
}
"""

add = tc.define(lang, name="add", inject_kernel="my_add", cuda_code=cuda_code)
a, b = torch.randn(100).cuda(), torch.randn(100).cuda()
out6 = add(a, b, grid=[1, 1, 1], block=[100, 1, 1])



In [14]:
out6

Variable containing:
-0.2441
-1.2044
-0.1484
 1.1395
 0.2219
-0.7885
 0.3211
 1.0028
-0.7377
-0.7845
 2.4702
 3.2829
 0.7614
 0.1627
 1.4155
-1.0388
 1.3907
 1.8067
-0.3192
 1.8688
 0.0137
 0.5402
 0.4248
 0.3889
-1.7622
-0.3826
 1.3375
 2.4420
-2.5455
-0.5168
 0.0186
-0.4861
-0.8110
-1.5189
-0.1110
 2.0514
-0.4572
 0.5803
-0.9403
 0.3659
-2.4906
 0.7857
 0.0200
-2.3367
 1.1161
-1.9790
 1.1353
 0.6458
 0.5414
 0.8869
 2.6098
 1.2731
 1.7311
-1.2708
 3.0092
-3.0787
 0.8363
-0.4576
-1.5630
 0.4198
-0.8549
 1.6064
 0.7856
-0.0349
-1.0281
-0.0128
 0.0209
-2.3107
 2.0514
 1.4023
 0.4369
-0.9508
 1.4871
 0.1565
-0.2754
 0.2455
-0.5572
 1.8748
 1.5466
-1.5910
 0.0043
-2.0473
-1.4289
 1.4102
-0.0120
 0.5479
 1.0860
 0.5362
 0.5234
 0.3940
 0.1579
 2.1710
 0.6930
 2.3230
-2.3188
 0.3688
-1.5531
 0.0285
 1.4020
 0.6835
[torch.cuda.FloatTensor of size 100 (GPU 0)]

### Built-in Functions

TC allows using some CUDA built-in functions as well when defining the TC language. During the execution, CUDA API will be called for those built-in functions. For example, let’s say we want to use `fmax` CUDA function in our TC language.

In [15]:
LANG = """
def relu(float(B,M) I) -> (O1){
  O1(b, m) = fmax(I(b, m), 0)
}
"""
relu = tc.define(LANG, name="relu")
inp = torch.randn(100, 128).cuda()
out7 = relu(inp, options=tc.Options("mlp"))

In [16]:
out7

Variable containing:
 0.4639  0.6699  0.0000  ...   0.0000  1.5628  0.0000
 0.9489  0.0000  1.1602  ...   3.3303  0.0000  0.2466
 0.5919  1.5336  0.0000  ...   0.0000  0.0000  0.0000
          ...             ⋱             ...          
 1.0246  1.0487  0.0000  ...   1.5216  0.0021  0.4519
 0.0000  0.0000  1.5521  ...   0.4257  0.0000  0.0000
 0.0000  0.0727  0.0000  ...   0.0000  0.0000  0.3307
[torch.cuda.FloatTensor of size 100x128 (GPU 0)]

### For more CUDA Math, go here 

[CUDA Math Doc](http://docs.nvidia.com/cuda/cuda-math-api/group__CUDA__MATH__SINGLE.html#group__CUDA__MATH__SINGLE)

## Machine Learning Layers Database

There is a database of about 30 machine learning layers that are used across various types of neural networks. 

If you want to use one of the layers in the database, you can query this database in your code easily. The database can be accessed by calling tc.database. This database is a dictionary of TC name to the TC definition. Each entry in the dictionary looks like: `{tc_name: {"lang": language, "grad": grad_language}}` where `tc_name` is the name of the operation, `lang` is the tc language describing that operation, `grad` is the TC language describing the gradient of that operation. The `grad` is optional entry.

We already have computed the matmul for mat1 and mat2 

### Pooling Layers 

#### Average Pooling

In [17]:
LANG = """
        def avgpool(float(B, C, H, W) input) -> (output) {{
            output(b, c, h, w) +=! input(b, c, h * {sH} + kh, w * {sW} + kw) / ({kH} * {kW}) where kh in 0:{kH}, kw in 0:{kW}
        }}
        """

#### Max Pooling

In [18]:
LANG = """
def maxpool(float(B, C, H, W) input) -> (output) {{
    output(b, c, h, w) max= input(b, c, h * {sH} + kh, w * {sW} + kw) where kh in 0:{kH}, kw in 0:{kW}
}}
"""

We already have computed the matmul for mat1 and mat2 

### Convolutional Layers 

#### Simple Pooling

In [20]:
LANG = """
def convolution(float(N, C, H, W) I, float(M, C, KH, KW) W1, float(M) B) -> (O) {
    O(n, m, h, w) +=! I(n, c, h + kh, w + kw) * W1(m, c, kh, kw)
    O(n, m, h, w) = O(n, m, h, w) + B(m)
}
"""

#### Strided Pooling

In [21]:
LANG = """
def convolution_strided(float(N, C, H, W) I, float(M, C, KH, KW) W1, float(M) B) -> (O) {{
    O(n, m, h, w) +=! I(n, c, {sh} * h + kh, {sw} * w + kw) * W1(m, c, kh, kw)
    O(n, m, h, w) = O(n, m, h, w) + B(m)
}}
"""

#### Strided Conv Gradient

In [23]:
lang = """
def convolution_grad(float(N, C, H, W) I, float(M, C, KH, KW) W1, float(N, M, H, W) O_grad) -> (I_grad, W1_grad) {{
    I_grad(n, c, h, w) +=! O_grad(n, m, {sh} * h - kh, {sw} * w - kw) * W1(m, c, kh, kw)
    W1_grad(m, c, kh, kw) +=! O_grad(n, m, {sh} * h - kh, {sw} * w - kw) * I(n, c, h, w)
}}
"""

#### Simple Group Convolution

In [25]:
lang = """

def group_convolution(float(N, G, C, H, W) I, float(G, F, C, KH, KW) W1, float(G, F) B) -> (O) {
    O(n, g, f, h, w) +=! I(n, g, c, h + kh, w + kw) * W1(g, f, c, kh, kw)
    O(n, g, f, h, w) = O(n, g, f, h, w) + B(g, f)
}
"""

#### Group Conv. Strided

In [26]:
lang = """
def group_convolution_strided(float(N, G, C, H, W) I, float(G, F, C, KH, KW) W1, float(G, F) B) -> (O) {{
    O(n, g, f, h, w) +=! I(n, g, c, {sh} * h + kh, {sw} * w + kw) * W1(g, f, c, kh, kw)
    O(n, g, f, h, w) = O(n, g, f, h, w) + B(g, f)
}}
"""