# Getting Tensor Comprehensions

```shell
$ conda install -y -c pytorch -c tensorcomp tensor_comprehensions
```
Note: Won;t work on your mac, this is my Ubuntu server.

In [1]:
import tensor_comprehensions as tc
import torch

In [2]:
lang = """
def matmul(float(M,N) A, float(N,K) B) -> (output) {
  output(i, j) +=! A(i, kk) * B(kk, j)
}
"""

In [3]:
matmul = tc.define(lang, name="matmul")
mat1, mat2 = torch.randn(3, 4).cuda(), torch.randn(4, 5).cuda()
out = matmul(mat1, mat2)



In [4]:
out

Variable containing:
 0.7536  2.7156 -2.3936 -0.1357 -0.5720
-1.7986  2.9340 -0.0688  0.3831  1.3774
-0.6450 -0.8190  0.2687  1.2483 -1.0357
[torch.cuda.FloatTensor of size 3x5 (GPU 0)]

## PyTorch layers in Tensor Comprehensions 

### Use of mapping option

Default Mapping: We provide various default options that can be chosen to closely represent the kernel. The defaults provided are:

- `pointwise, color=red`: if kernel resembles a pointwise operation
- `mlp`: if kernel resembles an Linear layer operation
- `conv`: if kernel resembles a convolution operation
- `group_conv`: if kernel resembles a group convolution operation
- `naive`: if none of the above, then chose naive default <-- This is why we get the warning
<font color='red'>bar</font>

In [5]:
# Specifying mapping options
matmul = tc.define(lang, name="matmul")
mat1, mat2 = torch.randn(100, 400).cuda(), torch.randn(400, 500).cuda()
out2 = matmul(mat1, mat2, options=tc.Options("mlp"))

In [6]:
out2

Variable containing:
-2.8489e+01  2.8404e+01  2.7415e+01  ...  -1.2269e+00 -3.6730e+01 -9.9155e+00
-1.7115e+01 -1.1121e+01 -3.0050e+01  ...  -6.7856e-01  3.8747e+00 -4.0622e+01
-2.5548e+01 -1.4382e+01 -8.0181e+00  ...   1.3411e+01  3.2021e+00  8.1902e-01
                ...                   ⋱                   ...                
 2.2978e+01 -2.4670e+01 -2.3970e+01  ...   1.5433e+00 -2.5969e+01 -2.9356e+00
-5.9101e+00 -1.6884e+01  2.4006e+01  ...  -5.9949e+00 -2.3709e+01  1.0794e+01
 1.4939e+01 -1.3193e+01 -5.4434e+01  ...   1.8283e+01 -3.1528e+01 -1.4516e+00
[torch.cuda.FloatTensor of size 100x500 (GPU 0)]

In [7]:
# Using reduction operators
# providing different input sizes for the same comprehension

matmul = tc.define(lang, name="matmul")
mat1, mat2 = torch.randn(3, 4).cuda(), torch.randn(4, 5).cuda()
out = matmul(mat1, mat2)

# different input sizes
mat3, mat4 = torch.randn(100, 400).cuda(), torch.randn(400, 500).cuda()
out2 = matmul(mat3, mat4)
print(out)
print(out2)

Variable containing:
 0.7346  0.0796  0.7190  0.4359  0.0805
 0.9850 -0.2511  0.8243  0.4254 -0.5436
-0.6071 -1.8995 -2.1216 -1.7089  2.4164
[torch.cuda.FloatTensor of size 3x5 (GPU 0)]

Variable containing:
  7.0490  14.0783 -31.0707  ...   -4.0171 -33.0912 -11.7724
 -2.8942  25.8813 -32.5398  ...   13.7943  22.6707 -10.7807
 28.8399  14.7229 -10.8210  ...  -25.2078  -4.4914  -9.6010
           ...               ⋱              ...            
-21.9716   3.9036  19.7386  ...   -8.0826   8.1482  -4.7208
 23.0311  31.1360   5.9420  ...  -27.8790 -14.3433  -4.7663
-34.4321  20.4107 -10.2030  ...   -3.9045  17.4901  29.1404
[torch.cuda.FloatTensor of size 100x500 (GPU 0)]



#### Multiple TC definitions

Let’s say you want to define all of your TCs in one string and later use that string for running different operations defined in the string. You an do so easily. You can define a <font color='blue'>lang</font> variable that holds the TC definition for all your operations. Every time you want to run a different operation, you can make a <font color='blue'>tc.define</font> call on the <font color='blue'>lang</font> variable, specify the <font color='blue'>name</font> corresponding to the operation definition and get the TC layer for it. Below is an example for how to do this:

In [8]:
lang = """
def matmul(float(M,N) A, float(N,K) B) -> (output) {
  output(i, j) +=! A(i, kk) * B(kk, j)
}
def abs(float(M, N) A) -> (O1) {
  O1(m, n) = fabs(A(m, n))
}
"""
matmul = tc.define(lang, name="matmul")
mat1, mat2 = torch.randn(3, 4).cuda(), torch.randn(4, 5).cuda()
out = matmul(mat1, mat2)

abs = tc.define(lang, name="abs")
A = torch.randn(3, 4).cuda()
out = abs(A)



In [9]:
out

Variable containing:
 1.6016  0.0163  1.9639  0.7835
 0.3498  0.3417  0.8138  0.2829
 0.4028  0.3834  0.9464  0.6010
[torch.cuda.FloatTensor of size 3x4 (GPU 0)]

#### Writing layers with Scalars

- **Option 1**: Pass a constants dictionary to the tc.define call as demo'ed below

In [13]:
lang = """
def avgpool(float(B, C, H, W) input) -> (output) {{
    output(b, c, h, w) += input(b, c, h * {sH} + kh, w * {sW} + kw) where kh in 0:{kH}, kw in 0:{kW}
}}
"""
avgpool = tc.define(lang, name="avgpool", constants={"sH":1, "sW":1, "kH":2, "kW":2})
inp = torch.ones(32, 3, 10, 10).cuda()
out4 = avgpool(inp, options=tc.Options("mlp"))

In [11]:
out4

Variable containing:
(0 ,0 ,.,.) = 
   4   4   4  ...    4   4   4
   4   4   4  ...    4   4   4
   4   4   4  ...    4   4   4
     ...       ⋱       ...    
   4   4   4  ...    4   4   4
   4   4   4  ...    4   4   4
   4   4   4  ...    4   4   4

(0 ,1 ,.,.) = 
   4   4   4  ...    4   4   4
   4   4   4  ...    4   4   4
   4   4   4  ...    4   4   4
     ...       ⋱       ...    
   4   4   4  ...    4   4   4
   4   4   4  ...    4   4   4
   4   4   4  ...    4   4   4

(0 ,2 ,.,.) = 
   4   4   4  ...    4   4   4
   4   4   4  ...    4   4   4
   4   4   4  ...    4   4   4
     ...       ⋱       ...    
   4   4   4  ...    4   4   4
   4   4   4  ...    4   4   4
   4   4   4  ...    4   4   4
     ⋮ 

(1 ,0 ,.,.) = 
   4   4   4  ...    4   4   4
   4   4   4  ...    4   4   4
   4   4   4  ...    4   4   4
     ...       ⋱       ...    
   4   4   4  ...    4   4   4
   4   4   4  ...    4   4   4
   4   4   4  ...    4   4   4

(1 ,1 ,.,.) = 
   4   4   4  ...    4  

#### Option 2 : Format the string using python regex 

In [12]:
import re
LANG="""
def avgpool(float(B, C, H, W) input) -> (output) {
    output(b, c, h, w) += input(b, c, h * <sh> + kh, w * <sw> + kw) where kh in 0:<kH>, kw in 0:<kW>
}
"""
sH, sW, kH, kW = 1, 1, 2, 2
LANG = re.sub('<sh>', str(sH), LANG)
LANG = re.sub('<sw>', str(sW), LANG)
LANG = re.sub('<kH>', str(kH), LANG)
LANG = re.sub('<kW>', str(kW), LANG)
avgpool = tc.define(LANG, name="avgpool")
inp = torch.ones(1, 1, 4, 4).cuda()
out5 = avgpool(inp)



### Manually injecting external CUDA code

If you have an external efficient CUDA code that you want to use rather than the CUDA code that TC generates, you can inject your code easily. For this, you need to create a string which has the CUDA code you want to inject and you need to pass the name of the kernel and the CUDA code string to the `tc.define` call.

As an example:

In [14]:
lang = """
def add(float(N) A, float(N) B) -> (output) {
    output(i) = A(i) + B(i) + 1
}
"""

cuda_code = """
extern "C"{
__global__ void my_add(float* __restrict__ output, const float* __restrict__ A, const float* __restrict B)
{
    int t = threadIdx.x;
    output[t] = A[t] + B[t];
}
}
"""

add = tc.define(lang, name="add", inject_kernel="my_add", cuda_code=cuda_code)
a, b = torch.randn(100).cuda(), torch.randn(100).cuda()
out6 = add(a, b, grid=[1, 1, 1], block=[100, 1, 1])



In [15]:
out6

Variable containing:
 3.6025
 1.2179
-0.7659
-1.0284
-0.8866
 0.8470
 1.1899
 0.5136
-0.5460
-1.5441
-0.3539
-2.4137
-0.4631
-0.0619
-1.6390
-2.0870
 0.8110
-2.0814
-0.0284
-2.3822
 1.4707
 3.8718
-0.7765
-3.1565
-1.9431
 0.5884
-2.8776
-1.8268
-0.1874
-0.9072
-0.1957
-0.3769
 1.9516
 1.2142
 0.8183
-0.5128
 1.2283
-0.1149
-1.0397
 1.7810
-1.0703
-0.1569
 1.1433
 0.9462
-0.9802
-3.1964
-3.1568
 0.0119
-0.2752
-2.6597
-0.4386
-0.4265
 0.3517
-2.6374
 1.8531
 0.1807
 0.7856
-1.4368
 1.1380
-0.2945
-1.3483
-1.9964
-0.3021
 1.3201
-1.2708
-0.0591
 0.7631
-2.1894
-0.3149
 2.1207
 0.3688
 3.3965
 0.9015
-0.3889
-1.1659
-2.7698
 0.3136
 0.0787
-0.4413
-1.5912
 1.1120
-1.2563
-0.1105
 1.5039
 1.1011
 2.0790
 0.3309
 2.8792
-1.4424
-3.8384
 0.0753
-1.0948
 0.7200
-1.3414
 1.2493
-3.1341
 2.5363
-0.9303
-0.0479
 0.7473
[torch.cuda.FloatTensor of size 100 (GPU 0)]