<a href="https://colab.research.google.com/github/DataLama/triton-tutorials/blob/main/tutorials/basic/2_vector_add.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip show torch
!pip show triton

Name: torch
Version: 2.2.1+cu121
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: packages@pytorch.org
License: BSD-3
Location: /usr/local/lib/python3.10/dist-packages
Requires: filelock, fsspec, jinja2, networkx, nvidia-cublas-cu12, nvidia-cuda-cupti-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-runtime-cu12, nvidia-cudnn-cu12, nvidia-cufft-cu12, nvidia-curand-cu12, nvidia-cusolver-cu12, nvidia-cusparse-cu12, nvidia-nccl-cu12, nvidia-nvtx-cu12, sympy, triton, typing-extensions
Required-by: fastai, torchaudio, torchdata, torchtext, torchvision
Name: triton
Version: 2.2.0
Summary: A language and compiler for custom Deep Learning operations
Home-page: https://github.com/openai/triton/
Author: Philippe Tillet
Author-email: phil@openai.com
License: 
Location: /usr/local/lib/python3.10/dist-packages
Requires: filelock
Required-by: torch


---

### `tl.store`

Store a tensor of data into memory locations defined by pointer:
    
(1) pointer could be a single element pointer, then a scalar will be stored

- `mask` must be scalar too
- `boundary_check` and `padding_option` must be empty
    
(2) pointer could be element-wise tensor of pointers, in which case:

- `mask` is implicitly broadcast to `pointer.shape`
- `boundary_check` must be empty
    
(3) or pointer could be a block pointer defined by make_block_ptr, in which case:

- `mask` must be None
- `boundary_check` can be specified to control the behavior of out-of-bound access value is implicitly broadcast to pointer.shape and typecast to pointer.dtype.element_ty.

In [5]:
%%writefile main.py
from typing import Dict
import torch
import triton
import triton.language as tl

@triton.jit
def add_kernel(
    # python에서는 tensor를 넘겨주는데, kernel에서는 해당 tensor의 포인터를 갖고오는 듯?
    x_ptr: torch.Tensor,
    y_ptr: torch.Tensor,
    z_ptr: torch.Tensor,
    size: int,
    block_size: tl.constexpr,
  ):
  # define offsets
  pid = tl.program_id(0) # 3D grid에서 해당 axis의 program_id.
  offsets = tl.arange(0, block_size) + pid * block_size # pid를 기준으로 정의된 block_size만큼 offset 생성.
  mask = offsets < size # offset의 크기가 tensor의 사이즈보다 클 경우 마스킹처리

  # load tensor from DRAM
  x = tl.load(x_ptr + offsets, mask)
  y = tl.load(y_ptr + offsets, mask)

  z = x + y # add on gpu

  # export tensor to DRAM.
  tl.store(z_ptr + offsets, z, mask)

def add (x:torch.Tensor, y:torch.Tensor):
  z = torch.empty_like(x, device='cuda')
  size = z.numel() # number of element in tensor.

  def grid(meta:Dict):
    # meta는 triton kernel이 input으로 받는 kwags ...
    return (triton.cdiv(size, meta["block_size"]),)

  add_kernel[grid](x, y, z , size, 2**10)

  return z

if __name__ == "__main__":
  size = 2 ** 16
  x = torch.randn(size, device="cuda")
  y = torch.randn(size, device="cuda")

  a = add(x, y) # triton
  b = x + y # torch

  print(sum(b - a))
  assert torch.allclose(a, b, atol=1e-2)


Overwriting main.py


In [6]:
%%time
!python main.py

tensor(0., device='cuda:0')
CPU times: user 26 ms, sys: 2.77 ms, total: 28.8 ms
Wall time: 3.02 s
