# quick triton breakdown

## What You Should Know About Triton (Basics)

### What is Triton?

Triton is a language + compiler for writing custom GPU kernels. It lets you write high‑performance GPU code (e.g. vector ops, matrix ops) in a more Pythonic / domain-specific way, instead of writing raw CUDA.

It was developed by OpenAI and is integrated in parts of PyTorch to allow fusion, custom kernels, etc.

### Device / Backend Target

Triton currently targets CUDA / NVIDIA GPUs. It generates code for GPU execution via CUDA.

It is not compatible with MPS (Apple GPU) or other non‑CUDA backends (as of now), so it only works where CUDA is available.

### triton.jit and Kernel Definitions

You write kernels decorated with @triton.jit, defining how memory is loaded, processed, stored, etc.

The kernel expresses parallelism (e.g. divide input array among program IDs) and handles masking, vectorization, etc.

### Kernel Invocation / Launching

You call kernels with special syntax: kernel[grid](...) where grid is a function of meta parameters and the input tensor sizes.

The grid determines how many blocks / threads you dispatch.

### Autotuning / Heuristics

Triton supports autotuning, to explore different kernel configurations (block sizes, warps, stages) to find the best performance per hardware.

You can specify @triton.autotune(...) around your kernel(s) and pass multiple config choices.

PyTorch’s torch.compile (when integrated) must respect those tunable configurations.

### Integration with PyTorch

You can embed Triton kernels inside PyTorch code and have torch.compile optimize across kernels + surrounding PyTorch code.

But to fully integrate (autograd, fallback, composability with PyTorch features), you often wrap Triton kernels into torch.library.triton_op or use torch.library mechanisms.

### Limitations & Fallbacks

Triton kernels only run on GPU (CUDA). They do not run on CPU automatically. You may need to provide a CPU fallback.

Not all PyTorch subsystems (like tensor subclasses, custom autograd, flop counting, etc.) are automatically compatible unless wrapped properly.

Some Triton features (e.g. heuristics vs autotune order) have constraints when used with torch.compile.

