# PyTorch Tutorial: Performance Engineering (Triton & Profiling)

In FAANG, making a model 10% faster can save millions of dollars. This chapter moves beyond "making it work" to "making it fast".

## Learning Objectives
- **Profile** your code to find bottlenecks.
- Use **`torch.compile`** for free speedups.
- Write a custom GPU kernel using **Triton**.

## 1. Vocabulary First

- **Latency**: Time per request.
- **Throughput**: Requests per second.
- **Kernel**: A function that runs on the GPU.
- **Fusion**: Combining multiple operations (Add + Multiply) into one kernel to save memory bandwidth.
- **Triton**: A language from OpenAI to write GPU kernels in Python.

In [None]:
import torch
import time

device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Running on: {device}")

## 2. Profiling with `torch.profiler`

Stop guessing where your code is slow. Measure it.

In [None]:
def heavy_computation(x):
    return torch.matmul(x, x) + torch.relu(x)

x = torch.randn(1000, 1000, device=device)

with torch.profiler.profile(
    activities=[torch.profiler.ProfilerActivity.CPU],
    record_shapes=True
) as prof:
    heavy_computation(x)

print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=10))

## 3. `torch.compile` (PyTorch 2.0)

The easiest way to speed up PyTorch code. It fuses operations automatically.

```python
model = MyModel()
opt_model = torch.compile(model)
```

In [None]:
@torch.compile
def fast_computation(x):
    return torch.sin(x) + torch.cos(x)

# First run compiles (might be slow)
start = time.time()
fast_computation(x)
print(f"First run (compilation): {time.time() - start:.4f}s")

# Second run is fast
start = time.time()
fast_computation(x)
print(f"Second run (cached): {time.time() - start:.4f}s")

## 4. Writing Custom Kernels with Triton

When `torch.compile` isn't enough, you write your own kernels. Triton makes this accessible to Python engineers.

*(Note: This requires a GPU to run)*

In [None]:
import triton
import triton.language as tl

@triton.jit
def add_kernel(x_ptr, y_ptr, output_ptr, n_elements, BLOCK_SIZE: tl.constexpr):
    pid = tl.program_id(axis=0)
    block_start = pid * BLOCK_SIZE
    offsets = block_start + tl.arange(0, BLOCK_SIZE)
    mask = offsets < n_elements
    
    x = tl.load(x_ptr + offsets, mask=mask)
    y = tl.load(y_ptr + offsets, mask=mask)
    output = x + y
    tl.store(output_ptr + offsets, output, mask=mask)

def triton_add(x: torch.Tensor, y: torch.Tensor):
    output = torch.empty_like(x)
    n_elements = x.numel()
    grid = lambda meta: (triton.cdiv(n_elements, meta['BLOCK_SIZE']),)
    
    add_kernel[grid](x, y, output, n_elements, BLOCK_SIZE=1024)
    return output

# if torch.cuda.is_available():
#     x = torch.randn(1000, device='cuda')
#     y = torch.randn(1000, device='cuda')
#     out = triton_add(x, y)
#     print("Triton add successful!")

## Key Takeaways

1. **Profile first**: Don't optimize blindly.
2. **Use `torch.compile`**: It's free speed.
3. **Triton**: The secret weapon for custom high-performance layers.