# Day 3: Host→Device Memcpy Scaling

**Focus:** Understanding memory-bound operations and scaling behavior

## Objectives
- Profile Host→Device memcpy operations
- Understand how memory transfer scales with size
- Compare memory bandwidth (GB/s) across different sizes

In [None]:
import torch
import nvtx

assert torch.cuda.is_available(), "CUDA not available"

torch.backends.cuda.matmul.allow_tf32 = True
torch.set_float32_matmul_precision("high")

DEVICE = "cuda"
DTYPE = torch.float16

print(f"GPU: {torch.cuda.get_device_name(0)}")
print(f"DTYPE: {DTYPE}")

## Helper: CUDA Event Timing Function

In [None]:
WARMUP = 5
ITERS = 10

def time_cuda(fn, stream=None):
    """Time a CUDA operation using events."""
    if stream is None:
        stream = torch.cuda.current_stream()
    start = torch.cuda.Event(enable_timing=True)
    end = torch.cuda.Event(enable_timing=True)

    # Warmup
    for _ in range(WARMUP):
        fn()
    stream.synchronize()

    # Timed
    start.record(stream)
    for _ in range(ITERS):
        fn()
    end.record(stream)
    end.synchronize()

    ms = start.elapsed_time(end) / ITERS
    return ms

## Experiment: H2D Memcpy Scaling

Measure transfer time and bandwidth for different tensor sizes.

In [None]:
SIZES = [1024, 2048, 4096, 8192]

print("=== H2D memcpy scaling ===")
print(f"{'N':>6}  {'ms':>10}  {'MB':>10}  {'GB/s':>8}")
print("-" * 40)

results = []

for N in SIZES:
    with nvtx.annotate(f"h2d_alloc_{N}"):
        # Pinned host memory (faster for transfers)
        x_cpu = torch.empty((N, N), dtype=DTYPE, pin_memory=True).normal_()
        x_gpu = torch.empty((N, N), dtype=DTYPE, device=DEVICE)

    def h2d():
        with nvtx.annotate(f"h2d_{N}"):
            x_gpu.copy_(x_cpu, non_blocking=True)

    ms = time_cuda(h2d)
    mb = x_cpu.numel() * x_cpu.element_size() / 1e6
    gbps = (mb / 1000.0) / (ms / 1000.0)  # GB/s
    
    results.append((N, ms, mb, gbps))
    print(f"{N:6d}  {ms:10.3f}  {mb:10.3f}  {gbps:8.2f}")

print("\nNote: Size scales as N², so larger N means more data transferred.")

## Analysis

**Questions to answer:**
1. How does transfer time scale with size? (linear? quadratic?)
2. Is bandwidth constant across sizes? Why or why not?
3. What would you expect to see for D2H (Device→Host) transfers?

_Record your observations here after running the experiments._