### Lecture 1 from CUDA MODE - cuda profilers

In [1]:
import torch

In [8]:
a = torch.tensor([1.,2.,3.])

### Time a pytorch function on the GPU

It is not possible to use the `time` python module because CUDA is Async. \
We can write a function that will use `torch.cuda.Event` to appropriately measure time. There is a warmup time, so the first iteration won't be as fast as the later ones. If we want a measure of the steady state, we have to function we want to measure a few time befaor staring the measure.

In [10]:
def time_pytorch_function(func, input):
    start = torch.cuda.Event(enable_timing=True)
    end = torch.cuda.Event(enable_timing=True)

    #warmup
    for _ in range(5):
        func(input)

    start.record()
    func(input)
    end.record()
    torch.cuda.synchronize()
    return start.elapsed_time(end)

In [11]:
b = torch.randn(10000, 10000).cuda()

In [13]:
def square_2(a):
    return a*a

In [14]:
def square_3(a):
    return a**2

In [15]:
time_pytorch_function(torch.square, b)
time_pytorch_function(square_2, b)
time_pytorch_function(square_3, b)

0.9185600280761719

### We can use `torch.autograd.profiler`

In [16]:
print("=============")
print("Profiling torch.square")
print("=============")

Profiling torch.square


In [21]:
with torch.autograd.profiler.profile(use_cuda=True) as prof:
    torch.square(b)

STAGE:2024-02-19 22:46:45 872:872 ActivityProfilerController.cpp:314] Completed Stage: Warm Up
STAGE:2024-02-19 22:46:45 872:872 ActivityProfilerController.cpp:320] Completed Stage: Collection
STAGE:2024-02-19 22:46:45 872:872 ActivityProfilerController.cpp:324] Completed Stage: Post Processing


In [22]:
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))

---------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                 Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
---------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
         aten::square        15.46%      15.000us       100.00%      97.000us      97.000us      15.000us         0.03%      51.006ms      51.006ms             1  
            aten::pow        81.44%      79.000us        84.54%      82.000us      82.000us      50.987ms        99.96%      50.991ms      50.991ms             1  
    aten::result_type         2.06%       2.000us         2.06%       2.000us       2.000us       2.000us         0.00%       2.000us       2.000us             1  
             at

In [17]:
with torch.autograd.profiler.profile(use_cuda=True) as prof:
    square_2(b)

INFO:2024-02-19 22:44:56 872:872 init.cpp:158] If you see CUPTI_ERROR_INSUFFICIENT_PRIVILEGES, refer to https://developer.nvidia.com/nvidia-development-tools-solutions-err-nvgpuctrperm-cupti
STAGE:2024-02-19 22:44:56 872:872 ActivityProfilerController.cpp:314] Completed Stage: Warm Up
STAGE:2024-02-19 22:44:56 872:872 ActivityProfilerController.cpp:320] Completed Stage: Collection
STAGE:2024-02-19 22:44:56 872:872 ActivityProfilerController.cpp:324] Completed Stage: Post Processing


In [18]:
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))

-------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
         Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
    aten::mul       100.00%      43.000us       100.00%      43.000us      43.000us      66.190ms       100.00%      66.190ms      66.190ms             1  
-------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 43.000us
Self CUDA time total: 66.190ms



In [19]:
with torch.autograd.profiler.profile(use_cuda=True) as prof:
    square_3(b)

STAGE:2024-02-19 22:46:02 872:872 ActivityProfilerController.cpp:314] Completed Stage: Warm Up
STAGE:2024-02-19 22:46:02 872:872 ActivityProfilerController.cpp:320] Completed Stage: Collection
STAGE:2024-02-19 22:46:02 872:872 ActivityProfilerController.cpp:324] Completed Stage: Post Processing


In [20]:
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))

---------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                 Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
---------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
            aten::pow        99.97%       9.144ms       100.00%       9.147ms       9.147ms      74.016ms        99.99%      74.021ms      74.021ms             1  
    aten::result_type         0.02%       2.000us         0.02%       2.000us       2.000us       3.000us         0.00%       3.000us       3.000us             1  
             aten::to         0.01%       1.000us         0.01%       1.000us       1.000us       2.000us         0.00%       2.000us       2.000us             1  
---------------

### We can use the Pytorch profiler

In [24]:
import torch
from torch.profiler import profile, record_function, ProfilerActivity

In [25]:
def trace_handler(prof):
    print(prof.key_averages().table(
        sort_by="self_cuda_time_total", row_limit=-1))
    prof.export_chrome_trace("/tmp/test_trace_" + str(prof.step_num) + ".json")

In [27]:
with torch.profiler.profile(
    activities=[
        torch.profiler.ProfilerActivity.CPU,
        torch.profiler.ProfilerActivity.CUDA,
    ],
     schedule=torch.profiler.schedule(
        wait=1,
        warmup=1,
        active=2,
        repeat=1),
    on_trace_ready=trace_handler
    # on_trace_ready=torch.profiler.tensorboard_trace_handler('./log')
    # used when outputting for tensorboard
    ) as p:
        for iter in range(10):
            torch.square(torch.randn(10000, 10000).cuda())
            # send a signal to the profiler that the next iteration has started
            p.step()

STAGE:2024-02-19 23:30:16 872:872 ActivityProfilerController.cpp:314] Completed Stage: Warm Up


-----------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls  
-----------------------  ------------  ------------  ------------  ------------  ------------  ------------  
          ProfilerStep*         1.20%      10.021ms       100.00%     832.520ms     416.260ms             2  
            aten::randn         0.04%     339.000us        56.69%     471.955ms     235.977ms             2  
            aten::empty         0.00%      25.000us         0.00%      25.000us      12.500us             2  
          aten::normal_        56.65%     471.591ms        56.65%     471.591ms     235.796ms             2  
               aten::to         1.10%       9.147ms        42.09%     350.420ms      87.605ms             4  
         aten::_to_copy         0.00%      29.000us        40.99%     341.273ms     170.637ms             2  
    aten::

STAGE:2024-02-19 23:30:17 872:872 ActivityProfilerController.cpp:320] Completed Stage: Collection
STAGE:2024-02-19 23:30:17 872:872 ActivityProfilerController.cpp:324] Completed Stage: Post Processing
