https://pytorch.org/tutorials/recipes/recipes/benchmark.html

In [1]:
import torch
from torch import Tensor

# 1. Defining functions to benchmark

As of the time of this writing, torch.dot does not support batched mode, so we will compare two approaches to implementing it using existing torch operators: one approach uses a combination of mul and sum while the other reduces the problem to bmm.

截至撰写本文时，torch.dot 不支持批处理模式，因此我们将比较使用现有 torch 运算符实现它的两种方法：一种方法使用 mul 和 sum 的组合，而另一种方法将问题简化为 bmm。

In [2]:
def batched_dot_mul_sum(a: Tensor, b: Tensor) -> Tensor:
    '''Computes batched dot by multiplying and summing'''
    return a.mul(b).sum(-1)

In [3]:
def batched_dot_bmm(a: Tensor, b: Tensor) -> Tensor:
    '''Computes batched dot by reducing to ``bmm``'''
    a = a.reshape(-1, 1, a.shape[-1])
    b = b.reshape(-1, a.shape[-1], 1)
    return torch.bmm(a, b).flatten()

In [4]:
# Input for benchmarking
x = torch.randn(10000, 64)

In [5]:
# Ensure that both functions compute the same output
assert batched_dot_mul_sum(x, x).allclose(batched_dot_bmm(x, x))

# 2. Benchmarking with timeit.Timer

First, let’s benchmark the code using Python’s builtin timeit module. We keep the benchmark code simple here so we can compare the defaults of timeit and torch.utils.benchmark.

首先，让我们使用 Python 的内置 timeit 模块对代码进行基准测试。 我们在这里保持基准代码简单，以便我们可以比较 timeit 和 torch.utils.benchmark 的默认值。

In [6]:
import timeit

In [7]:
t0 = timeit.Timer(
    stmt='batched_dot_mul_sum(x, x)',
    setup='from __main__ import batched_dot_mul_sum',
    globals={'x': x})

t1 = timeit.Timer(
    stmt='batched_dot_bmm(x, x)',
    setup='from __main__ import batched_dot_bmm',
    globals={'x': x})

In [8]:
print(f'mul_sum(x, x):  {t0.timeit(100) / 100 * 1e6:>5.1f} us')
print(f'bmm(x, x):      {t1.timeit(100) / 100 * 1e6:>5.1f} us')

mul_sum(x, x):  239.9 us
bmm(x, x):      187.7 us


# 3. Benchmarking with torch.utils.benchmark.Timer

PyTorch benchmark module was designed to be familiar to those who have used the timeit module before. However, its defaults make it easier and safer to use for benchmarking PyTorch code. Let’s first compare the same basic API as above.

PyTorch 基准测试模块旨在让以前使用过 timeit 模块的人熟悉。 然而，它的默认设置使得使用 PyTorch 代码进行基准测试变得更容易、更安全。 我们首先比较一下与上面相同的基本 API。


In [9]:
from torch.utils import benchmark

In [10]:
t0 = benchmark.Timer(
    stmt='batched_dot_mul_sum(x, x)',
    setup='from __main__ import batched_dot_mul_sum',
    globals={'x': x})
t0.timeit(100)

<torch.utils.benchmark.utils.common.Measurement object at 0x000001E0E6BA3190>
batched_dot_mul_sum(x, x)
setup: from __main__ import batched_dot_mul_sum
  1.22 ms
  1 measurement, 100 runs , 1 thread

In [11]:
t1 = benchmark.Timer(
    stmt='batched_dot_bmm(x, x)',
    setup='from __main__ import batched_dot_bmm',
    globals={'x': x})
t1.timeit(100)

<torch.utils.benchmark.utils.common.Measurement object at 0x000001E08FF3E990>
batched_dot_bmm(x, x)
setup: from __main__ import batched_dot_bmm
  792.43 us
  1 measurement, 100 runs , 1 thread

Even though the APIs are the same for the basic functionality, there are some important differences. benchmark.Timer.timeit() returns the time per run as opposed to the total runtime like timeit.Timer.timeit() does. PyTorch benchmark module also provides formatted string representations for printing the results.

Another important difference, and the reason why the results diverge is that PyTorch benchmark module runs in a single thread by default. We can change the number of threads with the num_threads argument.

torch.utils.benchmark.Timer takes several additional arguments including: label, sub_label, description and env which change the __repr__ of the measurement object returned and are used for grouping the results (more on this later).

尽管 API 的基本功能相同，但仍存在一些重要差异。 benchmark.Timer.timeit() 返回每次运行的时间，而不是像 timeit.Timer.timeit() 那样返回总运行时间。 PyTorch 基准测试模块还提供用于打印结果的格式化字符串表示形式。

另一个重要的区别以及结果出现差异的原因是 PyTorch 基准测试模块默认在单线程中运行。 我们可以使用 num_threads 参数更改线程数。

torch.utils.benchmark.Timer 需要几个附加参数，包括：label、sub_label、description 和 env，它们会更改返回的测量对象的 __repr__ 并用于对结果进行分组（稍后会详细介绍）。

In [12]:
num_threads = torch.get_num_threads()
num_threads

8

In [13]:
t0 = benchmark.Timer(
    stmt='batched_dot_mul_sum(x, x)',
    setup='from __main__ import batched_dot_mul_sum',
    globals={'x': x},
    num_threads=num_threads,
    label='Multithreaded batch dot',
    sub_label='Implemented using mul and sum')
t0.timeit(100)

<torch.utils.benchmark.utils.common.Measurement object at 0x000001E091144950>
Multithreaded batch dot: Implemented using mul and sum
setup: from __main__ import batched_dot_mul_sum
  348.60 us
  1 measurement, 100 runs , 8 threads

In [14]:
t1 = benchmark.Timer(
    stmt='batched_dot_bmm(x, x)',
    setup='from __main__ import batched_dot_bmm',
    globals={'x': x},
    num_threads=num_threads,
    label='Multithreaded batch dot',
    sub_label='Implemented using bmm')
t1.timeit(100)

<torch.utils.benchmark.utils.common.Measurement object at 0x000001E091145250>
Multithreaded batch dot: Implemented using bmm
setup: from __main__ import batched_dot_bmm
  201.44 us
  1 measurement, 100 runs , 8 threads

Running benchmark with all threads available gives similar results as the timeit module. More importantly, which version is faster depends on how many threads we run the code with. This is why it’s important to benchmark the code with thread settings that are representative of real use cases. Another important thing to remember is to synchronize CPU and CUDA when benchmarking on the GPU. Let’s run the above benchmarks again on a CUDA tensor and see what happens.

使用所有可用线程运行基准测试会得到与 timeit 模块类似的结果。 更重要的是，哪个版本更快取决于我们运行代码的线程数。 这就是为什么使用代表实际用例的线程设置对代码进行基准测试很重要。 另一件需要记住的重要事情是在 GPU 上进行基准测试时同步 CPU 和 CUDA。 让我们在 CUDA 张量上再次运行上述基准测试，看看会发生什么。

In [15]:
x = torch.randn(10000, 1024, device="cuda:0")

In [16]:
t0 = timeit.Timer(
    stmt='batched_dot_mul_sum(x, x)',
    setup='from __main__ import batched_dot_mul_sum',
    globals={'x': x})

t1 = timeit.Timer(
    stmt='batched_dot_bmm(x, x)',
    setup='from __main__ import batched_dot_bmm',
    globals={'x': x})

# Ran each twice to show difference before/after warm-up
print(f'mul_sum(x, x):  {t0.timeit(100) / 100 * 1e6:>5.1f} us')
print(f'mul_sum(x, x):  {t0.timeit(100) / 100 * 1e6:>5.1f} us')
print(f'bmm(x, x):      {t1.timeit(100) / 100 * 1e6:>5.1f} us')
print(f'bmm(x, x):      {t1.timeit(100) / 100 * 1e6:>5.1f} us')

mul_sum(x, x):  266.6 us
mul_sum(x, x):   31.4 us
bmm(x, x):      826.7 us
bmm(x, x):       35.1 us


In [17]:
t0 = benchmark.Timer(
    stmt='batched_dot_mul_sum(x, x)',
    setup='from __main__ import batched_dot_mul_sum',
    globals={'x': x})

# Run only once since benchmark module does warm-up for us
t0.timeit(100)

<torch.utils.benchmark.utils.common.Measurement object at 0x000001E0911459D0>
batched_dot_mul_sum(x, x)
setup: from __main__ import batched_dot_mul_sum
  407.38 us
  1 measurement, 100 runs , 1 thread

In [18]:
t1 = benchmark.Timer(
    stmt='batched_dot_bmm(x, x)',
    setup='from __main__ import batched_dot_bmm',
    globals={'x': x})

# Run only once since benchmark module does warm-up for us
t1.timeit(100)

<torch.utils.benchmark.utils.common.Measurement object at 0x000001E091305690>
batched_dot_bmm(x, x)
setup: from __main__ import batched_dot_bmm
  445.19 us
  1 measurement, 100 runs , 1 thread

The results reveal something interesting. The first run of the bmm version using the timeit module takes much longer than the second run. This is because bmm calls into cuBLAS which needs to be loaded the first time it’s called which takes some time. This is why it’s important to do a warm-up run before benchmarking, luckily for us, PyTorch’s benchmark module takes care of that.

The difference in the results between timeit and benchmark modules is because the timeit module is not synchronizing CUDA and is thus only timing the time to launch the kernel. PyTorch’s benchmark module does the synchronization for us.

结果揭示了一些有趣的事情。 使用 timeit 模块的 bmm 版本的第一次运行比第二次运行花费的时间要长得多。 这是因为 bmm 调用了 cuBLAS，它需要在第一次调用时加载，这需要一些时间。 这就是为什么在基准测试之前进行热身运行很重要，幸运的是，PyTorch 的基准测试模块可以解决这个问题。

timeit 和 benchmark 模块之间的结果差异是因为 timeit 模块不同步 CUDA，因此仅计时启动内核的时间。 PyTorch 的基准测试模块为我们进行同步。

# 4. Benchmarking with Blocked Autorange

While timeit.Timer.autorange takes a single continuous measurement of at least 0.2 seconds, torch.utils.benchmark.blocked_autorange takes many measurements whose times total at least 0.2 seconds (which can be changed by the min_run_time parameter) subject to the constraint that timing overhead is a small fraction of the overall measurement. This is accomplished by first running with an increasing number of runs per loop until the runtime is much larger than measurement overhead (which also serves as a warm up), and then taking measurements until the target time is reached. This has the useful properties that it wastes less data and allows us to compute statistics to estimate the reliability of the measurements.

timeit.Timer.autorange 进行至少 0.2 秒的单次连续测量，而 torch.utils.benchmark.blocked_autorange 进行多次测量，其时间总计至少 0.2 秒（可以通过 min_run_time 参数更改），但受到计时的约束 开销只占总体测量的一小部分。 这是通过首先增加每个循环的运行次数来实现的，直到运行时间远大于测量开销（这也可以作为预热），然后进行测量直到达到目标时间。 这具有有用的特性，即浪费更少的数据，并允许我们计算统计数据来估计测量的可靠性。

In [19]:
m0 = t0.blocked_autorange()
m0

<torch.utils.benchmark.utils.common.Measurement object at 0x000001E08FAA3850>
batched_dot_mul_sum(x, x)
setup: from __main__ import batched_dot_mul_sum
  395.04 us
  1 measurement, 1000 runs , 1 thread

In [20]:
m1 = t1.blocked_autorange()
m1

<torch.utils.benchmark.utils.common.Measurement object at 0x000001E08FBAFF50>
batched_dot_bmm(x, x)
setup: from __main__ import batched_dot_bmm
  381.12 us
  1 measurement, 1000 runs , 1 thread

We can also inspect the individual statistics from the returned measurements object.

我们还可以检查返回的测量对象的各个统计信息。

In [21]:
print(f"Mean:   {m0.mean * 1e6:6.2f} us")
print(f"Median: {m0.median * 1e6:6.2f} us")

Mean:   395.04 us
Median: 395.04 us


# 5. Comparing benchmark results

So far we’ve been comparing our two versions of batched dot against a single input. In practice, we want to try a combination of inputs as well as different number of threads. The Compare class helps display the results of many measurements in a formatted table. It uses the annotations described above (label, sub_label, num_threads, etc.) as well as description to group and organize the table. Let’s use Compare to see how our functions perform for different input sizes and number of threads.

到目前为止，我们一直在将两个版本的批处理点与单个输入进行比较。 在实践中，我们希望尝试组合输入以及不同数量的线程。 Compare 类有助于在格式化表格中显示许多测量的结果。 它使用上述注释（label、sub_label、num_threads 等）以及描述来对表进行分组和组织。 让我们使用 Compare 来看看我们的函数在不同的输入大小和线程数量下的执行情况。

In [22]:
from itertools import product

In [23]:
sizes = [1, 64, 1024, 10000]
list(product(sizes, sizes))

[(1, 1),
 (1, 64),
 (1, 1024),
 (1, 10000),
 (64, 1),
 (64, 64),
 (64, 1024),
 (64, 10000),
 (1024, 1),
 (1024, 64),
 (1024, 1024),
 (1024, 10000),
 (10000, 1),
 (10000, 64),
 (10000, 1024),
 (10000, 10000)]

In [24]:
results = []
for b, n in product(sizes, sizes):
    # label and sub_label are the rows
    # description is the column
    label = "Batched dot"
    sub_label = f'[{b}, {n}]'
    x = torch.ones((b, n))
    for num_threads in [1, 4, 16, 32]:
        results.append(benchmark.Timer(
            stmt='batched_dot_mul_sum(x, x)',
            setup='from __main__ import batched_dot_mul_sum',
            globals={'x': x},
            num_threads=num_threads,
            label=label,
            sub_label=sub_label,
            description='mul/sum'
        ).blocked_autorange(min_run_time=1))
        results.append(benchmark.Timer(
            stmt='batched_dot_bmm(x, x)',
            setup='from __main__ import batched_dot_bmm',
            globals={'x': x},
            num_threads=num_threads,
            label=label,
            sub_label=sub_label,
            description='bmm',
        ).blocked_autorange(min_run_time=1))

compare = benchmark.Compare(results=results)
compare.print()

[--------------- Batched dot ----------------]
                      |  mul/sum   |    bmm   
1 threads: -----------------------------------
      [1, 1]          |       5.8  |       9.8
      [1, 64]         |       6.1  |       9.9
      [1, 1024]       |       8.1  |      10.3
      [1, 10000]      |      16.2  |      11.4
      [64, 1]         |       6.5  |       9.9
      [64, 64]        |      13.5  |      14.9
      [64, 1024]      |      77.1  |     142.0
      [64, 10000]     |     675.0  |    1424.7
      [1024, 1]       |      10.3  |      14.7
      [1024, 64]      |     120.8  |      81.9
      [1024, 1024]    |    1371.0  |    2269.6
      [1024, 10000]   |   16059.7  |   21760.0
      [10000, 1]      |      37.9  |      53.0
      [10000, 64]     |    1095.9  |     698.0
      [10000, 1024]   |   16860.5  |   21691.8
      [10000, 10000]  |  152418.7  |  206059.8
4 threads: -----------------------------------
      [1, 1]          |       5.8  |       9.7
      [1, 64]

The results above indicate that the version which reduces to bmm is better for larger tensors running on multiple threads, while for smaller and/or single thread code, the other version is better.

Compare also provides functions for changing the table format

上面的结果表明，对于在多个线程上运行的较大张量，简化为 bmm 的版本更好，而对于较小和/或单线程代码，其他版本更好。

Compare还提供了更改表格格式的功能

In [25]:
compare.trim_significant_figures()
compare.colorize()
compare.print()

[-------------- Batched dot --------------]
                      |  mul/sum  |   bmm  
1 threads: --------------------------------
      [1, 1]          |  [92m[1m      6[0m[0m  |  [92m[1m    10[0m[0m
      [1, 64]         |  [34m[1m      6[0m[0m  |  [34m[1m    10[0m[0m
      [1, 1024]       |        8  |  [34m[1m    10[0m[0m
      [1, 10000]      |  [2m[91m     16[0m[0m  |      11
      [64, 1]         |        6  |  [92m[1m    10[0m[0m
      [64, 64]        |  [2m[91m     13[0m[0m  |      15
      [64, 1024]      |  [31m[1m     77[0m[0m  |  [31m[1m   140[0m[0m
      [64, 10000]     |  [31m[1m    700[0m[0m  |  [31m[1m  1000[0m[0m
      [1024, 1]       |       10  |      10
      [1024, 64]      |  [31m[1m    100[0m[0m  |  [31m[1m    82[0m[0m
      [1024, 1024]    |  [31m[1m   1000[0m[0m  |  [31m[1m  2300[0m[0m
      [1024, 10000]   |  [31m[1m  20000[0m[0m  |  [31m[1m 22000[0m[0m
      [10000, 1]      |  [31m[1m

# 6. Saving/Loading benchmark results

Measurements (and CallgrindStats which are described in section 8) can be serialized by the pickle module. This makes A/B testing easy, as you can collect measurements from two separate environments, pickle them, and then load both in a single environment. Timer even takes an env constructor argument so that such A/B testing works seamlessly.

Let’s imagine that rather than two Python functions, the add/sum and bmm approaches were in two different builds of PyTorch. The example below demonstrates how one might A/B test them. For simplicity, we only use a subset of shapes, and simply round trip results through pickle rather than actually using multiple environments and writing results to disk.

测量（以及第 8 节中描述的 CallgrindStats）可以由 pickle 模块序列化。 这使得 A/B 测试变得容易，因为您可以从两个独立的环境收集测量结果，腌制它们，然后将它们加载到单个环境中。 Timer 甚至采用 env 构造函数参数，以便此类 A/B 测试无缝运行。

让我们想象一下，add/sum 和 bmm 方法不是两个 Python 函数，而是位于 PyTorch 的两个不同版本中。 下面的示例演示了如何对它们进行 A/B 测试。 为简单起见，我们仅使用形状的子集，并通过 pickle 简单地往返结果，而不是实际使用多个环境并将结果写入磁盘。

In [26]:
import pickle

In [27]:
ab_test_results = []
for env in ('environment A: mul/sum', 'environment B: bmm'):
    for b, n in ((1, 1), (1024, 10000), (10000, 1)):
        x = torch.ones((b, n))
        dot_fn = (batched_dot_mul_sum if env == 'environment A: mul/sum' else batched_dot_bmm)
        m = benchmark.Timer(
            stmt='batched_dot(x, x)',
            globals={'x': x, 'batched_dot': dot_fn},
            num_threads=1,
            label='Batched dot',
            description=f'[{b}, {n}]',
            env=env,
        ).blocked_autorange(min_run_time=1)
        ab_test_results.append(pickle.dumps(m))

ab_results = [pickle.loads(i) for i in ab_test_results]
compare = benchmark.Compare(ab_results)
compare.trim_significant_figures()
compare.colorize()
compare.print()

[------------------------------------- Batched dot -------------------------------------]
                                               |  [1, 1]  |  [1024, 10000]  |  [10000, 1]
1 threads: ------------------------------------------------------------------------------
  (environment A: mul/sum)  batched_dot(x, x)  |  [92m[1m  6   [0m[0m  |  [92m[1m    20000    [0m[0m  |  [92m[1m    35    [0m[0m
  (environment B: bmm)      batched_dot(x, x)  |    10    |      20000      |      53    

Times are in microseconds (us).



In [28]:
# And just to show that we can round trip all of the results from earlier:
round_tripped_results = pickle.loads(pickle.dumps(results))
assert(str(benchmark.Compare(results)) == str(benchmark.Compare(round_tripped_results)))

# 7. Generating inputs with Fuzzed Parameters

As we’ve seen in the previous section, there can be some stark performance differences depending on the input tensors. Hence, it is a good idea to run benchmarks on a number of different inputs. However, creating all these input tensors can be tedious which is where torch.utils.benchmark.Fuzzer and related classes come in. Let’s take a look at how we can use the Fuzzer to create some test cases for the benchmark.

正如我们在上一节中所看到的，根据输入张量的不同，可能会存在一些明显的性能差异。 因此，最好对许多不同的输入运行基准测试。 然而，创建所有这些输入张量可能很乏味，这就是 torch.utils.benchmark.Fuzzer 和相关类的用武之地。让我们看看如何使用 Fuzzer 为基准创建一些测试用例。

In [29]:
from torch.utils.benchmark import Fuzzer, FuzzedParameter, FuzzedTensor, ParameterAlias

In [30]:
# Generates random tensors with 128 to 10000000 elements and sizes k0 and k1 chosen from a
# ``loguniform`` distribution in [1, 10000], 40% of which will be discontiguous on average.
example_fuzzer = Fuzzer(
    parameters=[
        FuzzedParameter(name='k0', minval=1, maxval=10000, distribution='loguniform'),
        FuzzedParameter(name='k1', minval=1, maxval=10000, distribution='loguniform'),
    ],
    tensors=[
        FuzzedTensor(name='x', size=('k0', 'k1'), min_elements=128, max_elements=10000000, probability_contiguous=0.6)
    ],
    seed=0
)
example_fuzzer

<torch.utils.benchmark.utils.fuzzer.Fuzzer at 0x1e08eac9c10>

In [31]:
list(example_fuzzer.take(1))

[({'x': tensor([[0.7821, 0.0536, 0.9888,  ..., 0.5545, 0.2512, 0.7045],
           [0.3418, 0.9983, 0.5456,  ..., 0.2689, 0.4601, 0.3495],
           [0.2495, 0.0588, 0.6216,  ..., 0.2450, 0.2342, 0.9557],
           ...,
           [0.2510, 0.1321, 0.3304,  ..., 0.6740, 0.4588, 0.6077],
           [0.2013, 0.5539, 0.7535,  ..., 0.5495, 0.1422, 0.7964],
           [0.3055, 0.8517, 0.1035,  ..., 0.6941, 0.7315, 0.9100]])},
  {'x': {'numel': 186325,
    'order': array([0, 1]),
    'steps': (1, 1),
    'is_contiguous': True,
    'dtype': 'torch.float32'}},
  {'k0': 725, 'k1': 257})]

In [32]:
results = []
for tensors, tensor_params, params in example_fuzzer.take(10):
    # description is the column label
    sub_label = f"{params['k0']:<6} x {params['k1']:<4} {'' if tensor_params['x']['is_contiguous'] else '(discontiguous)'}" # 左对齐
    results.append(benchmark.Timer(
        stmt='batched_dot_mul_sum(x, x)',
        setup='from __main__ import batched_dot_mul_sum',
        globals=tensors,
        label='Batched dot',
        sub_label=sub_label,
        description='mul/sum',
    ).blocked_autorange(min_run_time=1))
    results.append(benchmark.Timer(
        stmt='batched_dot_bmm(x, x)',
        setup='from __main__ import batched_dot_bmm',
        globals=tensors,
        label='Batched dot',
        sub_label=sub_label,
        description='bmm',
    ).blocked_autorange(min_run_time=1))
compare = benchmark.Compare(results)
compare.colorize()
compare.print()

[---------------------- Batched dot ----------------------]
                                     |  mul/sum  |    bmm  
1 threads: ------------------------------------------------
      725    x 257                   |  [31m[1m  215.8[0m[0m  |  [31m[1m  191.0[0m[0m
      49     x 383                   |  [2m[91m   26.5[0m[0m  |  [2m[91m   28.6[0m[0m
      34     x 1468                  |  [31m[1m   57.4[0m[0m  |  [31m[1m  109.7[0m[0m
      187    x 5039                  |  [31m[1m 1006.0[0m[0m  |  [31m[1m 1860.1[0m[0m
      2140   x 1296 (discontiguous)  |  [31m[1m 3809.0[0m[0m  |  [31m[1m34743.8[0m[0m
      78     x 1598                  |  [31m[1m  132.0[0m[0m  |  [31m[1m  257.8[0m[0m
      519    x 763                   |  [31m[1m  407.9[0m[0m  |  [31m[1m  818.2[0m[0m
      141    x 1082                  |  [31m[1m  161.8[0m[0m  |  [31m[1m  312.9[0m[0m
      78     x 5    (discontiguous)  |  [34m[1m    8.3[0m[0m  | 

There is a lot of flexibility for defining your own fuzzers which is great for creating a powerful set of inputs to benchmark. But to make things even simpler, PyTorch benchmark module comes with some built-in fuzzers for common benchmarking needs. Let’s take a look at how we can use one of these built-in fuzzers.

定义自己的fuzzers有很大的灵活性，这对于创建一组强大的基准输入非常有用。 但为了让事情变得更简单，PyTorch 基准测试模块附带了一些内置的fuzzers，可以满足常见的基准测试需求。 让我们看一下如何使用这些内置模糊器之一。

In [33]:
from torch.utils.benchmark.op_fuzzers import binary

In [34]:
results = []
for tensor, tensor_params, params in binary.BinaryOpFuzzer(seed=0).take(10):
    sub_label=f"{params['k0']:<6} x {params['k1']:<4} {'' if tensor_params['x']['is_contiguous'] else '(discontiguous)'}"
    results.append(benchmark.Timer(
        stmt='batched_dot_mul_sum(x, x)',
        setup='from __main__ import batched_dot_mul_sum',
        globals=tensors,
        label='Batched dot',
        sub_label=sub_label,
        description='mul/sum'
    ).blocked_autorange(min_run_time=1))
    results.append(benchmark.Timer(
        stmt='batched_dot_bmm(x, x)',
        setup='from __main__ import batched_dot_bmm',
        globals=tensors,
        label='Batched dot',
        sub_label=sub_label,
        description='bmm'
    ).blocked_autorange(min_run_time=1))
compare = benchmark.Compare(results)
compare.trim_significant_figures()
compare.colorize()
compare.print()

[---------------------- Batched dot ----------------------]
                                         |  mul/sum  |  bmm
1 threads: ------------------------------------------------
      64     x 473  (discontiguous)      |  [34m[1m  7.7  [0m[0m  |  [34m[1m 11[0m[0m
      16384  x 12642115 (discontiguous)  |  [34m[1m  7.6  [0m[0m  |  [34m[1m 11[0m[0m
      8192   x 892                       |  [34m[1m  7.7  [0m[0m  |  [34m[1m 11[0m[0m
      512    x 64   (discontiguous)      |  [34m[1m  8.1  [0m[0m  |  [34m[1m 11[0m[0m
      493    x 27   (discontiguous)      |  [34m[1m  7.7  [0m[0m  |  [34m[1m 11[0m[0m
      118    x 32   (discontiguous)      |  [34m[1m  7.6  [0m[0m  |  [34m[1m 11[0m[0m
      16     x 495  (discontiguous)      |  [34m[1m  7.6  [0m[0m  |  [34m[1m 11[0m[0m
      488    x 62374                     |  [34m[1m  8.0  [0m[0m  |  [34m[1m 11[0m[0m
      240372 x 69                        |  [92m[1m  7.5  [0m[0m

# 8. Collecting instruction counts with Callgrind

One of the challenges of optimizing code is the variation and opacity of wall time. There are many sources of non-determinism, from adaptive clock speeds to resource contention with other processes. Furthermore, end-to-end time gives no insight into where time is being spent, which is really what we’re interested in when optimizing code.

A complementary approach is to also collect instruction counts. These counts are a proxy metric and do not capture all aspects of performance (e.g. memory or I/O bound tasks), however they do have several useful properties. Instruction counts are reproducible, insensitive to environmental variation, and offer fine grained insight into where a program is spending cycles.

To see the utility of instruction counts, let us look at how we might reduce the overhead of batched_dot_mul_sum. The obvious solution is to move it to C++, so we avoid going between Python and C++ multiple times.

Fortunately, the source is nearly identical. One question that we have to ask in C++ is whether we should take arguments by value or reference.

优化代码的挑战之一是挂机时间的变化和不透明性。 不确定性的来源有很多，从自适应时钟速度到与其他进程的资源争用。 此外，端到端时间无法洞察时间花在哪里，而这正是我们在优化代码时真正感兴趣的。

一种补充方法是还收集指令计数。 这些计数是一个代理指标，并不能捕获性能的所有方面（例如内存或 I/O 密集型任务），但它们确实有几个有用的属性。 指令计数是可重复的，对环境变化不敏感，并提供对程序花费周期的细粒度洞察。

为了了解指令计数的实用性，让我们看看如何减少 batched_dot_mul_sum 的开销。 显而易见的解决方案是将其转移到 C++，这样我们就可以避免多次在 Python 和 C++ 之间切换。

幸运的是，来源几乎相同。 在 C++ 中我们必须问的一个问题是我们是否应该按值或引用获取参数。

In [36]:
batched_dot_src = """\
/* ---- Python ---- */
// def batched_dot_mul_sum(a, b):
//     return a.mul(b).sum(-1)

torch::Tensor batched_dot_mul_sum_v0(
    const torch::Tensor a,
    const torch::Tensor b) {
  return a.mul(b).sum(-1);
}

torch::Tensor batched_dot_mul_sum_v1(
    const torch::Tensor& a,
    const torch::Tensor& b) {
  return a.mul(b).sum(-1);
}
"""


# PyTorch makes it easy to test our C++ implementations by providing a utility
# to JIT compile C++ source into Python extensions:
import os
from torch.utils import cpp_extension
cpp_lib = cpp_extension.load_inline(
    name='cpp_lib',
    cpp_sources=batched_dot_src,
    extra_cflags=['-O3'],
    extra_include_paths=[
        # `load_inline` needs to know where to find ``pybind11`` headers.
        os.path.join(os.getenv('CONDA_PREFIX'), 'include')
    ],
    functions=['batched_dot_mul_sum_v0', 'batched_dot_mul_sum_v1']
)

# `load_inline` will create a shared object that is loaded into Python. When we collect
# instruction counts Timer will create a subprocess, so we need to re-import it. The
# import process is slightly more complicated for C extensions, but that's all we're
# doing here.
module_import_str = f"""\
# https://stackoverflow.com/questions/67631/how-to-import-a-module-given-the-full-path
import importlib.util
spec = importlib.util.spec_from_file_location("cpp_lib", {repr(cpp_lib.__file__)})
cpp_lib = importlib.util.module_from_spec(spec)
spec.loader.exec_module(cpp_lib)"""

import textwrap
def pretty_print(result):
    """Import machinery for ``cpp_lib.so`` can get repetitive to look at."""
    print(repr(result).replace(textwrap.indent(module_import_str, "  "), "  import cpp_lib"))

t_baseline = benchmark.Timer(
    stmt='batched_dot_mul_sum(x, x)',
    setup='''\
from __main__ import batched_dot_mul_sum
x = torch.randn(2, 2)''')

t0 = benchmark.Timer(
    stmt='cpp_lib.batched_dot_mul_sum_v0(x, x)',
    setup=f'''\
{module_import_str}
x = torch.randn(2, 2)''')

t1 = benchmark.Timer(
    stmt='cpp_lib.batched_dot_mul_sum_v1(x, x)',
    setup=f'''\
{module_import_str}
x = torch.randn(2, 2)''')

# Moving to C++ did indeed reduce overhead, but it's hard to tell which
# calling convention is more efficient. v1 (call with references) seems to
# be a bit faster, but it's within measurement error.
pretty_print(t_baseline.blocked_autorange())
pretty_print(t0.blocked_autorange())
pretty_print(t1.blocked_autorange())

ImportError: DLL load failed while importing cpp_lib: 找不到指定的模块。

In [None]:
# Let's use ``Callgrind`` to determine which is better.
stats_v0 = t0.collect_callgrind()
stats_v1 = t1.collect_callgrind()

pretty_print(stats_v0)
pretty_print(stats_v1)

# `.as_standardized` removes file names and some path prefixes, and makes
# it easier to read the function symbols.
stats_v0 = stats_v0.as_standardized()
stats_v1 = stats_v1.as_standardized()

# `.delta` diffs the instruction counts, and `.denoise` removes several
# functions in the Python interpreter that are known to have significant
# jitter.
delta = stats_v1.delta(stats_v0).denoise()

# `.transform` is a convenience API for transforming function names. It is
# useful for increasing cancelation when ``diff-ing`` instructions, as well as
# just generally improving readability.
replacements = (
    ("???:void pybind11", "pybind11"),
    ("batched_dot_mul_sum_v0", "batched_dot_mul_sum_v1"),
    ("at::Tensor, at::Tensor", "..."),
    ("at::Tensor const&, at::Tensor const&", "..."),
    ("auto torch::detail::wrap_pybind_function_impl_", "wrap_pybind_function_impl_"),
)
for before, after in replacements:
    delta = delta.transform(lambda l: l.replace(before, after))

# We can use print options to control how much of the function to display.
torch.set_printoptions(linewidth=160)

# Once parsed, the instruction counts make clear that passing `a` and `b`
# by reference is more efficient as it skips some ``c10::TensorImpl`` bookkeeping
# for the intermediate Tensors, and is also works better with ``pybind11``. This
# is consistent with our noisy wall time observations.
print(delta)

OSError: Valgrind is not supported on this platform.