# Understanding Dynamo Interpreter and graph breaks

In this notebook, I will try to understand Torch-Dynamo's VM by trying some essential cases. 
All the cases are guided by the logic of its code.

In [1]:
import torch
from torch import _inductor, _dynamo
from torch._inductor import config as iconfig
from torch._dynamo import config as dconfig

In [21]:
iconfig.debug = True
dconfig.output_code = True
import logging
dconfig.log_level = logging.DEBUG
dconfig.suppress_errors = True

## compile

In [3]:
def add_kernel(a, b):
    return a + b

In [4]:
a = torch.randn((3, 4), device='cuda')
b = torch.randn((3,4), device='cuda')
a = torch.abs(a)
b = torch.abs(b)
# both a and b are now positive

In [5]:
fn = torch.compile(add_kernel)
fn(a, b)

[2023-04-04 10:02:46,415] torch._dynamo.eval_frame: [DEBUG] skipping __init__ /usr/lib/python3.8/contextlib.py
[2023-04-04 10:02:46,416] torch._dynamo.eval_frame: [DEBUG] skipping __enter__ /usr/lib/python3.8/contextlib.py
[2023-04-04 10:02:46,416] torch._dynamo.eval_frame: [DEBUG] skipping __init__ /usr/lib/python3.8/contextlib.py
[2023-04-04 10:02:46,417] torch._dynamo.eval_frame: [DEBUG] skipping __enter__ /usr/lib/python3.8/contextlib.py
[2023-04-04 10:02:46,417] torch._dynamo.eval_frame: [DEBUG] skipping enable_dynamic /home/chunwei/newenv2/lib/python3.8/site-packages/torch/_dynamo/eval_frame.py
[2023-04-04 10:02:46,440] torch._dynamo.symbolic_convert: [INFO] Step 1: torchdynamo start tracing add_kernel
[2023-04-04 10:02:46,441] torch._dynamo.symbolic_convert: [DEBUG] TRACE starts_line /tmp/ipykernel_468424/176408799.py:2
[2023-04-04 10:02:46,441] torch._dynamo.symbolic_convert: [DEBUG] TRACE LOAD_FAST a []
[2023-04-04 10:02:46,442] torch._dynamo.symbolic_convert: [DEBUG] TRACE LO

tensor([[2.8210, 1.2094, 0.9151, 1.7770],
        [2.7160, 0.5586, 1.1959, 0.3365],
        [2.3278, 1.1822, 0.6505, 1.3714]], device='cuda:0')

The modified code is 

```
1           0 LOAD_GLOBAL              0 (__compiled_fn_0)
              2 LOAD_FAST                0 (a)
              4 LOAD_FAST                1 (b)
              6 CALL_FUNCTION            2
              8 UNPACK_SEQUENCE          1
             10 RETURN_VALUE
```

## Add a graph break

In [6]:
def add_kernel1(a, b):
    if a.sum() > 0:
        return a + b
    return a - b

In [7]:
_dynamo.reset()
fn = torch.compile(add_kernel1)
fn(a, b)

[2023-04-04 10:02:47,874] torch._dynamo.symbolic_convert: [INFO] Step 1: torchdynamo start tracing add_kernel1
[2023-04-04 10:02:47,874] torch._dynamo.symbolic_convert: [DEBUG] TRACE starts_line /tmp/ipykernel_468424/510685252.py:2
[2023-04-04 10:02:47,875] torch._dynamo.symbolic_convert: [DEBUG] TRACE LOAD_FAST a []
[2023-04-04 10:02:47,875] torch._dynamo.symbolic_convert: [DEBUG] TRACE LOAD_ATTR sum [TensorVariable()]
[2023-04-04 10:02:47,876] torch._dynamo.symbolic_convert: [DEBUG] TRACE CALL_FUNCTION 0 [GetAttrVariable(TensorVariable(), sum)]
[2023-04-04 10:02:47,878] torch._dynamo.symbolic_convert: [DEBUG] TRACE LOAD_CONST 0 [TensorVariable()]
[2023-04-04 10:02:47,878] torch._dynamo.symbolic_convert: [DEBUG] TRACE COMPARE_OP > [TensorVariable(), ConstantVariable(int)]
[2023-04-04 10:02:47,880] torch._dynamo.symbolic_convert: [DEBUG] TRACE POP_JUMP_IF_FALSE 20 [TensorVariable()]
[2023-04-04 10:02:47,880] torch._dynamo.symbolic_convert: [DEBUG] generic_jump triggered compile
[2023-0

tensor([[2.8210, 1.2094, 0.9151, 1.7770],
        [2.7160, 0.5586, 1.1959, 0.3365],
        [2.3278, 1.1822, 0.6505, 1.3714]], device='cuda:0')

In [8]:
fn(-a, b)

[2023-04-04 10:02:48,003] torch._dynamo.symbolic_convert: [INFO] Step 1: torchdynamo start tracing <resume in add_kernel1>
[2023-04-04 10:02:48,003] torch._dynamo.symbolic_convert: [DEBUG] TRACE JUMP_ABSOLUTE 22 []
[2023-04-04 10:02:48,004] torch._dynamo.symbolic_convert: [DEBUG] TRACE starts_line /tmp/ipykernel_468424/510685252.py:4
[2023-04-04 10:02:48,004] torch._dynamo.symbolic_convert: [DEBUG] TRACE LOAD_FAST a []
[2023-04-04 10:02:48,004] torch._dynamo.symbolic_convert: [DEBUG] TRACE LOAD_FAST b [TensorVariable()]
[2023-04-04 10:02:48,005] torch._dynamo.symbolic_convert: [DEBUG] TRACE BINARY_SUBTRACT None [TensorVariable(), TensorVariable()]
[2023-04-04 10:02:48,007] torch._dynamo.symbolic_convert: [DEBUG] TRACE RETURN_VALUE None [TensorVariable()]
[2023-04-04 10:02:48,007] torch._dynamo.symbolic_convert: [INFO] Step 1: torchdynamo done tracing <resume in add_kernel1> (RETURN_VALUE)
[2023-04-04 10:02:48,008] torch._dynamo.symbolic_convert: [DEBUG] RETURN_VALUE triggered compile
[

tensor([[-2.8210, -1.2094, -0.9151, -1.7770],
        [-2.7160, -0.5586, -1.1959, -0.3365],
        [-2.3278, -1.1822, -0.6505, -1.3714]], device='cuda:0')

In the function above, there are two branches in the `if a.sum() > 0` control flow. 
We call it with inputs of `(a, b)` and `(-a, b)` which will trigger the then branch and the else branch repectively,
the Inductor compilation is actived in both cases.

By reading the Interpreter code, it is clear that, once a `generic_jump` hit a graph break, it will 

1. stop gathering the current instruction into the recent subgraph and
    - Frozen and compile the last subgraph immediately, and put a CALL to `__compiled_xxx`
2. create two `resume_at` function calls for both the then branch and the else branch

Digging into the log, we could found the original code for the `add_kernel1` function is

```python
  2           0 LOAD_FAST                0 (a)
              2 LOAD_METHOD              0 (sum)
              4 CALL_METHOD              0
              6 LOAD_CONST               1 (0)
              8 COMPARE_OP               4 (>)
             10 POP_JUMP_IF_FALSE       20

  3          12 LOAD_FAST                0 (a)
             14 LOAD_FAST                1 (b)
             16 BINARY_ADD
             18 RETURN_VALUE

  4     >>   20 LOAD_FAST                0 (a)
             22 LOAD_FAST                1 (b)
             24 BINARY_SUBTRACT
             26 RETURN_VALUE
```

And it will first modified to the code above following the logic of `generic_jump`:


```python
  1           0 LOAD_GLOBAL              1 (__compiled_fn_15)
              2 LOAD_FAST                0 (a)
              4 CALL_FUNCTION            1
              6 UNPACK_SEQUENCE          1
              8 POP_JUMP_IF_FALSE       20
             10 LOAD_GLOBAL              2 (__resume_at_12_16)
             12 LOAD_FAST                0 (a)
             14 LOAD_FAST                1 (b)
             16 CALL_FUNCTION            2
             18 RETURN_VALUE
        >>   20 LOAD_GLOBAL              3 (__resume_at_20_17)
             22 LOAD_FAST                0 (a)
             24 LOAD_FAST                1 (b)
             26 CALL_FUNCTION            2
             28 RETURN_VALUE
```

The code above is the basic pattern of a graph break in a normal `generic_jump`.

The `__compiled_fn_15` is the compilation for the condition computation, and the `__resume_at_12_16` and `__resume_at_20_17` are the if-else's then-block and else-block respectively.

```python
 __compiled_fn_15 <eval_with_key>.66 class GraphModule(torch.nn.Module):
    def forward(self, a : torch.Tensor):
        # File: /tmp/ipykernel_69518/510685252.py:2, code: if a.sum() > 0:
        sum_1 = a.sum();  a = None
        gt = sum_1 > 0;  sum_1 = None
        return (gt,)
```

*Dynamo will keep extracting subgraphs in a greedy way, even on a single op, this is a bit insane.*

The `resume_at_xx` blockes will be further traced, although there are two `resume_at_xx` blocks, only the one actually accessed in the current context(inputs) will be traced, the other one's bytecode is just left unchanged.

Here is an issue: **is the tracing period actually do two things?**

1. Visit each instruction, append to output_graph if necessary
2. Evaluate the instruction, and alter the stack with real value



The answer is YES.
For each instruction, it will evaluate inplace, alter the stack with the real value (wrapped in a VariableTracker), and then append the instruction to the output_graph.

## Calling an unsupported op, cause a graph break

In [9]:
from collections import OrderedDict
from torch_scatter import scatter_mul
dic = OrderedDict()

def kernel19(src, index, out):
    out = scatter_mul(src, index, dim=1, out=out)

fn = torch.compile(kernel19)

src = torch.Tensor([[2, 0, 1, 4, 3], [0, 2, 1, 3, 4]])
index = torch.tensor([[4, 5, 4, 2, 3], [0, 0, 2, 2, 1]])
out = src.new_zeros((2, 6))
fn(src, index, out)




[2023-04-04 10:02:48,099] torch._dynamo.symbolic_convert: [INFO] Step 1: torchdynamo start tracing kernel19
[2023-04-04 10:02:48,100] torch._dynamo.symbolic_convert: [DEBUG] TRACE starts_line /tmp/ipykernel_468424/115080304.py:6
[2023-04-04 10:02:48,100] torch._dynamo.symbolic_convert: [DEBUG] TRACE LOAD_GLOBAL scatter_mul []
[2023-04-04 10:02:48,101] torch._dynamo.symbolic_convert: [DEBUG] TRACE LOAD_FAST src [UserFunctionVariable()]
[2023-04-04 10:02:48,102] torch._dynamo.symbolic_convert: [DEBUG] TRACE LOAD_FAST index [UserFunctionVariable(), TensorVariable()]
[2023-04-04 10:02:48,102] torch._dynamo.symbolic_convert: [DEBUG] TRACE LOAD_CONST 1 [UserFunctionVariable(), TensorVariable(), TensorVariable()]
[2023-04-04 10:02:48,102] torch._dynamo.symbolic_convert: [DEBUG] TRACE LOAD_FAST out [UserFunctionVariable(), TensorVariable(), TensorVariable(), ConstantVariable(int)]
[2023-04-04 10:02:48,103] torch._dynamo.symbolic_convert: [DEBUG] TRACE LOAD_CONST ('dim', 'out') [UserFunctionVar

In [10]:
dic = OrderedDict()
def kernel20(a):
    a = a + 1.
    dic["name"] = a
    a = a * 2
    return a

fn = torch.compile(kernel20)
fn(a)


[2023-04-04 10:02:48,200] torch._dynamo.symbolic_convert: [INFO] Step 1: torchdynamo start tracing kernel20
[2023-04-04 10:02:48,201] torch._dynamo.symbolic_convert: [DEBUG] TRACE starts_line /tmp/ipykernel_468424/2455507784.py:3
[2023-04-04 10:02:48,201] torch._dynamo.symbolic_convert: [DEBUG] TRACE LOAD_FAST a []
[2023-04-04 10:02:48,202] torch._dynamo.symbolic_convert: [DEBUG] TRACE LOAD_CONST 1.0 [TensorVariable()]
[2023-04-04 10:02:48,202] torch._dynamo.symbolic_convert: [DEBUG] TRACE BINARY_ADD None [TensorVariable(), ConstantVariable(float)]
[2023-04-04 10:02:48,204] torch._dynamo.symbolic_convert: [DEBUG] TRACE STORE_FAST a [TensorVariable()]
[2023-04-04 10:02:48,204] torch._dynamo.symbolic_convert: [DEBUG] TRACE starts_line /tmp/ipykernel_468424/2455507784.py:4
[2023-04-04 10:02:48,205] torch._dynamo.symbolic_convert: [DEBUG] TRACE LOAD_FAST a []
[2023-04-04 10:02:48,205] torch._dynamo.symbolic_convert: [DEBUG] TRACE LOAD_GLOBAL dic [TensorVariable()]
[2023-04-04 10:02:48,206]

tensor([[4.1813, 2.9671, 2.2853, 3.1537],
        [4.0779, 2.9101, 2.6452, 2.5583],
        [3.2492, 3.0482, 3.0148, 2.7530]], device='cuda:0')

## How could FX graph capture the ops?

The `call_function` will be captured by the BuiltinVariable, and once the calling target matches `is_allowed`(check that it is inside Torch), the `output` will add a node in its subgraph.

In [11]:
def add_kernel2(a, b):
    a0 = 0
    a1 = 1
    a2 = 2
    a3 = a0 + a1 + a2
    if a.sum() > 0:
        return a + b + a3
    return a - b

In [22]:
_dynamo.reset()
fn = torch.compile(add_kernel2)

In [23]:
fn(a,b)

[2023-04-04 10:06:04,970] torch._dynamo.symbolic_convert: [INFO] Step 1: torchdynamo start tracing add_kernel2
[2023-04-04 10:06:04,970] torch._dynamo.symbolic_convert: [DEBUG] TRACE starts_line /tmp/ipykernel_468424/1377176595.py:2
[2023-04-04 10:06:04,971] torch._dynamo.symbolic_convert: [DEBUG] TRACE LOAD_CONST 0 []
[2023-04-04 10:06:04,971] torch._dynamo.symbolic_convert: [DEBUG] TRACE STORE_FAST a0 [ConstantVariable(int)]
[2023-04-04 10:06:04,972] torch._dynamo.symbolic_convert: [DEBUG] TRACE starts_line /tmp/ipykernel_468424/1377176595.py:3
[2023-04-04 10:06:04,972] torch._dynamo.symbolic_convert: [DEBUG] TRACE LOAD_CONST 1 []
[2023-04-04 10:06:04,972] torch._dynamo.symbolic_convert: [DEBUG] TRACE STORE_FAST a1 [ConstantVariable(int)]
[2023-04-04 10:06:04,973] torch._dynamo.symbolic_convert: [DEBUG] TRACE starts_line /tmp/ipykernel_468424/1377176595.py:4
[2023-04-04 10:06:04,973] torch._dynamo.symbolic_convert: [DEBUG] TRACE LOAD_CONST 2 []
[2023-04-04 10:06:04,974] torch._dynamo

tensor([[14.4727,  7.8387,  7.7772, 10.7776],
        [14.1011,  4.0767,  8.5622,  3.6233],
        [13.8441,  7.4727,  4.3660,  9.3458]], device='cuda:0')

In the first subgraph, the FX graph is

```python
 __compiled_fn_15 <eval_with_key>.66 class GraphModule(torch.nn.Module):
    def forward(self, a : torch.Tensor, b : torch.Tensor):
        # File: /tmp/ipykernel_242802/1377176595.py:7, code: return a + b + a3
        add = a + b;  a = b = None
        add_1 = add + 3;  add = None
        return (add_1,)
```
The **3** comes from the tracing of the logic

```python
    a0 = 0
    a1 = 1
    a2 = 2
    a3 = a0 + a1 + a2
```


The Tracer keeps evaluate the non-torch-op instructions, and update the frame stack, so that the non-tensor computations will be replaced with constant value.

## Skip frame if graph break in a loop

According to the interpreter's code, Dynamo will skip tracing a frame once it hit a graph break within a forloop/while

In [14]:
def add_kernel3(a, b):
    for i in range(5):
        sum = a
        if a.sum() > 0:
            sum += b
    return sum

In [15]:
fn = torch.compile(add_kernel3)

In [16]:
_dynamo.reset()
fn(a, b)

[2023-04-04 10:02:48,762] torch._dynamo.symbolic_convert: [INFO] Step 1: torchdynamo start tracing add_kernel3
[2023-04-04 10:02:48,763] torch._dynamo.symbolic_convert: [DEBUG] TRACE starts_line /tmp/ipykernel_468424/1226052577.py:2
[2023-04-04 10:02:48,764] torch._dynamo.symbolic_convert: [DEBUG] TRACE LOAD_GLOBAL range []
[2023-04-04 10:02:48,765] torch._dynamo.symbolic_convert: [DEBUG] TRACE LOAD_CONST 5 [BuiltinVariable(range)]
[2023-04-04 10:02:48,765] torch._dynamo.symbolic_convert: [DEBUG] TRACE CALL_FUNCTION 1 [BuiltinVariable(range), ConstantVariable(int)]
[2023-04-04 10:02:48,766] torch._dynamo.symbolic_convert: [DEBUG] TRACE GET_ITER None [RangeVariable()]
[2023-04-04 10:02:48,767] torch._dynamo.symbolic_convert: [DEBUG] TRACE FOR_ITER 38 [ListIteratorVariable()]
[2023-04-04 10:02:48,768] torch._dynamo.symbolic_convert: [DEBUG] TRACE STORE_FAST i [ListIteratorVariable(), ConstantVariable(int)]
[2023-04-04 10:02:48,768] torch._dynamo.symbolic_convert: [DEBUG] TRACE starts_lin

tensor([[9.7424, 4.1129, 4.0048, 6.5775],
        [9.4241, 0.9731, 4.6890, 0.5659],
        [9.1408, 3.8146, 1.2229, 5.3509]], device='cuda:0')

We got `[2023-03-30 13:23:10,277] torch._dynamo.symbolic_convert: [INFO] Skipping frame because there is a graph break in a for/while loop` in the log, the whole frame is skipped by the tracer.

## What is Guard?

In [17]:
def simple_kernel(a, b, actived:bool):
    if actived:
        return a + b
    else:
        return a - b

fn = torch.compile(simple_kernel)
fn(a, b, True)


[2023-04-04 10:02:48,828] torch._dynamo.symbolic_convert: [INFO] Step 1: torchdynamo start tracing simple_kernel
[2023-04-04 10:02:48,829] torch._dynamo.symbolic_convert: [DEBUG] TRACE starts_line /tmp/ipykernel_468424/4251916827.py:2
[2023-04-04 10:02:48,829] torch._dynamo.symbolic_convert: [DEBUG] TRACE LOAD_FAST actived []
[2023-04-04 10:02:48,830] torch._dynamo.symbolic_convert: [DEBUG] TRACE POP_JUMP_IF_FALSE 12 [ConstantVariable(bool)]
[2023-04-04 10:02:48,830] torch._dynamo.symbolic_convert: [DEBUG] TRACE starts_line /tmp/ipykernel_468424/4251916827.py:3
[2023-04-04 10:02:48,831] torch._dynamo.symbolic_convert: [DEBUG] TRACE LOAD_FAST a []
[2023-04-04 10:02:48,831] torch._dynamo.symbolic_convert: [DEBUG] TRACE LOAD_FAST b [TensorVariable()]
[2023-04-04 10:02:48,832] torch._dynamo.symbolic_convert: [DEBUG] TRACE BINARY_ADD None [TensorVariable(), TensorVariable()]
[2023-04-04 10:02:48,834] torch._dynamo.symbolic_convert: [DEBUG] TRACE RETURN_VALUE None [TensorVariable()]
[2023-04

tensor([[11.4727,  4.8387,  4.7772,  7.7776],
        [11.1011,  1.0767,  5.5622,  0.6233],
        [10.8441,  4.4727,  1.3660,  6.3458]], device='cuda:0')

## Inline call

Will Dynamo always inline call functions?

In [18]:
def func0(a, b):
    return a + b

def func1(a, b):
    sum = a - b
    sum += func0(a, b)
    return sum

fn = torch.compile(func1)

fn(a, b)

[2023-04-04 10:02:48,903] torch._dynamo.symbolic_convert: [INFO] Step 1: torchdynamo start tracing func1
[2023-04-04 10:02:48,904] torch._dynamo.symbolic_convert: [DEBUG] TRACE starts_line /tmp/ipykernel_468424/1059658919.py:5
[2023-04-04 10:02:48,904] torch._dynamo.symbolic_convert: [DEBUG] TRACE LOAD_FAST a []
[2023-04-04 10:02:48,905] torch._dynamo.symbolic_convert: [DEBUG] TRACE LOAD_FAST b [TensorVariable()]
[2023-04-04 10:02:48,905] torch._dynamo.symbolic_convert: [DEBUG] TRACE BINARY_SUBTRACT None [TensorVariable(), TensorVariable()]
[2023-04-04 10:02:48,907] torch._dynamo.symbolic_convert: [DEBUG] TRACE STORE_FAST sum [TensorVariable()]
[2023-04-04 10:02:48,908] torch._dynamo.symbolic_convert: [DEBUG] TRACE starts_line /tmp/ipykernel_468424/1059658919.py:6
[2023-04-04 10:02:48,908] torch._dynamo.symbolic_convert: [DEBUG] TRACE LOAD_FAST sum []
[2023-04-04 10:02:48,909] torch._dynamo.symbolic_convert: [DEBUG] TRACE LOAD_GLOBAL func0 [TensorVariable()]
[2023-04-04 10:02:48,909] t

tensor([[19.4847,  8.2257,  8.0096, 13.1550],
        [18.8481,  1.9462,  9.3779,  1.1318],
        [18.2816,  7.6292,  2.4458, 10.7018]], device='cuda:0')

We could find 

1. `[DEBUG] INLINING <code object func0 at 0x7f147d36e660, file "/tmp/ipykernel_354122/1059658919.py", line 1>` in the log, which means that the tracer trys to inline the func0
2. `DONE INLINING <code object func0 at 0x7f147d36e660, file "/tmp/ipykernel_354122/1059658919.py", line 1>` which means func0 is successfully inlined into func1

Is there any cases the INLINE will fail?

## Case 1: the helper function has a graph break

In [19]:
def func0(a, b):
    if a.sum() > 0: 
        return a + b
    return a - b

def func1(a, b):
    res = a - b
    res += func0(a, b)
    return res

fn = torch.compile(func1)

fn(a, b)

[2023-04-04 10:02:49,013] torch._dynamo.symbolic_convert: [INFO] Step 1: torchdynamo start tracing func1
[2023-04-04 10:02:49,013] torch._dynamo.symbolic_convert: [DEBUG] TRACE starts_line /tmp/ipykernel_468424/3110369299.py:7
[2023-04-04 10:02:49,014] torch._dynamo.symbolic_convert: [DEBUG] TRACE LOAD_FAST a []
[2023-04-04 10:02:49,014] torch._dynamo.symbolic_convert: [DEBUG] TRACE LOAD_FAST b [TensorVariable()]
[2023-04-04 10:02:49,015] torch._dynamo.symbolic_convert: [DEBUG] TRACE BINARY_SUBTRACT None [TensorVariable(), TensorVariable()]
[2023-04-04 10:02:49,017] torch._dynamo.symbolic_convert: [DEBUG] TRACE STORE_FAST res [TensorVariable()]
[2023-04-04 10:02:49,017] torch._dynamo.symbolic_convert: [DEBUG] TRACE starts_line /tmp/ipykernel_468424/3110369299.py:8
[2023-04-04 10:02:49,017] torch._dynamo.symbolic_convert: [DEBUG] TRACE LOAD_FAST res []
[2023-04-04 10:02:49,018] torch._dynamo.symbolic_convert: [DEBUG] TRACE LOAD_GLOBAL func0 [TensorVariable()]
[2023-04-04 10:02:49,018] t

tensor([[19.4847,  8.2257,  8.0096, 13.1550],
        [18.8481,  1.9462,  9.3779,  1.1318],
        [18.2816,  7.6292,  2.4458, 10.7018]], device='cuda:0')

#

## Case 2. the recursive call

In [20]:
def func0(a, b):
    if a.sum() > 0: 
        return a + b
    return a - b

def func1(a, b):
    res = a - b
    res += func0(a, b)
    return res

def func2(a, b):
    res = a - b
    res += func1(a, b)
    return res

fn = torch.compile(func2)

fn(a, b)

[2023-04-04 10:02:49,203] torch._dynamo.symbolic_convert: [INFO] Step 1: torchdynamo start tracing func2
[2023-04-04 10:02:49,203] torch._dynamo.symbolic_convert: [DEBUG] TRACE starts_line /tmp/ipykernel_468424/722712950.py:12
[2023-04-04 10:02:49,204] torch._dynamo.symbolic_convert: [DEBUG] TRACE LOAD_FAST a []
[2023-04-04 10:02:49,204] torch._dynamo.symbolic_convert: [DEBUG] TRACE LOAD_FAST b [TensorVariable()]
[2023-04-04 10:02:49,204] torch._dynamo.symbolic_convert: [DEBUG] TRACE BINARY_SUBTRACT None [TensorVariable(), TensorVariable()]
[2023-04-04 10:02:49,206] torch._dynamo.symbolic_convert: [DEBUG] TRACE STORE_FAST res [TensorVariable()]
[2023-04-04 10:02:49,207] torch._dynamo.symbolic_convert: [DEBUG] TRACE starts_line /tmp/ipykernel_468424/722712950.py:13
[2023-04-04 10:02:49,207] torch._dynamo.symbolic_convert: [DEBUG] TRACE LOAD_FAST res []
[2023-04-04 10:02:49,208] torch._dynamo.symbolic_convert: [DEBUG] TRACE LOAD_GLOBAL func1 [TensorVariable()]
[2023-04-04 10:02:49,208] t

tensor([[27.4968, 11.6127, 11.2419, 18.5324],
        [26.5952,  2.8156, 13.1936,  1.6404],
        [25.7192, 10.7857,  3.5256, 15.0578]], device='cuda:0')

We could find `FAILED INLINING <code object func0` and `FAILED INLINING <code object func1` in the log, which means that both the func0 and func1 are failed in INLING.

The innermost graph break will disallow the outer functions to be inlined, thus has a huge impact over the performance.