Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PIR Python API 适配升级任务的 bug 修复手册 #58259

Open
MarioLulab opened this issue Oct 20, 2023 · 2 comments
Open

PIR Python API 适配升级任务的 bug 修复手册 #58259

MarioLulab opened this issue Oct 20, 2023 · 2 comments
Labels
PFCC Paddle Framework Contributor Club,https://github.com/PaddlePaddle/community/tree/master/pfcc

Comments

@MarioLulab
Copy link
Contributor

MarioLulab commented Oct 20, 2023

亲爱的开发者们👋,

大家好!大家在「PIR Python API 适配升级任务」中可能会遇到🐛和解决🔧各种各样的 bug,同时遇到的 bug 可能会普遍出现在其他 API 适配的场景,因此为了降低开发成本,我们维护了这个 bug 修复手册。大家可以以 comment 的形式对 bug 描述解决方法进行记录。

comment 回复模板可以参考下面的形式:

## 问题描述

简短地描述你在迁移的过程中遇到的问题,可配以截图和报错栈

## 用来复现问题的代码

请提供一个最小化的例子来复现你遇到的 bug

## 解决方法 or 思路

- xxx
@MarioLulab MarioLulab changed the title 新 IR Python API 适配升级任务的 bug 修复手册 PIR Python API 适配升级任务的 bug 修复手册 Oct 20, 2023
@paddle-bot paddle-bot bot added the PFCC Paddle Framework Contributor Club,https://github.com/PaddlePaddle/community/tree/master/pfcc label Oct 20, 2023
@Ligoml Ligoml removed status/new-issue 新建 type/others 其他问题 labels Oct 24, 2023
@MarioLulab
Copy link
Contributor Author

MarioLulab commented Oct 24, 2023

问题描述

@test_with_pir_api 装饰器包裹下,executor 执行 base.default_main_program() 出现报错:

Traceback (most recent call last):
  File "/luq/docker/paddle-docker/Paddle-bak/build/python/paddle/pir_utils.py", line 119, in impl
    func(*args, **kwargs)
  File "/luq/docker/paddle-docker/Paddle-bak/test/legacy_test/test_mse_loss.py", line 53, in test_mse_loss
    fetch_list=[output],
  File "/luq/docker/paddle-docker/Paddle-bak/build/python/paddle/base/executor.py", line 1633, in run
    return_numpy=return_numpy,
  File "/luq/docker/paddle-docker/Paddle-bak/build/python/paddle/base/executor.py", line 1936, in _run_pir_impl
    scope,
  File "/luq/docker/paddle-docker/Paddle-bak/build/python/paddle/base/executor.py", line 1026, in get_pir_program_and_executor
    program, fetch_list=fetch_list, fetch_var_name=fetch_var_name
  File "/luq/docker/paddle-docker/Paddle-bak/build/python/paddle/base/executor.py", line 511, in _add_pir_fetch_ops
    global_block, fetch_list, fetch_var_name, fetch_op
  File "/luq/docker/paddle-docker/Paddle-bak/build/python/paddle/base/executor.py", line 426, in has_fetch_operations
    if op.name() == fetch_op:
AttributeError: 'Operator' object has no attribute 'name'

用来复现问题的代码

  1. 创建单测文件:
class TestMseLoss(unittest.TestCase):
    @test_with_pir_api
    def test_mse_loss(self):
        paddle.enable_static()
        input_val = np.random.uniform(0.1, 0.5, (2, 3)).astype("float32")
        label_val = np.random.uniform(0.1, 0.5, (2, 3)).astype("float32")

        sub = input_val - label_val
        np_result = np.mean(sub * sub)

        input_var = paddle.static.data(
            name="input", shape=[-1, 3], dtype="float32"
        )
        label_var = paddle.static.data(
            name="label", shape=[-1, 3], dtype="float32"
        )

        output = paddle.nn.functional.mse_loss(input=input_var, label=label_var)
        for use_cuda in (
            [False, True] if core.is_compiled_with_cuda() else [False]
        ):
            place = base.CUDAPlace(0) if use_cuda else base.CPUPlace()
            exe = Executor(place)
            (result,) = exe.run(
                base.default_main_program(),
                feed={"input": input_val, "label": label_val},
                fetch_list=[output],
            )

            np.testing.assert_allclose(np_result, result, rtol=1e-05)
  1. 运行,报错

解决方法 or 思路

  • 在构建静态图前,实例化两个 paddle.static.Program() 分别作为 main_program 和 startup_program,并在静态图构图时使用 paddle.static.program_guard 管理上下文即可
class TestMseLoss(unittest.TestCase):
    @test_with_pir_api
    def test_mse_loss(self):
        paddle.enable_static()
        input_val = np.random.uniform(0.1, 0.5, (2, 3)).astype("float32")
        label_val = np.random.uniform(0.1, 0.5, (2, 3)).astype("float32")

        sub = input_val - label_val
        np_result = np.mean(sub * sub)

        main = paddle.static.Program()
        startup = paddle.static.Program()
        with paddle.static.program_guard(main, startup):
            input_var = paddle.static.data(
                name="input", shape=[-1, 3], dtype="float32"
            )
            label_var = paddle.static.data(
                name="label", shape=[-1, 3], dtype="float32"
            )

            output = paddle.nn.functional.mse_loss(input=input_var, label=label_var)
            for use_cuda in (
                [False, True] if core.is_compiled_with_cuda() else [False]
            ):
                place = base.CUDAPlace(0) if use_cuda else base.CPUPlace()
                exe = Executor(place)
                (result,) = exe.run(
                    main,
                    feed={"input": input_val, "label": label_val},
                    fetch_list=[output],
                )

                np.testing.assert_allclose(np_result, result, rtol=1e-05)

问题出现原因排查

IRGuard 没有对 paddle.base.default_main_programpaddle.base.default_startup_program 进行替换,代码见

# paddle.base.default_main_program = (
# paddle.pir.core.default_main_program
# )
# paddle.base.default_startup_program = (
# paddle.pir.core.default_startup_program
# )

该代码在 PR #57956 中被注释掉了,原因为:

pir_guard中没有切换base.default_main_program(), 
切换后会导致 OpTest 中 pir 使用 get_kernel_signature 时无法获取就静态图的 proto, 
合理的做法是在 pir 下写新的 get_kernel_signature , 不依赖于旧 ir 的结构,
暂时可以将其他单测使用 base.defalut_program 改为 static.default_program.

@MarioLulab
Copy link
Contributor Author

问题描述

在 @test_with_pir_api 装饰器包裹下,两个不同的网络在同一个 Program 下组网,第一次执行第一个网络,第二次执行第二个网络。执行器在执行第二个网络时出现错误:

Traceback (most recent call last):
  File "/home/aistudio/Paddle-gpu/build/python/paddle/pir_utils.py", line 119, in impl
    func(*args, **kwargs)
  File "/home/aistudio/Paddle-gpu/test/legacy_test/test_fused_feedforward_op.py", line 319, in test_static
    fetch_list=[res],
  File "/home/aistudio/Paddle-gpu/build/python/paddle/base/executor.py", line 1644, in run
    return_numpy=return_numpy,
  File "/home/aistudio/Paddle-gpu/build/python/paddle/base/executor.py", line 1947, in _run_pir_impl
    scope,
  File "/home/aistudio/Paddle-gpu/build/python/paddle/base/executor.py", line 1037, in get_pir_program_and_executor
    program, fetch_list=fetch_list, fetch_var_name=fetch_var_name
  File "/home/aistudio/Paddle-gpu/build/python/paddle/base/executor.py", line 511, in _add_pir_fetch_ops
    global_block, fetch_list, fetch_var_name, fetch_op
  File "/home/aistudio/Paddle-gpu/build/python/paddle/base/executor.py", line 430, in has_fetch_operations
    "There is a fetch op in Program which will fetch variable that is not belong to fetch_targets."
Exception: There is a fetch op in Program which will fetch variable that is not belong to fetch_targets.

用来复现问题的代码

  1. 创建单测文件:

    class APITestStaticFusedFFN(unittest.TestCase):
     @test_with_pir_api
     def test_static(self):
         paddle.enable_static()
         main = paddle.static.Program()
         startup = paddle.static.Program()
         main.random_seed = 42
    
         dtype = "float32"
         layer_norm_dtype = "float32"
         batch_size = 1
         d_model = 8
         dim_feedforward = 8
    
         x_data = np.random.random(
             (batch_size, d_model, dim_feedforward)
         ).astype(dtype)
         linear1_weight_data = np.random.random(
             (d_model, dim_feedforward)
         ).astype(dtype)
         linear1_bias_data = np.zeros(dim_feedforward).astype(dtype)
         linear2_weight_data = np.random.random(
             (dim_feedforward, d_model)
         ).astype(dtype)
         linear2_bias_data = np.zeros(d_model).astype(dtype)
    
         ln1_scale_data = np.ones(d_model).astype(layer_norm_dtype)
         ln1_bias_data = np.zeros(d_model).astype(layer_norm_dtype)
         ln2_scale_data = np.ones(d_model).astype(layer_norm_dtype)
         ln2_bias_data = np.zeros(d_model).astype(layer_norm_dtype)
    
         with paddle.static.program_guard(main, startup):
             x = paddle.static.data(
                 name='x',
                 shape=[batch_size, d_model, dim_feedforward],
                 dtype=dtype,
             )
             linear1_weight = paddle.static.data(
                 name='linear1_weight',
                 shape=[d_model, dim_feedforward],
                 dtype=dtype,
             )
             linear1_bias = paddle.static.data(
                 name='linear1_bias', shape=[dim_feedforward]
             )
             linear2_weight = paddle.static.data(
                 name='linear2_weight',
                 shape=[dim_feedforward, d_model],
                 dtype=dtype,
             )
             linear2_bias = paddle.static.data(
                 name='linear2_bias', shape=[d_model]
             )
             ln1_scale = paddle.static.data(name='ln1_scale', shape=[d_model])
             ln1_bias = paddle.static.data(name='ln1_bias', shape=[d_model])
             ln2_scale = paddle.static.data(name='ln2_scale', shape=[d_model])
             ln2_bias = paddle.static.data(name='ln2_bias', shape=[d_model])
             
             # <------------ 第一个网络组网 ------------>
             fused_out = incubate_f.fused_feedforward(
                 x,
                 linear1_weight,
                 linear2_weight,
                 linear1_bias,
                 linear2_bias,
                 ln1_scale,
                 ln1_bias,
                 ln2_scale,
                 ln2_bias,
                 0.0,
                 0.0,
                 activation="relu",
                 pre_layer_norm=False,
             )
    
    
             # <------------ 第二个网络组网 ------------>
             # base ffn
             linear1_out = F.linear(x, linear1_weight, linear1_bias)
             act_out = F.relu(linear1_out)
             dropout1_out = F.dropout(x=act_out, p=0.0, training=False)
             linear2_out = F.linear(dropout1_out, linear2_weight, linear2_bias)
             dropout2_out = x + F.dropout(x=linear2_out, p=0.0, training=False)
             ln_out = F.layer_norm(
                 dropout2_out,
                 normalized_shape=[d_model],
                 weight=ln2_scale,
                 bias=ln2_bias,
             )
    
             exe = paddle.static.Executor(paddle.CUDAPlace(0))
    
             res_list = [fused_out, ln_out]
             real_res = []
    
             for res in res_list:          # <---------------- 在同一个 program 下执行两个网络
                 fetch = exe.run(
                     feed={
                         'x': x_data,
                         'linear1_weight': linear1_weight_data,
                         'linear1_bias': linear1_bias_data,
                         'linear2_weight': linear2_weight_data,
                         'linear2_bias': linear2_bias_data,
                         'ln1_scale': ln1_scale_data,
                         'ln1_bias': ln1_bias_data,
                         'ln2_scale': ln2_scale_data,
                         'ln2_bias': ln2_bias_data,
                     },
                     fetch_list=[res],
                 )
                 real_res.append(fetch)
             np.testing.assert_allclose(
                 real_res[0], real_res[1], rtol=1e-05, atol=0.001
             )
  2. 运行,在执行器在执行第二个网络时出现错误

解决方法 or 思路

  • 将两个网络分别在两个 program 下组网并执行即可
class APITestStaticFusedFFN(unittest.TestCase):
    @test_with_pir_api
    def test_static(self):
        paddle.enable_static()

        dtype = "float32"
        layer_norm_dtype = "float32"
        batch_size = 1
        d_model = 8
        dim_feedforward = 8

        x_data = np.random.random(
            (batch_size, d_model, dim_feedforward)
        ).astype(dtype)
        linear1_weight_data = np.random.random(
            (d_model, dim_feedforward)
        ).astype(dtype)
        linear1_bias_data = np.zeros(dim_feedforward).astype(dtype)
        linear2_weight_data = np.random.random(
            (dim_feedforward, d_model)
        ).astype(dtype)
        linear2_bias_data = np.zeros(d_model).astype(dtype)

        ln1_scale_data = np.ones(d_model).astype(layer_norm_dtype)
        ln1_bias_data = np.zeros(d_model).astype(layer_norm_dtype)
        ln2_scale_data = np.ones(d_model).astype(layer_norm_dtype)
        ln2_bias_data = np.zeros(d_model).astype(layer_norm_dtype)

        main_1 = paddle.static.Program()
        startup_1 = paddle.static.Program()
        main_1.random_seed = 42
         # <------------ 第一个网络组网 ------------>
        with paddle.static.program_guard(main_1, startup_1):
             # <------------ code: 第一个网络组网代码 ------------>
            ... ...
           
        main_2 = paddle.static.Program()
        startup_2 = paddle.static.Program()
        main_2.random_seed = 42
         # <------------ 第二个网络组网 ------------>
        with paddle.static.program_guard(main_2, startup_2):
             # <------------ code: 第二个网络组网代码 ------------>
            ... ...     
        
        # 比较两者结果
        ... ... 

问题出现原因排查

现阶段 pir.Program 在多个网络共用 program 时可能会有问题,具体原因待排查

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
PFCC Paddle Framework Contributor Club,https://github.com/PaddlePaddle/community/tree/master/pfcc
Projects
None yet
Development

No branches or pull requests

3 participants