Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

【PIR】记录新ir全量算子覆盖需要修复的算子 #58266

Closed
xingmingyyj opened this issue Oct 20, 2023 · 0 comments
Closed

【PIR】记录新ir全量算子覆盖需要修复的算子 #58266

xingmingyyj opened this issue Oct 20, 2023 · 0 comments
Assignees
Labels
PFCC Paddle Framework Contributor Club,https://github.com/PaddlePaddle/community/tree/master/pfcc status/close 已关闭

Comments

@xingmingyyj
Copy link
Contributor

xingmingyyj commented Oct 20, 2023

记录正在修复的算子

序号 任务 失败原因 PR 开发者 问题记录
1 test_dpsgd_op Op dpsgd should have corresponding OpInfo pd.dpsgd #57826 @xingmingyyj 问题一: 在ops.yaml中添加dpsgd的注册信息之后,运行报错 [Hint: Expected param->numel() == sz, but received param->numel():10710 != sz:0.],原因是没有给dpsgd配置infermeta,在paddle/phi/infermeta/ternary.cc下添加infermeta函数后成功解决.
问题二error: eager_api_dpsgd was not declared in this scope,添加算子n时编译出现该问题需要将算子名称加入paddle/fluid/pir/dialect/op_generator/ops_api_gen.py NO_NEED_GEN_STATIC_ONLY_APIS
2 test_exponential_op Op exponential should have corresponding OpInfo pd.exponential & The difference in accuracy is too great & The difference in accuracy is too great #58029 @xingmingyyj 问题一:exponential op的输出值具有随机性,新Ir下的测试逻辑是在旧Ir下运行一次op,新Ir下运行一次op,比较两者的输出结果。因为exponential op的两次输出不相同所以现有的测试逻辑不适用。解决方法是对对于此类op单独创建一个new_ir_op_test_no_check_list不对输出结果做相应的检查
3 test_norm_all Attribute cast error in InferMeta Context, the expected attribute type is St6vectorIlSaIlEE #57942 @changeyoung98
4 test_pixel_unshuffle #57521 @phlrain
5 test_randint_op (PreconditionNotMet) ProtoType 17 has no corresponding translator #58295 @xingmingyyj 问题一:LOD_TENSOR的dtype类型会出现RAW类型,但是目前不支持RAW类型的翻译,所以这里仿照InferMeta的逻辑,将attribute中的dtype指定给Out
问题二:randint的输出同样具有随机性,所以这里将其加入new_ir_op_test_no_check_list对输出值不做检查
6 test_real_imag_op (PreconditionNotMet) op [pd_op.real_grad] kernel output args defs should equal op outputs 问题一:主要是单测机制导致的测试在开启FLAGS_enable_new_ir_in_executor时执行错误,开启FLAGS_PIR_OPTEST, FLAGS_PIR_OPTEST_WHITE_LIST单测成功,这里暂时不做处理
7 test_repeat_interleave_op InvalidArgumentError: repeats should be larger than zero #58379 @xingmingyyj 问题一:新Ir下需要将repeat_interleave这个op根据输入RepeatsTensor翻译成repeat_interleave_with_tensor_index或者repeat_interleave这里增加RepeatInterLeaveOpTranscriber就可以实现,但是要注意对对应的grad op也要做相同的处理。
问题二:报错the type of data we are trying to retrieve (float32) does not match the type of data (flaot64) 这个错误原因主要是组网时声明的tensor的dtype为float32但是测试文件中给出的数据是float64,旧Ir下的GetExpectedKernelType函数可以根据输入的数据的数据类型选择kernel,而新ir下暂不支持,新ir下根据x的dtype选择对应的kernel。所以对于此类问题需要修改单测文件,强制输入的数据类型和声明的dtype一致。
8 test_seed_op (NotFound) The kernel with key (CPU, Undefined(AnyLayout), int64) of kernel seed is not registered. Selected wrong DataType int64. Paddle support following DataTypes: int32. #58552 @xingmingyyj 问题一:该错误主要是由新旧ir下的GetExpectedKernelType不一致造成的,旧Ir下kerneltype为INT32,而新ir下的GetExpectedKernelType返回的是Out的dtype,修改新ir下的GetExpectedKernelType问题解决
9 test_share_data_op #57212 @yangguohao
10 test_spare_momentum_op Op dpsgd should have correspoding OpInfo pd.spare_momentum #58536 @xingmingyyj 问题一OpYamlInfoParser在解析runtime_info.kernel_param时会将可变属性放入kernel_fn_attr_params这样对于新Ir下定义的sparse_momentum_op(定义了Scalar axis)会造成AttributeMap中不存在axis属性的问题。所以对于此类legacy op暂时将可变属性统一放入kernel_fn_tensor_params中。解决方案是需要给OpYamlInfoParser多增加一个属性,用来判断当前翻译的Op是非为legacy op
11 test_sum_op FatalError: Segmentation fault is detected by the operating system. 问题一:主要是单测机制导致的测试在开启FLAGS_enable_new_ir_in_executor时执行错误,开启FLAGS_PIR_OPTEST, FLAGS_PIR_OPTEST_WHITE_LIST单测成功,这里暂时不做处理
12 test_uniform_random_op FatalError: Segmentation fault is detected by the operating system. 问题一:主要是单测机制导致的测试在开启FLAGS_enable_new_ir_in_executor时执行错误,开启FLAGS_PIR_OPTEST, FLAGS_PIR_OPTEST_WHITE_LIST单测成功,这里暂时不做处理
13 test_unique PreconditionNotMetError: Tensor holds no memory. Call Tensor::mutable_data firstly. 问题一:test_unique在新Ir下执行报错为PreconditionNotMetError: Tensor holds no memory. Call Tensor::mutable_data firstly..这里的问题是由新Ir下默认将旧Ir下的unique只翻译成新Ir下的unique导致的。在旧Ir下unique会根据属性is_sorted的值选择unique或者unique_raw两个kernel执行。在新Ir下不存在这样的机制,所以需要根据is_sorted的值将旧Ir下的unique翻译为新Ir下的unique或者unique_raw两个OP.这里在新Ir下补充了unique_raw的定义。
问题二:修复后在GPU环境上运行,在GPU版本的kernel中发生空指针异常,这是选kernel的逻辑存在问题,旧Ir下通过GetReduceGradExpectedKernelType在GPU环境下选择CPU中的kernel,新IR下不适配GetReduceGradExpectedKernelType导致在GPU环境下Kernel选择 出现问题,暂时尚未处理
14 test_uniform_random_bf16_op Op uniform_random_batch_size_like should have corresponding OpInfo pd_op.uniform_random_batch_size_like,RuntimeError: (NotFound) Variable is not initialized.1558: [Hint: holder_ should not be null.] #58904 @xingmingyyj 问题一: input对应的Variable在构建PhiContextholder_为空。在python侧_StandaloneExecutor执行run函数时传入的feed_names为空,在旧IR中会在program_interpreter中执行run函数,对于program_interpreter初始化Variable的机制,他会在构建Varibale是就将其初始化。而pir_interpreter不会先初始化Variable,它根据feed_names为输入变量初始化,所以如果feed_names为空,会导致input不会被初始化,导致后面运行报错。解决方案是在exe.run()中加入feed
@xingmingyyj xingmingyyj changed the title [pir] 记录新ir全量算子覆盖需要修复的算子 【PIR】记录新ir全量算子覆盖需要修复的算子 Oct 20, 2023
@paddle-bot paddle-bot bot added the PFCC Paddle Framework Contributor Club,https://github.com/PaddlePaddle/community/tree/master/pfcc label Oct 20, 2023
@Ligoml Ligoml removed status/new-issue 新建 type/others 其他问题 labels Oct 24, 2023
@paddle-bot paddle-bot bot added the status/close 已关闭 label Jun 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
PFCC Paddle Framework Contributor Club,https://github.com/PaddlePaddle/community/tree/master/pfcc status/close 已关闭
Projects
None yet
Development

No branches or pull requests

3 participants