Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Targeting Q4] test_triangular_solve_op fails on "Intel(R) Xeon(R) Silver 4314 CPU" #55707

Closed
Tom-Zheng opened this issue Jul 26, 2023 · 7 comments
Assignees

Comments

@Tom-Zheng
Copy link
Contributor

Tom-Zheng commented Jul 26, 2023

bug描述 Describe the Bug

The CPU kernel of triangular_solve breaks on Intel(R) Xeon(R) Silver 4314 CPU. This will cause test_triangular_solve_op failure.

test_lu_op and test_qr_op are also affected because they rely on triangular_solve.

Paddle version: release/2.5

Error info:

test_triangular_solve_op failed
 ..FFFFF.F.FFF.F......
======================================================================
FAIL: test_check_grad_normal (test_triangular_solve_op.TestTriangularSolveOp)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/opt/paddle/paddle/build/test/legacy_test/test_triangular_solve_op.py", line 70, in test_check_grad_normal
    self.check_grad(['X', 'Y'], 'Out', check_cinn=True)
  File "/opt/paddle/paddle/build/test/legacy_test/eager_op_test.py", line 2416, in check_grad
    self.check_grad_with_place(
  File "/opt/paddle/paddle/build/test/legacy_test/eager_op_test.py", line 2617, in check_grad_with_place
    self._assert_is_close(
  File "/opt/paddle/paddle/build/test/legacy_test/eager_op_test.py", line 2376, in _assert_is_close
    self.assertLessEqual(max_diff, max_relative_error, err_msg())
AssertionError: 8.356400757533255 not less than or equal to 1e-07 : Operator triangular_solve error, Gradient Check On Place(cpu) variable X (shape: (12, 12), dtype: float64) max gradient diff 8.356401e+00 over l
imit 1.000000e-07, the first error element is 9, expected -3.602969e-01, but got 7.001953e-02.
......(left out)

CPU info:

# lscpu

Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         46 bits physical, 57 bits virtual
  Byte Order:            Little Endian
CPU(s):                  32
  On-line CPU(s) list:   0-31
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Xeon(R) Silver 4314 CPU @ 2.40GHz
    CPU family:          6
    Model:               106
    Thread(s) per core:  2
    Core(s) per socket:  16
    Socket(s):           1
    Stepping:            6
    CPU max MHz:         3400.0000
    CPU min MHz:         800.0000
    BogoMIPS:            4800.00
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts
                          rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_dead
                         line_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 invpcid_single ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fs
                         gsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1
                          xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local wbnoinvd dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulq
                         dq avx512_vnni avx512_bitalg tme avx512_vpopcntdq rdpid md_clear pconfig flush_l1d arch_capabilities
Virtualization features:
  Virtualization:        VT-x
Caches (sum of all):
  L1d:                   768 KiB (16 instances)
  L1i:                   512 KiB (16 instances)
  L2:                    20 MiB (16 instances)
  L3:                    24 MiB (1 instance)
NUMA:
  NUMA node(s):          2
  NUMA node0 CPU(s):     0-7,16-23
  NUMA node1 CPU(s):     8-15,24-31
Vulnerabilities:
  Itlb multihit:         Not affected
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl and seccomp
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; Enhanced IBRS, IBPB conditional, RSB filling
  Srbds:                 Not affected
  Tsx async abort:       Not affected

其他补充信息 Additional Supplementary Information

No response

@YanhuiDua
Copy link
Contributor

你好,你的问题已经收到,分析中

@zhwesky2010
Copy link
Contributor

zhwesky2010 commented Aug 21, 2023

@Tom-Zheng 你好,请问运行的是CPU版本还是GPU版本的paddle,这个OP单测我们内部运行没问题 https://xly.bce.baidu.com/paddlepaddle/paddle/newipipe/detail/8988447/job/23602891

@Tom-Zheng
Copy link
Contributor Author

@Tom-Zheng 你好,请问运行的是CPU版本还是GPU版本的paddle,这个OP单测我们内部运行没问题 https://xly.bce.baidu.com/paddlepaddle/paddle/newipipe/detail/8988447/job/23602891

请看描述, 用"Intel(R) Xeon(R) Silver 4314 CPU"才能复现该问题.

@Tom-Zheng
Copy link
Contributor Author

我们运行的是GPU版的Paddle, 但该UT是CPU failure, 因此CPU版也应该能够复现.

@zhwesky2010
Copy link
Contributor

zhwesky2010 commented Aug 23, 2023

@Tom-Zheng 我们在内部的多种CPU机型上运行都是可以通过的。

Intel Core i9-9900 CPU:
infoflow 2023-08-23 11-27-13

Intel(R) Xeon(R) CPU
infoflow 2023-08-23 11-22-33

triangular_solve计算在CPU上,使用的是intel提供的mklml库,可能是该库在这种CPU上有计算问题?
infoflow 2023-08-23 11-51-55

所以可以测一下openblas版本的paddle,是否有同样问题:

如果你想安装avx、openblas的 Paddle 包,可以通过以下命令将 wheel 包下载到本地,再使用python -m pip install [name].whl本地安装([name]为 wheel 包名称):

python -m pip download paddlepaddle==2.5.1 -f https://www.paddlepaddle.org.cn/whl/windows/openblas/avx/stable.html --no-index --no-deps

同时确认GPU版本是否有同样问题。如果openblas、GPU都可以运行通过,则可以基本确定是intel mklml库的原因。

@Tom-Zheng Tom-Zheng changed the title test_triangular_solve_op fails on "Intel(R) Xeon(R) Silver 4314 CPU" [Q4] test_triangular_solve_op fails on "Intel(R) Xeon(R) Silver 4314 CPU" Aug 31, 2023
@Tom-Zheng
Copy link
Contributor Author

Will come back to this issue in Q4.

@Tom-Zheng Tom-Zheng changed the title [Q4] test_triangular_solve_op fails on "Intel(R) Xeon(R) Silver 4314 CPU" [Targeting Q4] test_triangular_solve_op fails on "Intel(R) Xeon(R) Silver 4314 CPU" Aug 31, 2023
@Tom-Zheng
Copy link
Contributor Author

The problem is gone after updating CBLAS from v0.3.18 to v0.3.24.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants