[Sparse] optimize sparse attention op kernel #44743

zhwesky2010 · 2022-07-29T14:03:24Z

PR types

Performance optimization

PR changes

OPs

Describe

优化sparse attention的OP kernel，优化kernel函数写法，使softmax支持shape=2048/4096，性能略有提升

paddle-bot · 2022-07-29T14:03:34Z

你的PR提交成功，感谢你对开源项目的贡献!
请关注后续CI自动化测试结果，详情请参考Paddle-CI手册。
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

* fix python3.10 compile bug on window (PaddlePaddle#44330) * Fix random seed for several unit tests (PaddlePaddle#44135) * Fix test_functional_conv2d_transpose random seed * Fix random seed and use np.testing * Fix random seed for test_lu_unpack_op * Fix test_autograd_functional_dynamic random seed * Remove boost library (PaddlePaddle#44092) * add fused token prune op and plugin (PaddlePaddle#44281) * add fused token prune op and plugin * Fix run inference bug for standalone executor (PaddlePaddle#44340) * xpu-paddlepaddle-33 [任务] matmul单测 timeout (PaddlePaddle#44333) test=kunlun * [IPU] add custom-op UTs 0/N (PaddlePaddle#44328) * add custom-op UTs 0 * add authors Co-authored-by: Allen Guo <alleng@graphcore.ai> Co-authored-by: Zhixin Yao <zhixiny@graphcore.ai> Co-authored-by: Zhaorui Chen <zhaoruic@graphcore.ai> Co-authored-by: Zhixin Yao <zhixiny@graphcore.ai> Co-authored-by: Zhaorui Chen <zhaoruic@graphcore.ai> * [IPU] add custom-op UTs 1/N (PaddlePaddle#44329) * add custom-op UTs 1 * add authors Co-authored-by: Allen Guo <alleng@graphcore.ai> Co-authored-by: Zhixin Yao <zhixiny@graphcore.ai> Co-authored-by: Zhaorui Chen <zhaoruic@graphcore.ai> * update url Co-authored-by: Zhixin Yao <zhixiny@graphcore.ai> Co-authored-by: Zhaorui Chen <zhaoruic@graphcore.ai> * support KL2 multi-card training, *test=kunlun (PaddlePaddle#43889) * update xccl lib * use separate streams for compute/comm on XPU * add broadcast op to xpu2_op_list * Remove auto to_pascal_case for args in op generator (PaddlePaddle#44350) * remove auto to_pascal_case for args in op generator * fix yaml config * Standard sparse conv name (PaddlePaddle#44353) * [Eager] eager variable back sync (PaddlePaddle#44343) * eager variable back sync * [ Phi Kernel ] Transfer as_real to phi. (PaddlePaddle#44263) * transfer as_real to phi * fix erros * blocking: True -> False * [Eager]Fix assert statement (PaddlePaddle#43492) * Not rename pb file to avoid re-compile (PaddlePaddle#44370) * [Phi] Migrate solve kernel to phi (PaddlePaddle#44363) * draft version * draft version * draft version * migrate solve kernel to phi * polish * polish * re useless header file, fix a bug in grad_kernel_impl * add header file in need * [auto parallel] remove comm init control (PaddlePaddle#44385) * [CustomDevice] remove unused file (PaddlePaddle#44358) * [Paddle-TRT] reshape fill_constant (PaddlePaddle#44314) * reshape fill_constant * commit * commit * set seed for uts (PaddlePaddle#44372) * [Paddle-TRT] remove useless code in fc (PaddlePaddle#44382) * remove useless code in fc * [Paddle-TRT] Fix cast (PaddlePaddle#44312) * fix_cast * fix_cast * commit * Polish jit layer cmakelists to hide some message (PaddlePaddle#44351) * Enable inference multi stream ci test (PaddlePaddle#44275) * test * update * fix bug of old pp (PaddlePaddle#44361) * add xpu resnet_unit (PaddlePaddle#44297) * add xpu resnet_unit *test=kunlun * tmp *test=kunlun * add blacklist in prim2orig interface (PaddlePaddle#44383) * [Plugin] Fix Custom device in eager mode, test=develop (PaddlePaddle#43952) * [Plugin] Fix Custom device in eager mode, test=develop * update test case, test=develop * update ut for coverage, test=develop * add ipu support for standalone executor. (PaddlePaddle#44342) * fix typos in template for codegen of operators (PaddlePaddle#44364) * fix duplicate slice logic in _grad (PaddlePaddle#44396) * [MLU] fix mlu ctest final. (PaddlePaddle#44404) * fix data transform bug of interpolate op (PaddlePaddle#44401) * [Sparse] Add sparse matmul kernel(coo*dense->dense) (PaddlePaddle#44346) * fix new autodiff api docs (PaddlePaddle#44341) * fix build error in low arch (PaddlePaddle#44391) * [new api] add new api paddle.vision.ops.distribute_fpn_proposals (PaddlePaddle#43736) * add distribute_fpn_proposals * change to new dygraph * fix doc and example code * change fluid impl to current version * update (PaddlePaddle#44418) * [Paddle-TRT] Shape sum fix scale (PaddlePaddle#44394) * shape sum * add shape, sum trt layer * [Phi] Migrate infermeta and add yaml for solve op (PaddlePaddle#44379) * migrate solve kernel to phi * re useless header file, fix a bug in grad_kernel_impl * add header file in need * add yaml for solve op * fix solve_sig.cc ArgumentMapping and update tests case * disable legacy dygraph check in op_test * rm solve_op.cc / solve_sig.cc and migrate yaml config * Update op_test.py disable legacy dygraph check when check_eager is True * add labels for infer ut (PaddlePaddle#44279) * add labels for infer ut * add RUN_TYPE=INFER for cpp ut * fix formaterror * update * Add mfence for XPU2 KP (PaddlePaddle#44258) * remove include of all.h in resnet_basic_block_op_xpu.cc, test=kunlun (PaddlePaddle#44423) * Rename BOOST_GET macros (PaddlePaddle#44368) * Rename BOOST_GET macros * Fix conflicts * [new API] add paddle.vision.ops.generate_proposals (PaddlePaddle#43611) * add generate_proposals into paddle.vision * remove class api * im_info -> img_size * change fluid impl to current version * Accelerate inference period in op Cache method (PaddlePaddle#43857) * Added pad3d and pad2d FP32 FWD oneDNN kernels (PaddlePaddle#43990) * Piotrek's changes for pad3d * my changes * first version of pad3d, single copy, unnecessary reads * optimized pad3d kernel * test upadte * removed magic numbers * added support for pad2d * reverted two files * reverted one old change * added support for Paddings tensor * CI fix * CI fix * fixed timeout of tests * fixed typo * changes to GetKernelTypeForVar * Revert "changes to GetKernelTypeForVar" This reverts commit 4691061. * added AsExtra() to pad2d Co-authored-by: Piotr Paturej <piotr.paturej@intel.com> * add save_cache/patch (PaddlePaddle#44420) * add save_cache/patch * add pybind * remove pybind * remove const_cast * add fleet * Standard name of sparse pool (PaddlePaddle#44344) * move eig operator from fluid to phi (PaddlePaddle#44398) * move eig operator from fluid to phi * add eig_grad unitest, upgrade IsComplexType() from fluid to phi * [Phi]Move angle op to phi (PaddlePaddle#44393) * Move angle op to phi * Replace mutable_data using Alloc * Remove some include * Try to fix windows ci error * include math.h to fix windows ci error * Fix kernel name * Move angle_grad infershape * [Eager]release gil when run backward (PaddlePaddle#44433) * release gil when run backward * compile phi/backends into one static library (PaddlePaddle#44373) * compile into one static library * fix xpu compile * fix xpu compile * fix inference compile * fix inference compile * add custom test * revert one file * [IPU] Add more Ops (PaddlePaddle#44414) * [IPU] Add more Ops * update boost API * Clean CI_SKIP_CPP_TEST (PaddlePaddle#44412) * Add dependency for read op in standalone executor (PaddlePaddle#44362) * Add dependency for read op in standalone executor * Fix CI errors * Add UT * add_dependency -> dependency_utils * Fix CI errors * Add distro in ci docker (PaddlePaddle#44332) * add distro zstd * test * test * add pip3.8 * [Phi] migrate as_complex kernel to phi (PaddlePaddle#44438) * migrate as_complex kernel to phi * support as_complex and as_real in phi * rm GetExpectedKernelType for AsRealOp * [GPUPS]FleetWrapper initialize (PaddlePaddle#44441) * fix FleetWrapper initialize * [XPU][NPU] (1) add device_guard. (2) add support for LoDTensorArray of sum op. (PaddlePaddle#44367) * device_guard support xpu. test=kunlun * sum op of xpu support LoDTensorArray. add test for while op of xpu. test=kunlun. * [IPU] add Op uts (PaddlePaddle#44415) * transfer block_id to CreateVarNode in multi_devices_graph_pass (PaddlePaddle#44366) * fix CreateVarNode in multi_devices_graph_pass * Revert "Fix var duplication bug for graph_to_program_pass (PaddlePaddle#44278)" This reverts commit a2c4c86. * 【GPUPS】Adam accessor (PaddlePaddle#43919) * add adam/sharedadam optimzier for gpups;edit optimizer struct;test=develop * [Phi] migrate sync_batch_norm to phi (PaddlePaddle#44369) * [GPUPS]Fix psgpuwrapper initialization (PaddlePaddle#44468) * Update ps_gpu_wrapper.h * Update ps_gpu_wrapper.h * Update ps_gpu_wrapper.cc * [Phi] migrate exponential kernel to phi (PaddlePaddle#44376) * [Phi] migrate exponential kernel to phi * fix comment * fix CI * [PHI] move diag_embed op to phi. (PaddlePaddle#44408) * move diag_embed to phi. * [MLU] set_value performance optimizing (PaddlePaddle#44390) * Update api changing approve members (PaddlePaddle#44463) * update api approve members, test=document_fix * add qingqnig into list, test=document_fix * fix bug,test=document_fix (PaddlePaddle#44478) * [Phi] migrate clip_by_norm to phi (PaddlePaddle#44458) * add eigen3 dependency for phi_backends (PaddlePaddle#44479) * remove fleet_13 ut in parallel_UT_rule.py; test=develop (PaddlePaddle#44477) * [PHI]Seperate xshape kernel from normal kernel (PaddlePaddle#44315) * seperate xshape kernel from normal kernel * fix bugs in infermeta * fix compile bugs * fix compile bugs * [AutoParallel] fix unittest with paddle.distributed.launch (PaddlePaddle#44439) * fix unittest * fix log_dir * _enable_legacy_dygraph * [Phi] add temporal_shift yaml (PaddlePaddle#44409) * add temporal_shift yaml and unittest * [Paddle inference] Add conv_fusion_fp16 (PaddlePaddle#44435) * convfusionfp16 * convfusionfp16 * convfusionfp16 * fix some convert error found in tipc. (PaddlePaddle#44457) * fix some error found in tipc. * update * [BugFix]Fix randint_like bugs when save program that don't need use tensor's value (PaddlePaddle#44446) * fix bugs of random * fix unittest error * fix unittest bugs * add adaptive pool and softmax with cross entropy supports different axis, * test = kunlun (PaddlePaddle#44428) * add xpu pnorm op and fix pool op, *test=kunlun * add adaptive pool, and softmax with cross entropy supports different axis, *test=kunlun * add slot attr for push sparse op (PaddlePaddle#44422) * add slot attr for push sparse op * add pybind * remove fleet * add unittest * fix * [Dy2Sta]Fix Segment Fault while training multi-card if params have no grad (PaddlePaddle#44485) * [Dy2Sta]Fix Segment Fault while training multi-card if params have no grad * fix unittest * fix tensor stream error in custom op (PaddlePaddle#44500) * Replace with dygraph op calling method. (PaddlePaddle#44331) * Replace with dygraph op calling method. * [JitLayer]Pybind PEFunction and call phi api in layer_test (PaddlePaddle#44465) * Support predictor function in JitLayer * Pybind PEFunction * Pybind PEFunction and call phi api in layer_test * Call sqrt phi API * Polish flags * Fix comments * [Sparse] Add sparse addmm kernel (dense+coo*dense->dense,dense+csr*dense->dense) (PaddlePaddle#44451) * [Eager] bilinear_tensor_product yaml (PaddlePaddle#44459) * bilinear_tensor_product yaml * [ Phi ] svd transfer (PaddlePaddle#44392) * svd cpu forward * svd gpu forward * transfer the backward of svd * remove cusolver in svd_grad * svd kernel bug fix * fix bugs * fix bugs. * fix bug * [Paddle-TRT] fix_fill_constant (PaddlePaddle#44481) * fix_fill_constant * fix_fill_constant * fix_ernie * [MLU] transpose avg_pool2d to NHWC for better performance. (PaddlePaddle#44475) * [jit] jit support property.proto (PaddlePaddle#44337) * add property.proto, can compiled * property get and deserilize * support get float * format code * format code * add unittest * add more set method * fix grammar error * Update paddle/fluid/jit/property.h Co-authored-by: Aurelius84 <zhangliujie@baidu.com> * Update paddle/fluid/jit/property.cc Co-authored-by: Aurelius84 <zhangliujie@baidu.com> * Update paddle/fluid/jit/property.cc Co-authored-by: Aurelius84 <zhangliujie@baidu.com> * Update paddle/fluid/jit/property.cc Co-authored-by: Aurelius84 <zhangliujie@baidu.com> * fix comment * fix error throw * fix property save unit test * fix error info * fix copyright and header import * reorder jit property tensor datatype Co-authored-by: Aurelius84 <zhangliujie@baidu.com> * [ Dy2static ] infer_program may be incorrect in amp mode. (PaddlePaddle#44487) * fix the outputs of net is x,x * add unittest for duplicate output * fix * fix _infer_program use the original program not the amp program. * get _***program_id back and avoid duplicate cache ing * fix * Fc fp16 (PaddlePaddle#44505) * fc support fp16 * add a ‘,’ on paddle_pass_builder.cc * fc support fp16 on non-cuda. * add batch stream (PaddlePaddle#44524) * shufflechannelfix (PaddlePaddle#44516) * fix arg_max to select first index (PaddlePaddle#44521) * [MLU] add floor kernel and grid_sampler kernel (PaddlePaddle#44498) * commit (PaddlePaddle#44534) * [CustomDevice] register Copy for custom device (PaddlePaddle#44200) * [CustomDevice] register Copy for custom device * [CustomDevice] register Copy for custom device * [CustomDevice] register Copy for custom device * merge and add uts * merge and add uts * fix for blocking and unittests coverage * (modified) fc support fp16 (PaddlePaddle#44540) * Add code of occupancy computing on DCU and avoid threadID bug for DCU profiler (PaddlePaddle#44520) * add xpu lars_momentum/pow2_decay (PaddlePaddle#44448) *test=kunlun * [phi] move inverse op from fluid to phi (PaddlePaddle#44471) * move inverse from fluid to phi with unitest bug * fix bug, add eager op yaml * support send_partial, recv_partial and allgather_partial in ProcessGroupNCCL (PaddlePaddle#44444) * [Sparse]add sparse unary api(expm1/deg2rad/rad2deg/relu6/leaky_relu) (PaddlePaddle#44432) * Fc fp16 (PaddlePaddle#44558) * (modified) fc support fp16 * __CUDA_ARCH__ version * delete half * delete half * Fix bug of amp code-gen (PaddlePaddle#44570) * fix bug of amp code_gen * fix bug * [JitLayer]Fix jit.save error when save params combined (PaddlePaddle#44504) * Fix jit.save error when save params combined * Change dict_value to list * [Phi] Migrate squared_l2_norm_op to phi (PaddlePaddle#44492) * add swish using TensorRT layer (PaddlePaddle#44561) * update * empty commit * update * update * update * Phi gird sampler migration (PaddlePaddle#44562) * add_ymal_utest for phi grid_sampler op * skip dist test cases if mlu card number only one, test=develop (PaddlePaddle#44549) * [dy2st]Add ProgramHelper to polish build program logic in autoparallel.Engine (PaddlePaddle#44513) * [dy2st]Add ProgramHelper to polish build program logic in autoparallel.Engine * refine code * 【Hackathon No.21】为 Paddle 新增 SoftMarginLoss (PaddlePaddle#42364) * 2022-04-28 * 2022-04-28_V2 * 2022-04-30 * 2022-04-30_V2 * 2022-05-01 * 2022-05-02 * 2022-05-02_V2 * 2022-05-05_V1 * 2022-05-06_V1 * 2022-05-07_V1 * Update loss.py * 2022-05-07_V2 * 2022-05-13_V1 * Update test_soft_margin_loss.py * Update loss.py * Update loss.py * 2022-05-16_V1 * 2022-05-19_V1 * 2022-05-20_V1 * Update test_soft_margin_loss.py * 2022-06-01_V1 * 2022-06-05 * 2022-06-07 * 2022-06-07 * 2022-06-08 * 2022-06-08_V2 * 2022-06-17-code_style * Modify python * 2022-06-20 * for * for CI;test=document_fix Co-authored-by: Ligoml <39876205+Ligoml@users.noreply.github.com> * [MLU]transpose convbpf output to HWCN for better performance (PaddlePaddle#44552) * Fc fp16 (PaddlePaddle#44578) * (modified) fc support fp16 * __CUDA_ARCH__ version * delete half * delete half * add half support * add half support * add half support * [Auto Parallel] Add dist op cost (PaddlePaddle#44146) * update comp cost * add dist default op cost * add dist fill constant batch size like op cost * add elewise op cost * add fill_constant_batch_size_like op cost unittest * add unittest and remove fill_constant_batch_size_like grad op cost * add to cmakelist * fix unittest bug * Improve CI unittest parallel execution strategy (PaddlePaddle#44334) * paralle_test_daily * test=paralle_test_daily * test=paralle_test_daily * test=parallel_test_daily * test=paralle_test_daily * test=paralle_test_daily * test=paralle_test_daily * test=paralle_test_daily * test=paralle_test_daily * test=paralle_test_daily * test=paralle_test_daily * test=paralle_test_daily * test pre_test_bak * test cfs * test_cfs,test=paralle_test_daily * test_cfs,test=paralle_test_daily * fix nightly test name,test=paralle_test_daily * fix nightly test name,test=paralle_test_daily * test ci parallel speed * refine parallel rule,test=paralle_test_daily * Move bmm OP from fluid to phi (PaddlePaddle#44496) * [PHI]Move slogdeterminant op to phi (PaddlePaddle#44547) * Move slogdeterminant op to phi * Add yaml and unit test for slogdeterminant * Rename pybind_boost_header.h (PaddlePaddle#44592) * unify data type and property enum value (PaddlePaddle#44585) * inference multi stream support handle lazy init. (PaddlePaddle#44563) * multi stream support handle lazy init. * support eigen lazy init * update * fix ci problem * Remove ControlDepVar in GraphToBlock (PaddlePaddle#44591) * transfer the svd infer into phi infermeta (PaddlePaddle#44528) * transfer the svd infer into phi infermeta * remove the svd.h * modify svd api * fix svd error by insert optional * Einsum grad complex (PaddlePaddle#44598) * add complex for einsum grad kernel * pass the ci * add reverse yaml (PaddlePaddle#44518) * add reverse yaml * Set more attrs in ReplaceScaleLossGradOp (PaddlePaddle#44576) * Set more attrs in ReplaceScaleLossGradOp * Fix typos * Fix CI errors * Add UT * [Phi] Migrate box coder to phi. (PaddlePaddle#44550) * fix behavior of device_id=None in Tensor.cuda (PaddlePaddle#44515) * fix behavior of device_id=None in Tensor.cuda * fix CI * fix windows cuda11.7 bug (PaddlePaddle#44601) * add horizontal federation learning ps feature (PaddlePaddle#44327) * back fl * delete ssl cert * . * make warning * . * unittest paral degree * solve unittest * heter & multi cloud commm ready * . * . * fl-ps v1.0 * . * support N + N mode * . * . * . * . * delete print * . * . * . * . * fix bug * . * . * fl-ps with coordinator ready * merge dev * update message parse only * update fl client scheduler * fix bug * update multithreads sync * fix ci errors * update role_maker.py * update role_maker.py * fix ci error: windows py import error * fix ci error: windows py import error * fix windows ci pylib import error * add dump fields & params * try to fix windows import fleet error * fix ps FLAGS error * [MLU] rollback cntoolkit vetsion to 2.8.5 (PaddlePaddle#44595) * [CustomDevice] add blas_axpby api for gradient_accumulator (PaddlePaddle#44584) * add sin,cos,exp primitive operators (PaddlePaddle#44345) * Optimize sparse convolution (PaddlePaddle#43576) * Merge kProgramDescs in GraphToProgram (PaddlePaddle#44526) * [Eager] Add warpctc yaml (PaddlePaddle#44617) * Add a feed op before each input parameter var. (PaddlePaddle#44499) * Add a feed op before each input parameter var. * Fix some issues about the unit test build_cinn_pass_test. * fix record event for operator type in new dygraph (PaddlePaddle#44582) * fix new dygraph record event for op * update unit test * fix bug of elementwise_add_grad, *test=kunlun (PaddlePaddle#44545) * fix bug of elementwise_add_grad, *test=kunlun * fix bug, *test=kunlun * rm pooling_t, *test=kunlun * fix bug of ew_add_grad when inplace, *test=kunlun * [IPU] small bug fix (PaddlePaddle#44473) * sync misc changes * add authors Co-authored-by: Zhaorui Chen <zhaoruic@graphcore.ai> * up x * Revert "up x" This reverts commit f3fde45. * add guarg for ipu Co-authored-by: Zhaorui Chen <zhaoruic@graphcore.ai> * support auto fallback to cpu kernel for cusom device (PaddlePaddle#44639) * fix dygraph bugs in broadcast_to api. (PaddlePaddle#44612) * add set_dtype for inverse_op (PaddlePaddle#44618) * refine overalls.cmake (PaddlePaddle#44623) * [PHI]Add yaml and unittest for bmm op (PaddlePaddle#44625) Add yaml and unittest for bmm op * Phi average accumulates migration (PaddlePaddle#44554) * move average_accumulates op to phi kernel * new exe not support pg (PaddlePaddle#44628) * [CustomDevice]fix phi kernel header (PaddlePaddle#44637) * [CustomDevice] add process_group_xccl ut (PaddlePaddle#44632) * [CustomDevice] add process_group_xccl ut * update * Fix conv api name (PaddlePaddle#44636) * [DCU] Fix NAN problem when training BERT on DUC platform (PaddlePaddle#44643) * [JitLayer]Remove include fluid head files in JitLayer (PaddlePaddle#44597) * Remove include fluid head files in JitLayer * Format code * Remove const to fix ci error * Fix param error * Polish jit layer include and cp some headers to python/include * Fix comment * [jit] jit.save support property serialization (PaddlePaddle#44581) * jit.save support peropty serilization * extract set property function * fix property test file name * fix typing error * fix typing error * fix test coverage * Replaced add_custom_command with add_custom_target in xpu_kp_cmake (PaddlePaddle#44619) * Replaced add_custom_command with add_custom_target in xpu_kp_cmake * add adagrad and rmsprop yaml (PaddlePaddle#44631) * [phi] move crop_tensor kernel from fluid to phi (PaddlePaddle#44574) * move crop_tensor from fluid to phi * delete fluid header files * fix crop_tensor_op dygraph_mode bug * modify header files, add out tensor check * fix RemoveIntermediateOut in fuse_elewise_add_act_pass while converting graph to program (PaddlePaddle#44593) * fix RemoveNode in fuse_elewise_add_act_pass * fix * change pointer to share_ptr * fix * fix * fix format * fix * fix graph_safe_remove_nodes * fix UTs on physical ipu (PaddlePaddle#44647) * [IPU] add more loss ops (PaddlePaddle#44646) * add more loss ops * add authors Co-authored-by: Zhaorui Chen <zhaoruic@graphcore.ai> Co-authored-by: Zhaorui Chen <zhaoruic@graphcore.ai> * add g_ipuplace_pytype (PaddlePaddle#44648) * Strided slice fp16 (PaddlePaddle#44653) * [MLU]fix sync_batch_norm and concat_grad op (PaddlePaddle#44586) * retain dist op returns (PaddlePaddle#44634) * xpu unittest grad compute supports more types, *test=kunlun (PaddlePaddle#44606) * [Eager] Add hierarchical_sigmoid yaml (PaddlePaddle#44638) * add matrix_nms in python/paddle/vision/ops.py (PaddlePaddle#44357) * [auto parallel] bug fix for op has sub_block attr created with copy_from (PaddlePaddle#44664) * Change the way to set attributes for grad op maker (PaddlePaddle#44514) * fix typos in template for codegen of operators * change the way to set attributes for grad op maker * [XPU] add top_k op (PaddlePaddle#44656) * [XPU] add top_k op. test=kunlun * [XPU] add top_k op. test=kunlun * use PADDLE_ENFORCE_XDNN_NOT_NULL to check pointer. test=kunlun * Support broadcast tensor in phi system (PaddlePaddle#44590) * [PHI] Move spectral_norm to phi (PaddlePaddle#44577) * Add kernel declarations * Copy kernel implementation code * Transfer implementation code * Fix: Move out_grad to first * Register new kernels * Remove old kernels * Move out_grad to last * Fix bugs * Transfer infermeta * Add yaml files * Add blank line * Fix code style * Optimize directory structure Co-authored-by: Bobholamovic <linmanhui@baidu.com> * Complete the dtypes for all_gather, add all_gather_object api (PaddlePaddle#44417) * [Eager] refactor general_grad and fix some bugs (PaddlePaddle#44611) * refactor general_grad and fix some bugs * add TODO: support prune logic deeper * support log_grad op, *test=kunlun (PaddlePaddle#44662) * [LAUNCH] add distributed launch check tools (PaddlePaddle#44495) * add launch test * launch test for cpu * bs 1 * Move api(lgamma) from legacy_api.yaml to api.yaml (PaddlePaddle#44355) * Move api(lgamma) from legacy_api.yaml to api.yaml * Move api(lgamma) from legacy_api.yaml to api.yaml * Move api(lgamma) from legacy_api.yaml to api.yaml * modify code style * add x to X mapping * add definition of lgamma * delete redundant lgamma definitions * Modify code comments * Modify ops.py code format * add lgamma single test and lgamma api in fluid * Optimized lgamma unittest * Move frame kernel to phi (PaddlePaddle#44615) * Move frame OP to phi、add frame OP yaml config and supplement single test * add Header file of in_dygraph_mode * Modify variable name and FrameGradInferMeta multiplex UnchangedInferMeta * move seq2col to phi * delete elementwise pow in xpu_kp_list (PaddlePaddle#44661) * [MLU] fix log_softmax mode selection. (PaddlePaddle#44669) * adapt for resnet (PaddlePaddle#44685) * Fix some problem of kernel fallback in C++ API (PaddlePaddle#44681) * support auto fallback to cpu kernel for cusom device * fix some problem of kernel fallback * fix bugs of lstsq (PaddlePaddle#44689) * migrate dirichlet kernel to phi (PaddlePaddle#44434) * migrate dirichlet op kernel to phi * fix dirichlet sample memory leak * [phi]move softsign from fluid to phi (PaddlePaddle#44616) * test_activation_op unitest error, yaml & activation.py in_dygraph_mode incomplete * fix test_activation_op unitest error, add yaml and dygraph test * fix code style with pre-commit * try to fix namespace error of abs in activation_functor.h * fix namespace error of abs * [Paddle Inference] Support depthwise_conv2d fp16. (PaddlePaddle#44642) * depthwise_fp16 * depthwise_fp16 * depthwise_fp16 * depthwise_fp16 * fix logging debug level (PaddlePaddle#44684) * back fl * delete ssl cert * . * make warning * . * unittest paral degree * solve unittest * heter & multi cloud commm ready * . * . * fl-ps v1.0 * . * support N + N mode * . * . * . * . * delete print * . * . * . * . * fix bug * . * . * fl-ps with coordinator ready * merge dev * update message parse only * update fl client scheduler * fix bug * update multithreads sync * fix ci errors * update role_maker.py * update role_maker.py * fix ci error: windows py import error * fix ci error: windows py import error * fix windows ci pylib import error * add dump fields & params * try to fix windows import fleet error * fix ps FLAGS error * fix logging risk * fix logging possible risk * Skip CUDA Graph case for standalone executor (PaddlePaddle#44693) * [Eager] fix lerp grad kernel logic (PaddlePaddle#44705) * clone ort_predictor reuse session (PaddlePaddle#44703) * [XPU] add sampling_id op, add top_k op, update xdnn api. test=kunlun (PaddlePaddle#44704) * fused_fc_elementwise_layernorm_op support fp16 (PaddlePaddle#44710) * fused_fc_elementwise_layernorm support fp16 * fused_fc_elementwise_layernorm support double * [Phi] Add yaml for assign_value (PaddlePaddle#44596) * [Phi] Add yaml for assign_value * [Phi] Fix the bug of the assign api and modify the unittest * [Phi] Fix the bug when the tensor does not have the backend info * [Phi] Replace the functional-style cast init by the brace-init * [Phi] Cast the data explicitly * [PHI] Move lu to phi (PaddlePaddle#44605) * Add kernel declarations * Copy kernel implementation code * Transfer implementation code * Register new kernels * Remove old kernels * Fix code style * Fix bugs * mutable_data->HostAlloc * Transfer infermeta * Add yaml and update python api * Add PADDLE_WITH_HIP check * Update unittests * Fix bugs * Fix bugs * Optimize directory structure * Add output checks * lu_impl.h->lu_kernel_impl.h Co-authored-by: Bobholamovic <linmanhui@baidu.com> * [MLU] add pytest for mlu strided_slice kernel (PaddlePaddle#44523) * Support backward final hook (PaddlePaddle#44686) * update to sdk2.6.0 (PaddlePaddle#44673) * move CUDAStream to phi (PaddlePaddle#44529) * init * move CUDAStream to phi * fix compilation * merge develop * add stream_owned_ member * split cuda_stream.h * fix cpu compile * fix constructor * fix bug * fix windows compile * fix inference test_levit * fix windows tests * [Auto parallel] Optimization Tuning (PaddlePaddle#43782) * fixed bug for pass & engine * fixed bug for benchmark GPT-3 * add tuner & profiler * add algorithms & config * skip cast trt convert when input dtype is bool (PaddlePaddle#44716) * skip cast trt convert when input dtype is bool * [LAUNCH] fix set args bug (PaddlePaddle#44717) * Phi softplus migration (PaddlePaddle#44542) * add yaml and utests of phi softplus add yaml of softplus fix softplus bug in phi * update utests * bug fix * bug fix for test_layers * layer api match * match def and doc in ops.py * doc polish * fix unwanted modified of thresholded_relu * style imporve * 【PaddlePaddle Hackathon 3 No.15】为 Paddle 新增 count_nonzero (PaddlePaddle#44169) * add count_nonzero api * remove grad test * [WIP] Matmul v1 & v2 unification -- part 1 (PaddlePaddle#44640) * - Unit tests to be debugged - fix - refactor - diagnostic - more diagnostic - fix - Fix number two - fix - fix - fix - alpha added - more fixes - compilation fix - removed diagnostic code - cosmetic fixes * lint * add FLAGS_enable_api_kernel_fallback (PaddlePaddle#44706) * add FLAGS_enable_api_kernel_fallback * deal with more cases * add ut for coverage * phi_multiclass_nms3 (PaddlePaddle#44613) * add some fp16 op for kunlun resnet50 model (PaddlePaddle#44672) * add some fp16 op for kunlun resnet50 model *test=kunlun * tmp *test=kunlun * add dist op costs (PaddlePaddle#44701) * [API/OP] Migrate Lstsq op into phi (PaddlePaddle#44318) * migrate lstsq op * update * fix bugs for CIs * update * fix bugs * add uts * update * update * update * fix bugs of jip * fix bugs of hip * update * update according to review * update * update * update * update * Add sparse SyncBatchNorm (PaddlePaddle#43520) * add sparse SyncBatchNorm * unify fluid::CUDADeviceContext and phi::GpuContext (PaddlePaddle#44723) * remove cudaDeviceContext * remove more template * fix rocm compile * 【PaddlePaddle Hackathon 3 No.12】为 Paddle 新增 pairwise_distance (PaddlePaddle#44161) * add paddle.nn.functional.pairwise_distance (cattidea#273) * remove the test case for undefined behavior Co-authored-by: SigureMo <sigure.qaq@gmail.com> * Phi prior box (PaddlePaddle#44431) * phi_prior_box * add float[] support * phi_prior_box_optest * update * ort backend support output mutable data (PaddlePaddle#44724) * [PHI] Move lu_unpack to phi (PaddlePaddle#44674) * Add kernel declarations * Copy kernel implementation code * Transfer implementation code * Register new kernels * Remove old kernels * Fix code style * Fix bugs * mutable_data->HostAlloc * Transfer infermeta * Add yaml and update python api * Add PADDLE_WITH_HIP check * Update unittests * Add kernel declarations * Copy kernel implementation code * Transfer kernel implementation code * Register new kernels * Remove old kernels * Add lu_unpack_sig * Fix bugs * Fix bugs * Fix bugs * Optimize directory structure * Add output checks * Update include files * lu_impl.h->lu_kernel_impl.h * Transfer infermeta * Add yaml and update python api * Add check_eager Co-authored-by: Bobholamovic <linmanhui@baidu.com> * update document of quantile and nanquantile; test=document_fix (PaddlePaddle#42413) * migrate reduce_amin,reduce_amax kernel to phi (PaddlePaddle#44698) * [Paddle Inference] add varlen_token_prune plugin, pass, convert (PaddlePaddle#44733) * add varlen_token_prune plugin, pass, convert * support build with Ninja on Linux (PaddlePaddle#44210) * support ninja * fix mkldnn on windows * fix mkldnn on windows up1 * up2 * up3 * fix gflags * BUILD_BYPRODUCTS_OPTION -> BUILD_BYPRODUCTS_ARGS * use CMAKE_COMMAND * up x * migrate overlap_add and overlap_add_grad op (PaddlePaddle#44739) * update code format * add ymal and test * update for comments * Fix to CI (PaddlePaddle#44744) * - fix * - another fix * lint * infer context fix place error. (PaddlePaddle#44726) * infer context fix place error. * update * update * [operator migration] Migrate unstack_op and nms_op (PaddlePaddle#44424) * update unstack_op * update unstack_op * update unstack_op * fix unstack test * update unstack * update with remote * fix unstack_test.py * temp_save_change_nms_op * add nms test * update nms fix * update unstack_op * temp save change * finish fix nms_op * pass nms test * fix CI * fix ops test * save change * fix code style * fix code style * fix ci and codestyle * fix ci Co-authored-by: ShiningZhang <zhang_liang1991@126.com> * Update linalg.py (PaddlePaddle#44347) * Fix test and doc (PaddlePaddle#44735) * fix test and doc * fix all_gather_object with various length, test=allcases (PaddlePaddle#44718) * update manipulation.py paddle.moveaxis (PaddlePaddle#44191) * [CI] CI for Distributed (PaddlePaddle#44085) * generate_unify_header supports excludes (PaddlePaddle#44761) * [JitLayer]Polish PEFuntion to speed up JitLayer and fix memory leak (PaddlePaddle#44738) * Polish PEFuntion to speed up JitLayer * Polish PEFunction code * Fix comments * paddle2onnx update version to 1.0.0rc2 (PaddlePaddle#44759) * set parallel_job according to CUDA memory in Windows CI unittest (PaddlePaddle#44695) * set parallel_job according to CUDA memory * fix bug: add whitespace between conten and [] or condition wont work * [Sparse] optimize sparse attention (PaddlePaddle#44743) * GPUGraph merge to develop (PaddlePaddle#44594) Co-authored-by: seemingwang <zsasuke@qq.com> Co-authored-by: DesmonDay <908660116@qq.com> Co-authored-by: seemingwang <seemingwang@users.noreply.github.com> Co-authored-by: Thunderbrook <a754913769@163.com> Co-authored-by: xuewujiao <105861147+xuewujiao@users.noreply.github.com> Co-authored-by: root <root@yq01-sys-hic-k8s-v100-box-a225-0693.yq01.baidu.com> Co-authored-by: Thunderbrook <52529258+Thunderbrook@users.noreply.github.com> Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com> Co-authored-by: huwei02 <53012141+huwei02@users.noreply.github.com> Co-authored-by: yaoxuefeng <yaoxuefeng@baidu.com> Co-authored-by: lxsbupt <luoxsbupt@163.com> Co-authored-by: miaoli06 <106585574+miaoli06@users.noreply.github.com> Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0008.yq01.baidu.com> Co-authored-by: chao9527 <33347532+chao9527@users.noreply.github.com> Co-authored-by: qingshui <qshuihu@gmail.com> Co-authored-by: yangjunchao <yangjunchao@baidu.com> * Revert for cmake static library errors on XPU KP PaddlePaddle#44762 * unify gpu context (PaddlePaddle#44740) * remove cudaDeviceContext * remove more template * fix rocm compile * remove alias name CUDADeviceContext * fix compile * fix tests * revert changes * API doc(en) Bugs fix in 第四期体验评估 (PaddlePaddle#44749) * fix docs(en) bugs;test=document_fix * update paddle.add docs;test=document_fix * update paddle.where docs;test=document_fix * for ci;test=document_fix * Update manipulation.py * update paddle.where;test=document_fix Co-authored-by: Ligoml <39876205+Ligoml@users.noreply.github.com> * Modify the output result annotation under the lerp function (PaddlePaddle#44035) * Refactor build_op_downstream_map for standalone executor (PaddlePaddle#44729) * Refactor build_op_downstream_map for standalone executor * Add some comments * update xpu.cmake to 20220731, test=kunlun (PaddlePaddle#44767) * fix ut new_group_api (PaddlePaddle#44764) * support beam_search operator on xpu. test=kunlun (PaddlePaddle#44720) * support beam_search operator on xpu. test=kunlun * support beam_search operator on xpu. test=kunlun * support beam_search operator on xpu. test=kunlun * support beam_search operator on xpu. test=kunlun * support beam_search operator on xpu. test=kunlun * [phi] add yolov3_loss yaml and unittest (PaddlePaddle#44476) * add yaml and unittest * update yaml * update backward yaml and unittest * update yaml * add Yolov3LossGradInferMeta * update yolov3_loss_op.cc * fix bug * code format * Update manipulation.py for rot90() (PaddlePaddle#44038) * fix compile error;test=develop * fix compile error;test=develop * fix compile;test=develop Co-authored-by: Sing_chan <51314274+betterpig@users.noreply.github.com> Co-authored-by: zlsh80826 <rewang@nvidia.com> Co-authored-by: Ruibiao Chen <chenruibiao@baidu.com> Co-authored-by: RichardWooSJTU <37864677+RichardWooSJTU@users.noreply.github.com> Co-authored-by: taixiurong <taixiurong@126.com> Co-authored-by: Allen Guo <alleng@graphcore.ai> Co-authored-by: Zhixin Yao <zhixiny@graphcore.ai> Co-authored-by: Zhaorui Chen <zhaoruic@graphcore.ai> Co-authored-by: zhangxiaoci <zhangxiaoci@baidu.com> Co-authored-by: zyfncg <zhangyunfei07@baidu.com> Co-authored-by: zhangkaihuo <zhangkaihuo@baidu.com> Co-authored-by: wanghuancoder <wanghuan29@baidu.com> Co-authored-by: xiongkun <xiongkun03@baidu.com> Co-authored-by: Aurelius84 <zhangliujie@baidu.com> Co-authored-by: Leo Chen <chenqiuliang@baidu.com> Co-authored-by: Weilong Wu <veyron_wu@163.com> Co-authored-by: caozhou <48191911+Caozhou1995@users.noreply.github.com> Co-authored-by: ronnywang <ronny1996@163.com> Co-authored-by: zhoutianzi666 <39978853+zhoutianzi666@users.noreply.github.com> Co-authored-by: Haohongxiang <86215757+haohongxiang@users.noreply.github.com> Co-authored-by: WangZhen <23097963+0x45f@users.noreply.github.com> Co-authored-by: Wilber <jiweibo@baidu.com> Co-authored-by: ShenLiang <1422485404@qq.com> Co-authored-by: QingshuChen <chenqingshu@baidu.com> Co-authored-by: levi131 <83750468+levi131@users.noreply.github.com> Co-authored-by: Qi Li <qili93@qq.com> Co-authored-by: 王明冬 <78149749+winter-wang@users.noreply.github.com> Co-authored-by: Feiyu Chan <chenfeiyu@baidu.com> Co-authored-by: Xiaoxu Chen <chenxx_id@163.com> Co-authored-by: Chenxiao Niu <ncxinhanzhong@gmail.com> Co-authored-by: Zhou Wei <1183042833@qq.com> Co-authored-by: JYChen <zoooo0820@qq.com> Co-authored-by: YUNSHEN XIE <1084314248@qq.com> Co-authored-by: niuliling123 <51102941+niuliling123@users.noreply.github.com> Co-authored-by: zhangyikun02 <48021248+zhangyk0314@users.noreply.github.com> Co-authored-by: huzhiqiang <912790387@qq.com> Co-authored-by: jakpiase <jakpia21@gmail.com> Co-authored-by: Piotr Paturej <piotr.paturej@intel.com> Co-authored-by: zhaocaibei123 <48509226+zhaocaibei123@users.noreply.github.com> Co-authored-by: freeliuzc <lzc842650834@gmail.com> Co-authored-by: tianshuo78520a <707759223@qq.com> Co-authored-by: zmxdream <zhangminxu01@baidu.com> Co-authored-by: houj04 <35131887+houj04@users.noreply.github.com> Co-authored-by: pangyoki <pangyoki@126.com> Co-authored-by: lyq <30404405+affectionlu@users.noreply.github.com> Co-authored-by: Zhong Hui <zhonghui.net@gmail.com> Co-authored-by: fuyou765 <64373205+fuyou765@users.noreply.github.com> Co-authored-by: Chen Weihang <chenweihang@baidu.com> Co-authored-by: YuanRisheng <yuanrisheng@baidu.com> Co-authored-by: zhaoyingli <86812880+zhaoyinglia@users.noreply.github.com> Co-authored-by: ccrrong <101700995+ccrrong@users.noreply.github.com> Co-authored-by: xiaoxiaohehe001 <49090790+xiaoxiaohehe001@users.noreply.github.com> Co-authored-by: ykkk2333 <77383312+ykkk2333@users.noreply.github.com> Co-authored-by: Li Min <11663212+limin2021@users.noreply.github.com> Co-authored-by: Hui Zhang <zhtclz@foxmail.com> Co-authored-by: ming1753 <61511741+ming1753@users.noreply.github.com> Co-authored-by: cifar10 <41565156+cifar10@users.noreply.github.com> Co-authored-by: fwenguang <95677191+fwenguang@users.noreply.github.com> Co-authored-by: Aganlengzi <aganlengzi@gmail.com> Co-authored-by: yuguo <948529990@qq.com> Co-authored-by: Zhang Jun <ewalker@live.cn> Co-authored-by: Wang Bojun <105858416+wwbitejotunn@users.noreply.github.com> Co-authored-by: yangguohao <70266361+yangguohao@users.noreply.github.com> Co-authored-by: Ligoml <39876205+Ligoml@users.noreply.github.com> Co-authored-by: Lux et Veritas <1004239791@qq.com> Co-authored-by: zhangbo9674 <82555433+zhangbo9674@users.noreply.github.com> Co-authored-by: BiynXu <62832681+BiynXu@users.noreply.github.com> Co-authored-by: ziyoujiyi <73728031+ziyoujiyi@users.noreply.github.com> Co-authored-by: Zhen Wang <wangzhen31@baidu.com> Co-authored-by: chenjian <chenjian26@baidu.com> Co-authored-by: helen88 <z8hanghuan@126.com> Co-authored-by: Yuang Liu <liuyuang@baidu.com> Co-authored-by: qipengh <huangqipeng@cambricon.com> Co-authored-by: shangliang Xu <ghostxsl@users.noreply.github.com> Co-authored-by: Jiabin Yang <360788950@qq.com> Co-authored-by: Lin Manhui <mhlin425@whu.edu.cn> Co-authored-by: Bobholamovic <linmanhui@baidu.com> Co-authored-by: LiYuRio <63526175+LiYuRio@users.noreply.github.com> Co-authored-by: kuizhiqing <kuizhiqing@baidu.com> Co-authored-by: Charles-hit <56987902+Charles-hit@users.noreply.github.com> Co-authored-by: HongyuJia <jiahongyu@baidu.com> Co-authored-by: heliqi <1101791222@qq.com> Co-authored-by: Yulong Ao <aoyulong@baidu.com> Co-authored-by: JZ-LIANG <jianzhongliang10@gmail.com> Co-authored-by: thunder95 <290844930@qq.com> Co-authored-by: Jacek Czaja <jacek.czaja@intel.com> Co-authored-by: zhiboniu <31800336+zhiboniu@users.noreply.github.com> Co-authored-by: Ainavo <57820731+Ainavo@users.noreply.github.com> Co-authored-by: SigureMo <sigure.qaq@gmail.com> Co-authored-by: Asthestarsfalll <72954905+Asthestarsfalll@users.noreply.github.com> Co-authored-by: Wangzheee <634486483@qq.com> Co-authored-by: Thomas Young <35565423+HexToString@users.noreply.github.com> Co-authored-by: ShiningZhang <zhang_liang1991@126.com> Co-authored-by: OccupyMars2025 <31559413+OccupyMars2025@users.noreply.github.com> Co-authored-by: mrcangye <mrcangye@email.cn> Co-authored-by: Roc <30228238+sljlp@users.noreply.github.com> Co-authored-by: seemingwang <zsasuke@qq.com> Co-authored-by: DesmonDay <908660116@qq.com> Co-authored-by: seemingwang <seemingwang@users.noreply.github.com> Co-authored-by: Thunderbrook <a754913769@163.com> Co-authored-by: xuewujiao <105861147+xuewujiao@users.noreply.github.com> Co-authored-by: root <root@yq01-sys-hic-k8s-v100-box-a225-0693.yq01.baidu.com> Co-authored-by: Thunderbrook <52529258+Thunderbrook@users.noreply.github.com> Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0009.yq01.baidu.com> Co-authored-by: huwei02 <53012141+huwei02@users.noreply.github.com> Co-authored-by: yaoxuefeng <yaoxuefeng@baidu.com> Co-authored-by: lxsbupt <luoxsbupt@163.com> Co-authored-by: miaoli06 <106585574+miaoli06@users.noreply.github.com> Co-authored-by: root <root@yq01-inf-hic-k8s-a100-ab2-0008.yq01.baidu.com> Co-authored-by: chao9527 <33347532+chao9527@users.noreply.github.com> Co-authored-by: qingshui <qshuihu@gmail.com> Co-authored-by: yangjunchao <yangjunchao@baidu.com> Co-authored-by: yang131313 <lisy928472889@163.com> Co-authored-by: mengqingchun02 <103740521+mengqingchun02@users.noreply.github.com> Co-authored-by: 熊峻峰 <xiongjunfeng@sina.com>

zhwesky2010 force-pushed the optim_sparse_attention branch 2 times, most recently from 9620683 to e48cf86 Compare July 31, 2022 14:03

[Sparse] optimize sparse attention

cf2db1f

zhwesky2010 force-pushed the optim_sparse_attention branch from e48cf86 to cf2db1f Compare August 1, 2022 02:54

zkh2016 approved these changes Aug 1, 2022

View reviewed changes

zhwesky2010 merged commit 1149a37 into PaddlePaddle:develop Aug 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Sparse] optimize sparse attention op kernel #44743

[Sparse] optimize sparse attention op kernel #44743

zhwesky2010 commented Jul 29, 2022 •

edited

Loading

paddle-bot bot commented Jul 29, 2022

[Sparse] optimize sparse attention op kernel #44743

[Sparse] optimize sparse attention op kernel #44743

Conversation

zhwesky2010 commented Jul 29, 2022 • edited Loading

PR types

PR changes

Describe

paddle-bot bot commented Jul 29, 2022

zhwesky2010 commented Jul 29, 2022 •

edited

Loading