Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Rank task graph merge master (#9440)
* Use Primitive in Scalar Pow Grad (#8620) * scalar math use primitive * fix * support pow grad * dev scalar pow grad * remove useless code * use std * auto format by CI * Refine Co-authored-by: guo-ran <360112263@qq.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * Add higher order derivative for loss function (#9070) * add higher order derivative for smooth_l1/nll loss * add higher order derivative for bce/kl_div loss * fix bug and refine testcase * fix wrong sbp signature of bce loss * optimize code and align precision with pytorch * add some index check * disable calc derivative for target in bce loss * remove unnecessary header include * fix sbp setting in testcase, and restore out_grads size check * auto format by CI Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Add higher order derivative for softmax and activation (#9032) * add higher order derivative for softmax/logsoftmax * add higher order derivative for mish/gelu activation * auto format by CI * add comment for constexpr parameter Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * add higher order derivative for pool (#9096) * add higher order derivative for pool * refine * optimize * fix ndim check error Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Cross Encropy 支持 probability 的 target (#9064) * support prob for crossentropy, still has bug for dims > 2 * fix bug of for ndim > 2 inputs, refine code * refine code, use template HasLabelSmoothing * fix grad bug of for ndim > 2 inputs, use pre-calculated factor in kernel * format code, remove redundant including header files * refine op * restore wrong modification * remove op, implement at functor layer * set bind_python to false, remove redundant header files * add docs * fix missing default param in unittest, fix typo in docstr example * auto format by CI * Update loss.py * remove useless file Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * Fix nvjpegDecodeParamsSetROI (#9101) * Fix nvjpegGetImageInfo * fix set ROI * add series op : adaptive_max_pool1d/2d/3d (#9023) * startup: cpu adaptive max pool 2d finished (a draft) * add 1d/2d/3d forward * add return_indices * refine files hieararchy * add adaptive_max_pool2d_grad for test * draft backward op for maxpool 2d * cpu op/kernel finished * reformat * gpu draft kernel * gpu forward finished * draft gpu backward version * refine gpu backward * add nn.AdaptiveMaxPoolnd Module * add docstring * rename avg pool gpu file * refine .td file * refine * refine test case * refine * refine by comments of zzk * refine according to clang_tidy errors * refine * refine by comments of zhuping * one_embedding physical_block_size change to 4096 (#9017) Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * OneEmbedding add ONEFLOW_ONE_EMBEDDING_DISABLE_PIPELINE (#9098) * one_embedding eager forward * deterministic forward gen random * merge master * merge master * grad op add attrs * Revert "grad op add attrs" This reverts commit 33b67c75d1e5d0e6529a108f7e7a17bc458dc661. * auto format by CI * format * refine * prefetch consume id_shuffle out and exec in advance * add new task_node * sort and add ctrl edge * rm id_shuffle_task_node * add register same output blob regst num * rm tasktype * refine * address review * rename * refine * refine * refine Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * develop eager AMP (#9088) * implement eager AMP * skip autocast for inplace and implement make autocast meta * fix * rm unused code * autocast python api * fix * fix * refine * skip autocast if any input is float32 for gray or clear list * refine * fix dead loop * add autocast unittest * refine worker seed (#9102) * refine worker seed * refine * reifne * use default_generator.seed * Dev GroupNorm (#7784) * add groupnorm infer * Add groupnorm forward * refine other forawrd situation * groupnorm backward still has bug * fix forward * support backward * add slow groupnorm param grad kernel * use blockreduce * update blocknum * add gradient func * simplify code * refine and add global test * remove annotation * not limit split dim * fix compile error * Add spatialsize pack logic and fix launch blocknum bug * add two stage reduced backward kernel * refine * simplify logic * refine pack logic * use THREAD_CACHED_MUTABLE_ATTR_MAP * fix comment * refine * refine comment * Refine more check * fix affine=False bug * fix bug * tmp use gemm reduce * use ComputeType buf * fix nvbfloat16 compute type * add amp gray list * Revert back * fix clang analysis * refine userops.td * fix userops * remove result_segment_sizes * add dispatch logic for groupnorm grad uncached block impl Co-authored-by: luyang <flowingsun007@163.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Introduce bfloat16 type (#9067) * introduce_bfloat16_type * storage * fix compile error * support bfloat16 ep operator * support create cpu bfloat tensor * refine code * minor fix * fix static check error * reslove comment * add more test case * fix bfloat16 numeric_limits * fix error Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Refine check in ibverbs (#8974) * refine check in ibverbs * format * fix typo and test * refine error message when there is no errno Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Support padding_idx in OneEmbedding (#8998) * init * Add attribute val in Userops.td * simply add paddingidx logic in EncodeLookupKernel * add simple padding_idx EmbeddingGrad * when index is -1 let gather add 0 * skip atomicadd when row index equals to padding_idx * change padding_idx type to int64 * fix compile error * set padding_idx in Pass * 1n1d eval success * refine * remove print * fix compile error * revert * refine * fix compile * refine * Refine * refine * refine store options * remove embedding grad shuffle redundant padding_idx * move gather in datashuffle kernel * remove redundant code * Refine * refine * remove redundant header file * Set padding idx as optional and remove attr has_padding_idx * Add padding_idx unittest * use array equal instead of allclose * remove a test * enlarge timeout * launch oneflow kernels in code generated with MLIR (#8980) * init * registry * add KernelLaunchFunctionPass * pass ninja and relu test * mlir test script & lowering * relu py * fi * kernel launch * fix * fix op and pass interfaces * add comment * add readme docs * fix typo * kenerl launch function pass is done * use template and rename func.func * declare * pass string through mlir.llvm dialect to c interface: llvm.mlir.global internal constant @"relu-0_var"("relu-0") %0 = "llvm.mlir.addressof"() {global_name = @"relu-0_var"} : () -> !llvm.ptr<array<6 x i8>> %1 = "llvm.mlir.constant"() {value = 0 : index} : () -> i64 %2 = "llvm.getelementptr"(%0, %1, %1) {structIndices = dense<-2147483648> : tensor<2xi32>} : (!llvm.ptr<array<6 x i8>>, i64, i64) -> !llvm.ptr<i8> * use symbol table * use oneflow variable op * fix symboltable * fix * ninja c1 check * split into kernel-launch-function pass and kernel-launch-with-llvm pass * restore pass 1 * Gen kernel example (#9042) * add example * add todo * add basic assertion * add file check * create pass in translation * sanitizeIdentifier * enable print * fix * update test file * kernel llvm pass is ok * pass ctx ptr to func and this ptr will be an operand to call c interface function * restore llvm ptr type to llvm.ptr<i8> * Kernel lookup in launch op (#9059) * add * move function to another unit * create map * add iter * impl TensorDesc4ArgNameAndIndex * set dev tag * load lib when ONEFLOW_MLIR_FUSE_KERNEL_LAUNCH is set * sharedlibs enables and pass enables in commpute * enable c interface callee * impl todo * naming * rm * add invalid * fix invoke arg * typed * rm log * rename pass * Update user_op_kernel_registry.h * Update user_op_kernel_registry.h * Update OneFlowOps.td * Update Passes.cpp * add comp ctx * add todo * refine todo * refactor op infer * minor fix * add check * refine error * refine msg * fix typo * fix typo * remove string in llvm * impl Tensor4ArgNameAndIndex * fix ninja c1 bug * realize gpu and add cuda test * auto format by CI * fix merge * fix ninja with cpu version * auto format by CI * rename * merge def * deduplicate code * fix * refactor * fix license * cache * add back TODO() * add jit arg type check * rm comment * fix typo * fix ci * todo ci * fix code style * rm misadded * rm misadded * Update Passes.cpp * pass ninja without debug about hungry mode of knerel init * fix null parsed module problem * fix dynamic cast of state problem * fix gpu error * fix * fix * auto format by CI * fix * Update kernel_launch_op.cpp * move * fix * auto format by CI * done * fix * fix * auto format by CI * fix * fix * auto format by CI * Update kernel_launch_op.cpp * rename * auto format by CI * fix * done * Update kernel_launch_op.cpp * fix * fix * fix * fix * fix * auto format by CI * Update oneflow/ir/oneflow-extension/kernel_launch_op.cpp Co-authored-by: Shenghang Tsai <jackalcooper@gmail.com> * fix * fix * fix * fix * fix * Update oneflow/ir/lib/OneFlow/Passes.cpp Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com> * fix * fix * fix Co-authored-by: jackalcooper <jackalcooper@gmail.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com> * interpolate api align (#9118) * Fix masked select op bug (#9120) * fix masked_select bug * refine * fix ci error * align with pytorch RANK env (#9111) * align with pytorch RANK env * refine Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Add oneflow hub (#9116) * add OneflowHub feature, consistent with PyTorchHub * add oneflow hub docs * refine docs and add test * refine * refine * refine * fix comment * auto format by CI * skip unittest Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * fix where op data_type infer bug (#9121) * fix where op data_type infer bug * refine Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * fix like op infer dtype (#9127) * elementwise.cuh remove template parameter tail (#9128) * fix_global_tensor_detach_bug (#9134) * fix_global_tensor_detach_bug * fix test case * Add deform_conv2d op (#9095) * add new op * add kernel * add deform_conv * add some test * modify test * modify format * modify test * fix the bug and add test * Add error message * modify kernel and add test * adjust the format * add global test * Update python/oneflow/test/modules/test_deform_conv2d.py * add doc and modify global test * adjust OneFlowUserOps.td * remove headfile and modify doc * modify doc * add docs at rst * modify global test * remove unnecessary code * remove unnecessary code * remove debug code * initialize fields * modify global test * modify test * modify test * modify test * auto format by CI Co-authored-by: Yinggang Wang <wyg19970408@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * Fix inplace mul 0size check bug (#9132) * fix inplace mul 0-size tensor check bug * code format * revert * Align round op to support round half to even (#9135) * align round op * add test * modify doc ,test and kernel * auto format by CI Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * rm dict in module apply (#9137) * rm dict in module apply * auto format by CI Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * one_embedding support broadcast table_ids (#9109) * support broadcast table_ids * address review * fix like op infer dtype * address review * address review * refine * refine error message for framework (#9104) * refine error msg for framework * more error messages * fix size_t comparison with zero * check for incomplete error messages * err msg for inconsistent placement * modify acc. to review * convert enum to string in error msg * fix redundant error info; clean up * refine error msg for consistency check * auto format by CI Co-authored-by: Yao Chi <later@usopp.net> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * Fix loss scale precision (#9126) * fix loss scale cast * amp_white_identity * revert debug log * move constant like back Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * one embedding eager (#8984) * forward * one_embedding eager * fix one_embedding grad * fix * fix * fix * fix amp * fix of_tidy * ONEFLOW_ONE_EMBEDDING_FUSE_UPDATE_PUT default true * merge master * save shadow var * get all ptr from embedding_state * reuse update and put op/kernel * mv id_shuffle to cuh * refine * refine * refine * refine * refine * refine * one_embedding eager forward * deterministic forward gen random * merge master * merge master * merge master * add table_ids in grad op * test pass * refine * create lazy state in lazy mode * optional learning_rate * add attr in update * refine * refine * refine * refine * fix adam and add adagrad attr * refine * refine * refine * refine * refine * address review * refine name * address review Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * module.to aligned with pytorch (#9083) * module.to aligned with pytorch Signed-off-by: daquexian <daquexian566@gmail.com> * auto format by CI * fix to str Signed-off-by: daquexian <daquexian566@gmail.com> * fix kwargs device bug Signed-off-by: daquexian <daquexian566@gmail.com> Signed-off-by: daquexian <daquexian566@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: binbinHan <han_binbin@163.com> * eager global zero_grad update sbp from b to p (#8853) * zero_grad b to p Signed-off-by: daquexian <daquexian566@gmail.com> * zero_grad b to p Signed-off-by: daquexian <daquexian566@gmail.com> * skip in lazy Signed-off-by: daquexian <daquexian566@gmail.com> * implement zero_grad in c++ Signed-off-by: daquexian <daquexian566@gmail.com> * _zero_grad to _zero_grad_, skip boxing of lazy tensor Signed-off-by: daquexian <daquexian566@gmail.com> * auto format by CI * auto format by CI * skip test in cpu only mode Signed-off-by: daquexian <daquexian566@gmail.com> Signed-off-by: daquexian <daquexian566@gmail.com> Co-authored-by: Li Xinqi <lixinqi2010@gmail.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Support inplace scatter (#9016) * refine scatter * fix * refine * refine * add atomicMul & refine * refine * Dev linalg cross (#8979) * add linalg_cross in yaml * add linalg cross * fix * refine broadcast * add global test * reformat * refine and fix * fix tidy * add nansum (#9113) * add nansum, can work on cpu, fail on cuda * implement nansum on cuda * restore modification in preprocessor_internal.h * register only for floating types * remove kernel register for int types, and it works * add whole reduce functor * add backward func * add export in __init__ and refine code * refine code * refine code, and register kernel * add sbp * just for debuging, cannot compile * just for debuging, cannot compile * use primitive to implement assign nan * refine code * add docs, remove useless op and functor * remove useless kernel * add docs, fix bug of primitive * fix typo in global test * refine code * refine code * refine code * refine code * auto format by CI * Update binary_func.h * Update binary_func.h Co-authored-by: MARD1NO <359521840@qq.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * Feat eager global tensor indexing (#9138) * test(TensorIndexing): add global basic indexing test * format code * feat(TensorIndexing): support eager global advance indexing * test(TensorIndex): add global tensor indexing error message test * format code * feat(TensorIndexing): support global tensor combined indexing * format code * feat(TensorIndexing): eager global combined basic with advance indexing * fix(TensorIndexing): fix global tensor write back bug * remove useless code * refine test and comment * fix(TensorIndexing): remove an unnecessary slice_update * add comment * fix with static analysis Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Add lr_scale for optimizers (#9008) * add lr_scale for opt * revert import * set lr scale in pass * add test * lr_scale default value * improve readability * fix_ctc_loss_error_with_float_target_input (#9143) * fix_ctc_loss_error_with_float_target_input * minor fix Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Inplace masked fill (#9133) * add inpalce masked_fill * reformat * refine * auto format by CI * refine according by comments of hbb * export via cpp directly * export oneflow.masked_fill_ * rename arg * refine test case Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Fix numpy>=1.23.0 advance indexing code (#9139) * test(TensorIndexing): fix numpy>=1.23.0 * auto format by CI Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * add_tensor_new_full_func (#9149) * add_tensor_new_full_func * auto format by CI * add global test case * fix error Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * As strided regist more dtype (#9150) * as_strided register more kernel * add test * fix commnet * fix ci error Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Auto Parallel (#8891) * add auto_parallel code add auto_parallel pass * Feat ap remove hierarchy cast (#7919) * feat(AutoParallel): support remove parallel_cast ops * feat(AutoParallel): export enable_auto_parallel_prune_parallel_cast_ops * format code * Fix add conv grad cost (#7972) * feat(Conv): add grad computation cost * fix ConvDataGrad computation cost * update conv grad cost * refine * Auto parallel/fast collector (#7958) * Try to speed up sbp collector. However, throughput drop * Shrink the parallel candidates for the proxy node * Print out some information and then refine * Store the sbp set for each consumer * Update binary set intersection * Remove impossible parallel candidates from sbp proxy * Refine binary set * Add a Clear() in binary set * Filter out those proxy candidates containing two sbps from the same unique group * refine * Check spells * Clip useless edges * AutoParallel mainstem algorithm add mutable_op_ctrl_edge (#8033) * feat(AutoParallel): mainstem algorithm add mutable_op_ctrl_edge * use if instead std::max * fix(AutoParallel): fix pooling computation cost function bug (#8147) * [WIP] Fix auto parallel dump uniform sbp bug (#8330) * fix(AutoParallel): fix auto parallel dump uniform sbp bug * refine source op judgement * update auto_parallel config (#8356) * Refactor dump nd sbp for auto parallel (#8353) * fix(AutoParallel): fix auto parallel dump uniform sbp bug * feat(AutoParallel): add inferface for op to dump nd_sbp to op_conf * refactor(AutoParallel): refactor DumpNdSbpSignatureForOpConfFn * rename Global to Singleton * Refactor SbpEdge (#8684) * refactor(AP): refactor SbpEdge * Rename variables * Add const for some functions Co-authored-by: Yipeng Li <jamesonli1313@gmail.com> * Refactor auto parallel sbp node (#8712) * Rename * Code clean up * Code clean up * Code clean up and package up * Rename * Add const for some functions * Refactor auto parallel sbp graph (#8722) * Code clean up * Package up * Code clean up and package up in SbpNode and SbpEdge * Rename * Rename * Rename mainstem to trunk * Typo, small bugs and rename * Rename and of format * Refactor auto parallel rest (#8731) * Package up SbpCollector * Add const for SbpGraph * Add const for SbpNode * Add const for SbpEdge * Add const for SbpCollector * Add const, rename, and package up for BinarySet * Rename for BinarySet * Rename for SbpCollector * Rename for SbpCollector * Rename for algorithm utils * Fix a bug for an unused function AddEntries() * Rename for BinarySet * Rename for SbpConstructor * Rename for BoxingCollector * Add const for sbp utils * fix merge conflict * Remove template for sbp signature (#8787) * Remove template for sbp signature * Remove _H_ from cpp files * Remove namespace specifier oneflow:: * Remove namespace specifier oneflow:: * Of format * Move the inline functions to cpp files * Can not add inline specifier? * Update oneflow/core/auto_parallel/sbp_graph.h Co-authored-by: Yinggang Wang <wyg19970408@gmail.com> * Of format Co-authored-by: Yinggang Wang <wyg19970408@gmail.com> * Refactor auto parallel class object stuff (#8835) * Delete copy/move constructor/operator * Move the deconstructor of SbpEdge to the cpp file * Equal by address for Sbp data structor * Replace sbp_sig_list_ with sbp_sig_obj_list_ * Fix auto parallel copy cost infer2 (#8788) * Check the output shape for operator in auto parallel * Return infinity for different sbps while is_mutable * Update oneflow/core/auto_parallel/sbp_constructor.cpp Co-authored-by: Yinggang Wang <wyg19970408@gmail.com> * Update oneflow/core/operator/operator.cpp Co-authored-by: Yinggang Wang <wyg19970408@gmail.com> * with output -> check output Co-authored-by: Yinggang Wang <wyg19970408@gmail.com> * Refactor prune identity as much as possible (#8849) * Prune a line of parallel cast ops * Avoid repeated pruning * Code clean up * Remove identity op * Update oneflow/core/job_rewriter/auto_parallel.cpp Co-authored-by: Yinggang Wang <wyg19970408@gmail.com> Co-authored-by: Yinggang Wang <wyg19970408@gmail.com> * Fix auto parallel low throughput (#8876) * Speed up after pruning identity * Slight changes * Refactor auto parallel final check (#8887) * Of format * Use const auto & * Of format and rename * Re-compute cost if steals sbp signatures * Docs auto parallel doc (#8896) * doc(AutoParallel): add auto parallel document framework * docs(AutoParallel): add document * fix typo * refine document * refine documentation * Test alexnet for auto_parallel (#8917) * test(AutoParallel): test alexnet for auto_parallel * test(AutoParallel): test model add auto_parallel config * Fix get sbp bug (#8939) * Fix the bug of missing sbp for uniform op * Speed up * Add the mising sbp for optional input UserSourceOpTickInput * Remove the repeated all-B sbp signature * Add sbp for undefined UserSourceOpTickInput * Resolve confits while merging master * Recompute cost with time shape (#9009) * Address comments * fix merge conflict * Address comments * Disabled ZeRO when enabled AutoParallel (#9087) fix(AutoParallel): disabled ZeRO when enabled AutoParallel * Update oneflow/core/job_rewriter/optimizer_placement_optimization_pass.cpp * Address comments * Address comment. GetComputationCostFn -> GetComputationCost * Update oneflow/core/job_rewriter/auto_parallel.cpp Co-authored-by: Yinggang Wang <wyg19970408@gmail.com> * New interface for pr#9018 * Static analysis * Fix ones like sbp bug and fix test import error in CI (#9123) fix(AutoParallel): skip 1n1d sbp agreement check * auto format by CI * test(AutoParallel): skip acc check * Address comments * rename source op set nd_sbp function and add check * fix typo * Feat full auto parallel (#9140) * Use B for inplace op and remove the check for sbp while truning the auto prallelism on * Slight change * Not using B as the constrain * Address comments * add debugg log for non-deleted cast ops * update prune parallel cast op log * rename auto_parallel_prune_parallel_cast_ops to enable_auto_parallel_ignore_user_sbp_config Co-authored-by: wyg1997 <wangyinggang@foxmail.com> Co-authored-by: Yinggang Wang <wyg19970408@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * refine oneflow op infer dtype error message (#9155) Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Fix to_global PyArg_ParseTupleAndKeywords (#9158) * Fix tensor local_to_global parse keywords * use PyObject Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Implement exponential_ and multinomial (#9073) * add exponential distribution cpu kernel * add exponential distribution cuda kernel and local tests * refine test * fix bug * auto format by CI * auto format by CI * implement multinomial functor and cpu kernel * auto format by CI * add multinomial cuda kernel * auto format by CI * refine * add multinomial tests * auto format by CI * add categorical distribution module and docs * refine * refine * refine doc * refine * refine * revert Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Disable IB when there no active IB devices (#9115) * fix lru_cache offset (#9162) fix lru_cache offset for larger than uint32 Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Rename cast to global and cast from global (#9151) * rename_cast_to_global_and_cast_from_global * auto format by CI Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Refine datatype error message part2 (#9168) * refine more ops dtype infer error message * refine Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * support tensor.triu_ (#9159) * support tensor.triu_ * Update tensor_functions.cpp * tensor.copy_ support stride (#9142) * tensor.copy_ support stride * add test case * PersistentTable add read_only flag (#9145) * read only * fix * avg_pool_nd support half (#9170) * avg_pool_nd support half * refine * refine * fix new_ones size paramater (#9161) * fix new_ones size paramater * auto format by CI Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * hot-fix (#9191) * hot-fix * refine * skip env var check and calculate local rank if not given (#9183) * skip env var check Signed-off-by: daquexian <daquexian566@gmail.com> * calc local rank if need * No warning for absent LOCAL_RANK Signed-off-by: daquexian <daquexian566@gmail.com> * auto format by CI Signed-off-by: daquexian <daquexian566@gmail.com> Co-authored-by: Yu OuYang <xuanjiuye@gmail.com> Co-authored-by: clackhan <han_binbin@163.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: Shenghang Tsai <jackalcooper@gmail.com> * set to_contiguous to amp clear list (#9171) * add tensor.nansum (#9182) * Add slight cost for different sbp in 1 device (#9172) * Add slight cost for different sbp in 1 device * Print to INFO Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * refine_to_contiguous_dtype_register (#9196) * refine_to_contiguous_dtype_register * add test case * pool_nd_ops register gray list * skip autocast for non-user op (#9199) * `copy_` support numpy fp16 (#9189) * copy_ support numpy fp16 Signed-off-by: daquexian <daquexian566@gmail.com> * auto format by CI Signed-off-by: daquexian <daquexian566@gmail.com> Co-authored-by: Shenghang Tsai <jackalcooper@gmail.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * fix matmul 0 size input error (#9147) Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Feat functional scalar tensor parameter (#9190) * add ScalarTensor check and unpack, bug has link error * refine scalar tensor item function * feat(functional): functional support ScalarTensor transfer to Scalar automatically * feat(functional): support ScalarTensor transfer to Scalar * change auto transfer rule * test(Functional): add functional scalar tensor param test * format code * refine GetItemInScalarTensor function * Fix broadcast fmod grad (#8865) * impl trunc divide * fix broadcast fmod grad * trunc_div grad, scalar_trunc_div, and primitive * format * gradient_func * add test * rename * compatible with older versions of torch * resolve warning * test global Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Feat straighten compress memory (#9094) * An initial inplementation of linear programming primal matrix * Coding for the revised simplex method * Finish coding for the phase 1 * Fix bug. Now we can get a corrent x for the initial basic feasible solution * Drive the artificial variables out in phase 1 * Bland's rule and bug fix * Adjust the mapping between the basic variables and compact columns * No columns removed while driving artificial variables out. Terminates the code if positive optimal cost found in auxiliary problem. * Implement the phase 2 of the revised simplex method. Remove columns of the inverse base matrix. * Update is_solved status and original problem recovery. * Rows and artificial columns activation * An initial implementation of mix integer programming * Try to assemble the original problem but fial due to the massive exclusion * Steal initial position from current setting * Compute the optimal cost from the compact relationship * Move to a neighbor status and compute the cost * Find the smallest cost and actually move to that status * Check conflit after the adjustment. Adaptively cost reduce * Generate a compact position from nothing * Straighten for memory * Update the offset * Add a demo for using the revised simplex method * Remove the linear programming part * Recompute the compact relationship after moving to a new status * Rename * Code clean up * Set the tag for the straighten algorithm * Code clean up * An attemp to explore the dependency between consumer nodes of a register * Revert "An attemp to explore the dependency between consumer nodes of a register" This reverts commit f219851fb85943d07d28b84c45e5c4bae80872a0. * Compute the lower bound and only execute the adjustment 2 for those cases with possible reduction in memory * Pre-compute and store the memory size for registers * Use pre-stored total register num * Limit the maximum iteration step * Use VLOG(3) instead of std::cout * Change interface * Package up memory share strategy interfaces * Address comments * Address comments * Of format * Fix bug lower bound = 0 Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Add contains magic method (#9185) * refine more ops dtype infer error message * refine * add tensor.__contains__ magic method Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Build cuda 11.8 (#9204) * export unsorted segment sum (#9206) export unsorted_segment_sum python Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Optimize OneEmbedding Save Snapshot (#9112) * init * fix compile error * refine * Refine put logic * todo lrucache logic * refine dump logic * finish * add flag check * Add env var * fix * fix a silly bug * fix template args * fix comment * add template * Refine comment * remove * fix bug * fix compile error * refine initial Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Add Tensor.scatter_add & refine scatter (#9201) add Tensor.scatter_add & refine scatter * optimize layernorm need padding cols perf (#9195) * optimize layernorm need padding cols perf * auto format by CI * reduce binary size Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Support Inplace behavior in Type Promotion (#9200) * support inplace * refine * add const * refine Co-authored-by: Houjiang Chen <chenhoujiangcug@gmail.com> * Fix Broadcast Matmul check (#9213) fix check * Export MultiTensor Update and FuseUpdateCast to GraphConfig (#9209) * export to graph config * refine or Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * fix bug of matmul dim check in `oneflow.bmm` (#9215) * fix bug of matmul dim check * refine code * Update nn_functor.cpp * Regist arange fp16 (#9202) * arange op support cuda half * add test * format * fix comment * fix comment * refine * ci test error Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Fix graph out argstree type judge (#9211) * reproduce bug * fix custom class type deal * fix typo * support ordereddict * auto format by CI Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: Shenghang Tsai <jackalcooper@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * fix ConcatFunctor error message (#9225) * Check async errors after kernel launched (#9226) Check errors after kernel launched * Skip unnecessary passes (#9219) * Skip unnecessary passes * refine * one_embedding fix typo (#9230) Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * [GetAsyncError] Add op name to error message (#9228) GetAsyncError refine error message * [JobBuildAndInferCtx]Remove an inefficient check (#9229) Remove an inefficient check * Fix linalg cross 0-size input error (#9232) * Add silu to amp list (#9233) * Disable CUDA virtual arch compilation (#9236) * Support set/get_default_dtype interface (#9227) * feat(DType): support set/get_default_dtype interface * doc(*): fix set/get_default_dtype document * doc(DType): refine document * feat(oneflow.tensor): support infer dtype as get_default_dtype * test(DType): add default dtype test * refine throw error * modify doctest because it will affect default dtype for other test * fix(DType): make DefaultDType is global * use default type in TensorWithDataCtorFunctor * fix(DType): flow.Tensor support DefaultDType * refine function name Co-authored-by: jackalcooper <jackalcooper@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Enhance doctest error message (#9237) * test(doctest): enhance doctest error message * Update python/oneflow/test/modules/test_functional_docstr.py Co-authored-by: Yao Chi <later@usopp.net> * Update python/oneflow/test/modules/test_functional_docstr.py Co-authored-by: Yao Chi <later@usopp.net> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: Yao Chi <later@usopp.net> * Feat: script to import oneflow as torch globally (#9160) * feat: global `import torch as oneflow` * use `console_scripts` to install oneflow-mock-torch to PATH * close quote * use os.makedirs to create temp torch directory * rename to `oneflow-mock-torch` * don't create temp files * use positional argument with 2 choices * add `mock torch test` in CI * uncomment env setup * default argument is enable * fix docker exec * refactor test script * check successful recover * don't run setup.py * support submodule importing & display error message * fix import * and import-from * move mock_torch to oneflow dir; update test command * fix error message * update mock test (less strict) * add more tests for torch imports * modify export path * mock_torch is a package Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * add time and mem log tools (#9164) * add time and mem log tools * refine format * auto format by CI * address review * auto format by CI * log with json format * rm useless * refine log format Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * support bool for `oneflow.nn.functional.pad` (#9234) * support bool in functor and kernel, add unittest for int and bool * refine unittest * check value for bool tensor * Feat: rand/randn support float16 kernel (#9238) * feat(Op): rand/randn support float16 kernel * add error message and refine code Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * reduc auto tick generate time (#9235) * reduc time * rm useless * address review, refact structure * auto format by CI Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * TensorIndexing support float16 (#9247) * feat(TensorIndexing): support float16 * feat(TensorIndexing): support bfloat16 * skip bfloat16 test when cuda version less than 11000 Co-authored-by: Shenghang Tsai <jackalcooper@gmail.com> * Add cudnn handle pool (#9243) * add_cudnn_handle_queue * deal normalization_kernel * refine * refine * reslove comment * minor fix * refine * auto format by CI * fix static check Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Added error message for CUDA device incompatibility (#9250) * Added error message for CUDA device incompatibility * auto format by CI Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Fix autograd.Function memory leak (#9249) * fix(AutogradFunction): fix memory leak * add ptr check for AutogradState data * test(AutogradFunction): ensure PyAutogradFunctionState released * test(AutogradFunction): decrease memory * register __dict__ function * refine code * fix state release test bug * refine error message * Feat speed up mem reuse (#9210) * Use HashSet instead of vector * O(n^3) -> O(n^2) * Compute offset for memory-first algorithm only * Remove explicit exclusion relationship * Revert print out information * Speed up exclusion judgement * Switch HashMap to vector * Code clean up * life time -> lifetime * mem_reused_regst: HashSet -> std::vector regst_desc_id2regst_desc -> mem_chain2regst_desc_id2reuse_regst_desc * Re-implement MemReusedAlgorithm_TimeLineAlgo and comment out useless code * Make allocate and free timeline local and HashSet -> std::vector * Eliminate a lot of Hash stuffs * Revert "Eliminate a lot of Hash stuffs" This reverts commit abfb86df57b13074cb50ca9dc080a1333cd46802. * Important comment * Address comments * auto format by CI * Remove magic number -1 * Address comment and rename Co-authored-by: Shenghang Tsai <jackalcooper@gmail.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * fix bug: segfult when argmax has 0 size tensor as input (#9242) * fix_half_check_of_reduce_mean (#9014) * fix_half_check_of_reduce_mean * refine * Support float16 for initializer operators (#9253) * feat(*): support float16 for initializer operators * refine test * Add half clamp (#9241) * Register half * register fp16 in clamp kernel, add check for fp16 in functor, update unittest for more dtype * format code * add macro WITH_CUDA Co-authored-by: WangYi <buaawangyi03@gmail.com> Co-authored-by: Wang Yi <53533850+marigoold@users.noreply.github.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * [CUDA]CheckVersionCompatibility (#9257) * [CUDA]CheckVersionCompatibility * Add CUDA 10.2 Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Feat: monkeypatching pytorch (#9256) * update custom meta path finder * update test commands * print warning if `torch` is already imported * rename to `mock` * update tests * private attribute cannot be imported with import * * split testcase Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * support destory_rdma (#9246) * support destory_rdma * refine * auto format by CI * refine * refine Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * add bincount (#9156) * add bincount * add docs, use atomic add in cuda kernel, add unittest * add minlength param, fix bug of memset in kernel * refine code * refine code * convert to local when input is global, add global test * auto format by CI * refine code * refine docstr, reduce doc length in one line * register fp16, add tensor function and unittest * add docs for tensor.bincount * auto format by CI Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * ONEFLOW_STREAM_ENABLE_H2D_STREAM (#9205) * Modify generator.manual_seed to return generator rather than None (#9262) generator.manual_seed return generator rather than None Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Dev add tensor bernoulli (#9261) * add tensor.bernoulli * add docs * Update tensor.py * Update tensor.py * Update tensor.py * auto format by CI Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Multi tensor update (#9252) * fix multi_tensor_sgd segfault * enable learning_rate_val to replace learning_rate Tensor * support adam and adamw * support epsilon for adam and adamw Co-authored-by: songyicheng <int.rejoice@gmail.com> * fix a typo in readme (#9268) * support nested asyncs.thread (#9270) * OneEmbedding add smart decay sparse adam (#9176) * add sparse adam * smart decay sparse adam * address review * fix * mv smart_decay to one_embedding namespace * upgrade clang-tidy used in ninja of_tidy (#9263) upgrade clang-tidy in ninja of_tidy Signed-off-by: daquexian <daquexian566@gmail.com> Signed-off-by: daquexian <daquexian566@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Feat/compile time count (#9245) * add graph compile time count * refine compile log * auto format by CI Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * fix random_normal (#9274) Co-authored-by: Juncheng <liujuncheng1022@gmail.com> * Flip and upsample bilinear support fp16 (#9284) * slice update cpu kernel multi_thread loop * refine * upsample bilinear and flip register fp16 cuda kernel * fix commnet * revert Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Fix PruneAmpWhiteIdentityOpPass (#9276) * fix * fix dup del * ref algorithm * fix dup mut * simple impl * rm useless code * fix * fix typo * fix typo Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * support api flow.randn_like (#9283) * support api flow.randn_like * refine * remove dry run, add sanitizers to ci (#8670) * fix some data races in c++ api and SteadyVector Signed-off-by: daquexian <daquexian566@gmail.com> * skip self copy in MutShapeView::ToShape Signed-off-by: daquexian <daquexian566@gmail.com> * remove dry run, add sanitizers to ci Signed-off-by: daquexian <daquexian566@gmail.com> * auto format by CI * update gh action * skip lit Signed-off-by: daquexian <daquexian566@gmail.com> * suppress ubsan error in llvm Signed-off-by: daquexian <daquexian566@gmail.com> * disable ubsan for now Signed-off-by: daquexian <daquexian566@gmail.com> * fix ci path Signed-off-by: daquexian <daquexian566@gmail.com> * update test manylinux docker Signed-off-by: daquexian <daquexian566@gmail.com> * restore dry run rpc manager Signed-off-by: daquexian <daquexian566@gmail.com> * run tsan for 3 times Signed-off-by: daquexian <daquexian566@gmail.com> * do not find initializer order bug Signed-off-by: daquexian <daquexian566@gmail.com> * fix merge conflict Signed-off-by: daquexian <daquexian566@gmail.com> * skip sanitizer test in cuda misc Signed-off-by: daquexian <daquexian566@gmail.com> * sleep Signed-off-by: daquexian <daquexian566@gmail.com> * suppress by __attribute__((no_sanitize_address)) Signed-off-by: daquexian <daquexian566@gmail.com> * auto format by CI * revert suppression * fix heap-use-after-free found by asan * auto format by CI * bash -c Signed-off-by: daquexian <daquexian566@gmail.com> Signed-off-by: daquexian <daquexian566@gmail.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: tsai <jackalcooper@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * add build config for RTX 40xx GPUs (#9290) * Bool support for triu (#9291) * Refix PruneAmpWhiteIdentityOpPass (#9294) fix Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * fix concat #8833 (#9275) * fix conat #8833 * support multi-none-input * test and global test * auto format by CI * format license Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * support half for masked_fill (#9292) * Fix BatchNorm performance (#9298) * slice update cpu kernel multi_thread loop (#9264) * slice update cpu kernel multi_thread loop * refine * try to fix bug * auto format by CI * deleteusless headfile Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * fix inplace bug in `tensor.masked_fill_` (#9295) * fix: bind tensor.masked_fill_ to inplace version, fix bug in unittest * refine unittest * fix_inplace_copy_bug (#9301) * FusedMultiHeadAttentionInference (#9287) * FusedMultiHeadAttentionInference * auto format by CI * cmake * fix graph * auto format by CI * fix cmake for mlir * rm duplicated install * fix align * support float * support causal * support causal * test global property * fix * disable clang * skip cpu test * skil all test Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: jackalcooper <jackalcooper@gmail.com> * Fix compile warnings (#9302) * Fix comiple warnings * fix * Set the default value of CUDA_STATIC to OFF when CUDA version is greater than or equal to 11.8 (#9306) * Reduce pass time cost (#9281) * batch del in PrunePinnedIdentityOpPass * add log * fix and refine fuse add_n * add new line * avoid op graph create * add op graph cost cnt and fix boxing log * fix ndsbp csv str * fix multi add same add_n * auto format by CI * rm debug log * auto format by CI * to cont ref * rm useless * refine auto modifier * rm useless * hack to debug * hack to debug * hack to debug * hack to debug * hack to debug ci * hack to debug ci * fix test case env var * fix env var set * revert to const ref * auto format by CI * sync to make sure tensor are created Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Refactor get sbp signature (#9304) * Add a GetSbpSignature with use parallel num instead of parallel description * Get sbp_sig_list for each dimension of hierarchy * Add test script and print out information * Remove parallel description in GetSbpSignature() * Fix small bug * Disable InferNdSbp for reshape op * Revert "Add test script and print out information" This reverts commit fdc7ee8558cab68aa9fa152cf1ba2a6dc2b4554e. * Add hierarchy value * Address comments * parallel num j-> hierarchy value for reshape op * Static analysis * refine * Update user_op.cpp * Update operator.cpp * auto format by CI * Revert Update operator.cpp This commit revert 64832e43196067d67f70094a8d35664a805a5891 Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * Fix type error for entering a single tensor using concat op (#9316) * modify tensorprocessor * remove blank line * remove blank line * modify CheckHasDifferentInputDType func * Update oneflow/core/functional/tensor_processor.cpp * auto format by CI Co-authored-by: Yinggang Wang <wyg19970408@gmail.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Add more sbp signature print functions for log and debug (#9293) * debug code * ReshapeOp::GetSBP use hierarchy dim instead of parallel_num * comment debug log * revert debug code * auto format by CI * rm NdSbpSignatureListAsString * rm 1d sbp signature print functions Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * Release/nightly cu118 (#9308) * update action * 116->118 * preserve 116 Co-authored-by: Juncheng <liujuncheng1022@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Fix different dtype in slice_update (#9331) * fix(SliceUpdate): fix different dtype in slice_update close #9330 * test(SliceUpdate): enhance test case Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Fix FlattenOp GetSbp (#9322) * fix flatten GetSbp * rm flatten op * update group stat * rm mlir test * fix * more strictly check * add reshape converion Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Refactor ONEFLOW_MLIR_PREFER_NHWC to support more ops (#9335) * use bn as gn * hack gn as relu * refine * support concat * ScalarDivOp * fix * move files * refine * fix bn * try fix * fix concat * fix * DRY * refactor * refactor * fix * workaound * add baseclass * rm hack * auto format by CI * minor refine * refine * add more Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * distributions.Categorical support logits not None (#9332) * avoid extra gpu memory usage in flow.save (#9328) * boxing to cpu first in flow.save Signed-off-by: daquexian <daquexian566@gmail.com> * auto format by CI Signed-off-by: daquexian <daquexian566@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * Use primitive to replace Ndarray::BroadcastBinary (#9311) * Use primitive to replace Ndarray::BroadcastBinary * refine * fix * negative * refine * refine * Block forward support modification (#9336) * block forward support modification * add test * fix format * auto format by CI * auto format by CI Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Add log sum exp api (#9333) * add_log_sum_exp_api * refine * add logsumexp to tensor * auto format by CI Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Feat: isclose and allclose (#9280) * add allclose op in tablegen * add isclose & allclose op in functional layer * use existing framework to implement `isclose` * import isclose & allclose * compose isclose and other op to form allclose in python * typo * add doc & test files * add default arg * curly braces between one stmt * generate one random data, the other is perturbation * update test * comment for ndarray bin func * add ref from torch * Refactor random op with consistent data (#9299) * refactor(RanddomOp): refactor random op with consistent data * test(RandomOp): add data consistent test * fix(RandomSeed): fix parallel_num==1 * move normal functor to random_functor.cpp * test(RandomOp): refine test * add comment for random_seed getter function * remove special judgement for 1n1d * fix random_seed parallel_num==1 * fix cuda generator index bug * fix test function name bug Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * bool tensor slice_update use masked_fill when possible (#9324) * bool tensor slice_update use masked_fill when possible * refine * auto format by CI * fix comment * auto format by CI * Update oneflow/api/python/framework/tensor_functions.cpp Co-authored-by: Wang Yi <53533850+marigoold@users.noreply.github.com> * refine * auto format by CI * except partial sum test * add todo Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: Wang Yi <53533850+marigoold@users.noreply.github.com> * Move tensor apis to cpython (#9303) * move tensor.is_floating_point to c++ * refine * move tensor.split to c++ * move tensor.flip to c++ * auto format by CI * Update oneflow/api/python/framework/tensor.cpp Co-authored-by: Wang Yi <53533850+marigoold@users.noreply.github.com> * refactor flip * refine * auto format by CI * fix free(): invalid pointer Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: Wang Yi <53533850+marigoold@users.noreply.github.com> * Add gelu_tanh op and kernel (#9343) * gelu_tanh * rename GeluTanh -> FastGelu * regulate constant and increase precision * instantiate and reg backward * reg grad fn * address review * address review * format * update test * refine_test_maxpool2d_channel_last (#9344) * refine * auto format by CI * add skip * auto format by CI Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * Refactor normal initializer (#9307) * refactor(RanddomOp): refactor random op with consistent data * test(RandomOp): add data consistent test * refactor(Initializer): refactor normal with oneflow kernel * fix(RandomSeed): fix parallel_num==1 * test(initializer): add initializer data test * format code * move normal functor to random_functor.cpp * test(RandomOp): refine test * add trunc_normal and relax mean/std precision * fix conflict * fix merge conflict Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Support fp16 in constant folding (#9337) * support fp16 * format * clean * refine * auto format by CI * refine test * clean * refine * refine Co-authored-by: Shenghang Tsai <jackalcooper@gmail.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * fix exp overflow with minus max trick (#9353) * Fix occasional bug in random_op data test (#9354) fix(RandomOp): fix occasional bug in random_op data test Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Dev add gumbel softmax (#9208) * regis gumbel_softmax * add: gumel_noise, attr-hard, next: log, one-hot, grad * add(fail): exp_dist * add: gumbel, grad on cpu, next: cuda * add: cuda & test bug: Synchronize() * add: docs, test_hrad, test_grad * add: format code * fix: TmpSize * fix: review * format, try to add * add: functor * format & half of rand * remove ops & kernels * support half of argmax & dim_scatter * fix review * add gumbel softmax docs * fix review * remove gumbel_softmax_grad_functor * remove grad in yaml * fix: raise half no util error * auto format by CI * auto format by CI * fix: make * fix: static Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * Fix the inconsistent behavior of slice update (#9321) * modify tensor_index.cpp * modify * support scalar tensor indexing * support scalar * modify tensor_util * modify tensor_index * add macro definition * add support type * refine getitemscalartensor * Update oneflow/core/framework/tensor_util.cpp * modify macro * modify macro and test * modify test * modify function parameter * modify tensor_index ("uint8" is regarded as "bool") Co-authored-by: Yinggang Wang <wyg19970408@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * enable autocast for that op which has nocast arguments (#9362) * fix autocast * fix * Add NHWC format for group norm (#9368) * group * nhwc * test_case * ir * fix * refine * Enable ZeRO with auto parallel (#9288) * Enable ZeRO with auto parallel in the first setting and speed up * Remove compute_cost parameter from Initialization of copy cost * Move the addition of wait time into sbp_node * Remove transfer cost since it is merged into the GetTransferCost() * Rename mainstem to trunk * Update warning Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Feat unbalanced split nd sbp (#9310) * Add a GetSbpSignature with use parallel num instead of parallel description * Get sbp_sig_list for each dimension of hierarchy * Add test script and print out information * Remove parallel description in GetSbpSignature() * Fix small bug * Disable InferNdSbp for reshape op * Revert "Add test script and print out information" This reverts commit fdc7ee8558cab68aa9fa152cf1ba2a6dc2b4554e. * Use the same physical shape as eager did * Remove the difference between eager and lazy for physical shape * Update the filter * Revert "Use the same physical shape as eager did" This reverts commit f20e222327e21166d5b5325e37c3cbe9ca4f4ac6. * Compute range for each rank * Compute position for range * Remove the difference between eager and lazy * Allow unbalanced split for variables * Add test script and print out information * Pass 2d test cases * Resolve conflict * Can not merge some split * Reduce in and out sbp simultaneously * Speed up for 1d sbp Package up the function for replacing hierarchy * Reduced simultaneously with the same hierarchy * Deal with 1to2d and 2to1d in InOutParallelDimReduce() * Pass 1to2d and 2to1d test cases * Remove the old code * Revert "Add test script and print out information" This reverts commit 58cdfb40b6536eb74c02174d3a69409676da374f. * Add the check for split questionary back * Feat speed up cost computation (#9355) * Compilation speed up * Speed up compilation for cost between 1d sbp * fix comment typeo * Address comment Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Add upsample_nearest_2d to amp clear list (#9366) * fix cuda integral type closeness computation (#9346) * fix cuda integral type computation * remove include Co-authored-by: Juncheng <liujuncheng1022@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Add fused linear (#9369) * Support fp16 on some cpu operators (#9374) support fp16 cpu triu * Scalar math kernels support inplace (#9372) * Scalar math kernels support inplace * type * fix * Optimize GroupNorm NHWC with FastDivmod (#9373) * GradAcc Mem V5: Part 0-4 (#8961) * default nccl use compute stream in grad acc * rm sharable mem block graph * half implement of LogicalChains * part-0 : Logical Chain * fix compile * logical chain runnable * fix bug of logical chain dp * Part 1 : AfterGradAccChain * fix bug of crush in acc chain infer * AccCtrlTick Op/Task/Actor/Pass * tmp * AccCtrlTick runnable * rename group boxing identity and model diff scale op name * stric order by acc tick * merge mem block by logical chain id group * fix user op register * fix GLOG error when no grad acc * Inplace repeat variable * Inplace repeat support consumed/produced ctrl regst * Part-4: merge acc op in to chain for reuse memory acc input (#9071) LogicalChain can merge acc op in to chain for reuse memory acc input 实测 GPT 的显存与 part-3 一致。 bert 与 t5 大部分的显存都略低于 part-3 https://github.com/Oneflow-Inc/OneTeam/issues/1670#issuecomment-1240468576 * find first source/sink op in acc chain which can be insert ctrl * TryMergeAfterAccLogicalChainToFirstLogicalChain * remove debug log * rm old version repeat kernel * fix format * MergeChainByLogicalChainId/PhysicalTaskGraph * IsValidChainId * rm useless file * remove note * fix clang-tidy * more IsValidChainId * rm debug log * rm note * fix bug of cpu repeat inplace var bug * fix bug of memory reuse for 0-size regst in time line algo * fix bug of acc chain merge mem guard * reuse cast to tick op * fix bug of acc different stream hint cause sync backward compute * actor name log * fix for review * remove log * fix note * fix bug of connect to cast to tick op * refine code for review * fix for review Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * fix the bug of fill_tensor_ of support fp16 & autocast (#9375) Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Allocate in instruction computation (#9282) * allocate memory in InstructionPolicy::Compute * remove unused methods of VirtualMachineEngine. * backup code * UnimplementedAllocator * prepare allocators for each cpu stream. * allocator for ccl stream * init AllocateTensorInstructionPolicy::output_dependences_ * only sync current rank in oneflow._oneflow_internal.eager.Sync * Update oneflow/core/vm/allocate_tensor_instruction_policy.cpp Co-authored-by: daquexian <daquexian566@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: daquexian <daquexian566@gmail.com> Co-authored-by: Shenghang Tsai <jackalcooper@gmail.com> * Disable conv algorithm search in eager mode (#9…
- Loading branch information