Skip to content

Commit

Permalink
Rank task graph merge master (#9440)
Browse files Browse the repository at this point in the history
* Use Primitive in Scalar Pow Grad (#8620)

* scalar math use primitive

* fix

* support pow grad

* dev scalar pow grad

* remove useless code

* use std

* auto format by CI

* Refine

Co-authored-by: guo-ran <360112263@qq.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>

* Add higher order derivative for loss function (#9070)

* add higher order derivative for smooth_l1/nll loss

* add higher order derivative for bce/kl_div loss

* fix bug and refine testcase

* fix wrong sbp signature of bce loss

* optimize code and align precision with pytorch

* add some index check

* disable calc derivative for target in bce loss

* remove unnecessary header include

* fix sbp setting in testcase, and restore out_grads size check

* auto format by CI

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add higher order derivative for softmax and activation (#9032)

* add higher order derivative for softmax/logsoftmax

* add higher order derivative for mish/gelu activation

* auto format by CI

* add comment for constexpr parameter

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* add higher order derivative for pool (#9096)

* add higher order derivative for pool

* refine

* optimize

* fix ndim check error

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Cross Encropy 支持 probability 的 target (#9064)

* support prob for crossentropy, still has bug for dims > 2

* fix bug of for ndim > 2 inputs, refine code

* refine code, use template HasLabelSmoothing

* fix grad bug of for ndim > 2 inputs, use pre-calculated factor in kernel

* format code, remove redundant including header files

* refine op

* restore wrong modification

* remove op, implement at functor layer

* set bind_python to false, remove redundant header files

* add docs

* fix missing default param in unittest, fix typo in docstr example

* auto format by CI

* Update loss.py

* remove useless file

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>

* Fix nvjpegDecodeParamsSetROI (#9101)

* Fix nvjpegGetImageInfo

* fix set ROI

* add series op : adaptive_max_pool1d/2d/3d (#9023)

* startup: cpu adaptive max pool 2d finished (a draft)

* add 1d/2d/3d forward

* add return_indices

* refine files hieararchy

* add adaptive_max_pool2d_grad for test

* draft backward op for maxpool 2d

* cpu op/kernel finished

* reformat

* gpu draft kernel

* gpu forward finished

* draft gpu backward version

* refine gpu backward

* add nn.AdaptiveMaxPoolnd Module

* add docstring

* rename avg pool gpu file

* refine .td file

* refine

* refine test case

* refine

* refine by comments of zzk

* refine according to clang_tidy errors

* refine

* refine by comments of zhuping

* one_embedding physical_block_size change to 4096 (#9017)

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* OneEmbedding add ONEFLOW_ONE_EMBEDDING_DISABLE_PIPELINE (#9098)

* one_embedding eager forward

* deterministic forward gen random

* merge master

* merge master

* grad op add attrs

* Revert "grad op add attrs"

This reverts commit 33b67c75d1e5d0e6529a108f7e7a17bc458dc661.

* auto format by CI

* format

* refine

* prefetch consume id_shuffle out and exec in advance

* add new task_node

* sort and add ctrl edge

* rm id_shuffle_task_node

* add register same output blob regst num

* rm tasktype

* refine

* address review

* rename

* refine

* refine

* refine

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* develop eager AMP (#9088)

* implement eager AMP

* skip autocast for inplace and implement make autocast meta

* fix

* rm unused code

* autocast python api

* fix

* fix

* refine

* skip autocast if any input is float32 for gray or clear list

* refine

* fix dead loop

* add autocast unittest

* refine worker seed (#9102)

* refine worker seed

* refine

* reifne

* use default_generator.seed

* Dev GroupNorm (#7784)

* add groupnorm infer

* Add groupnorm forward

* refine other forawrd situation

* groupnorm backward still has bug

* fix forward

* support backward

* add slow groupnorm param grad kernel

* use blockreduce

* update blocknum

* add gradient func

* simplify code

* refine and add global test

* remove annotation

* not limit split dim

* fix compile error

* Add spatialsize pack logic and fix launch blocknum bug

* add two stage reduced backward kernel

* refine

* simplify logic

* refine pack logic

* use THREAD_CACHED_MUTABLE_ATTR_MAP

* fix comment

* refine

* refine comment

* Refine more check

* fix affine=False bug

* fix bug

* tmp use gemm reduce

* use ComputeType buf

* fix nvbfloat16 compute type

* add amp gray list

* Revert back

* fix clang analysis

* refine userops.td

* fix userops

* remove result_segment_sizes

* add dispatch logic for groupnorm grad uncached block impl

Co-authored-by: luyang <flowingsun007@163.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Introduce bfloat16 type (#9067)

* introduce_bfloat16_type

* storage

* fix compile error

* support bfloat16 ep operator

* support create cpu bfloat tensor

* refine code

* minor fix

* fix static check error

* reslove comment

* add more test case

* fix bfloat16 numeric_limits

* fix error

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Refine check in ibverbs (#8974)

* refine check in ibverbs

* format

* fix typo and test

* refine error message when there is no errno

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Support padding_idx in OneEmbedding (#8998)

* init

* Add attribute val in Userops.td

* simply add paddingidx logic in EncodeLookupKernel

* add simple padding_idx EmbeddingGrad

* when index is -1 let gather add 0

* skip atomicadd when row index equals to padding_idx

* change padding_idx type to int64

* fix compile error

* set padding_idx in Pass

* 1n1d eval success

* refine

* remove print

* fix compile error

* revert

* refine

* fix compile

* refine

* Refine

* refine

* refine store options

* remove embedding grad shuffle redundant padding_idx

* move gather in datashuffle kernel

* remove redundant code

* Refine

* refine

* remove redundant header file

* Set padding idx as optional and remove attr has_padding_idx

* Add padding_idx unittest

* use array equal instead of allclose

* remove a test

* enlarge timeout

* launch oneflow kernels in code generated with MLIR (#8980)

* init

* registry

* add KernelLaunchFunctionPass

* pass ninja and relu test

* mlir test script & lowering

* relu py

* fi

* kernel launch

* fix

* fix op and pass interfaces

* add comment

* add readme docs

* fix typo

* kenerl launch function pass is done

* use template and rename func.func

* declare

* pass string through mlir.llvm dialect to c interface:
llvm.mlir.global internal constant @"relu-0_var"("relu-0")
%0 = "llvm.mlir.addressof"() {global_name = @"relu-0_var"} : () -> !llvm.ptr<array<6 x i8>>
%1 = "llvm.mlir.constant"() {value = 0 : index} : () -> i64
%2 = "llvm.getelementptr"(%0, %1, %1) {structIndices = dense<-2147483648> : tensor<2xi32>} : (!llvm.ptr<array<6 x i8>>, i64, i64) -> !llvm.ptr<i8>

* use symbol table

* use oneflow variable op

* fix symboltable

* fix

* ninja c1 check

* split into kernel-launch-function pass and kernel-launch-with-llvm pass

* restore pass 1

* Gen kernel example (#9042)

* add example

* add todo

* add basic assertion

* add file check

* create pass in translation

* sanitizeIdentifier

* enable print

* fix

* update test file

* kernel llvm pass is ok

* pass ctx ptr to func and this ptr will be an operand to call c interface function

* restore llvm ptr type to llvm.ptr<i8>

* Kernel lookup in launch op (#9059)

* add

* move function to another unit

* create map

* add iter

* impl TensorDesc4ArgNameAndIndex

* set dev tag

* load lib when ONEFLOW_MLIR_FUSE_KERNEL_LAUNCH is set

* sharedlibs enables and pass enables in commpute

* enable c interface callee

* impl todo

* naming

* rm

* add invalid

* fix invoke arg

* typed

* rm log

* rename pass

* Update user_op_kernel_registry.h

* Update user_op_kernel_registry.h

* Update OneFlowOps.td

* Update Passes.cpp

* add comp ctx

* add todo

* refine todo

* refactor op infer

* minor fix

* add check

* refine error

* refine msg

* fix typo

* fix typo

* remove string in llvm

* impl Tensor4ArgNameAndIndex

* fix ninja c1 bug

* realize gpu and add cuda test

* auto format by CI

* fix merge

* fix ninja with cpu version

* auto format by CI

* rename

* merge def

* deduplicate code

* fix

* refactor

* fix license

* cache

* add back TODO()

* add jit arg type check

* rm comment

* fix typo

* fix ci

* todo ci

* fix code style

* rm misadded

* rm misadded

* Update Passes.cpp

* pass ninja without debug about hungry mode of knerel init

* fix null parsed module problem

* fix dynamic cast of state problem

* fix gpu error

* fix

* fix

* auto format by CI

* fix

* Update kernel_launch_op.cpp

* move

* fix

* auto format by CI

* done

* fix

* fix

* auto format by CI

* fix

* fix

* auto format by CI

* Update kernel_launch_op.cpp

* rename

* auto format by CI

* fix

* done

* Update kernel_launch_op.cpp

* fix

* fix

* fix

* fix

* fix

* auto format by CI

* Update oneflow/ir/oneflow-extension/kernel_launch_op.cpp

Co-authored-by: Shenghang Tsai <jackalcooper@gmail.com>

* fix

* fix

* fix

* fix

* fix

* Update oneflow/ir/lib/OneFlow/Passes.cpp

Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>

* fix

* fix

* fix

Co-authored-by: jackalcooper <jackalcooper@gmail.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>

* interpolate api align (#9118)

* Fix masked select op bug (#9120)

* fix masked_select bug

* refine

* fix ci error

* align with pytorch RANK env (#9111)

* align with pytorch RANK env

* refine

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add oneflow hub (#9116)

* add OneflowHub feature, consistent with PyTorchHub

* add oneflow hub docs

* refine docs and add test

* refine

* refine

* refine

* fix comment

* auto format by CI

* skip unittest

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix where op data_type infer bug (#9121)

* fix where op data_type infer bug

* refine

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix like op infer dtype (#9127)

* elementwise.cuh remove template parameter tail (#9128)

* fix_global_tensor_detach_bug (#9134)

* fix_global_tensor_detach_bug

* fix test case

* Add deform_conv2d op (#9095)

* add new op

* add kernel

* add deform_conv

* add some test

* modify test

* modify format

* modify test

* fix the bug and add test

* Add error message

* modify kernel and add test

* adjust the format

* add global test

* Update python/oneflow/test/modules/test_deform_conv2d.py

* add doc and modify global test

* adjust OneFlowUserOps.td

* remove headfile and modify doc

* modify doc

* add docs at rst

* modify global test

* remove unnecessary code

* remove unnecessary code

* remove debug code

* initialize fields

* modify global test

* modify test

* modify test

* modify test

* auto format by CI

Co-authored-by: Yinggang Wang <wyg19970408@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>

* Fix inplace mul 0size check bug (#9132)

* fix inplace mul 0-size tensor check bug

* code format

* revert

* Align round op to support round half to even (#9135)

* align round op

* add test

* modify doc ,test and kernel

* auto format by CI

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>

* rm dict in module apply (#9137)

* rm dict in module apply

* auto format by CI

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>

* one_embedding support broadcast table_ids (#9109)

* support broadcast table_ids

* address review

* fix like op infer dtype

* address review

* address review

* refine

* refine error message for framework (#9104)

* refine error msg for framework

* more error messages

* fix size_t comparison with zero

* check for incomplete error messages

* err msg for inconsistent placement

* modify acc. to review

* convert enum to string in error msg

* fix redundant error info; clean up

* refine error msg for consistency check

* auto format by CI

Co-authored-by: Yao Chi <later@usopp.net>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>

* Fix loss scale precision (#9126)

* fix loss scale cast

* amp_white_identity

* revert debug log

* move constant like back

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* one embedding eager (#8984)

* forward

* one_embedding eager

* fix one_embedding grad

* fix

* fix

* fix

* fix amp

* fix of_tidy

* ONEFLOW_ONE_EMBEDDING_FUSE_UPDATE_PUT default true

* merge master

* save shadow var

* get all ptr from embedding_state

* reuse update and put op/kernel

* mv id_shuffle to cuh

* refine

* refine

* refine

* refine

* refine

* refine

* one_embedding eager forward

* deterministic forward gen random

* merge master

* merge master

* merge master

* add table_ids in grad op

* test pass

* refine

* create lazy state in lazy mode

* optional learning_rate

* add attr in update

* refine

* refine

* refine

* refine

* fix adam and add adagrad attr

* refine

* refine

* refine

* refine

* refine

* address review

* refine name

* address review

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* module.to aligned with pytorch (#9083)

* module.to aligned with pytorch

Signed-off-by: daquexian <daquexian566@gmail.com>

* auto format by CI

* fix to str

Signed-off-by: daquexian <daquexian566@gmail.com>

* fix kwargs device bug

Signed-off-by: daquexian <daquexian566@gmail.com>

Signed-off-by: daquexian <daquexian566@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: binbinHan <han_binbin@163.com>

* eager global zero_grad update sbp from b to p (#8853)

* zero_grad b to p

Signed-off-by: daquexian <daquexian566@gmail.com>

* zero_grad b to p

Signed-off-by: daquexian <daquexian566@gmail.com>

* skip in lazy

Signed-off-by: daquexian <daquexian566@gmail.com>

* implement zero_grad in c++

Signed-off-by: daquexian <daquexian566@gmail.com>

* _zero_grad to _zero_grad_, skip boxing of lazy tensor

Signed-off-by: daquexian <daquexian566@gmail.com>

* auto format by CI

* auto format by CI

* skip test in cpu only mode

Signed-off-by: daquexian <daquexian566@gmail.com>

Signed-off-by: daquexian <daquexian566@gmail.com>
Co-authored-by: Li Xinqi <lixinqi2010@gmail.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Support inplace scatter (#9016)

* refine scatter

* fix

* refine

* refine

* add atomicMul & refine

* refine

* Dev linalg cross (#8979)

* add linalg_cross in yaml

* add linalg cross

* fix

* refine broadcast

* add global test

* reformat

* refine and fix

* fix tidy

* add nansum (#9113)

* add nansum, can work on cpu, fail on cuda

* implement nansum on cuda

* restore modification in preprocessor_internal.h

* register only for floating types

* remove kernel register for int types, and it works

* add whole reduce functor

* add backward func

* add export in __init__ and refine code

* refine code

* refine code, and register kernel

* add sbp

* just for debuging, cannot compile

* just for debuging, cannot compile

* use primitive to implement assign nan

* refine code

* add docs, remove useless op and functor

* remove useless kernel

* add docs, fix bug of primitive

* fix typo in global test

* refine code

* refine code

* refine code

* refine code

* auto format by CI

* Update binary_func.h

* Update binary_func.h

Co-authored-by: MARD1NO <359521840@qq.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>

* Feat eager global tensor indexing (#9138)

* test(TensorIndexing): add global basic indexing test

* format code

* feat(TensorIndexing): support eager global advance indexing

* test(TensorIndex): add global tensor indexing error message test

* format code

* feat(TensorIndexing): support global tensor combined indexing

* format code

* feat(TensorIndexing): eager global combined basic with advance indexing

* fix(TensorIndexing): fix global tensor write back bug

* remove useless code

* refine test and comment

* fix(TensorIndexing): remove an unnecessary slice_update

* add comment

* fix with static analysis

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add lr_scale for optimizers (#9008)

* add lr_scale for opt

* revert import

* set lr scale in pass

* add test

* lr_scale default value

* improve readability

* fix_ctc_loss_error_with_float_target_input (#9143)

* fix_ctc_loss_error_with_float_target_input

* minor fix

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Inplace masked fill (#9133)

* add inpalce masked_fill

* reformat

* refine

* auto format by CI

* refine according by comments of hbb

* export via cpp directly

* export oneflow.masked_fill_

* rename arg

* refine test case

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Fix numpy>=1.23.0 advance indexing code (#9139)

* test(TensorIndexing): fix numpy>=1.23.0

* auto format by CI

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>

* add_tensor_new_full_func (#9149)

* add_tensor_new_full_func

* auto format by CI

* add global test case

* fix error

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>

* As strided regist more dtype (#9150)

* as_strided register more kernel

* add test

* fix commnet

* fix ci error

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Auto Parallel (#8891)

* add auto_parallel code

add auto_parallel pass

* Feat ap remove hierarchy cast (#7919)

* feat(AutoParallel): support remove parallel_cast ops

* feat(AutoParallel): export enable_auto_parallel_prune_parallel_cast_ops

* format code

* Fix add conv grad cost (#7972)

* feat(Conv): add grad computation cost

* fix ConvDataGrad computation cost

* update conv grad cost

* refine

* Auto parallel/fast collector (#7958)

* Try to speed up sbp collector.
However, throughput drop

* Shrink the parallel candidates for the proxy node

* Print out some information and then refine

* Store the sbp set for each consumer

* Update binary set intersection

* Remove impossible parallel candidates from sbp proxy

* Refine binary set

* Add a Clear() in binary set

* Filter out those proxy candidates containing two
sbps from the same unique group

* refine

* Check spells

* Clip useless edges

* AutoParallel mainstem algorithm add mutable_op_ctrl_edge (#8033)

* feat(AutoParallel): mainstem algorithm add mutable_op_ctrl_edge

* use if instead std::max

* fix(AutoParallel): fix pooling computation cost function bug (#8147)

* [WIP] Fix auto parallel dump uniform sbp bug (#8330)

* fix(AutoParallel): fix auto parallel dump uniform sbp bug

* refine source op judgement

* update auto_parallel config (#8356)

* Refactor dump nd sbp for auto parallel (#8353)

* fix(AutoParallel): fix auto parallel dump uniform sbp bug

* feat(AutoParallel): add inferface for op to dump nd_sbp to op_conf

* refactor(AutoParallel): refactor DumpNdSbpSignatureForOpConfFn

* rename Global to Singleton

* Refactor SbpEdge (#8684)

* refactor(AP): refactor SbpEdge

* Rename variables

* Add const for some functions

Co-authored-by: Yipeng Li <jamesonli1313@gmail.com>

* Refactor auto parallel sbp node (#8712)

* Rename

* Code clean up

* Code clean up

* Code clean up and package up

* Rename

* Add const for some functions

* Refactor auto parallel sbp graph (#8722)

* Code clean up

* Package up

* Code clean up and package up in SbpNode and SbpEdge

* Rename

* Rename

* Rename mainstem to trunk

* Typo, small bugs and rename

* Rename and of format

* Refactor auto parallel rest (#8731)

* Package up SbpCollector

* Add const for SbpGraph

* Add const for SbpNode

* Add const for SbpEdge

* Add const for SbpCollector

* Add const, rename, and package up for BinarySet

* Rename for BinarySet

* Rename for SbpCollector

* Rename for SbpCollector

* Rename for algorithm utils

* Fix a bug for an unused function AddEntries()

* Rename for BinarySet

* Rename for SbpConstructor

* Rename for BoxingCollector

* Add const for sbp utils

* fix merge conflict

* Remove template for sbp signature (#8787)

* Remove template for sbp signature

* Remove _H_ from cpp files

* Remove namespace specifier oneflow::

* Remove namespace specifier oneflow::

* Of format

* Move the inline functions to cpp files

* Can not add inline specifier?

* Update oneflow/core/auto_parallel/sbp_graph.h

Co-authored-by: Yinggang Wang <wyg19970408@gmail.com>

* Of format

Co-authored-by: Yinggang Wang <wyg19970408@gmail.com>

* Refactor auto parallel class object stuff (#8835)

* Delete copy/move constructor/operator

* Move the deconstructor of SbpEdge to the cpp file

* Equal by address for Sbp data structor

* Replace sbp_sig_list_ with sbp_sig_obj_list_

* Fix auto parallel copy cost infer2 (#8788)

* Check the output shape for operator in auto parallel

* Return infinity for different sbps while is_mutable

* Update oneflow/core/auto_parallel/sbp_constructor.cpp

Co-authored-by: Yinggang Wang <wyg19970408@gmail.com>

* Update oneflow/core/operator/operator.cpp

Co-authored-by: Yinggang Wang <wyg19970408@gmail.com>

* with output -> check output

Co-authored-by: Yinggang Wang <wyg19970408@gmail.com>

* Refactor prune identity as much as possible (#8849)

* Prune a line of parallel cast ops

* Avoid repeated pruning

* Code clean up

* Remove identity op

* Update oneflow/core/job_rewriter/auto_parallel.cpp

Co-authored-by: Yinggang Wang <wyg19970408@gmail.com>

Co-authored-by: Yinggang Wang <wyg19970408@gmail.com>

* Fix auto parallel low throughput (#8876)

* Speed up after pruning identity

* Slight changes

* Refactor auto parallel final check (#8887)

* Of format

* Use const auto &

* Of format and rename

* Re-compute cost if steals sbp signatures

* Docs auto parallel doc (#8896)

* doc(AutoParallel): add auto parallel document framework

* docs(AutoParallel): add document

* fix typo

* refine document

* refine documentation

* Test alexnet for auto_parallel (#8917)

* test(AutoParallel): test alexnet for auto_parallel

* test(AutoParallel): test model add auto_parallel config

* Fix get sbp bug (#8939)

* Fix the bug of missing sbp for uniform op

* Speed up

* Add the mising sbp for optional input UserSourceOpTickInput

* Remove the repeated all-B sbp signature

* Add sbp for undefined UserSourceOpTickInput

* Resolve confits while merging master

* Recompute cost with time shape (#9009)

* Address comments

* fix merge conflict

* Address comments

* Disabled ZeRO when enabled AutoParallel (#9087)

fix(AutoParallel): disabled ZeRO when enabled AutoParallel

* Update oneflow/core/job_rewriter/optimizer_placement_optimization_pass.cpp

* Address comments

* Address comment.
GetComputationCostFn -> GetComputationCost

* Update oneflow/core/job_rewriter/auto_parallel.cpp

Co-authored-by: Yinggang Wang <wyg19970408@gmail.com>

* New interface for pr#9018

* Static analysis

* Fix ones like sbp bug and fix test import error in CI (#9123)

fix(AutoParallel): skip 1n1d sbp agreement check

* auto format by CI

* test(AutoParallel): skip acc check

* Address comments

* rename source op set nd_sbp function and add check

* fix typo

* Feat full auto parallel (#9140)

* Use B for inplace op and remove the check for sbp
while truning the auto prallelism on

* Slight change

* Not using B as the constrain

* Address comments

* add debugg log for non-deleted cast ops

* update prune parallel cast op log

* rename auto_parallel_prune_parallel_cast_ops to enable_auto_parallel_ignore_user_sbp_config

Co-authored-by: wyg1997 <wangyinggang@foxmail.com>
Co-authored-by: Yinggang Wang <wyg19970408@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>

* refine oneflow op infer dtype error message (#9155)

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Fix to_global PyArg_ParseTupleAndKeywords (#9158)

* Fix tensor local_to_global parse keywords

* use PyObject

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Implement exponential_ and multinomial (#9073)

* add exponential distribution cpu kernel

* add exponential distribution cuda kernel and local tests

* refine test

* fix bug

* auto format by CI

* auto format by CI

* implement multinomial functor and cpu kernel

* auto format by CI

* add multinomial cuda kernel

* auto format by CI

* refine

* add multinomial tests

* auto format by CI

* add categorical distribution module and docs

* refine

* refine

* refine doc

* refine

* refine

* revert

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Disable IB when there no active IB devices (#9115)

* fix lru_cache offset (#9162)

fix lru_cache offset for larger than uint32

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Rename cast to global and cast from global (#9151)

* rename_cast_to_global_and_cast_from_global

* auto format by CI

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Refine datatype error message part2 (#9168)

* refine more ops dtype infer error message

* refine

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* support tensor.triu_ (#9159)

* support tensor.triu_

* Update tensor_functions.cpp

* tensor.copy_ support stride (#9142)

* tensor.copy_ support stride

* add test case

* PersistentTable add read_only flag (#9145)

* read only

* fix

* avg_pool_nd support half (#9170)

* avg_pool_nd support half

* refine

* refine

* fix new_ones size paramater (#9161)

* fix new_ones size paramater

* auto format by CI

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* hot-fix (#9191)

* hot-fix

* refine

* skip env var check and calculate local rank if not given (#9183)

* skip env var check

Signed-off-by: daquexian <daquexian566@gmail.com>

* calc local rank if need

* No warning for absent LOCAL_RANK

Signed-off-by: daquexian <daquexian566@gmail.com>

* auto format by CI

Signed-off-by: daquexian <daquexian566@gmail.com>
Co-authored-by: Yu OuYang <xuanjiuye@gmail.com>
Co-authored-by: clackhan <han_binbin@163.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: Shenghang Tsai <jackalcooper@gmail.com>

* set to_contiguous to amp clear list (#9171)

* add tensor.nansum (#9182)

* Add slight cost for different sbp in 1 device (#9172)

* Add slight cost for different sbp in 1 device

* Print to INFO

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* refine_to_contiguous_dtype_register (#9196)

* refine_to_contiguous_dtype_register

* add test case

* pool_nd_ops register gray list

* skip autocast for non-user op (#9199)

* `copy_` support numpy fp16 (#9189)

* copy_ support numpy fp16

Signed-off-by: daquexian <daquexian566@gmail.com>

* auto format by CI

Signed-off-by: daquexian <daquexian566@gmail.com>
Co-authored-by: Shenghang Tsai <jackalcooper@gmail.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>

* fix matmul 0 size input error (#9147)

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Feat functional scalar tensor parameter (#9190)

* add ScalarTensor check and unpack, bug has link error

* refine scalar tensor item function

* feat(functional): functional support ScalarTensor transfer to Scalar automatically

* feat(functional): support ScalarTensor transfer to Scalar

* change auto transfer rule

* test(Functional): add functional scalar tensor param test

* format code

* refine GetItemInScalarTensor function

* Fix broadcast fmod grad (#8865)

* impl trunc divide

* fix broadcast fmod grad

* trunc_div grad, scalar_trunc_div, and primitive

* format

* gradient_func

* add test

* rename

* compatible with older versions of torch

* resolve warning

* test global

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Feat straighten compress memory (#9094)

* An initial inplementation of linear programming primal matrix

* Coding for the revised simplex method

* Finish coding for the phase 1

* Fix bug.
Now we can get a corrent x for the initial basic feasible solution

* Drive the artificial variables out in phase 1

* Bland's rule and bug fix

* Adjust the mapping between the basic variables and compact columns

* No columns removed while driving artificial variables out.
Terminates the code if positive optimal cost found in auxiliary problem.

* Implement the phase 2 of the revised simplex method.
Remove columns of the inverse base matrix.

* Update is_solved status and original problem recovery.

* Rows and artificial columns activation

* An initial implementation of mix integer programming

* Try to assemble the original problem but fial due to the massive exclusion

* Steal initial position from current setting

* Compute the optimal cost from the compact relationship

* Move to a neighbor status and compute the cost

* Find the smallest cost and actually move to that status

* Check conflit after the adjustment.
Adaptively cost reduce

* Generate a compact position from nothing

* Straighten for memory

* Update the offset

* Add a demo for using the revised simplex method

* Remove the linear programming part

* Recompute the compact relationship after moving to a new status

* Rename

* Code clean up

* Set the tag for the straighten algorithm

* Code clean up

* An attemp to explore the dependency between consumer nodes of a register

* Revert "An attemp to explore the dependency between consumer nodes of a register"

This reverts commit f219851fb85943d07d28b84c45e5c4bae80872a0.

* Compute the lower bound and only execute
the adjustment 2 for those cases with possible reduction in memory

* Pre-compute and store the memory size for registers

* Use pre-stored total register num

* Limit the maximum iteration step

* Use VLOG(3) instead of std::cout

* Change interface

* Package up memory share strategy interfaces

* Address comments

* Address comments

* Of format

* Fix bug lower bound = 0

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add contains magic method (#9185)

* refine more ops dtype infer error message

* refine

* add tensor.__contains__ magic method

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Build cuda 11.8 (#9204)

* export unsorted segment sum (#9206)

export unsorted_segment_sum python

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Optimize OneEmbedding Save Snapshot (#9112)

* init

* fix compile error

* refine

* Refine put logic

* todo lrucache logic

* refine dump logic

* finish

* add flag check

* Add env var

* fix

* fix a silly bug

* fix template args

* fix comment

* add template

* Refine comment

* remove

* fix bug

* fix compile error

* refine initial

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add Tensor.scatter_add & refine scatter (#9201)

add Tensor.scatter_add & refine scatter

* optimize layernorm need padding cols perf (#9195)

* optimize layernorm need padding cols perf

* auto format by CI

* reduce binary size

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Support Inplace behavior in Type Promotion (#9200)

* support inplace

* refine

* add const

* refine

Co-authored-by: Houjiang Chen <chenhoujiangcug@gmail.com>

* Fix Broadcast Matmul check (#9213)

fix check

* Export MultiTensor Update and FuseUpdateCast to GraphConfig (#9209)

* export to graph config

* refine or

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix bug of matmul dim check in `oneflow.bmm` (#9215)

* fix bug of matmul dim check

* refine code

* Update nn_functor.cpp

* Regist arange fp16 (#9202)

* arange op support cuda half

* add test

* format

* fix comment

* fix comment

* refine

* ci test error

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Fix graph out argstree type judge (#9211)

* reproduce bug

* fix custom class type deal

* fix typo

* support ordereddict

* auto format by CI

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: Shenghang Tsai <jackalcooper@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix ConcatFunctor error message (#9225)

* Check async errors after kernel launched (#9226)

Check errors after kernel launched

* Skip unnecessary passes (#9219)

* Skip unnecessary passes

* refine

* one_embedding fix typo (#9230)

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* [GetAsyncError] Add op name to error message (#9228)

GetAsyncError refine error message

* [JobBuildAndInferCtx]Remove an inefficient check (#9229)

Remove an inefficient check

* Fix linalg cross 0-size input error (#9232)

* Add silu to amp list (#9233)

* Disable CUDA virtual arch compilation (#9236)

* Support set/get_default_dtype interface (#9227)

* feat(DType): support set/get_default_dtype interface

* doc(*): fix set/get_default_dtype document

* doc(DType): refine document

* feat(oneflow.tensor): support infer dtype as get_default_dtype

* test(DType): add default dtype test

* refine throw error

* modify doctest because it will affect default dtype for other test

* fix(DType): make DefaultDType is global

* use default type in TensorWithDataCtorFunctor

* fix(DType): flow.Tensor support DefaultDType

* refine function name

Co-authored-by: jackalcooper <jackalcooper@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Enhance doctest error message (#9237)

* test(doctest): enhance doctest error message

* Update python/oneflow/test/modules/test_functional_docstr.py

Co-authored-by: Yao Chi <later@usopp.net>

* Update python/oneflow/test/modules/test_functional_docstr.py

Co-authored-by: Yao Chi <later@usopp.net>

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: Yao Chi <later@usopp.net>

* Feat: script to import oneflow as torch globally (#9160)

* feat: global `import torch as oneflow`

* use `console_scripts` to install oneflow-mock-torch to PATH

* close quote

* use os.makedirs to create temp torch directory

* rename to `oneflow-mock-torch`

* don't create temp files

* use positional argument with 2 choices

* add `mock torch test` in CI

* uncomment env setup

* default argument is enable

* fix docker exec

* refactor test script

* check successful recover

* don't run setup.py

* support submodule importing & display error message

* fix import * and import-from

* move mock_torch to oneflow dir; update test command

* fix error message

* update mock test (less strict)

* add more tests for torch imports

* modify export path

* mock_torch is a package

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>

* add time and mem log tools (#9164)

* add time and mem log tools

* refine format

* auto format by CI

* address review

* auto format by CI

* log with json format

* rm useless

* refine log format

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* support bool for `oneflow.nn.functional.pad` (#9234)

* support bool in functor and kernel, add unittest for int and bool

* refine unittest

* check value for bool tensor

* Feat: rand/randn support float16 kernel (#9238)

* feat(Op): rand/randn support float16 kernel

* add error message and refine code

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* reduc auto tick generate time (#9235)

* reduc time

* rm useless

* address review, refact structure

* auto format by CI

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>

* TensorIndexing support float16 (#9247)

* feat(TensorIndexing): support float16

* feat(TensorIndexing): support bfloat16

* skip bfloat16 test when cuda version less than 11000

Co-authored-by: Shenghang Tsai <jackalcooper@gmail.com>

* Add cudnn handle pool (#9243)

* add_cudnn_handle_queue

* deal normalization_kernel

* refine

* refine

* reslove comment

* minor fix

* refine

* auto format by CI

* fix static check

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Added error message for CUDA device incompatibility (#9250)

* Added error message for CUDA device incompatibility

* auto format by CI

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Fix autograd.Function memory leak (#9249)

* fix(AutogradFunction): fix memory leak

* add ptr check for AutogradState data

* test(AutogradFunction): ensure PyAutogradFunctionState released

* test(AutogradFunction): decrease memory

* register __dict__ function

* refine code

* fix state release test bug

* refine error message

* Feat speed up mem reuse (#9210)

* Use HashSet instead of vector

* O(n^3) -> O(n^2)

* Compute offset for memory-first algorithm only

* Remove explicit exclusion relationship

* Revert print out information

* Speed up exclusion judgement

* Switch HashMap to vector

* Code clean up

* life time -> lifetime

* mem_reused_regst: HashSet -> std::vector
regst_desc_id2regst_desc -> mem_chain2regst_desc_id2reuse_regst_desc

* Re-implement MemReusedAlgorithm_TimeLineAlgo
and comment out useless code

* Make allocate and free timeline local
and HashSet -> std::vector

* Eliminate a lot of Hash stuffs

* Revert "Eliminate a lot of Hash stuffs"

This reverts commit abfb86df57b13074cb50ca9dc080a1333cd46802.

* Important comment

* Address comments

* auto format by CI

* Remove magic number -1

* Address comment and rename

Co-authored-by: Shenghang Tsai <jackalcooper@gmail.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix bug: segfult when argmax has 0 size tensor as input (#9242)

* fix_half_check_of_reduce_mean (#9014)

* fix_half_check_of_reduce_mean

* refine

* Support float16 for initializer operators (#9253)

* feat(*): support float16 for initializer operators

* refine test

* Add half clamp (#9241)

* Register half

* register fp16 in clamp kernel, add check for fp16 in functor, update unittest for more dtype

* format code

* add macro WITH_CUDA

Co-authored-by: WangYi <buaawangyi03@gmail.com>
Co-authored-by: Wang Yi <53533850+marigoold@users.noreply.github.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* [CUDA]CheckVersionCompatibility (#9257)

* [CUDA]CheckVersionCompatibility

* Add CUDA 10.2

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Feat: monkeypatching pytorch (#9256)

* update custom meta path finder

* update test commands

* print warning if `torch` is already imported

* rename to `mock`

* update tests

* private attribute cannot be imported with import *

* split testcase

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>

* support destory_rdma (#9246)

* support destory_rdma

* refine

* auto format by CI

* refine

* refine

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>

* add bincount (#9156)

* add bincount

* add docs, use atomic add in cuda kernel, add unittest

* add minlength param, fix bug of memset in kernel

* refine code

* refine code

* convert to local when input is global, add global test

* auto format by CI

* refine code

* refine docstr, reduce doc length in one line

* register fp16, add tensor function and unittest

* add docs for tensor.bincount

* auto format by CI

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>

* ONEFLOW_STREAM_ENABLE_H2D_STREAM (#9205)

* Modify generator.manual_seed to return generator rather than None (#9262)

generator.manual_seed return generator rather than None

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Dev add tensor bernoulli (#9261)

* add tensor.bernoulli

* add docs

* Update tensor.py

* Update tensor.py

* Update tensor.py

* auto format by CI

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Multi tensor update (#9252)

* fix multi_tensor_sgd segfault

* enable learning_rate_val to replace learning_rate Tensor

* support adam and adamw

* support epsilon for adam and adamw

Co-authored-by: songyicheng <int.rejoice@gmail.com>

* fix a typo in readme (#9268)

* support nested asyncs.thread (#9270)

* OneEmbedding add smart decay sparse adam (#9176)

* add sparse adam

* smart decay sparse adam

* address review

* fix

* mv smart_decay to one_embedding namespace

* upgrade clang-tidy used in ninja of_tidy (#9263)

upgrade clang-tidy in ninja of_tidy

Signed-off-by: daquexian <daquexian566@gmail.com>

Signed-off-by: daquexian <daquexian566@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Feat/compile time count (#9245)

* add graph compile time count

* refine compile log

* auto format by CI

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix random_normal (#9274)

Co-authored-by: Juncheng <liujuncheng1022@gmail.com>

* Flip and upsample bilinear support fp16 (#9284)

* slice update cpu kernel multi_thread loop

* refine

* upsample bilinear and flip register fp16 cuda kernel

* fix commnet

* revert

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Fix PruneAmpWhiteIdentityOpPass (#9276)

* fix

* fix dup del

* ref algorithm

* fix dup mut

* simple impl

* rm useless code

* fix

* fix typo

* fix typo

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* support api flow.randn_like (#9283)

* support api flow.randn_like

* refine

* remove dry run, add sanitizers to ci (#8670)

* fix some data races in c++ api and SteadyVector

Signed-off-by: daquexian <daquexian566@gmail.com>

* skip self copy in MutShapeView::ToShape

Signed-off-by: daquexian <daquexian566@gmail.com>

* remove dry run, add sanitizers to ci

Signed-off-by: daquexian <daquexian566@gmail.com>

* auto format by CI

* update gh action

* skip lit

Signed-off-by: daquexian <daquexian566@gmail.com>

* suppress ubsan error in llvm

Signed-off-by: daquexian <daquexian566@gmail.com>

* disable ubsan for now

Signed-off-by: daquexian <daquexian566@gmail.com>

* fix ci path

Signed-off-by: daquexian <daquexian566@gmail.com>

* update test manylinux docker

Signed-off-by: daquexian <daquexian566@gmail.com>

* restore dry run rpc manager

Signed-off-by: daquexian <daquexian566@gmail.com>

* run tsan for 3 times

Signed-off-by: daquexian <daquexian566@gmail.com>

* do not find initializer order bug

Signed-off-by: daquexian <daquexian566@gmail.com>

* fix merge conflict

Signed-off-by: daquexian <daquexian566@gmail.com>

* skip sanitizer test in cuda misc

Signed-off-by: daquexian <daquexian566@gmail.com>

* sleep

Signed-off-by: daquexian <daquexian566@gmail.com>

* suppress by __attribute__((no_sanitize_address))

Signed-off-by: daquexian <daquexian566@gmail.com>

* auto format by CI

* revert suppression

* fix heap-use-after-free found by asan

* auto format by CI

* bash -c

Signed-off-by: daquexian <daquexian566@gmail.com>

Signed-off-by: daquexian <daquexian566@gmail.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: tsai <jackalcooper@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* add build config for RTX 40xx GPUs (#9290)

* Bool support for triu (#9291)

* Refix PruneAmpWhiteIdentityOpPass (#9294)

fix

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix concat #8833 (#9275)

* fix conat #8833

* support multi-none-input

* test and global test

* auto format by CI

* format license

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>

* support half for masked_fill (#9292)

* Fix BatchNorm performance (#9298)

* slice update cpu kernel multi_thread loop (#9264)

* slice update cpu kernel multi_thread loop

* refine

* try to fix bug

* auto format by CI

* deleteusless headfile

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>

* fix inplace bug in `tensor.masked_fill_` (#9295)

* fix: bind tensor.masked_fill_ to inplace version, fix bug in unittest

* refine unittest

* fix_inplace_copy_bug (#9301)

* FusedMultiHeadAttentionInference (#9287)

* FusedMultiHeadAttentionInference

* auto format by CI

* cmake

* fix graph

* auto format by CI

* fix cmake for mlir

* rm duplicated install

* fix align

* support float

* support causal

* support causal

* test global property

* fix

* disable clang

* skip cpu test

* skil all test

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: jackalcooper <jackalcooper@gmail.com>

* Fix compile warnings (#9302)

* Fix comiple warnings

* fix

* Set the default value of CUDA_STATIC to OFF when CUDA version is greater than or equal to 11.8 (#9306)

* Reduce pass time cost (#9281)

* batch del in PrunePinnedIdentityOpPass

* add log

* fix and refine fuse add_n

* add new line

* avoid op graph create

* add op graph cost cnt and fix boxing log

* fix ndsbp csv str

* fix multi add same add_n

* auto format by CI

* rm debug log

* auto format by CI

* to cont ref

* rm useless

* refine auto modifier

* rm useless

* hack to debug

* hack to debug

* hack to debug

* hack to debug

* hack to debug ci

* hack to debug ci

* fix test case env var

* fix env var set

* revert to const ref

* auto format by CI

* sync to make sure tensor are created

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Refactor get sbp signature (#9304)

* Add a GetSbpSignature with use parallel num
instead of parallel description

* Get sbp_sig_list for each dimension of hierarchy

* Add test script and print out information

* Remove parallel description in GetSbpSignature()

* Fix small bug

* Disable InferNdSbp for reshape op

* Revert "Add test script and print out information"

This reverts commit fdc7ee8558cab68aa9fa152cf1ba2a6dc2b4554e.

* Add hierarchy value

* Address comments

* parallel num j-> hierarchy value for reshape op

* Static analysis

* refine

* Update user_op.cpp

* Update operator.cpp

* auto format by CI

* Revert Update operator.cpp

This commit revert 64832e43196067d67f70094a8d35664a805a5891

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>

* Fix type error for entering a single tensor using concat op (#9316)

* modify tensorprocessor

* remove blank line

* remove blank line

* modify CheckHasDifferentInputDType func

* Update oneflow/core/functional/tensor_processor.cpp

* auto format by CI

Co-authored-by: Yinggang Wang <wyg19970408@gmail.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add more sbp signature print functions for log and debug (#9293)

* debug code

* ReshapeOp::GetSBP use hierarchy dim instead of parallel_num

* comment debug log

* revert debug code

* auto format by CI

* rm NdSbpSignatureListAsString

* rm 1d sbp signature print functions

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>

* Release/nightly cu118 (#9308)

* update action

* 116->118

* preserve 116

Co-authored-by: Juncheng <liujuncheng1022@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Fix different dtype in slice_update (#9331)

* fix(SliceUpdate): fix different dtype in slice_update

close #9330

* test(SliceUpdate): enhance test case

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Fix FlattenOp GetSbp (#9322)

* fix flatten GetSbp

* rm flatten op

* update group stat

* rm mlir test

* fix

* more strictly check

* add reshape converion

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Refactor ONEFLOW_MLIR_PREFER_NHWC to support more ops (#9335)

* use bn as gn

* hack gn as relu

* refine

* support concat

* ScalarDivOp

* fix

* move files

* refine

* fix bn

* try fix

* fix concat

* fix

* DRY

* refactor

* refactor

* fix

* workaound

* add baseclass

* rm hack

* auto format by CI

* minor refine

* refine

* add more

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>

* distributions.Categorical support logits not None (#9332)

* avoid extra gpu memory usage in flow.save (#9328)

* boxing to cpu first in flow.save

Signed-off-by: daquexian <daquexian566@gmail.com>

* auto format by CI

Signed-off-by: daquexian <daquexian566@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>

* Use primitive to replace Ndarray::BroadcastBinary (#9311)

* Use primitive to replace Ndarray::BroadcastBinary

* refine

* fix

* negative

* refine

* refine

* Block forward support modification (#9336)

* block forward support modification

* add test

* fix format

* auto format by CI

* auto format by CI

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add log sum exp api (#9333)

* add_log_sum_exp_api

* refine

* add logsumexp to tensor

* auto format by CI

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Feat: isclose and allclose (#9280)

* add allclose op in tablegen

* add isclose & allclose op in functional layer

* use existing framework to implement `isclose`

* import isclose & allclose

* compose isclose and other op to form allclose in python

* typo

* add doc & test files

* add default arg

* curly braces between one stmt

* generate one random data, the other is perturbation

* update test

* comment for ndarray bin func

* add ref from torch

* Refactor random op with consistent data (#9299)

* refactor(RanddomOp): refactor random op with consistent data

* test(RandomOp): add data consistent test

* fix(RandomSeed): fix parallel_num==1

* move normal functor to random_functor.cpp

* test(RandomOp): refine test

* add comment for random_seed getter function

* remove special judgement for 1n1d

* fix random_seed parallel_num==1

* fix cuda generator index bug

* fix test function name bug

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* bool tensor slice_update use masked_fill when possible (#9324)

* bool tensor slice_update use masked_fill when possible

* refine

* auto format by CI

* fix comment

* auto format by CI

* Update oneflow/api/python/framework/tensor_functions.cpp

Co-authored-by: Wang Yi <53533850+marigoold@users.noreply.github.com>

* refine

* auto format by CI

* except partial sum test

* add todo

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: Wang Yi <53533850+marigoold@users.noreply.github.com>

* Move tensor apis to cpython (#9303)

* move tensor.is_floating_point to c++

* refine

* move tensor.split to c++

* move tensor.flip to c++

* auto format by CI

* Update oneflow/api/python/framework/tensor.cpp

Co-authored-by: Wang Yi <53533850+marigoold@users.noreply.github.com>

* refactor flip

* refine

* auto format by CI

* fix free(): invalid pointer

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: Wang Yi <53533850+marigoold@users.noreply.github.com>

* Add gelu_tanh op and kernel (#9343)

* gelu_tanh

* rename GeluTanh -> FastGelu

* regulate constant and increase precision

* instantiate and reg backward

* reg grad fn

* address review

* address review

* format

* update test

* refine_test_maxpool2d_channel_last (#9344)

* refine

* auto format by CI

* add skip

* auto format by CI

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>

* Refactor normal initializer (#9307)

* refactor(RanddomOp): refactor random op with consistent data

* test(RandomOp): add data consistent test

* refactor(Initializer): refactor normal with oneflow kernel

* fix(RandomSeed): fix parallel_num==1

* test(initializer): add initializer data test

* format code

* move normal functor to random_functor.cpp

* test(RandomOp): refine test

* add trunc_normal and relax mean/std precision

* fix conflict

* fix merge conflict

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Support fp16 in constant folding (#9337)

* support fp16

* format

* clean

* refine

* auto format by CI

* refine test

* clean

* refine

* refine

Co-authored-by: Shenghang Tsai <jackalcooper@gmail.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>

* fix exp overflow with minus max trick (#9353)

* Fix occasional bug in random_op data test (#9354)

fix(RandomOp): fix occasional bug in random_op data test

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Dev add gumbel softmax (#9208)

* regis gumbel_softmax

* add: gumel_noise, attr-hard, next: log, one-hot, grad

* add(fail): exp_dist

* add: gumbel, grad on cpu, next: cuda

* add: cuda & test bug: Synchronize()

* add: docs, test_hrad, test_grad

* add: format code

* fix: TmpSize

* fix: review

* format, try to add

* add: functor

* format & half of rand

* remove ops & kernels

* support half of argmax & dim_scatter

* fix review

* add gumbel softmax docs

* fix review

* remove gumbel_softmax_grad_functor

* remove grad in yaml

* fix: raise half no util error

* auto format by CI

* auto format by CI

* fix: make

* fix: static

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>

* Fix the inconsistent behavior of slice update (#9321)

* modify tensor_index.cpp

* modify

* support scalar tensor indexing

* support scalar

* modify tensor_util

* modify tensor_index

* add macro definition

* add support type

* refine getitemscalartensor

* Update oneflow/core/framework/tensor_util.cpp

* modify macro

* modify macro and test

* modify test

* modify function parameter

* modify tensor_index ("uint8" is regarded as "bool")

Co-authored-by: Yinggang Wang <wyg19970408@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* enable autocast for that op which has nocast arguments (#9362)

* fix autocast

* fix

* Add NHWC format for group norm (#9368)

* group

* nhwc

* test_case

* ir

* fix

* refine

* Enable ZeRO with auto parallel (#9288)

* Enable ZeRO with auto parallel in the first setting
and speed up

* Remove compute_cost parameter
from Initialization of copy cost

* Move the addition of wait time into sbp_node

* Remove transfer cost since it is merged into the GetTransferCost()

* Rename mainstem to trunk

* Update warning

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Feat unbalanced split nd sbp (#9310)

* Add a GetSbpSignature with use parallel num
instead of parallel description

* Get sbp_sig_list for each dimension of hierarchy

* Add test script and print out information

* Remove parallel description in GetSbpSignature()

* Fix small bug

* Disable InferNdSbp for reshape op

* Revert "Add test script and print out information"

This reverts commit fdc7ee8558cab68aa9fa152cf1ba2a6dc2b4554e.

* Use the same physical shape as eager did

* Remove the difference between eager and lazy for physical shape

* Update the filter

* Revert "Use the same physical shape as eager did"

This reverts commit f20e222327e21166d5b5325e37c3cbe9ca4f4ac6.

* Compute range for each rank

* Compute position for range

* Remove the difference between eager and lazy

* Allow unbalanced split for variables

* Add test script and print out information

* Pass 2d test cases

* Resolve conflict

* Can not merge some split

* Reduce in and out sbp simultaneously

* Speed up for 1d sbp
Package up the function for replacing hierarchy

* Reduced simultaneously with the same hierarchy

* Deal with 1to2d and 2to1d in InOutParallelDimReduce()

* Pass 1to2d and 2to1d test cases

* Remove the old code

* Revert "Add test script and print out information"

This reverts commit 58cdfb40b6536eb74c02174d3a69409676da374f.

* Add the check for split questionary back

* Feat speed up cost computation (#9355)

* Compilation speed up

* Speed up compilation for cost between 1d sbp

* fix comment typeo

* Address comment

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add upsample_nearest_2d to amp clear list (#9366)

* fix cuda integral type closeness computation (#9346)

* fix cuda integral type computation

* remove include

Co-authored-by: Juncheng <liujuncheng1022@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add fused linear (#9369)

* Support fp16 on some cpu operators (#9374)

support fp16 cpu triu

* Scalar math kernels support inplace (#9372)

* Scalar math kernels support inplace

* type

* fix

* Optimize GroupNorm NHWC with FastDivmod (#9373)

* GradAcc Mem V5: Part 0-4 (#8961)

* default nccl use compute stream in grad acc

* rm sharable mem block graph

* half implement of LogicalChains

* part-0 : Logical Chain

* fix compile

* logical chain runnable

* fix bug of logical chain dp

* Part 1 : AfterGradAccChain

* fix bug of crush in acc chain infer

* AccCtrlTick Op/Task/Actor/Pass

* tmp

* AccCtrlTick runnable

* rename group boxing identity and model diff scale op name

* stric order by acc tick

* merge mem block by logical chain id group

* fix user op register

* fix GLOG error when no grad acc

* Inplace repeat variable

* Inplace repeat support consumed/produced ctrl regst

* Part-4: merge acc op in to chain for reuse memory acc input (#9071)

LogicalChain can merge acc op in to chain for reuse memory acc input

实测 GPT 的显存与 part-3 一致。 bert 与 t5 大部分的显存都略低于 part-3

https://github.com/Oneflow-Inc/OneTeam/issues/1670#issuecomment-1240468576

* find first source/sink op in acc chain which can be insert ctrl

* TryMergeAfterAccLogicalChainToFirstLogicalChain

* remove debug log

* rm old version repeat kernel

* fix format

* MergeChainByLogicalChainId/PhysicalTaskGraph

* IsValidChainId

* rm useless file

* remove note

* fix clang-tidy

* more IsValidChainId

* rm debug log

* rm note

* fix bug of cpu repeat inplace var bug

* fix bug of memory reuse for 0-size regst in time line algo

* fix bug of acc chain merge mem guard

* reuse cast to tick op

* fix bug of acc different stream hint cause sync backward compute

* actor name log

* fix for review

* remove log

* fix note

* fix bug of connect to cast to tick op

* refine code for review

* fix for review

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix the bug of fill_tensor_ of support fp16 & autocast (#9375)

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Allocate in instruction computation (#9282)

* allocate memory in InstructionPolicy::Compute

* remove unused methods of VirtualMachineEngine.

* backup code

* UnimplementedAllocator

* prepare allocators for each cpu stream.

* allocator for ccl stream

* init AllocateTensorInstructionPolicy::output_dependences_

* only sync current rank in oneflow._oneflow_internal.eager.Sync

* Update oneflow/core/vm/allocate_tensor_instruction_policy.cpp

Co-authored-by: daquexian <daquexian566@gmail.com>

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: daquexian <daquexian566@gmail.com>
Co-authored-by: Shenghang Tsai <jackalcooper@gmail.com>

* Disable conv algorithm search in eager mode (#9…
  • Loading branch information
1 parent fa49459 commit a4e67b0
Show file tree
Hide file tree
Showing 866 changed files with 46,555 additions and 8,181 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/canary.yml
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ jobs:
- name: Checkout Oneflow-Inc/oneflow
if: ${{ github.event.inputs.oneflow-ref == '' }}
uses: actions/checkout@v2
- uses: Oneflow-Inc/get-oneflow@support-iree-ci
- uses: Oneflow-Inc/get-oneflow@support-cu118
name: Build manylinux
id: build-cuda
with:
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/on_merge.yml
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,6 @@ jobs:
if: github.event.pull_request.merged == true
runs-on: ubuntu-latest
steps:
- uses: Oneflow-Inc/get-oneflow/update-benchmark-history@support-iree-ci
- uses: Oneflow-Inc/get-oneflow/update-benchmark-history@support-cu118
name: Update benchmark history
timeout-minutes: 10
7 changes: 4 additions & 3 deletions .github/workflows/release.yml
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ jobs:
with:
ref: ${{ github.event.pull_request.head.sha }}
repository: ${{github.event.pull_request.head.repo.full_name}}
- uses: Oneflow-Inc/get-oneflow/cache-complete/matrix/build@support-iree-ci
- uses: Oneflow-Inc/get-oneflow/cache-complete/matrix/build@support-cu118
name: find cache
id: find-cache
timeout-minutes: 5
Expand All @@ -45,6 +45,7 @@ jobs:
release
oneflow-src: ${{ env.ONEFLOW_SRC }}
entries: |
cu118
cu116
cu112
cu102
Expand Down Expand Up @@ -74,7 +75,7 @@ jobs:
python3 -m pip install -U setuptools wheel --user
python3 -m pip install oss2 --user
- uses: actions/checkout@v2
- uses: Oneflow-Inc/get-oneflow@support-iree-ci
- uses: Oneflow-Inc/get-oneflow@support-cu118
name: Build ${{ matrix.entry }}
if: ${{ matrix.entry !='cpu' }}
with:
Expand All @@ -97,7 +98,7 @@ jobs:
3.8
3.9
3.10
- uses: Oneflow-Inc/get-oneflow@support-iree-ci
- uses: Oneflow-Inc/get-oneflow@support-cu118
name: Build ${{ matrix.entry }}
if: ${{ matrix.entry =='cpu' }}
with:
Expand Down
4 changes: 2 additions & 2 deletions .github/workflows/simple.yml
Original file line number Diff line number Diff line change
Expand Up @@ -245,7 +245,7 @@ jobs:
repository: Oneflow-Inc/conda-env
ref: 30a7f00eb48ee9009d85a848e720823e5054c66b
path: conda-env
- uses: Oneflow-Inc/get-oneflow@support-iree-ci
- uses: Oneflow-Inc/get-oneflow@support-cu118
name: Build with gcc7
if: ${{ matrix.build-type == 'gcc7'}}
with:
Expand All @@ -254,7 +254,7 @@ jobs:
oneflow-build-env: conda
conda-env-file: conda-env/dev/gcc7/environment-v2.yml
conda-env-name: oneflow-dev-gcc7-v2
- uses: Oneflow-Inc/get-oneflow@support-iree-ci
- uses: Oneflow-Inc/get-oneflow@support-cu118
name: Build with clang10
if: ${{ matrix.build-type == 'clang10'}}
with:
Expand Down
103 changes: 82 additions & 21 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ env:
FLOW_VISION_SRC: flow_vision
FLOW_VISION_COMMIT: ca8ebc663b58667cf8cd1b6ef0c861522780b7bb
LIBAI_SRC: libai
LIBAI_COMMIT: 7d31d9781e5f2d559dc0820f599e0bed798488ca
LIBAI_COMMIT: 94eb85ff0131e8dfce953a3a916de7a4f897c647
ONEFLOW_FACE_SRC: oneflow_face
ONEFLOW_FACE_COMMIT: 110a97e8d5737a1f1856281a7df556a5ac8f06de
ONEFLOW_IREE_SRC: oneflow_iree
Expand All @@ -29,7 +29,7 @@ jobs:
runs-on: ubuntu-latest
if: github.event.pull_request.draft == false && github.base_ref == 'master' && contains(github.event.pull_request.requested_reviewers.*.login, 'oneflow-ci-bot')
steps:
- uses: Oneflow-Inc/get-oneflow/priority-pr@support-iree-ci
- uses: Oneflow-Inc/get-oneflow/priority-pr@support-cu118
name: Check priority PR closed
id: save-cache
timeout-minutes: 5
Expand Down Expand Up @@ -163,7 +163,7 @@ jobs:
fi
echo "is_secrets_accessible=1" >> $GITHUB_ENV
- name: Wait for GPU slot
uses: Oneflow-Inc/get-oneflow/wait-for-gpu@support-iree-ci
uses: Oneflow-Inc/get-oneflow/wait-for-gpu@support-cu118
if: env.is_secrets_accessible == '1'
timeout-minutes: 90
continue-on-error: true
Expand All @@ -187,7 +187,7 @@ jobs:
with:
ref: ${{ github.event.pull_request.head.sha }}
repository: ${{github.event.pull_request.head.repo.full_name}}
- uses: Oneflow-Inc/get-oneflow/cache-complete/matrix/build@support-iree-ci
- uses: Oneflow-Inc/get-oneflow/cache-complete/matrix/build@support-cu118
name: find cache
id: find-cache
timeout-minutes: 5
Expand All @@ -201,6 +201,8 @@ jobs:
entries: |
cu102
cpu
cpu-asan-ubsan
cpu-tsan
llvm13
build-oneflow:
Expand Down Expand Up @@ -234,7 +236,7 @@ jobs:
with:
ref: ${{ github.event.pull_request.head.sha }}
repository: ${{github.event.pull_request.head.repo.full_name}}
- uses: Oneflow-Inc/get-oneflow/cache-complete@support-iree-ci
- uses: Oneflow-Inc/get-oneflow/cache-complete@support-cu118
name: Save cache if successful
id: save-cache
timeout-minutes: 5
Expand All @@ -248,7 +250,7 @@ jobs:
run: |
echo "::error file=test.yml,line=204,col=10::steps.save-cache.outputs.cache-hit != matrix.cache-hit"
exit 1
- uses: Oneflow-Inc/get-oneflow@support-iree-ci
- uses: Oneflow-Inc/get-oneflow@support-cu118
name: Build manylinux ${{ matrix.entry }}
id: build-cpu
if: ${{ matrix.entry =='cpu' && !matrix.cache-hit }}
Expand All @@ -270,7 +272,28 @@ jobs:
python-versions: |
3.7
3.8
- uses: Oneflow-Inc/get-oneflow@support-iree-ci
- uses: Oneflow-Inc/get-oneflow@support-cu118
name: Build manylinux ${{ matrix.entry }}
id: build-cpu-sanitizers
if: ${{ (matrix.entry == 'cpu-asan-ubsan' || matrix.entry == 'cpu-tsan') && !matrix.cache-hit }}
with:
cmake-init-cache: ${{ env.ONEFLOW_SRC }}/cmake/caches/ci/${{ matrix.entry }}.cmake
build-script: ${{ env.ONEFLOW_SRC }}/ci/manylinux/build.sh
run-lit: false
oneflow-src: ${{ env.ONEFLOW_SRC }}
oneflow-build-env: manylinux
wheelhouse-dir: ${{ env.WHEELHOUSE_DIR }}
clear-wheelhouse-dir: true
self-hosted: ${{ contains(matrix.runs-on, 'self-hosted') }}
cuda-version: none
manylinux-cache-dir: ${{ env.MANYLINUX_CACHE_DIR }}
docker-run-use-system-http-proxy: false
docker-run-use-lld: true
retry-failed-build: true
clean-ccache: ${{ contains(github.event.pull_request.labels.*.name, 'need-clean-ccache') }}
python-versions: |
3.8
- uses: Oneflow-Inc/get-oneflow@support-cu118
name: Build manylinux ${{ matrix.entry }}
id: build-cuda
if: ${{ matrix.entry =='cu102' && !matrix.cache-hit }}
Expand All @@ -290,7 +313,7 @@ jobs:
clean-ccache: ${{ contains(github.event.pull_request.labels.*.name, 'need-clean-ccache') }}
python-versions: |
3.7
- uses: Oneflow-Inc/get-oneflow@support-iree-ci
- uses: Oneflow-Inc/get-oneflow@support-cu118
name: Build ${{ matrix.entry }}
if: ${{ matrix.entry == 'llvm13' && !matrix.cache-hit }}
with:
Expand Down Expand Up @@ -329,7 +352,7 @@ jobs:
})
- name: Upload packed liboneflow
if: ${{ !fromJson(matrix.cache-hit) && matrix.entry != 'llvm13' && matrix.entry != 'cu102_xla' }}
uses: Oneflow-Inc/get-oneflow/digest/upload@support-iree-ci
uses: Oneflow-Inc/get-oneflow/digest/upload@support-cu118
timeout-minutes: 10
with:
digest: ${{ steps.save-cache.outputs.build-digest }}
Expand All @@ -340,7 +363,7 @@ jobs:
dst-dir: cpack
- name: Upload whl
if: ${{ !fromJson(matrix.cache-hit) && matrix.entry != 'llvm13' && matrix.entry != 'cu102_xla' }}
uses: Oneflow-Inc/get-oneflow/digest/upload@support-iree-ci
uses: Oneflow-Inc/get-oneflow/digest/upload@support-cu118
timeout-minutes: 10
with:
digest: ${{ steps.save-cache.outputs.build-digest }}
Expand All @@ -365,7 +388,7 @@ jobs:
with:
ref: ${{ github.event.pull_request.head.sha }}
repository: ${{github.event.pull_request.head.repo.full_name}}
- uses: Oneflow-Inc/get-oneflow/cache-complete/matrix/test@support-iree-ci
- uses: Oneflow-Inc/get-oneflow/cache-complete/matrix/test@support-cu118
name: find cache
id: find-cache
timeout-minutes: 5
Expand Down Expand Up @@ -396,7 +419,7 @@ jobs:
with:
ref: ${{ github.event.pull_request.head.sha }}
repository: ${{github.event.pull_request.head.repo.full_name}}
- uses: Oneflow-Inc/get-oneflow/cache-complete/matrix/test@support-iree-ci
- uses: Oneflow-Inc/get-oneflow/cache-complete/matrix/test@support-cu118
name: find cache
id: find-cache
timeout-minutes: 5
Expand Down Expand Up @@ -472,7 +495,7 @@ jobs:
if: ${{ contains(matrix.runs-on, 'self-hosted') }}
run: |
docker rm -f ${{ env.TEST_CONTAINER_NAME }} || true
- uses: Oneflow-Inc/get-oneflow/cache-complete@support-iree-ci
- uses: Oneflow-Inc/get-oneflow/cache-complete@support-cu118
name: Save cache if successful
id: save-cache
timeout-minutes: 5
Expand All @@ -488,7 +511,7 @@ jobs:
exit 1
- name: Download wheel and packed liboneflow
if: ${{ !fromJson(matrix.cache-hit) && contains(matrix.runs-on, 'self-hosted') }}
uses: Oneflow-Inc/get-oneflow/digest/download@support-iree-ci
uses: Oneflow-Inc/get-oneflow/digest/download@support-cu118
id: download-digest
timeout-minutes: 10
with:
Expand All @@ -498,7 +521,7 @@ jobs:
ssh-tank-path: ${{ env.SSH_TANK_PATH }}
- name: Get primary node
if: ${{ !fromJson(matrix.cache-hit) && contains(matrix.runs-on, 'self-hosted') }}
uses: Oneflow-Inc/get-oneflow/master-address@support-iree-ci
uses: Oneflow-Inc/get-oneflow/master-address@support-cu118
id: get-primary-node
with:
rank: ${{ matrix.rank }}
Expand Down Expand Up @@ -631,7 +654,7 @@ jobs:
TEST_CONTAINER_NAME: "pr-${{ github.event.pull_request.number }}-run-id-${{ github.run_id }}-${{ matrix.entry }}-test"
TEST_MANYLINUX_CONTAINER_NAME: "pr-${{ github.event.pull_request.number }}-run-id-${{ github.run_id }}-${{ matrix.entry }}-test-manylinux"
TEST_WITH_TF_IMG_TAG: registry.cn-beijing.aliyuncs.com/oneflow/test-with-tf-2.3.0:2f831e9354298a11447578e869d983959feb046f
TEST_MANYLINUX_IMG_TAG: registry.cn-beijing.aliyuncs.com/oneflow/manylinux2014_x86_64_cuda10.2:4fd9cc268bbe59c6245ca3941b8264fd256a8670
TEST_MANYLINUX_IMG_TAG: registry.cn-beijing.aliyuncs.com/oneflow/manylinux2014_x86_64_cuda10.2:190c92408855fe17ae664f2de1a9d6f484b2da2b
SSH_TANK_HOST: 192.168.1.13
SSH_TANK_PATH: /tank
METRICS_DIR: metrics
Expand Down Expand Up @@ -689,7 +712,7 @@ jobs:
if: ${{ contains(matrix.runs-on, 'self-hosted') }}
run: |
docker rm -f ${{ env.TEST_MANYLINUX_CONTAINER_NAME }} || true
- uses: Oneflow-Inc/get-oneflow/cache-complete@support-iree-ci
- uses: Oneflow-Inc/get-oneflow/cache-complete@support-cu118
name: Save cache if successful
id: save-cache
timeout-minutes: 5
Expand All @@ -705,14 +728,34 @@ jobs:
exit 1
- name: Download wheel and packed liboneflow
if: ${{ !fromJson(matrix.cache-hit) && contains(matrix.runs-on, 'self-hosted') }}
uses: Oneflow-Inc/get-oneflow/digest/download@support-iree-ci
uses: Oneflow-Inc/get-oneflow/digest/download@support-cu118
id: download-digest
timeout-minutes: 10
with:
digest: ${{ steps.save-cache.outputs.build-digest }}
entry: ${{ matrix.compute-platform }}
ssh-tank-host: ${{ env.SSH_TANK_HOST }}
ssh-tank-path: ${{ env.SSH_TANK_PATH }}
- name: Download ASAN and UBSAN wheel and packed liboneflow
if: ${{ !fromJson(matrix.cache-hit) && contains(matrix.runs-on, 'self-hosted') && matrix.device == 'cpu' }}
uses: Oneflow-Inc/get-oneflow/digest/download@support-cu118
id: asan-ubsan-download-digest
timeout-minutes: 10
with:
digest: ${{ steps.save-cache.outputs.build-digest }}
entry: cpu-asan-ubsan
ssh-tank-host: ${{ env.SSH_TANK_HOST }}
ssh-tank-path: ${{ env.SSH_TANK_PATH }}
- name: Download TSAN wheel and packed liboneflow
if: ${{ !fromJson(matrix.cache-hit) && contains(matrix.runs-on, 'self-hosted') && matrix.device == 'cpu' }}
uses: Oneflow-Inc/get-oneflow/digest/download@support-cu118
id: tsan-download-digest
timeout-minutes: 10
with:
digest: ${{ steps.save-cache.outputs.build-digest }}
entry: cpu-tsan
ssh-tank-host: ${{ env.SSH_TANK_HOST }}
ssh-tank-path: ${{ env.SSH_TANK_PATH }}
- name: Enable TF container
if: ${{ fromJSON(matrix.is-single-client) }}
run: |
Expand Down Expand Up @@ -765,6 +808,11 @@ jobs:
if: ${{ !fromJson(matrix.cache-hit) && contains(matrix.runs-on, 'self-hosted') && !fromJson(matrix.is-xla) }}
run: |
unzip ${{ env.ONEFLOW_CPACK_PATH }}/liboneflow-ci-linux.zip
- name: Unzip packed sanitized liboneflow
if: ${{ !fromJson(matrix.cache-hit) && contains(matrix.runs-on, 'self-hosted') && !fromJson(matrix.is-xla) && matrix.device == 'cpu' }}
run: |
unzip ${{ steps.asan-ubsan-download-digest.outputs.entry-dir }}/cpack/liboneflow-ci-linux.zip -d asan-ubsan
unzip ${{ steps.tsan-download-digest.outputs.entry-dir }}/cpack/liboneflow-ci-linux.zip -d tsan
- name: Start container
if: ${{ !fromJson(matrix.cache-hit) && contains(matrix.runs-on, 'self-hosted') }}
working-directory: ${{ env.ONEFLOW_SRC }}
Expand Down Expand Up @@ -825,6 +873,13 @@ jobs:
timeout-minutes: 20
run: |
docker exec -e ONEFLOW_SERVING_DEBUG=1 ${{ env.TEST_MANYLINUX_CONTAINER_NAME }} ./liboneflow-ci-linux/bin/oneflow_cpp_api_testexe --gtest_filter=-Api.embedding*
- name: Exe test (C++ API with sanitizers)
if: ${{ !fromJson(matrix.cache-hit) && matrix.test-type == 'misc' && matrix.device == 'cpu' }}
timeout-minutes: 10
run: |
docker exec -e UBSAN_OPTIONS=suppressions=.ubsan-suppressions -e ASAN_OPTIONS=strict_string_checks=1:detect_stack_use_after_return=1 -e LSAN_OPTIONS=suppressions=.lsan-suppressions ${{ env.TEST_MANYLINUX_CONTAINER_NAME }} ./asan-ubsan/liboneflow-ci-linux/bin/oneflow_cpp_api_testexe --gtest_filter=Api.graph_\*
# Run 5 times to avoid false positive because of occasional lack of stack info
docker exec -e TSAN_OPTIONS="history_size=7 suppressions=.tsan-suppressions" ${{ env.TEST_MANYLINUX_CONTAINER_NAME }} bash -c "./tsan/liboneflow-ci-linux/bin/oneflow_cpp_api_testexe || ./tsan/liboneflow-ci-linux/bin/oneflow_cpp_api_testexe || ./tsan/liboneflow-ci-linux/bin/oneflow_cpp_api_testexe || ./tsan/liboneflow-ci-linux/bin/oneflow_cpp_api_testexe || ./tsan/liboneflow-ci-linux/bin/oneflow_cpp_api_testexe"
- name: Test container
if: ${{ !fromJson(matrix.cache-hit) && contains(matrix.runs-on, 'self-hosted') }}
run: |
Expand Down Expand Up @@ -950,7 +1005,7 @@ jobs:
timeout-minutes: 30
if: ${{ !fromJson(matrix.cache-hit) && matrix.test-type == 'misc' && matrix.device == 'cuda' }}
run: |
docker exec -e ONEFLOW_TEST_DEVICE_NUM=4 -w $PWD/${{ env.ONEFLOW_FACE_SRC }} ${{ env.TEST_CONTAINER_NAME }} python3 -m oneflow.distributed.launch --nproc_per_node 4 -m unittest -f tests/train/test_train.py
docker exec -e ONEFLOW_TEST_DEVICE_NUM=4 -w $PWD/${{ env.ONEFLOW_FACE_SRC }} ${{ env.TEST_CONTAINER_NAME }} python3 -m oneflow.distributed.launch --nproc_per_node 4 -m pytest tests/train/test_train.py
- name: oneflow_iree test
timeout-minutes: 45
if: ${{ !fromJson(matrix.cache-hit) && matrix.test-type == 'misc' }}
Expand Down Expand Up @@ -978,10 +1033,16 @@ jobs:
if: ${{ !fromJson(matrix.cache-hit) && matrix.test-type == 'misc' }}
run: |
docker exec -e ONEFLOW_TEST_DIR=$PWD/python/oneflow/test/tensor ${{ env.TEST_CONTAINER_NAME }} bash ci/test/generic_test_multi_client.sh
- name: Test mocking torch by script
run: |
docker exec ${{ env.TEST_CONTAINER_NAME }} bash -x ci/test/test_mock_script.sh
- name: Test mocking torch by function
run: |
docker exec ${{ env.TEST_CONTAINER_NAME }} bash -x ci/test/test_mock_function.sh
- name: Benchmark Test
timeout-minutes: 100
if: ${{ !fromJson(matrix.cache-hit) && matrix.test-type == 'benchmark' && matrix.device == 'cuda' }}
uses: Oneflow-Inc/get-oneflow/pytest-benchmark@support-iree-ci
uses: Oneflow-Inc/get-oneflow/pytest-benchmark@support-cu118
with:
collect-path: ${{ env.FLOW_VISION_SRC }}/benchmark
container-name: ${{ env.TEST_CONTAINER_NAME }}
Expand Down Expand Up @@ -1043,7 +1104,7 @@ jobs:
ref: ${{ github.event.pull_request.head.sha }}
repository: ${{github.event.pull_request.head.repo.full_name}}
fetch-depth: 0
- uses: Oneflow-Inc/get-oneflow/cache-complete@support-iree-ci
- uses: Oneflow-Inc/get-oneflow/cache-complete@support-cu118
name: Save cache if successful
id: save-cache
timeout-minutes: 5
Expand Down
1 change: 1 addition & 0 deletions .lsan-suppressions
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
leak:CommandT
9 changes: 9 additions & 0 deletions .tsan-suppressions
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# These four group of functions are designed to be thread unsafe,
# it's user's responsibility to use them correctly.
race:ThreadUnsafe
race:thread_unsafe
race:flying_instruction_cnt
race:total_erased_instruction_cnt
race:ToShape
# glog
race:google::
2 changes: 2 additions & 0 deletions .ubsan-suppressions
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
# llvm
vptr:Class.cpp
Loading

0 comments on commit a4e67b0

Please sign in to comment.