Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fully Memory Log V2 with more details #8565

Merged
merged 32 commits into from Jul 12, 2022
Merged

Fully Memory Log V2 with more details #8565

merged 32 commits into from Jul 12, 2022

Conversation

chengtbf
Copy link
Contributor

@chengtbf chengtbf commented Jul 4, 2022

  • 提供更加详尽的内存分析日志,新增了每个 Chain(Chunk->MemBlock) 内 tensor 的 shape、dtype、生命周期、申请释放的顺序等,用于快速找到每个内存块中对占用内存影响较大的 tensor 是否有异常。
  • Checkpointing pass 提供日志,记录哪些 tensor 被 Checkpoint 了
  • refine 了 系统 tensor 的 prefix,使得日志中查看不会太长。

Checkpointing 日志示例:

BERT

 In subgraph: 0 has checkpointing tensor num = 14
Checkpointing tensor: Sys-GradAcc-VarRepeat-model.bert.encoders.6.self_attention.dense.weight-125/out_0 ,shape: (768,768) ,dtype: kFloat16 ,placement: oneflow.placement(type="cuda", ranks=[0]) ,sbp: (B)
Checkpointing tensor: Sys-GradAcc-VarRepeat-model.bert.encoders.6.self_attention.query_key_value.weight-123/out_0 ,shape: (2304,768) ,dtype: kFloat16 ,placement: oneflow.placement(type="cuda", ranks=[0]) ,sbp: (B)
Checkpointing tensor: model.bert-identity-179/out_0 ,shape: (8,1,512,512) ,dtype: kBool ,placement: oneflow.placement(type="cuda", ranks=[0]) ,sbp: (B)
Checkpointing tensor: Sys-GradAcc-VarRepeat-model.bert.encoders.6.mlp.dense_4h_to_h.bias-132/out_0 ,shape: (768,) ,dtype: kFloat16 ,placement: oneflow.placement(type="cuda", ranks=[0]) ,sbp: (B)
Checkpointing tensor: model.bert-identity-178/out_0 ,shape: (512,8,768) ,dtype: kFloat16 ,placement: oneflow.placement(type="cuda", ranks=[0]) ,sbp: (B)
Checkpointing tensor: Sys-GradAcc-VarRepeat-model.bert.encoders.6.self_attention.dense.bias-126/out_0 ,shape: (768,) ,dtype: kFloat16 ,placement: oneflow.placement(type="cuda", ranks=[0]) ,sbp: (B)
Checkpointing tensor: Sys-GradAcc-VarRepeat-model.bert.encoders.6.input_layernorm.bias-122/out_0 ,shape: (768,) ,dtype: kFloat16 ,placement: oneflow.placement(type="cuda", ranks=[0]) ,sbp: (B)
Checkpointing tensor: Sys-GradAcc-VarRepeat-model.bert.encoders.6.mlp.dense_4h_to_h.weight-131/out_0 ,shape: (768,4096) ,dtype: kFloat16 ,placement: oneflow.placement(type="cuda", ranks=[0]) ,sbp: (B)
Checkpointing tensor: Sys-GradAcc-VarRepeat-model.bert.encoders.6.self_attention.query_key_value.bias-124/out_0 ,shape: (2304,) ,dtype: kFloat16 ,placement: oneflow.placement(type="cuda", ranks=[0]) ,sbp: (B)
Checkpointing tensor: Sys-GradAcc-VarRepeat-model.bert.encoders.6.input_layernorm.weight-121/out_0 ,shape: (768,) ,dtype: kFloat16 ,placement: oneflow.placement(type="cuda", ranks=[0]) ,sbp: (B)
Checkpointing tensor: Sys-GradAcc-VarRepeat-model.bert.encoders.6.post_attention_layernorm.weight-127/out_0 ,shape: (768,) ,dtype: kFloat16 ,placement: oneflow.placement(type="cuda", ranks=[0]) ,sbp: (B)
Checkpointing tensor: Sys-GradAcc-VarRepeat-model.bert.encoders.6.mlp.dense_h_to_4h.bias-130/out_0 ,shape: (4096,) ,dtype: kFloat16 ,placement: oneflow.placement(type="cuda", ranks=[0]) ,sbp: (B)
Checkpointing tensor: Sys-GradAcc-VarRepeat-model.bert.encoders.6.post_attention_layernorm.bias-128/out_0 ,shape: (768,) ,dtype: kFloat16 ,placement: oneflow.placement(type="cuda", ranks=[0]) ,sbp: (B)
Checkpointing tensor: Sys-GradAcc-VarRepeat-model.bert.encoders.6.mlp.dense_h_to_4h.weight-129/out_0 ,shape: (4096,768) ,dtype: kFloat16 ,placement: oneflow.placement(type="cuda", ranks=[0]) ,sbp: (B)

会打印每个 Checkpointing 子图被后向缓存的那些 tensor(大部分是 Variable,只有特殊的 identity,即为 module 的 input tensor)。 我们可以看到 bert 有两个 input(data 和 mask)。

基于本日志,可以分析是否有 tensor 可以被复用 Checkpointing。

内存块详细日志分析

BERT

Summary

 Graph name GraphBase_0 in Rank: 0, Device: 0 needs to allocate [ 2909.38 MiB ] device memory. 
   In general, Chunk id: 0  memory is [ 1513.68 MiB ] with mem_block_num = 240
        Unreused memory not eager var is  [ 349.123 MiB ] with mem_block_num = 718
        Eager Variable Tensor total memory is [ 1046.58 MiB ] with mem_block_num = 331

包含了 Graph 所需的全部内存,以及其中三大组成部分:

  1. Chunk (每个 Rank / Device 只有一个 Chunk)的显存,Chunk 里有多个 MemBlock,每个 MemBlock 有多个 tensor,这些 tensor 均为可以内存复用的 tensor。
  2. Unreused mem,表示那些非 Variable 的独占内存的 tensor (不可以内存复用,如 Repeat、Acc 占用的内存)
  3. Eager Variable , 包含用户定义的 weight 和 Optimizer 的 state (如 adam 的 m 和 v)

Chunk

Chunk 里有多个内存不相交的 MemBlock。当 export GLOG_v = 2 会按照 MemBlock 从大到小依次输出每个 MemBlock。

In Device: 0 Chunk id: 0 MemBlock id: 161 has num = 840 tensor with mem size = 990.274
In Device: 0 Chunk id: 0 MemBlock id: 89 has num = 1 tensor with mem size = 65.2739
In Device: 0 Chunk id: 0 MemBlock id: 209 has num = 1 tensor with mem size = 32.6369
...

每个 MemBlock 即为一个可以内存复用的子图,当 export GLOG_v = 3 时此处会打印每个 MemBlock 内部的详细 tensor 分布,按照 tensor 在所属 op 执行时序上的顺序逐个输出每一个 tensor 的详细信息,包含: order,name, size,duration(生命周期,表示该 tensor 在申请了以后经过了多少个 op 执行以后才释放), shape,dtype,allocate order(op 的 时序), free order。

In Chunk id: 0, MemBlock id: 161 Order: 0 ,duration: 9 ,size: 0.00416 MiB, name: model.bert.embeddings-identity-11/out_0, shape: (1,512) ,dtype: kInt64 ,alloc_order: 0 ,free_order: 8
In Chunk id: 0, MemBlock id: 161 Order: 1 ,duration: 235 ,size: 0.000576 MiB, name: model-identity-243/out_0, shape: (8,) ,dtype: kInt64 ,alloc_order: 1 ,free_order: 235
In Chunk id: 0, MemBlock id: 161 Order: 2 ,duration: 20 ,size: 0.032832 MiB, name: model-identity-244/out_0, shape: (8,512) ,dtype: kInt64 ,alloc_order: 2 ,free_order: 21
In Chunk id: 0, MemBlock id: 161 Order: 3 ,duration: 632 ,size: 0.032832 MiB, name: model.bert-identity-8/out_0, shape: (8,512) ,dtype: kInt64 ,alloc_order: 3 ,free_order: 634
In Chunk id: 0, MemBlock id: 161 Order: 4 ,duration: 8 ,size: 0.032832 MiB, name: model-identity-245/out_0, shape: (8,512) ,dtype: kInt64 ,alloc_order: 4 ,free_order: 11
In Chunk id: 0, MemBlock id: 161 Order: 5 ,duration: 15 ,size: 0.032832 MiB, name: model.bert-identity-0/out_0, shape: (8,512) ,dtype: kInt64 ,alloc_order: 5 ,free_order: 19
In Chunk id: 0, MemBlock id: 161 Order: 6 ,duration: 628 ,size: 0.032832 MiB, name: model.bert-identity-7/out_0, shape: (8,512) ,dtype: kInt64 ,alloc_order: 6 ,free_order: 633
In Chunk id: 0, MemBlock id: 161 Order: 7 ,duration: 238 ,size: 32.637 MiB, name: model-identity-242/out_0, shape: (21248,768) ,dtype: kFloat16 ,alloc_order: 7 ,free_order: 244
In Chunk id: 0, MemBlock id: 161 Order: 8 ,duration: 625 ,size: 0.032832 MiB, name: model.bert.embeddings-expand-13/out_0, shape: (8,512) ,dtype: kInt64 ,alloc_order: 8 ,free_order: 632
In Chunk id: 0, MemBlock id: 161 Order: 9 ,duration: 8 ,size: 0.00416 MiB, name: model.cls_head.loss_func.lm_loss-scalar_logical_greater_equal-258/out_0, shape: (8,512) ,dtype: kBool ,alloc_order: 9 ,free_order: 16
In Chunk id: 0, MemBlock id: 161 Order: 10 ,duration: 15 ,size: 6.29152 MiB, name: model.bert.embeddings.tokentype_embeddings-gather-18/out_0, shape: (8,512,768) ,dtype: kFloat16 ,alloc_order: 10 ,free_order: 24
In Chunk id: 0, MemBlock id: 161 Order: 11 ,duration: 222 ,size: 0.016448 MiB, name: model.cls_head.loss_func-cast-264/out_0, shape: (8,512) ,dtype: kFloat ,alloc_order: 11 ,free_order: 232
In Chunk id: 0, MemBlock id: 161 Order: 12 ,duration: 1 ,size: 0.032832 MiB, name: model.bert.extended_attn_mask-expand_dims-2/out_0,shape: (8,512,1) ,dtype: kInt64 ,alloc_order: inplaced ,free_order: inplaced
In Chunk id: 0, MemBlock id: 161 Order: 13 ,duration: 1 ,size: 0.032832 MiB, name: model.bert.extended_attn_mask-expand_dims-1/out_0,shape: (8,1,512) ,dtype: kInt64 ,alloc_order: inplaced ,free_order: inplaced
In Chunk id: 0, MemBlock id: 161 Order: 14 ,duration: 14 ,size: 6.29152 MiB, name: model.bert.embeddings.vocab_embeddings-gather-10/out_0,shape: (8,512,768) ,dtype: kFloat16 ,alloc_order: 14 ,free_order: 27
In Chunk id: 0, MemBlock id: 161 Order: 15 ,duration: 6 ,size: 6.29152 MiB, name: model.bert.embeddings.position_embeddings-gather-15/out_0,shape: (8,512,768) ,dtype: kFloat16 ,alloc_order: 15 ,free_order: 20
In Chunk id: 0, MemBlock id: 161 Order: 16 ,duration: 6 ,size: 0.032832 MiB, name: model.cls_head.loss_func.lm_loss-cast-259/out_0,shape: (8,512) ,dtype: kInt64 ,alloc_order: 16 ,free_order: 21
In Chunk id: 0, MemBlock id: 161 Order: 17 ,duration: 1 ,size: 0.016448 MiB, name: model.cls_head.loss_func-reshape-268/out_0,shape: (4096,) ,dtype: kFloat ,alloc_order: inplaced ,free_order: inplaced
In Chunk id: 0, MemBlock id: 161 Order: 18 ,duration: 1 ,size: 0.016448 MiB, name: model.cls_head.loss_func-reduce_sum-265/tmp_buffer_0,shape: (16384,) ,dtype: kChar ,alloc_order: 18 ,free_order: 18
In Chunk id: 0, MemBlock id: 161 Order: 19 ,duration: 5 ,size: 0.000512 MiB, name: model.cls_head.loss_func-reduce_sum-265/output_tensor_0,shape: () ,dtype: kFloat ,alloc_order: 18 ,free_order: 22
In Chunk id: 0, MemBlock id: 161 Order: 20 ,duration: 8 ,size: 16.7773 MiB, name: model.bert.extended_attn_mask-broadcast_mul-3/z_0,shape: (8,512,512) ,dtype: kInt64 ,alloc_order: 19 ,free_order: 26
In Chunk id: 0, MemBlock id: 161 Order: 21 ,duration: 1 ,size: 6.29152 MiB, name: model.bert.embeddings-add_n-16/out_0,shape: (8,512,768) ,dtype: kFloat16 ,alloc_order: inplaced ,free_order: inplaced
In Chunk id: 0, MemBlock id: 161 Order: 22 ,duration: 216 ,size: 0.032832 MiB, name: model.cls_head.loss_func.lm_loss-broadcast_mul-260/z_0,shape: (8,512) ,dtype: kInt64 ,alloc_order: 21 ,free_order: 236
In Chunk id: 0, MemBlock id: 161 Order: 23 ,duration: 207 ,size: 0.000512 MiB, name: model.cls_head.loss_func-scalar_add-266/out_0,shape: () ,dtype: kFloat ,alloc_order: 22 ,free_order: 228
In Chunk id: 0, MemBlock id: 161 Order: 24 ,duration: 1 ,size: 16.7773 MiB, name: model.bert.extended_attn_mask-expand_dims-4/out_0,shape: (8,1,512,512) ,dtype: kInt64 ,alloc_order: inplaced ,free_order: inplaced
In Chunk id: 0, MemBlock id: 161 Order: 25 ,duration: 1 ,size: 6.29152 MiB, name: model.bert.embeddings-add_n-19/out_0

简化版:

0 ,duration: 9 ,size: 0.00416 MiB, name: model.bert.embeddings-identity-11/out_0, shape: (1,512) ,dtype: kInt64 ,alloc_order: 0 ,free_order: 8
1 ,duration: 235 ,size: 0.000576 MiB, name: model-identity-243/out_0, shape: (8,) ,dtype: kInt64 ,alloc_order: 1 ,free_order: 235
2 ,duration: 20 ,size: 0.032832 MiB, name: model-identity-244/out_0, shape: (8,512) ,dtype: kInt64 ,alloc_order: 2 ,free_order: 21
3 ,duration: 632 ,size: 0.032832 MiB, name: model.bert-identity-8/out_0, shape: (8,512) ,dtype: kInt64 ,alloc_order: 3 ,free_order: 634
4 ,duration: 8 ,size: 0.032832 MiB, name: model-identity-245/out_0, shape: (8,512) ,dtype: kInt64 ,alloc_order: 4 ,free_order: 11
5 ,duration: 15 ,size: 0.032832 MiB, name: model.bert-identity-0/out_0, shape: (8,512) ,dtype: kInt64 ,alloc_order: 5 ,free_order: 19
6 ,duration: 628 ,size: 0.032832 MiB, name: model.bert-identity-7/out_0, shape: (8,512) ,dtype: kInt64 ,alloc_order: 6 ,free_order: 633
7 ,duration: 238 ,size: 32.637 MiB, name: model-identity-242/out_0, shape: (21248,768) ,dtype: kFloat16 ,alloc_order: 7 ,free_order: 244
8 ,duration: 625 ,size: 0.032832 MiB, name: model.bert.embeddings-expand-13/out_0, shape: (8,512) ,dtype: kInt64 ,alloc_order: 8 ,free_order: 632
9 ,duration: 8 ,size: 0.00416 MiB, name: model.cls_head.loss_func.lm_loss-scalar_logical_greater_equal-258/out_0, shape: (8,512) ,dtype: kBool ,alloc_order: 9 ,free_order: 16
10 ,duration: 15 ,size: 6.29152 MiB, name: model.bert.embeddings.tokentype_embeddings-gather-18/out_0, shape: (8,512,768) ,dtype: kFloat16 ,alloc_order: 10 ,free_order: 24
11 ,duration: 222 ,size: 0.016448 MiB, name: model.cls_head.loss_func-cast-264/out_0, shape: (8,512) ,dtype: kFloat ,alloc_order: 11 ,free_order: 232
12 ,duration: 1 ,size: 0.032832 MiB, name: model.bert.extended_attn_mask-expand_dims-2/out_0,shape: (8,512,1) ,dtype: kInt64 ,alloc_order: inplaced ,free_order: inplaced
13 ,duration: 1 ,size: 0.032832 MiB, name: model.bert.extended_attn_mask-expand_dims-1/out_0,shape: (8,1,512) ,dtype: kInt64 ,alloc_order: inplaced ,free_order: inplaced
14 ,duration: 14 ,size: 6.29152 MiB, name: model.bert.embeddings.vocab_embeddings-gather-10/out_0,shape: (8,512,768) ,dtype: kFloat16 ,alloc_order: 14 ,free_order: 27
15 ,duration: 6 ,size: 6.29152 MiB, name: model.bert.embeddings.position_embeddings-gather-15/out_0,shape: (8,512,768) ,dtype: kFloat16 ,alloc_order: 15 ,free_order: 20
16 ,duration: 6 ,size: 0.032832 MiB, name: model.cls_head.loss_func.lm_loss-cast-259/out_0,shape: (8,512) ,dtype: kInt64 ,alloc_order: 16 ,free_order: 21
17 ,duration: 1 ,size: 0.016448 MiB, name: model.cls_head.loss_func-reshape-268/out_0,shape: (4096,) ,dtype: kFloat ,alloc_order: inplaced ,free_order: inplaced
18 ,duration: 1 ,size: 0.016448 MiB, name: model.cls_head.loss_func-reduce_sum-265/tmp_buffer_0,shape: (16384,) ,dtype: kChar ,alloc_order: 18 ,free_order: 18
19 ,duration: 5 ,size: 0.000512 MiB, name: model.cls_head.loss_func-reduce_sum-265/output_tensor_0,shape: () ,dtype: kFloat ,alloc_order: 18 ,free_order: 22
20 ,duration: 8 ,size: 16.7773 MiB, name: model.bert.extended_attn_mask-broadcast_mul-3/z_0,shape: (8,512,512) ,dtype: kInt64 ,alloc_order: 19 ,free_order: 26
21 ,duration: 1 ,size: 6.29152 MiB, name: model.bert.embeddings-add_n-16/out_0,shape: (8,512,768) ,dtype: kFloat16 ,alloc_order: inplaced ,free_order: inplaced
22 ,duration: 216 ,size: 0.032832 MiB, name: model.cls_head.loss_func.lm_loss-broadcast_mul-260/z_0,shape: (8,512) ,dtype: kInt64 ,alloc_order: 21 ,free_order: 236
23 ,duration: 207 ,size: 0.000512 MiB, name: model.cls_head.loss_func-scalar_add-266/out_0,shape: () ,dtype: kFloat ,alloc_order: 22 ,free_order: 228
24 ,duration: 1 ,size: 16.7773 MiB, name: model.bert.extended_attn_mask-expand_dims-4/out_0,shape: (8,1,512,512) ,dtype: kInt64 ,alloc_order: inplaced ,free_order: inplaced

如上表中:
7 ,duration: 238 ,size: 32.637 MiB, name: model-identity-242/out_0, shape: (21248,768) ,dtype: kFloat16 ,alloc_order: 7 ,free_order: 244

则是被后向消费的一个 data input。

Unreused mem

unreused mem block 都是 tensor 独占显存:

In Device: 0 Memblock id: 247 Unreused  size: 32.6369 MiB, name: Sys-GradAcc-VarRepeat-model.bert.embeddings.vocab_embeddings.weight-24/out_0, shape: (21248,768) ,dtype: kFloat16
In Device: 0 Memblock id: 918 Unreused  size: 32.6369 MiB, name: Sys-GradAcc-VarAcc-model.bert.embeddings.vocab_embeddings.weight-out-cast_f2h/out_0, shape: (21248,768) ,dtype: kFloat16
In Device: 0 Memblock id: 737 Unreused  size: 6.29146 MiB, name: Sys-GradAcc-VarAcc-model.bert.encoders.4.mlp.dense_4h_to_h.weight-out-cast_f2h/out_0, shape: (768,4096) ,dtype: kFloat16

...

In Device: 0 Memblock id: 807 Unreused  size: 0.001536 MiB, name: Sys-GradAcc-VarAcc-model.bert.encoders.2.input_layernorm.weight-out-cast_f2h/out_0

Eager Variable

模型以及 Optimizer stage (比如 adam m,v)等

In Device: 0 Memblock id: 993 EagerVariable  size: 65.2739 MiB, name: model.bert.embeddings.vocab_embeddings.weight-m/out, shape: (21248,768) ,dtype: kFloat
In Device: 0 Memblock id: 246 EagerVariable  size: 65.2739 MiB, name: model.bert.embeddings.vocab_embeddings.weight/out, shape: (21248,768) ,dtype: kFloat
In Device: 0 Memblock id: 1108 EagerVariable  size: 12.5829 MiB, name: model.bert.encoders.3.mlp.dense_4h_to_h.weight-m/out, shape: (768,4096) ,dtype: kFloat

@chengtbf chengtbf added feature graph graph mode labels Jul 4, 2022
Copy link
Contributor

@strint strint left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions
Copy link
Contributor

github-actions bot commented Jul 6, 2022

Static analysis with clang failed. PR label automerge has been removed

@github-actions github-actions bot removed the automerge label Jul 6, 2022
@github-actions
Copy link
Contributor

github-actions bot commented Jul 6, 2022

Speed stats:
GPU Name: NVIDIA GeForce GTX 1080 

❌ OneFlow resnet50 time: 129.4ms (= 12935.6ms / 100, input_shape=[16, 3, 224, 224])
PyTorch resnet50 time: 142.5ms (= 14246.5ms / 100, input_shape=[16, 3, 224, 224])
✔️ Relative speed: 1.10 (= 142.5ms / 129.4ms)

OneFlow resnet50 time: 75.8ms (= 7576.5ms / 100, input_shape=[8, 3, 224, 224])
PyTorch resnet50 time: 82.5ms (= 8248.8ms / 100, input_shape=[8, 3, 224, 224])
✔️ Relative speed: 1.09 (= 82.5ms / 75.8ms)

OneFlow resnet50 time: 49.1ms (= 9816.2ms / 200, input_shape=[4, 3, 224, 224])
PyTorch resnet50 time: 57.8ms (= 11561.2ms / 200, input_shape=[4, 3, 224, 224])
✔️ Relative speed: 1.18 (= 57.8ms / 49.1ms)

OneFlow resnet50 time: 40.4ms (= 8078.1ms / 200, input_shape=[2, 3, 224, 224])
PyTorch resnet50 time: 44.8ms (= 8953.4ms / 200, input_shape=[2, 3, 224, 224])
✔️ Relative speed: 1.11 (= 44.8ms / 40.4ms)

OneFlow resnet50 time: 33.9ms (= 6786.9ms / 200, input_shape=[1, 3, 224, 224])
PyTorch resnet50 time: 35.0ms (= 6999.2ms / 200, input_shape=[1, 3, 224, 224])
✔️ Relative speed: 1.03 (= 35.0ms / 33.9ms)

OneFlow swin dataloader time: 0.263s (= 52.598s / 200, num_workers=1)
PyTorch swin dataloader time: 0.155s (= 30.954s / 200, num_workers=1)
Relative speed: 0.589 (= 0.155s / 0.263s)

OneFlow swin dataloader time: 0.080s (= 15.957s / 200, num_workers=4)
PyTorch swin dataloader time: 0.040s (= 8.024s / 200, num_workers=4)
Relative speed: 0.503 (= 0.040s / 0.080s)

OneFlow swin dataloader time: 0.065s (= 13.074s / 200, num_workers=8)
PyTorch swin dataloader time: 0.022s (= 4.449s / 200, num_workers=8)
Relative speed: 0.340 (= 0.022s / 0.065s)

❌ OneFlow resnet50 time: 145.8ms (= 14580.1ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 169.6ms (= 16961.1ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.16 (= 169.6ms / 145.8ms)

OneFlow resnet50 time: 95.2ms (= 9517.7ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 114.1ms (= 11408.1ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.20 (= 114.1ms / 95.2ms)

OneFlow resnet50 time: 70.0ms (= 13992.4ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 92.8ms (= 18556.6ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.33 (= 92.8ms / 70.0ms)

OneFlow resnet50 time: 57.7ms (= 11536.0ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 80.8ms (= 16158.8ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.40 (= 80.8ms / 57.7ms)

OneFlow resnet50 time: 53.1ms (= 10629.5ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 75.2ms (= 15048.0ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.42 (= 75.2ms / 53.1ms)

@chengtbf chengtbf removed the request for review from oneflow-ci-bot July 6, 2022 14:29
@chengtbf chengtbf requested review from oneflow-ci-bot and removed request for oneflow-ci-bot July 7, 2022 03:03
@github-actions
Copy link
Contributor

github-actions bot commented Jul 7, 2022

View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/8565/

@chengtbf chengtbf requested a review from BBuf as a code owner July 7, 2022 17:52
@@ -205,8 +205,8 @@ Maybe<Tensor> GradAccTryInsertUnpackAfterInput(
<< " the input tensor of nn.Graph will be unpacked by 0th dim into multiple micro-batches "
<< " and exec them in order.\n";

user_op::UserOpConfWrapperBuilder unpack_builder("System-GradientAccumulation-InputUnpack-"
+ input_conf.name() + "-" + NewUniqueId());
user_op::UserOpConfWrapperBuilder unpack_builder("Sys-GradAcc-InputUnpack-" + input_conf.name()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个 PR 里对一些系统 op 的 prefix 前缀进行缩减,会影响到:

return generator->GenerateNamedRoundRobin("CPU_COMPUTE", cpu_device_num);

这里 CPU COMPUTE 的内存被写坏吗? @leaves-zwx

因为这个一模一样的报错出现了两次:

s.ssssssss.......ss.s.ssss.s....s....sssssssss.....sssss.......ssss..... [ 78%]
F20220707 23:08:52.707211  4036 stream_index_generator.cpp:40] Check failed: it->second.size == size (48 vs. 3667) CPU_COMPUTE
*** Check failure stack trace: ***
    @     0x7f25f7a4e3fa  google::LogMessage::Fail()
    @     0x7f25f7a4e6e2  google::LogMessage::SendToLog()
    @     0x7f25f7a4df67  google::LogMessage::Flush()
    @     0x7f25f7a50ad9  google::LogMessageFatal::~LogMessageFatal()
    @     0x7f25f0083b72  oneflow::StreamIndexGenerator::GenerateNamedRoundRobin()
    @     0x7f25f00a1002  oneflow::GenerateComputeTaskStreamIndex()
    @     0x7f25f00a118f  oneflow::TaskStreamIndexGetterRegistry::Dispatch()
    @     0x7f25f00a195c  oneflow::TaskStreamIndexManager::GetTaskStreamIndex()

https://github.com/Oneflow-Inc/oneflow/runs/7265216193?check_suite_focus=true
https://github.com/Oneflow-Inc/oneflow/runs/7242834096?check_suite_focus=true

但是我本地没有复现:

chengcheng@oneflow-21:~/debug/graph $ python3 test_tvm_frontend_dependency_on_graph.py 
.._TvmFrontedGraph_1_input.0.0_2
m.features.0.weight
m.features.0-conv2d-0
m.features.2-max_pool_2d-3
m.features.3.weight
m.features.3-conv2d-4
m.features.5-max_pool_2d-7
m.features.6.weight
m.features.6-conv2d-8
m.features.7-relu-10
m.features.8.weight
m.features.8-conv2d-11
m.features.9-relu-13
m.features.10.weight
m.features.10-conv2d-14
.
----------------------------------------------------------------------
Ran 3 tests in 2.012s

OK

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

应该和这里修改前缀无关,我有一个PR把这个op都用functional重写了一下,op name都变成了repeat-xx, CI也没出问题

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

但是我这个 PR 就改了前缀。另外是 log 里加日志,这个怎么改应该也不会影响到 stream index 报错,而且两次 ci 报错出错的栈和位置一模一样。都是 test_tvm_frontend_dependency_on_graph ,都是 it->second.size == size (48 vs. 3667) CPU_COMPUTE

我感觉不是巧合。但是又想不通是为啥

@github-actions
Copy link
Contributor

CI failed when running job: cpu-module. PR label automerge has been removed

@github-actions
Copy link
Contributor

View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/8565/

@github-actions
Copy link
Contributor

Speed stats:

@chengtbf chengtbf requested review from oneflow-ci-bot and removed request for oneflow-ci-bot July 12, 2022 08:47
@github-actions
Copy link
Contributor

Speed stats:
GPU Name: NVIDIA GeForce GTX 1080 

❌ OneFlow resnet50 time: 129.4ms (= 12936.1ms / 100, input_shape=[16, 3, 224, 224])
PyTorch resnet50 time: 142.8ms (= 14284.1ms / 100, input_shape=[16, 3, 224, 224])
✔️ Relative speed: 1.10 (= 142.8ms / 129.4ms)

OneFlow resnet50 time: 75.7ms (= 7573.4ms / 100, input_shape=[8, 3, 224, 224])
PyTorch resnet50 time: 85.1ms (= 8510.3ms / 100, input_shape=[8, 3, 224, 224])
✔️ Relative speed: 1.12 (= 85.1ms / 75.7ms)

OneFlow resnet50 time: 48.4ms (= 9672.3ms / 200, input_shape=[4, 3, 224, 224])
PyTorch resnet50 time: 58.5ms (= 11708.8ms / 200, input_shape=[4, 3, 224, 224])
✔️ Relative speed: 1.21 (= 58.5ms / 48.4ms)

OneFlow resnet50 time: 37.8ms (= 7568.5ms / 200, input_shape=[2, 3, 224, 224])
PyTorch resnet50 time: 42.7ms (= 8537.5ms / 200, input_shape=[2, 3, 224, 224])
✔️ Relative speed: 1.13 (= 42.7ms / 37.8ms)

OneFlow resnet50 time: 32.1ms (= 6426.2ms / 200, input_shape=[1, 3, 224, 224])
PyTorch resnet50 time: 35.8ms (= 7153.0ms / 200, input_shape=[1, 3, 224, 224])
✔️ Relative speed: 1.11 (= 35.8ms / 32.1ms)

OneFlow swin dataloader time: 0.253s (= 50.561s / 200, num_workers=1)
PyTorch swin dataloader time: 0.153s (= 30.503s / 200, num_workers=1)
Relative speed: 0.603 (= 0.153s / 0.253s)

OneFlow swin dataloader time: 0.109s (= 21.744s / 200, num_workers=4)
PyTorch swin dataloader time: 0.041s (= 8.292s / 200, num_workers=4)
Relative speed: 0.381 (= 0.041s / 0.109s)

OneFlow swin dataloader time: 0.042s (= 8.490s / 200, num_workers=8)
PyTorch swin dataloader time: 0.023s (= 4.545s / 200, num_workers=8)
Relative speed: 0.535 (= 0.023s / 0.042s)

❌ OneFlow resnet50 time: 144.8ms (= 14476.9ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 168.1ms (= 16813.2ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.16 (= 168.1ms / 144.8ms)

OneFlow resnet50 time: 94.3ms (= 9427.6ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 113.0ms (= 11303.0ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.20 (= 113.0ms / 94.3ms)

OneFlow resnet50 time: 69.1ms (= 13822.8ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 99.2ms (= 19832.0ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.43 (= 99.2ms / 69.1ms)

OneFlow resnet50 time: 55.8ms (= 11164.9ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 74.1ms (= 14821.0ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.33 (= 74.1ms / 55.8ms)

OneFlow resnet50 time: 52.3ms (= 10455.6ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 68.7ms (= 13746.7ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.31 (= 68.7ms / 52.3ms)

@github-actions
Copy link
Contributor

CI failed when running job: cuda-module. PR label automerge has been removed

@chengtbf chengtbf requested review from oneflow-ci-bot and removed request for oneflow-ci-bot July 12, 2022 14:51
@github-actions
Copy link
Contributor

Speed stats:
GPU Name: NVIDIA GeForce GTX 1080 

❌ OneFlow resnet50 time: 129.4ms (= 12938.3ms / 100, input_shape=[16, 3, 224, 224])
PyTorch resnet50 time: 143.7ms (= 14367.1ms / 100, input_shape=[16, 3, 224, 224])
✔️ Relative speed: 1.11 (= 143.7ms / 129.4ms)

OneFlow resnet50 time: 75.9ms (= 7587.7ms / 100, input_shape=[8, 3, 224, 224])
PyTorch resnet50 time: 86.0ms (= 8598.4ms / 100, input_shape=[8, 3, 224, 224])
✔️ Relative speed: 1.13 (= 86.0ms / 75.9ms)

OneFlow resnet50 time: 49.1ms (= 9828.5ms / 200, input_shape=[4, 3, 224, 224])
PyTorch resnet50 time: 62.5ms (= 12496.1ms / 200, input_shape=[4, 3, 224, 224])
✔️ Relative speed: 1.27 (= 62.5ms / 49.1ms)

OneFlow resnet50 time: 36.8ms (= 7360.2ms / 200, input_shape=[2, 3, 224, 224])
PyTorch resnet50 time: 44.9ms (= 8976.1ms / 200, input_shape=[2, 3, 224, 224])
✔️ Relative speed: 1.22 (= 44.9ms / 36.8ms)

OneFlow resnet50 time: 32.7ms (= 6544.8ms / 200, input_shape=[1, 3, 224, 224])
PyTorch resnet50 time: 40.7ms (= 8135.8ms / 200, input_shape=[1, 3, 224, 224])
✔️ Relative speed: 1.24 (= 40.7ms / 32.7ms)

OneFlow swin dataloader time: 0.256s (= 51.188s / 200, num_workers=1)
PyTorch swin dataloader time: 0.153s (= 30.593s / 200, num_workers=1)
Relative speed: 0.598 (= 0.153s / 0.256s)

OneFlow swin dataloader time: 0.075s (= 15.072s / 200, num_workers=4)
PyTorch swin dataloader time: 0.041s (= 8.284s / 200, num_workers=4)
Relative speed: 0.550 (= 0.041s / 0.075s)

OneFlow swin dataloader time: 0.062s (= 12.354s / 200, num_workers=8)
PyTorch swin dataloader time: 0.021s (= 4.291s / 200, num_workers=8)
Relative speed: 0.347 (= 0.021s / 0.062s)

❌ OneFlow resnet50 time: 144.9ms (= 14487.1ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 168.8ms (= 16877.3ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.16 (= 168.8ms / 144.9ms)

OneFlow resnet50 time: 95.7ms (= 9565.9ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 112.0ms (= 11196.5ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.17 (= 112.0ms / 95.7ms)

OneFlow resnet50 time: 68.0ms (= 13601.0ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 99.0ms (= 19809.1ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.46 (= 99.0ms / 68.0ms)

OneFlow resnet50 time: 55.6ms (= 11123.7ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 73.7ms (= 14732.8ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.32 (= 73.7ms / 55.6ms)

OneFlow resnet50 time: 50.8ms (= 10169.1ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 69.2ms (= 13835.4ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.36 (= 69.2ms / 50.8ms)

@mergify mergify bot merged commit 8ffab16 into master Jul 12, 2022
@mergify mergify bot deleted the dev_cc_mem_log_v2 branch July 12, 2022 22:50
Ikkyu321 added a commit to ZJLabDubhe/oneflow-zj that referenced this pull request Aug 23, 2022
* Multi Tensor apply Optimizer (#8373)

* Add optim_cast and modify sgd

* Remove

* try to add fuseUpdatecast pass logic

* use pass

* still have bug in inplace

* ban inplace and fix sgd update

* fix regst num

* add env var

* remove cuda graph wrong use

* add support for graph

* initialize

* add functional impl

* add simple job rewrite

* delete redundant sgd update kernel

* support half

* add kernel

* use single loop kernel

* refine

* when in eval mode, we turn off multi tensor update

* refine format

* use juncheng kernel

* Refine

* group multi tensor op by some attr

* add parallel conf to key

* refine

* Add unroll logic

* fix bug

* restruct

* use pointer list

* add adam kernel

* support multi tensor adam update

* Remove cpu

* support skip if and scale by tensor

* support sgd adam unittest

* add more check

* Remove config

* Restruct tensorparams

* support fused cast in multi tensor update

* support cast in multi tensor

* fix bug in model update cast pass

* fix multi tensor sgd update with cast Pass check logic

* refine

* support multi tensor adam update with cast

* refine format

* Remove redundant template args

* merge modify for fused cast

* only allow fused cast in train mode

* only support data parallel in multi tensor update

* rewrite fuse update cast pass logic

* remove redundant if

* fix format

* add new line

* rename

* Remove print

* rename and add LOG

* Add more type and test

* still have bug in multi tensor adam

* Fix multi tensor adam update bug

* add multi tensor adam update with cast test

* simplify code

* fix format

* Add model diff datatype in optimizer key

* remove random seed

* fix comment

* fix comment

* fix to use model copy

* use for loop

* Fix comment

* use hashcombine

* fix clang analysis error

* add with cuda macro

* fix env var in unittest

* remove redundant unittest

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Fix doc and ops template auto gen (#8546)

* fix doc and add op calculator

* fix bug

* fix gen_ops

* fix diag 0size tensr shape infer bug (#8557)

* fix diag 0size tensr shape infer bug

* refine

* refine

* auto format by CI

* auto format by CI

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Format tensor on cpu (#8548)

* Format tensor on cpu

* use tensor.detach

* Remove useless WITH_CUDAs (#8562)

* unique identity (#8509)

* unique identity

* fix

* add identit name

* rm debug log

* mv identity form class to graph

* auto format by CI

* fix unique iden with having multiple stage

* auto format by CI

* Update block.py

Co-authored-by: cheng cheng <472491134@qq.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add GenericStreamContext (#8560)

* Modify some file and add test (#8556)

* Modify some file and add test

* modify the content

* modify the format and test function name

* modify the format and aligned with pytorch

* delete print

* modity the function name

* auto format by CI

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>

* Move some op into amp gray list (#8545)

enlarge gray list

Co-authored-by: cheng cheng <472491134@qq.com>

* Refine inplace expand runtime_error (#8561)

* Refine inplace expand runtime_error

* Opt

* Refine

* Add Note

* OneEmbedding use malloc async (#8543)

* in out ptrs

* ops and test

* test pass

* prefetch tmp buffer

* embedding shuffle tmp buffer

* gradient shuffle

* tmp buffer size

* mem pool

* cuda 11.2

* add id_shuffle to setNumunique in update tests

* default not use dynamic alloc

* fix of_tidy

* add fused op

* address review

* init tmp_buffer

* mv memset

* fix

* one_embedding fused_lookup_init_cast and fused_update_put (#8564)

* add fused op

* mv memset

* fix

* address review

* rm fullcache n_missing check

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix cpu aligned_alloc size (#8569)

Signed-off-by: daquexian <daquexian566@gmail.com>

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add flow norm (#8535)

* add flow norm

* rm import

* rm  doctest.testmod

* fix pad_packed_sequence method input requires_grad==True (#8574)

* fix pad_packed_sequence method input requires_grad==True

* fix append error when batch_first=True

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix embedding manager tmp buffer (#8585)

* fix embedding manager

* format

* fix reduce_ops 0size bug (#8551)

* fix reduce_ops 0size bug

* fix commnet

* auto format by CI

* fix bug

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Align Momentum Optimizer (#8549)

* fix moemntum update

* align momentum

* fix bug and finish eager unittest

* Support Graph optimizer

* fix momentum bug

* refine beta

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Fill GetSbp bug and consistent test bug (#8576)

fix(FillOp): fill GetSbp bug and consistent test bug

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Dev Fully fused MLP Grad[OneEmbedding] (#8462)

* support fully fused mlp grad in eager

* support lazy backward

* fix output size

* add fallback to tmp_buf logic when ones buffer is not enough

* build sbp

* overlap allreduce

* fix overlap order

* fix format

* CUDA Graphs delayed capture

* Add ifcomm create for graph

* insert weight event roughly

* fix dbias allreduce error

* simplify code

* Add 11060 limit

* Remove print

* Rename

* fix fill bug and remove comm to cache

* Rename variable and add debug code for cache

* Use kernel state and fix bug

* remove print

* fix allreduce dbias bug

* fix header file

* fix comment

* remove redundant headerfile

* fix userops build error

* refine

* init nccl comm before execute kernel

* fix comment

Co-authored-by: liujuncheng <liujuncheng1022@gmail.com>

* rename mirrored to local (#8503)

* rename mirrored to local

* rename files

* rename files

* auto format by CI

* revert change of package_mirror.py

* rename LocalObject to Dependence

* rename fn LocalObject to Dependence

* merge master

* handle clang check

* fix

* refine

* rename local_object to dependence

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>

* Implement BroadcastElementwiseUnary primitive (#8384)

* Add code skeleton for broadcast unary primitive

* first try

* finish impl

* finish impl

* format

* fix build error

* address review

* refine

* address review comments

* use broadcast unary primitive in fill_tensor_ kernel

* handle pack tail statically

* fix

* address review

* address review

* Fix SimplifyBroadcastDims

* fix

* revert fill_kernel

Co-authored-by: Juncheng <liujuncheng1022@gmail.com>

* skip cpu autotest for graph global (#8593)

* TODO

* skip cpu autotest for graph global

* Refine

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add function_library.h Exception (#8241)

* add RuntimeError for checking

* add RuntimeError to CHECK_EQ

* auto format by CI

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>

* Refactor shrink (#8573)

* caching allocator

* auto format by CI

* Update ep_device_context.h

* EpDeviceCtx with CachingAllocator

* rm RawAllocator typename

* auto format by CI

* specific allo in EpDeviceCtx

* auto format by CI

* rm outdated alloc

* simplify thread safe guard

* auto format by CI

* avoid return mutex

* auto format by CI

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Speed up SliceKernel (#8589)

* perf(SliceKernel): descrease number of cuda kernel and speed up

* perf(SliceKernel): use old kernel when small tensor is all fullslice

* use std::copy to copy contiguous memory

* fix cpu kernel bug

* Update readme and vsn for 0.8.0 (#8600)

* update version

* remove py3.6

* modify some file and improve error message (#8592)

* modify some file and improve error message

* modify scalar_by_tensor_op.cpp

* Update scalar_by_tensor_op.cpp

* Update slice_op.cpp

* Update test_slice_op.py

* Update test_slice_op.py

* auto format by CI

* auto format by CI

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* rename consistent to global (#8505)

* rename consistent to global

* rename consistent to global

* rename files

* rename files

* refine

* auto format by CI

* refine

* fix clang check

* fix

* fix

* fix

* rm to_consistent docs

* auto format by CI

* refine

* fix

* fix

* revert changes

* auto format by CI

* revert changes

* revert changes

* rename

* rename

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>

* add module releated container docs (#8580)

* add module releated container docs

* auto format by CI

* fix comment

* refine

* refine

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix rnn util extra memory usage when requires_grad=False (#8603)

* fix rnn util extra memory usage when requires_grad=False

* add comments

* refine comments

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* use bracket format slice in tensor str (#8489)

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Perf TensorInfo constructor (#8606)

* perf(Autograd): perf TensorInfo constructor

* rename consistent to global

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* print operators' python location when print nn_graph (#8558)

1. add a flag in nn.Graph.debug() named print_op_loc for printing operator location.
2. add a flag in nn.Graph.debug() named only_print_user_code_loc for only print users' code location

* Add randint like (#8598)

* add randnint_like op

* add docs for random

* refine

* auto format by CI

* add randint_like global test

* refine doc

* refine randint_like docs

* fix bug

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add full_like api (#8595)

* add full_like_op api

* refine

* add test

* refine

* refine docs

* refine

* add consistent_full test

* add full_like op

* fix docs commnet

* change scalar sbp return value from list to tuple

* auto format by CI

* merge conflict

* revert

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix cumsum GenBackwardOpConfFn (#8604)

* fix cumsum GenBackwardOpConfFn

* add test case

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* revert change (#8613)

* fix test graph optimization conf CI bug (#8617)

* restore resource config after random tests

* refine

* refine

* Release pod tensor (#8552)

* ThreadLocalGuard

* split ReleaseTensor into ReleasePodTensor and ReleaseNonPodTensor.

* rename

Co-authored-by: luyang <flowingsun007@163.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add param group for optimizer (#8611)

* add add_param_group interface for Optimize

* add test for add_param_group

* revert

* fix comment

* refine

* auto format by CI

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix broadcast_elementwise_binary cpu (#8625)

fix broadcast_elementwise_binary_cpu

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* align exception msg to torch (#8627)

* align exception msg to torch

* auto format by CI

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>

* skip unstable global test in ci, reduce failture rate (#8635)

* fuse embedding interaction (#8586)

* fuse embedding interaction

* fix of_tidy

* refine

* fix

* address review

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix flip gen backward opconf (#8605)

* fix flip gen backward opconf

* use new opconf api

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add ONEFLOW_ONE_EMBEDDING_PERSISTENT_TABLE_SNAPSHOT_LOAD_MMAP_LOCKED (#8597)

* Add ONEFLOW_ONE_EMBEDDING_PERSISTENT_TABLE_SNAPSHOT_LOAD_MMAP_LOCKED

* refine

* use MAP_POPULATE

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Profiling main thread (#8601)

* ThreadLocalGuard

* refactor EagerBlobObjectList

* op_args_reserved_size

* remove useless comments

Co-authored-by: binbinHan <han_binbin@163.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Fully Memory Log V2 with more details (#8565)

* Fully Memory Log V2 with more details

* refine log and long op name

* fix clang tidy

* fix test

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: Xiaoyu Xu <xiaoyulink@gmail.com>

* Stream policy (#8590)

* ThreadLocalGuard

* refactor signature of StreamType::InitDeviceCtx

* refactor hint

* add StreamPolicy

* remove DeviceCtx args

* refine OpCallInstructionUtil::Prepare & Compute

* merge EpDeviceCtx and LazyJobDeviceCtx into StreamPolicy

* minor fix

* minor fix

* del useless code

* fix error

* fix merge error

* fix segment fault bug

* fix complie error

* del methods belong to Subclass

* reslove comment

Co-authored-by: binbinHan <han_binbin@163.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add fully support for broadcast matmul (#6937)

* fix arange bug

* fully support broadcast matmul

* add more check

* remove check

* add fully sbp

* fix full sbp

* Fix broadcast matmul grad

* remove old broadcast matmul grad

* add broadcast grad back and when B numaxes is 2, we use broadcast_gradB instead of matmul+reduce

* add lazy backward

* Add restrict when transpose_a is false we can use bmatmul_grad_b

* revert

* fix broadcast matmul backward

* fix single client dispatch matmul logic

* revert old bcast matmul grad b kernel

* fix eager functional matmul backward

* add more test case

* remove redundant code

* add more special case

* when b num axes is 2, we only save tensor a

* fix annotation

* fix conflict and format

* remove single client matmul code

* Fix eval error

* fix conflict

* fix unittest

* Add init value

* support matrix vector matmul

* add vector matrix product

* Use matmul primitive to rewrite matrix vector product forward and backward

* Add fullllllllly support for vector matrix product

* Fix sbp

* fix bug

* add unittest

* Add consistent test for broadcast matmul

* Remove redundant code

* fix userops annotation

* fix

* refine

* Fix clang static analysis

* fix clang analysis

* set check graph as false

* fix

* fix for unittest

* fix broadcast sbp bug

* try to fix unittest

* Fix consistent test

* fix multiplier to 4 for unittest

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Revert "skip cpu autotest for graph global" (#8608)

* Revert "skip cpu autotest for graph global (#8593)"

This reverts commit b076be782fd8f21e50ee4915f2d1562f3a9ab4c0.

* cherry pick from master

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* OneEmbedding add tmp_buffer allocator (#8588)

* fix embedding manager

* format

* refine embedding_manager tmp_buffer allocator

* fix

* format

* refine

* refine

* auto format by CI

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>

* refine error msg for some user ops (#8579)

* refine error msg for some user ops

* refine error msg for some user ops

* optimize

* optimize the writing

* optimize the writing

* optimize the writing

* auto format by CI

* optimize writing

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add tril fill value (#8655)

add tril fill value

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix_non_pod_data_allocate_bug (#8657)

Co-authored-by: Li Xinqi <lixinqi2010@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Fix norm (#8629)

* fix norm

* add doc

* add bool &

* update math_functor.cpp

* add note

* fix_decorate_mem_leak_bug_in_eager_boxing (#8661)

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* add higher order derivative for leaky_relu and negative op (#8643)

* add higher derivative for leakyrelu and negative

* fix a typo

* remove functor

* add initialize alpha

* fix incorrect dim size in global test

* fix incorrect dim size in global test

* optimize testcase

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* update oneflow intro to show the difference (#8669)

* update oneflow intro

* refine

* refine

* refine

* refine

* refine

* refine

* refine

* refine

* refine

* refine oneflow intro

* Stacked error (#8671)

* ThreadLocalGuard

* StackedError

* StackedError

Co-authored-by: Shenghang Tsai <jackalcooper@gmail.com>

* Refactor tensor initializer (#8626)

* fix(*): fix xavier_initializer

* refactor(Initializer): refactor initializer

* fix function name

* auto format by CI

* refine

* fix interface in tensor.py

* fix(trunc_normal_): fix init bug and add test

* auto format by CI

* fix bug

* add oneflow.nn.init.normal_ test

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>

* Fix nn doc (#8650)

* fix hsplit doc

* add doc for module

* fix dtype

* fix formula

* add ref

* fix row length

* Fix reduce max min bool dtype bug (#8651)

* fix reduce_max_min_bool_dtype

* fix bug

* auto format by CI

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Remove redundant exception wrapper (#8631)

* remove redundant ExceptionWrapper

* refine KeyErrorMessage

* refine

* auto format by CI

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>

* Refactor MemoryCase to eliminate determine statements of device_type (#7727)

* ref memory_case_util

* ref BlobObject::CheckMemCase

* ref mem_case using

* address review

* address review

* namespace memcase -> memory

* fix conflict

* address review

* address static analysis

* rm check

* cpu device_id is always 0

* fix conflict

* timeout-minutes: 50

* revert change

* increase thrd limit in container

* skip 2x2 TestEinsumConsistent

* skip failed case of distributed test

* auto format by CI

* fix_non_pod_data_allocate_bug

Co-authored-by: Li Xinqi <lixinqi2010@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: tsai <jackalcooper@gmail.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: clackhan <han_binbin@163.com>

* fix some data races in c++ api and SteadyVector (#8654)

* fix some data races in c++ api and SteadyVector

Signed-off-by: daquexian <daquexian566@gmail.com>

* skip self copy in MutShapeView::ToShape

Signed-off-by: daquexian <daquexian566@gmail.com>

* auto format by CI

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>

* Fix sin/cos higher order derivative (#8648)

* fix(GradGrad): fix sin/cos higher order derivative

* fix(GradGrad): fix calculate error

* refine autograd global test

* auto format by CI

* refine sin/cos grad_grad calculate

* fix static analysis

* merge conflict

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: Ping Zhu <58718936+REYGU@users.noreply.github.com>
Co-authored-by: Zhu, Ping <pingzhuu@outlook.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* refine_eager_boxing_to_adapt_ep (#8568)

* refine_eager_boxing_to_adapt_ep

* fix typo

* refine

* refine symmetric-acyclic-nd-sbp-to-nd-sbp

* refine

* fix error

* fix static check

* add NOLINT

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Fix repeat bug (#8645)

* make result contiguous

* add test case

* auto format by CI

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>

* Instruction policy (#8583)

* ThreadLocalGuard

* vm::InstructionPolicy

* fix compile error (#8623)

* fix compile error

* change MirroredObject to Dependence

* Modify DependenceVector

* rm include stream type

* fix stream type

* auto format by CI

Co-authored-by: Yu OuYang <xuanjiuye@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>

* handle non-contiguous input (#8665)

* handle non-contiguous input

* refine

* auto format by CI

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>

* rename define CONSISTENT to GLOBAL (#8652)

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Refine naive interpret (#8672)

* ThreadLocalGuard

* refactor EagerBlobObjectList

* op_args_reserved_size

* remove useless comments

* rename one::EagerBlobObjectList* to vm::EagerBlobObject*

* refactor signature of InstructionsBuiler::Call

* PhysicalRun

* refactor InstructionsBuilder::Call

* remove unused StatefulOpKernel::need_check_mem_case

* remove EagerLocalTensorImpl::is_shape_synced_

* refactor SoftSync

* move SmallVector from common/container_util.h to framework/instructions_builder.cpp

* explicit scalar initialization

Co-authored-by: clackhan <han_binbin@163.com>

* Rebuild Docs V0.8.0 (#8392)

* rebuild for 5 module

* fix bug

* fix for doctree and content  in nn and

* fix

* fix

* fix

* add some

* fix for oneflow.rst

* update oneflow oneflow.nn

* update tensor

* update tensor module

* update

* test

* update

* update

* fix for undone desc

* docs: oneflow.utils.data (#8485)

* feat(utils.data): add oneflow.utils.data

* docs(dataloader): change the docstring of DataLoader

* docs(tensor): add methods to oneflow.Tensor document

* docs(optim): change docstring of optimizer and add a note to the doucument

* nn.graph

* fix for graph

* fix bug

* review nn and linalg document (#8515)

* docs(nn): add contents to oneflow.nn document

* docs(linalg): refactor oneflow.linalg document

* change attributes.rst and review nn.functional.rst (#8514)

* change attributes.rst and review nn.functional.rst

* reconstruction oneflow.cuda

* fix cuda and rebuild comm demo (#8582)

* update image

* add distributed

* oneembedding & refine graph

* update for sdisributed one_embedding

* fix rnn.py (#8616)

* 重构 oneflow.nn.init 文档 (#8622)

docs(nn.init): refactore nn.init document

* docs(nn.init): remove the comments

* docs(utils.data): remove the comments

* update and fix bug

* docs(review): refine the documents (#8646)

* docs(review): refine oneflow, nn, Tensor, nn.init, linalg, utils.data, optim modules

* docs(optim): modify the code examples

* docs(tensor): edit note

* 重构 oneflow.autograd 文档 (#8594)

* docs(autograd): refactor oneflow.autograd

* docs(autograd): edit "Default gradient layouts".

* docs(autograd): reedit "Default gradient layouts"

* docs(autograd): add comment

* docs(autograd): add reference

* update

* docs(tensor): change autoclass to autosummary

* update

* update

* add oneflow.linalg.diagonal (#8653)

* docs(linalg): add oneflow.linalg.diagonal

* update enviorment variable

* Update docs/source/distributed.rst

Co-authored-by: Houjiang Chen <chenhoujiangcug@gmail.com>

* Update docs/source/distributed.rst

Co-authored-by: Houjiang Chen <chenhoujiangcug@gmail.com>

* update enviorment variable

* update for ev & distributed

* update distribued

* update ev

* update distribute desc

* Update docs/source/distributed.rst

Co-authored-by: Houjiang Chen <chenhoujiangcug@gmail.com>

* update

* 修改 docstring 描述 (#8656)

* docs: move pytorch refernce to end

* docs: add some docstring

* docs(refs): add refs

* Update docs/source/distributed.rst

* updte for distributed details and environment_variable

* docs(docstring): Modify all reference links to version 1.10 (#8663)

* fix bug

* fix bug

* fix all warning

Co-authored-by: Guoliang Cheng <1876953310@qq.com>
Co-authored-by: liu xuan <85344642+laoliu97@users.noreply.github.com>
Co-authored-by: Guoliang Cheng <lmyybh_lazy@163.com>
Co-authored-by: laoliu97 <841637247@qq.com>
Co-authored-by: Yao Chi <later@usopp.net>
Co-authored-by: Houjiang Chen <chenhoujiangcug@gmail.com>

* Fix zeros like and ones_like api (#8632)

* fix zeros_like and ones_like bug

* refine

* revert

* refine

* fix tensor_slice_view infer physic_shape bug

* add test

* refine

* auto format by CI

* fix bug

* refine

* auto format by CI

* fix import error

* fix bug

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Fix sbp print bug (#8689)

* Add a normal priority with no transfer but different sbp

* Fix the bug for printing no boxing edge

* Do not use P for weights

* auto format by CI

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>

* eager_local_interpreter_with_infer_cache (#8619)

* ThreadLocalGuard

* refactor EagerBlobObjectList

* op_args_reserved_size

* remove useless comments

* rename one::EagerBlobObjectList* to vm::EagerBlobObject*

* refactor signature of InstructionsBuiler::Call

* PhysicalRun

* refactor InstructionsBuilder::Call

* remove unused StatefulOpKernel::need_check_mem_case

* remove EagerLocalTensorImpl::is_shape_synced_

* eager_local_interpreter_with_infer_cache

* remove useless code

* reslove comments

* refactor TensorMeta::TensorMeta(const TensorMeta)

* use small vector

* add kMaxNumDims

* fix error include

* fix split Symbol LocalTensorMeta error

* refactor SoftSync

* move SmallVector from common/container_util.h to framework/instructions_builder.cpp

* mone ONEFLOW_EAGER_ENABLE_LOCAL_INFER_CACHE to eager.h

* add blank line

* reslove comments

* minor fix

* refine

* explicit scalar initialization

* fix static check error

* auto format by CI

* of_format

* reslove comment

* refine

* refine

* refine

Co-authored-by: lixinqi <lixinqi0703106@163.com>
Co-authored-by: Li Xinqi <lixinqi2010@gmail.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix gelu nn.Module bug and support tanh mode. (#8693)

* add gelu2 api

* refine test

* refine docs

* refine

* restuct

* delete useless headfile

* format

* rm doc of tensor.gelu (#8696)

Co-authored-by: Shanshan Zhong <62104945+zhongshsh@users.noreply.github.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Fix bug in CrossFeatureInteraction LazyBackward (#8677)

fix bug

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix floating-point scalar tensor in arange (#8673)

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add nn functional fold (#8667)

* add fold

* update fold.py

* add test

* fix doc

* fix comment

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* modify some file and improve the error message (#8566)

* modify some file and improve the error message

* modify the content

* modify the content

* auto format by CI

* Update roi_align_op.cpp

* Update roi_align_op.cpp

* Update reshape_user_op_util.cpp

* auto format by CI

* Update roi_align_op.cpp

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>

* [OneEmbedding] add id_shuffle_copy_out (#8683)

add id_shuffle_copy_out

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix add_param_group step key not match error (#8698)

* fix add_param_group step key not match error

* auto format by CI

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* add env ONEFLOW_EP_CUDA_DEVICE_FLAGS and ONEFLOW_EP_CUDA_STREAM_FLAGS (#8703)

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix for docsv0.8 (#8710)

* fix repeat op 0-size releated bug (both in FW and AD) (#8707)

* fix repeat op 0-size releated bug (both in FW and AD)

* refine

* refine static check

* refine

* fix commnet

* fix comment

* refine

* fix test

* auto format by CI

* auto format by CI

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Support Dropout Scale in FusedMLPGrad[OneEmbedding] (#8633)

* support alpha list

* Remove redundant modify

* remove redundant alpha set

* refine

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Fix bug of Tensor.type (#8697)

* fix bug of tensor.type(flow.Tensor)

* fix bug of tensor.type(flow.Tensor) about device

* Fix tensor type doc (#8699)

fix doc of tensor.type

* add test for tensor.type(flow.Tensor)

* move PyTensorMetaCls_CheckExact to header file

Co-authored-by: Shanshan Zhong <62104945+zhongshsh@users.noreply.github.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* ONEFLOW_GRAPH_PLACE_TRAINING_STATE_ON_ALL_RANKS (#8706)

* ONEFLOW_GRAPH_PLACE_TRAINING_STATE_ON_ALL_RANKS

* auto format by CI

Co-authored-by: liujuncheng <liujuncheng1022@gmail.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* define_mut_output_shape_and_mut_output_stride_in_infer_ctx (#8709)

* define_mut_output_shape_and_mut_output_stride_in_infer_ctx

* fix merge master error

* fix typo

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add qat conv modules (#8368)

* add qat conv modules

* add quantization related modules to doc

* refine qatconv modules doc

* add qat conv module tests

* refine

* refine

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* add unsqueeze_multiple_op (#8714)

* add unsqueeze_multiple_op

* modify the format

* Update functional_api.yaml

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* modify broadcast_like_op.cpp and add test (#8720)

* modify broadcast_like_op.cpp and add test

* modify broadcast_like_op.cpp

* Update broadcast_like_op.cpp

Co-authored-by: Yinggang Wang <wyg19970408@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* JIT LR (#8500)

* add example code

* Update cosine_annealing_lr.py

* enable self params transformer

* enable pass ast to c++ api

* enable jit backend for lr

* enable jit global register and invoke

* convert Global to Singleton for new merge

* enable pybind11 walk on python ast

* enable test all existent get_lr of oneflow in python

* enable py_ast_wrapper pass ast from python to mlir

* switch all ast to ast-wrapper in mlir scope

* define python ast partially

* partial python ast definition

* trim asdl of python ast

* mlir gen

* add symbol table

* from ast to jit done

* switch llvm::errs() to mlir::emitError and convert switch to typeSwitch

* trim duplicate namespace use

* fix LIT header

* add some docs

* enable compare with or_else, if with return seamless in branch and mutable variable

* trim code and refine struct

* register pybind11 ast node for shared_ptr

* enable cpp class in python

* go through python to mlir to llvm to jit to run

* add addf subf op

* work well on stepLR linearLR exponentialLR coseineDecayLR cosineAnnealingLR constantLR

* enable maxf minf conversion to llvm ir

* rename LR_JIT to LRJITRegister

* remove LR_JIT_Engine and swith Invoke to std::function ret by  lookup

* refine struct

* enable bisect_right and python resigter api have dump option arg

* add bisect_left and bisect_transformer specially, delete former test python script

* remove c++17 standard

* restore double hash to iterator

* publish

* publish

* publish

* use llvm classof and typeswitch rightly

* trim

* commit

* commit

* commit

* commit

* commit

* commit

* auto format by CI

* Update ir.cpp

* Update OneFlowLRJITRegistry.h

* auto format by CI

* Update AstMlirGen.h

* Update lr_jit.cpp

* auto format by CI

* Naming conventions

* auto format by CI

* auto format by CI

* deploy _ behind

Co-authored-by: leaves-zwx <kunta0932@gmail.com>
Co-authored-by: yuhao <1171760467@qq.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: yuhao <72971170+howin98@users.noreply.github.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add logspace (#8599)

* add logspace

* add global test

* restore rand

* fix doc

* rename consistent to global

* adjust import order

* add todo

* Add hann_window (#8615)

* add hann_window

* rm useless include

* add check

* adjust import order

* add ONEFLOW_VM_PENDING_HANDLE_WINDOW_SIZE (#8730)

* add ONEFLOW_VM_PENDING_HANDLE_WINDOW_SIZE

* add environment to vm.h

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Fix as strided bool type and view bug (#8713)

* fix as_stride bug

* refine

* refine

* refine

* delete useless head file

* refine

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add functional binary cross entropy (#8708)

* add gelu2 api

* refine test

* refine docs

* refine

* restuct

* delete useless headfile

* format

* rm doc of tensor.gelu

* add functional binary cross entropy

Co-authored-by: BBuf <1182563586@qq.com>
Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* support map_location in flow.load (#8666)

* support map_location in flow.load

Signed-off-by: daquexian <daquexian566@gmail.com>

* auto format by CI

* fix tests

Signed-off-by: daquexian <daquexian566@gmail.com>

* fix bug when map_location is None

Signed-off-by: daquexian <daquexian566@gmail.com>

* auto format by CI

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>

* Add addcdiv (#8581)

* add addcdiv

* fix tensor_functions

* fix inplace

* add test number

* rename consistent to global

* Inner most dim case for cumsum cumprod op (#8403)

* cumsum use cub scansum in some case

* prod use cub scan

* refine name

* refine

* optimize cum op

* format

* fix

* get device properties by cuda stream class

* revert useless code

* refine

* outer dim use parallel sweep algo

* refine

* fix a fraction of threads hit __syncthreads

* revert

* refine kernel define

* refine

* refine

* refine

* refine

* move comment

* fix

* fix

* refine

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Define mut output dtype and mut output is dynamic in infer ctx (#8716)

* define_mut_output_shape_and_mut_output_stride_in_infer_ctx

* fix merge master error

* fix typo

* define_mut_output_dtype_and_mut_output_is_dynamic_in_infer_ctx

* replce const DataType& with DataType

* replace const DataType& with DataType ret

* split TensorDesc4ArgNameAndIndex and MutTensorDesc4ArgNameAndIndex

* refine

* minor fix

* refine

* fix static check error

* Update op_expr.cpp

* Update op_expr.cpp

* Update stateful_opkernel.cpp

* refine

* fix static check error

* refine

* refine

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Dev refactor fuse instruction policy (#8624)

* ThreadLocalGuard

* vm::InstructionPolicy

* refactor fuse instruction policy

* fix compile error (#8623)

* fix compile error

* change MirroredObject to Dependence

* Modify DependenceVector

* add instruction policy util

* add instruction policy util

* remove include

* add include

* rm fuse instruction type

* Modifying variable properties

* add stream_sequential_dependence_ to instruction_policy

Co-authored-by: lixinqi <lixinqi0703106@163.com>
Co-authored-by: Li Xinqi <lixinqi2010@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix bug of batchnorm num_batches_tracked global error when loading state_dict (#8723)

add condition for assign num_batches_tracked

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* add launch master port limit (#8563)

* add launch master port limit

* Update python/oneflow/distributed/launch.py

Co-authored-by: daquexian <daquexian566@gmail.com>

Co-authored-by: daquexian <daquexian566@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Fix docs import distance (#8691)

* fix import distance

* add functional apis

* add smooth_l1_loss docs

* refine activation.py

* add deleted api

* review

* 添加oneflow, nn 等模块文档中遗漏的接口 (#8704)

* docs: add api

* docs(nn): refactor nn

* review

Co-authored-by: Guoliang Cheng <lmyybh_lazy@163.com>
Co-authored-by: ChenQiaoling <48576019+Chenqll@users.noreply.github.com>

* refactor control stream type (#8647)

* refactor control stream type

* auto format by CI

* Add method implementation

* refine

* refien

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: Li Xinqi <lixinqi2010@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Define mut output tensor desc (#8717)

* define_mut_output_shape_and_mut_output_stride_in_infer_ctx

* fix merge master error

* fix typo

* define_mut_output_dtype_and_mut_output_is_dynamic_in_infer_ctx

* define_mut_output_dtype_and_mut_output_tensor_desc

* replce const DataType& with DataType

* replace const DataType& with DataType ret

* split TensorDesc4ArgNameAndIndex and MutTensorDesc4ArgNameAndIndex

* refine

* minor fix

* fix merge error

* fix warning error

* refine

* fix static check error

* Update op_expr.cpp

* Update op_expr.cpp

* Update stateful_opkernel.cpp

* refine

* fix static check error

* refine

* refine

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Symbolic local tensor meta (#8662)

* ThreadLocalGuard

* refactor EagerBlobObjectList

* op_args_reserved_size

* remove useless comments

* rename one::EagerBlobObjectList* to vm::EagerBlobObject*

* refactor signature of InstructionsBuiler::Call

* PhysicalRun

* refactor InstructionsBuilder::Call

* remove unused StatefulOpKernel::need_check_mem_case

* remove EagerLocalTensorImpl::is_shape_synced_

* eager_local_interpreter_with_infer_cache

* remove useless code

* reslove comments

* refactor TensorMeta::TensorMeta(const TensorMeta)

* use small vector

* Symbolic LocalTensorMeta

* check shape in critical_sectio

* add kMaxNumDims

* fix error include

* fix split Symbol LocalTensorMeta error

* fix split cache and symbolic local tensor meta error

* refactor SoftSync

* move SmallVector from common/container_util.h to framework/instructions_builder.cpp

* mone ONEFLOW_EAGER_ENABLE_LOCAL_INFER_CACHE to eager.h

* add blank line

* reslove comments

* minor fix

* refine

* explicit scalar initialization

* fix static check error

* auto format by CI

* of_format

* reslove comment

* refine

* refine

* refine

* fix error

* define MutOutputShape and MutOutputStride in InferContext

* define_mut_output_shape_and_mut_output_stride_in_infer_ctx

* fix merge master error

* fix typo

* fix static check error

* define_mut_output_dtype_and_mut_output_is_dynamic_in_infer_ctx

* define_mut_output_dtype_and_mut_output_tensor_desc

* replce const DataType& with DataType

* split const and mut func in LocalTensorMeta

* replace const DataType& with DataType ret

* split TensorDesc4ArgNameAndIndex and MutTensorDesc4ArgNameAndIndex

* refine

* minor fix

* fix merge error

* fix warning error

* refine

* fix static check error

* Update op_expr.cpp

* Update op_expr.cpp

* split MutTensorMeta and MutLocalTensorMeta

* Update stateful_opkernel.cpp

* refine

* fix static check error

* refine

* refine

* reslove comment

* refine

* fix typo

Co-authored-by: Houjiang Chen <chenhoujiangcug@gmail.com>

* fxi typo

* use OpArgsVector

Co-authored-by: lixinqi <lixinqi0703106@163.com>
Co-authored-by: Li Xinqi <lixinqi2010@gmail.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: Houjiang Chen <chenhoujiangcug@gmail.com>

* Feat general basic communication (#8437)

* Add a slight cost for B->S and B->P in 2d sbp

* Add penalty for P in consumer

* Fix a slight bug

* Add at most 1 middle node for general basic communication

* Add the cost for general basic communication

* Add the slight penalty for eager

* Skip initialization of boxing collector if not needed

* Fix a bug

* Dev nd nccl send recv boxing (#8467)

* nd nccl_send_recv_boxing

* rm print

* support num_axes > 2

* Add distributed optional run (#8372)

* Add

* change deps

* add install

* add skip

* autoprof supports bandwidth (#8367)

* autoprof supports bandwidth

Signed-off-by: daquexian <daquexian566@gmail.com>

* print bandwidth

Signed-off-by: daquexian <daquexian566@gmail.com>

* auto format by CI

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>

* remove tmp buffer of cumprod cpu backward kernel (#8369)

* remove tmp buffer of cumprod cpu backward kernel

* refine

* refine

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Move tensor api to cpython part3 (#8342)

* add tensor_functions

* concat py methods

* add hash, restore tensor.py

* check replacement

* refine code, remove commented tensor.py

* refine code

* move some api

* add cpu and cuda api

* add triu tril norm and etc.

* remove tensor_functions.h

* move more api

* move more api, refine size

* fix typo

* format code, remove useless include

* refine code

* refine code, fix typo

* align .cuda to python

* refine code

* split some api to part3 for review

* remove positional only arguments of argmax and argmin

* remove arguments parse

* modify arguments name in matmul and floor_divide

* rename BINARY_FUNC to DIRECT_PASS_FUNC, modify some functions

* refine code, format code

* add inplace /=, add comments

* remove name in macros

* remove python api

* remove redundant include

* remove cout

* format code

* refactor tensor.size by directly call shape.at, refactor tensor.sub_ by calling nb_sub_

* remove redundant code

* auto format by CI

* fix typo, fix wrong call

* modify idx datatype from int32 to int64 in tensor.size

* add some DIRECT_PASS_FUNC

* add cpu cuda var pow and etc.

* add masked_fill any all

* make REDUCE_FUNC macro, add reduce_* functions

* add 0dim check in ReduceSumWhole, refine yaml

* fix bug

* restore add add_ sub sub_

* add unittest for tensor.half tensor.add tensor.add_

* refine code

* refine code

* fix typo

* fix bug of tensor.std()

* refactor var std and cuda, using c++ functional api

* add beta and threshold in softplus

* auto format by CI

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add nn_functor Check (#7910)

* add bias_add_check

* add bias_add error test

* fix conv2d nhwc bias_add error

* add nhwc conv test

* add bias_add_error test

* Add bias add error check

* Rename

* add batch matmul error check

* add matmul check error msg

* remove annotation

* add fused mlp error msg check

* Add pixel shuffle check test

* add more test until normalization add relu functor

* refine error message

* finish all nnfunctor check msg

* handle type error

* remove useless symbol

* modify back to TypeError

* fix all comment

* Remove redundant code

* Remove pad ndim check

* fix bias add space

* fix check logic cause ci gpu not always gpu:0

Co-authored-by: hjchen2 <chenhoujiangcug@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add FusedMatmulBiasAddReluDropout [OneEmbedding] (#8222)

* previous version for fused_matmul_bias_add_relu_dropout

* add op infer

* fix detail

* finish forward

* support dropout rate list

* add forward test

* fix bug for output buffer

* Configurable alpha params

* try to add bit mask logic

* Add bitmask first version!

* Add row col bitmask logic

* support not align4 reludropout

* simplify relu dropout ld logic

* Add naive relu dropout grad kernel

* add simple relu dropout grad kernel

* Rename

* support relu_dropout bitmask backward

* add vectorized optimization

* fix tmp buffer

* add to amp list

* add lazy backward logic

* Refine kernel

* add indextype dispatch

* simplify functor logic

* fix cublas fused mlp aux_ld shape bug

* Add more relu dropout kernel

* add full unittest

* fix bug in skip final activation

* refine

* Remove dump func

* fix format

* Remove cmake

* remove redundant divide

* add padded version

* fix dropout

* oneflow curand

* refine

* remove redundant kernel

* add unroll logic

* add unroll and ballot sync

* refine format

* Remove fast curand

* Refine python interface

* Add if branch for memset

* fix python logic

* just for debug

* not use matmul bias add grad

* add launch 1 block limit

* fix unittest

* Refine

* fix graph backward bug

* limit to 11060

* change to use int32_t dtype for cublas aux

* Fix jc comment

* fix comment

* fix convert

* fix static_analysis

* fix at

* fix userops td

* fix userops td

* fix const ref

* fix compile error for bfloat16

* limit to 11060

* fix bug

Co-authored-by: Juncheng <liujuncheng1022@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix gather 0-dim tensor bug (#8376)

* fix 0-dim tensor bug

* refine

* support input 0-dim tensor for gather

* refine

* refine

* refine dim_scatter_kernel check

* refine

* refine check

* fix clang_tidy error

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* add api to apply external job pass (#8370)

* Add condition to find-test-cache-distributed (#8387)

* add condition to find-test-cache-distributed

* fix

* warp dim util (#8382)

* warp dim util

* format

* use more maybe_wrap_dim

* refine array functor

* add more

* refine math_functor

* fix_bug_in_broadcast_min_max_grad_and_broadcast_like (#8379)

* fix_bug_in_broadcast_min_max_grad_and_broadcast_like

* refine

* fix static check error

* fix bug about index (#8388)

* fix bug about index

* add test case

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* LogicalSliceAssign support full slice sbp (#8344)

* feat(SliceOp): slice ops support 2d sbp

* fix(SliceOp): fix [B, P] 2d sbp bug

* refine error message

* fix bug in parallel_num == 1

* add comment

* add warning and format

* add NOLINT for boxing check

* feat(LogicalSliceOps): support all nd_sbp

* feat(LogicalSlice): support nd_sbp

* add error message

* fix(AutoTest): fix auto_test bug in module.parameter pass

* auto format by CI

* fix(LogicalSliceAssign): skip test when 1n1d

* fix SliceParams memset error

* remove memset

* add CHECK_JUST

* fix(*): make sure split_axis >= 0 or equal to SPLIT_AXIS_FOR_NON_SPLIT

* remove memset

* fix spilit_info.axis bug

* feat(LogicalSliceOps): support grad

* add logical_slice gradient_funcs

* feat(LogicalSliceAssign): LogicalSliceAssign support full slice sbp

* auto format by CI

* test(LogicalSlice): fix logical_slice dims

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: Houjiang Chen <chenhoujiangcug@gmail.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>

* fix_tensor_from_numpy_mem_leak_bug (#8391)

* fix_tensor_from_numpy_mem_leak_bug

* add note

* refine note

* refine

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Make of_pyext_obj static only to make sure only a python ext so has python symbols (#8393)

* make of_pyext_obj static only

* refine note

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Adjust tolerance setting in embedding_renorm unit test (#8394)

* support front end compile for job to iree (#8249)

* support frontend dev version

* polish name

* add tosa-to-elf.mlir

* tosa to elf by llvm

* conv2d partial

* an enhanced frontend runner

* support numpy as input

* enable multiple using nn graph with different input(jobname make it  it cd /home/yuhao/frontend/oneflow ; /usr/bin/env /usr/bin/python3 /home/yuhao/.vscode-server/extensions/ms-python.python-2022.6.2/pythonFiles/lib/python/debugpy/launcher 40873 -- /home/yuhao/frontend/oneflow/oneflow/ir/test/Frontend/runner.py )

* enable multiple input

* enable cpu and cuda

* change full_name to _full_name

* support exchange cuda with cpu seamlessly

* remove pip

* lit config

* polish

* trim

* auto format by CI

* modify

* auto format by CI

* last line polish

* use unittest

* auto format by CI

* use allclose

* auto format by CI

* pulish

* optimize convert oneflow to tosa

* conv2d

* conv2d enhanced && conv2d examples add

* add road map

* add add_n2Op and boardcast_addOp conversion

* add matmulOp conversion

* support converting normailzation op to tosa(partically)

* update roadmap

* support i64 tensor to dense elem attr

* support 100% resnet op conversion

* add test mlir

* add test iree resnet python script

* auto format by CI

* done

* enhance iree resnet test script

* auto format by CI

* rebuild code

* auto format by CI

* rebuild test script

* update

* auto format by CI

* pub

* trim test scripts

* move

* move

* input and output add block arg judgement

* emit error in variable conversion

* error handle for ci

* modify err info

* auto format by CI

* merge

* auto format by CI

* output not block

* flow ones

* rm const

* trim maybe

* trim maybe with header file

* const auto

* solve clangd error

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Feat/zero mix with mp (#8036)

* add zero limit

* add debug

* add mix zero test

* refactor zero api

* zero test with mp

* add 2d test

* add zero nd

* add nd zero

* add sbp cast

* test passed soft limit consumer

* refine size api

* zero use stage 2

* add limit consumer api

* add new api

* refine zero s select

* fix index out of range

* rm zero limit on device type

* zero test with activation checkpointing

* add indentity when dp sequence len is 1

* move to base with master

* fix

* fix

* fix

* add test

* debug bad case

* refine test for eager and graph boxing

* test case ready

* simplify

* refine test

* fix buff size

* fix conflict

* refine zero nd

* refine

* add full test

* revert change

* refine split check

* fix typo

* rm log

* spit long func

* restore test

* Update optimizer_placement_optimization_pass.cpp

* auto format by CI

* auto format by CI

* fix static check

* add tips for zero api change

* auto format by CI

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Revert embedding normal path and fix amp list (#8374)

* revert embedding normal path, fix amp list

* fix amp

* fix memset bug in gather cpu kernel

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* replace fixed_vector with small_vector and make Shape inherit from it (#8365)

* Replace fixed_vector with llvm::SmallVector

Signed-off-by: daquexian <daquexian566@gmail.com>

* Shape inherited from llvm::SmallVector

Signed-off-by: daquexian <daquexian566@gmail.com>

* refine cmake

Signed-off-by: daquexian <daquexian566@gmail.com>

* rename fixed_vector to small_vector

Signed-off-by: daquexian <daquexian566@gmail.com>

* fix reviews

Signed-off-by: daquexian <daquexian566@gmail.com>

* auto format by CI

* update Shape constructor

Signed-off-by: daquexian <daquexian566@gmail.com>

* add 'PUBLIC' keyword to all target_link_libraries

Signed-off-by: daquexian <daquexian566@gmail.com>

* auto format by CI

* update cmake

Signed-off-by: daquexian <daquexian566@gmail.com>

* auto format by CI

* update cmake

Signed-off-by: daquexian <daquexian566@gmail.com>

* update cmake

Signed-off-by: daquexian <daquexian566@gmail.com>

* auto format by CI

* set is_initialized_ default to true

Signed-off-by: daquexian <daquexian566@gmail.com>

* override some methods to set is_initialized_

Signed-off-by: daquexian <daquexian566@gmail.com>

* auto format by CI

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>

* Light plan for debug (#8396)

* Light plan for debug

* fix note

* disable terminfo to fix missing terminfo symbols (#8400)

* disable terminfo to fix missing terminfo symbols

Signed-off-by: daquexian <daquexian566@gmail.com>

* auto format by CI

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix bug of ZeRO MP in complex case (#8404)

* Remove redundant output_lbns in ir (#8409)

* mv case

* remove redundant info

* Dev FusedCrossInteraction[OneEmbedding] (#8335)

* add simple fused cross interaction forward

* add packed fused

* Add cross interaction grad

* simplify code

* fix bug

* support crossnet v2

* support cross interaction v2

* add lazy backward

* Rename and add test

* fix jc comment

* fix comment

* fix bug

* fix userops td elem_cnt for FUSED Group

* fix header file

* fix clang static analysis

* fix unittest

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* add exe graph physical shape check msg (#8002)

* fix index select op in graph

* add exe graph physical shape check msg

* improve the debug information for the python stack trace

1. add a parameter 'max_stack_depth' to specify the max depth for the stack trace
2. refactor other debug related classes.

* remove parens

* update

* resolve PR comments

* update

* update graph debug test file.

* restore self._debug in class Graph and class ModuleBlock

* Do not shorten the stack frame string if it is in debug mode

* delete TODOs

* disable conv3d test (#7969)

Signed-off-by: daquexian <daquexian566@gmail.com>

* skip layernorm random_data_warp test (#7941)

* skip layernorm random_data_warp test

* warp/block/uncached case only test gpu

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Lock click version (#7967)

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* add global avgpool unittest (#7585)

* fix (#7978)

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Support negative dim in scatter op (#7934)

* support negative dim in scatter op

* refine scatter test

* refine scatter test again

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* run barrier callback in BarrierPhyInstrOperand::~BarrierPhyInstrOperand (#7702)

* run barrier callback in BarrierPhyInstrOperand::~BarrierPhyInstrOperand

* lock gil in vm Callback thread

* more comments for VirtualMachineEngine::Callback()

* the Env is never destroyed.

* export Env into python

* more unittests

* wait shared_ptr.use_count() == 0

* export unittest.TestCase in framework/unittest.py

* SwitchToShuttingDownPhase

* optional is_normal_exit

* VirtualMachine::CloseVMThreads

* Delete env_api.h

env_api.h is deleted by master

* reshape_only_one_dim_infered

* address pr comments

* fix a ref-cnt bug in TryRunBarrierInstruction.

* rollback flow.env.all_device_placement

* no distributed running test_shutting_down.py

* auto format by CI

* expand lifetime of module oneflow in test_shutting_down.py

* refine del depend on of

* capture oneflow._oneflow_internal.eager when calling sync in __del__

* add try in flaky test

Co-authored-by: Luyang <flowingsun007@163.com>
Co-authored-by: chengtbf <472491134@qq.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: Xiaoyu Xu <xiaoyulink@gmail.com>

* Fix one hot scalar tensor bug (#7975)

* fix reduce_sum scalar check bug

* fix one_hot scalar tensor bug

* fix clang tidy error

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* support ctor np array from of tensor (#7970)

* support ctor np array from of tensor

* add test case constructing np array from tensor

* refine

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* add_manual_seed_all_api (#7957)

* add_manual_seed_all_api

* Update conf.py

* refine

* add test case

* auto format by CI

* Update random_generator.cpp

* auto format by CI

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* one_embedding add doc string (#7902)

* add doc string

* add example

* add

* fix doc

* refine

* address review

* mb to MB

* add make_table_option

* option to options

* refine

* add forward

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Support numpy scalar parameters (#7935)

* feat(functional): support numpy scalar parameters

* rename inferface

* feat(*): TensorIndex support numpy scalar

* feat(TensorIndex): support advance indexing

* add unittest and int32 support for branch feat-param_support_np_scalar (#7939)

* add unittest

* refactor unittest

* add todo for int16 advanced indexing

* add int32 supporting for advance indexing

* auto format by CI

Co-authored-by: Wang Yi <53533850+marigoold@users.noreply.github.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>

* fix tensor_scatter_nd_update (#7953)

* fix tensor_scatter_nd_update

* auto backward

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix one_embedding adam (#7974)

* fix one_embedding adam

* fix tidy

* fix normal

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* speed test with score (#7990)

Signed-off-by: daquexian <daquexian566@gmail.com>

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Feat/graph del by ref (#7857)

* remove IsMultiClient() and single client logic

Signed-off-by: daquexian <daquexian566@gmail.com>

* rename eager.multi_client to eager

Signed-off-by: daquexian <daquexian566@gmail.com>

* auto format by CI

* add py ref

* refine new session

* clean code

* make scope api inner use

* use session with ref cnt

* run barrier callback in BarrierPhyInstrOperand::~BarrierPhyInstrOperand

* test pass

* lock gil in vm Callback thread

* more comments for VirtualMachineEngine::Callback()

* merge

* merge rm single client

* rm initenv

* merge and fix master

* refactor env c api

* add debug code

* fix and serving test pass

* test passed

* rm useless

* rm useless code

* format

* rm useless include

* rm sync in py

* the Env is never destroyed.

* export Env into python

* more unittests

* fix and pass tests

* revert virtual_machine.cpp

* revert core/vm

* remove outdated python class oneflow.unittest.TestCase

* graph test passed

* wait shared_ptr.use_count() == 0

* export unittest.TestCase in framework/unittest.py

* SwitchToShuttingDownPhase

* optional is_normal_exit

* VirtualMachine::CloseVMThreads

* Delete env_api.h

env_api.h is deleted by master

* address pr comments

* rm is env init

* Clear empty thread when graph destroy (#7633)

* Revert "Clear empty thread when graph destroy (#7633)" (#7860)

This reverts commit 3e8585e5fa20b97229d6b0be46a7ff814dc8cd83.

* fix a ref-cnt bug in TryRunBarrierInstruction.

* rm env_api

* fix clang-tidy error

* fix clang-tidy in env_imp

* refine env api

* format

* refine graph del and sync at shuttingdown

* fix typo

* add comment

* rm useless

* rm useless

Co-authored-by: daquexian <daquexian566@gmail.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: lixinqi <lixinqi0703106@163.com>
Co-authored-by: Li Xinqi <lixinqi2010@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: Luyang <flowingsun007@163.com>
Co-authored-by: cheng cheng <472491134@qq.com>

* [PersistentTable] Fix num blocks (#7986)

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add auto benchmark for flowvision (#7806)

* update yml

* update workflow

* add resnet50

* [PersistentTable] Async write (#7946)

* [PersistentTable] Async write

* fix

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* save log in separate dir by default (#7825)

Signed-off-by: daquexian <daquexian566@gmail.com>

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix index select op in graph

* add exe graph physical shape check msg

* improve the debug inform…
Ikkyu321 added a commit to ZJLabDubhe/oneflow-zj that referenced this pull request Aug 26, 2022
* edit tanh to a closure op (#5)

Co-authored-by: yoonlee888 <qiuyunlei@zhejianglab.com>

* Dev sin loop grad (#7)

* edit tanh to a closure op

* add grad-looped sin_cos_negative

* add test case

Co-authored-by: yoonlee888 <qiuyunlei@zhejianglab.com>
Co-authored-by: Zhenhua <1209435+hengzi@users.noreply.github.com>

* add log_grad_grad (#12)

* Add exp_grad_grad (#11)

* Revert "Dev sin loop grad (#7)" (#13)

This reverts commit c256a5a326d7e04c2ad4af802318661d18f72441.

* fix bugs (#16)

* fix ScalarSub param

* Add test case

* code format

* fix

* add higher order derivative Interface draft (#6)

* add higher order derivative Interface draft

* solve bugs of no Tensor.is_sparse attrs

* rm some  Interface comments

* fix & format

Co-authored-by: Zhenhua <1209435+hengzi@users.noreply.github.com>
Co-authored-by: Huang Zhenhua <huangzhenhua@zhejianglab.com>

* add Higher derivative vjp (#9)

* add Higher derivative vjp

* add autotest code

* add autograd.functional.vhp and motified functional

* Merge Testcase

* Rm chinese chars

Co-authored-by: Zhenhua <1209435+hengzi@users.noreply.github.com>
Co-authored-by: Huang Zhenhua <huangzhenhua@zhejianglab.com>

* merge Master into zj/develop (#21)

* Multi Tensor apply Optimizer (#8373)

* Add optim_cast and modify sgd

* Remove

* try to add fuseUpdatecast pass logic

* use pass

* still have bug in inplace

* ban inplace and fix sgd update

* fix regst num

* add env var

* remove cuda graph wrong use

* add support for graph

* initialize

* add functional impl

* add simple job rewrite

* delete redundant sgd update kernel

* support half

* add kernel

* use single loop kernel

* refine

* when in eval mode, we turn off multi tensor update

* refine format

* use juncheng kernel

* Refine

* group multi tensor op by some attr

* add parallel conf to key

* refine

* Add unroll logic

* fix bug

* restruct

* use pointer list

* add adam kernel

* support multi tensor adam update

* Remove cpu

* support skip if and scale by tensor

* support sgd adam unittest

* add more check

* Remove config

* Restruct tensorparams

* support fused cast in multi tensor update

* support cast in multi tensor

* fix bug in model update cast pass

* fix multi tensor sgd update with cast Pass check logic

* refine

* support multi tensor adam update with cast

* refine format

* Remove redundant template args

* merge modify for fused cast

* only allow fused cast in train mode

* only support data parallel in multi tensor update

* rewrite fuse update cast pass logic

* remove redundant if

* fix format

* add new line

* rename

* Remove print

* rename and add LOG

* Add more type and test

* still have bug in multi tensor adam

* Fix multi tensor adam update bug

* add multi tensor adam update with cast test

* simplify code

* fix format

* Add model diff datatype in optimizer key

* remove random seed

* fix comment

* fix comment

* fix to use model copy

* use for loop

* Fix comment

* use hashcombine

* fix clang analysis error

* add with cuda macro

* fix env var in unittest

* remove redundant unittest

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Fix doc and ops template auto gen (#8546)

* fix doc and add op calculator

* fix bug

* fix gen_ops

* fix diag 0size tensr shape infer bug (#8557)

* fix diag 0size tensr shape infer bug

* refine

* refine

* auto format by CI

* auto format by CI

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Format tensor on cpu (#8548)

* Format tensor on cpu

* use tensor.detach

* Remove useless WITH_CUDAs (#8562)

* unique identity (#8509)

* unique identity

* fix

* add identit name

* rm debug log

* mv identity form class to graph

* auto format by CI

* fix unique iden with having multiple stage

* auto format by CI

* Update block.py

Co-authored-by: cheng cheng <472491134@qq.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add GenericStreamContext (#8560)

* Modify some file and add test (#8556)

* Modify some file and add test

* modify the content

* modify the format and test function name

* modify the format and aligned with pytorch

* delete print

* modity the function name

* auto format by CI

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>

* Move some op into amp gray list (#8545)

enlarge gray list

Co-authored-by: cheng cheng <472491134@qq.com>

* Refine inplace expand runtime_error (#8561)

* Refine inplace expand runtime_error

* Opt

* Refine

* Add Note

* OneEmbedding use malloc async (#8543)

* in out ptrs

* ops and test

* test pass

* prefetch tmp buffer

* embedding shuffle tmp buffer

* gradient shuffle

* tmp buffer size

* mem pool

* cuda 11.2

* add id_shuffle to setNumunique in update tests

* default not use dynamic alloc

* fix of_tidy

* add fused op

* address review

* init tmp_buffer

* mv memset

* fix

* one_embedding fused_lookup_init_cast and fused_update_put (#8564)

* add fused op

* mv memset

* fix

* address review

* rm fullcache n_missing check

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix cpu aligned_alloc size (#8569)

Signed-off-by: daquexian <daquexian566@gmail.com>

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add flow norm (#8535)

* add flow norm

* rm import

* rm  doctest.testmod

* fix pad_packed_sequence method input requires_grad==True (#8574)

* fix pad_packed_sequence method input requires_grad==True

* fix append error when batch_first=True

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix embedding manager tmp buffer (#8585)

* fix embedding manager

* format

* fix reduce_ops 0size bug (#8551)

* fix reduce_ops 0size bug

* fix commnet

* auto format by CI

* fix bug

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Align Momentum Optimizer (#8549)

* fix moemntum update

* align momentum

* fix bug and finish eager unittest

* Support Graph optimizer

* fix momentum bug

* refine beta

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Fill GetSbp bug and consistent test bug (#8576)

fix(FillOp): fill GetSbp bug and consistent test bug

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Dev Fully fused MLP Grad[OneEmbedding] (#8462)

* support fully fused mlp grad in eager

* support lazy backward

* fix output size

* add fallback to tmp_buf logic when ones buffer is not enough

* build sbp

* overlap allreduce

* fix overlap order

* fix format

* CUDA Graphs delayed capture

* Add ifcomm create for graph

* insert weight event roughly

* fix dbias allreduce error

* simplify code

* Add 11060 limit

* Remove print

* Rename

* fix fill bug and remove comm to cache

* Rename variable and add debug code for cache

* Use kernel state and fix bug

* remove print

* fix allreduce dbias bug

* fix header file

* fix comment

* remove redundant headerfile

* fix userops build error

* refine

* init nccl comm before execute kernel

* fix comment

Co-authored-by: liujuncheng <liujuncheng1022@gmail.com>

* rename mirrored to local (#8503)

* rename mirrored to local

* rename files

* rename files

* auto format by CI

* revert change of package_mirror.py

* rename LocalObject to Dependence

* rename fn LocalObject to Dependence

* merge master

* handle clang check

* fix

* refine

* rename local_object to dependence

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>

* Implement BroadcastElementwiseUnary primitive (#8384)

* Add code skeleton for broadcast unary primitive

* first try

* finish impl

* finish impl

* format

* fix build error

* address review

* refine

* address review comments

* use broadcast unary primitive in fill_tensor_ kernel

* handle pack tail statically

* fix

* address review

* address review

* Fix SimplifyBroadcastDims

* fix

* revert fill_kernel

Co-authored-by: Juncheng <liujuncheng1022@gmail.com>

* skip cpu autotest for graph global (#8593)

* TODO

* skip cpu autotest for graph global

* Refine

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add function_library.h Exception (#8241)

* add RuntimeError for checking

* add RuntimeError to CHECK_EQ

* auto format by CI

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>

* Refactor shrink (#8573)

* caching allocator

* auto format by CI

* Update ep_device_context.h

* EpDeviceCtx with CachingAllocator

* rm RawAllocator typename

* auto format by CI

* specific allo in EpDeviceCtx

* auto format by CI

* rm outdated alloc

* simplify thread safe guard

* auto format by CI

* avoid return mutex

* auto format by CI

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Speed up SliceKernel (#8589)

* perf(SliceKernel): descrease number of cuda kernel and speed up

* perf(SliceKernel): use old kernel when small tensor is all fullslice

* use std::copy to copy contiguous memory

* fix cpu kernel bug

* Update readme and vsn for 0.8.0 (#8600)

* update version

* remove py3.6

* modify some file and improve error message (#8592)

* modify some file and improve error message

* modify scalar_by_tensor_op.cpp

* Update scalar_by_tensor_op.cpp

* Update slice_op.cpp

* Update test_slice_op.py

* Update test_slice_op.py

* auto format by CI

* auto format by CI

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* rename consistent to global (#8505)

* rename consistent to global

* rename consistent to global

* rename files

* rename files

* refine

* auto format by CI

* refine

* fix clang check

* fix

* fix

* fix

* rm to_consistent docs

* auto format by CI

* refine

* fix

* fix

* revert changes

* auto format by CI

* revert changes

* revert changes

* rename

* rename

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>

* add module releated container docs (#8580)

* add module releated container docs

* auto format by CI

* fix comment

* refine

* refine

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix rnn util extra memory usage when requires_grad=False (#8603)

* fix rnn util extra memory usage when requires_grad=False

* add comments

* refine comments

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* use bracket format slice in tensor str (#8489)

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Perf TensorInfo constructor (#8606)

* perf(Autograd): perf TensorInfo constructor

* rename consistent to global

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* print operators' python location when print nn_graph (#8558)

1. add a flag in nn.Graph.debug() named print_op_loc for printing operator location.
2. add a flag in nn.Graph.debug() named only_print_user_code_loc for only print users' code location

* Add randint like (#8598)

* add randnint_like op

* add docs for random

* refine

* auto format by CI

* add randint_like global test

* refine doc

* refine randint_like docs

* fix bug

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add full_like api (#8595)

* add full_like_op api

* refine

* add test

* refine

* refine docs

* refine

* add consistent_full test

* add full_like op

* fix docs commnet

* change scalar sbp return value from list to tuple

* auto format by CI

* merge conflict

* revert

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix cumsum GenBackwardOpConfFn (#8604)

* fix cumsum GenBackwardOpConfFn

* add test case

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* revert change (#8613)

* fix test graph optimization conf CI bug (#8617)

* restore resource config after random tests

* refine

* refine

* Release pod tensor (#8552)

* ThreadLocalGuard

* split ReleaseTensor into ReleasePodTensor and ReleaseNonPodTensor.

* rename

Co-authored-by: luyang <flowingsun007@163.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add param group for optimizer (#8611)

* add add_param_group interface for Optimize

* add test for add_param_group

* revert

* fix comment

* refine

* auto format by CI

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix broadcast_elementwise_binary cpu (#8625)

fix broadcast_elementwise_binary_cpu

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* align exception msg to torch (#8627)

* align exception msg to torch

* auto format by CI

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>

* skip unstable global test in ci, reduce failture rate (#8635)

* fuse embedding interaction (#8586)

* fuse embedding interaction

* fix of_tidy

* refine

* fix

* address review

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix flip gen backward opconf (#8605)

* fix flip gen backward opconf

* use new opconf api

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add ONEFLOW_ONE_EMBEDDING_PERSISTENT_TABLE_SNAPSHOT_LOAD_MMAP_LOCKED (#8597)

* Add ONEFLOW_ONE_EMBEDDING_PERSISTENT_TABLE_SNAPSHOT_LOAD_MMAP_LOCKED

* refine

* use MAP_POPULATE

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Profiling main thread (#8601)

* ThreadLocalGuard

* refactor EagerBlobObjectList

* op_args_reserved_size

* remove useless comments

Co-authored-by: binbinHan <han_binbin@163.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Fully Memory Log V2 with more details (#8565)

* Fully Memory Log V2 with more details

* refine log and long op name

* fix clang tidy

* fix test

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: Xiaoyu Xu <xiaoyulink@gmail.com>

* Stream policy (#8590)

* ThreadLocalGuard

* refactor signature of StreamType::InitDeviceCtx

* refactor hint

* add StreamPolicy

* remove DeviceCtx args

* refine OpCallInstructionUtil::Prepare & Compute

* merge EpDeviceCtx and LazyJobDeviceCtx into StreamPolicy

* minor fix

* minor fix

* del useless code

* fix error

* fix merge error

* fix segment fault bug

* fix complie error

* del methods belong to Subclass

* reslove comment

Co-authored-by: binbinHan <han_binbin@163.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add fully support for broadcast matmul (#6937)

* fix arange bug

* fully support broadcast matmul

* add more check

* remove check

* add fully sbp

* fix full sbp

* Fix broadcast matmul grad

* remove old broadcast matmul grad

* add broadcast grad back and when B numaxes is 2, we use broadcast_gradB instead of matmul+reduce

* add lazy backward

* Add restrict when transpose_a is false we can use bmatmul_grad_b

* revert

* fix broadcast matmul backward

* fix single client dispatch matmul logic

* revert old bcast matmul grad b kernel

* fix eager functional matmul backward

* add more test case

* remove redundant code

* add more special case

* when b num axes is 2, we only save tensor a

* fix annotation

* fix conflict and format

* remove single client matmul code

* Fix eval error

* fix conflict

* fix unittest

* Add init value

* support matrix vector matmul

* add vector matrix product

* Use matmul primitive to rewrite matrix vector product forward and backward

* Add fullllllllly support for vector matrix product

* Fix sbp

* fix bug

* add unittest

* Add consistent test for broadcast matmul

* Remove redundant code

* fix userops annotation

* fix

* refine

* Fix clang static analysis

* fix clang analysis

* set check graph as false

* fix

* fix for unittest

* fix broadcast sbp bug

* try to fix unittest

* Fix consistent test

* fix multiplier to 4 for unittest

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Revert "skip cpu autotest for graph global" (#8608)

* Revert "skip cpu autotest for graph global (#8593)"

This reverts commit b076be782fd8f21e50ee4915f2d1562f3a9ab4c0.

* cherry pick from master

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* OneEmbedding add tmp_buffer allocator (#8588)

* fix embedding manager

* format

* refine embedding_manager tmp_buffer allocator

* fix

* format

* refine

* refine

* auto format by CI

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>

* refine error msg for some user ops (#8579)

* refine error msg for some user ops

* refine error msg for some user ops

* optimize

* optimize the writing

* optimize the writing

* optimize the writing

* auto format by CI

* optimize writing

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add tril fill value (#8655)

add tril fill value

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix_non_pod_data_allocate_bug (#8657)

Co-authored-by: Li Xinqi <lixinqi2010@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Fix norm (#8629)

* fix norm

* add doc

* add bool &

* update math_functor.cpp

* add note

* fix_decorate_mem_leak_bug_in_eager_boxing (#8661)

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* add higher order derivative for leaky_relu and negative op (#8643)

* add higher derivative for leakyrelu and negative

* fix a typo

* remove functor

* add initialize alpha

* fix incorrect dim size in global test

* fix incorrect dim size in global test

* optimize testcase

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* update oneflow intro to show the difference (#8669)

* update oneflow intro

* refine

* refine

* refine

* refine

* refine

* refine

* refine

* refine

* refine

* refine oneflow intro

* Stacked error (#8671)

* ThreadLocalGuard

* StackedError

* StackedError

Co-authored-by: Shenghang Tsai <jackalcooper@gmail.com>

* Refactor tensor initializer (#8626)

* fix(*): fix xavier_initializer

* refactor(Initializer): refactor initializer

* fix function name

* auto format by CI

* refine

* fix interface in tensor.py

* fix(trunc_normal_): fix init bug and add test

* auto format by CI

* fix bug

* add oneflow.nn.init.normal_ test

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>

* Fix nn doc (#8650)

* fix hsplit doc

* add doc for module

* fix dtype

* fix formula

* add ref

* fix row length

* Fix reduce max min bool dtype bug (#8651)

* fix reduce_max_min_bool_dtype

* fix bug

* auto format by CI

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Remove redundant exception wrapper (#8631)

* remove redundant ExceptionWrapper

* refine KeyErrorMessage

* refine

* auto format by CI

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>

* Refactor MemoryCase to eliminate determine statements of device_type (#7727)

* ref memory_case_util

* ref BlobObject::CheckMemCase

* ref mem_case using

* address review

* address review

* namespace memcase -> memory

* fix conflict

* address review

* address static analysis

* rm check

* cpu device_id is always 0

* fix conflict

* timeout-minutes: 50

* revert change

* increase thrd limit in container

* skip 2x2 TestEinsumConsistent

* skip failed case of distributed test

* auto format by CI

* fix_non_pod_data_allocate_bug

Co-authored-by: Li Xinqi <lixinqi2010@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: tsai <jackalcooper@gmail.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: clackhan <han_binbin@163.com>

* fix some data races in c++ api and SteadyVector (#8654)

* fix some data races in c++ api and SteadyVector

Signed-off-by: daquexian <daquexian566@gmail.com>

* skip self copy in MutShapeView::ToShape

Signed-off-by: daquexian <daquexian566@gmail.com>

* auto format by CI

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>

* Fix sin/cos higher order derivative (#8648)

* fix(GradGrad): fix sin/cos higher order derivative

* fix(GradGrad): fix calculate error

* refine autograd global test

* auto format by CI

* refine sin/cos grad_grad calculate

* fix static analysis

* merge conflict

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: Ping Zhu <58718936+REYGU@users.noreply.github.com>
Co-authored-by: Zhu, Ping <pingzhuu@outlook.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* refine_eager_boxing_to_adapt_ep (#8568)

* refine_eager_boxing_to_adapt_ep

* fix typo

* refine

* refine symmetric-acyclic-nd-sbp-to-nd-sbp

* refine

* fix error

* fix static check

* add NOLINT

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Fix repeat bug (#8645)

* make result contiguous

* add test case

* auto format by CI

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>

* Instruction policy (#8583)

* ThreadLocalGuard

* vm::InstructionPolicy

* fix compile error (#8623)

* fix compile error

* change MirroredObject to Dependence

* Modify DependenceVector

* rm include stream type

* fix stream type

* auto format by CI

Co-authored-by: Yu OuYang <xuanjiuye@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>

* handle non-contiguous input (#8665)

* handle non-contiguous input

* refine

* auto format by CI

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>

* rename define CONSISTENT to GLOBAL (#8652)

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Refine naive interpret (#8672)

* ThreadLocalGuard

* refactor EagerBlobObjectList

* op_args_reserved_size

* remove useless comments

* rename one::EagerBlobObjectList* to vm::EagerBlobObject*

* refactor signature of InstructionsBuiler::Call

* PhysicalRun

* refactor InstructionsBuilder::Call

* remove unused StatefulOpKernel::need_check_mem_case

* remove EagerLocalTensorImpl::is_shape_synced_

* refactor SoftSync

* move SmallVector from common/container_util.h to framework/instructions_builder.cpp

* explicit scalar initialization

Co-authored-by: clackhan <han_binbin@163.com>

* Rebuild Docs V0.8.0 (#8392)

* rebuild for 5 module

* fix bug

* fix for doctree and content  in nn and

* fix

* fix

* fix

* add some

* fix for oneflow.rst

* update oneflow oneflow.nn

* update tensor

* update tensor module

* update

* test

* update

* update

* fix for undone desc

* docs: oneflow.utils.data (#8485)

* feat(utils.data): add oneflow.utils.data

* docs(dataloader): change the docstring of DataLoader

* docs(tensor): add methods to oneflow.Tensor document

* docs(optim): change docstring of optimizer and add a note to the doucument

* nn.graph

* fix for graph

* fix bug

* review nn and linalg document (#8515)

* docs(nn): add contents to oneflow.nn document

* docs(linalg): refactor oneflow.linalg document

* change attributes.rst and review nn.functional.rst (#8514)

* change attributes.rst and review nn.functional.rst

* reconstruction oneflow.cuda

* fix cuda and rebuild comm demo (#8582)

* update image

* add distributed

* oneembedding & refine graph

* update for sdisributed one_embedding

* fix rnn.py (#8616)

* 重构 oneflow.nn.init 文档 (#8622)

docs(nn.init): refactore nn.init document

* docs(nn.init): remove the comments

* docs(utils.data): remove the comments

* update and fix bug

* docs(review): refine the documents (#8646)

* docs(review): refine oneflow, nn, Tensor, nn.init, linalg, utils.data, optim modules

* docs(optim): modify the code examples

* docs(tensor): edit note

* 重构 oneflow.autograd 文档 (#8594)

* docs(autograd): refactor oneflow.autograd

* docs(autograd): edit "Default gradient layouts".

* docs(autograd): reedit "Default gradient layouts"

* docs(autograd): add comment

* docs(autograd): add reference

* update

* docs(tensor): change autoclass to autosummary

* update

* update

* add oneflow.linalg.diagonal (#8653)

* docs(linalg): add oneflow.linalg.diagonal

* update enviorment variable

* Update docs/source/distributed.rst

Co-authored-by: Houjiang Chen <chenhoujiangcug@gmail.com>

* Update docs/source/distributed.rst

Co-authored-by: Houjiang Chen <chenhoujiangcug@gmail.com>

* update enviorment variable

* update for ev & distributed

* update distribued

* update ev

* update distribute desc

* Update docs/source/distributed.rst

Co-authored-by: Houjiang Chen <chenhoujiangcug@gmail.com>

* update

* 修改 docstring 描述 (#8656)

* docs: move pytorch refernce to end

* docs: add some docstring

* docs(refs): add refs

* Update docs/source/distributed.rst

* updte for distributed details and environment_variable

* docs(docstring): Modify all reference links to version 1.10 (#8663)

* fix bug

* fix bug

* fix all warning

Co-authored-by: Guoliang Cheng <1876953310@qq.com>
Co-authored-by: liu xuan <85344642+laoliu97@users.noreply.github.com>
Co-authored-by: Guoliang Cheng <lmyybh_lazy@163.com>
Co-authored-by: laoliu97 <841637247@qq.com>
Co-authored-by: Yao Chi <later@usopp.net>
Co-authored-by: Houjiang Chen <chenhoujiangcug@gmail.com>

* Fix zeros like and ones_like api (#8632)

* fix zeros_like and ones_like bug

* refine

* revert

* refine

* fix tensor_slice_view infer physic_shape bug

* add test

* refine

* auto format by CI

* fix bug

* refine

* auto format by CI

* fix import error

* fix bug

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Fix sbp print bug (#8689)

* Add a normal priority with no transfer but different sbp

* Fix the bug for printing no boxing edge

* Do not use P for weights

* auto format by CI

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>

* eager_local_interpreter_with_infer_cache (#8619)

* ThreadLocalGuard

* refactor EagerBlobObjectList

* op_args_reserved_size

* remove useless comments

* rename one::EagerBlobObjectList* to vm::EagerBlobObject*

* refactor signature of InstructionsBuiler::Call

* PhysicalRun

* refactor InstructionsBuilder::Call

* remove unused StatefulOpKernel::need_check_mem_case

* remove EagerLocalTensorImpl::is_shape_synced_

* eager_local_interpreter_with_infer_cache

* remove useless code

* reslove comments

* refactor TensorMeta::TensorMeta(const TensorMeta)

* use small vector

* add kMaxNumDims

* fix error include

* fix split Symbol LocalTensorMeta error

* refactor SoftSync

* move SmallVector from common/container_util.h to framework/instructions_builder.cpp

* mone ONEFLOW_EAGER_ENABLE_LOCAL_INFER_CACHE to eager.h

* add blank line

* reslove comments

* minor fix

* refine

* explicit scalar initialization

* fix static check error

* auto format by CI

* of_format

* reslove comment

* refine

* refine

* refine

Co-authored-by: lixinqi <lixinqi0703106@163.com>
Co-authored-by: Li Xinqi <lixinqi2010@gmail.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix gelu nn.Module bug and support tanh mode. (#8693)

* add gelu2 api

* refine test

* refine docs

* refine

* restuct

* delete useless headfile

* format

* rm doc of tensor.gelu (#8696)

Co-authored-by: Shanshan Zhong <62104945+zhongshsh@users.noreply.github.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Fix bug in CrossFeatureInteraction LazyBackward (#8677)

fix bug

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix floating-point scalar tensor in arange (#8673)

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add nn functional fold (#8667)

* add fold

* update fold.py

* add test

* fix doc

* fix comment

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* modify some file and improve the error message (#8566)

* modify some file and improve the error message

* modify the content

* modify the content

* auto format by CI

* Update roi_align_op.cpp

* Update roi_align_op.cpp

* Update reshape_user_op_util.cpp

* auto format by CI

* Update roi_align_op.cpp

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>

* [OneEmbedding] add id_shuffle_copy_out (#8683)

add id_shuffle_copy_out

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix add_param_group step key not match error (#8698)

* fix add_param_group step key not match error

* auto format by CI

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* add env ONEFLOW_EP_CUDA_DEVICE_FLAGS and ONEFLOW_EP_CUDA_STREAM_FLAGS (#8703)

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix for docsv0.8 (#8710)

* fix repeat op 0-size releated bug (both in FW and AD) (#8707)

* fix repeat op 0-size releated bug (both in FW and AD)

* refine

* refine static check

* refine

* fix commnet

* fix comment

* refine

* fix test

* auto format by CI

* auto format by CI

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Support Dropout Scale in FusedMLPGrad[OneEmbedding] (#8633)

* support alpha list

* Remove redundant modify

* remove redundant alpha set

* refine

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Fix bug of Tensor.type (#8697)

* fix bug of tensor.type(flow.Tensor)

* fix bug of tensor.type(flow.Tensor) about device

* Fix tensor type doc (#8699)

fix doc of tensor.type

* add test for tensor.type(flow.Tensor)

* move PyTensorMetaCls_CheckExact to header file

Co-authored-by: Shanshan Zhong <62104945+zhongshsh@users.noreply.github.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* ONEFLOW_GRAPH_PLACE_TRAINING_STATE_ON_ALL_RANKS (#8706)

* ONEFLOW_GRAPH_PLACE_TRAINING_STATE_ON_ALL_RANKS

* auto format by CI

Co-authored-by: liujuncheng <liujuncheng1022@gmail.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* define_mut_output_shape_and_mut_output_stride_in_infer_ctx (#8709)

* define_mut_output_shape_and_mut_output_stride_in_infer_ctx

* fix merge master error

* fix typo

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add qat conv modules (#8368)

* add qat conv modules

* add quantization related modules to doc

* refine qatconv modules doc

* add qat conv module tests

* refine

* refine

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* add unsqueeze_multiple_op (#8714)

* add unsqueeze_multiple_op

* modify the format

* Update functional_api.yaml

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* modify broadcast_like_op.cpp and add test (#8720)

* modify broadcast_like_op.cpp and add test

* modify broadcast_like_op.cpp

* Update broadcast_like_op.cpp

Co-authored-by: Yinggang Wang <wyg19970408@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* JIT LR (#8500)

* add example code

* Update cosine_annealing_lr.py

* enable self params transformer

* enable pass ast to c++ api

* enable jit backend for lr

* enable jit global register and invoke

* convert Global to Singleton for new merge

* enable pybind11 walk on python ast

* enable test all existent get_lr of oneflow in python

* enable py_ast_wrapper pass ast from python to mlir

* switch all ast to ast-wrapper in mlir scope

* define python ast partially

* partial python ast definition

* trim asdl of python ast

* mlir gen

* add symbol table

* from ast to jit done

* switch llvm::errs() to mlir::emitError and convert switch to typeSwitch

* trim duplicate namespace use

* fix LIT header

* add some docs

* enable compare with or_else, if with return seamless in branch and mutable variable

* trim code and refine struct

* register pybind11 ast node for shared_ptr

* enable cpp class in python

* go through python to mlir to llvm to jit to run

* add addf subf op

* work well on stepLR linearLR exponentialLR coseineDecayLR cosineAnnealingLR constantLR

* enable maxf minf conversion to llvm ir

* rename LR_JIT to LRJITRegister

* remove LR_JIT_Engine and swith Invoke to std::function ret by  lookup

* refine struct

* enable bisect_right and python resigter api have dump option arg

* add bisect_left and bisect_transformer specially, delete former test python script

* remove c++17 standard

* restore double hash to iterator

* publish

* publish

* publish

* use llvm classof and typeswitch rightly

* trim

* commit

* commit

* commit

* commit

* commit

* commit

* auto format by CI

* Update ir.cpp

* Update OneFlowLRJITRegistry.h

* auto format by CI

* Update AstMlirGen.h

* Update lr_jit.cpp

* auto format by CI

* Naming conventions

* auto format by CI

* auto format by CI

* deploy _ behind

Co-authored-by: leaves-zwx <kunta0932@gmail.com>
Co-authored-by: yuhao <1171760467@qq.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: yuhao <72971170+howin98@users.noreply.github.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add logspace (#8599)

* add logspace

* add global test

* restore rand

* fix doc

* rename consistent to global

* adjust import order

* add todo

* Add hann_window (#8615)

* add hann_window

* rm useless include

* add check

* adjust import order

* add ONEFLOW_VM_PENDING_HANDLE_WINDOW_SIZE (#8730)

* add ONEFLOW_VM_PENDING_HANDLE_WINDOW_SIZE

* add environment to vm.h

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Fix as strided bool type and view bug (#8713)

* fix as_stride bug

* refine

* refine

* refine

* delete useless head file

* refine

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add functional binary cross entropy (#8708)

* add gelu2 api

* refine test

* refine docs

* refine

* restuct

* delete useless headfile

* format

* rm doc of tensor.gelu

* add functional binary cross entropy

Co-authored-by: BBuf <1182563586@qq.com>
Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* support map_location in flow.load (#8666)

* support map_location in flow.load

Signed-off-by: daquexian <daquexian566@gmail.com>

* auto format by CI

* fix tests

Signed-off-by: daquexian <daquexian566@gmail.com>

* fix bug when map_location is None

Signed-off-by: daquexian <daquexian566@gmail.com>

* auto format by CI

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>

* Add addcdiv (#8581)

* add addcdiv

* fix tensor_functions

* fix inplace

* add test number

* rename consistent to global

* Inner most dim case for cumsum cumprod op (#8403)

* cumsum use cub scansum in some case

* prod use cub scan

* refine name

* refine

* optimize cum op

* format

* fix

* get device properties by cuda stream class

* revert useless code

* refine

* outer dim use parallel sweep algo

* refine

* fix a fraction of threads hit __syncthreads

* revert

* refine kernel define

* refine

* refine

* refine

* refine

* move comment

* fix

* fix

* refine

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Define mut output dtype and mut output is dynamic in infer ctx (#8716)

* define_mut_output_shape_and_mut_output_stride_in_infer_ctx

* fix merge master error

* fix typo

* define_mut_output_dtype_and_mut_output_is_dynamic_in_infer_ctx

* replce const DataType& with DataType

* replace const DataType& with DataType ret

* split TensorDesc4ArgNameAndIndex and MutTensorDesc4ArgNameAndIndex

* refine

* minor fix

* refine

* fix static check error

* Update op_expr.cpp

* Update op_expr.cpp

* Update stateful_opkernel.cpp

* refine

* fix static check error

* refine

* refine

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Dev refactor fuse instruction policy (#8624)

* ThreadLocalGuard

* vm::InstructionPolicy

* refactor fuse instruction policy

* fix compile error (#8623)

* fix compile error

* change MirroredObject to Dependence

* Modify DependenceVector

* add instruction policy util

* add instruction policy util

* remove include

* add include

* rm fuse instruction type

* Modifying variable properties

* add stream_sequential_dependence_ to instruction_policy

Co-authored-by: lixinqi <lixinqi0703106@163.com>
Co-authored-by: Li Xinqi <lixinqi2010@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix bug of batchnorm num_batches_tracked global error when loading state_dict (#8723)

add condition for assign num_batches_tracked

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* add launch master port limit (#8563)

* add launch master port limit

* Update python/oneflow/distributed/launch.py

Co-authored-by: daquexian <daquexian566@gmail.com>

Co-authored-by: daquexian <daquexian566@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Fix docs import distance (#8691)

* fix import distance

* add functional apis

* add smooth_l1_loss docs

* refine activation.py

* add deleted api

* review

* 添加oneflow, nn 等模块文档中遗漏的接口 (#8704)

* docs: add api

* docs(nn): refactor nn

* review

Co-authored-by: Guoliang Cheng <lmyybh_lazy@163.com>
Co-authored-by: ChenQiaoling <48576019+Chenqll@users.noreply.github.com>

* refactor control stream type (#8647)

* refactor control stream type

* auto format by CI

* Add method implementation

* refine

* refien

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: Li Xinqi <lixinqi2010@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Define mut output tensor desc (#8717)

* define_mut_output_shape_and_mut_output_stride_in_infer_ctx

* fix merge master error

* fix typo

* define_mut_output_dtype_and_mut_output_is_dynamic_in_infer_ctx

* define_mut_output_dtype_and_mut_output_tensor_desc

* replce const DataType& with DataType

* replace const DataType& with DataType ret

* split TensorDesc4ArgNameAndIndex and MutTensorDesc4ArgNameAndIndex

* refine

* minor fix

* fix merge error

* fix warning error

* refine

* fix static check error

* Update op_expr.cpp

* Update op_expr.cpp

* Update stateful_opkernel.cpp

* refine

* fix static check error

* refine

* refine

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Symbolic local tensor meta (#8662)

* ThreadLocalGuard

* refactor EagerBlobObjectList

* op_args_reserved_size

* remove useless comments

* rename one::EagerBlobObjectList* to vm::EagerBlobObject*

* refactor signature of InstructionsBuiler::Call

* PhysicalRun

* refactor InstructionsBuilder::Call

* remove unused StatefulOpKernel::need_check_mem_case

* remove EagerLocalTensorImpl::is_shape_synced_

* eager_local_interpreter_with_infer_cache

* remove useless code

* reslove comments

* refactor TensorMeta::TensorMeta(const TensorMeta)

* use small vector

* Symbolic LocalTensorMeta

* check shape in critical_sectio

* add kMaxNumDims

* fix error include

* fix split Symbol LocalTensorMeta error

* fix split cache and symbolic local tensor meta error

* refactor SoftSync

* move SmallVector from common/container_util.h to framework/instructions_builder.cpp

* mone ONEFLOW_EAGER_ENABLE_LOCAL_INFER_CACHE to eager.h

* add blank line

* reslove comments

* minor fix

* refine

* explicit scalar initialization

* fix static check error

* auto format by CI

* of_format

* reslove comment

* refine

* refine

* refine

* fix error

* define MutOutputShape and MutOutputStride in InferContext

* define_mut_output_shape_and_mut_output_stride_in_infer_ctx

* fix merge master error

* fix typo

* fix static check error

* define_mut_output_dtype_and_mut_output_is_dynamic_in_infer_ctx

* define_mut_output_dtype_and_mut_output_tensor_desc

* replce const DataType& with DataType

* split const and mut func in LocalTensorMeta

* replace const DataType& with DataType ret

* split TensorDesc4ArgNameAndIndex and MutTensorDesc4ArgNameAndIndex

* refine

* minor fix

* fix merge error

* fix warning error

* refine

* fix static check error

* Update op_expr.cpp

* Update op_expr.cpp

* split MutTensorMeta and MutLocalTensorMeta

* Update stateful_opkernel.cpp

* refine

* fix static check error

* refine

* refine

* reslove comment

* refine

* fix typo

Co-authored-by: Houjiang Chen <chenhoujiangcug@gmail.com>

* fxi typo

* use OpArgsVector

Co-authored-by: lixinqi <lixinqi0703106@163.com>
Co-authored-by: Li Xinqi <lixinqi2010@gmail.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: Houjiang Chen <chenhoujiangcug@gmail.com>

* Feat general basic communication (#8437)

* Add a slight cost for B->S and B->P in 2d sbp

* Add penalty for P in consumer

* Fix a slight bug

* Add at most 1 middle node for general basic communication

* Add the cost for general basic communication

* Add the slight penalty for eager

* Skip initialization of boxing collector if not needed

* Fix a bug

* Dev nd nccl send recv boxing (#8467)

* nd nccl_send_recv_boxing

* rm print

* support num_axes > 2

* Add distributed optional run (#8372)

* Add

* change deps

* add install

* add skip

* autoprof supports bandwidth (#8367)

* autoprof supports bandwidth

Signed-off-by: daquexian <daquexian566@gmail.com>

* print bandwidth

Signed-off-by: daquexian <daquexian566@gmail.com>

* auto format by CI

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>

* remove tmp buffer of cumprod cpu backward kernel (#8369)

* remove tmp buffer of cumprod cpu backward kernel

* refine

* refine

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Move tensor api to cpython part3 (#8342)

* add tensor_functions

* concat py methods

* add hash, restore tensor.py

* check replacement

* refine code, remove commented tensor.py

* refine code

* move some api

* add cpu and cuda api

* add triu tril norm and etc.

* remove tensor_functions.h

* move more api

* move more api, refine size

* fix typo

* format code, remove useless include

* refine code

* refine code, fix typo

* align .cuda to python

* refine code

* split some api to part3 for review

* remove positional only arguments of argmax and argmin

* remove arguments parse

* modify arguments name in matmul and floor_divide

* rename BINARY_FUNC to DIRECT_PASS_FUNC, modify some functions

* refine code, format code

* add inplace /=, add comments

* remove name in macros

* remove python api

* remove redundant include

* remove cout

* format code

* refactor tensor.size by directly call shape.at, refactor tensor.sub_ by calling nb_sub_

* remove redundant code

* auto format by CI

* fix typo, fix wrong call

* modify idx datatype from int32 to int64 in tensor.size

* add some DIRECT_PASS_FUNC

* add cpu cuda var pow and etc.

* add masked_fill any all

* make REDUCE_FUNC macro, add reduce_* functions

* add 0dim check in ReduceSumWhole, refine yaml

* fix bug

* restore add add_ sub sub_

* add unittest for tensor.half tensor.add tensor.add_

* refine code

* refine code

* fix typo

* fix bug of tensor.std()

* refactor var std and cuda, using c++ functional api

* add beta and threshold in softplus

* auto format by CI

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add nn_functor Check (#7910)

* add bias_add_check

* add bias_add error test

* fix conv2d nhwc bias_add error

* add nhwc conv test

* add bias_add_error test

* Add bias add error check

* Rename

* add batch matmul error check

* add matmul check error msg

* remove annotation

* add fused mlp error msg check

* Add pixel shuffle check test

* add more test until normalization add relu functor

* refine error message

* finish all nnfunctor check msg

* handle type error

* remove useless symbol

* modify back to TypeError

* fix all comment

* Remove redundant code

* Remove pad ndim check

* fix bias add space

* fix check logic cause ci gpu not always gpu:0

Co-authored-by: hjchen2 <chenhoujiangcug@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add FusedMatmulBiasAddReluDropout [OneEmbedding] (#8222)

* previous version for fused_matmul_bias_add_relu_dropout

* add op infer

* fix detail

* finish forward

* support dropout rate list

* add forward test

* fix bug for output buffer

* Configurable alpha params

* try to add bit mask logic

* Add bitmask first version!

* Add row col bitmask logic

* support not align4 reludropout

* simplify relu dropout ld logic

* Add naive relu dropout grad kernel

* add simple relu dropout grad kernel

* Rename

* support relu_dropout bitmask backward

* add vectorized optimization

* fix tmp buffer

* add to amp list

* add lazy backward logic

* Refine kernel

* add indextype dispatch

* simplify functor logic

* fix cublas fused mlp aux_ld shape bug

* Add more relu dropout kernel

* add full unittest

* fix bug in skip final activation

* refine

* Remove dump func

* fix format

* Remove cmake

* remove redundant divide

* add padded version

* fix dropout

* oneflow curand

* refine

* remove redundant kernel

* add unroll logic

* add unroll and ballot sync

* refine format

* Remove fast curand

* Refine python interface

* Add if branch for memset

* fix python logic

* just for debug

* not use matmul bias add grad

* add launch 1 block limit

* fix unittest

* Refine

* fix graph backward bug

* limit to 11060

* change to use int32_t dtype for cublas aux

* Fix jc comment

* fix comment

* fix convert

* fix static_analysis

* fix at

* fix userops td

* fix userops td

* fix const ref

* fix compile error for bfloat16

* limit to 11060

* fix bug

Co-authored-by: Juncheng <liujuncheng1022@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix gather 0-dim tensor bug (#8376)

* fix 0-dim tensor bug

* refine

* support input 0-dim tensor for gather

* refine

* refine

* refine dim_scatter_kernel check

* refine

* refine check

* fix clang_tidy error

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* add api to apply external job pass (#8370)

* Add condition to find-test-cache-distributed (#8387)

* add condition to find-test-cache-distributed

* fix

* warp dim util (#8382)

* warp dim util

* format

* use more maybe_wrap_dim

* refine array functor

* add more

* refine math_functor

* fix_bug_in_broadcast_min_max_grad_and_broadcast_like (#8379)

* fix_bug_in_broadcast_min_max_grad_and_broadcast_like

* refine

* fix static check error

* fix bug about index (#8388)

* fix bug about index

* add test case

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* LogicalSliceAssign support full slice sbp (#8344)

* feat(SliceOp): slice ops support 2d sbp

* fix(SliceOp): fix [B, P] 2d sbp bug

* refine error message

* fix bug in parallel_num == 1

* add comment

* add warning and format

* add NOLINT for boxing check

* feat(LogicalSliceOps): support all nd_sbp

* feat(LogicalSlice): support nd_sbp

* add error message

* fix(AutoTest): fix auto_test bug in module.parameter pass

* auto format by CI

* fix(LogicalSliceAssign): skip test when 1n1d

* fix SliceParams memset error

* remove memset

* add CHECK_JUST

* fix(*): make sure split_axis >= 0 or equal to SPLIT_AXIS_FOR_NON_SPLIT

* remove memset

* fix spilit_info.axis bug

* feat(LogicalSliceOps): support grad

* add logical_slice gradient_funcs

* feat(LogicalSliceAssign): LogicalSliceAssign support full slice sbp

* auto format by CI

* test(LogicalSlice): fix logical_slice dims

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: Houjiang Chen <chenhoujiangcug@gmail.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>

* fix_tensor_from_numpy_mem_leak_bug (#8391)

* fix_tensor_from_numpy_mem_leak_bug

* add note

* refine note

* refine

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Make of_pyext_obj static only to make sure only a python ext so has python symbols (#8393)

* make of_pyext_obj static only

* refine note

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Adjust tolerance setting in embedding_renorm unit test (#8394)

* support front end compile for job to iree (#8249)

* support frontend dev version

* polish name

* add tosa-to-elf.mlir

* tosa to elf by llvm

* conv2d partial

* an enhanced frontend runner

* support numpy as input

* enable multiple using nn graph with different input(jobname make it  it cd /home/yuhao/frontend/oneflow ; /usr/bin/env /usr/bin/python3 /home/yuhao/.vscode-server/extensions/ms-python.python-2022.6.2/pythonFiles/lib/python/debugpy/launcher 40873 -- /home/yuhao/frontend/oneflow/oneflow/ir/test/Frontend/runner.py )

* enable multiple input

* enable cpu and cuda

* change full_name to _full_name

* support exchange cuda with cpu seamlessly

* remove pip

* lit config

* polish

* trim

* auto format by CI

* modify

* auto format by CI

* last line polish

* use unittest

* auto format by CI

* use allclose

* auto format by CI

* pulish

* optimize convert oneflow to tosa

* conv2d

* conv2d enhanced && conv2d examples add

* add road map

* add add_n2Op and boardcast_addOp conversion

* add matmulOp conversion

* support converting normailzation op to tosa(partically)

* update roadmap

* support i64 tensor to dense elem attr

* support 100% resnet op conversion

* add test mlir

* add test iree resnet python script

* auto format by CI

* done

* enhance iree resnet test script

* auto format by CI

* rebuild code

* auto format by CI

* rebuild test script

* update

* auto format by CI

* pub

* trim test scripts

* move

* move

* input and output add block arg judgement

* emit error in variable conversion

* error handle for ci

* modify err info

* auto format by CI

* merge

* auto format by CI

* output not block

* flow ones

* rm const

* trim maybe

* trim maybe with header file

* const auto

* solve clangd error

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Feat/zero mix with mp (#8036)

* add zero limit

* add debug

* add mix zero test

* refactor zero api

* zero test with mp

* add 2d test

* add zero nd

* add nd zero

* add sbp cast

* test passed soft limit consumer

* refine size api

* zero use stage 2

* add limit consumer api

* add new api

* refine zero s select

* fix index out of range

* rm zero limit on device type

* zero test with activation checkpointing

* add indentity when dp sequence len is 1

* move to base with master

* fix

* fix

* fix

* add test

* debug bad case

* refine test for eager and graph boxing

* test case ready

* simplify

* refine test

* fix buff size

* fix conflict

* refine zero nd

* refine

* add full test

* revert change

* refine split check

* fix typo

* rm log

* spit long func

* restore test

* Update optimizer_placement_optimization_pass.cpp

* auto format by CI

* auto format by CI

* fix static check

* add tips for zero api change

* auto format by CI

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Revert embedding normal path and fix amp list (#8374)

* revert embedding normal path, fix amp list

* fix amp

* fix memset bug in gather cpu kernel

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* replace fixed_vector with small_vector and make Shape inherit from it (#8365)

* Replace fixed_vector with llvm::SmallVector

Signed-off-by: daquexian <daquexian566@gmail.com>

* Shape inherited from llvm::SmallVector

Signed-off-by: daquexian <daquexian566@gmail.com>

* refine cmake

Signed-off-by: daquexian <daquexian566@gmail.com>

* rename fixed_vector to small_vector

Signed-off-by: daquexian <daquexian566@gmail.com>

* fix reviews

Signed-off-by: daquexian <daquexian566@gmail.com>

* auto format by CI

* update Shape constructor

Signed-off-by: daquexian <daquexian566@gmail.com>

* add 'PUBLIC' keyword to all target_link_libraries

Signed-off-by: daquexian <daquexian566@gmail.com>

* auto format by CI

* update cmake

Signed-off-by: daquexian <daquexian566@gmail.com>

* auto format by CI

* update cmake

Signed-off-by: daquexian <daquexian566@gmail.com>

* update cmake

Signed-off-by: daquexian <daquexian566@gmail.com>

* auto format by CI

* set is_initialized_ default to true

Signed-off-by: daquexian <daquexian566@gmail.com>

* override some methods to set is_initialized_

Signed-off-by: daquexian <daquexian566@gmail.com>

* auto format by CI

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>

* Light plan for debug (#8396)

* Light plan for debug

* fix note

* disable terminfo to fix missing terminfo symbols (#8400)

* disable terminfo to fix missing terminfo symbols

Signed-off-by: daquexian <daquexian566@gmail.com>

* auto format by CI

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix bug of ZeRO MP in complex case (#8404)

* Remove redundant output_lbns in ir (#8409)

* mv case

* remove redundant info

* Dev FusedCrossInteraction[OneEmbedding] (#8335)

* add simple fused cross interaction forward

* add packed fused

* Add cross interaction grad

* simplify code

* fix bug

* support crossnet v2

* support cross interaction v2

* add lazy backward

* Rename and add test

* fix jc comment

* fix comment

* fix bug

* fix userops td elem_cnt for FUSED Group

* fix header file

* fix clang static analysis

* fix unittest

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* add exe graph physical shape check msg (#8002)

* fix index select op in graph

* add exe graph physical shape check msg

* improve the debug information for the python stack trace

1. add a parameter 'max_stack_depth' to specify the max depth for the stack trace
2. refactor other debug related classes.

* remove parens

* update

* resolve PR comments

* update

* update graph debug test file.

* restore self._debug in class Graph and class ModuleBlock

* Do not shorten the stack frame string if it is in debug mode

* delete TODOs

* disable conv3d test (#7969)

Signed-off-by: daquexian <daquexian566@gmail.com>

* skip layernorm random_data_warp test (#7941)

* skip layernorm random_data_warp test

* warp/block/uncached case only test gpu

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Lock click version (#7967)

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* add global avgpool unittest (#7585)

* fix (#7978)

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Support negative dim in scatter op (#7934)

* support negative dim in scatter op

* refine scatter test

* refine scatter test again

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* run barrier callback in BarrierPhyInstrOperand::~BarrierPhyInstrOperand (#7702)

* run barrier callback in BarrierPhyInstrOperand::~BarrierPhyInstrOperand

* lock gil in vm Callback thread

* more comments for VirtualMachineEngine::Callback()

* the Env is never destroyed.

* export Env into python

* more unittests

* wait shared_ptr.use_count() == 0

* export unittest.TestCase in framework/unittest.py

* SwitchToShuttingDownPhase

* optional is_normal_exit

* VirtualMachine::CloseVMThreads

* Delete env_api.h

env_api.h is deleted by master

* reshape_only_one_dim_infered

* address pr comments

* fix a ref-cnt bug in TryRunBarrierInstruction.

* rollback flow.env.all_device_placement

* no distributed running test_shutting_down.py

* auto format by CI

* expand lifetime of module oneflow in test_shutting_down.py

* refine del depend on of

* capture oneflow._oneflow_internal.eager when calling sync in __del__

* add try in flaky test

Co-authored-by: Luyang <flowingsun007@163.com>
Co-authored-by: chengtbf <472491134@qq.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: Xiaoyu Xu <xiaoyulink@gmail.com>

* Fix one hot scalar tensor bug (#7975)

* fix reduce_sum scalar check bug

* fix one_hot scalar tensor bug

* fix clang tidy error

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* support ctor np array from of tensor (#7970)

* support ctor np array from of tensor

* add test case constructing np array from tensor

* refine

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* add_manual_seed_all_api (#7957)

* add_manual_seed_all_api

* Update conf.py

* refine

* add test case

* auto format by CI

* Update random_generator.cpp

* auto format by CI

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* one_embedding add doc string (#7902)

* add doc string

* add example

* add

* fix doc

* refine

* address review

* mb to MB

* add make_table_option

* option to options

* refine

* add forward

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Support numpy scalar parameters (#7935)

* feat(functional): support numpy scalar parameters

* rename inferface

* feat(*): TensorIndex support numpy scalar

* feat(TensorIndex): support advance indexing

* add unittest and int32 support for branch feat-param_support_np_scalar (#7939)

* add unittest

* refactor unittest

* add todo for int16 advanced indexing

* add int32 supporting for advance indexing

* auto format by CI

Co-authored-by: Wang Yi <53533850+marigoold@users.noreply.github.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>

* fix tensor_scatter_nd_update (#7953)

* fix tensor_scatter_nd_update

* auto backward

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix one_embedding adam (#7974)

* fix one_embedding adam

* fix tidy

* fix normal

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* speed test with score (#7990)

Signed-off-by: daquexian <daquexian566@gmail.com>

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Feat/graph del by ref (#7857)

* remove IsMultiClient() and single client logic

Signed-off-by: daquexian <daquexian566@gmail.com>

* rename eager.multi_client to eager

Signed-off-by: daquexian <daquexian566@gmail.com>

* auto format by CI

* add py ref

* refine new session

* clean code

* make scope api inner use

* use session with ref cnt

* run barrier callback in BarrierPhyInstrOperand::~BarrierPhyInstrOperand

* test pass

* lock gil in vm Callback thread

* more comments for VirtualMachineEngine::Callback()

* merge

* merge rm single client

* rm initenv

* merge and fix master

* refactor env c api

* add debug code

* fix and serving test pass

* test passed

* rm useless

* rm useless code

* format

* rm useless include

* rm sync in py

* the Env is never destroyed.

* export Env into python

* more unittests

* fix and pass tests

* revert virtual_machine.cpp

* revert core/vm

* remove outdated python class oneflow.unittest.TestCase

* graph test passed

* wait shared_ptr.use_count() == 0

* export unittest.TestCase in framework/unittest.py

* SwitchToShuttingDownPhase

* optional is_normal_exit

* VirtualMachine::CloseVMThreads

* Delete env_api.h

env_api.h is deleted by master

* address pr comments

* rm is env init

* Clear empty thread when graph destroy (#7633)

* Revert "Clear empty thread when graph destroy (#7633)" (#7860)

This reverts commit 3e8585e5fa20b97229d6b0be46a7ff814dc8cd83.

* fix a ref-cnt bug in TryRunBarrierInstruction.

* rm env_api

* …
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants