Fully Memory Log V2 with more details #8565

chengtbf · 2022-07-04T14:11:11Z

提供更加详尽的内存分析日志，新增了每个 Chain（Chunk->MemBlock）内 tensor 的 shape、dtype、生命周期、申请释放的顺序等，用于快速找到每个内存块中对占用内存影响较大的 tensor 是否有异常。
Checkpointing pass 提供日志，记录哪些 tensor 被 Checkpoint 了
refine 了系统 tensor 的 prefix，使得日志中查看不会太长。

Checkpointing 日志示例：

BERT

 In subgraph: 0 has checkpointing tensor num = 14
Checkpointing tensor: Sys-GradAcc-VarRepeat-model.bert.encoders.6.self_attention.dense.weight-125/out_0 ,shape: (768,768) ,dtype: kFloat16 ,placement: oneflow.placement(type="cuda", ranks=[0]) ,sbp: (B)
Checkpointing tensor: Sys-GradAcc-VarRepeat-model.bert.encoders.6.self_attention.query_key_value.weight-123/out_0 ,shape: (2304,768) ,dtype: kFloat16 ,placement: oneflow.placement(type="cuda", ranks=[0]) ,sbp: (B)
Checkpointing tensor: model.bert-identity-179/out_0 ,shape: (8,1,512,512) ,dtype: kBool ,placement: oneflow.placement(type="cuda", ranks=[0]) ,sbp: (B)
Checkpointing tensor: Sys-GradAcc-VarRepeat-model.bert.encoders.6.mlp.dense_4h_to_h.bias-132/out_0 ,shape: (768,) ,dtype: kFloat16 ,placement: oneflow.placement(type="cuda", ranks=[0]) ,sbp: (B)
Checkpointing tensor: model.bert-identity-178/out_0 ,shape: (512,8,768) ,dtype: kFloat16 ,placement: oneflow.placement(type="cuda", ranks=[0]) ,sbp: (B)
Checkpointing tensor: Sys-GradAcc-VarRepeat-model.bert.encoders.6.self_attention.dense.bias-126/out_0 ,shape: (768,) ,dtype: kFloat16 ,placement: oneflow.placement(type="cuda", ranks=[0]) ,sbp: (B)
Checkpointing tensor: Sys-GradAcc-VarRepeat-model.bert.encoders.6.input_layernorm.bias-122/out_0 ,shape: (768,) ,dtype: kFloat16 ,placement: oneflow.placement(type="cuda", ranks=[0]) ,sbp: (B)
Checkpointing tensor: Sys-GradAcc-VarRepeat-model.bert.encoders.6.mlp.dense_4h_to_h.weight-131/out_0 ,shape: (768,4096) ,dtype: kFloat16 ,placement: oneflow.placement(type="cuda", ranks=[0]) ,sbp: (B)
Checkpointing tensor: Sys-GradAcc-VarRepeat-model.bert.encoders.6.self_attention.query_key_value.bias-124/out_0 ,shape: (2304,) ,dtype: kFloat16 ,placement: oneflow.placement(type="cuda", ranks=[0]) ,sbp: (B)
Checkpointing tensor: Sys-GradAcc-VarRepeat-model.bert.encoders.6.input_layernorm.weight-121/out_0 ,shape: (768,) ,dtype: kFloat16 ,placement: oneflow.placement(type="cuda", ranks=[0]) ,sbp: (B)
Checkpointing tensor: Sys-GradAcc-VarRepeat-model.bert.encoders.6.post_attention_layernorm.weight-127/out_0 ,shape: (768,) ,dtype: kFloat16 ,placement: oneflow.placement(type="cuda", ranks=[0]) ,sbp: (B)
Checkpointing tensor: Sys-GradAcc-VarRepeat-model.bert.encoders.6.mlp.dense_h_to_4h.bias-130/out_0 ,shape: (4096,) ,dtype: kFloat16 ,placement: oneflow.placement(type="cuda", ranks=[0]) ,sbp: (B)
Checkpointing tensor: Sys-GradAcc-VarRepeat-model.bert.encoders.6.post_attention_layernorm.bias-128/out_0 ,shape: (768,) ,dtype: kFloat16 ,placement: oneflow.placement(type="cuda", ranks=[0]) ,sbp: (B)
Checkpointing tensor: Sys-GradAcc-VarRepeat-model.bert.encoders.6.mlp.dense_h_to_4h.weight-129/out_0 ,shape: (4096,768) ,dtype: kFloat16 ,placement: oneflow.placement(type="cuda", ranks=[0]) ,sbp: (B)

会打印每个 Checkpointing 子图被后向缓存的那些 tensor（大部分是 Variable，只有特殊的 identity，即为 module 的 input tensor）。我们可以看到 bert 有两个 input（data 和 mask）。

基于本日志，可以分析是否有 tensor 可以被复用 Checkpointing。

内存块详细日志分析

BERT

Summary

 Graph name GraphBase_0 in Rank: 0, Device: 0 needs to allocate [ 2909.38 MiB ] device memory. 
   In general, Chunk id: 0  memory is [ 1513.68 MiB ] with mem_block_num = 240
        Unreused memory not eager var is  [ 349.123 MiB ] with mem_block_num = 718
        Eager Variable Tensor total memory is [ 1046.58 MiB ] with mem_block_num = 331

包含了 Graph 所需的全部内存，以及其中三大组成部分：

Chunk （每个 Rank / Device 只有一个 Chunk）的显存，Chunk 里有多个 MemBlock，每个 MemBlock 有多个 tensor，这些 tensor 均为可以内存复用的 tensor。
Unreused mem，表示那些非 Variable 的独占内存的 tensor （不可以内存复用，如 Repeat、Acc 占用的内存）
Eager Variable ，包含用户定义的 weight 和 Optimizer 的 state （如 adam 的 m 和 v）

Chunk

Chunk 里有多个内存不相交的 MemBlock。当 export GLOG_v = 2 会按照 MemBlock 从大到小依次输出每个 MemBlock。

In Device: 0 Chunk id: 0 MemBlock id: 161 has num = 840 tensor with mem size = 990.274
In Device: 0 Chunk id: 0 MemBlock id: 89 has num = 1 tensor with mem size = 65.2739
In Device: 0 Chunk id: 0 MemBlock id: 209 has num = 1 tensor with mem size = 32.6369
...

每个 MemBlock 即为一个可以内存复用的子图，当 export GLOG_v = 3 时此处会打印每个 MemBlock 内部的详细 tensor 分布，按照 tensor 在所属 op 执行时序上的顺序逐个输出每一个 tensor 的详细信息，包含： order，name， size，duration（生命周期，表示该 tensor 在申请了以后经过了多少个 op 执行以后才释放）， shape，dtype，allocate order（op 的时序）， free order。

In Chunk id: 0, MemBlock id: 161 Order: 0 ,duration: 9 ,size: 0.00416 MiB, name: model.bert.embeddings-identity-11/out_0, shape: (1,512) ,dtype: kInt64 ,alloc_order: 0 ,free_order: 8
In Chunk id: 0, MemBlock id: 161 Order: 1 ,duration: 235 ,size: 0.000576 MiB, name: model-identity-243/out_0, shape: (8,) ,dtype: kInt64 ,alloc_order: 1 ,free_order: 235
In Chunk id: 0, MemBlock id: 161 Order: 2 ,duration: 20 ,size: 0.032832 MiB, name: model-identity-244/out_0, shape: (8,512) ,dtype: kInt64 ,alloc_order: 2 ,free_order: 21
In Chunk id: 0, MemBlock id: 161 Order: 3 ,duration: 632 ,size: 0.032832 MiB, name: model.bert-identity-8/out_0, shape: (8,512) ,dtype: kInt64 ,alloc_order: 3 ,free_order: 634
In Chunk id: 0, MemBlock id: 161 Order: 4 ,duration: 8 ,size: 0.032832 MiB, name: model-identity-245/out_0, shape: (8,512) ,dtype: kInt64 ,alloc_order: 4 ,free_order: 11
In Chunk id: 0, MemBlock id: 161 Order: 5 ,duration: 15 ,size: 0.032832 MiB, name: model.bert-identity-0/out_0, shape: (8,512) ,dtype: kInt64 ,alloc_order: 5 ,free_order: 19
In Chunk id: 0, MemBlock id: 161 Order: 6 ,duration: 628 ,size: 0.032832 MiB, name: model.bert-identity-7/out_0, shape: (8,512) ,dtype: kInt64 ,alloc_order: 6 ,free_order: 633
In Chunk id: 0, MemBlock id: 161 Order: 7 ,duration: 238 ,size: 32.637 MiB, name: model-identity-242/out_0, shape: (21248,768) ,dtype: kFloat16 ,alloc_order: 7 ,free_order: 244
In Chunk id: 0, MemBlock id: 161 Order: 8 ,duration: 625 ,size: 0.032832 MiB, name: model.bert.embeddings-expand-13/out_0, shape: (8,512) ,dtype: kInt64 ,alloc_order: 8 ,free_order: 632
In Chunk id: 0, MemBlock id: 161 Order: 9 ,duration: 8 ,size: 0.00416 MiB, name: model.cls_head.loss_func.lm_loss-scalar_logical_greater_equal-258/out_0, shape: (8,512) ,dtype: kBool ,alloc_order: 9 ,free_order: 16
In Chunk id: 0, MemBlock id: 161 Order: 10 ,duration: 15 ,size: 6.29152 MiB, name: model.bert.embeddings.tokentype_embeddings-gather-18/out_0, shape: (8,512,768) ,dtype: kFloat16 ,alloc_order: 10 ,free_order: 24
In Chunk id: 0, MemBlock id: 161 Order: 11 ,duration: 222 ,size: 0.016448 MiB, name: model.cls_head.loss_func-cast-264/out_0, shape: (8,512) ,dtype: kFloat ,alloc_order: 11 ,free_order: 232
In Chunk id: 0, MemBlock id: 161 Order: 12 ,duration: 1 ,size: 0.032832 MiB, name: model.bert.extended_attn_mask-expand_dims-2/out_0,shape: (8,512,1) ,dtype: kInt64 ,alloc_order: inplaced ,free_order: inplaced
In Chunk id: 0, MemBlock id: 161 Order: 13 ,duration: 1 ,size: 0.032832 MiB, name: model.bert.extended_attn_mask-expand_dims-1/out_0,shape: (8,1,512) ,dtype: kInt64 ,alloc_order: inplaced ,free_order: inplaced
In Chunk id: 0, MemBlock id: 161 Order: 14 ,duration: 14 ,size: 6.29152 MiB, name: model.bert.embeddings.vocab_embeddings-gather-10/out_0,shape: (8,512,768) ,dtype: kFloat16 ,alloc_order: 14 ,free_order: 27
In Chunk id: 0, MemBlock id: 161 Order: 15 ,duration: 6 ,size: 6.29152 MiB, name: model.bert.embeddings.position_embeddings-gather-15/out_0,shape: (8,512,768) ,dtype: kFloat16 ,alloc_order: 15 ,free_order: 20
In Chunk id: 0, MemBlock id: 161 Order: 16 ,duration: 6 ,size: 0.032832 MiB, name: model.cls_head.loss_func.lm_loss-cast-259/out_0,shape: (8,512) ,dtype: kInt64 ,alloc_order: 16 ,free_order: 21
In Chunk id: 0, MemBlock id: 161 Order: 17 ,duration: 1 ,size: 0.016448 MiB, name: model.cls_head.loss_func-reshape-268/out_0,shape: (4096,) ,dtype: kFloat ,alloc_order: inplaced ,free_order: inplaced
In Chunk id: 0, MemBlock id: 161 Order: 18 ,duration: 1 ,size: 0.016448 MiB, name: model.cls_head.loss_func-reduce_sum-265/tmp_buffer_0,shape: (16384,) ,dtype: kChar ,alloc_order: 18 ,free_order: 18
In Chunk id: 0, MemBlock id: 161 Order: 19 ,duration: 5 ,size: 0.000512 MiB, name: model.cls_head.loss_func-reduce_sum-265/output_tensor_0,shape: () ,dtype: kFloat ,alloc_order: 18 ,free_order: 22
In Chunk id: 0, MemBlock id: 161 Order: 20 ,duration: 8 ,size: 16.7773 MiB, name: model.bert.extended_attn_mask-broadcast_mul-3/z_0,shape: (8,512,512) ,dtype: kInt64 ,alloc_order: 19 ,free_order: 26
In Chunk id: 0, MemBlock id: 161 Order: 21 ,duration: 1 ,size: 6.29152 MiB, name: model.bert.embeddings-add_n-16/out_0,shape: (8,512,768) ,dtype: kFloat16 ,alloc_order: inplaced ,free_order: inplaced
In Chunk id: 0, MemBlock id: 161 Order: 22 ,duration: 216 ,size: 0.032832 MiB, name: model.cls_head.loss_func.lm_loss-broadcast_mul-260/z_0,shape: (8,512) ,dtype: kInt64 ,alloc_order: 21 ,free_order: 236
In Chunk id: 0, MemBlock id: 161 Order: 23 ,duration: 207 ,size: 0.000512 MiB, name: model.cls_head.loss_func-scalar_add-266/out_0,shape: () ,dtype: kFloat ,alloc_order: 22 ,free_order: 228
In Chunk id: 0, MemBlock id: 161 Order: 24 ,duration: 1 ,size: 16.7773 MiB, name: model.bert.extended_attn_mask-expand_dims-4/out_0,shape: (8,1,512,512) ,dtype: kInt64 ,alloc_order: inplaced ,free_order: inplaced
In Chunk id: 0, MemBlock id: 161 Order: 25 ,duration: 1 ,size: 6.29152 MiB, name: model.bert.embeddings-add_n-19/out_0

简化版：

0 ,duration: 9 ,size: 0.00416 MiB, name: model.bert.embeddings-identity-11/out_0, shape: (1,512) ,dtype: kInt64 ,alloc_order: 0 ,free_order: 8
1 ,duration: 235 ,size: 0.000576 MiB, name: model-identity-243/out_0, shape: (8,) ,dtype: kInt64 ,alloc_order: 1 ,free_order: 235
2 ,duration: 20 ,size: 0.032832 MiB, name: model-identity-244/out_0, shape: (8,512) ,dtype: kInt64 ,alloc_order: 2 ,free_order: 21
3 ,duration: 632 ,size: 0.032832 MiB, name: model.bert-identity-8/out_0, shape: (8,512) ,dtype: kInt64 ,alloc_order: 3 ,free_order: 634
4 ,duration: 8 ,size: 0.032832 MiB, name: model-identity-245/out_0, shape: (8,512) ,dtype: kInt64 ,alloc_order: 4 ,free_order: 11
5 ,duration: 15 ,size: 0.032832 MiB, name: model.bert-identity-0/out_0, shape: (8,512) ,dtype: kInt64 ,alloc_order: 5 ,free_order: 19
6 ,duration: 628 ,size: 0.032832 MiB, name: model.bert-identity-7/out_0, shape: (8,512) ,dtype: kInt64 ,alloc_order: 6 ,free_order: 633
7 ,duration: 238 ,size: 32.637 MiB, name: model-identity-242/out_0, shape: (21248,768) ,dtype: kFloat16 ,alloc_order: 7 ,free_order: 244
8 ,duration: 625 ,size: 0.032832 MiB, name: model.bert.embeddings-expand-13/out_0, shape: (8,512) ,dtype: kInt64 ,alloc_order: 8 ,free_order: 632
9 ,duration: 8 ,size: 0.00416 MiB, name: model.cls_head.loss_func.lm_loss-scalar_logical_greater_equal-258/out_0, shape: (8,512) ,dtype: kBool ,alloc_order: 9 ,free_order: 16
10 ,duration: 15 ,size: 6.29152 MiB, name: model.bert.embeddings.tokentype_embeddings-gather-18/out_0, shape: (8,512,768) ,dtype: kFloat16 ,alloc_order: 10 ,free_order: 24
11 ,duration: 222 ,size: 0.016448 MiB, name: model.cls_head.loss_func-cast-264/out_0, shape: (8,512) ,dtype: kFloat ,alloc_order: 11 ,free_order: 232
12 ,duration: 1 ,size: 0.032832 MiB, name: model.bert.extended_attn_mask-expand_dims-2/out_0,shape: (8,512,1) ,dtype: kInt64 ,alloc_order: inplaced ,free_order: inplaced
13 ,duration: 1 ,size: 0.032832 MiB, name: model.bert.extended_attn_mask-expand_dims-1/out_0,shape: (8,1,512) ,dtype: kInt64 ,alloc_order: inplaced ,free_order: inplaced
14 ,duration: 14 ,size: 6.29152 MiB, name: model.bert.embeddings.vocab_embeddings-gather-10/out_0,shape: (8,512,768) ,dtype: kFloat16 ,alloc_order: 14 ,free_order: 27
15 ,duration: 6 ,size: 6.29152 MiB, name: model.bert.embeddings.position_embeddings-gather-15/out_0,shape: (8,512,768) ,dtype: kFloat16 ,alloc_order: 15 ,free_order: 20
16 ,duration: 6 ,size: 0.032832 MiB, name: model.cls_head.loss_func.lm_loss-cast-259/out_0,shape: (8,512) ,dtype: kInt64 ,alloc_order: 16 ,free_order: 21
17 ,duration: 1 ,size: 0.016448 MiB, name: model.cls_head.loss_func-reshape-268/out_0,shape: (4096,) ,dtype: kFloat ,alloc_order: inplaced ,free_order: inplaced
18 ,duration: 1 ,size: 0.016448 MiB, name: model.cls_head.loss_func-reduce_sum-265/tmp_buffer_0,shape: (16384,) ,dtype: kChar ,alloc_order: 18 ,free_order: 18
19 ,duration: 5 ,size: 0.000512 MiB, name: model.cls_head.loss_func-reduce_sum-265/output_tensor_0,shape: () ,dtype: kFloat ,alloc_order: 18 ,free_order: 22
20 ,duration: 8 ,size: 16.7773 MiB, name: model.bert.extended_attn_mask-broadcast_mul-3/z_0,shape: (8,512,512) ,dtype: kInt64 ,alloc_order: 19 ,free_order: 26
21 ,duration: 1 ,size: 6.29152 MiB, name: model.bert.embeddings-add_n-16/out_0,shape: (8,512,768) ,dtype: kFloat16 ,alloc_order: inplaced ,free_order: inplaced
22 ,duration: 216 ,size: 0.032832 MiB, name: model.cls_head.loss_func.lm_loss-broadcast_mul-260/z_0,shape: (8,512) ,dtype: kInt64 ,alloc_order: 21 ,free_order: 236
23 ,duration: 207 ,size: 0.000512 MiB, name: model.cls_head.loss_func-scalar_add-266/out_0,shape: () ,dtype: kFloat ,alloc_order: 22 ,free_order: 228
24 ,duration: 1 ,size: 16.7773 MiB, name: model.bert.extended_attn_mask-expand_dims-4/out_0,shape: (8,1,512,512) ,dtype: kInt64 ,alloc_order: inplaced ,free_order: inplaced

如上表中：
7 ,duration: 238 ,size: 32.637 MiB, name: model-identity-242/out_0, shape: (21248,768) ,dtype: kFloat16 ,alloc_order: 7 ,free_order: 244

则是被后向消费的一个 data input。

Unreused mem

unreused mem block 都是 tensor 独占显存：

In Device: 0 Memblock id: 247 Unreused  size: 32.6369 MiB, name: Sys-GradAcc-VarRepeat-model.bert.embeddings.vocab_embeddings.weight-24/out_0, shape: (21248,768) ,dtype: kFloat16
In Device: 0 Memblock id: 918 Unreused  size: 32.6369 MiB, name: Sys-GradAcc-VarAcc-model.bert.embeddings.vocab_embeddings.weight-out-cast_f2h/out_0, shape: (21248,768) ,dtype: kFloat16
In Device: 0 Memblock id: 737 Unreused  size: 6.29146 MiB, name: Sys-GradAcc-VarAcc-model.bert.encoders.4.mlp.dense_4h_to_h.weight-out-cast_f2h/out_0, shape: (768,4096) ,dtype: kFloat16

...

In Device: 0 Memblock id: 807 Unreused  size: 0.001536 MiB, name: Sys-GradAcc-VarAcc-model.bert.encoders.2.input_layernorm.weight-out-cast_f2h/out_0

Eager Variable

模型以及 Optimizer stage （比如 adam m，v）等

In Device: 0 Memblock id: 993 EagerVariable  size: 65.2739 MiB, name: model.bert.embeddings.vocab_embeddings.weight-m/out， shape: (21248,768) ,dtype: kFloat
In Device: 0 Memblock id: 246 EagerVariable  size: 65.2739 MiB, name: model.bert.embeddings.vocab_embeddings.weight/out, shape: (21248,768) ,dtype: kFloat
In Device: 0 Memblock id: 1108 EagerVariable  size: 12.5829 MiB, name: model.bert.encoders.3.mlp.dense_4h_to_h.weight-m/out, shape: (768,4096) ,dtype: kFloat

strint

LGTM

github-actions · 2022-07-06T10:33:24Z

Static analysis with clang failed. PR label automerge has been removed

github-actions · 2022-07-06T10:57:24Z

Speed stats:

GPU Name: NVIDIA GeForce GTX 1080 

❌ OneFlow resnet50 time: 129.4ms (= 12935.6ms / 100, input_shape=[16, 3, 224, 224])
PyTorch resnet50 time: 142.5ms (= 14246.5ms / 100, input_shape=[16, 3, 224, 224])
✔️ Relative speed: 1.10 (= 142.5ms / 129.4ms)

OneFlow resnet50 time: 75.8ms (= 7576.5ms / 100, input_shape=[8, 3, 224, 224])
PyTorch resnet50 time: 82.5ms (= 8248.8ms / 100, input_shape=[8, 3, 224, 224])
✔️ Relative speed: 1.09 (= 82.5ms / 75.8ms)

OneFlow resnet50 time: 49.1ms (= 9816.2ms / 200, input_shape=[4, 3, 224, 224])
PyTorch resnet50 time: 57.8ms (= 11561.2ms / 200, input_shape=[4, 3, 224, 224])
✔️ Relative speed: 1.18 (= 57.8ms / 49.1ms)

OneFlow resnet50 time: 40.4ms (= 8078.1ms / 200, input_shape=[2, 3, 224, 224])
PyTorch resnet50 time: 44.8ms (= 8953.4ms / 200, input_shape=[2, 3, 224, 224])
✔️ Relative speed: 1.11 (= 44.8ms / 40.4ms)

OneFlow resnet50 time: 33.9ms (= 6786.9ms / 200, input_shape=[1, 3, 224, 224])
PyTorch resnet50 time: 35.0ms (= 6999.2ms / 200, input_shape=[1, 3, 224, 224])
✔️ Relative speed: 1.03 (= 35.0ms / 33.9ms)

OneFlow swin dataloader time: 0.263s (= 52.598s / 200, num_workers=1)
PyTorch swin dataloader time: 0.155s (= 30.954s / 200, num_workers=1)
Relative speed: 0.589 (= 0.155s / 0.263s)

OneFlow swin dataloader time: 0.080s (= 15.957s / 200, num_workers=4)
PyTorch swin dataloader time: 0.040s (= 8.024s / 200, num_workers=4)
Relative speed: 0.503 (= 0.040s / 0.080s)

OneFlow swin dataloader time: 0.065s (= 13.074s / 200, num_workers=8)
PyTorch swin dataloader time: 0.022s (= 4.449s / 200, num_workers=8)
Relative speed: 0.340 (= 0.022s / 0.065s)

❌ OneFlow resnet50 time: 145.8ms (= 14580.1ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 169.6ms (= 16961.1ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.16 (= 169.6ms / 145.8ms)

OneFlow resnet50 time: 95.2ms (= 9517.7ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 114.1ms (= 11408.1ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.20 (= 114.1ms / 95.2ms)

OneFlow resnet50 time: 70.0ms (= 13992.4ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 92.8ms (= 18556.6ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.33 (= 92.8ms / 70.0ms)

OneFlow resnet50 time: 57.7ms (= 11536.0ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 80.8ms (= 16158.8ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.40 (= 80.8ms / 57.7ms)

OneFlow resnet50 time: 53.1ms (= 10629.5ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 75.2ms (= 15048.0ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.42 (= 75.2ms / 53.1ms)

github-actions · 2022-07-07T04:57:07Z

View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/8565/

…eflow into dev_cc_mem_log_v2

chengtbf · 2022-07-11T02:02:28Z

oneflow/core/framework/op_interpreter/lazy_op_interpreter.cpp

@@ -205,8 +205,8 @@ Maybe<Tensor> GradAccTryInsertUnpackAfterInput(
        << " the input tensor of nn.Graph will be unpacked by 0th dim into multiple micro-batches "
        << " and exec them in order.\n";

-    user_op::UserOpConfWrapperBuilder unpack_builder("System-GradientAccumulation-InputUnpack-"
-                                                     + input_conf.name() + "-" + NewUniqueId());
+    user_op::UserOpConfWrapperBuilder unpack_builder("Sys-GradAcc-InputUnpack-" + input_conf.name()


这个 PR 里对一些系统 op 的 prefix 前缀进行缩减，会影响到：

oneflow/oneflow/core/graph/task_stream_index_manager.cpp

Line 73 in b79be4f

return generator->GenerateNamedRoundRobin("CPU_COMPUTE", cpu_device_num);

这里 CPU COMPUTE 的内存被写坏吗？ @leaves-zwx

因为这个一模一样的报错出现了两次：

s.ssssssss.......ss.s.ssss.s....s....sssssssss.....sssss.......ssss..... [ 78%] F20220707 23:08:52.707211 4036 stream_index_generator.cpp:40] Check failed: it->second.size == size (48 vs. 3667) CPU_COMPUTE *** Check failure stack trace: *** @ 0x7f25f7a4e3fa google::LogMessage::Fail() @ 0x7f25f7a4e6e2 google::LogMessage::SendToLog() @ 0x7f25f7a4df67 google::LogMessage::Flush() @ 0x7f25f7a50ad9 google::LogMessageFatal::~LogMessageFatal() @ 0x7f25f0083b72 oneflow::StreamIndexGenerator::GenerateNamedRoundRobin() @ 0x7f25f00a1002 oneflow::GenerateComputeTaskStreamIndex() @ 0x7f25f00a118f oneflow::TaskStreamIndexGetterRegistry::Dispatch() @ 0x7f25f00a195c oneflow::TaskStreamIndexManager::GetTaskStreamIndex()

https://github.com/Oneflow-Inc/oneflow/runs/7265216193?check_suite_focus=true
https://github.com/Oneflow-Inc/oneflow/runs/7242834096?check_suite_focus=true

但是我本地没有复现：

chengcheng@oneflow-21:~/debug/graph $ python3 test_tvm_frontend_dependency_on_graph.py .._TvmFrontedGraph_1_input.0.0_2 m.features.0.weight m.features.0-conv2d-0 m.features.2-max_pool_2d-3 m.features.3.weight m.features.3-conv2d-4 m.features.5-max_pool_2d-7 m.features.6.weight m.features.6-conv2d-8 m.features.7-relu-10 m.features.8.weight m.features.8-conv2d-11 m.features.9-relu-13 m.features.10.weight m.features.10-conv2d-14 . ---------------------------------------------------------------------- Ran 3 tests in 2.012s OK

应该和这里修改前缀无关，我有一个PR把这个op都用functional重写了一下，op name都变成了repeat-xx, CI也没出问题

但是我这个 PR 就改了前缀。另外是 log 里加日志，这个怎么改应该也不会影响到 stream index 报错，而且两次 ci 报错出错的栈和位置一模一样。都是 test_tvm_frontend_dependency_on_graph ，都是 it->second.size == size (48 vs. 3667) CPU_COMPUTE

我感觉不是巧合。但是又想不通是为啥

github-actions · 2022-07-11T03:25:08Z

CI failed when running job: cpu-module. PR label automerge has been removed

github-actions · 2022-07-11T03:25:53Z

View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/8565/

github-actions · 2022-07-11T03:26:13Z

Speed stats:

github-actions · 2022-07-12T13:12:09Z

Speed stats:

GPU Name: NVIDIA GeForce GTX 1080 

❌ OneFlow resnet50 time: 129.4ms (= 12936.1ms / 100, input_shape=[16, 3, 224, 224])
PyTorch resnet50 time: 142.8ms (= 14284.1ms / 100, input_shape=[16, 3, 224, 224])
✔️ Relative speed: 1.10 (= 142.8ms / 129.4ms)

OneFlow resnet50 time: 75.7ms (= 7573.4ms / 100, input_shape=[8, 3, 224, 224])
PyTorch resnet50 time: 85.1ms (= 8510.3ms / 100, input_shape=[8, 3, 224, 224])
✔️ Relative speed: 1.12 (= 85.1ms / 75.7ms)

OneFlow resnet50 time: 48.4ms (= 9672.3ms / 200, input_shape=[4, 3, 224, 224])
PyTorch resnet50 time: 58.5ms (= 11708.8ms / 200, input_shape=[4, 3, 224, 224])
✔️ Relative speed: 1.21 (= 58.5ms / 48.4ms)

OneFlow resnet50 time: 37.8ms (= 7568.5ms / 200, input_shape=[2, 3, 224, 224])
PyTorch resnet50 time: 42.7ms (= 8537.5ms / 200, input_shape=[2, 3, 224, 224])
✔️ Relative speed: 1.13 (= 42.7ms / 37.8ms)

OneFlow resnet50 time: 32.1ms (= 6426.2ms / 200, input_shape=[1, 3, 224, 224])
PyTorch resnet50 time: 35.8ms (= 7153.0ms / 200, input_shape=[1, 3, 224, 224])
✔️ Relative speed: 1.11 (= 35.8ms / 32.1ms)

OneFlow swin dataloader time: 0.253s (= 50.561s / 200, num_workers=1)
PyTorch swin dataloader time: 0.153s (= 30.503s / 200, num_workers=1)
Relative speed: 0.603 (= 0.153s / 0.253s)

OneFlow swin dataloader time: 0.109s (= 21.744s / 200, num_workers=4)
PyTorch swin dataloader time: 0.041s (= 8.292s / 200, num_workers=4)
Relative speed: 0.381 (= 0.041s / 0.109s)

OneFlow swin dataloader time: 0.042s (= 8.490s / 200, num_workers=8)
PyTorch swin dataloader time: 0.023s (= 4.545s / 200, num_workers=8)
Relative speed: 0.535 (= 0.023s / 0.042s)

❌ OneFlow resnet50 time: 144.8ms (= 14476.9ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 168.1ms (= 16813.2ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.16 (= 168.1ms / 144.8ms)

OneFlow resnet50 time: 94.3ms (= 9427.6ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 113.0ms (= 11303.0ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.20 (= 113.0ms / 94.3ms)

OneFlow resnet50 time: 69.1ms (= 13822.8ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 99.2ms (= 19832.0ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.43 (= 99.2ms / 69.1ms)

OneFlow resnet50 time: 55.8ms (= 11164.9ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 74.1ms (= 14821.0ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.33 (= 74.1ms / 55.8ms)

OneFlow resnet50 time: 52.3ms (= 10455.6ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 68.7ms (= 13746.7ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.31 (= 68.7ms / 52.3ms)

github-actions · 2022-07-12T13:37:06Z

CI failed when running job: cuda-module. PR label automerge has been removed

github-actions · 2022-07-12T22:02:35Z

Speed stats:

GPU Name: NVIDIA GeForce GTX 1080 

❌ OneFlow resnet50 time: 129.4ms (= 12938.3ms / 100, input_shape=[16, 3, 224, 224])
PyTorch resnet50 time: 143.7ms (= 14367.1ms / 100, input_shape=[16, 3, 224, 224])
✔️ Relative speed: 1.11 (= 143.7ms / 129.4ms)

OneFlow resnet50 time: 75.9ms (= 7587.7ms / 100, input_shape=[8, 3, 224, 224])
PyTorch resnet50 time: 86.0ms (= 8598.4ms / 100, input_shape=[8, 3, 224, 224])
✔️ Relative speed: 1.13 (= 86.0ms / 75.9ms)

OneFlow resnet50 time: 49.1ms (= 9828.5ms / 200, input_shape=[4, 3, 224, 224])
PyTorch resnet50 time: 62.5ms (= 12496.1ms / 200, input_shape=[4, 3, 224, 224])
✔️ Relative speed: 1.27 (= 62.5ms / 49.1ms)

OneFlow resnet50 time: 36.8ms (= 7360.2ms / 200, input_shape=[2, 3, 224, 224])
PyTorch resnet50 time: 44.9ms (= 8976.1ms / 200, input_shape=[2, 3, 224, 224])
✔️ Relative speed: 1.22 (= 44.9ms / 36.8ms)

OneFlow resnet50 time: 32.7ms (= 6544.8ms / 200, input_shape=[1, 3, 224, 224])
PyTorch resnet50 time: 40.7ms (= 8135.8ms / 200, input_shape=[1, 3, 224, 224])
✔️ Relative speed: 1.24 (= 40.7ms / 32.7ms)

OneFlow swin dataloader time: 0.256s (= 51.188s / 200, num_workers=1)
PyTorch swin dataloader time: 0.153s (= 30.593s / 200, num_workers=1)
Relative speed: 0.598 (= 0.153s / 0.256s)

OneFlow swin dataloader time: 0.075s (= 15.072s / 200, num_workers=4)
PyTorch swin dataloader time: 0.041s (= 8.284s / 200, num_workers=4)
Relative speed: 0.550 (= 0.041s / 0.075s)

OneFlow swin dataloader time: 0.062s (= 12.354s / 200, num_workers=8)
PyTorch swin dataloader time: 0.021s (= 4.291s / 200, num_workers=8)
Relative speed: 0.347 (= 0.021s / 0.062s)

❌ OneFlow resnet50 time: 144.9ms (= 14487.1ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 168.8ms (= 16877.3ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.16 (= 168.8ms / 144.9ms)

OneFlow resnet50 time: 95.7ms (= 9565.9ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 112.0ms (= 11196.5ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.17 (= 112.0ms / 95.7ms)

OneFlow resnet50 time: 68.0ms (= 13601.0ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 99.0ms (= 19809.1ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.46 (= 99.0ms / 68.0ms)

OneFlow resnet50 time: 55.6ms (= 11123.7ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 73.7ms (= 14732.8ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.32 (= 73.7ms / 55.6ms)

OneFlow resnet50 time: 50.8ms (= 10169.1ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 69.2ms (= 13835.4ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.36 (= 69.2ms / 50.8ms)

* Multi Tensor apply Optimizer (#8373) * Add optim_cast and modify sgd * Remove * try to add fuseUpdatecast pass logic * use pass * still have bug in inplace * ban inplace and fix sgd update * fix regst num * add env var * remove cuda graph wrong use * add support for graph * initialize * add functional impl * add simple job rewrite * delete redundant sgd update kernel * support half * add kernel * use single loop kernel * refine * when in eval mode, we turn off multi tensor update * refine format * use juncheng kernel * Refine * group multi tensor op by some attr * add parallel conf to key * refine * Add unroll logic * fix bug * restruct * use pointer list * add adam kernel * support multi tensor adam update * Remove cpu * support skip if and scale by tensor * support sgd adam unittest * add more check * Remove config * Restruct tensorparams * support fused cast in multi tensor update * support cast in multi tensor * fix bug in model update cast pass * fix multi tensor sgd update with cast Pass check logic * refine * support multi tensor adam update with cast * refine format * Remove redundant template args * merge modify for fused cast * only allow fused cast in train mode * only support data parallel in multi tensor update * rewrite fuse update cast pass logic * remove redundant if * fix format * add new line * rename * Remove print * rename and add LOG * Add more type and test * still have bug in multi tensor adam * Fix multi tensor adam update bug * add multi tensor adam update with cast test * simplify code * fix format * Add model diff datatype in optimizer key * remove random seed * fix comment * fix comment * fix to use model copy * use for loop * Fix comment * use hashcombine * fix clang analysis error * add with cuda macro * fix env var in unittest * remove redundant unittest Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Fix doc and ops template auto gen (#8546) * fix doc and add op calculator * fix bug * fix gen_ops * fix diag 0size tensr shape infer bug (#8557) * fix diag 0size tensr shape infer bug * refine * refine * auto format by CI * auto format by CI Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Format tensor on cpu (#8548) * Format tensor on cpu * use tensor.detach * Remove useless WITH_CUDAs (#8562) * unique identity (#8509) * unique identity * fix * add identit name * rm debug log * mv identity form class to graph * auto format by CI * fix unique iden with having multiple stage * auto format by CI * Update block.py Co-authored-by: cheng cheng <472491134@qq.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Add GenericStreamContext (#8560) * Modify some file and add test (#8556) * Modify some file and add test * modify the content * modify the format and test function name * modify the format and aligned with pytorch * delete print * modity the function name * auto format by CI Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * Move some op into amp gray list (#8545) enlarge gray list Co-authored-by: cheng cheng <472491134@qq.com> * Refine inplace expand runtime_error (#8561) * Refine inplace expand runtime_error * Opt * Refine * Add Note * OneEmbedding use malloc async (#8543) * in out ptrs * ops and test * test pass * prefetch tmp buffer * embedding shuffle tmp buffer * gradient shuffle * tmp buffer size * mem pool * cuda 11.2 * add id_shuffle to setNumunique in update tests * default not use dynamic alloc * fix of_tidy * add fused op * address review * init tmp_buffer * mv memset * fix * one_embedding fused_lookup_init_cast and fused_update_put (#8564) * add fused op * mv memset * fix * address review * rm fullcache n_missing check Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * fix cpu aligned_alloc size (#8569) Signed-off-by: daquexian <daquexian566@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Add flow norm (#8535) * add flow norm * rm import * rm doctest.testmod * fix pad_packed_sequence method input requires_grad==True (#8574) * fix pad_packed_sequence method input requires_grad==True * fix append error when batch_first=True Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * fix embedding manager tmp buffer (#8585) * fix embedding manager * format * fix reduce_ops 0size bug (#8551) * fix reduce_ops 0size bug * fix commnet * auto format by CI * fix bug Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Align Momentum Optimizer (#8549) * fix moemntum update * align momentum * fix bug and finish eager unittest * Support Graph optimizer * fix momentum bug * refine beta Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Fill GetSbp bug and consistent test bug (#8576) fix(FillOp): fill GetSbp bug and consistent test bug Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Dev Fully fused MLP Grad[OneEmbedding] (#8462) * support fully fused mlp grad in eager * support lazy backward * fix output size * add fallback to tmp_buf logic when ones buffer is not enough * build sbp * overlap allreduce * fix overlap order * fix format * CUDA Graphs delayed capture * Add ifcomm create for graph * insert weight event roughly * fix dbias allreduce error * simplify code * Add 11060 limit * Remove print * Rename * fix fill bug and remove comm to cache * Rename variable and add debug code for cache * Use kernel state and fix bug * remove print * fix allreduce dbias bug * fix header file * fix comment * remove redundant headerfile * fix userops build error * refine * init nccl comm before execute kernel * fix comment Co-authored-by: liujuncheng <liujuncheng1022@gmail.com> * rename mirrored to local (#8503) * rename mirrored to local * rename files * rename files * auto format by CI * revert change of package_mirror.py * rename LocalObject to Dependence * rename fn LocalObject to Dependence * merge master * handle clang check * fix * refine * rename local_object to dependence Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * Implement BroadcastElementwiseUnary primitive (#8384) * Add code skeleton for broadcast unary primitive * first try * finish impl * finish impl * format * fix build error * address review * refine * address review comments * use broadcast unary primitive in fill_tensor_ kernel * handle pack tail statically * fix * address review * address review * Fix SimplifyBroadcastDims * fix * revert fill_kernel Co-authored-by: Juncheng <liujuncheng1022@gmail.com> * skip cpu autotest for graph global (#8593) * TODO * skip cpu autotest for graph global * Refine Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Add function_library.h Exception (#8241) * add RuntimeError for checking * add RuntimeError to CHECK_EQ * auto format by CI Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * Refactor shrink (#8573) * caching allocator * auto format by CI * Update ep_device_context.h * EpDeviceCtx with CachingAllocator * rm RawAllocator typename * auto format by CI * specific allo in EpDeviceCtx * auto format by CI * rm outdated alloc * simplify thread safe guard * auto format by CI * avoid return mutex * auto format by CI Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Speed up SliceKernel (#8589) * perf(SliceKernel): descrease number of cuda kernel and speed up * perf(SliceKernel): use old kernel when small tensor is all fullslice * use std::copy to copy contiguous memory * fix cpu kernel bug * Update readme and vsn for 0.8.0 (#8600) * update version * remove py3.6 * modify some file and improve error message (#8592) * modify some file and improve error message * modify scalar_by_tensor_op.cpp * Update scalar_by_tensor_op.cpp * Update slice_op.cpp * Update test_slice_op.py * Update test_slice_op.py * auto format by CI * auto format by CI Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * rename consistent to global (#8505) * rename consistent to global * rename consistent to global * rename files * rename files * refine * auto format by CI * refine * fix clang check * fix * fix * fix * rm to_consistent docs * auto format by CI * refine * fix * fix * revert changes * auto format by CI * revert changes * revert changes * rename * rename Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * add module releated container docs (#8580) * add module releated container docs * auto format by CI * fix comment * refine * refine Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * fix rnn util extra memory usage when requires_grad=False (#8603) * fix rnn util extra memory usage when requires_grad=False * add comments * refine comments Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * use bracket format slice in tensor str (#8489) Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Perf TensorInfo constructor (#8606) * perf(Autograd): perf TensorInfo constructor * rename consistent to global Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * print operators' python location when print nn_graph (#8558) 1. add a flag in nn.Graph.debug() named print_op_loc for printing operator location. 2. add a flag in nn.Graph.debug() named only_print_user_code_loc for only print users' code location * Add randint like (#8598) * add randnint_like op * add docs for random * refine * auto format by CI * add randint_like global test * refine doc * refine randint_like docs * fix bug Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Add full_like api (#8595) * add full_like_op api * refine * add test * refine * refine docs * refine * add consistent_full test * add full_like op * fix docs commnet * change scalar sbp return value from list to tuple * auto format by CI * merge conflict * revert Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * fix cumsum GenBackwardOpConfFn (#8604) * fix cumsum GenBackwardOpConfFn * add test case Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * revert change (#8613) * fix test graph optimization conf CI bug (#8617) * restore resource config after random tests * refine * refine * Release pod tensor (#8552) * ThreadLocalGuard * split ReleaseTensor into ReleasePodTensor and ReleaseNonPodTensor. * rename Co-authored-by: luyang <flowingsun007@163.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Add param group for optimizer (#8611) * add add_param_group interface for Optimize * add test for add_param_group * revert * fix comment * refine * auto format by CI Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * fix broadcast_elementwise_binary cpu (#8625) fix broadcast_elementwise_binary_cpu Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * align exception msg to torch (#8627) * align exception msg to torch * auto format by CI Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * skip unstable global test in ci, reduce failture rate (#8635) * fuse embedding interaction (#8586) * fuse embedding interaction * fix of_tidy * refine * fix * address review Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * fix flip gen backward opconf (#8605) * fix flip gen backward opconf * use new opconf api Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Add ONEFLOW_ONE_EMBEDDING_PERSISTENT_TABLE_SNAPSHOT_LOAD_MMAP_LOCKED (#8597) * Add ONEFLOW_ONE_EMBEDDING_PERSISTENT_TABLE_SNAPSHOT_LOAD_MMAP_LOCKED * refine * use MAP_POPULATE Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Profiling main thread (#8601) * ThreadLocalGuard * refactor EagerBlobObjectList * op_args_reserved_size * remove useless comments Co-authored-by: binbinHan <han_binbin@163.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Fully Memory Log V2 with more details (#8565) * Fully Memory Log V2 with more details * refine log and long op name * fix clang tidy * fix test Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: Xiaoyu Xu <xiaoyulink@gmail.com> * Stream policy (#8590) * ThreadLocalGuard * refactor signature of StreamType::InitDeviceCtx * refactor hint * add StreamPolicy * remove DeviceCtx args * refine OpCallInstructionUtil::Prepare & Compute * merge EpDeviceCtx and LazyJobDeviceCtx into StreamPolicy * minor fix * minor fix * del useless code * fix error * fix merge error * fix segment fault bug * fix complie error * del methods belong to Subclass * reslove comment Co-authored-by: binbinHan <han_binbin@163.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Add fully support for broadcast matmul (#6937) * fix arange bug * fully support broadcast matmul * add more check * remove check * add fully sbp * fix full sbp * Fix broadcast matmul grad * remove old broadcast matmul grad * add broadcast grad back and when B numaxes is 2, we use broadcast_gradB instead of matmul+reduce * add lazy backward * Add restrict when transpose_a is false we can use bmatmul_grad_b * revert * fix broadcast matmul backward * fix single client dispatch matmul logic * revert old bcast matmul grad b kernel * fix eager functional matmul backward * add more test case * remove redundant code * add more special case * when b num axes is 2, we only save tensor a * fix annotation * fix conflict and format * remove single client matmul code * Fix eval error * fix conflict * fix unittest * Add init value * support matrix vector matmul * add vector matrix product * Use matmul primitive to rewrite matrix vector product forward and backward * Add fullllllllly support for vector matrix product * Fix sbp * fix bug * add unittest * Add consistent test for broadcast matmul * Remove redundant code * fix userops annotation * fix * refine * Fix clang static analysis * fix clang analysis * set check graph as false * fix * fix for unittest * fix broadcast sbp bug * try to fix unittest * Fix consistent test * fix multiplier to 4 for unittest Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Revert "skip cpu autotest for graph global" (#8608) * Revert "skip cpu autotest for graph global (#8593)" This reverts commit b076be782fd8f21e50ee4915f2d1562f3a9ab4c0. * cherry pick from master Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * OneEmbedding add tmp_buffer allocator (#8588) * fix embedding manager * format * refine embedding_manager tmp_buffer allocator * fix * format * refine * refine * auto format by CI Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * refine error msg for some user ops (#8579) * refine error msg for some user ops * refine error msg for some user ops * optimize * optimize the writing * optimize the writing * optimize the writing * auto format by CI * optimize writing Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Add tril fill value (#8655) add tril fill value Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * fix_non_pod_data_allocate_bug (#8657) Co-authored-by: Li Xinqi <lixinqi2010@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Fix norm (#8629) * fix norm * add doc * add bool & * update math_functor.cpp * add note * fix_decorate_mem_leak_bug_in_eager_boxing (#8661) Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * add higher order derivative for leaky_relu and negative op (#8643) * add higher derivative for leakyrelu and negative * fix a typo * remove functor * add initialize alpha * fix incorrect dim size in global test * fix incorrect dim size in global test * optimize testcase Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * update oneflow intro to show the difference (#8669) * update oneflow intro * refine * refine * refine * refine * refine * refine * refine * refine * refine * refine oneflow intro * Stacked error (#8671) * ThreadLocalGuard * StackedError * StackedError Co-authored-by: Shenghang Tsai <jackalcooper@gmail.com> * Refactor tensor initializer (#8626) * fix(*): fix xavier_initializer * refactor(Initializer): refactor initializer * fix function name * auto format by CI * refine * fix interface in tensor.py * fix(trunc_normal_): fix init bug and add test * auto format by CI * fix bug * add oneflow.nn.init.normal_ test Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * Fix nn doc (#8650) * fix hsplit doc * add doc for module * fix dtype * fix formula * add ref * fix row length * Fix reduce max min bool dtype bug (#8651) * fix reduce_max_min_bool_dtype * fix bug * auto format by CI Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Remove redundant exception wrapper (#8631) * remove redundant ExceptionWrapper * refine KeyErrorMessage * refine * auto format by CI Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * Refactor MemoryCase to eliminate determine statements of device_type (#7727) * ref memory_case_util * ref BlobObject::CheckMemCase * ref mem_case using * address review * address review * namespace memcase -> memory * fix conflict * address review * address static analysis * rm check * cpu device_id is always 0 * fix conflict * timeout-minutes: 50 * revert change * increase thrd limit in container * skip 2x2 TestEinsumConsistent * skip failed case of distributed test * auto format by CI * fix_non_pod_data_allocate_bug Co-authored-by: Li Xinqi <lixinqi2010@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: tsai <jackalcooper@gmail.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: clackhan <han_binbin@163.com> * fix some data races in c++ api and SteadyVector (#8654) * fix some data races in c++ api and SteadyVector Signed-off-by: daquexian <daquexian566@gmail.com> * skip self copy in MutShapeView::ToShape Signed-off-by: daquexian <daquexian566@gmail.com> * auto format by CI Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * Fix sin/cos higher order derivative (#8648) * fix(GradGrad): fix sin/cos higher order derivative * fix(GradGrad): fix calculate error * refine autograd global test * auto format by CI * refine sin/cos grad_grad calculate * fix static analysis * merge conflict Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: Ping Zhu <58718936+REYGU@users.noreply.github.com> Co-authored-by: Zhu, Ping <pingzhuu@outlook.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * refine_eager_boxing_to_adapt_ep (#8568) * refine_eager_boxing_to_adapt_ep * fix typo * refine * refine symmetric-acyclic-nd-sbp-to-nd-sbp * refine * fix error * fix static check * add NOLINT Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Fix repeat bug (#8645) * make result contiguous * add test case * auto format by CI Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * Instruction policy (#8583) * ThreadLocalGuard * vm::InstructionPolicy * fix compile error (#8623) * fix compile error * change MirroredObject to Dependence * Modify DependenceVector * rm include stream type * fix stream type * auto format by CI Co-authored-by: Yu OuYang <xuanjiuye@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * handle non-contiguous input (#8665) * handle non-contiguous input * refine * auto format by CI Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * rename define CONSISTENT to GLOBAL (#8652) Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Refine naive interpret (#8672) * ThreadLocalGuard * refactor EagerBlobObjectList * op_args_reserved_size * remove useless comments * rename one::EagerBlobObjectList* to vm::EagerBlobObject* * refactor signature of InstructionsBuiler::Call * PhysicalRun * refactor InstructionsBuilder::Call * remove unused StatefulOpKernel::need_check_mem_case * remove EagerLocalTensorImpl::is_shape_synced_ * refactor SoftSync * move SmallVector from common/container_util.h to framework/instructions_builder.cpp * explicit scalar initialization Co-authored-by: clackhan <han_binbin@163.com> * Rebuild Docs V0.8.0 (#8392) * rebuild for 5 module * fix bug * fix for doctree and content in nn and * fix * fix * fix * add some * fix for oneflow.rst * update oneflow oneflow.nn * update tensor * update tensor module * update * test * update * update * fix for undone desc * docs: oneflow.utils.data (#8485) * feat(utils.data): add oneflow.utils.data * docs(dataloader): change the docstring of DataLoader * docs(tensor): add methods to oneflow.Tensor document * docs(optim): change docstring of optimizer and add a note to the doucument * nn.graph * fix for graph * fix bug * review nn and linalg document (#8515) * docs(nn): add contents to oneflow.nn document * docs(linalg): refactor oneflow.linalg document * change attributes.rst and review nn.functional.rst (#8514) * change attributes.rst and review nn.functional.rst * reconstruction oneflow.cuda * fix cuda and rebuild comm demo (#8582) * update image * add distributed * oneembedding & refine graph * update for sdisributed one_embedding * fix rnn.py (#8616) * 重构 oneflow.nn.init 文档 (#8622) docs(nn.init): refactore nn.init document * docs(nn.init): remove the comments * docs(utils.data): remove the comments * update and fix bug * docs(review): refine the documents (#8646) * docs(review): refine oneflow, nn, Tensor, nn.init, linalg, utils.data, optim modules * docs(optim): modify the code examples * docs(tensor): edit note * 重构 oneflow.autograd 文档 (#8594) * docs(autograd): refactor oneflow.autograd * docs(autograd): edit "Default gradient layouts". * docs(autograd): reedit "Default gradient layouts" * docs(autograd): add comment * docs(autograd): add reference * update * docs(tensor): change autoclass to autosummary * update * update * add oneflow.linalg.diagonal (#8653) * docs(linalg): add oneflow.linalg.diagonal * update enviorment variable * Update docs/source/distributed.rst Co-authored-by: Houjiang Chen <chenhoujiangcug@gmail.com> * Update docs/source/distributed.rst Co-authored-by: Houjiang Chen <chenhoujiangcug@gmail.com> * update enviorment variable * update for ev & distributed * update distribued * update ev * update distribute desc * Update docs/source/distributed.rst Co-authored-by: Houjiang Chen <chenhoujiangcug@gmail.com> * update * 修改 docstring 描述 (#8656) * docs: move pytorch refernce to end * docs: add some docstring * docs(refs): add refs * Update docs/source/distributed.rst * updte for distributed details and environment_variable * docs(docstring): Modify all reference links to version 1.10 (#8663) * fix bug * fix bug * fix all warning Co-authored-by: Guoliang Cheng <1876953310@qq.com> Co-authored-by: liu xuan <85344642+laoliu97@users.noreply.github.com> Co-authored-by: Guoliang Cheng <lmyybh_lazy@163.com> Co-authored-by: laoliu97 <841637247@qq.com> Co-authored-by: Yao Chi <later@usopp.net> Co-authored-by: Houjiang Chen <chenhoujiangcug@gmail.com> * Fix zeros like and ones_like api (#8632) * fix zeros_like and ones_like bug * refine * revert * refine * fix tensor_slice_view infer physic_shape bug * add test * refine * auto format by CI * fix bug * refine * auto format by CI * fix import error * fix bug Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Fix sbp print bug (#8689) * Add a normal priority with no transfer but different sbp * Fix the bug for printing no boxing edge * Do not use P for weights * auto format by CI Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * eager_local_interpreter_with_infer_cache (#8619) * ThreadLocalGuard * refactor EagerBlobObjectList * op_args_reserved_size * remove useless comments * rename one::EagerBlobObjectList* to vm::EagerBlobObject* * refactor signature of InstructionsBuiler::Call * PhysicalRun * refactor InstructionsBuilder::Call * remove unused StatefulOpKernel::need_check_mem_case * remove EagerLocalTensorImpl::is_shape_synced_ * eager_local_interpreter_with_infer_cache * remove useless code * reslove comments * refactor TensorMeta::TensorMeta(const TensorMeta) * use small vector * add kMaxNumDims * fix error include * fix split Symbol LocalTensorMeta error * refactor SoftSync * move SmallVector from common/container_util.h to framework/instructions_builder.cpp * mone ONEFLOW_EAGER_ENABLE_LOCAL_INFER_CACHE to eager.h * add blank line * reslove comments * minor fix * refine * explicit scalar initialization * fix static check error * auto format by CI * of_format * reslove comment * refine * refine * refine Co-authored-by: lixinqi <lixinqi0703106@163.com> Co-authored-by: Li Xinqi <lixinqi2010@gmail.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * fix gelu nn.Module bug and support tanh mode. (#8693) * add gelu2 api * refine test * refine docs * refine * restuct * delete useless headfile * format * rm doc of tensor.gelu (#8696) Co-authored-by: Shanshan Zhong <62104945+zhongshsh@users.noreply.github.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Fix bug in CrossFeatureInteraction LazyBackward (#8677) fix bug Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * fix floating-point scalar tensor in arange (#8673) Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Add nn functional fold (#8667) * add fold * update fold.py * add test * fix doc * fix comment Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * modify some file and improve the error message (#8566) * modify some file and improve the error message * modify the content * modify the content * auto format by CI * Update roi_align_op.cpp * Update roi_align_op.cpp * Update reshape_user_op_util.cpp * auto format by CI * Update roi_align_op.cpp Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * [OneEmbedding] add id_shuffle_copy_out (#8683) add id_shuffle_copy_out Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * fix add_param_group step key not match error (#8698) * fix add_param_group step key not match error * auto format by CI Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * add env ONEFLOW_EP_CUDA_DEVICE_FLAGS and ONEFLOW_EP_CUDA_STREAM_FLAGS (#8703) Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * fix for docsv0.8 (#8710) * fix repeat op 0-size releated bug (both in FW and AD) (#8707) * fix repeat op 0-size releated bug (both in FW and AD) * refine * refine static check * refine * fix commnet * fix comment * refine * fix test * auto format by CI * auto format by CI Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Support Dropout Scale in FusedMLPGrad[OneEmbedding] (#8633) * support alpha list * Remove redundant modify * remove redundant alpha set * refine Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Fix bug of Tensor.type (#8697) * fix bug of tensor.type(flow.Tensor) * fix bug of tensor.type(flow.Tensor) about device * Fix tensor type doc (#8699) fix doc of tensor.type * add test for tensor.type(flow.Tensor) * move PyTensorMetaCls_CheckExact to header file Co-authored-by: Shanshan Zhong <62104945+zhongshsh@users.noreply.github.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * ONEFLOW_GRAPH_PLACE_TRAINING_STATE_ON_ALL_RANKS (#8706) * ONEFLOW_GRAPH_PLACE_TRAINING_STATE_ON_ALL_RANKS * auto format by CI Co-authored-by: liujuncheng <liujuncheng1022@gmail.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * define_mut_output_shape_and_mut_output_stride_in_infer_ctx (#8709) * define_mut_output_shape_and_mut_output_stride_in_infer_ctx * fix merge master error * fix typo Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Add qat conv modules (#8368) * add qat conv modules * add quantization related modules to doc * refine qatconv modules doc * add qat conv module tests * refine * refine Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * add unsqueeze_multiple_op (#8714) * add unsqueeze_multiple_op * modify the format * Update functional_api.yaml Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * modify broadcast_like_op.cpp and add test (#8720) * modify broadcast_like_op.cpp and add test * modify broadcast_like_op.cpp * Update broadcast_like_op.cpp Co-authored-by: Yinggang Wang <wyg19970408@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * JIT LR (#8500) * add example code * Update cosine_annealing_lr.py * enable self params transformer * enable pass ast to c++ api * enable jit backend for lr * enable jit global register and invoke * convert Global to Singleton for new merge * enable pybind11 walk on python ast * enable test all existent get_lr of oneflow in python * enable py_ast_wrapper pass ast from python to mlir * switch all ast to ast-wrapper in mlir scope * define python ast partially * partial python ast definition * trim asdl of python ast * mlir gen * add symbol table * from ast to jit done * switch llvm::errs() to mlir::emitError and convert switch to typeSwitch * trim duplicate namespace use * fix LIT header * add some docs * enable compare with or_else, if with return seamless in branch and mutable variable * trim code and refine struct * register pybind11 ast node for shared_ptr * enable cpp class in python * go through python to mlir to llvm to jit to run * add addf subf op * work well on stepLR linearLR exponentialLR coseineDecayLR cosineAnnealingLR constantLR * enable maxf minf conversion to llvm ir * rename LR_JIT to LRJITRegister * remove LR_JIT_Engine and swith Invoke to std::function ret by lookup * refine struct * enable bisect_right and python resigter api have dump option arg * add bisect_left and bisect_transformer specially, delete former test python script * remove c++17 standard * restore double hash to iterator * publish * publish * publish * use llvm classof and typeswitch rightly * trim * commit * commit * commit * commit * commit * commit * auto format by CI * Update ir.cpp * Update OneFlowLRJITRegistry.h * auto format by CI * Update AstMlirGen.h * Update lr_jit.cpp * auto format by CI * Naming conventions * auto format by CI * auto format by CI * deploy _ behind Co-authored-by: leaves-zwx <kunta0932@gmail.com> Co-authored-by: yuhao <1171760467@qq.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: yuhao <72971170+howin98@users.noreply.github.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Add logspace (#8599) * add logspace * add global test * restore rand * fix doc * rename consistent to global * adjust import order * add todo * Add hann_window (#8615) * add hann_window * rm useless include * add check * adjust import order * add ONEFLOW_VM_PENDING_HANDLE_WINDOW_SIZE (#8730) * add ONEFLOW_VM_PENDING_HANDLE_WINDOW_SIZE * add environment to vm.h Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Fix as strided bool type and view bug (#8713) * fix as_stride bug * refine * refine * refine * delete useless head file * refine Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Add functional binary cross entropy (#8708) * add gelu2 api * refine test * refine docs * refine * restuct * delete useless headfile * format * rm doc of tensor.gelu * add functional binary cross entropy Co-authored-by: BBuf <1182563586@qq.com> Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * support map_location in flow.load (#8666) * support map_location in flow.load Signed-off-by: daquexian <daquexian566@gmail.com> * auto format by CI * fix tests Signed-off-by: daquexian <daquexian566@gmail.com> * fix bug when map_location is None Signed-off-by: daquexian <daquexian566@gmail.com> * auto format by CI Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * Add addcdiv (#8581) * add addcdiv * fix tensor_functions * fix inplace * add test number * rename consistent to global * Inner most dim case for cumsum cumprod op (#8403) * cumsum use cub scansum in some case * prod use cub scan * refine name * refine * optimize cum op * format * fix * get device properties by cuda stream class * revert useless code * refine * outer dim use parallel sweep algo * refine * fix a fraction of threads hit __syncthreads * revert * refine kernel define * refine * refine * refine * refine * move comment * fix * fix * refine Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Define mut output dtype and mut output is dynamic in infer ctx (#8716) * define_mut_output_shape_and_mut_output_stride_in_infer_ctx * fix merge master error * fix typo * define_mut_output_dtype_and_mut_output_is_dynamic_in_infer_ctx * replce const DataType& with DataType * replace const DataType& with DataType ret * split TensorDesc4ArgNameAndIndex and MutTensorDesc4ArgNameAndIndex * refine * minor fix * refine * fix static check error * Update op_expr.cpp * Update op_expr.cpp * Update stateful_opkernel.cpp * refine * fix static check error * refine * refine Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Dev refactor fuse instruction policy (#8624) * ThreadLocalGuard * vm::InstructionPolicy * refactor fuse instruction policy * fix compile error (#8623) * fix compile error * change MirroredObject to Dependence * Modify DependenceVector * add instruction policy util * add instruction policy util * remove include * add include * rm fuse instruction type * Modifying variable properties * add stream_sequential_dependence_ to instruction_policy Co-authored-by: lixinqi <lixinqi0703106@163.com> Co-authored-by: Li Xinqi <lixinqi2010@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * fix bug of batchnorm num_batches_tracked global error when loading state_dict (#8723) add condition for assign num_batches_tracked Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * add launch master port limit (#8563) * add launch master port limit * Update python/oneflow/distributed/launch.py Co-authored-by: daquexian <daquexian566@gmail.com> Co-authored-by: daquexian <daquexian566@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Fix docs import distance (#8691) * fix import distance * add functional apis * add smooth_l1_loss docs * refine activation.py * add deleted api * review * 添加oneflow, nn 等模块文档中遗漏的接口 (#8704) * docs: add api * docs(nn): refactor nn * review Co-authored-by: Guoliang Cheng <lmyybh_lazy@163.com> Co-authored-by: ChenQiaoling <48576019+Chenqll@users.noreply.github.com> * refactor control stream type (#8647) * refactor control stream type * auto format by CI * Add method implementation * refine * refien Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: Li Xinqi <lixinqi2010@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Define mut output tensor desc (#8717) * define_mut_output_shape_and_mut_output_stride_in_infer_ctx * fix merge master error * fix typo * define_mut_output_dtype_and_mut_output_is_dynamic_in_infer_ctx * define_mut_output_dtype_and_mut_output_tensor_desc * replce const DataType& with DataType * replace const DataType& with DataType ret * split TensorDesc4ArgNameAndIndex and MutTensorDesc4ArgNameAndIndex * refine * minor fix * fix merge error * fix warning error * refine * fix static check error * Update op_expr.cpp * Update op_expr.cpp * Update stateful_opkernel.cpp * refine * fix static check error * refine * refine Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Symbolic local tensor meta (#8662) * ThreadLocalGuard * refactor EagerBlobObjectList * op_args_reserved_size * remove useless comments * rename one::EagerBlobObjectList* to vm::EagerBlobObject* * refactor signature of InstructionsBuiler::Call * PhysicalRun * refactor InstructionsBuilder::Call * remove unused StatefulOpKernel::need_check_mem_case * remove EagerLocalTensorImpl::is_shape_synced_ * eager_local_interpreter_with_infer_cache * remove useless code * reslove comments * refactor TensorMeta::TensorMeta(const TensorMeta) * use small vector * Symbolic LocalTensorMeta * check shape in critical_sectio * add kMaxNumDims * fix error include * fix split Symbol LocalTensorMeta error * fix split cache and symbolic local tensor meta error * refactor SoftSync * move SmallVector from common/container_util.h to framework/instructions_builder.cpp * mone ONEFLOW_EAGER_ENABLE_LOCAL_INFER_CACHE to eager.h * add blank line * reslove comments * minor fix * refine * explicit scalar initialization * fix static check error * auto format by CI * of_format * reslove comment * refine * refine * refine * fix error * define MutOutputShape and MutOutputStride in InferContext * define_mut_output_shape_and_mut_output_stride_in_infer_ctx * fix merge master error * fix typo * fix static check error * define_mut_output_dtype_and_mut_output_is_dynamic_in_infer_ctx * define_mut_output_dtype_and_mut_output_tensor_desc * replce const DataType& with DataType * split const and mut func in LocalTensorMeta * replace const DataType& with DataType ret * split TensorDesc4ArgNameAndIndex and MutTensorDesc4ArgNameAndIndex * refine * minor fix * fix merge error * fix warning error * refine * fix static check error * Update op_expr.cpp * Update op_expr.cpp * split MutTensorMeta and MutLocalTensorMeta * Update stateful_opkernel.cpp * refine * fix static check error * refine * refine * reslove comment * refine * fix typo Co-authored-by: Houjiang Chen <chenhoujiangcug@gmail.com> * fxi typo * use OpArgsVector Co-authored-by: lixinqi <lixinqi0703106@163.com> Co-authored-by: Li Xinqi <lixinqi2010@gmail.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: Houjiang Chen <chenhoujiangcug@gmail.com> * Feat general basic communication (#8437) * Add a slight cost for B->S and B->P in 2d sbp * Add penalty for P in consumer * Fix a slight bug * Add at most 1 middle node for general basic communication * Add the cost for general basic communication * Add the slight penalty for eager * Skip initialization of boxing collector if not needed * Fix a bug * Dev nd nccl send recv boxing (#8467) * nd nccl_send_recv_boxing * rm print * support num_axes > 2 * Add distributed optional run (#8372) * Add * change deps * add install * add skip * autoprof supports bandwidth (#8367) * autoprof supports bandwidth Signed-off-by: daquexian <daquexian566@gmail.com> * print bandwidth Signed-off-by: daquexian <daquexian566@gmail.com> * auto format by CI Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * remove tmp buffer of cumprod cpu backward kernel (#8369) * remove tmp buffer of cumprod cpu backward kernel * refine * refine Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Move tensor api to cpython part3 (#8342) * add tensor_functions * concat py methods * add hash, restore tensor.py * check replacement * refine code, remove commented tensor.py * refine code * move some api * add cpu and cuda api * add triu tril norm and etc. * remove tensor_functions.h * move more api * move more api, refine size * fix typo * format code, remove useless include * refine code * refine code, fix typo * align .cuda to python * refine code * split some api to part3 for review * remove positional only arguments of argmax and argmin * remove arguments parse * modify arguments name in matmul and floor_divide * rename BINARY_FUNC to DIRECT_PASS_FUNC, modify some functions * refine code, format code * add inplace /=, add comments * remove name in macros * remove python api * remove redundant include * remove cout * format code * refactor tensor.size by directly call shape.at, refactor tensor.sub_ by calling nb_sub_ * remove redundant code * auto format by CI * fix typo, fix wrong call * modify idx datatype from int32 to int64 in tensor.size * add some DIRECT_PASS_FUNC * add cpu cuda var pow and etc. * add masked_fill any all * make REDUCE_FUNC macro, add reduce_* functions * add 0dim check in ReduceSumWhole, refine yaml * fix bug * restore add add_ sub sub_ * add unittest for tensor.half tensor.add tensor.add_ * refine code * refine code * fix typo * fix bug of tensor.std() * refactor var std and cuda, using c++ functional api * add beta and threshold in softplus * auto format by CI Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Add nn_functor Check (#7910) * add bias_add_check * add bias_add error test * fix conv2d nhwc bias_add error * add nhwc conv test * add bias_add_error test * Add bias add error check * Rename * add batch matmul error check * add matmul check error msg * remove annotation * add fused mlp error msg check * Add pixel shuffle check test * add more test until normalization add relu functor * refine error message * finish all nnfunctor check msg * handle type error * remove useless symbol * modify back to TypeError * fix all comment * Remove redundant code * Remove pad ndim check * fix bias add space * fix check logic cause ci gpu not always gpu:0 Co-authored-by: hjchen2 <chenhoujiangcug@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Add FusedMatmulBiasAddReluDropout [OneEmbedding] (#8222) * previous version for fused_matmul_bias_add_relu_dropout * add op infer * fix detail * finish forward * support dropout rate list * add forward test * fix bug for output buffer * Configurable alpha params * try to add bit mask logic * Add bitmask first version! * Add row col bitmask logic * support not align4 reludropout * simplify relu dropout ld logic * Add naive relu dropout grad kernel * add simple relu dropout grad kernel * Rename * support relu_dropout bitmask backward * add vectorized optimization * fix tmp buffer * add to amp list * add lazy backward logic * Refine kernel * add indextype dispatch * simplify functor logic * fix cublas fused mlp aux_ld shape bug * Add more relu dropout kernel * add full unittest * fix bug in skip final activation * refine * Remove dump func * fix format * Remove cmake * remove redundant divide * add padded version * fix dropout * oneflow curand * refine * remove redundant kernel * add unroll logic * add unroll and ballot sync * refine format * Remove fast curand * Refine python interface * Add if branch for memset * fix python logic * just for debug * not use matmul bias add grad * add launch 1 block limit * fix unittest * Refine * fix graph backward bug * limit to 11060 * change to use int32_t dtype for cublas aux * Fix jc comment * fix comment * fix convert * fix static_analysis * fix at * fix userops td * fix userops td * fix const ref * fix compile error for bfloat16 * limit to 11060 * fix bug Co-authored-by: Juncheng <liujuncheng1022@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * fix gather 0-dim tensor bug (#8376) * fix 0-dim tensor bug * refine * support input 0-dim tensor for gather * refine * refine * refine dim_scatter_kernel check * refine * refine check * fix clang_tidy error Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * add api to apply external job pass (#8370) * Add condition to find-test-cache-distributed (#8387) * add condition to find-test-cache-distributed * fix * warp dim util (#8382) * warp dim util * format * use more maybe_wrap_dim * refine array functor * add more * refine math_functor * fix_bug_in_broadcast_min_max_grad_and_broadcast_like (#8379) * fix_bug_in_broadcast_min_max_grad_and_broadcast_like * refine * fix static check error * fix bug about index (#8388) * fix bug about index * add test case Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * LogicalSliceAssign support full slice sbp (#8344) * feat(SliceOp): slice ops support 2d sbp * fix(SliceOp): fix [B, P] 2d sbp bug * refine error message * fix bug in parallel_num == 1 * add comment * add warning and format * add NOLINT for boxing check * feat(LogicalSliceOps): support all nd_sbp * feat(LogicalSlice): support nd_sbp * add error message * fix(AutoTest): fix auto_test bug in module.parameter pass * auto format by CI * fix(LogicalSliceAssign): skip test when 1n1d * fix SliceParams memset error * remove memset * add CHECK_JUST * fix(*): make sure split_axis >= 0 or equal to SPLIT_AXIS_FOR_NON_SPLIT * remove memset * fix spilit_info.axis bug * feat(LogicalSliceOps): support grad * add logical_slice gradient_funcs * feat(LogicalSliceAssign): LogicalSliceAssign support full slice sbp * auto format by CI * test(LogicalSlice): fix logical_slice dims Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: Houjiang Chen <chenhoujiangcug@gmail.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * fix_tensor_from_numpy_mem_leak_bug (#8391) * fix_tensor_from_numpy_mem_leak_bug * add note * refine note * refine Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Make of_pyext_obj static only to make sure only a python ext so has python symbols (#8393) * make of_pyext_obj static only * refine note Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Adjust tolerance setting in embedding_renorm unit test (#8394) * support front end compile for job to iree (#8249) * support frontend dev version * polish name * add tosa-to-elf.mlir * tosa to elf by llvm * conv2d partial * an enhanced frontend runner * support numpy as input * enable multiple using nn graph with different input(jobname make it it cd /home/yuhao/frontend/oneflow ; /usr/bin/env /usr/bin/python3 /home/yuhao/.vscode-server/extensions/ms-python.python-2022.6.2/pythonFiles/lib/python/debugpy/launcher 40873 -- /home/yuhao/frontend/oneflow/oneflow/ir/test/Frontend/runner.py ) * enable multiple input * enable cpu and cuda * change full_name to _full_name * support exchange cuda with cpu seamlessly * remove pip * lit config * polish * trim * auto format by CI * modify * auto format by CI * last line polish * use unittest * auto format by CI * use allclose * auto format by CI * pulish * optimize convert oneflow to tosa * conv2d * conv2d enhanced && conv2d examples add * add road map * add add_n2Op and boardcast_addOp conversion * add matmulOp conversion * support converting normailzation op to tosa(partically) * update roadmap * support i64 tensor to dense elem attr * support 100% resnet op conversion * add test mlir * add test iree resnet python script * auto format by CI * done * enhance iree resnet test script * auto format by CI * rebuild code * auto format by CI * rebuild test script * update * auto format by CI * pub * trim test scripts * move * move * input and output add block arg judgement * emit error in variable conversion * error handle for ci * modify err info * auto format by CI * merge * auto format by CI * output not block * flow ones * rm const * trim maybe * trim maybe with header file * const auto * solve clangd error Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Feat/zero mix with mp (#8036) * add zero limit * add debug * add mix zero test * refactor zero api * zero test with mp * add 2d test * add zero nd * add nd zero * add sbp cast * test passed soft limit consumer * refine size api * zero use stage 2 * add limit consumer api * add new api * refine zero s select * fix index out of range * rm zero limit on device type * zero test with activation checkpointing * add indentity when dp sequence len is 1 * move to base with master * fix * fix * fix * add test * debug bad case * refine test for eager and graph boxing * test case ready * simplify * refine test * fix buff size * fix conflict * refine zero nd * refine * add full test * revert change * refine split check * fix typo * rm log * spit long func * restore test * Update optimizer_placement_optimization_pass.cpp * auto format by CI * auto format by CI * fix static check * add tips for zero api change * auto format by CI Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Revert embedding normal path and fix amp list (#8374) * revert embedding normal path, fix amp list * fix amp * fix memset bug in gather cpu kernel Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * replace fixed_vector with small_vector and make Shape inherit from it (#8365) * Replace fixed_vector with llvm::SmallVector Signed-off-by: daquexian <daquexian566@gmail.com> * Shape inherited from llvm::SmallVector Signed-off-by: daquexian <daquexian566@gmail.com> * refine cmake Signed-off-by: daquexian <daquexian566@gmail.com> * rename fixed_vector to small_vector Signed-off-by: daquexian <daquexian566@gmail.com> * fix reviews Signed-off-by: daquexian <daquexian566@gmail.com> * auto format by CI * update Shape constructor Signed-off-by: daquexian <daquexian566@gmail.com> * add 'PUBLIC' keyword to all target_link_libraries Signed-off-by: daquexian <daquexian566@gmail.com> * auto format by CI * update cmake Signed-off-by: daquexian <daquexian566@gmail.com> * auto format by CI * update cmake Signed-off-by: daquexian <daquexian566@gmail.com> * update cmake Signed-off-by: daquexian <daquexian566@gmail.com> * auto format by CI * set is_initialized_ default to true Signed-off-by: daquexian <daquexian566@gmail.com> * override some methods to set is_initialized_ Signed-off-by: daquexian <daquexian566@gmail.com> * auto format by CI Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * Light plan for debug (#8396) * Light plan for debug * fix note * disable terminfo to fix missing terminfo symbols (#8400) * disable terminfo to fix missing terminfo symbols Signed-off-by: daquexian <daquexian566@gmail.com> * auto format by CI Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * fix bug of ZeRO MP in complex case (#8404) * Remove redundant output_lbns in ir (#8409) * mv case * remove redundant info * Dev FusedCrossInteraction[OneEmbedding] (#8335) * add simple fused cross interaction forward * add packed fused * Add cross interaction grad * simplify code * fix bug * support crossnet v2 * support cross interaction v2 * add lazy backward * Rename and add test * fix jc comment * fix comment * fix bug * fix userops td elem_cnt for FUSED Group * fix header file * fix clang static analysis * fix unittest Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * add exe graph physical shape check msg (#8002) * fix index select op in graph * add exe graph physical shape check msg * improve the debug information for the python stack trace 1. add a parameter 'max_stack_depth' to specify the max depth for the stack trace 2. refactor other debug related classes. * remove parens * update * resolve PR comments * update * update graph debug test file. * restore self._debug in class Graph and class ModuleBlock * Do not shorten the stack frame string if it is in debug mode * delete TODOs * disable conv3d test (#7969) Signed-off-by: daquexian <daquexian566@gmail.com> * skip layernorm random_data_warp test (#7941) * skip layernorm random_data_warp test * warp/block/uncached case only test gpu Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Lock click version (#7967) Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * add global avgpool unittest (#7585) * fix (#7978) Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Support negative dim in scatter op (#7934) * support negative dim in scatter op * refine scatter test * refine scatter test again Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * run barrier callback in BarrierPhyInstrOperand::~BarrierPhyInstrOperand (#7702) * run barrier callback in BarrierPhyInstrOperand::~BarrierPhyInstrOperand * lock gil in vm Callback thread * more comments for VirtualMachineEngine::Callback() * the Env is never destroyed. * export Env into python * more unittests * wait shared_ptr.use_count() == 0 * export unittest.TestCase in framework/unittest.py * SwitchToShuttingDownPhase * optional is_normal_exit * VirtualMachine::CloseVMThreads * Delete env_api.h env_api.h is deleted by master * reshape_only_one_dim_infered * address pr comments * fix a ref-cnt bug in TryRunBarrierInstruction. * rollback flow.env.all_device_placement * no distributed running test_shutting_down.py * auto format by CI * expand lifetime of module oneflow in test_shutting_down.py * refine del depend on of * capture oneflow._oneflow_internal.eager when calling sync in __del__ * add try in flaky test Co-authored-by: Luyang <flowingsun007@163.com> Co-authored-by: chengtbf <472491134@qq.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: Xiaoyu Xu <xiaoyulink@gmail.com> * Fix one hot scalar tensor bug (#7975) * fix reduce_sum scalar check bug * fix one_hot scalar tensor bug * fix clang tidy error Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * support ctor np array from of tensor (#7970) * support ctor np array from of tensor * add test case constructing np array from tensor * refine Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * add_manual_seed_all_api (#7957) * add_manual_seed_all_api * Update conf.py * refine * add test case * auto format by CI * Update random_generator.cpp * auto format by CI Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * one_embedding add doc string (#7902) * add doc string * add example * add * fix doc * refine * address review * mb to MB * add make_table_option * option to options * refine * add forward Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Support numpy scalar parameters (#7935) * feat(functional): support numpy scalar parameters * rename inferface * feat(*): TensorIndex support numpy scalar * feat(TensorIndex): support advance indexing * add unittest and int32 support for branch feat-param_support_np_scalar (#7939) * add unittest * refactor unittest * add todo for int16 advanced indexing * add int32 supporting for advance indexing * auto format by CI Co-authored-by: Wang Yi <53533850+marigoold@users.noreply.github.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * fix tensor_scatter_nd_update (#7953) * fix tensor_scatter_nd_update * auto backward Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * fix one_embedding adam (#7974) * fix one_embedding adam * fix tidy * fix normal Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * speed test with score (#7990) Signed-off-by: daquexian <daquexian566@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Feat/graph del by ref (#7857) * remove IsMultiClient() and single client logic Signed-off-by: daquexian <daquexian566@gmail.com> * rename eager.multi_client to eager Signed-off-by: daquexian <daquexian566@gmail.com> * auto format by CI * add py ref * refine new session * clean code * make scope api inner use * use session with ref cnt * run barrier callback in BarrierPhyInstrOperand::~BarrierPhyInstrOperand * test pass * lock gil in vm Callback thread * more comments for VirtualMachineEngine::Callback() * merge * merge rm single client * rm initenv * merge and fix master * refactor env c api * add debug code * fix and serving test pass * test passed * rm useless * rm useless code * format * rm useless include * rm sync in py * the Env is never destroyed. * export Env into python * more unittests * fix and pass tests * revert virtual_machine.cpp * revert core/vm * remove outdated python class oneflow.unittest.TestCase * graph test passed * wait shared_ptr.use_count() == 0 * export unittest.TestCase in framework/unittest.py * SwitchToShuttingDownPhase * optional is_normal_exit * VirtualMachine::CloseVMThreads * Delete env_api.h env_api.h is deleted by master * address pr comments * rm is env init * Clear empty thread when graph destroy (#7633) * Revert "Clear empty thread when graph destroy (#7633)" (#7860) This reverts commit 3e8585e5fa20b97229d6b0be46a7ff814dc8cd83. * fix a ref-cnt bug in TryRunBarrierInstruction. * rm env_api * fix clang-tidy error * fix clang-tidy in env_imp * refine env api * format * refine graph del and sync at shuttingdown * fix typo * add comment * rm useless * rm useless Co-authored-by: daquexian <daquexian566@gmail.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: lixinqi <lixinqi0703106@163.com> Co-authored-by: Li Xinqi <lixinqi2010@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: Luyang <flowingsun007@163.com> Co-authored-by: cheng cheng <472491134@qq.com> * [PersistentTable] Fix num blocks (#7986) Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Add auto benchmark for flowvision (#7806) * update yml * update workflow * add resnet50 * [PersistentTable] Async write (#7946) * [PersistentTable] Async write * fix Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * save log in separate dir by default (#7825) Signed-off-by: daquexian <daquexian566@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * fix index select op in graph * add exe graph physical shape check msg * improve the debug inform…

* edit tanh to a closure op (#5) Co-authored-by: yoonlee888 <qiuyunlei@zhejianglab.com> * Dev sin loop grad (#7) * edit tanh to a closure op * add grad-looped sin_cos_negative * add test case Co-authored-by: yoonlee888 <qiuyunlei@zhejianglab.com> Co-authored-by: Zhenhua <1209435+hengzi@users.noreply.github.com> * add log_grad_grad (#12) * Add exp_grad_grad (#11) * Revert "Dev sin loop grad (#7)" (#13) This reverts commit c256a5a326d7e04c2ad4af802318661d18f72441. * fix bugs (#16) * fix ScalarSub param * Add test case * code format * fix * add higher order derivative Interface draft (#6) * add higher order derivative Interface draft * solve bugs of no Tensor.is_sparse attrs * rm some Interface comments * fix & format Co-authored-by: Zhenhua <1209435+hengzi@users.noreply.github.com> Co-authored-by: Huang Zhenhua <huangzhenhua@zhejianglab.com> * add Higher derivative vjp (#9) * add Higher derivative vjp * add autotest code * add autograd.functional.vhp and motified functional * Merge Testcase * Rm chinese chars Co-authored-by: Zhenhua <1209435+hengzi@users.noreply.github.com> Co-authored-by: Huang Zhenhua <huangzhenhua@zhejianglab.com> * merge Master into zj/develop (#21) * Multi Tensor apply Optimizer (#8373) * Add optim_cast and modify sgd * Remove * try to add fuseUpdatecast pass logic * use pass * still have bug in inplace * ban inplace and fix sgd update * fix regst num * add env var * remove cuda graph wrong use * add support for graph * initialize * add functional impl * add simple job rewrite * delete redundant sgd update kernel * support half * add kernel * use single loop kernel * refine * when in eval mode, we turn off multi tensor update * refine format * use juncheng kernel * Refine * group multi tensor op by some attr * add parallel conf to key * refine * Add unroll logic * fix bug * restruct * use pointer list * add adam kernel * support multi tensor adam update * Remove cpu * support skip if and scale by tensor * support sgd adam unittest * add more check * Remove config * Restruct tensorparams * support fused cast in multi tensor update * support cast in multi tensor * fix bug in model update cast pass * fix multi tensor sgd update with cast Pass check logic * refine * support multi tensor adam update with cast * refine format * Remove redundant template args * merge modify for fused cast * only allow fused cast in train mode * only support data parallel in multi tensor update * rewrite fuse update cast pass logic * remove redundant if * fix format * add new line * rename * Remove print * rename and add LOG * Add more type and test * still have bug in multi tensor adam * Fix multi tensor adam update bug * add multi tensor adam update with cast test * simplify code * fix format * Add model diff datatype in optimizer key * remove random seed * fix comment * fix comment * fix to use model copy * use for loop * Fix comment * use hashcombine * fix clang analysis error * add with cuda macro * fix env var in unittest * remove redundant unittest Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Fix doc and ops template auto gen (#8546) * fix doc and add op calculator * fix bug * fix gen_ops * fix diag 0size tensr shape infer bug (#8557) * fix diag 0size tensr shape infer bug * refine * refine * auto format by CI * auto format by CI Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Format tensor on cpu (#8548) * Format tensor on cpu * use tensor.detach * Remove useless WITH_CUDAs (#8562) * unique identity (#8509) * unique identity * fix * add identit name * rm debug log * mv identity form class to graph * auto format by CI * fix unique iden with having multiple stage * auto format by CI * Update block.py Co-authored-by: cheng cheng <472491134@qq.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Add GenericStreamContext (#8560) * Modify some file and add test (#8556) * Modify some file and add test * modify the content * modify the format and test function name * modify the format and aligned with pytorch * delete print * modity the function name * auto format by CI Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * Move some op into amp gray list (#8545) enlarge gray list Co-authored-by: cheng cheng <472491134@qq.com> * Refine inplace expand runtime_error (#8561) * Refine inplace expand runtime_error * Opt * Refine * Add Note * OneEmbedding use malloc async (#8543) * in out ptrs * ops and test * test pass * prefetch tmp buffer * embedding shuffle tmp buffer * gradient shuffle * tmp buffer size * mem pool * cuda 11.2 * add id_shuffle to setNumunique in update tests * default not use dynamic alloc * fix of_tidy * add fused op * address review * init tmp_buffer * mv memset * fix * one_embedding fused_lookup_init_cast and fused_update_put (#8564) * add fused op * mv memset * fix * address review * rm fullcache n_missing check Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * fix cpu aligned_alloc size (#8569) Signed-off-by: daquexian <daquexian566@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Add flow norm (#8535) * add flow norm * rm import * rm doctest.testmod * fix pad_packed_sequence method input requires_grad==True (#8574) * fix pad_packed_sequence method input requires_grad==True * fix append error when batch_first=True Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * fix embedding manager tmp buffer (#8585) * fix embedding manager * format * fix reduce_ops 0size bug (#8551) * fix reduce_ops 0size bug * fix commnet * auto format by CI * fix bug Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Align Momentum Optimizer (#8549) * fix moemntum update * align momentum * fix bug and finish eager unittest * Support Graph optimizer * fix momentum bug * refine beta Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Fill GetSbp bug and consistent test bug (#8576) fix(FillOp): fill GetSbp bug and consistent test bug Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Dev Fully fused MLP Grad[OneEmbedding] (#8462) * support fully fused mlp grad in eager * support lazy backward * fix output size * add fallback to tmp_buf logic when ones buffer is not enough * build sbp * overlap allreduce * fix overlap order * fix format * CUDA Graphs delayed capture * Add ifcomm create for graph * insert weight event roughly * fix dbias allreduce error * simplify code * Add 11060 limit * Remove print * Rename * fix fill bug and remove comm to cache * Rename variable and add debug code for cache * Use kernel state and fix bug * remove print * fix allreduce dbias bug * fix header file * fix comment * remove redundant headerfile * fix userops build error * refine * init nccl comm before execute kernel * fix comment Co-authored-by: liujuncheng <liujuncheng1022@gmail.com> * rename mirrored to local (#8503) * rename mirrored to local * rename files * rename files * auto format by CI * revert change of package_mirror.py * rename LocalObject to Dependence * rename fn LocalObject to Dependence * merge master * handle clang check * fix * refine * rename local_object to dependence Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * Implement BroadcastElementwiseUnary primitive (#8384) * Add code skeleton for broadcast unary primitive * first try * finish impl * finish impl * format * fix build error * address review * refine * address review comments * use broadcast unary primitive in fill_tensor_ kernel * handle pack tail statically * fix * address review * address review * Fix SimplifyBroadcastDims * fix * revert fill_kernel Co-authored-by: Juncheng <liujuncheng1022@gmail.com> * skip cpu autotest for graph global (#8593) * TODO * skip cpu autotest for graph global * Refine Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Add function_library.h Exception (#8241) * add RuntimeError for checking * add RuntimeError to CHECK_EQ * auto format by CI Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * Refactor shrink (#8573) * caching allocator * auto format by CI * Update ep_device_context.h * EpDeviceCtx with CachingAllocator * rm RawAllocator typename * auto format by CI * specific allo in EpDeviceCtx * auto format by CI * rm outdated alloc * simplify thread safe guard * auto format by CI * avoid return mutex * auto format by CI Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Speed up SliceKernel (#8589) * perf(SliceKernel): descrease number of cuda kernel and speed up * perf(SliceKernel): use old kernel when small tensor is all fullslice * use std::copy to copy contiguous memory * fix cpu kernel bug * Update readme and vsn for 0.8.0 (#8600) * update version * remove py3.6 * modify some file and improve error message (#8592) * modify some file and improve error message * modify scalar_by_tensor_op.cpp * Update scalar_by_tensor_op.cpp * Update slice_op.cpp * Update test_slice_op.py * Update test_slice_op.py * auto format by CI * auto format by CI Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * rename consistent to global (#8505) * rename consistent to global * rename consistent to global * rename files * rename files * refine * auto format by CI * refine * fix clang check * fix * fix * fix * rm to_consistent docs * auto format by CI * refine * fix * fix * revert changes * auto format by CI * revert changes * revert changes * rename * rename Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * add module releated container docs (#8580) * add module releated container docs * auto format by CI * fix comment * refine * refine Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * fix rnn util extra memory usage when requires_grad=False (#8603) * fix rnn util extra memory usage when requires_grad=False * add comments * refine comments Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * use bracket format slice in tensor str (#8489) Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Perf TensorInfo constructor (#8606) * perf(Autograd): perf TensorInfo constructor * rename consistent to global Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * print operators' python location when print nn_graph (#8558) 1. add a flag in nn.Graph.debug() named print_op_loc for printing operator location. 2. add a flag in nn.Graph.debug() named only_print_user_code_loc for only print users' code location * Add randint like (#8598) * add randnint_like op * add docs for random * refine * auto format by CI * add randint_like global test * refine doc * refine randint_like docs * fix bug Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Add full_like api (#8595) * add full_like_op api * refine * add test * refine * refine docs * refine * add consistent_full test * add full_like op * fix docs commnet * change scalar sbp return value from list to tuple * auto format by CI * merge conflict * revert Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * fix cumsum GenBackwardOpConfFn (#8604) * fix cumsum GenBackwardOpConfFn * add test case Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * revert change (#8613) * fix test graph optimization conf CI bug (#8617) * restore resource config after random tests * refine * refine * Release pod tensor (#8552) * ThreadLocalGuard * split ReleaseTensor into ReleasePodTensor and ReleaseNonPodTensor. * rename Co-authored-by: luyang <flowingsun007@163.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Add param group for optimizer (#8611) * add add_param_group interface for Optimize * add test for add_param_group * revert * fix comment * refine * auto format by CI Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * fix broadcast_elementwise_binary cpu (#8625) fix broadcast_elementwise_binary_cpu Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * align exception msg to torch (#8627) * align exception msg to torch * auto format by CI Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * skip unstable global test in ci, reduce failture rate (#8635) * fuse embedding interaction (#8586) * fuse embedding interaction * fix of_tidy * refine * fix * address review Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * fix flip gen backward opconf (#8605) * fix flip gen backward opconf * use new opconf api Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Add ONEFLOW_ONE_EMBEDDING_PERSISTENT_TABLE_SNAPSHOT_LOAD_MMAP_LOCKED (#8597) * Add ONEFLOW_ONE_EMBEDDING_PERSISTENT_TABLE_SNAPSHOT_LOAD_MMAP_LOCKED * refine * use MAP_POPULATE Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Profiling main thread (#8601) * ThreadLocalGuard * refactor EagerBlobObjectList * op_args_reserved_size * remove useless comments Co-authored-by: binbinHan <han_binbin@163.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Fully Memory Log V2 with more details (#8565) * Fully Memory Log V2 with more details * refine log and long op name * fix clang tidy * fix test Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: Xiaoyu Xu <xiaoyulink@gmail.com> * Stream policy (#8590) * ThreadLocalGuard * refactor signature of StreamType::InitDeviceCtx * refactor hint * add StreamPolicy * remove DeviceCtx args * refine OpCallInstructionUtil::Prepare & Compute * merge EpDeviceCtx and LazyJobDeviceCtx into StreamPolicy * minor fix * minor fix * del useless code * fix error * fix merge error * fix segment fault bug * fix complie error * del methods belong to Subclass * reslove comment Co-authored-by: binbinHan <han_binbin@163.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Add fully support for broadcast matmul (#6937) * fix arange bug * fully support broadcast matmul * add more check * remove check * add fully sbp * fix full sbp * Fix broadcast matmul grad * remove old broadcast matmul grad * add broadcast grad back and when B numaxes is 2, we use broadcast_gradB instead of matmul+reduce * add lazy backward * Add restrict when transpose_a is false we can use bmatmul_grad_b * revert * fix broadcast matmul backward * fix single client dispatch matmul logic * revert old bcast matmul grad b kernel * fix eager functional matmul backward * add more test case * remove redundant code * add more special case * when b num axes is 2, we only save tensor a * fix annotation * fix conflict and format * remove single client matmul code * Fix eval error * fix conflict * fix unittest * Add init value * support matrix vector matmul * add vector matrix product * Use matmul primitive to rewrite matrix vector product forward and backward * Add fullllllllly support for vector matrix product * Fix sbp * fix bug * add unittest * Add consistent test for broadcast matmul * Remove redundant code * fix userops annotation * fix * refine * Fix clang static analysis * fix clang analysis * set check graph as false * fix * fix for unittest * fix broadcast sbp bug * try to fix unittest * Fix consistent test * fix multiplier to 4 for unittest Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Revert "skip cpu autotest for graph global" (#8608) * Revert "skip cpu autotest for graph global (#8593)" This reverts commit b076be782fd8f21e50ee4915f2d1562f3a9ab4c0. * cherry pick from master Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * OneEmbedding add tmp_buffer allocator (#8588) * fix embedding manager * format * refine embedding_manager tmp_buffer allocator * fix * format * refine * refine * auto format by CI Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * refine error msg for some user ops (#8579) * refine error msg for some user ops * refine error msg for some user ops * optimize * optimize the writing * optimize the writing * optimize the writing * auto format by CI * optimize writing Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Add tril fill value (#8655) add tril fill value Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * fix_non_pod_data_allocate_bug (#8657) Co-authored-by: Li Xinqi <lixinqi2010@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Fix norm (#8629) * fix norm * add doc * add bool & * update math_functor.cpp * add note * fix_decorate_mem_leak_bug_in_eager_boxing (#8661) Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * add higher order derivative for leaky_relu and negative op (#8643) * add higher derivative for leakyrelu and negative * fix a typo * remove functor * add initialize alpha * fix incorrect dim size in global test * fix incorrect dim size in global test * optimize testcase Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * update oneflow intro to show the difference (#8669) * update oneflow intro * refine * refine * refine * refine * refine * refine * refine * refine * refine * refine oneflow intro * Stacked error (#8671) * ThreadLocalGuard * StackedError * StackedError Co-authored-by: Shenghang Tsai <jackalcooper@gmail.com> * Refactor tensor initializer (#8626) * fix(*): fix xavier_initializer * refactor(Initializer): refactor initializer * fix function name * auto format by CI * refine * fix interface in tensor.py * fix(trunc_normal_): fix init bug and add test * auto format by CI * fix bug * add oneflow.nn.init.normal_ test Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * Fix nn doc (#8650) * fix hsplit doc * add doc for module * fix dtype * fix formula * add ref * fix row length * Fix reduce max min bool dtype bug (#8651) * fix reduce_max_min_bool_dtype * fix bug * auto format by CI Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Remove redundant exception wrapper (#8631) * remove redundant ExceptionWrapper * refine KeyErrorMessage * refine * auto format by CI Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * Refactor MemoryCase to eliminate determine statements of device_type (#7727) * ref memory_case_util * ref BlobObject::CheckMemCase * ref mem_case using * address review * address review * namespace memcase -> memory * fix conflict * address review * address static analysis * rm check * cpu device_id is always 0 * fix conflict * timeout-minutes: 50 * revert change * increase thrd limit in container * skip 2x2 TestEinsumConsistent * skip failed case of distributed test * auto format by CI * fix_non_pod_data_allocate_bug Co-authored-by: Li Xinqi <lixinqi2010@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: tsai <jackalcooper@gmail.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: clackhan <han_binbin@163.com> * fix some data races in c++ api and SteadyVector (#8654) * fix some data races in c++ api and SteadyVector Signed-off-by: daquexian <daquexian566@gmail.com> * skip self copy in MutShapeView::ToShape Signed-off-by: daquexian <daquexian566@gmail.com> * auto format by CI Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * Fix sin/cos higher order derivative (#8648) * fix(GradGrad): fix sin/cos higher order derivative * fix(GradGrad): fix calculate error * refine autograd global test * auto format by CI * refine sin/cos grad_grad calculate * fix static analysis * merge conflict Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: Ping Zhu <58718936+REYGU@users.noreply.github.com> Co-authored-by: Zhu, Ping <pingzhuu@outlook.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * refine_eager_boxing_to_adapt_ep (#8568) * refine_eager_boxing_to_adapt_ep * fix typo * refine * refine symmetric-acyclic-nd-sbp-to-nd-sbp * refine * fix error * fix static check * add NOLINT Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Fix repeat bug (#8645) * make result contiguous * add test case * auto format by CI Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * Instruction policy (#8583) * ThreadLocalGuard * vm::InstructionPolicy * fix compile error (#8623) * fix compile error * change MirroredObject to Dependence * Modify DependenceVector * rm include stream type * fix stream type * auto format by CI Co-authored-by: Yu OuYang <xuanjiuye@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * handle non-contiguous input (#8665) * handle non-contiguous input * refine * auto format by CI Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * rename define CONSISTENT to GLOBAL (#8652) Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Refine naive interpret (#8672) * ThreadLocalGuard * refactor EagerBlobObjectList * op_args_reserved_size * remove useless comments * rename one::EagerBlobObjectList* to vm::EagerBlobObject* * refactor signature of InstructionsBuiler::Call * PhysicalRun * refactor InstructionsBuilder::Call * remove unused StatefulOpKernel::need_check_mem_case * remove EagerLocalTensorImpl::is_shape_synced_ * refactor SoftSync * move SmallVector from common/container_util.h to framework/instructions_builder.cpp * explicit scalar initialization Co-authored-by: clackhan <han_binbin@163.com> * Rebuild Docs V0.8.0 (#8392) * rebuild for 5 module * fix bug * fix for doctree and content in nn and * fix * fix * fix * add some * fix for oneflow.rst * update oneflow oneflow.nn * update tensor * update tensor module * update * test * update * update * fix for undone desc * docs: oneflow.utils.data (#8485) * feat(utils.data): add oneflow.utils.data * docs(dataloader): change the docstring of DataLoader * docs(tensor): add methods to oneflow.Tensor document * docs(optim): change docstring of optimizer and add a note to the doucument * nn.graph * fix for graph * fix bug * review nn and linalg document (#8515) * docs(nn): add contents to oneflow.nn document * docs(linalg): refactor oneflow.linalg document * change attributes.rst and review nn.functional.rst (#8514) * change attributes.rst and review nn.functional.rst * reconstruction oneflow.cuda * fix cuda and rebuild comm demo (#8582) * update image * add distributed * oneembedding & refine graph * update for sdisributed one_embedding * fix rnn.py (#8616) * 重构 oneflow.nn.init 文档 (#8622) docs(nn.init): refactore nn.init document * docs(nn.init): remove the comments * docs(utils.data): remove the comments * update and fix bug * docs(review): refine the documents (#8646) * docs(review): refine oneflow, nn, Tensor, nn.init, linalg, utils.data, optim modules * docs(optim): modify the code examples * docs(tensor): edit note * 重构 oneflow.autograd 文档 (#8594) * docs(autograd): refactor oneflow.autograd * docs(autograd): edit "Default gradient layouts". * docs(autograd): reedit "Default gradient layouts" * docs(autograd): add comment * docs(autograd): add reference * update * docs(tensor): change autoclass to autosummary * update * update * add oneflow.linalg.diagonal (#8653) * docs(linalg): add oneflow.linalg.diagonal * update enviorment variable * Update docs/source/distributed.rst Co-authored-by: Houjiang Chen <chenhoujiangcug@gmail.com> * Update docs/source/distributed.rst Co-authored-by: Houjiang Chen <chenhoujiangcug@gmail.com> * update enviorment variable * update for ev & distributed * update distribued * update ev * update distribute desc * Update docs/source/distributed.rst Co-authored-by: Houjiang Chen <chenhoujiangcug@gmail.com> * update * 修改 docstring 描述 (#8656) * docs: move pytorch refernce to end * docs: add some docstring * docs(refs): add refs * Update docs/source/distributed.rst * updte for distributed details and environment_variable * docs(docstring): Modify all reference links to version 1.10 (#8663) * fix bug * fix bug * fix all warning Co-authored-by: Guoliang Cheng <1876953310@qq.com> Co-authored-by: liu xuan <85344642+laoliu97@users.noreply.github.com> Co-authored-by: Guoliang Cheng <lmyybh_lazy@163.com> Co-authored-by: laoliu97 <841637247@qq.com> Co-authored-by: Yao Chi <later@usopp.net> Co-authored-by: Houjiang Chen <chenhoujiangcug@gmail.com> * Fix zeros like and ones_like api (#8632) * fix zeros_like and ones_like bug * refine * revert * refine * fix tensor_slice_view infer physic_shape bug * add test * refine * auto format by CI * fix bug * refine * auto format by CI * fix import error * fix bug Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Fix sbp print bug (#8689) * Add a normal priority with no transfer but different sbp * Fix the bug for printing no boxing edge * Do not use P for weights * auto format by CI Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * eager_local_interpreter_with_infer_cache (#8619) * ThreadLocalGuard * refactor EagerBlobObjectList * op_args_reserved_size * remove useless comments * rename one::EagerBlobObjectList* to vm::EagerBlobObject* * refactor signature of InstructionsBuiler::Call * PhysicalRun * refactor InstructionsBuilder::Call * remove unused StatefulOpKernel::need_check_mem_case * remove EagerLocalTensorImpl::is_shape_synced_ * eager_local_interpreter_with_infer_cache * remove useless code * reslove comments * refactor TensorMeta::TensorMeta(const TensorMeta) * use small vector * add kMaxNumDims * fix error include * fix split Symbol LocalTensorMeta error * refactor SoftSync * move SmallVector from common/container_util.h to framework/instructions_builder.cpp * mone ONEFLOW_EAGER_ENABLE_LOCAL_INFER_CACHE to eager.h * add blank line * reslove comments * minor fix * refine * explicit scalar initialization * fix static check error * auto format by CI * of_format * reslove comment * refine * refine * refine Co-authored-by: lixinqi <lixinqi0703106@163.com> Co-authored-by: Li Xinqi <lixinqi2010@gmail.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * fix gelu nn.Module bug and support tanh mode. (#8693) * add gelu2 api * refine test * refine docs * refine * restuct * delete useless headfile * format * rm doc of tensor.gelu (#8696) Co-authored-by: Shanshan Zhong <62104945+zhongshsh@users.noreply.github.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Fix bug in CrossFeatureInteraction LazyBackward (#8677) fix bug Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * fix floating-point scalar tensor in arange (#8673) Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Add nn functional fold (#8667) * add fold * update fold.py * add test * fix doc * fix comment Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * modify some file and improve the error message (#8566) * modify some file and improve the error message * modify the content * modify the content * auto format by CI * Update roi_align_op.cpp * Update roi_align_op.cpp * Update reshape_user_op_util.cpp * auto format by CI * Update roi_align_op.cpp Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * [OneEmbedding] add id_shuffle_copy_out (#8683) add id_shuffle_copy_out Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * fix add_param_group step key not match error (#8698) * fix add_param_group step key not match error * auto format by CI Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * add env ONEFLOW_EP_CUDA_DEVICE_FLAGS and ONEFLOW_EP_CUDA_STREAM_FLAGS (#8703) Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * fix for docsv0.8 (#8710) * fix repeat op 0-size releated bug (both in FW and AD) (#8707) * fix repeat op 0-size releated bug (both in FW and AD) * refine * refine static check * refine * fix commnet * fix comment * refine * fix test * auto format by CI * auto format by CI Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Support Dropout Scale in FusedMLPGrad[OneEmbedding] (#8633) * support alpha list * Remove redundant modify * remove redundant alpha set * refine Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Fix bug of Tensor.type (#8697) * fix bug of tensor.type(flow.Tensor) * fix bug of tensor.type(flow.Tensor) about device * Fix tensor type doc (#8699) fix doc of tensor.type * add test for tensor.type(flow.Tensor) * move PyTensorMetaCls_CheckExact to header file Co-authored-by: Shanshan Zhong <62104945+zhongshsh@users.noreply.github.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * ONEFLOW_GRAPH_PLACE_TRAINING_STATE_ON_ALL_RANKS (#8706) * ONEFLOW_GRAPH_PLACE_TRAINING_STATE_ON_ALL_RANKS * auto format by CI Co-authored-by: liujuncheng <liujuncheng1022@gmail.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * define_mut_output_shape_and_mut_output_stride_in_infer_ctx (#8709) * define_mut_output_shape_and_mut_output_stride_in_infer_ctx * fix merge master error * fix typo Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Add qat conv modules (#8368) * add qat conv modules * add quantization related modules to doc * refine qatconv modules doc * add qat conv module tests * refine * refine Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * add unsqueeze_multiple_op (#8714) * add unsqueeze_multiple_op * modify the format * Update functional_api.yaml Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * modify broadcast_like_op.cpp and add test (#8720) * modify broadcast_like_op.cpp and add test * modify broadcast_like_op.cpp * Update broadcast_like_op.cpp Co-authored-by: Yinggang Wang <wyg19970408@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * JIT LR (#8500) * add example code * Update cosine_annealing_lr.py * enable self params transformer * enable pass ast to c++ api * enable jit backend for lr * enable jit global register and invoke * convert Global to Singleton for new merge * enable pybind11 walk on python ast * enable test all existent get_lr of oneflow in python * enable py_ast_wrapper pass ast from python to mlir * switch all ast to ast-wrapper in mlir scope * define python ast partially * partial python ast definition * trim asdl of python ast * mlir gen * add symbol table * from ast to jit done * switch llvm::errs() to mlir::emitError and convert switch to typeSwitch * trim duplicate namespace use * fix LIT header * add some docs * enable compare with or_else, if with return seamless in branch and mutable variable * trim code and refine struct * register pybind11 ast node for shared_ptr * enable cpp class in python * go through python to mlir to llvm to jit to run * add addf subf op * work well on stepLR linearLR exponentialLR coseineDecayLR cosineAnnealingLR constantLR * enable maxf minf conversion to llvm ir * rename LR_JIT to LRJITRegister * remove LR_JIT_Engine and swith Invoke to std::function ret by lookup * refine struct * enable bisect_right and python resigter api have dump option arg * add bisect_left and bisect_transformer specially, delete former test python script * remove c++17 standard * restore double hash to iterator * publish * publish * publish * use llvm classof and typeswitch rightly * trim * commit * commit * commit * commit * commit * commit * auto format by CI * Update ir.cpp * Update OneFlowLRJITRegistry.h * auto format by CI * Update AstMlirGen.h * Update lr_jit.cpp * auto format by CI * Naming conventions * auto format by CI * auto format by CI * deploy _ behind Co-authored-by: leaves-zwx <kunta0932@gmail.com> Co-authored-by: yuhao <1171760467@qq.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: yuhao <72971170+howin98@users.noreply.github.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Add logspace (#8599) * add logspace * add global test * restore rand * fix doc * rename consistent to global * adjust import order * add todo * Add hann_window (#8615) * add hann_window * rm useless include * add check * adjust import order * add ONEFLOW_VM_PENDING_HANDLE_WINDOW_SIZE (#8730) * add ONEFLOW_VM_PENDING_HANDLE_WINDOW_SIZE * add environment to vm.h Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Fix as strided bool type and view bug (#8713) * fix as_stride bug * refine * refine * refine * delete useless head file * refine Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Add functional binary cross entropy (#8708) * add gelu2 api * refine test * refine docs * refine * restuct * delete useless headfile * format * rm doc of tensor.gelu * add functional binary cross entropy Co-authored-by: BBuf <1182563586@qq.com> Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * support map_location in flow.load (#8666) * support map_location in flow.load Signed-off-by: daquexian <daquexian566@gmail.com> * auto format by CI * fix tests Signed-off-by: daquexian <daquexian566@gmail.com> * fix bug when map_location is None Signed-off-by: daquexian <daquexian566@gmail.com> * auto format by CI Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * Add addcdiv (#8581) * add addcdiv * fix tensor_functions * fix inplace * add test number * rename consistent to global * Inner most dim case for cumsum cumprod op (#8403) * cumsum use cub scansum in some case * prod use cub scan * refine name * refine * optimize cum op * format * fix * get device properties by cuda stream class * revert useless code * refine * outer dim use parallel sweep algo * refine * fix a fraction of threads hit __syncthreads * revert * refine kernel define * refine * refine * refine * refine * move comment * fix * fix * refine Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Define mut output dtype and mut output is dynamic in infer ctx (#8716) * define_mut_output_shape_and_mut_output_stride_in_infer_ctx * fix merge master error * fix typo * define_mut_output_dtype_and_mut_output_is_dynamic_in_infer_ctx * replce const DataType& with DataType * replace const DataType& with DataType ret * split TensorDesc4ArgNameAndIndex and MutTensorDesc4ArgNameAndIndex * refine * minor fix * refine * fix static check error * Update op_expr.cpp * Update op_expr.cpp * Update stateful_opkernel.cpp * refine * fix static check error * refine * refine Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Dev refactor fuse instruction policy (#8624) * ThreadLocalGuard * vm::InstructionPolicy * refactor fuse instruction policy * fix compile error (#8623) * fix compile error * change MirroredObject to Dependence * Modify DependenceVector * add instruction policy util * add instruction policy util * remove include * add include * rm fuse instruction type * Modifying variable properties * add stream_sequential_dependence_ to instruction_policy Co-authored-by: lixinqi <lixinqi0703106@163.com> Co-authored-by: Li Xinqi <lixinqi2010@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * fix bug of batchnorm num_batches_tracked global error when loading state_dict (#8723) add condition for assign num_batches_tracked Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * add launch master port limit (#8563) * add launch master port limit * Update python/oneflow/distributed/launch.py Co-authored-by: daquexian <daquexian566@gmail.com> Co-authored-by: daquexian <daquexian566@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Fix docs import distance (#8691) * fix import distance * add functional apis * add smooth_l1_loss docs * refine activation.py * add deleted api * review * 添加oneflow, nn 等模块文档中遗漏的接口 (#8704) * docs: add api * docs(nn): refactor nn * review Co-authored-by: Guoliang Cheng <lmyybh_lazy@163.com> Co-authored-by: ChenQiaoling <48576019+Chenqll@users.noreply.github.com> * refactor control stream type (#8647) * refactor control stream type * auto format by CI * Add method implementation * refine * refien Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: Li Xinqi <lixinqi2010@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Define mut output tensor desc (#8717) * define_mut_output_shape_and_mut_output_stride_in_infer_ctx * fix merge master error * fix typo * define_mut_output_dtype_and_mut_output_is_dynamic_in_infer_ctx * define_mut_output_dtype_and_mut_output_tensor_desc * replce const DataType& with DataType * replace const DataType& with DataType ret * split TensorDesc4ArgNameAndIndex and MutTensorDesc4ArgNameAndIndex * refine * minor fix * fix merge error * fix warning error * refine * fix static check error * Update op_expr.cpp * Update op_expr.cpp * Update stateful_opkernel.cpp * refine * fix static check error * refine * refine Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Symbolic local tensor meta (#8662) * ThreadLocalGuard * refactor EagerBlobObjectList * op_args_reserved_size * remove useless comments * rename one::EagerBlobObjectList* to vm::EagerBlobObject* * refactor signature of InstructionsBuiler::Call * PhysicalRun * refactor InstructionsBuilder::Call * remove unused StatefulOpKernel::need_check_mem_case * remove EagerLocalTensorImpl::is_shape_synced_ * eager_local_interpreter_with_infer_cache * remove useless code * reslove comments * refactor TensorMeta::TensorMeta(const TensorMeta) * use small vector * Symbolic LocalTensorMeta * check shape in critical_sectio * add kMaxNumDims * fix error include * fix split Symbol LocalTensorMeta error * fix split cache and symbolic local tensor meta error * refactor SoftSync * move SmallVector from common/container_util.h to framework/instructions_builder.cpp * mone ONEFLOW_EAGER_ENABLE_LOCAL_INFER_CACHE to eager.h * add blank line * reslove comments * minor fix * refine * explicit scalar initialization * fix static check error * auto format by CI * of_format * reslove comment * refine * refine * refine * fix error * define MutOutputShape and MutOutputStride in InferContext * define_mut_output_shape_and_mut_output_stride_in_infer_ctx * fix merge master error * fix typo * fix static check error * define_mut_output_dtype_and_mut_output_is_dynamic_in_infer_ctx * define_mut_output_dtype_and_mut_output_tensor_desc * replce const DataType& with DataType * split const and mut func in LocalTensorMeta * replace const DataType& with DataType ret * split TensorDesc4ArgNameAndIndex and MutTensorDesc4ArgNameAndIndex * refine * minor fix * fix merge error * fix warning error * refine * fix static check error * Update op_expr.cpp * Update op_expr.cpp * split MutTensorMeta and MutLocalTensorMeta * Update stateful_opkernel.cpp * refine * fix static check error * refine * refine * reslove comment * refine * fix typo Co-authored-by: Houjiang Chen <chenhoujiangcug@gmail.com> * fxi typo * use OpArgsVector Co-authored-by: lixinqi <lixinqi0703106@163.com> Co-authored-by: Li Xinqi <lixinqi2010@gmail.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: Houjiang Chen <chenhoujiangcug@gmail.com> * Feat general basic communication (#8437) * Add a slight cost for B->S and B->P in 2d sbp * Add penalty for P in consumer * Fix a slight bug * Add at most 1 middle node for general basic communication * Add the cost for general basic communication * Add the slight penalty for eager * Skip initialization of boxing collector if not needed * Fix a bug * Dev nd nccl send recv boxing (#8467) * nd nccl_send_recv_boxing * rm print * support num_axes > 2 * Add distributed optional run (#8372) * Add * change deps * add install * add skip * autoprof supports bandwidth (#8367) * autoprof supports bandwidth Signed-off-by: daquexian <daquexian566@gmail.com> * print bandwidth Signed-off-by: daquexian <daquexian566@gmail.com> * auto format by CI Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * remove tmp buffer of cumprod cpu backward kernel (#8369) * remove tmp buffer of cumprod cpu backward kernel * refine * refine Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Move tensor api to cpython part3 (#8342) * add tensor_functions * concat py methods * add hash, restore tensor.py * check replacement * refine code, remove commented tensor.py * refine code * move some api * add cpu and cuda api * add triu tril norm and etc. * remove tensor_functions.h * move more api * move more api, refine size * fix typo * format code, remove useless include * refine code * refine code, fix typo * align .cuda to python * refine code * split some api to part3 for review * remove positional only arguments of argmax and argmin * remove arguments parse * modify arguments name in matmul and floor_divide * rename BINARY_FUNC to DIRECT_PASS_FUNC, modify some functions * refine code, format code * add inplace /=, add comments * remove name in macros * remove python api * remove redundant include * remove cout * format code * refactor tensor.size by directly call shape.at, refactor tensor.sub_ by calling nb_sub_ * remove redundant code * auto format by CI * fix typo, fix wrong call * modify idx datatype from int32 to int64 in tensor.size * add some DIRECT_PASS_FUNC * add cpu cuda var pow and etc. * add masked_fill any all * make REDUCE_FUNC macro, add reduce_* functions * add 0dim check in ReduceSumWhole, refine yaml * fix bug * restore add add_ sub sub_ * add unittest for tensor.half tensor.add tensor.add_ * refine code * refine code * fix typo * fix bug of tensor.std() * refactor var std and cuda, using c++ functional api * add beta and threshold in softplus * auto format by CI Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Add nn_functor Check (#7910) * add bias_add_check * add bias_add error test * fix conv2d nhwc bias_add error * add nhwc conv test * add bias_add_error test * Add bias add error check * Rename * add batch matmul error check * add matmul check error msg * remove annotation * add fused mlp error msg check * Add pixel shuffle check test * add more test until normalization add relu functor * refine error message * finish all nnfunctor check msg * handle type error * remove useless symbol * modify back to TypeError * fix all comment * Remove redundant code * Remove pad ndim check * fix bias add space * fix check logic cause ci gpu not always gpu:0 Co-authored-by: hjchen2 <chenhoujiangcug@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Add FusedMatmulBiasAddReluDropout [OneEmbedding] (#8222) * previous version for fused_matmul_bias_add_relu_dropout * add op infer * fix detail * finish forward * support dropout rate list * add forward test * fix bug for output buffer * Configurable alpha params * try to add bit mask logic * Add bitmask first version! * Add row col bitmask logic * support not align4 reludropout * simplify relu dropout ld logic * Add naive relu dropout grad kernel * add simple relu dropout grad kernel * Rename * support relu_dropout bitmask backward * add vectorized optimization * fix tmp buffer * add to amp list * add lazy backward logic * Refine kernel * add indextype dispatch * simplify functor logic * fix cublas fused mlp aux_ld shape bug * Add more relu dropout kernel * add full unittest * fix bug in skip final activation * refine * Remove dump func * fix format * Remove cmake * remove redundant divide * add padded version * fix dropout * oneflow curand * refine * remove redundant kernel * add unroll logic * add unroll and ballot sync * refine format * Remove fast curand * Refine python interface * Add if branch for memset * fix python logic * just for debug * not use matmul bias add grad * add launch 1 block limit * fix unittest * Refine * fix graph backward bug * limit to 11060 * change to use int32_t dtype for cublas aux * Fix jc comment * fix comment * fix convert * fix static_analysis * fix at * fix userops td * fix userops td * fix const ref * fix compile error for bfloat16 * limit to 11060 * fix bug Co-authored-by: Juncheng <liujuncheng1022@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * fix gather 0-dim tensor bug (#8376) * fix 0-dim tensor bug * refine * support input 0-dim tensor for gather * refine * refine * refine dim_scatter_kernel check * refine * refine check * fix clang_tidy error Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * add api to apply external job pass (#8370) * Add condition to find-test-cache-distributed (#8387) * add condition to find-test-cache-distributed * fix * warp dim util (#8382) * warp dim util * format * use more maybe_wrap_dim * refine array functor * add more * refine math_functor * fix_bug_in_broadcast_min_max_grad_and_broadcast_like (#8379) * fix_bug_in_broadcast_min_max_grad_and_broadcast_like * refine * fix static check error * fix bug about index (#8388) * fix bug about index * add test case Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * LogicalSliceAssign support full slice sbp (#8344) * feat(SliceOp): slice ops support 2d sbp * fix(SliceOp): fix [B, P] 2d sbp bug * refine error message * fix bug in parallel_num == 1 * add comment * add warning and format * add NOLINT for boxing check * feat(LogicalSliceOps): support all nd_sbp * feat(LogicalSlice): support nd_sbp * add error message * fix(AutoTest): fix auto_test bug in module.parameter pass * auto format by CI * fix(LogicalSliceAssign): skip test when 1n1d * fix SliceParams memset error * remove memset * add CHECK_JUST * fix(*): make sure split_axis >= 0 or equal to SPLIT_AXIS_FOR_NON_SPLIT * remove memset * fix spilit_info.axis bug * feat(LogicalSliceOps): support grad * add logical_slice gradient_funcs * feat(LogicalSliceAssign): LogicalSliceAssign support full slice sbp * auto format by CI * test(LogicalSlice): fix logical_slice dims Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: Houjiang Chen <chenhoujiangcug@gmail.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * fix_tensor_from_numpy_mem_leak_bug (#8391) * fix_tensor_from_numpy_mem_leak_bug * add note * refine note * refine Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Make of_pyext_obj static only to make sure only a python ext so has python symbols (#8393) * make of_pyext_obj static only * refine note Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Adjust tolerance setting in embedding_renorm unit test (#8394) * support front end compile for job to iree (#8249) * support frontend dev version * polish name * add tosa-to-elf.mlir * tosa to elf by llvm * conv2d partial * an enhanced frontend runner * support numpy as input * enable multiple using nn graph with different input(jobname make it it cd /home/yuhao/frontend/oneflow ; /usr/bin/env /usr/bin/python3 /home/yuhao/.vscode-server/extensions/ms-python.python-2022.6.2/pythonFiles/lib/python/debugpy/launcher 40873 -- /home/yuhao/frontend/oneflow/oneflow/ir/test/Frontend/runner.py ) * enable multiple input * enable cpu and cuda * change full_name to _full_name * support exchange cuda with cpu seamlessly * remove pip * lit config * polish * trim * auto format by CI * modify * auto format by CI * last line polish * use unittest * auto format by CI * use allclose * auto format by CI * pulish * optimize convert oneflow to tosa * conv2d * conv2d enhanced && conv2d examples add * add road map * add add_n2Op and boardcast_addOp conversion * add matmulOp conversion * support converting normailzation op to tosa(partically) * update roadmap * support i64 tensor to dense elem attr * support 100% resnet op conversion * add test mlir * add test iree resnet python script * auto format by CI * done * enhance iree resnet test script * auto format by CI * rebuild code * auto format by CI * rebuild test script * update * auto format by CI * pub * trim test scripts * move * move * input and output add block arg judgement * emit error in variable conversion * error handle for ci * modify err info * auto format by CI * merge * auto format by CI * output not block * flow ones * rm const * trim maybe * trim maybe with header file * const auto * solve clangd error Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Feat/zero mix with mp (#8036) * add zero limit * add debug * add mix zero test * refactor zero api * zero test with mp * add 2d test * add zero nd * add nd zero * add sbp cast * test passed soft limit consumer * refine size api * zero use stage 2 * add limit consumer api * add new api * refine zero s select * fix index out of range * rm zero limit on device type * zero test with activation checkpointing * add indentity when dp sequence len is 1 * move to base with master * fix * fix * fix * add test * debug bad case * refine test for eager and graph boxing * test case ready * simplify * refine test * fix buff size * fix conflict * refine zero nd * refine * add full test * revert change * refine split check * fix typo * rm log * spit long func * restore test * Update optimizer_placement_optimization_pass.cpp * auto format by CI * auto format by CI * fix static check * add tips for zero api change * auto format by CI Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Revert embedding normal path and fix amp list (#8374) * revert embedding normal path, fix amp list * fix amp * fix memset bug in gather cpu kernel Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * replace fixed_vector with small_vector and make Shape inherit from it (#8365) * Replace fixed_vector with llvm::SmallVector Signed-off-by: daquexian <daquexian566@gmail.com> * Shape inherited from llvm::SmallVector Signed-off-by: daquexian <daquexian566@gmail.com> * refine cmake Signed-off-by: daquexian <daquexian566@gmail.com> * rename fixed_vector to small_vector Signed-off-by: daquexian <daquexian566@gmail.com> * fix reviews Signed-off-by: daquexian <daquexian566@gmail.com> * auto format by CI * update Shape constructor Signed-off-by: daquexian <daquexian566@gmail.com> * add 'PUBLIC' keyword to all target_link_libraries Signed-off-by: daquexian <daquexian566@gmail.com> * auto format by CI * update cmake Signed-off-by: daquexian <daquexian566@gmail.com> * auto format by CI * update cmake Signed-off-by: daquexian <daquexian566@gmail.com> * update cmake Signed-off-by: daquexian <daquexian566@gmail.com> * auto format by CI * set is_initialized_ default to true Signed-off-by: daquexian <daquexian566@gmail.com> * override some methods to set is_initialized_ Signed-off-by: daquexian <daquexian566@gmail.com> * auto format by CI Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * Light plan for debug (#8396) * Light plan for debug * fix note * disable terminfo to fix missing terminfo symbols (#8400) * disable terminfo to fix missing terminfo symbols Signed-off-by: daquexian <daquexian566@gmail.com> * auto format by CI Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * fix bug of ZeRO MP in complex case (#8404) * Remove redundant output_lbns in ir (#8409) * mv case * remove redundant info * Dev FusedCrossInteraction[OneEmbedding] (#8335) * add simple fused cross interaction forward * add packed fused * Add cross interaction grad * simplify code * fix bug * support crossnet v2 * support cross interaction v2 * add lazy backward * Rename and add test * fix jc comment * fix comment * fix bug * fix userops td elem_cnt for FUSED Group * fix header file * fix clang static analysis * fix unittest Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * add exe graph physical shape check msg (#8002) * fix index select op in graph * add exe graph physical shape check msg * improve the debug information for the python stack trace 1. add a parameter 'max_stack_depth' to specify the max depth for the stack trace 2. refactor other debug related classes. * remove parens * update * resolve PR comments * update * update graph debug test file. * restore self._debug in class Graph and class ModuleBlock * Do not shorten the stack frame string if it is in debug mode * delete TODOs * disable conv3d test (#7969) Signed-off-by: daquexian <daquexian566@gmail.com> * skip layernorm random_data_warp test (#7941) * skip layernorm random_data_warp test * warp/block/uncached case only test gpu Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Lock click version (#7967) Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * add global avgpool unittest (#7585) * fix (#7978) Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Support negative dim in scatter op (#7934) * support negative dim in scatter op * refine scatter test * refine scatter test again Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * run barrier callback in BarrierPhyInstrOperand::~BarrierPhyInstrOperand (#7702) * run barrier callback in BarrierPhyInstrOperand::~BarrierPhyInstrOperand * lock gil in vm Callback thread * more comments for VirtualMachineEngine::Callback() * the Env is never destroyed. * export Env into python * more unittests * wait shared_ptr.use_count() == 0 * export unittest.TestCase in framework/unittest.py * SwitchToShuttingDownPhase * optional is_normal_exit * VirtualMachine::CloseVMThreads * Delete env_api.h env_api.h is deleted by master * reshape_only_one_dim_infered * address pr comments * fix a ref-cnt bug in TryRunBarrierInstruction. * rollback flow.env.all_device_placement * no distributed running test_shutting_down.py * auto format by CI * expand lifetime of module oneflow in test_shutting_down.py * refine del depend on of * capture oneflow._oneflow_internal.eager when calling sync in __del__ * add try in flaky test Co-authored-by: Luyang <flowingsun007@163.com> Co-authored-by: chengtbf <472491134@qq.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: Xiaoyu Xu <xiaoyulink@gmail.com> * Fix one hot scalar tensor bug (#7975) * fix reduce_sum scalar check bug * fix one_hot scalar tensor bug * fix clang tidy error Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * support ctor np array from of tensor (#7970) * support ctor np array from of tensor * add test case constructing np array from tensor * refine Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * add_manual_seed_all_api (#7957) * add_manual_seed_all_api * Update conf.py * refine * add test case * auto format by CI * Update random_generator.cpp * auto format by CI Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * one_embedding add doc string (#7902) * add doc string * add example * add * fix doc * refine * address review * mb to MB * add make_table_option * option to options * refine * add forward Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Support numpy scalar parameters (#7935) * feat(functional): support numpy scalar parameters * rename inferface * feat(*): TensorIndex support numpy scalar * feat(TensorIndex): support advance indexing * add unittest and int32 support for branch feat-param_support_np_scalar (#7939) * add unittest * refactor unittest * add todo for int16 advanced indexing * add int32 supporting for advance indexing * auto format by CI Co-authored-by: Wang Yi <53533850+marigoold@users.noreply.github.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> * fix tensor_scatter_nd_update (#7953) * fix tensor_scatter_nd_update * auto backward Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * fix one_embedding adam (#7974) * fix one_embedding adam * fix tidy * fix normal Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * speed test with score (#7990) Signed-off-by: daquexian <daquexian566@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * Feat/graph del by ref (#7857) * remove IsMultiClient() and single client logic Signed-off-by: daquexian <daquexian566@gmail.com> * rename eager.multi_client to eager Signed-off-by: daquexian <daquexian566@gmail.com> * auto format by CI * add py ref * refine new session * clean code * make scope api inner use * use session with ref cnt * run barrier callback in BarrierPhyInstrOperand::~BarrierPhyInstrOperand * test pass * lock gil in vm Callback thread * more comments for VirtualMachineEngine::Callback() * merge * merge rm single client * rm initenv * merge and fix master * refactor env c api * add debug code * fix and serving test pass * test passed * rm useless * rm useless code * format * rm useless include * rm sync in py * the Env is never destroyed. * export Env into python * more unittests * fix and pass tests * revert virtual_machine.cpp * revert core/vm * remove outdated python class oneflow.unittest.TestCase * graph test passed * wait shared_ptr.use_count() == 0 * export unittest.TestCase in framework/unittest.py * SwitchToShuttingDownPhase * optional is_normal_exit * VirtualMachine::CloseVMThreads * Delete env_api.h env_api.h is deleted by master * address pr comments * rm is env init * Clear empty thread when graph destroy (#7633) * Revert "Clear empty thread when graph destroy (#7633)" (#7860) This reverts commit 3e8585e5fa20b97229d6b0be46a7ff814dc8cd83. * fix a ref-cnt bug in TryRunBarrierInstruction. * rm env_api * …

Fully Memory Log V2 with more details

f51fd2e

chengtbf added feature graph graph mode labels Jul 4, 2022

chengtbf requested review from strint and leaves-zwx July 4, 2022 14:11

strint approved these changes Jul 4, 2022

View reviewed changes

leaves-zwx approved these changes Jul 5, 2022

View reviewed changes

refine log and long op name

08cffae

chengtbf added the automerge label Jul 6, 2022

Merge branch 'master' into dev_cc_mem_log_v2

5407529

chengtbf requested a review from oneflow-ci-bot July 6, 2022 09:43

github-actions bot removed the automerge label Jul 6, 2022

fix clang tidy

ffc15a3

chengtbf added the automerge label Jul 6, 2022

chengtbf removed the request for review from oneflow-ci-bot July 6, 2022 14:29

mergify bot added 4 commits July 6, 2022 14:32

Merge branch 'master' into dev_cc_mem_log_v2

5b23dfc

Merge branch 'master' into dev_cc_mem_log_v2

062ceac

Merge branch 'master' into dev_cc_mem_log_v2

845d9e3

Merge branch 'master' into dev_cc_mem_log_v2

af36bfb

chengtbf requested review from oneflow-ci-bot and removed request for oneflow-ci-bot July 7, 2022 03:03

mergify bot and others added 4 commits July 7, 2022 10:54

Merge branch 'master' into dev_cc_mem_log_v2

8e2434a

Merge branch 'master' into dev_cc_mem_log_v2

4b4c4a6

fix test

432371c

Merge branch 'dev_cc_mem_log_v2' of https://github.com/Oneflow-Inc/on…

80ab3d8

…eflow into dev_cc_mem_log_v2

chengtbf requested a review from BBuf as a code owner July 7, 2022 17:52

chengtbf requested a review from oneflow-ci-bot July 11, 2022 01:57

chengtbf commented Jul 11, 2022

View reviewed changes

github-actions bot removed the automerge label Jul 11, 2022

Merge branch 'master' into dev_cc_mem_log_v2

9434a63

strint added the automerge label Jul 11, 2022

mergify bot added 4 commits July 11, 2022 15:37

Merge branch 'master' into dev_cc_mem_log_v2

349e0c9

Merge branch 'master' into dev_cc_mem_log_v2

da7508f

Merge branch 'master' into dev_cc_mem_log_v2

dd313a1

Merge branch 'master' into dev_cc_mem_log_v2

7c1c422

chengtbf requested review from oneflow-ci-bot and removed request for oneflow-ci-bot July 12, 2022 08:47

mergify bot added 2 commits July 12, 2022 09:03

Merge branch 'master' into dev_cc_mem_log_v2

0e751ff

Merge branch 'master' into dev_cc_mem_log_v2

20f4d16

github-actions bot removed the automerge label Jul 12, 2022

chengtbf added the automerge label Jul 12, 2022

chengtbf requested review from oneflow-ci-bot and removed request for oneflow-ci-bot July 12, 2022 14:51

mergify bot added 3 commits July 12, 2022 15:43

Merge branch 'master' into dev_cc_mem_log_v2

1c4155b

Merge branch 'master' into dev_cc_mem_log_v2

5c5b693

Merge branch 'master' into dev_cc_mem_log_v2

1448e63

mergify bot merged commit 8ffab16 into master Jul 12, 2022

mergify bot deleted the dev_cc_mem_log_v2 branch July 12, 2022 22:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fully Memory Log V2 with more details #8565

Fully Memory Log V2 with more details #8565

chengtbf commented Jul 4, 2022

strint left a comment

github-actions bot commented Jul 6, 2022

github-actions bot commented Jul 6, 2022

github-actions bot commented Jul 7, 2022

chengtbf Jul 11, 2022

hjchen2 Jul 11, 2022

chengtbf Jul 11, 2022

github-actions bot commented Jul 11, 2022

github-actions bot commented Jul 11, 2022

github-actions bot commented Jul 11, 2022

github-actions bot commented Jul 12, 2022

github-actions bot commented Jul 12, 2022

github-actions bot commented Jul 12, 2022

Fully Memory Log V2 with more details #8565

Fully Memory Log V2 with more details #8565

Conversation

chengtbf commented Jul 4, 2022

Checkpointing 日志示例：

内存块详细日志分析

Chunk

Unreused mem

Eager Variable

strint left a comment

Choose a reason for hiding this comment

github-actions bot commented Jul 6, 2022

github-actions bot commented Jul 6, 2022

github-actions bot commented Jul 7, 2022

chengtbf Jul 11, 2022

Choose a reason for hiding this comment

hjchen2 Jul 11, 2022

Choose a reason for hiding this comment

chengtbf Jul 11, 2022

Choose a reason for hiding this comment

github-actions bot commented Jul 11, 2022

github-actions bot commented Jul 11, 2022

github-actions bot commented Jul 11, 2022

github-actions bot commented Jul 12, 2022

github-actions bot commented Jul 12, 2022

github-actions bot commented Jul 12, 2022