-
Notifications
You must be signed in to change notification settings - Fork 662
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
graph activation checkpointing #6192
Conversation
149b9be
to
9a2b3b0
Compare
* Primitive (#6183) * Add Primitive * #ifdef WITH_CUDA Co-authored-by: oneflow-ci-bot <69100618+oneflow-ci-bot@users.noreply.github.com> * Disable implicit boxing when parallel num eq one (#6188) * mv_boxing_folder_to_core * minor fix * disable_implicit_boxing_when_parallel_num_eq_one * Update eager_consistent_op_interpreter.cpp Co-authored-by: oneflow-ci-bot <69100618+oneflow-ci-bot@users.noreply.github.com> * Lazy support Scalar (#6181) Co-authored-by: oneflow-ci-bot <69100618+oneflow-ci-bot@users.noreply.github.com> * Fix LayerNorm check bug (#6196) * fix(Layernorm): fix check bug * fix judge whether cpu or not Co-authored-by: oneflow-ci-bot <69100618+oneflow-ci-bot@users.noreply.github.com> * add glu op (#6065) * add glu op * del glu_op export,align with torch * mod glu_op * mov op logic to C++ * Solve problems * solve conflict * delete gradient functor * add ndim check * add GLU test * delete blank line * format Co-authored-by: oneflow-ci-bot <69100618+oneflow-ci-bot@users.noreply.github.com> Co-authored-by: Zhenhua <huangzhenhua@zhejianglab.com> * Primitive based copy task node (#6195) Co-authored-by: oneflow-ci-bot <69100618+oneflow-ci-bot@users.noreply.github.com> * KernelState (#6198) Co-authored-by: oneflow-ci-bot <69100618+oneflow-ci-bot@users.noreply.github.com> * container_util: fix VectorAt, remove useless MutMapAt (#6172) * fcontainer_util: fix VectorAt, remove useless MutMapAt * fcontainer_util: format * MapAt: add default value version * format Co-authored-by: oneflow-ci-bot <69100618+oneflow-ci-bot@users.noreply.github.com> * Refine StreamContext (#6191) Co-authored-by: oneflow-ci-bot <69100618+oneflow-ci-bot@users.noreply.github.com> * Cpu symetric s to s (#6153) * mv_boxing_folder_to_core * minor fix * cpu_symetric_s_to_s * add test case * auto format by CI * minor fix * refine * Update eager_nccl_kernels.cpp * minor fix * fix bug * minor fix * Update oneflow/user/kernels/eager_nccl_kernels.cpp Co-authored-by: daquexian <daquexian566@gmail.com> * Update eager_nccl_kernels.cpp * Update eager_nccl_kernels.cpp * minor fix * Update eager_nccl_kernels.cpp Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: daquexian <daquexian566@gmail.com> * fix bug (#6197) Co-authored-by: Yinggang Wang <wyg19970408@gmail.com> Co-authored-by: oneflow-ci-bot <69100618+oneflow-ci-bot@users.noreply.github.com> * fix consistent tensor zeros (#6202) Signed-off-by: daquexian <daquexian566@gmail.com> Co-authored-by: oneflow-ci-bot <69100618+oneflow-ci-bot@users.noreply.github.com> * [Feat.] nn.Graph support grad acc with input/output tensor (#6155) * nn.Graph support grad acc with input/output tensor * dirty pass grad acc * revert tensor.backward hack * fix indent * default S0 -> B * Pack op/kernel support scalar input * nn.Graph output pack support loss scalar * add test script * pass test * Lazy build output eager tensors after job complete * non scalar output test Co-authored-by: oneflow-ci-bot <69100618+oneflow-ci-bot@users.noreply.github.com> * Dev eliminate gcc warnings (#6199) * fix gcc warning * refine * fix comment Co-authored-by: oneflow-ci-bot <69100618+oneflow-ci-bot@users.noreply.github.com> * StreamContextAdapter (#6205) * StreamContextAdapter * refine Co-authored-by: oneflow-ci-bot <69100618+oneflow-ci-bot@users.noreply.github.com> * Autotest generate input tensor (#6206) * Add tensor yaml, support export tensor functional api. * refine * Remove packed functor signature * remove unused file * Refine * refine * add activation op import * reinit oneflow init.py * add oneflow abs and exp * add oneflow abs and exp * add acos * add arccosh * add more op * add more ops * add more op * add more ops * add log1p * add more smaples * add more ops * add more ops * add more ops * add more ops * Complete tensor functional apis. * Fix pybind call * add more ops * add ops done * Add target of_functional_tensor_obj * Disable throw visibility warnings * fix target link * fix * fix incorrect use of flow.Tensor. * Fix error merge * fix * fix add unittest * refine * refine * fix * fix * add tensor doc * auto format by CI * refine * Fix * Add doc for python function * refine * add tensor method docstring * fix some bug * fix docs bug * Fix * auto format by CI * Tensor->tensor * Tensor->tensor * refine Tensor->tensor * fix * fix * fix * fix conflict * fix bug * fix ci bug * fix * delete diag op * fix conflict * Fix segment * fix * merge * autotest framework generate input tensor * autotest framework generate input tensor * fix bug * fix impl bug * refine * refine * refine * fix * fix * fix comments * delete useless * fix ci error * fix ci error Co-authored-by: hjchen2 <chenhoujiangcug@gmail.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: oneflow-ci-bot <69100618+oneflow-ci-bot@users.noreply.github.com> * Cleanup KernelUtil (#6212) Co-authored-by: oneflow-ci-bot <69100618+oneflow-ci-bot@users.noreply.github.com> * Rename flow to oneflow in user hint (#6190) * style(*): rename flow to oneflow in user hint * fix(*): fix doctest * auto format by CI * remove ddp speed test Signed-off-by: daquexian <daquexian566@gmail.com> Co-authored-by: oneflow-ci-bot <69100618+oneflow-ci-bot@users.noreply.github.com> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: daquexian <daquexian566@gmail.com> * merg and refactor * refact code * add io identity for activation checkpointing Co-authored-by: Juncheng <liujuncheng1022@gmail.com> Co-authored-by: oneflow-ci-bot <69100618+oneflow-ci-bot@users.noreply.github.com> Co-authored-by: binbinHan <han_binbin@163.com> Co-authored-by: cheng cheng <472491134@qq.com> Co-authored-by: Yinggang Wang <wyg19970408@gmail.com> Co-authored-by: QiangX-man <87475073+QiangX-man@users.noreply.github.com> Co-authored-by: Zhenhua <huangzhenhua@zhejianglab.com> Co-authored-by: Twice <i@twice.moe> Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org> Co-authored-by: daquexian <daquexian566@gmail.com> Co-authored-by: ZZK <42901638+MARD1NO@users.noreply.github.com> Co-authored-by: Luyang <flowingsun007@163.com> Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com> Co-authored-by: hjchen2 <chenhoujiangcug@gmail.com>
…tkpei/checkpoint
with self.scope_context(): | ||
result = self._origin.__class__.forward(self, *args) | ||
result = self._post_forward_mapping_out_scope(result) | ||
result = seq_to_func_return(result) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
在block scope外对forward做input/output的mapping
break_with_identity, | ||
"break_activation_checkpointing_with_identity", | ||
*args, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
插入indentity
else: | ||
if self._debug: | ||
print( | ||
f"{repr_str} is not a Tensor, {func_desc} transformation will be ignored." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mapping在debug下的异常打印信息
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
我们的 mapping tensor 只支持本身 arg 就是 tensor 以及 list 里的 tensor,不支持其他的嵌套了吧? 比如 dict,比如 list of list
return list_to_func_return(self._eager_outputs_buffer[0]) | ||
return seq_to_func_return(self._eager_outputs_buffer[0]) | ||
|
||
def _rebuild_outputs(self, out2name=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
输出的重建汇总成一个函数处理
op.name, | ||
re.I, | ||
) | ||
is not None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
验证插入了重计算op
if re.search("identity_.*_grad", str(name), re.I) is not None: | ||
find_ctrl = True | ||
print(name) | ||
test_case.assertTrue(find_ctrl) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
验证插入了identity op,并且重计算段的第一个op有grad作为控制边
@@ -39,12 +39,14 @@ class CustomGraphIOCheck(flow.nn.Graph): | |||
def __init__(self): | |||
super().__init__() | |||
self.m = CustomModuleIOCheck() | |||
self.m.config.activation_checkpointing = True | |||
|
|||
def build(self, t, lt, n): | |||
rt, rlt, n, ri, rs = self.m(t, lt, n, 1, "2") | |||
return t, lt, n |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
验证block的maping支持 tensor / list(tensor),并且可以忽略其它类型
See the License for the specific language governing permissions and | ||
limitations under the License. | ||
""" | ||
import re |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
需要 import os
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已加
…flow into tkpei/checkpoint
self._is_executing_forward = False | ||
return result | ||
|
||
def _pre_forward_mapping_out_scope(self, *args): | ||
# Deal with activation checkpointing identity. | ||
if self.config.activation_checkpointing: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里改成 或 吧,如果 配置了 stage id 或者 checkpointing,就插入 identity。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
好的,不过下个pr里面才有测试例子,我加在下个pr里
Speed stats:
|
Speed stats:
|
在nn.graph中