Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LazyInterpret for FeedVariableOpExpr #5490

Merged
merged 6 commits into from
Jul 15, 2021
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
60 changes: 57 additions & 3 deletions oneflow/core/framework/op_interpreter/lazy_op_interpreter.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -113,9 +113,63 @@ Maybe<void> LazyInterpreter::ApplyImpl(const FeedInputOpExpr& op_expr, const Ten

Maybe<void> LazyInterpreter::ApplyImpl(const FeedVariableOpExpr& op_expr, const TensorTuple& inputs,
TensorTuple* outputs, const OpExprInterpContext& ctx) const {
// TODO(chengcheng)
OF_UNIMPLEMENTED() << "The type " << op_expr.op_type_name()
<< " has not been supported in LazyInterpreter::Apply.";
// NOTE(chengcheng): inputs[0] is the EagerTensor
CHECK_EQ_OR_RETURN(inputs.size(), 1);
CHECK_EQ_OR_RETURN(op_expr.input_size(), 1);
const std::shared_ptr<Tensor>& input_tensor = inputs.at(0);
CHECK_OR_RETURN(input_tensor->is_eager());

const auto& scope = JUST(GetCurrentScope());
int64_t scope_symbol_id = JUST(scope->symbol_id());

OperatorConf op_conf;
op_conf.set_name(op_expr.op_name()); // construct by python nn.Graph
op_conf.set_scope_symbol_id(scope_symbol_id); // TODO(chengcheng): NewScope by cur scope.
Copy link
Contributor

@strint strint Jul 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

外部创建了一个对该variable的scope,因为是单Tensor的,貌似可以复用,不用NewScope了?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

需要,外部你创建的 scope 是 Block 的,没有真实的 ParallelDesc 信息。ParallelDesc 相关的 Scope 一定需要在 LazyInterpret· 里现场创建,无论是 Input、Variable 还是普通的 UserOp,因为这些都在输入的 Tensor 上保存的。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

主要原因是我们现在要根据tensor去推理ParallelDesc,所以Block层面创建的Scope里面的ParallelDesc往往就没用了对吧

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

是的

op_conf.set_device_tag(GetDeviceTagOfTensor(input_tensor));
// NOTE(chengcheng):
// We contruct VariableOpConf instead of FeedVariableOpConf because FeedVariableOpExpr JUST
// for getting input EagerTensor.
VariableOpConf* var_conf = op_conf.mutable_variable_conf();
var_conf->set_out("out");
input_tensor->shape()->ToProto(var_conf->mutable_shape());
var_conf->set_data_type(input_tensor->dtype());
// NOTE(chengcheng): VariableOpConf initializer_conf is useless because variable is inited
// by EagerTensor.
var_conf->mutable_initializer()->mutable_empty_conf();
if (input_tensor->is_consistent()) {
// TODO(chengcheng): GenerateParallelDistributionString by tensor.
}
if (!input_tensor->requires_grad()) { var_conf->set_trainable(false); }
// TODO(chengcheng, xuxiaoyu): Set L1/L2 RegularizerConf by nn.Graph Optimizer
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里我列了一个 TODO: 现在单凭 EagerTensor 是不知道 Variable 的 L1 和 L2 参数的,PyTorch 的应该是放在了 Optimizer 里? @strint 啸宇你后续研究一下,看 nn.Graph 如何支持配置每个 Variable 的 L1 和 L2

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

是不同参数配置不同的learning rate么

可以不同参数被不同optimizer绑定,每个opt一个lr
https://pytorch.org/docs/stable/optim.html

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

还有这种:

optim.SGD([
                {'params': model.base.parameters()},
                {'params': model.classifier.parameters(), 'lr': 1e-3}
            ], lr=1e-2, momentum=0.9)

optimizer中把参数分为多个group,每个group一个lr,而且还有个默认的lr

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不是,LR 是 learning rate,我说的这个是 l1、l2 正则化的参数

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

torch 没有统一,都可以在定义loss时,自己手动写,另外标准的写法:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

看我们是否有必要,在optimizer的分group的参数配置中,加一个 l1_norm/l2_norm参数,进行配置 @wyg1997

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

应该是所有Optimizer构造时传的参数,都支持单独指定,如果l1、l2在Optimizer的参数列表里,每个ParamGroup都也应该支持的


auto infer_ctx = JUST(GetCurInferCtx());
OpAttribute op_attr = *JUST(infer_ctx->AddAndInferConsistentOp(op_conf));

const std::string& op_name = op_conf.name();

// temp debug log
std::cout << "cclog: Lazy nn.Graph AddOpName: " << op_name << std::endl
<< " and the origin op_conf is :" << op_conf.DebugString();

int64_t parallel_desc_sym_id = JUST(scope->GetParallelDescSymbolId(op_conf));
const std::shared_ptr<ParallelDesc>& blob_parallel_desc_sym =
JUST(GetSymbol<cfg::ParallelConf, ParallelDesc>(parallel_desc_sym_id));

// Check outputs num and setup output tensor properties.
CHECK_EQ_OR_RETURN(outputs->size(), 1);
CHECK_EQ_OR_RETURN(op_expr.output_size(), 1);

const std::string obn = "out"; // NOTE(chengcheng): obn is NOT op_expr.indexed_obns
const auto& parallel_attr =
JUST(compatible_py::GetOpArgParallelAttribute(blob_parallel_desc_sym, op_attr, obn));
const auto& blob_attr = JUST(compatible_py::GetOpArgBlobAttribute(op_attr, obn));

CHECK_OR_RETURN(!outputs->at(0).get());
hjchen2 marked this conversation as resolved.
Show resolved Hide resolved
(*outputs)[0] = JUST(OpInterpUtil::BuildTensor(blob_attr, parallel_attr, /*is_lazy=*/true));
// NOTE(chengcheng): Record variable op output LazyTenosr
TensorNameScope::Global()->Record(outputs->at(0), op_name + "/" + obn);
// NOTE(chengcheng): Record EagerTensor as variable tensor name
TensorNameScope::Global()->Record(input_tensor, op_name + "/" + obn);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里我会把输入的 EagerTensor 也记录下来,这样后续的 UserOp LazyInterpret 里的 input 如果是 EagerTensor 也能找到正确的 lbn。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

variable 的 lazy tensor 和 eager tensor 的内存共享会在哪里实现?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NNGraph 里,LazyInterpret 不关心这个事情。 由 python 端的 nn.Graph 把 Variable 的 EagerTensor 和 对应的 names 传进 NNGraph 里,NNGraph记录该信息,在 Runtime 启动时绑定 Regst。

return Maybe<void>::Ok();
}

Expand Down
89 changes: 89 additions & 0 deletions oneflow/python/test/graph/test_variable_op_expr.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
"""
Copyright 2020 The OneFlow Authors. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
"""
import unittest

import numpy as np
import os

os.environ["MASTER_ADDR"] = "127.0.0.1"
os.environ["MASTER_PORT"] = "12139"
os.environ["WORLD_SIZE"] = "1"
os.environ["RANK"] = "0"
os.environ["LOCAL_RANK"] = "0"

import oneflow
import oneflow.experimental as flow
import oneflow.python.framework.session_context as session_ctx
import oneflow._oneflow_internal
from oneflow.python.framework.multi_client_session import MultiClientSession
import oneflow.python.framework.c_api_util as c_api_util


@flow.unittest.skip_unless_1n1d()
@unittest.skipIf(
not flow.unittest.env.eager_execution_enabled(),
"default use eager mode to test this case",
)
class TestFeedVariableTensor(unittest.TestCase):
def test_feed_var_tensor(test_case):
test_case.assertTrue(oneflow.distributed.is_multi_client())
test_case.assertTrue(
oneflow.python.framework.env_util.HasAllMultiClientEnvVars()
)

x = flow.Tensor(1, 1, 10, 10)
flow.nn.init.uniform_(x, a=-1.0, b=1.0)

session = session_ctx.GetDefaultSession()
test_case.assertTrue(isinstance(session, MultiClientSession))
session.TryInit()

with oneflow._oneflow_internal.lazy_mode.gard(True):

oneflow._oneflow_internal.JobBuildAndInferCtx_Open(
"cc_test_variable_op_expr_job"
)
job_conf = (
oneflow._oneflow_internal.oneflow.core.job.job_conf.JobConfigProto()
)
job_conf.set_job_name("cc_test_variable_op_expr_job")
job_conf.mutable_predict_conf()
c_api_util.CurJobBuildAndInferCtx_SetJobConf(job_conf)

op_name = "cc_Variable_0"
var_conf = (
oneflow._oneflow_internal.oneflow.core.operator.op_conf.FeedVariableOpConf()
)
var_conf.set_in_0("EagerTensorInput")
var_conf.set_out_0("out_0")

var_op = oneflow._oneflow_internal.one.FeedVariableOpExpr(
op_name, var_conf, ["in_0"], ["out_0"]
)
attrs = oneflow._oneflow_internal.MutableCfgAttrMap()

if not x.is_determined:
x.determine()
x_tensor_in_c = x._local_or_consistent_tensor

out_tensor = var_op.apply([x_tensor_in_c], attrs)[0]
test_case.assertEqual(out_tensor.shape, (1, 1, 10, 10))
test_case.assertTrue(out_tensor.is_lazy)
test_case.assertTrue(out_tensor.is_consistent)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

当前默认产出都是consistent的?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lazy 其实只有 Consistent 概念; Mirror 也是展成 Consistent 的,即使你传进去的 是 一个 local tensor,也会翻译成 placement 是本 rank 的 ConsistentTensor



if __name__ == "__main__":
unittest.main()