Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat multi input sharing graph, save and load compiled graph #9754

Merged
merged 28 commits into from
Feb 1, 2023

Conversation

strint
Copy link
Contributor

@strint strint commented Jan 15, 2023

支持 共享编译后的 graph 和 variables

  • 支持 complete 后的 graph 重新推理shape;
  • 支持把 pass 处理后的 graph 到 init runtime 中的处理分段,这样可以把其中一些中间结果取出来,也可以跳过其中一些段;
  • 支持传入上个 graph 的 job,然后根据新输入推理 job,并初始化运行时;
  • 支持新 shape 输入的执行,可能有些共享的数据要处理下;
  • input/output/variable 适配
    • input 和 output 是新的 tensor,但同名;
    • variable tensor 要复用(解决 constant folding 新创建 variable 的共享问题);
    • input 根据新输入 tensor 做构造 shape,以支持推理新 shape 的输出;
  • 验证多个 graph 之间的参数和 activation 的内存共享正常;
    • activation 的内存共享,需要保证大的输入的执行在前;
    • 参数和常量折叠之后的参数可以共享,通过共享编译之后的 job 和 参数来实现;
  • 【后续考虑】如果有根据输入 shape 构建 source op 的情况,要补充下处理
    • free eager tensor 的重推理问题情况会麻烦一点,看 sd 中是否存在

支持保存和加载运行时相关状态,实现离线编译。

  • runtime_state_dict
  • load_runtime_state_dict
  • work with graph sharing
  • flow.save/load runtime_state_dict

@strint strint marked this pull request as ready for review January 15, 2023 16:57
@strint strint changed the title Feat multi input of a graph [WIP]Feat multi input of a graph Jan 15, 2023
@strint strint requested review from leaves-zwx and removed request for daquexian and BBuf January 15, 2023 16:57
GetInputCriticalSectionCallbackBufferName(new_job_name));
} else if (buffer_name.rfind(kOutputCriticalSectionCallbackBufferNamePrefix, 0) == 0) {
op_conf.mutable_critical_section_callback_tick_conf()->set_buffer_name(
GetOutputCriticalSectionCallbackBufferName(new_job_name));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chengtbf 讨论后,发现卡死的问题是上面这些 tick 相关的 op,都是基于全局 buffer 通信的,buffer 通信需要生成独立的 buffer name

buffer_mgr->Get(GetSourceTickBufferName(job_name))->Push(job_instance);
LOG(INFO) << "vm run lazy " << job_name << " push source tick "
<< " run count " << run_cnt;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

调试完成后需要删除

@@ -433,6 +442,9 @@ void Actor::ActUntilFail() {
AsyncRetInplaceConsumedRegstIfNoConsumer();

AsyncSendQueuedMsg();
LOG(INFO) << "Actor " << actor_id_ << " name " << op_name << " finish to act count "
<< act_cnt_;
++act_cnt_;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

调试完成后需要删除

@leaves-zwx
Copy link
Contributor

Traceback (most recent call last):
  File "sd2_text2img.py", line 52, in <module>
    text_to_image(prompt, 512)
  File "sd2_text2img.py", line 32, in text_to_image
    def text_to_image(prompt, image_size, num_images_per_prompt=1, prefix=""):
  File "/home/zhangwenxiao/repos/oneflow/python/oneflow/autograd/autograd_mode.py", line 154, in wrapper
    return func(*args, **kwargs)
  File "/home/zhangwenxiao/repos/diffusers/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_oneflow.py", line 650, in __call__
    vae_post_process_graph.compile(latents)
  File "/home/zhangwenxiao/repos/diffusers/src/diffusers/oneflow_graph_compile_cache.py", line 29, in compile
    self.graph_._compile_from_shared(*args, **kwargs)
  File "/home/zhangwenxiao/repos/oneflow/python/oneflow/nn/graph/graph.py", line 838, in _compile_from_shared
    self._c_nn_graph.build_with_new_input_from_shared_graph(
oneflow._oneflow_internal.exception.RuntimeError: Error: Element number in input blob must be an integer multiple of reshape_conf, but got 4718592 and 2097152

reshape 的问题出了,sd2 里面。

@chengtbf
Copy link
Contributor

oneflow._oneflow_internal.exception.RuntimeError: Error: Element number in input blob must be an integer multiple of reshape_conf, but got 4718592 and 2097152

reshape 的问题出了,sd2 里面。

一种可能的解法是: 新的 graph 再执行一遍 build,这便 build 不会触发后续的 job pass,但是会获取到新的合法的 reshape conf 等跟 shape 相关的配置(random、input 等)

@leaves-zwx
Copy link
Contributor

一种可能的解法是: 新的 graph 再执行一遍 build,这便 build 不会触发后续的 job pass,但是会获取到新的合法的 reshape conf 等跟 shape 相关的配置(random、input 等)

应该可行,重新执行一遍 build 的目的就是为了刷新 attr 等,但如何把新 build 出来的 reshape 等 op 与老的 graph 里面的对应 reshape 关联起来呢(需要填充新的 attr)

@chengtbf
Copy link
Contributor

一种可能的解法是: 新的 graph 再执行一遍 build,这便 build 不会触发后续的 job pass,但是会获取到新的合法的 reshape conf 等跟 shape 相关的配置(random、input 等)

应该可行,重新执行一遍 build 的目的就是为了刷新 attr 等,但如何把新 build 出来的 reshape 等 op 与老的 graph 里面的对应 reshape 关联起来呢(需要填充新的 attr)

按照创建顺序关联, 就像 input 的 conf attr shape 关联一样。 只不过稍微复杂一点。有多种办法,比如:

  • 原始 graph build 记录 op create order,得到 order -> op_conf -> shape 的映射关系
  • 后续 graph build 根据 order 获知映射关系,得到 new op name -> order -> op_conf -> shape -> new_op_conf 的映射关系

@leaves-zwx
Copy link
Contributor

按照创建顺序关联, 就像 input 的 conf attr shape 关联一样。 只不过稍微复杂一点。有多种办法,比如:

  • 原始 graph build 记录 op create order,得到 order -> op_conf -> shape 的映射关系
  • 后续 graph build 根据 order 获知映射关系,得到 new op name -> order -> op_conf -> shape -> new_op_conf 的映射关系

这个 order 每次 build 时以及执行 pass 后是稳定的吗?有些 pass 对图的改写会影响 order 吧?还是说完全忽略 pass,只考虑原始的 graph

@chengtbf
Copy link
Contributor

这个 order 每次 build 时以及执行 pass 后是稳定的吗?有些 pass 对图的改写会影响 order 吧?还是说完全忽略 pass,只考虑原始的 graph

只考虑原始 graph,不考虑 job pass 改写(reshape 不会被 fusion)。在原始 build 逻辑里,是按照 python 脚本的执行顺序触发的,每次都是一样的。

def forward(self, x):
y = self.linear(x)
assert len(y.shape) == 2
return flow.reshape(y, (y.shape[1], y.shape[0]))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reshape 的测试通过

auto attr_iter = new_op_conf->user_conf().attr().find(pair.first);
CHECK_OR_RETURN(attr_iter != new_op_conf->user_conf().attr().end())
<< " There is not attr " << pair.first << " in new op " << new_op_conf->DebugString();
*pair.second.mutable_at_shape() = attr_iter->second.at_shape();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

更新 shape attr

NewOp4SharedOpName) {
// job is a copy from a shared graph.
// The job name has already update in py nn.Graph.
const auto& new_job_name = job->job_conf().job_name();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

更新了一下结构,会清楚一点

destination = OrderedDict()
destination._metadata = OrderedDict()

destination["graph_name"] = self.name
Copy link
Contributor Author

@strint strint Jan 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

runtime_state_dict 保存了 graph 运行时执行需要的信息:

  • job name
  • job id
  • 输入 tensor 和 name
  • 输出 tensor 和 name
  • variable tensor 和 name
  • plan

@leaves-zwx

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • 输入 tensor 和 name
  • 输入 tensor 和 name

输出 tensor 和 name ? 为什么需要保存输入输出 tensor ? 这里其实是保存 tensor meta ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

是的,实际上只需要 meta。但是保存 tensor 的机制是成熟的,所以就直接保存了 tensor。

实际 c nn graph 里面也是只使用 meta,但也是存为了 tensor。

# Create a c nn graph to run with lazy runtime.
self._c_nn_graph = oneflow._oneflow_internal.nn.graph.CNNGraph(
self._name,
state_dict["exe_plan"],
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

load graph 时,直接传递 plan

@github-actions
Copy link
Contributor

Speed stats:
GPU Name: GeForce GTX 1080 









❌ OneFlow resnet50 time: 143.0ms (= 14301.6ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 166.5ms (= 16654.5ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.16 (= 166.5ms / 143.0ms)

OneFlow resnet50 time: 87.9ms (= 8789.8ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 104.6ms (= 10458.8ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.19 (= 104.6ms / 87.9ms)

OneFlow resnet50 time: 60.3ms (= 12069.5ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 79.5ms (= 15896.9ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.32 (= 79.5ms / 60.3ms)

OneFlow resnet50 time: 46.2ms (= 9248.4ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 79.8ms (= 15965.8ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.73 (= 79.8ms / 46.2ms)

OneFlow resnet50 time: 41.8ms (= 8353.7ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 68.1ms (= 13612.7ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.63 (= 68.1ms / 41.8ms)

@github-actions
Copy link
Contributor

View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/9754/

"OutputCriticalSectionCallback-";
static const std::string kInputBufferNamePrefix = "Input-";
static const std::string kOutputBufferNamePrefix = "Output-";
static const std::string kSourceTickBufferNamePrefix = "SourceTick-";
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

因为本文件外部要使用这些名字,所以放到了函数外部

: name_(name),
job_id_(job_id),
session_ctx_(session_ctx),
plan_(plan),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

支持重 plan 初始化 NNGraph

@@ -28,7 +29,15 @@ class JobCompleter final {
JobCompleter() = default;
~JobCompleter() = default;

Maybe<void> Complete(Job* job) const;
static Maybe<void> Complete(Job* job);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Complete 函数改成了 static

@@ -262,8 +262,8 @@ def is_deprecated(func_or_class):
import oneflow.framework.session_context as session_ctx
from oneflow.framework.tensor_str import set_printoptions

__oneflow_global_unique_env = env_util.GetEnv()
session_ctx.NewDefaultSession(__oneflow_global_unique_env)
_oneflow_global_unique_env = env_util.GetEnv()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

创建 new session 时需要依赖 env,所以去掉了 __ 以在其它模块可以获取 env

@@ -910,7 +1207,13 @@ def __build_graph(self, *args, **kwargs):
with graph_build_util.graph_build_context(self.config.proto, self._session):
# Deal with inputs
self.__print(0, 1, self._shallow_repr() + " start building graph inputs.")
arg_op_names, lazy_args, lazy_kwargs, self._args_repr, _ = self.__build_io(
(
self._input_op_names,
Copy link
Contributor

@leaves-zwx leaves-zwx Feb 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个字段不是 enable_save_runtime_state_dict 为 true 时才会有吗?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

嗯,我再改回去,这里之前没用 share 开关,直接保存了

output_op_names,
self._eager_outputs,
self._output_op_names,
self._build_eager_outputs,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个字段没看到设置的地方

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

还有下面的 out2name

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

上面有设置,不过不太清楚,我改一下

_, # empty kwargs return
outs_repr,
out2name,
) = self.__build_io("output", graph_build_util.build_graph_output, *outputs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

前面把 output 转成 tuple,这里再 unpack 是为了什么?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个是 __build_io 的设定的逻辑。build 成 tuple 是假设了输入形式为 (args, kwargs),这样比较通用。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

直接传入 outputs 也符合这个设定吧,outputs 也会被解析为 args 的一员。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

想起来了,是为了处理边界情况:#7539

}
return std::make_shared<NNGraph>(name, job, job_id, session_ctx);
}))
.def(py::init([](const std::string& name, const std::string& serialized_plan, int64_t job_id,
const std::shared_ptr<MultiClientSessionContext>& session_ctx,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个 ctx 是为了还原 save 的时候的 session?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个 ctx 是为了还原 save 的时候的 session?

这个 ctx 是之前那个用引用计数来释放 graph/session/env 的 pr 引入的,这个 pr 只是新增了一个从 plan 构造 c nn graph 构造函数。


// NOTE(chengcheng): Singleton<JobDesc> need be clear before GlobalJobDescScope construct.
if (Singleton<JobDesc>::Get() != nullptr) { Singleton<JobDesc>::Delete(); }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这行逻辑挪到哪里了呢

Maybe<void> NNGraph::BuildWithNewInputFromSharedGraph(
const std::vector<std::string>& shared_inputs_op_names,
const std::vector<std::shared_ptr<one::Tensor>>& new_input_tensors,
const std::vector<std::string>& shared_op_names, const std::string& new_serialized_job) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shared_op_names 这个参数有什么用呢, new_serialized_job 里是不是都有

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shared_op_names 这个参数有什么用呢, new_serialized_job 里是不是都有

        shared_op_names = []
        for op_idx in range(len(self._forward_job_proto.net.op)):
            shared_op_names.append(
                self._shared_graph._forward_job_proto.net.op[op_idx].name
            )

shared_op_names 是从 build 那里直接产生的原始逻辑图得到的,new_serialized_job 里面已经是优化后的图了。
优化后的图,没有顺序保证了。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new_serialized_job 里面已经是优化后的图了

那如何保证 shared_op_names 和 new_serialized_job 两者相同呢,可能 new_serialized_job 没有 shared_op_names 里的 op 了

for (int64_t idx = 0; idx < shared_inputs_op_names.size(); ++idx) {
input_name2tensor.emplace(shared_inputs_op_names[idx], new_input_tensors[idx]);
}
const auto& InputTensor4Name =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个是不是应该放在 RegisterInputOpNamesAndTensors 里

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个是不是应该放在 RegisterInputOpNamesAndTensors 里

InputTensor4Name 只有下面用了,上面的RegisterInputOpNamesAndTensors不依赖这个查找,所以就写成了用完就释放的形式

for (int64_t op_idx = 0; op_idx < shared_op_names.size(); ++op_idx) {
// Assume that the new graph and the shared graph from nn.Graph.build have the same op order.
const auto& op = new_build_job.mutable_net()->mutable_op()->at(op_idx);
shared_op_name2_new_op.emplace(shared_op_names[op_idx], &op);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

其实可以把 : new_build_job 直接传给: CompleteSharedGraphForNewInput 吧

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

其实可以把 : new_build_job 直接传给: CompleteSharedGraphForNewInput 吧

这个 map 其实是从:shared op name 到new build job 中的 op。

中间用 op 顺序做了下对应, shared op name 都 op order 到 new build job 中的 op。

以给后面修改 shared graph op attr 做准备。所以只传递 new_build_job 还不行。

这个我改下名字,然后注释下。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

所以只传递 new_build_job 还不行。

这里我理解是 new_build_job 是不包含 op 顺序导致的?

Copy link
Contributor Author

@strint strint Feb 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new_build_job 是新的 build 函数产生的临时 job,后来改了下名字,它作为新 graph attr 的词典存在。

所以要额外传递 op name 信息来维护新老 op 的对应关系

state_dict["states"]
)
if type(self) != Graph:
# Graph init with eager module, try to share mem with eager module
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里的 state dict 不是 load plan 时加载的吗,怎么还需要处理 eager module 的事情

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

    if not load_with_eager:
        # 不带 eager 的 graph 初始化
        linear_g = flow.nn.Graph()
    else:
        # 带有 eager 的 graph 初始化
        class LinearGraph(flow.nn.Graph):
            def __init__(self):
                super().__init__()
                self.my_linear = linear_reshape

            def build(self, x):
                return self.my_linear(x)

        linear_g = LinearGraph()
    # 加载运行时状态
    linear_g.load_runtime_state_dict(state_dict_list[0])

测试 sd 的时候,它采用了 带有 eager 的 graph 初始化。

发现如果不考虑 eager 的参数共享,会多 1.8 G 显存开销,所以加上了对这个情况的处理。

):
if self._enable_save_runtime_state_dict or self._enable_shared_from_this:
self._input_op_names = input_op_names
self._output_op_names = output_op_names
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这些内容如果默认就保存,有什么代价吗(不区分 _enable_save_runtime_state_dict )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

之前区分主要是考虑保存 tensor 的额外开销,比如保存 _inputs_tensor_tuple ,多占用显存。

然后觉得既然考虑了,就都做了区分。

) = oneflow._oneflow_internal.DumpVariableTensorMgr()
self._state_tensor_tuple = convert_to_tensor_tuple(state_tensors)
self._state_tensor_tuple = convert_to_tensor_tuple(self._state_tensors)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_state_tensors 好像可以作为临时变量,不用保存在 self 上?

我看到只有 runtime_state_dict 用到了 _state_op_names 和 _state_tensor_tuple

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

的确,已经去掉

@github-actions
Copy link
Contributor

github-actions bot commented Feb 1, 2023

Speed stats:
GPU Name: GeForce GTX 1080 









❌ OneFlow resnet50 time: 141.9ms (= 14194.4ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 165.0ms (= 16503.7ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.16 (= 165.0ms / 141.9ms)

OneFlow resnet50 time: 87.6ms (= 8762.9ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 105.5ms (= 10552.5ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.20 (= 105.5ms / 87.6ms)

OneFlow resnet50 time: 60.2ms (= 12033.5ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 79.5ms (= 15906.0ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.32 (= 79.5ms / 60.2ms)

OneFlow resnet50 time: 45.7ms (= 9141.2ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 70.8ms (= 14158.4ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.55 (= 70.8ms / 45.7ms)

OneFlow resnet50 time: 42.8ms (= 8551.7ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 67.5ms (= 13503.1ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.58 (= 67.5ms / 42.8ms)

@github-actions
Copy link
Contributor

github-actions bot commented Feb 1, 2023

View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/9754/

@github-actions
Copy link
Contributor

github-actions bot commented Feb 1, 2023

Speed stats:
GPU Name: GeForce GTX 1080 









❌ OneFlow resnet50 time: 141.2ms (= 14118.4ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 164.6ms (= 16458.6ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.17 (= 164.6ms / 141.2ms)

OneFlow resnet50 time: 88.0ms (= 8798.5ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 103.4ms (= 10341.3ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.18 (= 103.4ms / 88.0ms)

OneFlow resnet50 time: 60.3ms (= 12064.6ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 89.4ms (= 17885.5ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.48 (= 89.4ms / 60.3ms)

OneFlow resnet50 time: 44.8ms (= 8968.4ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 69.1ms (= 13828.9ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.54 (= 69.1ms / 44.8ms)

OneFlow resnet50 time: 39.7ms (= 7933.5ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 67.3ms (= 13466.6ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.70 (= 67.3ms / 39.7ms)

@github-actions
Copy link
Contributor

github-actions bot commented Feb 1, 2023

View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/9754/

Copy link
Contributor

@chengtbf chengtbf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

看完回复了

@github-actions
Copy link
Contributor

github-actions bot commented Feb 1, 2023

Speed stats:
GPU Name: GeForce GTX 1080 









❌ OneFlow resnet50 time: 141.2ms (= 14121.8ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 166.1ms (= 16612.8ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.18 (= 166.1ms / 141.2ms)

OneFlow resnet50 time: 87.4ms (= 8741.2ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 114.5ms (= 11447.2ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.31 (= 114.5ms / 87.4ms)

OneFlow resnet50 time: 59.4ms (= 11876.0ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 78.1ms (= 15621.9ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.32 (= 78.1ms / 59.4ms)

OneFlow resnet50 time: 46.1ms (= 9214.3ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 71.1ms (= 14228.3ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.54 (= 71.1ms / 46.1ms)

OneFlow resnet50 time: 41.3ms (= 8268.4ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 67.5ms (= 13508.7ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.63 (= 67.5ms / 41.3ms)

@github-actions
Copy link
Contributor

github-actions bot commented Feb 1, 2023

View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/9754/

@mergify mergify bot merged commit aef9981 into master Feb 1, 2023
@mergify mergify bot deleted the feat_multi_in branch February 1, 2023 11:29
return self.my_linear(x)

linear_g = LinearGraph()
linear_g.enable_shared()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

第一个 graph,允许它被共享;

test_case.assertTrue(np.array_equal(of_lazy_out.numpy(), of_eager_out.numpy()))

linear_g1 = LinearGraph()
linear_g1.share_from(linear_g)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

第二个 graph,共享第一个 graph 的优化后的图和参数;

return self.my_linear(x)

linear_g = LinearGraph()
linear_g.enable_save_runtime_state_dict()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

做离线编译时,允许 graph 保存运行时状态;

return_dict["save1"] = test_case1

state_dict_list = []
state_dict0 = linear_g.runtime_state_dict()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

做离线编译时,获取 graph 的运行时状态;

这个 state_dict 可以用 flow.save 保存;

linear_g = LinearGraph()
if with_share is True:
linear_g.enable_shared()
linear_g.load_runtime_state_dict(state_dict_list[0])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

做在线加载时,state_dict 用 flow.load() 从磁盘获取;然后 graph 加载运行时状态即可;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants