Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix bug in AddInputOutputOpsPass: check existence of key in HashMap(inferface_lbi2scope_sym_id) #5653

Merged
merged 14 commits into from
Aug 11, 2021

Conversation

zzk0
Copy link
Contributor

@zzk0 zzk0 commented Jul 29, 2021

Bug 复现

  1. 训练一个模型。
  2. 使用 python/oneflow/serving/saved_model_builder.py 将模型保存下来,并且增加一个自定义 Signature。
  3. 使用 python/oneflow/serving/inference_session.py 读取模型,在调用 CurJobBuildAndInferCtx_Complete 的时候报错。

问题定位

单步调试,直到发现是在 AddInputOutputOpsPass 中出了问题。仔细阅读了代码之后,发现:如果一个 Input 并不参与计算但它又存在,也就是说从 output 反向遍历不到的话,这个 Input 是不会加入到一个 HashMap 中的。可是,后面的代码会根据 signature 去 HashMap 找 Input 对应的 key。

代码错误发生在

int64_t scope_sym_id = inferface_lbi2scope_sym_id.at(input_def.lbi());

附:

下面是报错信息和保存的模型。

Traceback (most recent call last):
  File "mlp_load.py", line 20, in <module>
    sess.load_saved_model(saved_model_dir="./models", model_version=2)
  File "/home/percent1/oneflow/build_debug/python_scripts/oneflow/python/serving/inference_session.py", line 399, in load_saved_model
    self.compile(graph_def.op_list)
  File "/home/percent1/oneflow/build_debug/python_scripts/oneflow/python/serving/inference_session.py", line 316, in compile
    oneflow._oneflow_internal.CurJobBuildAndInferCtx_Complete()
IndexError: _Map_base::at
WARNING: Logging before InitGoogleLogging() is written to STDERR
F0729 02:03:32.338528 13585 global_process_ctx.cpp:47] Check failed: 'Global<ProcessCtx>::Get()' Must be non NULL 
*** Check failure stack trace: ***
    @     0x7f35ade39ee6  google::LogMessage::Fail()
    @     0x7f35ade39e2e  google::LogMessage::SendToLog()
    @     0x7f35ade3976f  google::LogMessage::Flush()
    @     0x7f35ade3cbcc  google::LogMessageFatal::~LogMessageFatal()
    @     0x7f35a8970a73  google::CheckNotNull<>()
    @     0x7f35a8e75a37  oneflow::GlobalProcessCtx::IsThisProcessMaster()
    @     0x7f35a81218d8  oneflow::DestroyLazyGlobalSession()
    @     0x7f35a8122799  DestroyLazyGlobalSession()
    @     0x7f35a7f27db9  _ZNO8pybind116detail15argument_loaderIJEE9call_implIvRPFvvEJENS0_9void_typeEEET_OT0_NS0_14index_sequenceIJXspT1_EEEEOT2_
    @     0x7f35a7f27c74  _ZNO8pybind116detail15argument_loaderIJEE4callIvNS0_9void_typeERPFvvEEENSt9enable_ifIXsrSt7is_voidIT_E5valueES4_E4typeEOT1_
    @     0x7f35a7f27618  _ZZN8pybind1112cpp_function10initializeIRPFvvEvJEJNS_4nameENS_5scopeENS_7siblingEEEEvOT_PFT0_DpT1_EDpRKT2_ENKUlRNS_6detail13function_callEE1_clESL_
    @     0x7f35a7f27689  _ZZN8pybind1112cpp_function10initializeIRPFvvEvJEJNS_4nameENS_5scopeENS_7siblingEEEEvOT_PFT0_DpT1_EDpRKT2_ENUlRNS_6detail13function_callEE1_4_FUNESL_
    @     0x7f35a72fc6e7  pybind11::cpp_function::dispatcher()
    @     0x563999cd9706  PyCFunction_Call
    @     0x563999c9725f  _PyObject_MakeTpCall
    @     0x563999d1f719  _PyEval_EvalFrameDefault
    @     0x563999ce516b  _PyFunction_Vectorcall.localalias.353
    @     0x563999c5956d  _PyEval_EvalFrameDefault.cold.2793
    @     0x563999ce516b  _PyFunction_Vectorcall.localalias.353
    @     0x563999bd7ef9  _PyObject_Vectorcall.lto_priv.9
    @     0x563999c973ef  call_unbound_noarg
    @     0x563999d62b99  slot_tp_finalize
    @     0x563999c962b5  collect.constprop.446
    @     0x563999d8f2aa  _PyGC_CollectNoFail
    @     0x563999da5b55  PyImport_Cleanup
    @     0x563999da5d13  Py_FinalizeEx
    @     0x563999da8200  Py_RunMain
    @     0x563999da8389  Py_BytesMain
    @     0x7f35d34aabf7  __libc_start_main
    @     0x563999d38553  (unknown)
Aborted (core dumped)

模型,删除了中间的东西,下面的 Input_15 是没用到的

name: "mlp"
version: 2
checkpoint_dir: "variables"
graphs {
  key: "mlp_inference"
  value {
    op_list {
      name: "Input_14"
      device_tag: "gpu"
      scope_symbol_id: 4611686018427486206
      input_conf {
        out: "out"
        blob_conf {
          shape {
            dim: 1
            dim: 1
            dim: 28
            dim: 28
          }
          data_type: kFloat
          is_dynamic: false
          parallel_distribution {
            sbp_parallel {
              split_parallel {
                axis: 0
              }
            }
          }
        }
      }
    }
    op_list {
      name: "Input_15"
      device_tag: "gpu"
      scope_symbol_id: 4611686018427486206
      input_conf {
        out: "out"
        blob_conf {
          shape {
            dim: 1
          }
          data_type: kInt32
          is_dynamic: false
          parallel_distribution {
            sbp_parallel {
              split_parallel {
                axis: 0
              }
            }
          }
        }
      }
    }
...(省略)
    op_list {
      name: "Return_17"
      device_tag: "cpu"
      scope_symbol_id: 4611686018427502590
      return_conf {
        in: "dense2-bias_add/out_0"
        out: "out"
      }
    }
    signatures {
      key: "mlp"
      value {
        inputs {
          key: "image"
          value {
            lbi {
              op_name: "Input_14"
              blob_name: "out"
            }
            blob_conf {
              shape {
                dim: 1
                dim: 1
                dim: 28
                dim: 28
              }
              data_type: kFloat
              is_dynamic: false
              parallel_distribution {
                sbp_parallel {
                  split_parallel {
                    axis: 0
                  }
                }
              }
            }
          }
        }
        inputs {
          key: "label"
          value {
            lbi {
              op_name: "Input_15"
              blob_name: "out"
            }
            blob_conf {
              shape {
                dim: 1
              }
              data_type: kInt32
              is_dynamic: false
              parallel_distribution {
                sbp_parallel {
                  split_parallel {
                    axis: 0
                  }
                }
              }
            }
          }
        }
        outputs {
          key: "output"
          value {
            lbi {
              op_name: "dense2-bias_add"
              blob_name: "out_0"
            }
          }
        }
      }
    }
    default_signature_name: "mlp"
  }
}
default_graph_name: "mlp_inference"

@zzk0 zzk0 closed this Aug 3, 2021
zzk0 added a commit to zzk0/oneflow that referenced this pull request Aug 3, 2021
…ce of key in HashMap(inferface_lbi2scope_sym_id)
@zzk0 zzk0 reopened this Aug 6, 2021
@zzk0 zzk0 requested a review from leaves-zwx August 6, 2021 09:14
@oneflow-ci-bot oneflow-ci-bot requested review from oneflow-ci-bot and removed request for oneflow-ci-bot August 10, 2021 03:59
@oneflow-ci-bot oneflow-ci-bot requested review from oneflow-ci-bot and removed request for oneflow-ci-bot August 10, 2021 06:36
@oneflow-ci-bot oneflow-ci-bot requested review from oneflow-ci-bot and removed request for oneflow-ci-bot August 10, 2021 08:35
@oneflow-ci-bot oneflow-ci-bot self-requested a review August 10, 2021 10:29
@oneflow-ci-bot oneflow-ci-bot requested review from oneflow-ci-bot and removed request for oneflow-ci-bot August 10, 2021 13:24
@oneflow-ci-bot oneflow-ci-bot requested review from oneflow-ci-bot and removed request for oneflow-ci-bot August 10, 2021 16:29
@oneflow-ci-bot oneflow-ci-bot requested review from oneflow-ci-bot and removed request for oneflow-ci-bot August 10, 2021 17:50
@oneflow-ci-bot oneflow-ci-bot requested review from oneflow-ci-bot and removed request for oneflow-ci-bot August 10, 2021 19:15
@oneflow-ci-bot oneflow-ci-bot requested review from oneflow-ci-bot and removed request for oneflow-ci-bot August 10, 2021 23:43
@oneflow-ci-bot oneflow-ci-bot self-requested a review August 11, 2021 01:12
@oneflow-ci-bot oneflow-ci-bot merged commit 24e4ea2 into Oneflow-Inc:master Aug 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants