'MobileNetV2' object has no attribute 'to_global' #7560

tattain404 · 2022-02-22T01:38:10Z

Summary

I run the code from the tutorial https://docs.oneflow.org/master/parallelism/05_ddp.html
for 通过设置 SBP 做数据并行训练, but it turns out with

'MobileNetV2' object has no attribute 'to_global'

i also try define NeuralNetwork class using class NeuralNetwork(nn.Module):
and model = NeuralNetwork().to(DEVICE)
then try to use model.to_global to allocate the model to GPU clusters, but it shows the same error:
'nn.Module' object has no attribute 'to_global'

Code to reproduce bug

Same code https://docs.oneflow.org/master/parallelism/05_ddp.html#sbp

System Information

What is your OneFlow installation (pip, source, dockerhub):
python3 -m pip install -f https://release.oneflow.info oneflow==0.6.0+cu101
OS: Linux version 3.10.0-862.el7.x86_64 (builder@kbuilder.dev.centos.org)
OneFlow version (run python3 -m oneflow --doctor): 0.6.0
Python version: 3.9.7
CUDA driver version: CUDA Version: 10.1
GPU models: 8 x Tesla V100-SXM2 32GB
Other info:

The text was updated successfully, but these errors were encountered:

doombeaker · 2022-02-22T02:47:41Z

hi, thanks for your issue.

The documents url you post is for OneFlow of master version, which should be installed by: python3 -m pip install -f https://staging.oneflow.info/branch/master/cu102 --pre oneflow (which can be found on docs.oneflow.org)

It seems you install OneFlow of version 0.6.0, but refered to documents of master version. Upgrading OneFlow to master version may solve this problem. Or you can downgrade your code by replacing to_global to old API to_consistent.

Thanks your feedback again.

tattain404 · 2022-02-22T08:07:07Z

hi, thanks for your issue.

The documents url you post is for OneFlow of master version, which should be installed by: python3 -m pip install -f https://staging.oneflow.info/branch/master/cu102 --pre oneflow (which can be found on docs.oneflow.org)

It seems you install OneFlow of version 0.6.0, but refered to documents of master version. Upgrading OneFlow to master version may solve this problem. Or you can downgrade your code by replacing to_global to old API to_consistent.

Thanks your feedback again.

Thanks for the solutions, but i can't run the code properly now, here are the Updates

So first i try to update the version of oneflow to 0.7.0 using
python3 -m pip install -f https://staging.oneflow.info/branch/master/cu101 --pre oneflow
but when i run the code, it has the same problem with 'MobileNetV2' object has no attribute 'to_global'

then i tried the second solution by replacing to_global to old API to_consistent. It shows that the model have the function to_consistent but another error comes out:

root@7384127d2ab9:~/OneFlow_test# python3 test_orig.py
loaded library: /usr/lib/x86_64-linux-gnu/libibverbs.so.1
Using cuda device
Files already downloaded and verified
Traceback (most recent call last):
  File "/root/OneFlow_test/test_orig.py", line 42, in <module>
    model = model.to_consistent(placement=PLACEMENT, sbp=B)
  File "/root/anaconda3/lib/python3.9/site-packages/oneflow/nn/module.py", line 541, in to_consistent
    return self._apply(convert)
  File "/root/anaconda3/lib/python3.9/site-packages/oneflow/nn/module.py", line 473, in _apply
    module._apply(fn, applied_dict)
  File "/root/anaconda3/lib/python3.9/site-packages/oneflow/nn/module.py", line 473, in _apply
    module._apply(fn, applied_dict)
  File "/root/anaconda3/lib/python3.9/site-packages/oneflow/nn/module.py", line 473, in _apply
    module._apply(fn, applied_dict)
  File "/root/anaconda3/lib/python3.9/site-packages/oneflow/nn/module.py", line 488, in _apply
    param_applied = fn(param)
  File "/root/anaconda3/lib/python3.9/site-packages/oneflow/nn/module.py", line 539, in convert
    return t.to_consistent(placement=placement, sbp=sbp)
  File "/root/anaconda3/lib/python3.9/site-packages/oneflow/framework/tensor.py", line 794, in _to_consistent
    return flow.to_consistent(self, placement, sbp, grad_sbp)
  File "/root/anaconda3/lib/python3.9/site-packages/oneflow/nn/modules/consistent_cast.py", line 81, in to_consistent_op
    return flow._C.to_consistent(input, placement, sbp, grad_sbp)
oneflow._oneflow_internal.exception.CheckFailedException:
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/functional/impl/consistent_cast.cpp", line 359, in operator()
    LocalToConsistent(x, parallel_desc, sbp_parallels, local_to_consistent_op_)
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/functional/impl/consistent_cast.cpp", line 285, in LocalToConsistent
    GetLogicalShapeAndDataType(shape.get(), &dtype, x->shape(), parallel_desc, nd_sbp)
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/functional/impl/consistent_cast.cpp", line 200, in GetLogicalShapeAndDataTyp             e
    BroadcastShapeAndDtype(*physical_shape, *dtype, parallel_desc)
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/functional/impl/consistent_cast.cpp", line 135, in BroadcastShapeAndDtype
    GetBroadcastGroup(parallel_desc, rank_group_parallel_desc)
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/framework/placement_sbp_util.cpp", line 559, in CalcBroadcastGroup
    Check failed: (src_parallel_desc->parallel_num()) <= (dst_parallel_desc->parallel_num()) (2 vs 1)

then I change the code PLACEMENT = flow.placement("cuda", [0,1]) to PLACEMENT = flow.placement("cuda", [0]) the error was solved. But i have a GPU-cluster with 8 cores, could you please tell me, why the error comes and how can i use all of the 8 cores to train the model?

doombeaker · 2022-02-22T08:28:14Z

So first i try to update the version of oneflow to 0.7.0 using
python3 -m pip install -f https://staging.oneflow.info/branch/master/cu101 --pre oneflow
but when i run the code, it has the same problem with 'MobileNetV2' object has no attribute 'to_global'

could you please to make sure OneFlow 0.7.0 are being used indeed? by run commands:

python -m oneflow --doctor

if not 0.7.0 shown, just run pip uninstall oneflow and then re-install oneflow 0.7.0 again.

then I change the code PLACEMENT = flow.placement("cuda", [0,1]) to PLACEMENT = flow.placement("cuda", [0]) the error was solved. But i have a GPU-cluster with 8 cores, could you please tell me, why the error comes and how can i use all of the 8 cores to train the model?

sorry for our API changing. if you keep using OneFlow 0.6.0, you can use placement like that:

PLACEMENT = flow.placement("cuda",{0:[0,1]})

if you want to use whole 8 cores, just use:

PLACEMENT = flow.placement("cuda",{0:[0,1,2,3,4,5,6,7]})

the documents compatible with 0.6.0 can be found at:

https://docs.oneflow.org/v0.5.0/parallelism/02_sbp.html#placement
and
https://docs.oneflow.org/en/v0.5.0/parallelism/02_sbp.html#placement (English version)

This PR https://github.com/Oneflow-Inc/oneflow-documentation/pull/411/files update the documents for API changing ( to_global and placement ). In that PR, you can see clearly how new APIs usage differs from old APIs

May those info helps you. Thanks for your feedback again.

strint added community events from community help wanted good first issue labels Feb 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

'MobileNetV2' object has no attribute 'to_global' #7560

'MobileNetV2' object has no attribute 'to_global' #7560

tattain404 commented Feb 22, 2022

doombeaker commented Feb 22, 2022

tattain404 commented Feb 22, 2022

doombeaker commented Feb 22, 2022

'MobileNetV2' object has no attribute 'to_global' #7560

'MobileNetV2' object has no attribute 'to_global' #7560

Comments

tattain404 commented Feb 22, 2022

Summary

Code to reproduce bug

System Information

doombeaker commented Feb 22, 2022

tattain404 commented Feb 22, 2022

doombeaker commented Feb 22, 2022