Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'MobileNetV2' object has no attribute 'to_global' #7560

Open
tattain404 opened this issue Feb 22, 2022 · 3 comments
Open

'MobileNetV2' object has no attribute 'to_global' #7560

tattain404 opened this issue Feb 22, 2022 · 3 comments

Comments

@tattain404
Copy link

Summary

I run the code from the tutorial https://docs.oneflow.org/master/parallelism/05_ddp.html
for 通过设置 SBP 做数据并行训练, but it turns out with

'MobileNetV2' object has no attribute 'to_global'

i also try define NeuralNetwork class using class NeuralNetwork(nn.Module):
and model = NeuralNetwork().to(DEVICE)
then try to use model.to_global to allocate the model to GPU clusters, but it shows the same error:
'nn.Module' object has no attribute 'to_global'

Code to reproduce bug

Same code https://docs.oneflow.org/master/parallelism/05_ddp.html#sbp

System Information

  • What is your OneFlow installation (pip, source, dockerhub):
  • python3 -m pip install -f https://release.oneflow.info oneflow==0.6.0+cu101
  • OS: Linux version 3.10.0-862.el7.x86_64 (builder@kbuilder.dev.centos.org)
  • OneFlow version (run python3 -m oneflow --doctor): 0.6.0
  • Python version: 3.9.7
  • CUDA driver version: CUDA Version: 10.1
  • GPU models: 8 x Tesla V100-SXM2 32GB
  • Other info:
@doombeaker
Copy link
Contributor

hi, thanks for your issue.

The documents url you post is for OneFlow of master version, which should be installed by: python3 -m pip install -f https://staging.oneflow.info/branch/master/cu102 --pre oneflow (which can be found on docs.oneflow.org)

It seems you install OneFlow of version 0.6.0, but refered to documents of master version. Upgrading OneFlow to master version may solve this problem. Or you can downgrade your code by replacing to_global to old API to_consistent.

Thanks your feedback again.

@tattain404
Copy link
Author

hi, thanks for your issue.

The documents url you post is for OneFlow of master version, which should be installed by: python3 -m pip install -f https://staging.oneflow.info/branch/master/cu102 --pre oneflow (which can be found on docs.oneflow.org)

It seems you install OneFlow of version 0.6.0, but refered to documents of master version. Upgrading OneFlow to master version may solve this problem. Or you can downgrade your code by replacing to_global to old API to_consistent.

Thanks your feedback again.

Thanks for the solutions, but i can't run the code properly now, here are the Updates

So first i try to update the version of oneflow to 0.7.0 using
python3 -m pip install -f https://staging.oneflow.info/branch/master/cu101 --pre oneflow
but when i run the code, it has the same problem with 'MobileNetV2' object has no attribute 'to_global'

then i tried the second solution by replacing to_global to old API to_consistent. It shows that the model have the function to_consistent but another error comes out:

root@7384127d2ab9:~/OneFlow_test# python3 test_orig.py
loaded library: /usr/lib/x86_64-linux-gnu/libibverbs.so.1
Using cuda device
Files already downloaded and verified
Traceback (most recent call last):
  File "/root/OneFlow_test/test_orig.py", line 42, in <module>
    model = model.to_consistent(placement=PLACEMENT, sbp=B)
  File "/root/anaconda3/lib/python3.9/site-packages/oneflow/nn/module.py", line 541, in to_consistent
    return self._apply(convert)
  File "/root/anaconda3/lib/python3.9/site-packages/oneflow/nn/module.py", line 473, in _apply
    module._apply(fn, applied_dict)
  File "/root/anaconda3/lib/python3.9/site-packages/oneflow/nn/module.py", line 473, in _apply
    module._apply(fn, applied_dict)
  File "/root/anaconda3/lib/python3.9/site-packages/oneflow/nn/module.py", line 473, in _apply
    module._apply(fn, applied_dict)
  File "/root/anaconda3/lib/python3.9/site-packages/oneflow/nn/module.py", line 488, in _apply
    param_applied = fn(param)
  File "/root/anaconda3/lib/python3.9/site-packages/oneflow/nn/module.py", line 539, in convert
    return t.to_consistent(placement=placement, sbp=sbp)
  File "/root/anaconda3/lib/python3.9/site-packages/oneflow/framework/tensor.py", line 794, in _to_consistent
    return flow.to_consistent(self, placement, sbp, grad_sbp)
  File "/root/anaconda3/lib/python3.9/site-packages/oneflow/nn/modules/consistent_cast.py", line 81, in to_consistent_op
    return flow._C.to_consistent(input, placement, sbp, grad_sbp)
oneflow._oneflow_internal.exception.CheckFailedException:
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/functional/impl/consistent_cast.cpp", line 359, in operator()
    LocalToConsistent(x, parallel_desc, sbp_parallels, local_to_consistent_op_)
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/functional/impl/consistent_cast.cpp", line 285, in LocalToConsistent
    GetLogicalShapeAndDataType(shape.get(), &dtype, x->shape(), parallel_desc, nd_sbp)
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/functional/impl/consistent_cast.cpp", line 200, in GetLogicalShapeAndDataTyp             e
    BroadcastShapeAndDtype(*physical_shape, *dtype, parallel_desc)
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/functional/impl/consistent_cast.cpp", line 135, in BroadcastShapeAndDtype
    GetBroadcastGroup(parallel_desc, rank_group_parallel_desc)
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/framework/placement_sbp_util.cpp", line 559, in CalcBroadcastGroup
    Check failed: (src_parallel_desc->parallel_num()) <= (dst_parallel_desc->parallel_num()) (2 vs 1)

then I change the code PLACEMENT = flow.placement("cuda", [0,1]) to PLACEMENT = flow.placement("cuda", [0]) the error was solved. But i have a GPU-cluster with 8 cores, could you please tell me, why the error comes and how can i use all of the 8 cores to train the model?

@doombeaker
Copy link
Contributor

So first i try to update the version of oneflow to 0.7.0 using
python3 -m pip install -f https://staging.oneflow.info/branch/master/cu101 --pre oneflow
but when i run the code, it has the same problem with 'MobileNetV2' object has no attribute 'to_global'

could you please to make sure OneFlow 0.7.0 are being used indeed? by run commands:

python -m oneflow --doctor

if not 0.7.0 shown, just run pip uninstall oneflow and then re-install oneflow 0.7.0 again.

then I change the code PLACEMENT = flow.placement("cuda", [0,1]) to PLACEMENT = flow.placement("cuda", [0]) the error was solved. But i have a GPU-cluster with 8 cores, could you please tell me, why the error comes and how can i use all of the 8 cores to train the model?

sorry for our API changing. if you keep using OneFlow 0.6.0, you can use placement like that:

PLACEMENT = flow.placement("cuda",{0:[0,1]})

if you want to use whole 8 cores, just use:

PLACEMENT = flow.placement("cuda",{0:[0,1,2,3,4,5,6,7]})

the documents compatible with 0.6.0 can be found at:

https://docs.oneflow.org/v0.5.0/parallelism/02_sbp.html#placement
and
https://docs.oneflow.org/en/v0.5.0/parallelism/02_sbp.html#placement (English version)

This PR https://github.com/Oneflow-Inc/oneflow-documentation/pull/411/files update the documents for API changing ( to_global and placement ). In that PR, you can see clearly how new APIs usage differs from old APIs

May those info helps you. Thanks for your feedback again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants