Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PyTorch Version: torch_shm_manager error when running with multiprocessing #402

Closed
alex-razor opened this issue Aug 14, 2019 · 25 comments
Closed

Comments

@alex-razor
Copy link

alex-razor commented Aug 14, 2019

Running code doesnt work. I get the following error:

(venv) juggernaut@xmen9:/hdd/AlphaPose$ python demo.py --indir examples/demo/
Loading YOLO model..
torch_shm_manager: error while loading shared libraries: libcudart.so.10.0: cannot open shared object file: No such file or directory
Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/queues.py", line 234, in _feed
    obj = _ForkingPickler.dumps(obj)
  File "/usr/lib/python3.6/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
  File "/hdd/kps_pipeline/venv/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 314, in reduce_storage
    metadata = storage._share_filename_()
RuntimeError: error executing torch_shm_manager at "/hdd/kps_pipeline/venv/lib/python3.6/site-packages/torch/bin/torch_shm_manager" at /pytorch/torch/lib/libshm/core.cpp:99
torch_shm_manager: error while loading shared libraries: libcudart.so.10.0: cannot open shared object file: No such file or directory
Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/queues.py", line 234, in _feed
    obj = _ForkingPickler.dumps(obj)
  File "/usr/lib/python3.6/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
  File "/hdd/kps_pipeline/venv/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 314, in reduce_storage
    metadata = storage._share_filename_()
RuntimeError: error executing torch_shm_manager at "/hdd/kps_pipeline/venv/lib/python3.6/site-packages/torch/bin/torch_shm_manager" at /pytorch/torch/lib/libshm/core.cpp:99
torch_shm_manager: error while loading shared libraries: libcudart.so.10.0: cannot open shared object file: No such file or directory
torch_shm_manager: error while loading shared libraries: libcudart.so.10.0: cannot open shared object file: No such file or directory
Traceback (most recent call last):
  File "demo.py", line 50, in <module>
    det_loader = DetectionLoader(data_loader, batchSize=args.detbatch).start()
  File "/hdd/AlphaPose/dataloader.py", line 309, in start
    p.start()
  File "/usr/lib/python3.6/multiprocessing/process.py", line 105, in start
    self._popen = self._Popen(self)
  File "/usr/lib/python3.6/multiprocessing/context.py", line 223, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "/usr/lib/python3.6/multiprocessing/context.py", line 291, in _Popen
    return Popen(process_obj)
  File "/usr/lib/python3.6/multiprocessing/popen_forkserver.py", line 35, in __init__
    super().__init__(process_obj)
  File "/usr/lib/python3.6/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/usr/lib/python3.6/multiprocessing/popen_forkserver.py", line 47, in _launch
    reduction.dump(process_obj, buf)
  File "/usr/lib/python3.6/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
  File "/hdd/kps_pipeline/venv/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 314, in reduce_storage
    metadata = storage._share_filename_()
RuntimeError: error executing torch_shm_manager at "/hdd/kps_pipeline/venv/lib/python3.6/site-packages/torch/bin/torch_shm_manager" at /pytorch/torch/lib/libshm/core.cpp:99

Although, when i add flag --sp it works fine.

Python 3.6
CUDA 9.0
CUDNN 7
torch 1.2.0    
torchfile 0.1.0    
torchvision 0.4.0 
@Fang-Haoshu
Copy link
Member

Fang-Haoshu commented Aug 15, 2019

Hi, can you try modifying line 26 of 'demo.py' as below?
torch.multiprocessing.set_start_method('spawn', force=True)

@alex-razor
Copy link
Author

Hi, can you try modifying line 26 of 'demo.py' as below?
torch.multiprocessing.set_start_method('spawn', force=True)

Thank you for your reply. However, it didn't help. same error.

@Fang-Haoshu
Copy link
Member

Oh, it's so weird.. We have only tested for PyTorch 1.1 so far. Can you check if PyTorch 1.1 works for you?

@alex-razor
Copy link
Author

That did work for me. Thanks!

@David-on-Code
Copy link

RuntimeError: error executing torch_shm_manager at "/hdd/kps_pipeline/venv/lib/python3.6/site-packages/torch/bin/torch_shm_manager" at /pytorch/torch/lib/libshm/core.cpp:99

how can i solve it?

@schmmd
Copy link

schmmd commented Oct 18, 2019

I'm also hitting this, but on torch==1.3.0

@maochen
Copy link

maochen commented Oct 24, 2019

same on torch==1.3.0
os: MacOS 10.14.6

@waiting-gy
Copy link

RuntimeError: error executing torch_shm_manager at "/hdd/kps_pipeline/venv/lib/python3.6/site-packages/torch/bin/torch_shm_manager" at /pytorch/torch/lib/libshm/core.cpp:99

how can i solve it?

do you know how to solve it? thank you!

@Abhipray
Copy link

Abhipray commented Dec 3, 2019

I was seeing this error with 1.3.0. Upgrading to 1.3.1 fixed it for me.

@asheeshcric
Copy link

@Abhipray I have torch==1.3.1 installed, but it isn't working for me. I get the same error. Has anyone found the solution to this problem?

@Ehsan-Yaghoubi
Copy link

Ehsan-Yaghoubi commented Dec 9, 2019

I had the same problem. When I used the following versions, Alphapose worked and generated a Jason file for the images.

  • I created a virtual environment with Python 3.6. If you don't know how to do it, have a look at https://gist.github.com/frfahim/73c0fad6350332cef7a653bcd762f08d

  • I installed the latest version of PyTorch using
    https://pytorch.org/ and selected CUDA 9.2 (Cuda 10.0 did not work)
    I used (pip3 install torch==1.3.1+cu92 torchvision==0.4.2+cu92 -f https://download.pytorch.org/whl/torch_stable.html)

  • I installed Cuda 9.2 from https://developer.nvidia.com/cuda-92-download-archive?target_os=Linux&target_arch=x86_64&target_distro=Ubuntu&target_version=1604&target_type=runfilelocal

Then follow the instruction of the Alphapose that says download the models and:

  • git clone -b pytorch https://github.com/MVIG-SJTU/AlphaPose.git
    - pip3 install -r requirements.txt (remove the torch and torchvision and ntpath from this file and then run this code)
  • python3 demo.py --indir examples/demo --outdir examples/res

SUMMARY:

Linux 16.04
Python3.6
CUDA 9.2
CUDNN 7
torch==1.3.1+cu92
torchvision==0.4.2+cu92
GPU NVIDIA 2080ti 

@phamdat09
Copy link

Hello !!!
@Ehsan-Yaghoubi , how many FPS did you get ? Thanks

@Ehsan-Yaghoubi
Copy link

Hello !!!
@Ehsan-Yaghoubi , how many FPS did you get ? Thanks

Hi, I only used it to produce the pose information for my own dataset. I didn't check the metrics as I didn't need them.

@phamdat09
Copy link

Hi !! @Ehsan-Yaghoubi thank for your reply !!

@cslxiao
Copy link

cslxiao commented Feb 9, 2020

It still happens with PyTorch 1.4

@cdyangbo
Copy link

Set num_workers=0

@cdyangbo
Copy link

torch.multiprocessing.set_start_method('spawn', force=True) work well with num_works > 0 in macos

@nlml
Copy link

nlml commented Apr 20, 2020

I was just able to fix this by commenting a line I had added to fix an issue on a different system:

Old: torch.multiprocessing.set_sharing_strategy('file_system')

New: # torch.multiprocessing.set_sharing_strategy('file_system')

I think the problem in my case might be caused by my system having CUDA 10.2 while Pytorch is installed as the 10.1 version. But commenting the above line at the start of my script fixed the problem, at least in my case.

@Amir22010
Copy link

@nlml works for me thanks!! i have pytorch 1.4 with cuda 10.2...

@qhdqhd
Copy link

qhdqhd commented Apr 26, 2021

add --sp works fine for me

@Zrrr1997
Copy link

Zrrr1997 commented Jun 4, 2021

Hitting the same error:

(alphapose) zrrr@zrrr-GL552VW:~/Projects/AlphaPose$ python scripts/demo_inference.py --cfg configs/coco/resnet/256x192_res50_lr1e-3_1x.yaml --checkpoint pretrained_models/fast_res50_256x192.pth --indir examples/demo/

Traceback (most recent call last):
  File "scripts/demo_inference.py", line 175, in <module>
    det_loader = DetectionLoader(input_source, get_detector(args), cfg, args, batchSize=args.detbatch, mode=mode, queueSize=args.qsize)
  File "/home/zrrr/Projects/AlphaPose/detector/apis.py", line 12, in get_detector
    from detector.yolo_api import YOLODetector
  File "/home/zrrr/Projects/AlphaPose/detector/yolo_api.py", line 27, in <module>
    from detector.nms import nms_wrapper
  File "/home/zrrr/Projects/AlphaPose/detector/nms/__init__.py", line 1, in <module>
    from .nms_wrapper import nms, soft_nms
  File "/home/zrrr/Projects/AlphaPose/detector/nms/nms_wrapper.py", line 4, in <module>
    from . import nms_cpu, nms_cuda
ImportError: libcudart.so.10.0: cannot open shared object file: No such file or directory
Python 3.6.13
Cuda Toolkit 9.0
cudnn 7.6.5
torch 1.1.0
torchvision 0.3.0

How can I fix this?

@maochen
Copy link

maochen commented Jun 4, 2021

Hitting the same error:

(alphapose) zrrr@zrrr-GL552VW:~/Projects/AlphaPose$ python scripts/demo_inference.py --cfg configs/coco/resnet/256x192_res50_lr1e-3_1x.yaml --checkpoint pretrained_models/fast_res50_256x192.pth --indir examples/demo/

Traceback (most recent call last):
  File "scripts/demo_inference.py", line 175, in <module>
    det_loader = DetectionLoader(input_source, get_detector(args), cfg, args, batchSize=args.detbatch, mode=mode, queueSize=args.qsize)
  File "/home/zrrr/Projects/AlphaPose/detector/apis.py", line 12, in get_detector
    from detector.yolo_api import YOLODetector
  File "/home/zrrr/Projects/AlphaPose/detector/yolo_api.py", line 27, in <module>
    from detector.nms import nms_wrapper
  File "/home/zrrr/Projects/AlphaPose/detector/nms/__init__.py", line 1, in <module>
    from .nms_wrapper import nms, soft_nms
  File "/home/zrrr/Projects/AlphaPose/detector/nms/nms_wrapper.py", line 4, in <module>
    from . import nms_cpu, nms_cuda
ImportError: libcudart.so.10.0: cannot open shared object file: No such file or directory
Python 3.6.13
Cuda Toolkit 9.0
cudnn 7.6.5
torch 1.1.0
torchvision 0.3.0

How can I fix this?

Could you try any version of torch >= 1.3.1 to see if the issue still there?

@qhdqhd
Copy link

qhdqhd commented Jun 5, 2021

add --sp is ok

@angerhang
Copy link

I was just able to fix this by commenting a line I had added to fix an issue on a different system:

Old: torch.multiprocessing.set_sharing_strategy('file_system')

New: # torch.multiprocessing.set_sharing_strategy('file_system')

I think the problem in my case might be caused by my system having CUDA 10.2 while Pytorch is installed as the 10.1 version. But commenting the above line at the start of my script fixed the problem, at least in my case.

I had to do the same to make the code work on Linux. Any ideas why so strange?

@tianhangpan
Copy link

Hi, can you try modifying line 26 of 'demo.py' as below? torch.multiprocessing.set_start_method('spawn', force=True)

Thanks, that work for me on the Linux!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests