Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FileNotFoundError: [Errno 2] No such file or directory: './cube/' #how to train the custom dataset? #97

Closed
avinashsen707 opened this issue Jan 31, 2020 · 14 comments

Comments

@avinashsen707
Copy link

binpick@ncrai-Precision-7820-Tower:~/catkin_ws/src/dope/scripts$ python3 train.py --data ./cube/ --outf cube_1214  --gpuids 0 1 --epochs 120 --loginterval 1 --batchsize 32
start: 00:11:52.635052
load data
Traceback (most recent call last):
  File "train.py", line 1255, in <module>
    transforms.Scale(opt.imagesize//8),
  File "train.py", line 416, in __init__
    self.imgs = load_data(root)
  File "train.py", line 411, in load_data
    for name in os.listdir(str(path)):
FileNotFoundError: [Errno 2] No such file or directory: './cube/'

Screenshot from 2020-02-02 00-12-31

@avinashsen707 avinashsen707 changed the title FileNotFoundError: [Errno 2] No such file or directory: './cube/' #how to train the cyustom dataset? FileNotFoundError: [Errno 2] No such file or directory: './cube/' #how to train the custom dataset? Jan 31, 2020
@TontonTremblay
Copy link
Collaborator

Can you give it the absolute path without the .? Some thing like /home/jtremblay/data/cude/

@avinashsen707
Copy link
Author

avinashsen707 commented Feb 1, 2020

Can you give it the absolute path without the .? Some thing like /home/jtremblay/data/cude/

/media/binpick/DATA/binpick_extended_user/Training$ python3 train.py --data ./fat/single/011_banana_16k/kitchen_0/ --outf ban
ana  --gpuids 0 1 --epochs 120 --loginterval 1 --batchsize 32
start: 11:08:10.511437
load data
training data: 7 batches
load models
Training network pretrained on imagenet.
Traceback (most recent call last):
  File "train.py", line 1306, in <module>
    net = torch.nn.DataParallel(net,device_ids=opt.gpuids).cuda()
  File "/home/binpick/.local/lib/python3.5/site-packages/torch/nn/parallel/data_parallel.py", line 102, in __init__
    _check_balance(self.device_ids)
  File "/home/binpick/.local/lib/python3.5/site-packages/torch/nn/parallel/data_parallel.py", line 17, in _check_balance
    dev_props = [torch.cuda.get_device_properties(i) for i in device_ids]
  File "/home/binpick/.local/lib/python3.5/site-packages/torch/nn/parallel/data_parallel.py", line 17, in <listcomp>
    dev_props = [torch.cuda.get_device_properties(i) for i in device_ids]
  File "/home/binpick/.local/lib/python3.5/site-packages/torch/cuda/__init__.py", line 292, in get_device_properties
    raise AssertionError("Invalid device id")
AssertionError: Invalid device id
 python3
Python 3.5.2 (default, Oct  8 2019, 13:06:37) 
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> torch.cuda.device_count()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'torch' is not defined
>>> import torch
>>> torch.cuda.device_count()
1
>>> 
KeyboardInterrupt
>>> 
KeyboardInterrupt
>>> exit()

python3 train.py --data ./fat/single/011_banana_16k/kitchen_0/ --outf ban
ana  --gpuids 0 1 2 3  --epochs 120 --loginterval 1 --batchsize 32
Traceback (most recent call last):
  File "train.py", line 61, in <module>
    import torchvision.transforms as transforms
  File "/home/binpick/.local/lib/python3.5/site-packages/torchvision/__init__.py", line 2, in <module>
    from torchvision import datasets
  File "/home/binpick/.local/lib/python3.5/site-packages/torchvision/datasets/__init__.py", line 9, in <module>
    from .fakedata import FakeData
  File "/home/binpick/.local/lib/python3.5/site-packages/torchvision/datasets/fakedata.py", line 3, in <module>
    from .. import transforms
  File "/home/binpick/.local/lib/python3.5/site-packages/torchvision/transforms/__init__.py", line 1, in <module>
    from .transforms import *
  File "/home/binpick/.local/lib/python3.5/site-packages/torchvision/transforms/transforms.py", line 16, in <module>
    from . import functional as F
  File "/home/binpick/.local/lib/python3.5/site-packages/torchvision/transforms/functional.py", line 5, in <module>
    from PIL import Image, ImageOps, ImageEnhance, PILLOW_VERSION
ImportError: cannot import name 'PILLOW_VERSION'
python3 train.py --data ./fat/single/011_banana_16k/kitchen_0/ --outf ban
ana  --gpuids 0  --epochs 120 --loginterval 1 --batchsize 32
start: 12:46:42.727720
load data
training data: 7 batches
load models
Training network pretrained on imagenet.
THCudaCheck FAIL file=/pytorch/aten/src/THC/generic/THCStorage.cu line=58 error=2 : out of memory
Traceback (most recent call last):
  File "train.py", line 1392, in <module>
    _runnetwork(epoch,trainingdata)
  File "train.py", line 1334, in _runnetwork
    output_belief, output_affinities = net(data)
  File "/home/binpick/.local/lib/python3.5/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/binpick/.local/lib/python3.5/site-packages/torch/nn/parallel/data_parallel.py", line 112, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "/home/binpick/.local/lib/python3.5/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "train.py", line 153, in forward
    out1 = self.vgg(x)
  File "/home/binpick/.local/lib/python3.5/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/binpick/.local/lib/python3.5/site-packages/torch/nn/modules/container.py", line 91, in forward
    input = module(input)
  File "/home/binpick/.local/lib/python3.5/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/binpick/.local/lib/python3.5/site-packages/torch/nn/modules/conv.py", line 301, in forward
    self.padding, self.dilation, self.groups)
RuntimeError: cuda runtime error (2) : out of memory at /pytorch/aten/src/THC/generic/THCStorage.cu:58

Thankyou sir, while i look in the path and corrected ; it ended up in the following error!

@TontonTremblay
Copy link
Collaborator

the batchsize is too large. Try --batchsize 1.

@avinashsen707
Copy link
Author

the batchsize is too large. Try --batchsize 1.

thankyou sir, it worked.
I am training now with FAT dataset for each object. But the confusion is that there are almost 20 background for each object; so i need to do train in each folder to get weights. So how can i select the final weight to be placed in DOPE code; since i am getting alot of weights as per my given epoch in each folder.

and could you help in giving the optimum epoch and log interval does i need to give to get good results, i am using Nvidia P4000 quadro gpu with 8GB in dell precision7820 workstation.

Thanking you in advance for the patient reply for beginers like me.

@TontonTremblay
Copy link
Collaborator

With 8gb you should be able to run a batchsize of 8, the idea here is to increase it until you have fill your gpu memory. You can check with nvidia-smi while it is running to see how much you are using.

You can give /fat/ to the training script with the --objectofinterest banana it will only use the banana information.

I hope this helps.

@avinashsen707
Copy link
Author

avinashsen707 commented Feb 7, 2020

as per your suggestion i made it like this

binpick@ncrai-Precision-7820-Tower:/media/binpick/DATA/binpick_userfiles/Training$ python3 train.py --data ./pvc_tee_800/ --outf tee_1214  --gpuids 0 --epochs 10 --loginterval 1 --batchsize 8
start: 09:55:13.370283
load data
training data: 100 batches
load models
Training network pretrained on imagenet.
Traceback (most recent call last):
  File "train.py", line 1392, in <module>
    _runnetwork(epoch,trainingdata)
  File "train.py", line 1330, in _runnetwork
    for batch_idx, targets in enumerate(loader):
  File "/home/binpick/.local/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 286, in __next__
    return self._process_next_batch(batch)
  File "/home/binpick/.local/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 307, in _process_next_batch
    raise batch.exc_type(batch.exc_msg)
IndexError: Traceback (most recent call last):
  File "/home/binpick/.local/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 57, in _worker_loop
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "/home/binpick/.local/lib/python3.5/site-packages/torch/utils/data/dataloader.py", line 57, in <listcomp>
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "train.py", line 491, in __getitem__
    cuboid = np.array(data['exported_objects'][0]['cuboid_dimensions'])
**IndexError: list index out of range**

Some of my dataset samples are hereby attaching!
000467
000468 depth 16
Screenshot from 2020-02-08 10-14-16

@avinashsen707
Copy link
Author

Problem solved ; the json files data were missing inside it to load for the script!

@tuhinmallick
Copy link

Traceback (most recent call last):
File "train.py", line 1312, in
net = torch.nn.DataParallel(net,device_ids=opt.gpuids).cuda()
File "C:\Users\intraflyQuadro\anaconda3\lib\site-packages\torch\nn\parallel\data_parallel.py", line 142, in init
_check_balance(self.device_ids)
File "C:\Users\intraflyQuadro\anaconda3\lib\site-packages\torch\nn\parallel\data_parallel.py", line 23, in _check_balance
dev_props = _get_devices_properties(device_ids)
File "C:\Users\intraflyQuadro\anaconda3\lib\site-packages\torch_utils.py", line 459, in _get_devices_properties
return [_get_device_attr(lambda m: m.get_device_properties(i)) for i in device_ids]
File "C:\Users\intraflyQuadro\anaconda3\lib\site-packages\torch_utils.py", line 459, in
return [_get_device_attr(lambda m: m.get_device_properties(i)) for i in device_ids]
File "C:\Users\intraflyQuadro\anaconda3\lib\site-packages\torch_utils.py", line 442, in _get_device_attr
return get_member(torch.cuda)
File "C:\Users\intraflyQuadro\anaconda3\lib\site-packages\torch_utils.py", line 459, in
return [get_device_attr(lambda m: m.get_device_properties(i)) for i in device_ids]
File "C:\Users\intraflyQuadro\anaconda3\lib\site-packages\torch\cuda_init
.py", line 309, in get_device_properties
raise AssertionError("Invalid device id")
AssertionError: Invalid device id

How did you get rid of this error ?

@TontonTremblay
Copy link
Collaborator

how many GPUs do you have? What is the value of opt.gpuids?

@tuhinmallick
Copy link

I have only one GPU ( Quadro RTX 6000) and I got the device id fixed by specifying the gpuid as only 0. thanks for the help.
But now I am getting the following error

Training network pretrained on imagenet.
Traceback (most recent call last):
File "", line 1, in
File "C:\Users\intraflyQuadro\anaconda3\lib\multiprocessing\spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "C:\Users\intraflyQuadro\anaconda3\lib\multiprocessing\spawn.py", line 125, in _main
prepare(preparation_data)
File "C:\Users\intraflyQuadro\anaconda3\lib\multiprocessing\spawn.py", line 236, in prepare
_fixup_main_from_path(data['init_main_from_path'])
File "C:\Users\intraflyQuadro\anaconda3\lib\multiprocessing\spawn.py", line 287, in _fixup_main_from_path
main_content = runpy.run_path(main_path,
File "C:\Users\intraflyQuadro\anaconda3\lib\runpy.py", line 265, in run_path
return _run_module_code(code, init_globals, run_name,
File "C:\Users\intraflyQuadro\anaconda3\lib\runpy.py", line 97, in _run_module_code
_run_code(code, mod_globals, init_globals,
File "C:\Users\intraflyQuadro\anaconda3\lib\runpy.py", line 87, in _run_code
exec(code, run_globals)
File "C:\Users\intraflyQuadro\Desktop\Deep_Object_Pose\scripts\train.py", line 1398, in
_runnetwork(epoch,trainingdata)
File "C:\Users\intraflyQuadro\Desktop\Deep_Object_Pose\scripts\train.py", line 1336, in _runnetwork
for batch_idx, targets in enumerate(loader):
File "C:\Users\intraflyQuadro\anaconda3\lib\site-packages\torch\utils\data\dataloader.py", line 355, in iter
return self._get_iterator()
File "C:\Users\intraflyQuadro\anaconda3\lib\site-packages\torch\utils\data\dataloader.py", line 301, in _get_iterator
return _MultiProcessingDataLoaderIter(self)
File "C:\Users\intraflyQuadro\anaconda3\lib\site-packages\torch\utils\data\dataloader.py", line 914, in init
w.start()
File "C:\Users\intraflyQuadro\anaconda3\lib\multiprocessing\process.py", line 121, in start
self._popen = self._Popen(self)
File "C:\Users\intraflyQuadro\anaconda3\lib\multiprocessing\context.py", line 224, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "C:\Users\intraflyQuadro\anaconda3\lib\multiprocessing\context.py", line 327, in _Popen
return Popen(process_obj)
File "C:\Users\intraflyQuadro\anaconda3\lib\multiprocessing\popen_spawn_win32.py", line 45, in init
prep_data = spawn.get_preparation_data(process_obj._name)
File "C:\Users\intraflyQuadro\anaconda3\lib\multiprocessing\spawn.py", line 154, in get_preparation_data
_check_not_importing_main()
File "C:\Users\intraflyQuadro\anaconda3\lib\multiprocessing\spawn.py", line 134, in _check_not_importing_main
raise RuntimeError('''
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.

    This probably means that you are not using fork to start your
    child processes and you have forgotten to use the proper idiom
    in the main module:

        if __name__ == '__main__':
            freeze_support()
            ...

    The "freeze_support()" line can be omitted if the program
    is not going to be frozen to produce an executable.

Can you please help me out ?

@TontonTremblay
Copy link
Collaborator

I never seen this error, and I never tested a quadro with this code. Can you try to change the number of workers on the dataloader to 1?

@tuhinmallick
Copy link

tuhinmallick commented Jul 30, 2021

The problem did get sorted out by having the number of workers on the dataloader to 0.

But a new error cropped up :-

Screenshot

I am training on 150k images. Is it because of the large dataset of images ?

Because I didn't get the error when I am doing the training on 10k
Unbenannt1

@TontonTremblay
Copy link
Collaborator

TontonTremblay commented Jul 30, 2021 via email

@tuhinmallick
Copy link

No, I am using NDDS

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants