Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

training #42

Closed
ghost opened this issue Jun 12, 2019 · 53 comments
Closed

training #42

ghost opened this issue Jun 12, 2019 · 53 comments

Comments

@ghost
Copy link

ghost commented Jun 12, 2019

请问如何不通过命令行运行train.py?目前我直接在运行test.py 可以,但是直接运行train.py:
File "/home/public/anaconda3/lib/python3.6/os.py", line 669, in getitem
raise KeyError(key) from None
KeyError: 'RANK'

请问如何在代码中设置:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
python -m torch.distributed.launch
--nproc_per_node=8
--master_port=2333
../../tools/train.py --cfg config.yaml
中的值,使得可以直接在pycharm运行train.py,
谢谢您的回复。

@serycjon
Copy link

It seems you have to set some environment variables, see: https://pytorch.org/docs/0.3.0/distributed.html#environment-variable-initialization

@IlchaeJung
Copy link

I have a same problem with you.
I also have a difficulty with @serycjon 's proposal.
@chenbolinstudent is this proposal working for you?

But, I success to run the evaluation and it works very well in evaluation.

@ghost
Copy link
Author

ghost commented Jun 14, 2019

@serycjon
can you give the detail?

@serycjon
Copy link

I don't have the training data ready, so haven't really tested it, but running the script with:
WORLD_SIZE=1 RANK=0 PYTHONPATH=./:$PYTHONPATH python tools/train.py --cfg experiments/siamrpn_r50_l234_dwxcorr/config.yaml got me past that rank error.

@ghost
Copy link
Author

ghost commented Jun 14, 2019

if I :

WORLD_SIZE=1 RANK=0 PYTHONPATH=./:$PYTHONPATH python tools/train.py --cfg experiments/siamrpn_r50_l234_dwxcorr/config.yaml

then:

Traceback (most recent call last):
File "train.py", line 24, in
from pysot.utils.lr_scheduler import build_lr_scheduler
ModuleNotFoundError: No module named 'pysot'

@Hevans123
Copy link

@chenbolinstudent ,可以发下解决了RANK问题的修改的代码的截图吗, PYTHONPATH=./:$PYTHONPATH python tools/train.py,这个不是命令行语句吗,怎样在代码中修改呢

@Hevans123
Copy link

@serycjon

@serycjon
Copy link

@chenbolinstudent Well, you have to install pysot correctly first (see install.md).
@Hevans123 It is a correct BASH command. You have to check how to set environment variables on your system if this does not work.

@ZZXin
Copy link

ZZXin commented Jun 15, 2019

I found a problem, the training time is very long, if the training process is interrupted midway, then you need to start training from the beginning, is there a solution to continue training in the interrupted position,just like SiamFC‘s training process?

@lb1100
Copy link
Contributor

lb1100 commented Jun 19, 2019

I found a problem, the training time is very long, if the training process is interrupted midway, then you need to start training from the beginning, is there a solution to continue training in the interrupted position,just like SiamFC‘s training process?

Try to use TRAIN.RESUME to resume your training.

@ZZXin
Copy link

ZZXin commented Jun 22, 2019

I found a problem, the training time is very long, if the training process is interrupted midway, then you need to start training from the beginning, is there a solution to continue training in the interrupted position,just like SiamFC‘s training process?

Try to use TRAIN.RESUME to resume your training.
@lb1100
I followed what you said,but i met with those problem:

[2019-06-22 09:28:08,844-rk0-model_load.py# 42] remove prefix 'module.'
[2019-06-22 09:28:08,844-rk0-model_load.py# 33] used keys:75
[2019-06-22 09:28:08,845-rk0-model_load.py# 33] used keys:2
Traceback (most recent call last):
File "../../tools/train.py", line 319, in
main()
File "../../tools/train.py", line 307, in main
restore_from(model, optimizer, cfg.TRAIN.RESUME)
File "/home/db/Subject/pysot/pysot/utils/model_load.py", line 89, in restore_from
optimizer.load_state_dict(ckpt['optimizer'])
File "/home/db/anaconda3/envs/pysot/lib/python3.7/site-packages/torch/optim/optimizer.py", line 107, in load_state_dict
raise ValueError("loaded state dict contains a parameter group "
ValueError: loaded state dict contains a parameter group that doesn't match the size of optimizer's group

@ZZXin
Copy link

ZZXin commented Jun 22, 2019

请问如何不通过命令行运行train.py?目前我直接在运行test.py 可以,但是直接运行train.py:
File "/home/public/anaconda3/lib/python3.6/os.py", line 669, in getitem
raise KeyError(key) from None
KeyError: 'RANK'

请问如何在代码中设置:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
python -m torch.distributed.launch
--nproc_per_node=8
--master_port=2333
../../tools/train.py --cfg config.yaml
中的值,使得可以直接在pycharm运行train.py,
谢谢您的回复。

I can debug training code on pycharm,you can refer to this blog,https://oldpan.me/archives/pytorch-to-use-multiple-gpus,but follow his configuration,you would receive an error: no module train.py,you just change 'train.py' to 'train',and it works

@ghost
Copy link
Author

ghost commented Jun 23, 2019

@ZZXin
can you give the detail?
thanks

@ghost
Copy link
Author

ghost commented Jun 23, 2019

but this blog do not say how to set the nproc_per_node master_port

@ZZXin
Copy link

ZZXin commented Jun 23, 2019

but this blog do not say how to set the nproc_per_node master_port

I train this code on single GPU,I think it doesn't matter,you can set '--nproc_per_node master_port = 1'

@ghost
Copy link
Author

ghost commented Jun 23, 2019

在哪个.py文件设置?
which .py file to set?

@ghost
Copy link
Author

ghost commented Jun 23, 2019

which .py file to set
--nproc_per_node
--master_port ?

@ZZXin
Copy link

ZZXin commented Jun 23, 2019

which .py file to set
--nproc_per_node
--master_port ?

train.py

@ghost
Copy link
Author

ghost commented Jun 23, 2019

--nproc_per_node
--master_port
is not exist in train.py

@ZZXin
Copy link

ZZXin commented Jun 23, 2019

@ZZXin
can you offer the youtubebb of badu wangpan link ?

I don't have it, I am waiting for pysot‘s downloading link

@ghost
Copy link
Author

ghost commented Jun 23, 2019

how do you debug training code on pycharm
https://oldpan.me/archives/pytorch-to-use-multiple-gpus do not say how to set
can you say the detail?

@ZZXin
Copy link

ZZXin commented Jun 23, 2019

how do you debug training code on pycharm
https://oldpan.me/archives/pytorch-to-use-multiple-gpus do not say how to set
can you say the detail?
refer to
https://image.oldpan.me/TIM%E6%88%AA%E5%9B%BE20190306182352.jpg
Take a good look at this picture.

@xi-mao
Copy link

xi-mao commented Jun 25, 2019

how do you debug training code on pycharm
https://oldpan.me/archives/pytorch-to-use-multiple-gpus do not say how to set
can you say the detail?
refer to
https://image.oldpan.me/TIM%E6%88%AA%E5%9B%BE20190306182352.jpg
Take a good look at this picture.

can I ask where the configs in the picture is? I cann't find the window.

@ghost
Copy link
Author

ghost commented Jun 25, 2019

scipts:
torch.distributed.launch

scipts.paramters:
--nproc_per_node=1 --master_port=2333 /home/ty/PycharmProjects/pysot-master/tools/train.py

still KeyError: 'RANK'

@yangkai12
Copy link

Has the issue been solved? I have the same issue.

@boh95
Copy link

boh95 commented Jul 5, 2019

I found a problem, the training time is very long, if the training process is interrupted midway, then you need to start training from the beginning, is there a solution to continue training in the interrupted position,just like SiamFC‘s training process?

Try to use TRAIN.RESUME to resume your training.
@lb1100
I followed what you said,but i met with those problem:

[2019-06-22 09:28:08,844-rk0-model_load.py# 42] remove prefix 'module.'
[2019-06-22 09:28:08,844-rk0-model_load.py# 33] used keys:75
[2019-06-22 09:28:08,845-rk0-model_load.py# 33] used keys:2
Traceback (most recent call last):
File "../../tools/train.py", line 319, in
main()
File "../../tools/train.py", line 307, in main
restore_from(model, optimizer, cfg.TRAIN.RESUME)
File "/home/db/Subject/pysot/pysot/utils/model_load.py", line 89, in restore_from
optimizer.load_state_dict(ckpt['optimizer'])
File "/home/db/anaconda3/envs/pysot/lib/python3.7/site-packages/torch/optim/optimizer.py", line 107, in load_state_dict
raise ValueError("loaded state dict contains a parameter group "
ValueError: loaded state dict contains a parameter group that doesn't match the size of optimizer's group

Have you solved this problem?

@xiaotian3
Copy link

xiaotian3 commented Jul 8, 2019

pycharm Run->Import configurations->Configuration->module name :(torch .distributed.launch)
Ps: defalt is Script path ,you should click the list button .
refer to
https://image.oldpan.me/TIM%E6%88%AA%E5%9B%BE20190306182352.jpg

@ghost
Copy link
Author

ghost commented Jul 8, 2019

@ZZXin
train.py do not have " --nproc_per_node--master_port "

@xiaotian3
Copy link

@chenbolinstudent module name ! not scipts path .have a try

@ghost
Copy link
Author

ghost commented Jul 8, 2019

@xiaotian3
what? please say the in detail

@xiaotian3
Copy link

xiaotian3 commented Jul 8, 2019

scipts:
torch.distributed.launch

scipts.paramters:
--nproc_per_node=1 --master_port=2333 /home/ty/PycharmProjects/pysot-master/tools/train.py

still KeyError: 'RANK'
you use the script path ,but it should ues module name 。
refer to
https://image.oldpan.me/TIM%E6%88%AA%E5%9B%BE20190306182352.jpg
@chenbolinstudent

@ZZXin
Copy link

ZZXin commented Jul 8, 2019

@ZZXin
train.py do not have " --nproc_per_node--master_port "
I think " --nproc_per_node--master_port " is related to the distributed training of torch.

@ghost
Copy link
Author

ghost commented Jul 16, 2019

1

2

I set it and do not change the train.py scipt,but still KeyError: 'RANK'
@xiaotian3

@Programmerwyl
Copy link

I have a question whether to run train or distributed in pycharm?
second,
It is ok when I run distribued .
but if I debug with distribued ,it occurs 'no moule named ./tools/train.py'
I have modified the path of train.py, but I still get an error
@ZZXin I just change 'train.py' to 'train',and it does not works

@xiaotian3
Copy link

1

2

I set it and do not change the train.py scipt,but still KeyError: 'RANK'
@xiaotian3
image

@Programmerwyl
Copy link

@xiaotian3
1.Do you configure it with ubuntu or Windows?
2.can you debug with pycharm?

@xiaotian3
Copy link

@xiaotian3
1.Do you configure it with ubuntu or Windows?
2.can you debug with pycharm?

ubuntu
I can debug with pycharm
image

@Programmerwyl
Copy link

@xiaotian3
当我 debug时,总报错,
No module named /media/wyl/01937159-8963-47f6-b239-efe7b18f2e8b/wyl/data/project/pysot/pysot-master/tools/train.py

2019-07-26 16-52-22屏幕截图

@Programmerwyl
Copy link

@xiaotian3
麻烦指导一下,搞了一个下午了。
非常感谢

@xiaotian3
Copy link

@xiaotian3
麻烦指导一下,搞了一个下午了。
非常感谢

直接用train,前面的路径,可以参考我的图片

@Programmerwyl
Copy link

@xiaotian3 麻烦把你pycharm 中配置的parameters点开让我看一下
我把.py去掉之后,连运行都会报错

/media/wyl/01937159-8963-47f6-b239-efe7b18f2e8b/wyl/data/software/conda/envs/pysot/bin/python3.7 -m torch.distributed.launch --nproc_per_node 1 --master_port=2333 /media/wyl/01937159-8963-47f6-b239-efe7b18f2e8b/wyl/data/project/pysot/pysot-master/tools/train --cfg /media/wyl/01937159-8963-47f6-b239-efe7b18f2e8b/wyl/data/project/pysot/pysot-master/experiments/siamrpn_r50_l234_dwxcorr_8gpu/config.yaml
/media/wyl/01937159-8963-47f6-b239-efe7b18f2e8b/wyl/data/software/conda/envs/pysot/bin/python3.7: can't open file '/media/wyl/01937159-8963-47f6-b239-efe7b18f2e8b/wyl/data/project/pysot/pysot-master/tools/train': [Errno 2] No such file or directory

Process finished with exit code 0
2019-07-28 11-19-55屏幕截图

@Programmerwyl
Copy link

@xiaotian3 期待你的回复,谢谢!

@xiaotian3
Copy link

xiaotian3 commented Jul 28, 2019

@xiaotian3 期待你的回复,谢谢!

只留train,前面的路径都去掉就好了
image

@Programmerwyl
Copy link

非常感谢你的回复,除了配置train,还有其他需要配置的吗?
配置完,是直接运行train,而不是运行train.py?
我按照如下的配置,还是不行,
can't open file 'train': [Errno 2] No such file or directory
我有点摸不着头脑
2019-07-28 11-34-21屏幕截图

@xiaotian3
Copy link

非常感谢你的回复,除了配置train,还有其他需要配置的吗?
配置完,是直接运行train,而不是运行train.py?
我按照如下的配置,还是不行,
can't open file 'train': [Errno 2] No such file or directory
我有点摸不着头脑
2019-07-28 11-34-21屏幕截图

不行的话,你加我qq:809078075

@Programmerwyl
Copy link

好的

@ghost
Copy link
Author

ghost commented Jul 28, 2019

@Programmerwyl
我找不到module name 模块,
你是用pycharm专业版吗?

@Programmerwyl
Copy link

@chenbolinstudent 我的是专业版的。
我不知道,普通版,是不是没有module name?
在我的版本中,点击configuration下面的第一个列表,可以选择module name

@Programmerwyl
Copy link

summarize:
if you want to train it with pycharm ,you need click run > Edit Configurations
you need to config module name and Parameters
note:train scrpts should be full path As is shown in the picture
image
image

if you want to debug it with pycharm ,you need click run > Edit Configurations
you need to config module name and Parameters
note the different thing is that you only need train,but you need to config Working diretory
image

image

@shiyongde
Copy link

Thanks @Programmerwyl ,I make it . When debugging,use train , Not train.py .

@daxiedong
Copy link

十分感谢,我也做到了

@ZhiyuanChen
Copy link
Collaborator

由于本issue没有得到更进一步的提问,我现在将其关闭。
如有新的问题,请重新打开它或者创建一个新的issue。

@byq817
Copy link

byq817 commented May 12, 2020

summarize:
if you want to train it with pycharm ,you need click run > Edit Configurations
you need to config module name and Parameters
note:train scrpts should be full path As is shown in the picture
image
image

if you want to debug it with pycharm ,you need click run > Edit Configurations
you need to config module name and Parameters
note the different thing is that you only need train,but you need to config Working diretory
image

image

请问你的distribution是怎么生成的?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests