training #42

ghost · 2019-06-12T02:54:04Z

请问如何不通过命令行运行train.py?目前我直接在运行test.py 可以，但是直接运行train.py:
File "/home/public/anaconda3/lib/python3.6/os.py", line 669, in getitem
raise KeyError(key) from None
KeyError: 'RANK'

请问如何在代码中设置：
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
python -m torch.distributed.launch
--nproc_per_node=8
--master_port=2333
../../tools/train.py --cfg config.yaml
中的值，使得可以直接在pycharm运行train.py，
谢谢您的回复。

serycjon · 2019-06-13T12:12:25Z

It seems you have to set some environment variables, see: https://pytorch.org/docs/0.3.0/distributed.html#environment-variable-initialization

IlchaeJung · 2019-06-13T18:09:07Z

I have a same problem with you.
I also have a difficulty with @serycjon 's proposal.
@chenbolinstudent is this proposal working for you?

But, I success to run the evaluation and it works very well in evaluation.

ghost · 2019-06-14T01:51:16Z

@serycjon
can you give the detail?

serycjon · 2019-06-14T06:35:00Z

I don't have the training data ready, so haven't really tested it, but running the script with:
WORLD_SIZE=1 RANK=0 PYTHONPATH=./:$PYTHONPATH python tools/train.py --cfg experiments/siamrpn_r50_l234_dwxcorr/config.yaml got me past that rank error.

ghost · 2019-06-14T06:41:26Z

if I :

WORLD_SIZE=1 RANK=0 PYTHONPATH=./:$PYTHONPATH python tools/train.py --cfg experiments/siamrpn_r50_l234_dwxcorr/config.yaml

then:

Traceback (most recent call last):
File "train.py", line 24, in
from pysot.utils.lr_scheduler import build_lr_scheduler
ModuleNotFoundError: No module named 'pysot'

Hevans123 · 2019-06-14T08:01:21Z

@chenbolinstudent ,可以发下解决了RANK问题的修改的代码的截图吗， PYTHONPATH=./:$PYTHONPATH python tools/train.py，这个不是命令行语句吗，怎样在代码中修改呢

Hevans123 · 2019-06-14T08:02:09Z

@serycjon

serycjon · 2019-06-14T08:44:09Z

@chenbolinstudent Well, you have to install pysot correctly first (see install.md).
@Hevans123 It is a correct BASH command. You have to check how to set environment variables on your system if this does not work.

ZZXin · 2019-06-15T09:26:26Z

I found a problem, the training time is very long, if the training process is interrupted midway, then you need to start training from the beginning, is there a solution to continue training in the interrupted position，just like SiamFC‘s training process？

lb1100 · 2019-06-19T04:54:52Z

I found a problem, the training time is very long, if the training process is interrupted midway, then you need to start training from the beginning, is there a solution to continue training in the interrupted position，just like SiamFC‘s training process？

Try to use TRAIN.RESUME to resume your training.

ZZXin · 2019-06-22T03:51:12Z

I found a problem, the training time is very long, if the training process is interrupted midway, then you need to start training from the beginning, is there a solution to continue training in the interrupted position，just like SiamFC‘s training process？

Try to use TRAIN.RESUME to resume your training.
@lb1100
I followed what you said,but i met with those problem:

[2019-06-22 09:28:08,844-rk0-model_load.py# 42] remove prefix 'module.'
[2019-06-22 09:28:08,844-rk0-model_load.py# 33] used keys:75
[2019-06-22 09:28:08,845-rk0-model_load.py# 33] used keys:2
Traceback (most recent call last):
File "../../tools/train.py", line 319, in
main()
File "../../tools/train.py", line 307, in main
restore_from(model, optimizer, cfg.TRAIN.RESUME)
File "/home/db/Subject/pysot/pysot/utils/model_load.py", line 89, in restore_from
optimizer.load_state_dict(ckpt['optimizer'])
File "/home/db/anaconda3/envs/pysot/lib/python3.7/site-packages/torch/optim/optimizer.py", line 107, in load_state_dict
raise ValueError("loaded state dict contains a parameter group "
ValueError: loaded state dict contains a parameter group that doesn't match the size of optimizer's group

ZZXin · 2019-06-22T04:04:50Z

请问如何不通过命令行运行train.py?目前我直接在运行test.py 可以，但是直接运行train.py:
File "/home/public/anaconda3/lib/python3.6/os.py", line 669, in getitem
raise KeyError(key) from None
KeyError: 'RANK'

请问如何在代码中设置：
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
python -m torch.distributed.launch
--nproc_per_node=8
--master_port=2333
../../tools/train.py --cfg config.yaml
中的值，使得可以直接在pycharm运行train.py，
谢谢您的回复。

I can debug training code on pycharm，you can refer to this blog，https://oldpan.me/archives/pytorch-to-use-multiple-gpus，but follow his configuration，you would receive an error： no module train.py，you just change 'train.py' to 'train',and it works

ghost · 2019-06-23T02:44:15Z

@ZZXin
can you give the detail?
thanks

ghost · 2019-06-23T02:47:59Z

but this blog do not say how to set the nproc_per_node master_port

ZZXin · 2019-06-23T03:27:56Z

but this blog do not say how to set the nproc_per_node master_port

I train this code on single GPU,I think it doesn't matter,you can set '--nproc_per_node master_port = 1'

ghost · 2019-06-23T03:31:11Z

在哪个.py文件设置？
which .py file to set?

ghost · 2019-06-23T03:32:38Z

which .py file to set
--nproc_per_node
--master_port ?

ZZXin · 2019-06-23T03:36:07Z

which .py file to set
--nproc_per_node
--master_port ?

train.py

ghost · 2019-06-23T03:37:53Z

--nproc_per_node
--master_port
is not exist in train.py

ZZXin · 2019-06-23T03:41:32Z

@ZZXin
can you offer the youtubebb of badu wangpan link ?

I don't have it, I am waiting for pysot‘s downloading link

ghost · 2019-06-23T06:51:00Z

how do you debug training code on pycharm
https://oldpan.me/archives/pytorch-to-use-multiple-gpus do not say how to set
can you say the detail?

ZZXin · 2019-06-23T13:11:52Z

how do you debug training code on pycharm
https://oldpan.me/archives/pytorch-to-use-multiple-gpus do not say how to set
can you say the detail?
refer to
https://image.oldpan.me/TIM%E6%88%AA%E5%9B%BE20190306182352.jpg
Take a good look at this picture.

xi-mao · 2019-06-25T01:44:54Z

how do you debug training code on pycharm
https://oldpan.me/archives/pytorch-to-use-multiple-gpus do not say how to set
can you say the detail?
refer to
https://image.oldpan.me/TIM%E6%88%AA%E5%9B%BE20190306182352.jpg
Take a good look at this picture.

can I ask where the configs in the picture is? I cann't find the window.

ghost · 2019-06-25T02:53:50Z

scipts:
torch.distributed.launch

scipts.paramters:
--nproc_per_node=1 --master_port=2333 /home/ty/PycharmProjects/pysot-master/tools/train.py

still KeyError: 'RANK'

yangkai12 · 2019-06-27T09:07:02Z

Has the issue been solved? I have the same issue.

boh95 · 2019-07-05T23:17:23Z

I found a problem, the training time is very long, if the training process is interrupted midway, then you need to start training from the beginning, is there a solution to continue training in the interrupted position，just like SiamFC‘s training process？

Try to use TRAIN.RESUME to resume your training.
@lb1100
I followed what you said,but i met with those problem:

[2019-06-22 09:28:08,844-rk0-model_load.py# 42] remove prefix 'module.'
[2019-06-22 09:28:08,844-rk0-model_load.py# 33] used keys:75
[2019-06-22 09:28:08,845-rk0-model_load.py# 33] used keys:2
Traceback (most recent call last):
File "../../tools/train.py", line 319, in
main()
File "../../tools/train.py", line 307, in main
restore_from(model, optimizer, cfg.TRAIN.RESUME)
File "/home/db/Subject/pysot/pysot/utils/model_load.py", line 89, in restore_from
optimizer.load_state_dict(ckpt['optimizer'])
File "/home/db/anaconda3/envs/pysot/lib/python3.7/site-packages/torch/optim/optimizer.py", line 107, in load_state_dict
raise ValueError("loaded state dict contains a parameter group "
ValueError: loaded state dict contains a parameter group that doesn't match the size of optimizer's group

Have you solved this problem?

xiaotian3 · 2019-07-08T02:08:37Z

pycharm Run->Import configurations->Configuration->module name :(torch .distributed.launch)
Ps: defalt is Script path ,you should click the list button .
refer to
https://image.oldpan.me/TIM%E6%88%AA%E5%9B%BE20190306182352.jpg

ghost · 2019-07-08T02:21:49Z

@ZZXin
train.py do not have " --nproc_per_node--master_port "

xiaotian3 · 2019-07-08T02:27:26Z

@chenbolinstudent module name ! not scipts path .have a try

ghost · 2019-07-08T02:32:51Z

@xiaotian3
what? please say the in detail

xiaotian3 · 2019-07-08T02:37:21Z

scipts:
torch.distributed.launch

scipts.paramters:
--nproc_per_node=1 --master_port=2333 /home/ty/PycharmProjects/pysot-master/tools/train.py

still KeyError: 'RANK'
you use the script path ,but it should ues module name 。
refer to
https://image.oldpan.me/TIM%E6%88%AA%E5%9B%BE20190306182352.jpg
@chenbolinstudent

ZZXin · 2019-07-08T04:33:02Z

@ZZXin
train.py do not have " --nproc_per_node--master_port "
I think " --nproc_per_node--master_port " is related to the distributed training of torch.

ghost · 2019-07-16T03:30:25Z

I set it and do not change the train.py scipt,but still KeyError: 'RANK'
@xiaotian3

Programmerwyl · 2019-07-26T08:19:42Z

I have a question whether to run train or distributed in pycharm?
second,
It is ok when I run distribued .
but if I debug with distribued ,it occurs 'no moule named ./tools/train.py'
I have modified the path of train.py, but I still get an error
@ZZXin I just change 'train.py' to 'train',and it does not works

xiaotian3 · 2019-07-26T08:22:04Z

I set it and do not change the train.py scipt,but still KeyError: 'RANK'
@xiaotian3

Programmerwyl · 2019-07-26T08:37:09Z

@xiaotian3
1.Do you configure it with ubuntu or Windows？
2.can you debug with pycharm?

xiaotian3 · 2019-07-26T08:39:30Z

@xiaotian3
1.Do you configure it with ubuntu or Windows？
2.can you debug with pycharm?

ubuntu
I can debug with pycharm

Programmerwyl · 2019-07-26T08:55:37Z

@xiaotian3
当我 debug时，总报错，
No module named /media/wyl/01937159-8963-47f6-b239-efe7b18f2e8b/wyl/data/project/pysot/pysot-master/tools/train.py

Programmerwyl · 2019-07-26T08:57:38Z

@xiaotian3
麻烦指导一下，搞了一个下午了。
非常感谢

xiaotian3 · 2019-07-26T10:55:22Z

@xiaotian3
麻烦指导一下，搞了一个下午了。
非常感谢

直接用train,前面的路径，可以参考我的图片

Programmerwyl · 2019-07-28T03:24:44Z

@xiaotian3 麻烦把你pycharm 中配置的parameters点开让我看一下
我把.py去掉之后，连运行都会报错

/media/wyl/01937159-8963-47f6-b239-efe7b18f2e8b/wyl/data/software/conda/envs/pysot/bin/python3.7 -m torch.distributed.launch --nproc_per_node 1 --master_port=2333 /media/wyl/01937159-8963-47f6-b239-efe7b18f2e8b/wyl/data/project/pysot/pysot-master/tools/train --cfg /media/wyl/01937159-8963-47f6-b239-efe7b18f2e8b/wyl/data/project/pysot/pysot-master/experiments/siamrpn_r50_l234_dwxcorr_8gpu/config.yaml
/media/wyl/01937159-8963-47f6-b239-efe7b18f2e8b/wyl/data/software/conda/envs/pysot/bin/python3.7: can't open file '/media/wyl/01937159-8963-47f6-b239-efe7b18f2e8b/wyl/data/project/pysot/pysot-master/tools/train': [Errno 2] No such file or directory

Process finished with exit code 0

Programmerwyl · 2019-07-28T03:27:13Z

@xiaotian3 期待你的回复，谢谢！

xiaotian3 · 2019-07-28T03:28:33Z

@xiaotian3 期待你的回复，谢谢！

只留train,前面的路径都去掉就好了

Programmerwyl · 2019-07-28T03:39:27Z

非常感谢你的回复，除了配置train,还有其他需要配置的吗？
配置完，是直接运行train，而不是运行train.py?
我按照如下的配置，还是不行，
can't open file 'train': [Errno 2] No such file or directory
我有点摸不着头脑

xiaotian3 · 2019-07-28T03:42:31Z

非常感谢你的回复，除了配置train,还有其他需要配置的吗？
配置完，是直接运行train，而不是运行train.py?
我按照如下的配置，还是不行，
can't open file 'train': [Errno 2] No such file or directory
我有点摸不着头脑

不行的话，你加我qq：809078075

Programmerwyl · 2019-07-28T03:51:49Z

好的

ghost · 2019-07-28T04:42:12Z

@Programmerwyl
我找不到module name 模块，
你是用pycharm专业版吗?

Programmerwyl · 2019-07-28T04:48:09Z

@chenbolinstudent 我的是专业版的。
我不知道，普通版，是不是没有module name?
在我的版本中，点击configuration下面的第一个列表，可以选择module name

Programmerwyl · 2019-07-30T03:21:09Z

summarize：
if you want to train it with pycharm ,you need click run > Edit Configurations
you need to config module name and Parameters
note:train scrpts should be full path As is shown in the picture

if you want to debug it with pycharm ,you need click run > Edit Configurations
you need to config module name and Parameters
note the different thing is that you only need train,but you need to config Working diretory

shiyongde · 2019-08-28T12:49:58Z

Thanks @Programmerwyl ,I make it . When debugging，use train , Not train.py .

daxiedong · 2019-11-28T03:30:53Z

十分感谢，我也做到了

ZhiyuanChen · 2019-12-18T03:03:49Z

由于本issue没有得到更进一步的提问，我现在将其关闭。
如有新的问题，请重新打开它或者创建一个新的issue。

byq817 · 2020-05-12T09:15:06Z

summarize：
if you want to train it with pycharm ,you need click run > Edit Configurations
you need to config module name and Parameters
note:train scrpts should be full path As is shown in the picture

if you want to debug it with pycharm ,you need click run > Edit Configurations
you need to config module name and Parameters
note the different thing is that you only need train,but you need to config Working diretory

请问你的distribution是怎么生成的?

lb1100 mentioned this issue Jun 15, 2019

KeyError(key) from None KeyError: 'RANK' #48

Closed

ZhiyuanChen closed this as completed Dec 18, 2019

training #42

training #42

Comments

ghost commented Jun 12, 2019

serycjon commented Jun 13, 2019

IlchaeJung commented Jun 13, 2019

ghost commented Jun 14, 2019

serycjon commented Jun 14, 2019

ghost commented Jun 14, 2019

Hevans123 commented Jun 14, 2019

Hevans123 commented Jun 14, 2019

serycjon commented Jun 14, 2019

ZZXin commented Jun 15, 2019

lb1100 commented Jun 19, 2019

ZZXin commented Jun 22, 2019 • edited Loading

ZZXin commented Jun 22, 2019

ghost commented Jun 23, 2019

ghost commented Jun 23, 2019

ZZXin commented Jun 23, 2019

ghost commented Jun 23, 2019

ghost commented Jun 23, 2019

ZZXin commented Jun 23, 2019

ghost commented Jun 23, 2019

ZZXin commented Jun 23, 2019

ghost commented Jun 23, 2019

ZZXin commented Jun 23, 2019

xi-mao commented Jun 25, 2019

ghost commented Jun 25, 2019

yangkai12 commented Jun 27, 2019

boh95 commented Jul 5, 2019

xiaotian3 commented Jul 8, 2019 • edited Loading

ghost commented Jul 8, 2019

xiaotian3 commented Jul 8, 2019

ghost commented Jul 8, 2019

xiaotian3 commented Jul 8, 2019 • edited Loading

ZZXin commented Jul 8, 2019

ghost commented Jul 16, 2019

Programmerwyl commented Jul 26, 2019

xiaotian3 commented Jul 26, 2019

Programmerwyl commented Jul 26, 2019

xiaotian3 commented Jul 26, 2019

Programmerwyl commented Jul 26, 2019

Programmerwyl commented Jul 26, 2019

xiaotian3 commented Jul 26, 2019

Programmerwyl commented Jul 28, 2019

Programmerwyl commented Jul 28, 2019

xiaotian3 commented Jul 28, 2019 • edited Loading

Programmerwyl commented Jul 28, 2019

xiaotian3 commented Jul 28, 2019

Programmerwyl commented Jul 28, 2019

ghost commented Jul 28, 2019

Programmerwyl commented Jul 28, 2019

Programmerwyl commented Jul 30, 2019

shiyongde commented Aug 28, 2019

daxiedong commented Nov 28, 2019

ZhiyuanChen commented Dec 18, 2019

byq817 commented May 12, 2020

ZZXin commented Jun 22, 2019 •

edited

Loading

xiaotian3 commented Jul 8, 2019 •

edited

Loading

xiaotian3 commented Jul 8, 2019 •

edited

Loading

xiaotian3 commented Jul 28, 2019 •

edited

Loading