Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

It takes long time to train #5

Open
juntawu opened this issue Apr 19, 2021 · 17 comments
Open

It takes long time to train #5

juntawu opened this issue Apr 19, 2021 · 17 comments

Comments

@juntawu
Copy link

juntawu commented Apr 19, 2021

Hello. Thanks for you work.
I trained HRNetV2-W18-C+OCR ITER-M model with command
python3 train.py models/iter_mask/hrnet18_cocolvis_itermask_3p.py --gpus=0,1 --workers=6 --exp-name=first-try
on COCO_LVIS dataset, with 2 GPUs (Tesla-V100-SXM2-32GB).
However, it took me nearly 70+ hours to train 200 epochs. Is this normal ?

@haoyuying
Copy link

@juntawu 请问你训了多久呢,batch_size设置的多大,我这边也遇到了这个问题,我是按照代码中给的训练方式训练的,用了单卡,感觉训练的时间太久了。

@liyuxuan89
Copy link

I trained hrnet18s on one 1080ti for 200 epochs. It took approximate 20mins per epoch. The result is lower than reported. I wonder if this is normal.
企业微信截图_20210623172604

@qinliuliuqin
Copy link

qinliuliuqin commented Oct 25, 2021

Hello. Thanks for you work. I trained HRNetV2-W18-C+OCR ITER-M model with command python3 train.py models/iter_mask/hrnet18_cocolvis_itermask_3p.py --gpus=0,1 --workers=6 --exp-name=first-try on COCO_LVIS dataset, with 2 GPUs (Tesla-V100-SXM2-32GB). However, it took me nearly 70+ hours to train 200 epochs. Is this normal ?

It's normal. #3 (comment)
I need 3 days to train 220 epochs. This is why the authors only trained 55 epochs for their experiments.

@qinliuliuqin
Copy link

@juntawu 请问你训了多久呢,batch_size设置的多大,我这边也遇到了这个问题,我是按照代码中给的训练方式训练的,用了单卡,感觉训练的时间太久了。

The patch size is set to 32 by default. To save time, you only need to train 55 epochs on COCO_LVIS as the authors did in their experiments.

@ty199931
Copy link

@juntawu 请问你训了多久呢,batch_size设置的多大,我这边也遇到了这个问题,我是按照代码中给的训练方式训练的,用了单卡,感觉训练的时间太久了。

The patch size is set to 32 by default. To save time, you only need to train 55 epochs on COCO_LVIS as the authors did in their experiments.

Save checkpoint to experiments\iter_mask\sbd_hrnet18\000_first-try\checkpoints\last_checkpoint.pth
Save checkpoint to experiments\iter_mask\sbd_hrnet18\000_first-try\checkpoints\000.pth
请问我在训练第一个epoch的时候,训练结束后就一直停在这个界面是正常的吗?就是会在这里停滞很久是吗?我也不敢去乱点。

@qinliuliuqin
Copy link

@juntawu 请问你训了多久呢,batch_size设置的多大,我这边也遇到了这个问题,我是按照代码中给的训练方式训练的,用了单卡,感觉训练的时间太久了。

The patch size is set to 32 by default. To save time, you only need to train 55 epochs on COCO_LVIS as the authors did in their experiments.

Save checkpoint to experiments\iter_mask\sbd_hrnet18\000_first-try\checkpoints\last_checkpoint.pth Save checkpoint to experiments\iter_mask\sbd_hrnet18\000_first-try\checkpoints\000.pth 请问我在训练第一个epoch的时候,训练结束后就一直停在这个界面是正常的吗?就是会在这里停滞很久是吗?我也不敢去乱点。

训练结束后会做validation,你可以去过一遍代码,这个代码写得很好。validation的时候会停顿下,但不会很久,而且会有进度条显示。

@ty199931
Copy link

@juntawu 请问你训了多久呢,batch_size设置的多大,我这边也遇到了这个问题,我是按照代码中给的训练方式训练的,用了单卡,感觉训练的时间太久了。

The patch size is set to 32 by default. To save time, you only need to train 55 epochs on COCO_LVIS as the authors did in their experiments.

Save checkpoint to experiments\iter_mask\sbd_hrnet18\000_first-try\checkpoints\last_checkpoint.pth Save checkpoint to experiments\iter_mask\sbd_hrnet18\000_first-try\checkpoints\000.pth 请问我在训练第一个epoch的时候,训练结束后就一直停在这个界面是正常的吗?就是会在这里停滞很久是吗?我也不敢去乱点。

训练结束后会做validation,你可以去过一遍代码,这个代码写得很好。validation的时候会停顿下,但不会很久,而且会有进度条显示。

我看了代码了,然后也挨个代码打断点找问题,发现他有的时候连for循环都进不去,如果你们都没问题的话,那可能是我的电脑的原因? 或者我的数据集有问题?

@hyalvin
Copy link

hyalvin commented Jan 4, 2022

@juntawu 请问你训了多久呢,batch_size设置的多大,我这边也遇到了这个问题,我是按照代码中给的训练方式训练的,用了单卡,感觉训练的时间太久了。

The patch size is set to 32 by default. To save time, you only need to train 55 epochs on COCO_LVIS as the authors did in their experiments.

Save checkpoint to experiments\iter_mask\sbd_hrnet18\000_first-try\checkpoints\last_checkpoint.pth Save checkpoint to experiments\iter_mask\sbd_hrnet18\000_first-try\checkpoints\000.pth 请问我在训练第一个epoch的时候,训练结束后就一直停在这个界面是正常的吗?就是会在这里停滞很久是吗?我也不敢去乱点。

训练结束后会做validation,你可以去过一遍代码,这个代码写得很好。validation的时候会停顿下,但不会很久,而且会有进度条显示。

我看了代码了,然后也挨个代码打断点找问题,发现他有的时候连for循环都进不去,如果你们都没问题的话,那可能是我的电脑的原因? 或者我的数据集有问题?

请问你们有每个epoch训练loss都重新开始的问题么,感觉每个epoch都是独立的

@hyalvin
Copy link

hyalvin commented Jan 4, 2022

I trained hrnet18s on one 1080ti for 200 epochs. It took approximate 20mins per epoch. The result is lower than reported. I wonder if this is normal. 企业微信截图_20210623172604

hello, may i ask how you get this results? My validation process only gives me the validation loss result.

@xiangyunfan
Copy link

@juntawu 请问你训了多久呢,batch_size设置的多大,我这边也遇到了这个问题,我是按照代码中给的训练方式训练的,用了单卡,感觉训练的时间太久了。

The patch size is set to 32 by default. To save time, you only need to train 55 epochs on COCO_LVIS as the authors did in their experiments.

Save checkpoint to experiments\iter_mask\sbd_hrnet18\000_first-try\checkpoints\last_checkpoint.pth Save checkpoint to experiments\iter_mask\sbd_hrnet18\000_first-try\checkpoints\000.pth 请问我在训练第一个epoch的时候,训练结束后就一直停在这个界面是正常的吗?就是会在这里停滞很久是吗?我也不敢去乱点。

请问后来这个问题是怎么解决的呢,我训练自己的数据集也遇到了同样的问题,训练完第一个epoch到validation就卡死了

@yangshunDragon
Copy link

久是吗?我也不敢去乱点。

训练结束后会做validation,你可以去过一遍代码,这个代码写得很好。validation的时候会停顿下,但不会很久,而且会有进度条显示。

我看了代码了,然后也挨个代码打断点找问题,发现他有的时候连for循环都进不去,如果你们都没问题的话,那可能是我的电脑的原因? 或者我的数据集有问题?

请问你们有每个epoch训练loss都重新开始的问题么,感觉每个epoch都是独立的

请问这个问题后来你怎么解决的呢?

@yangshunDragon
Copy link

@juntawu 请问你训了多久呢,batch_size设置的多大,我这边也遇到了这个问题,我是按照代码中给的训练方式训练的,用了单卡,感觉训练的时间太久了。

The patch size is set to 32 by default. To save time, you only need to train 55 epochs on COCO_LVIS as the authors did in their experiments.

Save checkpoint to experiments\iter_mask\sbd_hrnet18\000_first-try\checkpoints\last_checkpoint.pth Save checkpoint to experiments\iter_mask\sbd_hrnet18\000_first-try\checkpoints\000.pth 请问我在训练第一个epoch的时候,训练结束后就一直停在这个界面是正常的吗?就是会在这里停滞很久是吗?我也不敢去乱点。

请问这个问题后来你怎么解决的呢?

@yangshunDragon
Copy link

@juntawu 请问你训了多久呢,batch_size设置的多大,我这边也遇到了这个问题,我是按照代码中给的训练方式训练的,用了单卡,感觉训练的时间太久了。

The patch size is set to 32 by default. To save time, you only need to train 55 epochs on COCO_LVIS as the authors did in their experiments.

Save checkpoint to experiments\iter_mask\sbd_hrnet18\000_first-try\checkpoints\last_checkpoint.pth Save checkpoint to experiments\iter_mask\sbd_hrnet18\000_first-try\checkpoints\000.pth 请问我在训练第一个epoch的时候,训练结束后就一直停在这个界面是正常的吗?就是会在这里停滞很久是吗?我也不敢去乱点。

请问后来这个问题是怎么解决的呢,我训练自己的数据集也遇到了同样的问题,训练完第一个epoch到validation就卡死了

请问这个问题后来你怎么解决的呢?

@ty199931
Copy link

ty199931 commented May 7, 2022 via email

@yangshunDragon
Copy link

yangshunDragon commented May 9, 2022

好像是因为影像的原因,把没有标签的剔除掉就可以了

------------------ 原始邮件 ------------------ 发件人: @.>; 发送时间: 2022年5月7日(星期六) 下午5:42 收件人: @.>; 抄送: @.>; @.>; 主题: Re: [saic-vul/ritm_interactive_segmentation] It takes long time to train (#5) @juntawu 请问你训了多久呢,batch_size设置的多大,我这边也遇到了这个问题,我是按照代码中给的训练方式训练的,用了单卡,感觉训练的时间太久了。 The patch size is set to 32 by default. To save time, you only need to train 55 epochs on COCO_LVIS as the authors did in their experiments. Save checkpoint to experiments\iter_mask\sbd_hrnet18\000_first-try\checkpoints\last_checkpoint.pth Save checkpoint to experiments\iter_mask\sbd_hrnet18\000_first-try\checkpoints\000.pth 请问我在训练第一个epoch的时候,训练结束后就一直停在这个界面是正常的吗?就是会在这里停滞很久是吗?我也不敢去乱点。 请问后来这个问题是怎么解决的呢,我训练自己的数据集也遇到了同样的问题,训练完第一个epoch到validation就卡死了 请问这个问题后来你怎么解决的呢? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

没有标签是指有原始图像 images/sth.jpg 但是没有对应的掩膜masks/sth.png么?

我的情况是使用3D的医学图像切片做的训练数据集,每个原始图像images/sth.jpg都有对应的masks/sth.png图像,但是mask图像有一定比例是纯黑的(mask图像内没有目标)

@chuyhu
Copy link

chuyhu commented Jul 11, 2022

@yangshunDragon 你好,我也是想用这个模型做一下医学图像分割,想请问一下您这个问题解决了吗

@ty199931
Copy link

ty199931 commented Oct 11, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants