Unable to reproduce `seg_hrnet_w18_small_v1` #51

alvinwan · 2019-09-03T23:54:37Z

Thanks for 27488d4, the configuration file is very helpful. With that said, training on 4 GPUs as prescribed, I'm unable to reproduce Cityscapes validation accuracy of 70.3% (attained 65.21%) https://github.com/HRNet/HRNet-Semantic-Segmentation#small-models.

Is https://github.com/HRNet/HRNet-Semantic-Segmentation/blob/master/experiments/cityscapes/seg_hrnet_w18_small_v1_512x1024_sgd_lr1e-2_wd5e-4_bs_12_epoch484.yaml verbatim the file used to produce 70.3% or does it need further hyperparameter tuning? (I'm on the pytorch-v1.1 branch.)

In case it's helpful (although I'm sure this isn't informative), here are the cIoUs for the w18-v1 retrained model:

Loss: 0.179, MeanIU:  0.6509, Best_mIoU:  0.6521
[0.97245895 0.79921705 0.8969752  0.43651182 0.47062117 0.56336364
 0.57983322 0.68906234 0.91533262 0.60986547 0.93415257 0.74804671
 0.46804914 0.91671634 0.4241423  0.58802203 0.24108752 0.41514963
 0.69802723]

The text was updated successfully, but these errors were encountered:

sunke123 · 2019-09-06T02:16:38Z

Hi, sorry for the late reply.
This is my training log for seg_hrnet_w18_small_v1: https://1drv.ms/u/s!Aus8VCZ_C_33gSQ7irYs1DZy68yv?e=6fiuJN

The model 'hrnet_w18_for_mb' is the same as hrnet_w18_small_v1.
Please check it out.
I think that I use the same settings as you.
And If you use pytorch-v1.2, you can try to run this code on pytorch-v1.1. My friends tell me that they use pytorch-v1.2 and get worse performance.

alvinwan · 2019-09-06T05:29:45Z

No problem, thanks for replying! I noticed that the config in your training log includes CLASS_BALANCE: True whereas the current YAML does not have this variable set (by default, lib/config/default.py sets this variable to false. I will try retraining with the class balance variable set to true, and if that works, I'll make a PR with the change.

alvinwan · 2019-09-09T20:48:33Z

For posterity, I was unable to reproduce seg_hrnet_w18_small_v1's reported accuracy of 0.7026. Training with pytorch 1.1 + CLASS_BALANCE: True did improve my initial accuracy above by 1.5%. However, I obtained 0.6688 and 0.6674 (~3.4% short)

Edit: CLASS_BALANCE doesn't change anything, as the default is already true.

HRNet-Semantic-Segmentation/lib/config/default.py

Line 44 in 06142dc

_C.LOSS.CLASS_BALANCE = True

Looks like the improvement came from downgrading pytorch 1.2 to pytorch 1.1.

@sunke123 would you happen to have the mean/std of your runs?

sunke123 · 2019-09-10T06:35:54Z

I only ran it 1 once.
I think the difference of 3.4% is too large.
Could you share your training log with me?
I will double-check it.

alvinwan · 2019-09-10T07:23:27Z

@sunke123 Thanks for offering to take a look. I think I just figured it out, oops: Looking at your logs again, I noticed the first epoch results in 30% mIOU (whereas my first epoch results in 10% mIOU), and your logs contain an extra line:

~~Is that a mobilenet-v3 checkpoint pretrained on imagenet? If so, that likely explains the discrepancy (if so, oops, sorry). Just in case that's not the issue, here are both training logs:~~

0.6674 (log)
0.6688 (log part 1, part 2)

I realize I didn't initialize from the pretrained Imagenet weights, as stated in the README. Sorry for the bother -- I'll try again and update here.

alvinwan · 2019-09-12T01:52:26Z

@sunke123 sorry for bothering, that fixed it! haha. I appreciate your help. These are probably the first reproducible results I've ever seen. The code is clean, the results are reproducible... couldn't ask for more.

sunke123 · 2019-09-12T06:27:03Z

Congrats! hh~
Thanks for your attention.
If you have any questions, please feel free to contact us.

YijianLiu mentioned this issue Sep 6, 2019

RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1535491974311/work/aten/src/THC/generated/../THCReduceAll.cuh:317 terminate called after throwing an instance of 'at::Error' #50

Closed

alvinwan changed the title ~~Unable to reproduce w18-v1~~ Unable to reproduce seg_hrnet_w18_small_v1 Sep 9, 2019

alvinwan closed this as completed Sep 12, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to reproduce `seg_hrnet_w18_small_v1` #51

Unable to reproduce `seg_hrnet_w18_small_v1` #51

alvinwan commented Sep 3, 2019

sunke123 commented Sep 6, 2019

alvinwan commented Sep 6, 2019

alvinwan commented Sep 9, 2019 •

edited

sunke123 commented Sep 10, 2019

alvinwan commented Sep 10, 2019 •

edited

alvinwan commented Sep 12, 2019

sunke123 commented Sep 12, 2019

Unable to reproduce seg_hrnet_w18_small_v1 #51

Unable to reproduce seg_hrnet_w18_small_v1 #51

Comments

alvinwan commented Sep 3, 2019

sunke123 commented Sep 6, 2019

alvinwan commented Sep 6, 2019

alvinwan commented Sep 9, 2019 • edited

sunke123 commented Sep 10, 2019

alvinwan commented Sep 10, 2019 • edited

alvinwan commented Sep 12, 2019

sunke123 commented Sep 12, 2019

Unable to reproduce `seg_hrnet_w18_small_v1` #51

Unable to reproduce `seg_hrnet_w18_small_v1` #51

alvinwan commented Sep 9, 2019 •

edited

alvinwan commented Sep 10, 2019 •

edited