Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sth strange when trainning #47

Open
liwenssss opened this issue Aug 30, 2019 · 6 comments
Open

sth strange when trainning #47

liwenssss opened this issue Aug 30, 2019 · 6 comments

Comments

@liwenssss
Copy link

hi, I modify the train.sh as :

python train.py --name resnet_radvani_32000_20190415 --model resnet --netD conv-up --batch_size 4 --max_dataset_size 32000 --niter 20 --niter_decay 50 --save_result_freq 250 --save_epoch_freq 2 --ndown 6 --data_root /home/liwensh/data

and run about 16 hours(44 epoch), and at the end of log.txt shows:

epoch 43 iter 6499:  l1: 11.288669 tv: 1.522304 total: 11.288669
epoch 43 iter 6749:  l1: 11.599895 tv: 0.667862 total: 11.599895
epoch 43 iter 6999:  l1: 11.125267 tv: 1.277602 total: 11.125267
epoch 43 iter 7249:  l1: 11.893361 tv: 1.366742 total: 11.893361
epoch 43 iter 7499:  l1: 11.343329 tv: 1.228081 total: 11.343329
epoch 43 iter 7749:  l1: 11.397069 tv: 1.426213 total: 11.397069
epoch 43 iter 7999:  l1: 11.519998 tv: 0.664876 total: 11.519998
epoch 44 iter 249:  l1: 11.183926 tv: 1.258252 total: 11.183926
epoch 44 iter 499:  l1: 11.555054 tv: 1.201256 total: 11.555054
epoch 44 iter 749:  l1: 12.041154 tv: 1.312884 total: 12.041154
epoch 44 iter 999:  l1: 11.605458 tv: 0.706056 total: 11.605458
epoch 44 iter 1249:  l1: 11.589639 tv: 1.093558 total: 11.589639
epoch 44 iter 1499:  l1: 11.533211 tv: 1.338729 total: 11.533211
epoch 44 iter 1749:  l1: 11.822362 tv: 1.297630 total: 11.822362
epoch 44 iter 1999:  l1: 12.410873 tv: 1.159959 total: 12.410873
epoch 44 iter 2249:  l1: 11.855060 tv: 1.531642 total: 11.855060

the total have not changed much since the 5th epoch.
and the inter output is strange(44 epoch):
image
I wonder if it it because the batch size is too small, since I have no enough GPU memory. Or maybe other option set I am wrong?

@Lotayou
Copy link
Owner

Lotayou commented Aug 30, 2019

The model is definitely crashed... Seems you've activated the tv_loss and set the weight too high that the output tends to be over smooth and crashed at local minimum [1,0,0] and [0,1,1], leading to the blue and yellow pattern... does all predicted uv maps look like this?

I don't know what else did you change in resnet_model.py, but tv_loss is one of the things not to be messed with... Try switching off the tv_loss and see if the training could stablize, let me know if the crash still happens. Good luck

@liwenssss
Copy link
Author

I just remove dilate when generate uv map and set align_corners is True when upsample.And now when I trained again it got better. I will give the resaults later.

@liwenssss
Copy link
Author

when trained to epoch 22 iter 2749, the l1 loss was 1.736614 and the predicted uv map is :
022_02749
but when it got epoch 22 iter 2999, the l1 loss was 8.662540 and the predicted uv map is :
022_02999
and it become worse then.

@Lotayou
Copy link
Owner

Lotayou commented Sep 1, 2019

Interesting, looks like the training was successful. Random crash happens frequently in my experiments, could be a bad item in provided toy dataset but I’m not sure either. Anyway, I just load the last successful checkpoint and resume training whenever crash happens, not a big deal.

@liwenssss
Copy link
Author

yes, I tried to reload the latest better performanced checkpoint and continue to training. But I found the resample resault is ..emmm old question. so I wonder if my trained network make something wrong. I use the provided pre-trained model and got the following predicted uv map:
200_00000
but the resample resault is ..intresting:
image
same resample function can generate nearly normal resault from the generated uv map.I wonder maybe something I can do based on the resample resault, just like the paper said to fit the SMPL model.

@onepiece666
Copy link

Later, I would report a mistake about every epoch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants