Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reg loss became Nan when it came to 2.6k iters #47

Closed
mxmxlwlw opened this issue May 15, 2017 · 17 comments
Closed

reg loss became Nan when it came to 2.6k iters #47

mxmxlwlw opened this issue May 15, 2017 · 17 comments

Comments

@mxmxlwlw
Copy link

Hi,
It seems that the reg loss of training process become Nan, when it comes to 2.6k iters.
Besides, how can I make use of the network to test my own imgs?
best wishes!

@kevinkit
Copy link

Hello @mxmxlwlw ,

There are a lot of issues at the moment facing the problem, that the training will be NaN or stop after some iterations, did you take a look at them and found that this is a complete new issue ?

See #42 , #24

Also did you already get a snapshot of the trained weights or does the computation stop before that point?

@amirbar
Copy link
Collaborator

amirbar commented May 17, 2017

Hi,

I think i'm getting the same behaviour. I have an overflow in the function bbox_transform.py. right after the overflow the reg loss is jumping until it becomes nan. I came up with some fix which seems to work. can you please look and tell whether you get the same behaviour?

if yes, I will propose a PR.

iter 267: image-id:0123208, time:0.817(sec), regular_loss: 0.214897, total-loss 1.0351(0.0118, 0.3499, 0.001303, 0.0411, 0.6309), instances: 1, batch:(20|104, 2|66, 2|2)
[ 1 640 853 3]
iter 268: image-id:0477321, time:0.727(sec), regular_loss: 0.215178, total-loss 135.3920(0.5606, 81.0835, 0.000000, 53.7479, 0.0000), instances: 2, batch:(20|96, 0|64, 0|0)
[ 1 640 853 3]
/home/amir/Deployment/FastMaskRCNN-fork/train/../libs/boxes/bbox_transform.py:61: RuntimeWarning: overflow encountered in exp
pred_w = np.exp(dw) * widths[:, np.newaxis]
/home/amir/Deployment/FastMaskRCNN-fork/train/../libs/boxes/bbox_transform.py:61: RuntimeWarning: overflow encountered in multiply
pred_w = np.exp(dw) * widths[:, np.newaxis]
/home/amir/Deployment/FastMaskRCNN-fork/train/../libs/boxes/bbox_transform.py:62: RuntimeWarning: overflow encountered in exp
pred_h = np.exp(dh) * heights[:, np.newaxis]
/home/amir/Deployment/FastMaskRCNN-fork/train/../libs/boxes/bbox_transform.py:62: RuntimeWarning: overflow encountered in multiply
pred_h = np.exp(dh) * heights[:, np.newaxis]
iter 269: image-id:0215226, time:0.715(sec), regular_loss: 0.216160, total-loss 183.3571(5.7902, 177.5669, 0.000000, 0.0000, 0.0000), instances: 2, batch:(2|34, 0|64, 0|0)
[ 1 640 1137 3]
iter 270: image-id:0477310, time:2.707(sec), regular_loss: 0.224796, total-loss 1331875328.0000(38611.8828, 1331836672.0000, 0.000000, 0.0000, 0.0000), instances: 2, batch:(2|34, 0|64, 0|0)
[ 1 640 963 3]
iter 271: image-id:0057707, time:0.770(sec), regular_loss: 486502989824.000000, total-loss nan(0.0088, 0.4111, nan, nan, 0.6301), instances: 1, batch:(15|84, 1|65, 1|1)
[[ 225.38453674 596.21038818 397.67379761 897.72412109 76. ]]

@sheldon606
Copy link

I came across the same problem too.

@amirbar
Copy link
Collaborator

amirbar commented May 21, 2017

@CharlesShang can you please review/comment?

@mxmxlwlw
Copy link
Author

@amirbar Hi, problem solved! But how can I make use of the network to test my own imgs and get the rects and masks?

@blitu12345
Copy link

@mxmxlwlw please share your solution, how did you modify your code to stop regular-layer from becoming Nan? thanks!

@mxmxlwlw
Copy link
Author

mxmxlwlw commented Jun 4, 2017

@blitu12345 Hi, they already changed the code in github, just download it, and normally, it will be ok. If you still meet the problem sometimes. Just lower the learning rate. It may works.

@mxmxlwlw
Copy link
Author

mxmxlwlw commented Jun 4, 2017

@blitu12345 And there may still be some bugs in the training code.

@blitu12345
Copy link

blitu12345 commented Jun 4, 2017 via email

@mxmxlwlw
Copy link
Author

mxmxlwlw commented Jun 5, 2017

@blitu12345 Yeah, they commit with comment "Change computation for numerical stability". However, there may still be some bugs... And I really looking forward they giving some sample code for testing their network. Just one image would be fine.

@blitu12345
Copy link

blitu12345 commented Jun 5, 2017

@mxmxlwlw have you trained your model ? i m just at 120k iteration and its already more than 24 hrs, seems like it going take a long time to train.How much time did your model took to train?Are they storing and saving the trained model at successive interval in the source code ?Thanks !!

@amirbar
Copy link
Collaborator

amirbar commented Jun 5, 2017

@mxmxlwlw I wrote a short code for bounding box visualization I can PR
There are still bugs. I'm currently testing only the RPN component and it seems to work with few code fixes and hyper params search. I will try to PR today

The repository seems far from reproducing the original work

@mxmxlwlw
Copy link
Author

mxmxlwlw commented Jun 6, 2017

@blitu12345 I just use the original code for training. And yes, it took long time to train.

@mxmxlwlw
Copy link
Author

mxmxlwlw commented Jun 6, 2017

@amirbar Wow, thank you for your share! You help me a log.

@amirbar
Copy link
Collaborator

amirbar commented Jun 6, 2017

@mxmxlwlw , according to the experiments I performed, training will not lead you to anything unless you merge #50, which at least will get you the RPN component working.

anyway, because this thread issue is resolved can you please close? there are too many issues to track anyway :)

@mxmxlwlw
Copy link
Author

mxmxlwlw commented Jun 6, 2017

Ok.

@mxmxlwlw mxmxlwlw closed this as completed Jun 6, 2017
@meetps
Copy link

meetps commented Feb 19, 2018

Check my comment here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants