-
Notifications
You must be signed in to change notification settings - Fork 241
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why I got the nan loss_bbox when i train and eval? #22
Comments
Did you modify the code? For the first training iteration, it should be something like
|
I have not modify the code! Could I modify the code? |
No. That won't be necessary. Directly running the training script should be fine. Could you please provide a full training log (by uploading to BaiduYun / GoogleDrive / Dropbox) for me to have further analysis? Also could you please evaluate our trained model by following the instructions in the README, to see if it works properly? |
Yes, I can evaluate by your trained model,and there is no error. |
That's quite weird. Could you please
experiments/scripts/train.sh 0 --set EXP_DIR resnet50 RNG_SEED 1 On my machine, this will lead to the same loss as follows for iteration 0
|
Sorry,when I first run the training script with no modify!There are one error!
And then I google solved by adding import google.protobuf.text_format in /lib/fast_rcnn/train.py! Now I do as you say the step 1 and 2! also got the nan loss
|
@Cysu First ,very thanks for your perfect job.There is no issue,but I have a question, have you try YOLO9000 for pedestrain detection,YOLO v2 for object detection is more faster and precision than faster rcnn.At your current work have the detection accuracy influence the person_search‘s mAP. |
Thank you very much for the suggestion. I really appreciate recent advances in object detection, e.g., YOLO v2, FPN, etc., and would like to give it a try if I have some time in the future. But currently I may not have enough spare time for it, and YOLO v2 seems to be implemented only in darknet, which is not that popular, compared with caffe / tf / pytorch. By the way, do you still suffer from nan loss? If not, how did you solve it? |
Now, there are tensorflow verson YOLO:https://github.com/thtrieu/darkflow |
Thank you very much for the link. I will check about it. It's quite weird about the nan problem. Sorry but currently I have no idea about why it happens. |
@andongchen @Cysu When training ,I got "id_accuracy = -nan", Is normal ? |
@duanLH, id_accuracy = -nan is possible, because there are cases that the proposals do not contain any ground truth person, especially at the beginning stage of training. |
when i train the model ,the loss_bbox is after iteration 20
I0411 17:16:45.481537 19413 solver.cpp:240] Iteration 0, loss = 89.1661
I0411 17:16:45.481566 19413 solver.cpp:255] Train net output #0: det_accuracy = 0.25
I0411 17:16:45.481573 19413 solver.cpp:255] Train net output #1: det_loss = 0.693147 (* 1 = 0.693147 loss)
I0411 17:16:45.481577 19413 solver.cpp:255] Train net output #2: id_accuracy = 0
I0411 17:16:45.481581 19413 solver.cpp:255] Train net output #3: id_loss = 87.3365 (* 1 = 87.3365 loss)
I0411 17:16:45.481586 19413 solver.cpp:255] Train net output #4: loss_bbox = 0.646189 (* 1 = 0.646189 loss)
I0411 17:16:45.481590 19413 solver.cpp:255] Train net output #5: rpn_bbox_loss = 0.0912708 (* 1 = 0.0912708 loss)
I0411 17:16:45.481595 19413 solver.cpp:255] Train net output #6: rpn_cls_loss = 0.693147 (* 1 = 0.693147 loss)
I0411 17:16:45.481600 19413 solver.cpp:640] Iteration 0, lr = 0.001
I0411 17:17:08.716578 19413 solver.cpp:240] Iteration 20, loss = nan
I0411 17:17:08.716603 19413 solver.cpp:255] Train net output #0: det_accuracy = 0.898438
I0411 17:17:08.716612 19413 solver.cpp:255] Train net output #1: det_loss = 0.628226 (* 1 = 0.628226 loss)
I0411 17:17:08.716616 19413 solver.cpp:255] Train net output #2: id_accuracy = 0
I0411 17:17:08.716621 19413 solver.cpp:255] Train net output #3: id_loss = 87.3365 (* 1 = 87.3365 loss)
I0411 17:17:08.716625 19413 solver.cpp:255] Train net output #4: loss_bbox = nan (* 1 = nan loss)
I0411 17:17:08.716629 19413 solver.cpp:255] Train net output #5: rpn_bbox_loss = 0.266092 (* 1 = 0.266092 loss)
I0411 17:17:08.716634 19413 solver.cpp:255] Train net output #6: rpn_cls_loss = 0.684319 (* 1 = 0.684319 loss)
I0411 17:17:08.716639 19413 solver.cpp:640] Iteration 20, lr = 0.001
when i run experiments/scripts/eval_test.sh resnet50 50000 resnet50 there are errors.
/lib/datasets/psdb.py" lines 150
for gt, det in zip(gt_roidb, gallery_det):
det=[[ nan nan nan nan 0.1167134]
[ nan nan nan nan 0.1167134]
[ nan nan nan nan 0.1167134]
...,
[ nan nan nan nan 0.1167134]
[ nan nan nan nan 0.1167134]
[ nan nan nan nan 0.1167134]]
([], [])
The text was updated successfully, but these errors were encountered: