Why I got the nan loss_bbox when i train and eval? #22

andongchen · 2017-04-11T09:24:22Z

when i train the model ,the loss_bbox is after iteration 20
I0411 17:16:45.481537 19413 solver.cpp:240] Iteration 0, loss = 89.1661
I0411 17:16:45.481566 19413 solver.cpp:255] Train net output #0: det_accuracy = 0.25
I0411 17:16:45.481573 19413 solver.cpp:255] Train net output #1: det_loss = 0.693147 (* 1 = 0.693147 loss)
I0411 17:16:45.481577 19413 solver.cpp:255] Train net output #2: id_accuracy = 0
I0411 17:16:45.481581 19413 solver.cpp:255] Train net output #3: id_loss = 87.3365 (* 1 = 87.3365 loss)
I0411 17:16:45.481586 19413 solver.cpp:255] Train net output #4: loss_bbox = 0.646189 (* 1 = 0.646189 loss)
I0411 17:16:45.481590 19413 solver.cpp:255] Train net output #5: rpn_bbox_loss = 0.0912708 (* 1 = 0.0912708 loss)
I0411 17:16:45.481595 19413 solver.cpp:255] Train net output #6: rpn_cls_loss = 0.693147 (* 1 = 0.693147 loss)
I0411 17:16:45.481600 19413 solver.cpp:640] Iteration 0, lr = 0.001

I0411 17:17:08.716578 19413 solver.cpp:240] Iteration 20, loss = nan
I0411 17:17:08.716603 19413 solver.cpp:255] Train net output #0: det_accuracy = 0.898438
I0411 17:17:08.716612 19413 solver.cpp:255] Train net output #1: det_loss = 0.628226 (* 1 = 0.628226 loss)
I0411 17:17:08.716616 19413 solver.cpp:255] Train net output #2: id_accuracy = 0
I0411 17:17:08.716621 19413 solver.cpp:255] Train net output #3: id_loss = 87.3365 (* 1 = 87.3365 loss)
I0411 17:17:08.716625 19413 solver.cpp:255] Train net output #4: loss_bbox = nan (* 1 = nan loss)
I0411 17:17:08.716629 19413 solver.cpp:255] Train net output #5: rpn_bbox_loss = 0.266092 (* 1 = 0.266092 loss)
I0411 17:17:08.716634 19413 solver.cpp:255] Train net output #6: rpn_cls_loss = 0.684319 (* 1 = 0.684319 loss)
I0411 17:17:08.716639 19413 solver.cpp:640] Iteration 20, lr = 0.001

when i run experiments/scripts/eval_test.sh resnet50 50000 resnet50 there are errors.
/lib/datasets/psdb.py" lines 150
for gt, det in zip(gt_roidb, gallery_det):

det=[[ nan nan nan nan 0.1167134]
[ nan nan nan nan 0.1167134]
[ nan nan nan nan 0.1167134]
...,
[ nan nan nan nan 0.1167134]
[ nan nan nan nan 0.1167134]
[ nan nan nan nan 0.1167134]]
([], [])

Cysu · 2017-04-11T09:38:26Z

Did you modify the code? For the first training iteration, it should be something like

I1113 15:51:24.800622 32170 solver.cpp:240] Iteration 0, loss = 6.22973
I1113 15:51:24.800657 32170 solver.cpp:255]     Train net output #0: det_accuracy = 0.078125
I1113 15:51:24.800668 32170 solver.cpp:255]     Train net output #1: det_loss = 0.706399 (* 1 = 0.706399 loss)
I1113 15:51:24.800671 32170 solver.cpp:255]     Train net output #2: id_accuracy = 0
I1113 15:51:24.800676 32170 solver.cpp:255]     Train net output #3: id_loss = 9.26615 (* 1 = 9.26615 loss)
I1113 15:51:24.800681 32170 solver.cpp:255]     Train net output #4: loss_bbox = 1.04062e-05 (* 1 = 1.04062e-05 loss)
I1113 15:51:24.800685 32170 solver.cpp:255]     Train net output #5: rpn_bbox_loss = 0.188907 (* 1 = 0.188907 loss)
I1113 15:51:24.800689 32170 solver.cpp:255]     Train net output #6: rpn_cls_loss = 0.693245 (* 1 = 0.693245 loss)
I1113 15:51:24.800700 32170 solver.cpp:640] Iteration 0, lr = 0.001

andongchen · 2017-04-11T12:33:40Z

I have not modify the code! Could I modify the code?

Cysu · 2017-04-11T13:27:58Z

No. That won't be necessary. Directly running the training script should be fine. Could you please provide a full training log (by uploading to BaiduYun / GoogleDrive / Dropbox) for me to have further analysis?

Also could you please evaluate our trained model by following the instructions in the README, to see if it works properly?

andongchen · 2017-04-11T14:09:06Z

Yes, I can evaluate by your trained model,and there is no error.
The train log is here：https://drive.google.com/file/d/0Bz7UoqmY26NkeWphcnZYckNKUU0/view?usp=sharing

Cysu · 2017-04-12T02:05:09Z

That's quite weird. Could you please

Remove this line of randomness
Run the training script with specified random seed

experiments/scripts/train.sh 0 --set EXP_DIR resnet50 RNG_SEED 1

On my machine, this will lead to the same loss as follows for iteration 0

I0412 10:00:41.251739 29112 solver.cpp:240] Iteration 0, loss = 11.4016
I0412 10:00:41.251796 29112 solver.cpp:255]     Train net output #0: det_accuracy = 0.804688
I0412 10:00:41.251809 29112 solver.cpp:255]     Train net output #1: det_loss = 0.681872 (* 1 = 0.681872 loss)
I0412 10:00:41.251818 29112 solver.cpp:255]     Train net output #2: id_accuracy = 0
I0412 10:00:41.251827 29112 solver.cpp:255]     Train net output #3: id_loss = 9.40343 (* 1 = 9.40343 loss)
I0412 10:00:41.251835 29112 solver.cpp:255]     Train net output #4: loss_bbox = 0.522466 (* 1 = 0.522466 loss)
I0412 10:00:41.251844 29112 solver.cpp:255]     Train net output #5: rpn_bbox_loss = 0.123584 (* 1 = 0.123584 loss)
I0412 10:00:41.251876 29112 solver.cpp:255]     Train net output #6: rpn_cls_loss = 0.693231 (* 1 = 0.693231 loss)
I0412 10:00:41.251895 29112 solver.cpp:640] Iteration 0, lr = 0.001

andongchen · 2017-04-12T03:59:54Z

Sorry,when I first run the training script with no modify!There are one error!
experiments/scripts/train.sh 0 --set EXP_DIR resnet50`

Normalizing targets
done
Traceback (most recent call last):
  File "tools/train_net.py", line 130, in <module>
    max_iters=args.max_iters)
  File "/home/cy/PycharmProjects/person_search-master/tools/../lib/fast_rcnn/train.py", line 121, in train_net
    pretrained_model=pretrained_model)
  File "/home/cy/PycharmProjects/person_search-master/tools/../lib/fast_rcnn/train.py", line 50, in __init__
    pb2.text_format.Merge(f.read(), self.solver_param)
AttributeError: 'module' object has no attribute 'text_format'

And then I google solved by adding import google.protobuf.text_format in /lib/fast_rcnn/train.py!
and then got the nan_loss error!

Now I do as you say the step 1 and 2! also got the nan loss

I0412 11:58:24.734537 15281 solver.cpp:240] Iteration 0, loss = 45.384
I0412 11:58:24.734563 15281 solver.cpp:255]     Train net output #0: det_accuracy = 0.03125
I0412 11:58:24.734571 15281 solver.cpp:255]     Train net output #1: det_loss = 0.693147 (* 1 = 0.693147 loss)
I0412 11:58:24.734575 15281 solver.cpp:255]     Train net output #2: id_accuracy = -nan
I0412 11:58:24.734578 15281 solver.cpp:255]     Train net output #3: id_loss = 0 (* 1 = 0 loss)
I0412 11:58:24.734582 15281 solver.cpp:255]     Train net output #4: loss_bbox = 0.0592934 (* 1 = 0.0592934 loss)
I0412 11:58:24.734586 15281 solver.cpp:255]     Train net output #5: rpn_bbox_loss = 0.00123454 (* 1 = 0.00123454 loss)
I0412 11:58:24.734591 15281 solver.cpp:255]     Train net output #6: rpn_cls_loss = 0.693147 (* 1 = 0.693147 loss)
I0412 11:58:24.734596 15281 solver.cpp:640] Iteration 0, lr = 0.001
I0412 11:58:48.375877 15281 solver.cpp:240] Iteration 20, loss = nan
I0412 11:58:48.376101 15281 solver.cpp:255]     Train net output #0: det_accuracy = 0.929688
I0412 11:58:48.376142 15281 solver.cpp:255]     Train net output #1: det_loss = 0.620077 (* 1 = 0.620077 loss)
I0412 11:58:48.376157 15281 solver.cpp:255]     Train net output #2: id_accuracy = -nan
I0412 11:58:48.376165 15281 solver.cpp:255]     Train net output #3: id_loss = 0 (* 1 = 0 loss)
I0412 11:58:48.376170 15281 solver.cpp:255]     Train net output #4: loss_bbox = nan (* 1 = nan loss)
I0412 11:58:48.376176 15281 solver.cpp:255]     Train net output #5: rpn_bbox_loss = 0.185238 (* 1 = 0.185238 loss)
I0412 11:58:48.376183 15281 solver.cpp:255]     Train net output #6: rpn_cls_loss = 0.680902 (* 1 = 0.680902 loss)
I0412 11:58:48.376190 15281 solver.cpp:640] Iteration 20, lr = 0.001

andongchen · 2017-04-19T07:49:18Z

@Cysu First ,very thanks for your perfect job.There is no issue,but I have a question, have you try YOLO9000 for pedestrain detection,YOLO v2 for object detection is more faster and precision than faster rcnn.At your current work have the detection accuracy influence the person_search‘s mAP.

Cysu · 2017-04-19T08:13:59Z

Thank you very much for the suggestion. I really appreciate recent advances in object detection, e.g., YOLO v2, FPN, etc., and would like to give it a try if I have some time in the future. But currently I may not have enough spare time for it, and YOLO v2 seems to be implemented only in darknet, which is not that popular, compared with caffe / tf / pytorch.

By the way, do you still suffer from nan loss? If not, how did you solve it?

andongchen · 2017-04-19T08:34:42Z

Now, there are tensorflow verson YOLO:https://github.com/thtrieu/darkflow
I still suffer from the nan loss,i think it's machine environment's error ,but i not sure.

Cysu · 2017-04-19T09:27:50Z

Thank you very much for the link. I will check about it.

It's quite weird about the nan problem. Sorry but currently I have no idea about why it happens.

duanLH · 2017-07-27T06:19:24Z

@andongchen @Cysu When training ,I got "id_accuracy = -nan", Is normal ?

Cysu · 2017-07-27T08:30:03Z

@duanLH, id_accuracy = -nan is possible, because there are cases that the proposals do not contain any ground truth person, especially at the beginning stage of training.

Cysu closed this as completed Sep 14, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why I got the nan loss_bbox when i train and eval? #22

Why I got the nan loss_bbox when i train and eval? #22

andongchen commented Apr 11, 2017 •

edited

Cysu commented Apr 11, 2017

andongchen commented Apr 11, 2017

Cysu commented Apr 11, 2017

andongchen commented Apr 11, 2017

Cysu commented Apr 12, 2017

andongchen commented Apr 12, 2017 •

edited

andongchen commented Apr 19, 2017

Cysu commented Apr 19, 2017

andongchen commented Apr 19, 2017

Cysu commented Apr 19, 2017

duanLH commented Jul 27, 2017

Cysu commented Jul 27, 2017

Why I got the nan loss_bbox when i train and eval? #22

Why I got the nan loss_bbox when i train and eval? #22

Comments

andongchen commented Apr 11, 2017 • edited

Cysu commented Apr 11, 2017

andongchen commented Apr 11, 2017

Cysu commented Apr 11, 2017

andongchen commented Apr 11, 2017

Cysu commented Apr 12, 2017

andongchen commented Apr 12, 2017 • edited

andongchen commented Apr 19, 2017

Cysu commented Apr 19, 2017

andongchen commented Apr 19, 2017

Cysu commented Apr 19, 2017

duanLH commented Jul 27, 2017

Cysu commented Jul 27, 2017

andongchen commented Apr 11, 2017 •

edited

andongchen commented Apr 12, 2017 •

edited