Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NaN:Dear author, thanks for you great work. Currently I am trying to run your code but always report NaN error, the following is the error traceback, can you have a look? Thanks in advance! #15

Open
MrCrightH opened this issue Oct 17, 2022 · 7 comments

Comments

@MrCrightH
Copy link

proposals = self.predict_proposals(

File "/home/sjk/anaconda3/envs/chpy/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
return func(*args, **kwargs)
File "/home/1Tm2/CH/Meta-Faster-R-CNN/meta_faster_rcnn/modeling/fsod/fsod_rpn.py", line 523, in predict_proposals
return find_top_rpn_proposals(
File "/home/sjk/anaconda3/envs/chpy/lib/python3.8/site-packages/detectron2/modeling/proposal_generator/proposal_utils.py", line 103, in find_top_rpn_proposals
raise FloatingPointError(
FloatingPointError: Predicted boxes or scores contain Inf/NaN. Training has diverged.

@GuangxingHan
Copy link
Owner

Thanks for the feedback. Can you clarify which script are you using and which step are you in?

@MrCrightH
Copy link
Author

meta_training_coco_resnet101_stage_2.yaml. When I implement this step, it only runs a few steps before reporting an error that Predicted boxes or scores contain Inf/NaN.

@GuangxingHan
Copy link
Owner

Are you trying to reproduce our experiments on coco dataset? This is weird. Can you show me the full running log to better understand what happened in your training?

@MrCrightH
Copy link
Author

yes! i have finished the meta_traning_coco stage_1 of the meta_training_coco_multi... , but when i run the meta_training_coco_resnet101_stage_2, it showed as follow:

[10/17 15:21:08] d2.data.datasets.coco INFO: Loading datasets/coco/new_annotations/final_split_non_voc_instances_train2014.json takes 3.08 seconds.
[10/17 15:21:09] d2.data.datasets.coco INFO: Loaded 117264 images in COCO format from datasets/coco/new_annotations/final_split_non_voc_instances_train2014.json
[10/17 15:21:11] d2.data.build INFO: Removed 54575 images with no usable annotations. 62689 images left.
[10/17 15:21:17] d2.data.build INFO: Removed 0 images with no usable annotations. 94195 images left.
[10/17 15:21:21] d2.data.build INFO: Distribution of instances among all 80 categories:
�[36m| category | #instances | category | #instances | category | #instances |
|:-------------:|:-------------|:------------:|:-------------|:-------------:|:-------------|
| person | 0 | bicycle | 0 | car | 0 |
| motorcycle | 0 | airplane | 0 | bus | 0 |
| train | 0 | truck | 4479 | boat | 0 |
| traffic light | 1522 | fire hydrant | 1172 | stop sign | 1119 |
| parking meter | 514 | bench | 3997 | bird | 0 |
| cat | 0 | dog | 0 | horse | 0 |
| sheep | 0 | cow | 0 | elephant | 3983 |
| bear | 1252 | zebra | 4064 | giraffe | 4530 |
| backpack | 2628 | umbrella | 3678 | handbag | 3022 |
| tie | 2625 | suitcase | 3054 | frisbee | 1352 |
| skis | 2567 | snowboard | 1282 | sports ball | 846 |
| kite | 1346 | baseball bat | 1015 | baseball gl.. | 759 |
| skateboard | 2906 | surfboard | 3389 | tennis racket | 1713 |
| bottle | 0 | wine glass | 2068 | cup | 6238 |
| fork | 2498 | knife | 2765 | spoon | 2040 |
| bowl | 5265 | banana | 3755 | apple | 2271 |
| sandwich | 2962 | orange | 2850 | broccoli | 4319 |
| carrot | 2959 | hot dog | 1794 | pizza | 3848 |
| donut | 4166 | cake | 3448 | chair | 0 |
| couch | 0 | potted plant | 0 | bed | 2985 |
| dining table | 0 | toilet | 3477 | tv | 0 |
| laptop | 2307 | mouse | 927 | remote | 1492 |
| keyboard | 1362 | cell phone | 2625 | microwave | 654 |
| oven | 1332 | toaster | 71 | sink | 2869 |
| refrigerator | 1283 | book | 3851 | clock | 2682 |
| vase | 2874 | scissors | 953 | teddy bear | 3334 |
| hair drier | 116 | toothbrush | 776 | | |
| total | 148030 | | | | |�[0m
[10/17 15:21:21] d2.data.common INFO: Serializing 94195 elements to byte tensors and concatenating them all ...
[10/17 15:21:22] d2.data.common INFO: Serialized dataset takes 50.18 MiB
[10/17 15:21:22] meta_faster_rcnn.data.build INFO: Using training sampler TrainingSampler
[10/17 15:21:22] fvcore.common.checkpoint INFO: [Checkpointer] Loading from ./output/fsod/meta_training_coco_resnet101_stage_1/model_final.pth ...
[10/17 15:21:23] fvcore.common.checkpoint WARNING: Some model parameters or buffers are not found in the checkpoint:
�[34mproposal_generator.rpn_head.anchor_deltas_cat.{bias, weight}�[0m
�[34mproposal_generator.rpn_head.anchor_deltas_diff.{bias, weight}�[0m
�[34mproposal_generator.rpn_head.cat_fc.0.{bias, weight}�[0m
�[34mproposal_generator.rpn_head.diff_fc.0.{bias, weight}�[0m
�[34mproposal_generator.rpn_head.objectness_logits_cat.{bias, weight}�[0m
�[34mproposal_generator.rpn_head.objectness_logits_diff.{bias, weight}�[0m
�[34mroi_heads.box_predictor.bbox_pred_cor.{bias, weight}�[0m
�[34mroi_heads.box_predictor.bbox_pred_fc.{bias, weight}�[0m
�[34mroi_heads.box_predictor.bbox_pred_gd.{bias, weight}�[0m
�[34mroi_heads.box_predictor.cls_score_gd.{bias, weight}�[0m
�[34mroi_heads.box_predictor.conv_1_gd.weight�[0m
�[34mroi_heads.box_predictor.conv_2_gd.weight�[0m
�[34mroi_heads.box_predictor.conv_3_gd.weight�[0m
�[34mroi_heads.box_predictor.norm.{bias, weight}�[0m
[10/17 15:21:23] d2.engine.train_loop INFO: Starting training from iteration 0
[10/17 15:22:04] d2.utils.events INFO: eta: 10:42:19 iter: 19 total_loss: 2.142 loss_cls: 1.69 loss_box_reg: 0.3216 loss_rpn_cls: 0.02834 loss_rpn_loc: 0.01064 time: 1.9260 data_time: 0.1603 lr: 0.0001342 max_mem: 19052M
[10/17 15:22:43] d2.utils.events INFO: eta: 10:42:48 iter: 39 total_loss: 1.361 loss_cls: 1.111 loss_box_reg: 0.261 loss_rpn_cls: 0.02525 loss_rpn_loc: 0.009109 time: 1.9303 data_time: 0.0777 lr: 0.0001702 max_mem: 19055M
[10/17 15:23:22] d2.utils.events INFO: eta: 10:44:28 iter: 59 total_loss: 1.251 loss_cls: 0.8241 loss_box_reg: 0.3086 loss_rpn_cls: 0.02666 loss_rpn_loc: 0.01125 time: 1.9439 data_time: 0.0875 lr: 0.0002062 max_mem: 19055M
[10/17 15:24:02] d2.utils.events INFO: eta: 10:46:26 iter: 79 total_loss: 1.308 loss_cls: 0.8696 loss_box_reg: 0.2948 loss_rpn_cls: 0.03027 loss_rpn_loc: 0.01091 time: 1.9531 data_time: 0.0828 lr: 0.0002422 max_mem: 19056M
[10/17 15:24:42] d2.utils.events INFO: eta: 10:50:33 iter: 99 total_loss: 0.9825 loss_cls: 0.605 loss_box_reg: 0.3051 loss_rpn_cls: 0.02744 loss_rpn_loc: 0.009897 time: 1.9619 data_time: 0.0792 lr: 0.0002782 max_mem: 19056M
[10/17 15:25:22] d2.utils.events INFO: eta: 10:51:56 iter: 119 total_loss: 1.068 loss_cls: 0.637 loss_box_reg: 0.3279 loss_rpn_cls: 0.02772 loss_rpn_loc: 0.008074 time: 1.9676 data_time: 0.0862 lr: 0.0003142 max_mem: 19056M
[10/17 15:26:01] d2.utils.events INFO: eta: 10:52:43 iter: 139 total_loss: 0.9612 loss_cls: 0.5965 loss_box_reg: 0.3174 loss_rpn_cls: 0.02713 loss_rpn_loc: 0.009668 time: 1.9712 data_time: 0.0804 lr: 0.0003502 max_mem: 19056M
[10/17 15:26:41] d2.utils.events INFO: eta: 10:52:03 iter: 159 total_loss: 0.9253 loss_cls: 0.5459 loss_box_reg: 0.33 loss_rpn_cls: 0.02211 loss_rpn_loc: 0.009601 time: 1.9721 data_time: 0.0765 lr: 0.0003862 max_mem: 19056M
[10/17 15:27:21] d2.utils.events INFO: eta: 10:52:04 iter: 179 total_loss: 0.8336 loss_cls: 0.4791 loss_box_reg: 0.3217 loss_rpn_cls: 0.02815 loss_rpn_loc: 0.01103 time: 1.9759 data_time: 0.0812 lr: 0.0004222 max_mem: 19056M
[10/17 15:28:01] d2.utils.events INFO: eta: 10:51:21 iter: 199 total_loss: 0.9207 loss_cls: 0.5247 loss_box_reg: 0.3523 loss_rpn_cls: 0.02278 loss_rpn_loc: 0.008193 time: 1.9766 data_time: 0.0805 lr: 0.0004582 max_mem: 19056M
[10/17 15:28:41] d2.utils.events INFO: eta: 10:51:00 iter: 219 total_loss: 0.8944 loss_cls: 0.501 loss_box_reg: 0.274 loss_rpn_cls: 0.02453 loss_rpn_loc: 0.01002 time: 1.9773 data_time: 0.0839 lr: 0.0004942 max_mem: 19058M
[10/17 15:29:20] d2.utils.events INFO: eta: 10:50:59 iter: 239 total_loss: 0.953 loss_cls: 0.646 loss_box_reg: 0.3021 loss_rpn_cls: 0.02511 loss_rpn_loc: 0.00783 time: 1.9780 data_time: 0.0844 lr: 0.0005302 max_mem: 19058M
[10/17 15:30:00] d2.utils.events INFO: eta: 10:51:08 iter: 259 total_loss: 0.9736 loss_cls: 0.6806 loss_box_reg: 0.2666 loss_rpn_cls: 0.02316 loss_rpn_loc: 0.00909 time: 1.9785 data_time: 0.0843 lr: 0.0005662 max_mem: 19058M
[10/17 15:30:40] d2.utils.events INFO: eta: 10:50:52 iter: 279 total_loss: 1.277 loss_cls: 1.035 loss_box_reg: 0.2421 loss_rpn_cls: 0.04444 loss_rpn_loc: 0.007862 time: 1.9794 data_time: 0.0851 lr: 0.0006022 max_mem: 19058M
[10/17 15:31:20] d2.utils.events INFO: eta: 10:50:21 iter: 299 total_loss: 1.427 loss_cls: 1.128 loss_box_reg: 0.2823 loss_rpn_cls: 0.02622 loss_rpn_loc: 0.008311 time: 1.9800 data_time: 0.0825 lr: 0.0006382 max_mem: 19058M
[10/17 15:31:59] d2.utils.events INFO: eta: 10:49:41 iter: 319 total_loss: 2.897 loss_cls: 2.585 loss_box_reg: 0.2703 loss_rpn_cls: 0.03668 loss_rpn_loc: 0.008904 time: 1.9800 data_time: 0.0794 lr: 0.0006742 max_mem: 19059M
[10/17 15:32:21] d2.engine.train_loop ERROR: Exception during training:
Traceback (most recent call last):
File "/home/sjk/anaconda3/envs/chpy/lib/python3.8/site-packages/detectron2/engine/train_loop.py", line 149, in train
self.run_step()
File "/home/sjk/anaconda3/envs/chpy/lib/python3.8/site-packages/detectron2/engine/defaults.py", line 494, in run_step
self._trainer.run_step()
File "/home/sjk/anaconda3/envs/chpy/lib/python3.8/site-packages/detectron2/engine/train_loop.py", line 273, in run_step
loss_dict = self.model(data)
File "/home/sjk/anaconda3/envs/chpy/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/1Tm2/CH/Meta-Faster-R-CNN/meta_faster_rcnn/modeling/fsod/fsod_rcnn.py", line 206, in forward
pos_proposals, pos_anchors, pos_pred_objectness_logits, pos_gt_labels, pos_pred_anchor_deltas, pos_gt_boxes = self.proposal_generator(query_images, pos_features, pos_support_features_pool, query_gt_instances) # attention rpn
File "/home/sjk/anaconda3/envs/chpy/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/1Tm2/CH/Meta-Faster-R-CNN/meta_faster_rcnn/modeling/fsod/fsod_rpn.py", line 490, in forward
proposals = self.predict_proposals(
File "/home/sjk/anaconda3/envs/chpy/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
return func(*args, **kwargs)
File "/home/1Tm2/CH/Meta-Faster-R-CNN/meta_faster_rcnn/modeling/fsod/fsod_rpn.py", line 523, in predict_proposals
return find_top_rpn_proposals(
File "/home/sjk/anaconda3/envs/chpy/lib/python3.8/site-packages/detectron2/modeling/proposal_generator/proposal_utils.py", line 103, in find_top_rpn_proposals
raise FloatingPointError(
FloatingPointError: Predicted boxes or scores contain Inf/NaN. Training has diverged.

@GuangxingHan
Copy link
Owner

May I know the model hyper-parameters configs, e.g., the batch size? Did you change the default value?

@MrCrightH
Copy link
Author

Due to my limited gpu memory, I had to change the batchsize to 4 and the learning rate to 0.0005. can you please tell me how to adjust it?

############################################
meta_training_coco_resnet101_stage_1.yaml:
BASE: "Base-FSOD-C4.yaml"
MODEL:
WEIGHTS: "/home/1Tm2/CH/Meta-Faster-R-CNN/R-101.pkl"
MASK_ON: False
RESNETS:
DEPTH: 101
BACKBONE:
FREEZE_AT: 2
ROI_HEADS:
SCORE_THRESH_TEST: 0.0
RPN:
PRE_NMS_TOPK_TEST: 12000
POST_NMS_TOPK_TEST: 100
FEWX_BASELINE: True
WITH_ALIGNMENT: False
OUTPUT_DIR: './output/fsod/meta_training_coco_resnet101_stage_1'
DATASETS:
TRAIN: ("coco_2014_train_nonvoc",)
TEST: ("coco_2014_val",)
TEST_SHOTS: (1,2,3,5,10,30)
INPUT:
FS:
SUPPORT_WAY: 2
SUPPORT_SHOT: 30
MIN_SIZE_TRAIN: (640, 672, 704, 736, 768, 800)
MAX_SIZE_TRAIN: 1333
MIN_SIZE_TEST: 600
MAX_SIZE_TEST: 1000
SOLVER:
IMS_PER_BATCH: 4 #8
BASE_LR: 0.0005 #0.001
STEPS: (30000, 40000)
MAX_ITER: 40001
WARMUP_ITERS: 1000
WARMUP_FACTOR: 0.1
CHECKPOINT_PERIOD: 10000
HEAD_LR_FACTOR: 2.0
#TEST:
#EVAL_PERIOD: 40000

##########################################
meta_training_coco_resnet101_stage_2.yaml:
BASE: "Base-FSOD-C4.yaml"
MODEL:
WEIGHTS: "./output/fsod/meta_training_coco_resnet101_stage_1/model_final.pth"
MASK_ON: False
RESNETS:
DEPTH: 101
BACKBONE:
FREEZE_AT: 2
ROI_HEADS:
SCORE_THRESH_TEST: 0.0
RPN:
PRE_NMS_TOPK_TEST: 12000
POST_NMS_TOPK_TEST: 100
FEWX_BASELINE: False
WITH_ALIGNMENT: False
OUTPUT_DIR: './output/fsod/meta_training_coco_resnet101_stage_2'
DATASETS:
TRAIN: ("coco_2014_train_nonvoc",)
TEST: ("coco_2014_val",)
TEST_SHOTS: (1,2,3,5,10,30)
INPUT:
FS:
SUPPORT_WAY: 2
SUPPORT_SHOT: 30
MIN_SIZE_TRAIN: (640, 672, 704, 736, 768, 800)
MAX_SIZE_TRAIN: 1333
MIN_SIZE_TEST: 600
MAX_SIZE_TEST: 1000
SOLVER:
IMS_PER_BATCH: 4 #8
BASE_LR: 0.001
STEPS: (15000, 20000)
MAX_ITER: 20001
WARMUP_ITERS: 500
WARMUP_FACTOR: 0.1
CHECKPOINT_PERIOD: 20001
HEAD_LR_FACTOR: 2.0
TEST:
EVAL_PERIOD: 10000

@GuangxingHan
Copy link
Owner

Unfortunately, our model works best with batch_size >=8 in the second step. Using smaller batch_size leads to unstable training. You can try to decrease the BASE_LR, increase the WARMUP_ITERS, decrease the number of SUPPORT_SHOT or other ways to remedy the small batch_size, but the detection accuracy may not be guaranteed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants