Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproduction of oadp_ov_coco.py #15

Open
Lukas-Ma1 opened this issue Nov 29, 2023 · 1 comment
Open

Reproduction of oadp_ov_coco.py #15

Lukas-Ma1 opened this issue Nov 29, 2023 · 1 comment

Comments

@Lukas-Ma1
Copy link

Lukas-Ma1 commented Nov 29, 2023

Thank you for outstanding work, I got some problems when I try to reproduce the training of coco. Firstly I use your checkpoint and successfully got the same result 31.3 mAP, it proves that the dataset and python environment is correctly set.

And I use the command to train vild first: torchrun --nproc_per_node=2 -m oadp.dp.train vild_ov_coco configs/dp/vild_ov_coco.py, and then formattly train coco: torchrun --nproc_per_node=2 -m oadp.dp.train oadp_ov_coco configs/dp/oadp_ov_coco.py, but I don't get correct result when I use the training checkpoint, here is my full result:

{'COCO_17_bbox_mAP_': '0.1495',
'COCO_17_bbox_mAP_50': '0.2830',
'COCO_17_bbox_mAP_75': '0.1398',
'COCO_17_bbox_mAP_copypaste': '0.1495 0.2830 0.1398 0.1060 0.1788 0.1816',
'COCO_17_bbox_mAP_l': '0.1816',
'COCO_17_bbox_mAP_m': '0.1788',
'COCO_17_bbox_mAP_s': '0.1060',
'COCO_48_17_bbox_mAP_': '0.2673',
'COCO_48_17_bbox_mAP_50': '0.4436',
'COCO_48_17_bbox_mAP_75': '0.2798',
'COCO_48_17_bbox_mAP_copypaste': '0.2673 0.4436 0.2798 0.1750 0.2916 0.3488',
'COCO_48_17_bbox_mAP_l': '0.3488',
'COCO_48_17_bbox_mAP_m': '0.2916',
'COCO_48_17_bbox_mAP_s': '0.1750',
'COCO_48_bbox_mAP_': '0.3090',
'COCO_48_bbox_mAP_50': '0.5005',
'COCO_48_bbox_mAP_75': '0.3293',
'COCO_48_bbox_mAP_copypaste': '0.3090 0.5005 0.3293 0.1994 0.3316 0.4080',
'COCO_48_bbox_mAP_l': '0.4080',
'COCO_48_bbox_mAP_m': '0.3316',
'COCO_48_bbox_mAP_s': '0.1994'}

By the way, I noticed that some abnormal data was output during the training process, the mAP result of coco_17_bbox is -1!!!, here I randomly cut partly of output during training, it is during iteration of 26000/40000:

2023-11-29 19:26:42,471 - mmdet - INFO - Iter(val) [2500] COCO_48_17_bbox_mAP_: 0.1982, COCO_48_17_bbox_mAP_50: 0.3539, COCO_48_17_bbox_mAP_75: 0.1999, COCO_48_17_bbox_mAP_s: 0.1101, COCO_48_17_bbox_mAP_m: 0.2075, COCO_48_17_bbox_mAP_l: 0.2655, COCO_48_17_bbox_mAP_copypaste: 0.1982 0.3539 0.1999 0.1101 0.2075 0.2655, COCO_48_bbox_mAP_: 0.1982, COCO_48_bbox_mAP_50: 0.3539, COCO_48_bbox_mAP_75: 0.1999, COCO_48_bbox_mAP_s: 0.1101, COCO_48_bbox_mAP_m: 0.2075, COCO_48_bbox_mAP_l: 0.2655, COCO_48_bbox_mAP_copypaste: 0.1982 0.3539 0.1999 0.1101 0.2075 0.2655, COCO_17_bbox_mAP_: -1.0000, COCO_17_bbox_mAP_50: -1.0000, COCO_17_bbox_mAP_75: -1.0000, COCO_17_bbox_mAP_s: -1.0000, COCO_17_bbox_mAP_m: -1.0000, COCO_17_bbox_mAP_l: -1.0000, COCO_17_bbox_mAP_copypaste: -1.0000 -1.0000 -1.0000 -1.0000 -1.0000 -1.0000

And when I add --override to command like: torchrun --nproc_per_node=2 -m oadp.dp.train vild_ov_coco configs/dp/vild_ov_coco.py --override .validator.dataloader.dataset.ann_file::data/coco/annotations/instances_val2017.48.json, the checkpoint becomes unuseful:
截屏2023-11-30 09 44 37
why it makes this situation?

It seems like some parts of my experiment is wrong, how can I fixed it? And can you tell me how to use training command correctly? Appreciated!

@LutingWang
Copy link
Owner

The training scripts you used are correct:

torchrun --nproc_per_node=2 -m oadp.dp.train vild_ov_coco configs/dp/vild_ov_coco.py
torchrun --nproc_per_node=2 -m oadp.dp.train oadp_ov_coco configs/dp/oadp_ov_coco.py

However, I noticed that you are using 2 GPUs for training, while the original checkpoint was trained with 8 GPUs. When using 2 GPUs, only 4 times less data is used for training compared to the original. To address this issue, there are a few potential solutions:

  1. Increase the batch size for each GPU. Ideally, use a batch size of 8 per GPU.
  2. Use more GPUs for training.
  3. Increase the learning rate.

Regarding the second problem, when mAP=-1, it typically indicates that there are no ground truth objects present. For example, if COCO_17_bbox_mAP_: -1.0000 is shown, it means that there are no novel category objects in the annotation file. Please verify if this is the case, and if so, regenerate the annotation files. If the problem persists, please provide more details, so I can conduct a more thorough investigation.

Lastly, the option --override .validator.dataloader.dataset.ann_file::data/coco/annotations/instances_val2017.48.json is intended to be used in conjunction with TRAIN_WITH_VAL_DATASET. When TRAIN_WITH_VAL_DATASET is set to True, it replaces the training dataset with the validation dataset. However, this can cause errors during training when there are novel category objects in the validation dataset. To avoid these errors, it is recommended to include the option --override .validator.dataloader.dataset.ann_file::data/coco/annotations/instances_val2017.48.json. In your case, since TRAIN_WITH_VAL_DATASET is not set, the additional option is likely to cause erroneous behaviors. The error message indicates that the model has not produced any predictions, which is likely not caused by the override option. However, since the option was incorrectly added, it may be best to disregard this error for now and focus on addressing the previous two errors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants