Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

训练时遇到subprocess.CalledProcessError #16

Closed
Logicino opened this issue Nov 27, 2021 · 9 comments
Closed

训练时遇到subprocess.CalledProcessError #16

Logicino opened this issue Nov 27, 2021 · 9 comments

Comments

@Logicino
Copy link

您好,我按照项目中的要求创建了新的虚拟环境
使用了cuda10.1+pytorch1.5.0
(这里install.md中的mmcv-full安装方式需要更新pip install mmcv-full==1.2.7 -f https://download.openmmlab.com/mmcv/dist/cu101/torch1.5.0/index.html
README.md中的configs目录有问题
我只有一个GPU,训练的时候,我修改成了
./tools/dist_train.sh configs/ld/ld_r18_gflv1_r101_fpn_coco_1x.py 1
有一处报错为:
subprocess.CalledProcessError: Command '['/home/a/anaconda3/envs/open-mmlab/bin/python', '-u', './tools/train.py', '--local_rank=0', 'configs/ld/ld_r18_gflv1_r101_fpn_coco_1x.py', '--launcher', 'pytorch']' returned non-zero exit status 1.
我搜索了此种报错的解决方案,有一种是在DistributedDataParallel中添加find_unused_parameters=True
model = torch.nn.parallel.DistributedDataParallel(model,device_ids=[args.local_rank],output_device=args.local_rank, find_unused_parameters=True)
我想知道这个项目的find_unused_parameters应该在哪个文件中设置呢?感谢。

我的整体报错如下:
(open-mmlab) a@a-System-Product-Name:~/LD$ ./tools/dist_train.sh configs/ld/ld_r18_gflv1_r101_fpn_coco_1x.py 1
2021-11-27 22:44:39,804 - mmdet - INFO - Environment info:

sys.platform: linux
Python: 3.7.11 (default, Jul 27 2021, 14:32:16) [GCC 7.5.0]
CUDA available: True
GPU 0: NVIDIA GeForce RTX 2060 SUPER
CUDA_HOME: /usr/local/cuda-10.2
NVCC: Cuda compilation tools, release 10.2, V10.2.89
GCC: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
PyTorch: 1.5.1
PyTorch compiling details: PyTorch built with:

  • GCC 7.3
  • C++ Version: 201402
  • Intel(R) oneAPI Math Kernel Library Version 2021.4-Product Build 20210904 for Intel(R) 64 architecture applications
  • Intel(R) MKL-DNN v0.21.1 (Git Hash 7d2fd500bc78936d1d648ca713b901012f470dbc)
  • OpenMP 201511 (a.k.a. OpenMP 4.5)
  • NNPACK is enabled
  • CPU capability usage: AVX2
  • CUDA Runtime 10.1
  • NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_37,code=compute_37
  • CuDNN 7.6.3
  • Magma 2.5.2
  • Build settings: BLAS=MKL, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -fopenmp -DNDEBUG -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DUSE_INTERNAL_THREADPOOL_IMPL -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_STATIC_DISPATCH=OFF,

TorchVision: 0.6.0a0+35d732a
OpenCV: 4.5.4
MMCV: 1.2.7
MMCV Compiler: GCC 7.3
MMCV CUDA Compiler: 10.1
MMDetection: 2.10.0+9856a78

2021-11-27 22:44:39,967 - mmdet - INFO - Distributed training: True
2021-11-27 22:44:40,128 - mmdet - INFO - Config:
dataset_type = 'CocoDataset'
data_root = 'data/coco/'
img_norm_cfg = dict(
mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
train_pipeline = [
dict(type='LoadImageFromFile'),
dict(type='LoadAnnotations', with_bbox=True),
dict(type='Resize', img_scale=(1333, 800), keep_ratio=True),
dict(type='RandomFlip', flip_ratio=0.5),
dict(
type='Normalize',
mean=[123.675, 116.28, 103.53],
std=[58.395, 57.12, 57.375],
to_rgb=True),
dict(type='Pad', size_divisor=32),
dict(type='DefaultFormatBundle'),
dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels'])
]
test_pipeline = [
dict(type='LoadImageFromFile'),
dict(
type='MultiScaleFlipAug',
img_scale=(1333, 800),
flip=False,
transforms=[
dict(type='Resize', keep_ratio=True),
dict(type='RandomFlip'),
dict(
type='Normalize',
mean=[123.675, 116.28, 103.53],
std=[58.395, 57.12, 57.375],
to_rgb=True),
dict(type='Pad', size_divisor=32),
dict(type='ImageToTensor', keys=['img']),
dict(type='Collect', keys=['img'])
])
]
data = dict(
samples_per_gpu=3,
workers_per_gpu=2,
train=dict(
type='CocoDataset',
ann_file='data/coco/annotations/instances_train2017.json',
img_prefix='data/coco/images/',
pipeline=[
dict(type='LoadImageFromFile'),
dict(type='LoadAnnotations', with_bbox=True),
dict(type='Resize', img_scale=(1333, 800), keep_ratio=True),
dict(type='RandomFlip', flip_ratio=0.5),
dict(
type='Normalize',
mean=[123.675, 116.28, 103.53],
std=[58.395, 57.12, 57.375],
to_rgb=True),
dict(type='Pad', size_divisor=32),
dict(type='DefaultFormatBundle'),
dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels'])
]),
val=dict(
type='CocoDataset',
ann_file='data/coco/annotations/instances_val2017.json',
img_prefix='data/coco/images/',
pipeline=[
dict(type='LoadImageFromFile'),
dict(
type='MultiScaleFlipAug',
img_scale=(1333, 800),
flip=False,
transforms=[
dict(type='Resize', keep_ratio=True),
dict(type='RandomFlip'),
dict(
type='Normalize',
mean=[123.675, 116.28, 103.53],
std=[58.395, 57.12, 57.375],
to_rgb=True),
dict(type='Pad', size_divisor=32),
dict(type='ImageToTensor', keys=['img']),
dict(type='Collect', keys=['img'])
])
]),
test=dict(
type='CocoDataset',
ann_file='data/coco/annotations/instances_val2017.json',
img_prefix='data/coco/images/',
pipeline=[
dict(type='LoadImageFromFile'),
dict(
type='MultiScaleFlipAug',
img_scale=(1333, 800),
flip=False,
transforms=[
dict(type='Resize', keep_ratio=True),
dict(type='RandomFlip'),
dict(
type='Normalize',
mean=[123.675, 116.28, 103.53],
std=[58.395, 57.12, 57.375],
to_rgb=True),
dict(type='Pad', size_divisor=32),
dict(type='ImageToTensor', keys=['img']),
dict(type='Collect', keys=['img'])
])
]))
evaluation = dict(interval=1, metric='bbox')
optimizer = dict(type='SGD', lr=0.00375, momentum=0.9, weight_decay=0.0001)
optimizer_config = dict(grad_clip=None)
lr_config = dict(
policy='step',
warmup='linear',
warmup_iters=500,
warmup_ratio=0.001,
step=[8, 11])
runner = dict(type='EpochBasedRunner', max_epochs=12)
checkpoint_config = dict(interval=1)
log_config = dict(interval=50, hooks=[dict(type='TextLoggerHook')])
custom_hooks = [dict(type='NumClassCheckHook')]
dist_params = dict(backend='nccl')
log_level = 'INFO'
load_from = None
resume_from = None
workflow = [('train', 1)]
teacher_ckpt = 'https://download.openmmlab.com/mmdetection/v2.0/gfl/gfl_r101_fpn_mstrain_2x_coco/gfl_r101_fpn_mstrain_2x_coco_20200629_200126-dd12f847.pth'
model = dict(
type='KnowledgeDistillationSingleStageDetector',
pretrained='torchvision://resnet18',
teacher_config='configs/gfl/gfl_r101_fpn_mstrain_2x_coco.py',
teacher_ckpt=
'https://download.openmmlab.com/mmdetection/v2.0/gfl/gfl_r101_fpn_mstrain_2x_coco/gfl_r101_fpn_mstrain_2x_coco_20200629_200126-dd12f847.pth',
output_feature=True,
backbone=dict(
type='ResNet',
depth=18,
num_stages=4,
out_indices=(0, 1, 2, 3),
frozen_stages=1,
norm_cfg=dict(type='BN', requires_grad=True),
norm_eval=True,
style='pytorch'),
neck=dict(
type='FPN',
in_channels=[64, 128, 256, 512],
out_channels=256,
start_level=1,
add_extra_convs='on_output',
num_outs=5),
bbox_head=dict(
type='LDHead',
num_classes=80,
in_channels=256,
stacked_convs=4,
feat_channels=256,
anchor_generator=dict(
type='AnchorGenerator',
ratios=[1.0],
octave_base_scale=8,
scales_per_octave=1,
strides=[8, 16, 32, 64, 128]),
loss_cls=dict(
type='QualityFocalLoss',
use_sigmoid=True,
beta=2.0,
loss_weight=1.0),
loss_dfl=dict(type='DistributionFocalLoss', loss_weight=0.25),
loss_ld=dict(
type='KnowledgeDistillationKLDivLoss', loss_weight=0.25, T=10),
reg_max=16,
loss_bbox=dict(type='GIoULoss', loss_weight=2.0)),
train_cfg=dict(
assigner=dict(type='ATSSAssigner', topk=9),
allowed_border=-1,
pos_weight=-1,
debug=False),
test_cfg=dict(
nms_pre=1000,
min_bbox_size=0,
score_thr=0.05,
nms=dict(type='nms', iou_threshold=0.6),
max_per_img=100))
work_dir = './work_dirs/ld_r18_gflv1_r101_fpn_coco_1x'
gpu_ids = range(0, 1)

Traceback (most recent call last):
File "/home/a/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/utils/registry.py", line 179, in build_from_cfg
return obj_cls(**args)
TypeError: init() missing 1 required positional argument: 'loss_im'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/a/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/utils/registry.py", line 179, in build_from_cfg
return obj_cls(**args)
File "/home/a/LD/mmdet/models/detectors/kd_one_stage.py", line 35, in init
pretrained)
File "/home/a/LD/mmdet/models/detectors/single_stage.py", line 30, in init
self.bbox_head = build_head(bbox_head)
File "/home/a/LD/mmdet/models/builder.py", line 59, in build_head
return build(cfg, HEADS)
File "/home/a/LD/mmdet/models/builder.py", line 34, in build
return build_from_cfg(cfg, registry, default_args)
File "/home/a/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/utils/registry.py", line 182, in build_from_cfg
raise type(e)(f'{obj_cls.name}: {e}')
TypeError: LDHead: init() missing 1 required positional argument: 'loss_im'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "./tools/train.py", line 187, in
main()
File "./tools/train.py", line 161, in main
test_cfg=cfg.get('test_cfg'))
File "/home/a/LD/mmdet/models/builder.py", line 77, in build_detector
return build(cfg, DETECTORS, dict(train_cfg=train_cfg, test_cfg=test_cfg))
File "/home/a/LD/mmdet/models/builder.py", line 34, in build
return build_from_cfg(cfg, registry, default_args)
File "/home/a/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/utils/registry.py", line 182, in build_from_cfg
raise type(e)(f'{obj_cls.name}: {e}')
TypeError: KnowledgeDistillationSingleStageDetector: LDHead: init() missing 1 required positional argument: 'loss_im'
Traceback (most recent call last):
File "/home/a/anaconda3/envs/open-mmlab/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/a/anaconda3/envs/open-mmlab/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/a/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/distributed/launch.py", line 263, in
main()
File "/home/a/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/distributed/launch.py", line 259, in main
cmd=cmd)
subprocess.CalledProcessError: Command '['/home/a/anaconda3/envs/open-mmlab/bin/python', '-u', './tools/train.py', '--local_rank=0', 'configs/ld/ld_r18_gflv1_r101_fpn_coco_1x.py', '--launcher', 'pytorch']' returned non-zero exit status 1.

@Logicino
Copy link
Author

另外,我在编译环境的时候是在LD文件夹目录下编译的,所生成的mmdet文件是在LD的目录下,LD和mmdetection的目录是分开的,即
/home/a/LD/mmdet
/home/a/mmdetection
,不知道是否有影响?

@HikariTJU
Copy link
Owner

config 里loss_ld下面加这三行

loss_ld_vlr=dict(
    type='KnowledgeDistillationKLDivLoss', loss_weight=0.25, T=10),
loss_kd=dict(
    type='KnowledgeDistillationKLDivLoss', loss_weight=10, T=2),
loss_im=dict(type='IMLoss', loss_weight=2.0),

@Logicino
Copy link
Author

config 里loss_ld下面加这三行

loss_ld_vlr=dict(
    type='KnowledgeDistillationKLDivLoss', loss_weight=0.25, T=10),
loss_kd=dict(
    type='KnowledgeDistillationKLDivLoss', loss_weight=10, T=2),
loss_im=dict(type='IMLoss', loss_weight=2.0),

您好,我按照您的修改了,修改后的部分大概是这个样子的:

        loss_ld=dict(
            type='KnowledgeDistillationKLDivLoss', loss_weight=0.25, T=10),
	loss_ld_vlr=dict(
	    type='KnowledgeDistillationKLDivLoss', loss_weight=0.25, T=10),
	loss_kd=dict(
	    type='KnowledgeDistillationKLDivLoss', loss_weight=10, T=2),
	loss_im=dict(type='IMLoss', loss_weight=2.0),
        reg_max=16,
        loss_bbox=dict(type='GIoULoss', loss_weight=2.0)),

我觉得这块我应该没有理解错误,非常感谢您的解答~
修改后,可以成功下载相应的pth文件
但在我loading annotations into memory的过程中,出现了FileNotFoundError: [Errno 2] No such file or directory: 'data/coco/annotations/instances_train2017.json'的错误
我是将制作好的coco格式数据集放在了/home/a/mmdetection/data/coco/annotations目录下的
图片
并且我查看了软连接情况为:
图片

并且最后报错依然为:
subprocess.CalledProcessError: Command '['/home/a/anaconda3/envs/open-mmlab/bin/python', '-u', './tools/train.py', '--local_rank=0', 'configs/ld/ld_r18_gflv1_r101_fpn_coco_1x.py', '--launcher', 'pytorch']' returned non-zero exit status 1.
不知道是否还是这个问题没有解决呢?
p.s:找不到data/coco/annotations之前在使用有的网络的时候也会有这个问题,但善用搜索也一直没有找到其中的原因orz

这是整体的运行和报错情况:
(open-mmlab) a@a-System-Product-Name:~/LD$ ./tools/dist_train.sh configs/ld/ld_r18_gflv1_r101_fpn_coco_1x.py 1
2021-11-27 23:02:30,450 - mmdet - INFO - Environment info:

sys.platform: linux
Python: 3.7.11 (default, Jul 27 2021, 14:32:16) [GCC 7.5.0]
CUDA available: True
GPU 0: NVIDIA GeForce RTX 2060 SUPER
CUDA_HOME: /usr/local/cuda-10.2
NVCC: Cuda compilation tools, release 10.2, V10.2.89
GCC: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
PyTorch: 1.5.1
PyTorch compiling details: PyTorch built with:

  • GCC 7.3
  • C++ Version: 201402
  • Intel(R) oneAPI Math Kernel Library Version 2021.4-Product Build 20210904 for Intel(R) 64 architecture applications
  • Intel(R) MKL-DNN v0.21.1 (Git Hash 7d2fd500bc78936d1d648ca713b901012f470dbc)
  • OpenMP 201511 (a.k.a. OpenMP 4.5)
  • NNPACK is enabled
  • CPU capability usage: AVX2
  • CUDA Runtime 10.1
  • NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_37,code=compute_37
  • CuDNN 7.6.3
  • Magma 2.5.2
  • Build settings: BLAS=MKL, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -fopenmp -DNDEBUG -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DUSE_INTERNAL_THREADPOOL_IMPL -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_STATIC_DISPATCH=OFF,

TorchVision: 0.6.0a0+35d732a
OpenCV: 4.5.4
MMCV: 1.2.7
MMCV Compiler: GCC 7.3
MMCV CUDA Compiler: 10.1
MMDetection: 2.10.0+9856a78

2021-11-27 23:02:30,619 - mmdet - INFO - Distributed training: True
2021-11-27 23:02:30,787 - mmdet - INFO - Config:
dataset_type = 'CocoDataset'
data_root = 'data/coco/'
img_norm_cfg = dict(
mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
train_pipeline = [
dict(type='LoadImageFromFile'),
dict(type='LoadAnnotations', with_bbox=True),
dict(type='Resize', img_scale=(1333, 800), keep_ratio=True),
dict(type='RandomFlip', flip_ratio=0.5),
dict(
type='Normalize',
mean=[123.675, 116.28, 103.53],
std=[58.395, 57.12, 57.375],
to_rgb=True),
dict(type='Pad', size_divisor=32),
dict(type='DefaultFormatBundle'),
dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels'])
]
test_pipeline = [
dict(type='LoadImageFromFile'),
dict(
type='MultiScaleFlipAug',
img_scale=(1333, 800),
flip=False,
transforms=[
dict(type='Resize', keep_ratio=True),
dict(type='RandomFlip'),
dict(
type='Normalize',
mean=[123.675, 116.28, 103.53],
std=[58.395, 57.12, 57.375],
to_rgb=True),
dict(type='Pad', size_divisor=32),
dict(type='ImageToTensor', keys=['img']),
dict(type='Collect', keys=['img'])
])
]
data = dict(
samples_per_gpu=3,
workers_per_gpu=2,
train=dict(
type='CocoDataset',
ann_file='data/coco/annotations/instances_train2017.json',
img_prefix='data/coco/images/',
pipeline=[
dict(type='LoadImageFromFile'),
dict(type='LoadAnnotations', with_bbox=True),
dict(type='Resize', img_scale=(1333, 800), keep_ratio=True),
dict(type='RandomFlip', flip_ratio=0.5),
dict(
type='Normalize',
mean=[123.675, 116.28, 103.53],
std=[58.395, 57.12, 57.375],
to_rgb=True),
dict(type='Pad', size_divisor=32),
dict(type='DefaultFormatBundle'),
dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels'])
]),
val=dict(
type='CocoDataset',
ann_file='data/coco/annotations/instances_val2017.json',
img_prefix='data/coco/images/',
pipeline=[
dict(type='LoadImageFromFile'),
dict(
type='MultiScaleFlipAug',
img_scale=(1333, 800),
flip=False,
transforms=[
dict(type='Resize', keep_ratio=True),
dict(type='RandomFlip'),
dict(
type='Normalize',
mean=[123.675, 116.28, 103.53],
std=[58.395, 57.12, 57.375],
to_rgb=True),
dict(type='Pad', size_divisor=32),
dict(type='ImageToTensor', keys=['img']),
dict(type='Collect', keys=['img'])
])
]),
test=dict(
type='CocoDataset',
ann_file='data/coco/annotations/instances_val2017.json',
img_prefix='data/coco/images/',
pipeline=[
dict(type='LoadImageFromFile'),
dict(
type='MultiScaleFlipAug',
img_scale=(1333, 800),
flip=False,
transforms=[
dict(type='Resize', keep_ratio=True),
dict(type='RandomFlip'),
dict(
type='Normalize',
mean=[123.675, 116.28, 103.53],
std=[58.395, 57.12, 57.375],
to_rgb=True),
dict(type='Pad', size_divisor=32),
dict(type='ImageToTensor', keys=['img']),
dict(type='Collect', keys=['img'])
])
]))
evaluation = dict(interval=1, metric='bbox')
optimizer = dict(type='SGD', lr=0.00375, momentum=0.9, weight_decay=0.0001)
optimizer_config = dict(grad_clip=None)
lr_config = dict(
policy='step',
warmup='linear',
warmup_iters=500,
warmup_ratio=0.001,
step=[8, 11])
runner = dict(type='EpochBasedRunner', max_epochs=12)
checkpoint_config = dict(interval=1)
log_config = dict(interval=50, hooks=[dict(type='TextLoggerHook')])
custom_hooks = [dict(type='NumClassCheckHook')]
dist_params = dict(backend='nccl')
log_level = 'INFO'
load_from = None
resume_from = None
workflow = [('train', 1)]
teacher_ckpt = 'https://download.openmmlab.com/mmdetection/v2.0/gfl/gfl_r101_fpn_mstrain_2x_coco/gfl_r101_fpn_mstrain_2x_coco_20200629_200126-dd12f847.pth'
model = dict(
type='KnowledgeDistillationSingleStageDetector',
pretrained='torchvision://resnet18',
teacher_config='configs/gfl/gfl_r101_fpn_mstrain_2x_coco.py',
teacher_ckpt=
'https://download.openmmlab.com/mmdetection/v2.0/gfl/gfl_r101_fpn_mstrain_2x_coco/gfl_r101_fpn_mstrain_2x_coco_20200629_200126-dd12f847.pth',
output_feature=True,
backbone=dict(
type='ResNet',
depth=18,
num_stages=4,
out_indices=(0, 1, 2, 3),
frozen_stages=1,
norm_cfg=dict(type='BN', requires_grad=True),
norm_eval=True,
style='pytorch'),
neck=dict(
type='FPN',
in_channels=[64, 128, 256, 512],
out_channels=256,
start_level=1,
add_extra_convs='on_output',
num_outs=5),
bbox_head=dict(
type='LDHead',
num_classes=80,
in_channels=256,
stacked_convs=4,
feat_channels=256,
anchor_generator=dict(
type='AnchorGenerator',
ratios=[1.0],
octave_base_scale=8,
scales_per_octave=1,
strides=[8, 16, 32, 64, 128]),
loss_cls=dict(
type='QualityFocalLoss',
use_sigmoid=True,
beta=2.0,
loss_weight=1.0),
loss_dfl=dict(type='DistributionFocalLoss', loss_weight=0.25),
loss_ld=dict(
type='KnowledgeDistillationKLDivLoss', loss_weight=0.25, T=10),
loss_ld_vlr=dict(
type='KnowledgeDistillationKLDivLoss', loss_weight=0.25, T=10),
loss_kd=dict(
type='KnowledgeDistillationKLDivLoss', loss_weight=10, T=2),
loss_im=dict(type='IMLoss', loss_weight=2.0),
reg_max=16,
loss_bbox=dict(type='GIoULoss', loss_weight=2.0)),
train_cfg=dict(
assigner=dict(type='ATSSAssigner', topk=9),
allowed_border=-1,
pos_weight=-1,
debug=False),
test_cfg=dict(
nms_pre=1000,
min_bbox_size=0,
score_thr=0.05,
nms=dict(type='nms', iou_threshold=0.6),
max_per_img=100))
work_dir = './work_dirs/ld_r18_gflv1_r101_fpn_coco_1x'
gpu_ids = range(0, 1)

2021-11-27 23:02:30,920 - mmdet - INFO - load model from: torchvision://resnet18
2021-11-27 23:02:30,920 - mmdet - INFO - Use load_from_torchvision loader
Downloading: "https://download.pytorch.org/models/resnet18-5c106cde.pth" to /home/a/.cache/torch/checkpoints/resnet18-5c106cde.pth
100.0%
2021-11-27 23:02:35,276 - mmdet - WARNING - The model and loaded state dict do not match exactly

unexpected key in source state_dict: fc.weight, fc.bias

2021-11-27 23:02:35,600 - mmdet - INFO - load model from: torchvision://resnet101
2021-11-27 23:02:35,600 - mmdet - INFO - Use load_from_torchvision loader
Downloading: "https://download.pytorch.org/models/resnet101-5d3b4d8f.pth" to /home/a/.cache/torch/checkpoints/resnet101-5d3b4d8f.pth
100.0%
2021-11-27 23:02:50,067 - mmdet - WARNING - The model and loaded state dict do not match exactly

unexpected key in source state_dict: fc.weight, fc.bias

Use load_from_http loader
Downloading: "https://download.openmmlab.com/mmdetection/v2.0/gfl/gfl_r101_fpn_mstrain_2x_coco/gfl_r101_fpn_mstrain_2x_coco_20200629_200126-dd12f847.pth" to /home/a/.cache/torch/checkpoints/gfl_r101_fpn_mstrain_2x_coco_20200629_200126-dd12f847.pth
100.0%
loading annotations into memory...
Traceback (most recent call last):
File "/home/a/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/utils/registry.py", line 179, in build_from_cfg
return obj_cls(**args)
File "/home/a/LD/mmdet/datasets/custom.py", line 87, in init
self.data_infos = self.load_annotations(self.ann_file)
File "/home/a/LD/mmdet/datasets/coco.py", line 57, in load_annotations
self.coco = COCO(ann_file)
File "/home/a/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/pycocotools/coco.py", line 84, in init
with open(annotation_file, 'r') as f:
FileNotFoundError: [Errno 2] No such file or directory: 'data/coco/annotations/instances_train2017.json'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "./tools/train.py", line 187, in
main()
File "./tools/train.py", line 163, in main
datasets = [build_dataset(cfg.data.train)]
File "/home/a/LD/mmdet/datasets/builder.py", line 71, in build_dataset
dataset = build_from_cfg(cfg, DATASETS, default_args)
File "/home/a/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/utils/registry.py", line 182, in build_from_cfg
raise type(e)(f'{obj_cls.name}: {e}')
FileNotFoundError: CocoDataset: [Errno 2] No such file or directory: 'data/coco/annotations/instances_train2017.json'
Traceback (most recent call last):
File "/home/a/anaconda3/envs/open-mmlab/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/a/anaconda3/envs/open-mmlab/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/a/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/distributed/launch.py", line 263, in
main()
File "/home/a/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/distributed/launch.py", line 259, in main
cmd=cmd)
subprocess.CalledProcessError: Command '['/home/a/anaconda3/envs/open-mmlab/bin/python', '-u', './tools/train.py', '--local_rank=0', 'configs/ld/ld_r18_gflv1_r101_fpn_coco_1x.py', '--launcher', 'pytorch']' returned non-zero exit status 1.

@HikariTJU
Copy link
Owner

data要放在LD文件夹下面

@Logicino
Copy link
Author

data要放在LD文件夹下面

感谢!成功地解决了data的问题!
但是还是报错:

subprocess.CalledProcessError: Command '['/home/a/anaconda3/envs/open-mmlab/bin/python', '-u', './tools/train.py', '--local_rank=0', 'configs/ld/ld_r18_gflv1_r101_fpn_coco_1x.py', '--launcher', 'pytorch']' returned non-zero exit status 1.

并且看到是在train开始的时候(已经完成了loading annotations)
感觉可能还是有分布式训练的问题?

具体报错如下:
Use load_from_http loader
loading annotations into memory...
Done (t=0.00s)
creating index...
index created!
loading annotations into memory...
Done (t=0.00s)
creating index...
index created!
2021-11-27 23:23:26,089 - mmdet - INFO - Start running, host: a@a-System-Product-Name, work_dir: /home/a/LD/work_dirs/ld_r18_gflv1_r101_fpn_coco_1x
2021-11-27 23:23:26,089 - mmdet - INFO - workflow: [('train', 1)], max: 12 epochs
2021-11-27 23:23:30,444 - mmdet - INFO - Saving checkpoint at 1 epochs
[ ] 0/100, elapsed: 0s, ETA:Traceback (most recent call last):
File "./tools/train.py", line 187, in
main()
File "./tools/train.py", line 183, in main
meta=meta)
File "/home/a/LD/mmdet/apis/train.py", line 170, in train_detector
runner.run(data_loaders, cfg.workflow)
File "/home/a/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 125, in run
epoch_runner(data_loaders[i], **kwargs)
File "/home/a/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 54, in train
self.call_hook('after_train_epoch')
File "/home/a/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/base_runner.py", line 308, in call_hook
getattr(hook, fn_name)(self)
File "/home/a/LD/mmdet/core/evaluation/eval_hooks.py", line 276, in after_train_epoch
gpu_collect=self.gpu_collect)
File "/home/a/LD/mmdet/apis/test.py", line 97, in multi_gpu_test
result = model(return_loss=False, rescale=True, **data)
File "/home/a/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in call
result = self.forward(*input, **kwargs)
File "/home/a/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 458, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/home/a/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in call
result = self.forward(*input, **kwargs)
File "/home/a/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/fp16_utils.py", line 84, in new_func
return old_func(*args, **kwargs)
File "/home/a/LD/mmdet/models/detectors/base.py", line 183, in forward
return self.forward_test(img, img_metas, **kwargs)
File "/home/a/LD/mmdet/models/detectors/base.py", line 160, in forward_test
return self.simple_test(imgs[0], img_metas[0], **kwargs)
File "/home/a/LD/mmdet/models/detectors/single_stage.py", line 120, in simple_test
*outs, img_metas, rescale=rescale)
File "/home/a/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/fp16_utils.py", line 164, in new_func
return old_func(*args, **kwargs)
File "/home/a/LD/mmdet/models/dense_heads/anchor_head.py", line 583, in get_bboxes
scale_factors, cfg, rescale)
File "/home/a/LD/mmdet/models/dense_heads/gfl_head.py", line 560, in _get_bboxes
cfg.max_per_img)
File "/home/a/LD/mmdet/core/post_processing/bbox_nms.py", line 187, in multiclass_nms
return dets, labels[keep]
IndexError: index 159 is out of bounds for dimension 0 with size 100
Traceback (most recent call last):
File "/home/a/anaconda3/envs/open-mmlab/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/a/anaconda3/envs/open-mmlab/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/a/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/distributed/launch.py", line 263, in
main()
File "/home/a/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/distributed/launch.py", line 259, in main
cmd=cmd)
subprocess.CalledProcessError: Command '['/home/a/anaconda3/envs/open-mmlab/bin/python', '-u', './tools/train.py', '--local_rank=0', 'configs/ld/ld_r18_gflv1_r101_fpn_coco_1x.py', '--launcher', 'pytorch']' returned non-zero exit status 1.

@HikariTJU
Copy link
Owner

HikariTJU commented Nov 27, 2021

2021-11-27 23:23:26,089 - mmdet - INFO - workflow: [('train', 1)], max: 12 epochs
2021-11-27 23:23:30,444 - mmdet - INFO - Saving checkpoint at 1 epochs

你这个为什么还没训就直接save checkpoint了? 我试了同样的命令可以训:
image

@Augusta-A
Copy link

data要放在LD文件夹下面

感谢!成功地解决了data的问题! 但是还是报错:

subprocess.CalledProcessError: Command '['/home/a/anaconda3/envs/open-mmlab/bin/python', '-u', './tools/train.py', '--local_rank=0', 'configs/ld/ld_r18_gflv1_r101_fpn_coco_1x.py', '--launcher', 'pytorch']' returned non-zero exit status 1.

并且看到是在train开始的时候(已经完成了loading annotations) 感觉可能还是有分布式训练的问题?

具体报错如下: Use load_from_http loader loading annotations into memory... Done (t=0.00s) creating index... index created! loading annotations into memory... Done (t=0.00s) creating index... index created! 2021-11-27 23:23:26,089 - mmdet - INFO - Start running, host: a@a-System-Product-Name, work_dir: /home/a/LD/work_dirs/ld_r18_gflv1_r101_fpn_coco_1x 2021-11-27 23:23:26,089 - mmdet - INFO - workflow: [('train', 1)], max: 12 epochs 2021-11-27 23:23:30,444 - mmdet - INFO - Saving checkpoint at 1 epochs [ ] 0/100, elapsed: 0s, ETA:Traceback (most recent call last): File "./tools/train.py", line 187, in main() File "./tools/train.py", line 183, in main meta=meta) File "/home/a/LD/mmdet/apis/train.py", line 170, in train_detector runner.run(data_loaders, cfg.workflow) File "/home/a/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 125, in run epoch_runner(data_loaders[i], **kwargs) File "/home/a/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 54, in train self.call_hook('after_train_epoch') File "/home/a/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/base_runner.py", line 308, in call_hook getattr(hook, fn_name)(self) File "/home/a/LD/mmdet/core/evaluation/eval_hooks.py", line 276, in after_train_epoch gpu_collect=self.gpu_collect) File "/home/a/LD/mmdet/apis/test.py", line 97, in multi_gpu_test result = model(return_loss=False, rescale=True, **data) File "/home/a/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in call result = self.forward(*input, **kwargs) File "/home/a/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 458, in forward output = self.module(*inputs[0], **kwargs[0]) File "/home/a/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in call result = self.forward(*input, **kwargs) File "/home/a/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/fp16_utils.py", line 84, in new_func return old_func(*args, **kwargs) File "/home/a/LD/mmdet/models/detectors/base.py", line 183, in forward return self.forward_test(img, img_metas, **kwargs) File "/home/a/LD/mmdet/models/detectors/base.py", line 160, in forward_test return self.simple_test(imgs[0], img_metas[0], **kwargs) File "/home/a/LD/mmdet/models/detectors/single_stage.py", line 120, in simple_test *outs, img_metas, rescale=rescale) File "/home/a/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/fp16_utils.py", line 164, in new_func return old_func(*args, **kwargs) File "/home/a/LD/mmdet/models/dense_heads/anchor_head.py", line 583, in get_bboxes scale_factors, cfg, rescale) File "/home/a/LD/mmdet/models/dense_heads/gfl_head.py", line 560, in _get_bboxes cfg.max_per_img) File "/home/a/LD/mmdet/core/post_processing/bbox_nms.py", line 187, in multiclass_nms return dets, labels[keep] IndexError: index 159 is out of bounds for dimension 0 with size 100 Traceback (most recent call last): File "/home/a/anaconda3/envs/open-mmlab/lib/python3.7/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/home/a/anaconda3/envs/open-mmlab/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/a/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/distributed/launch.py", line 263, in main() File "/home/a/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/distributed/launch.py", line 259, in main cmd=cmd) subprocess.CalledProcessError: Command '['/home/a/anaconda3/envs/open-mmlab/bin/python', '-u', './tools/train.py', '--local_rank=0', 'configs/ld/ld_r18_gflv1_r101_fpn_coco_1x.py', '--launcher', 'pytorch']' returned non-zero exit status 1.

@Logicino 你好,我在multiclass_nms遇到和你类似的问题,请问你是怎么解决的?

@Logicino
Copy link
Author

2021-11-27 23:23:26,089 - mmdet - INFO - workflow: [('train', 1)], max: 12 epochs
2021-11-27 23:23:30,444 - mmdet - INFO - Saving checkpoint at 1 epochs

你这个为什么还没训就直接save checkpoint了? 我试了同样的命令可以训: image

想要确认一下,安装CUDA版本只要大于我猜测可能是我的计算机上安装了CUDA10.2

2021-11-27 23:23:26,089 - mmdet - INFO - workflow: [('train', 1)], max: 12 epochs
2021-11-27 23:23:30,444 - mmdet - INFO - Saving checkpoint at 1 epochs

你这个为什么还没训就直接save checkpoint了? 我试了同样的命令可以训: image
您好我觉得我之前可能是理解错误了使用说明。后来我看了,mmdetection里configs里有个LD文件夹,应该指的就是这个网络,然后直接用训练了mmdetection这个网络谢谢作者!

@Logicino
Copy link
Author

data要放在LD文件夹下面

感谢!成功地解决了data的问题! 但是还是报错:

subprocess.CalledProcessError: Command '['/home/a/anaconda3/envs/open-mmlab/bin/python', '-u', './tools/train.py', '--local_rank=0', 'configs/ld/ld_r18_gflv1_r101_fpn_coco_1x.py', '--launcher', 'pytorch']' returned non-zero exit status 1.

并且看到是在train开始的时候(已经完成了loading annotations) 感觉可能还是有分布式训练的问题?
具体报错如下: Use load_from_http loader loading annotations into memory... Done (t=0.00s) creating index... index created! loading annotations into memory... Done (t=0.00s) creating index... index created! 2021-11-27 23:23:26,089 - mmdet - INFO - Start running, host: a@a-System-Product-Name, work_dir: /home/a/LD/work_dirs/ld_r18_gflv1_r101_fpn_coco_1x 2021-11-27 23:23:26,089 - mmdet - INFO - workflow: [('train', 1)], max: 12 epochs 2021-11-27 23:23:30,444 - mmdet - INFO - Saving checkpoint at 1 epochs [ ] 0/100, elapsed: 0s, ETA:Traceback (most recent call last): File "./tools/train.py", line 187, in main() File "./tools/train.py", line 183, in main meta=meta) File "/home/a/LD/mmdet/apis/train.py", line 170, in train_detector runner.run(data_loaders, cfg.workflow) File "/home/a/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 125, in run epoch_runner(data_loaders[i], **kwargs) File "/home/a/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 54, in train self.call_hook('after_train_epoch') File "/home/a/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/base_runner.py", line 308, in call_hook getattr(hook, fn_name)(self) File "/home/a/LD/mmdet/core/evaluation/eval_hooks.py", line 276, in after_train_epoch gpu_collect=self.gpu_collect) File "/home/a/LD/mmdet/apis/test.py", line 97, in multi_gpu_test result = model(return_loss=False, rescale=True, **data) File "/home/a/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in call result = self.forward(*input, **kwargs) File "/home/a/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 458, in forward output = self.module(*inputs[0], **kwargs[0]) File "/home/a/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in call result = self.forward(*input, **kwargs) File "/home/a/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/fp16_utils.py", line 84, in new_func return old_func(*args, **kwargs) File "/home/a/LD/mmdet/models/detectors/base.py", line 183, in forward return self.forward_test(img, img_metas, **kwargs) File "/home/a/LD/mmdet/models/detectors/base.py", line 160, in forward_test return self.simple_test(imgs[0], img_metas[0], **kwargs) File "/home/a/LD/mmdet/models/detectors/single_stage.py", line 120, in simple_test *outs, img_metas, rescale=rescale) File "/home/a/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/fp16_utils.py", line 164, in new_func return old_func(*args, **kwargs) File "/home/a/LD/mmdet/models/dense_heads/anchor_head.py", line 583, in get_bboxes scale_factors, cfg, rescale) File "/home/a/LD/mmdet/models/dense_heads/gfl_head.py", line 560, in _get_bboxes cfg.max_per_img) File "/home/a/LD/mmdet/core/post_processing/bbox_nms.py", line 187, in multiclass_nms return dets, labels[keep] IndexError: index 159 is out of bounds for dimension 0 with size 100 Traceback (most recent call last): File "/home/a/anaconda3/envs/open-mmlab/lib/python3.7/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/home/a/anaconda3/envs/open-mmlab/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/a/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/distributed/launch.py", line 263, in main() File "/home/a/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/distributed/launch.py", line 259, in main cmd=cmd) subprocess.CalledProcessError: Command '['/home/a/anaconda3/envs/open-mmlab/bin/python', '-u', './tools/train.py', '--local_rank=0', 'configs/ld/ld_r18_gflv1_r101_fpn_coco_1x.py', '--launcher', 'pytorch']' returned non-zero exit status 1.

@Logicino 你好,我在multiclass_nms遇到和你类似的问题,请问你是怎么解决的?

我觉得我应该是使用错了,重新训练可以用了,暂时没有用到multiclass_nms的问题来着

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants