New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
训练时遇到subprocess.CalledProcessError #16
Comments
另外,我在编译环境的时候是在LD文件夹目录下编译的,所生成的mmdet文件是在LD的目录下,LD和mmdetection的目录是分开的,即 |
config 里loss_ld下面加这三行
|
您好,我按照您的修改了,修改后的部分大概是这个样子的:
我觉得这块我应该没有理解错误,非常感谢您的解答~ 并且最后报错依然为: 这是整体的运行和报错情况:
|
data要放在LD文件夹下面 |
感谢!成功地解决了data的问题!
并且看到是在train开始的时候(已经完成了loading annotations) 具体报错如下: |
@Logicino 你好,我在multiclass_nms遇到和你类似的问题,请问你是怎么解决的? |
我觉得我应该是使用错了,重新训练可以用了,暂时没有用到multiclass_nms的问题来着 |
您好,我按照项目中的要求创建了新的虚拟环境
使用了cuda10.1+pytorch1.5.0
(这里install.md中的mmcv-full安装方式需要更新
pip install mmcv-full==1.2.7 -f https://download.openmmlab.com/mmcv/dist/cu101/torch1.5.0/index.html
)README.md中的configs目录有问题
我只有一个GPU,训练的时候,我修改成了
./tools/dist_train.sh configs/ld/ld_r18_gflv1_r101_fpn_coco_1x.py 1
有一处报错为:
subprocess.CalledProcessError: Command '['/home/a/anaconda3/envs/open-mmlab/bin/python', '-u', './tools/train.py', '--local_rank=0', 'configs/ld/ld_r18_gflv1_r101_fpn_coco_1x.py', '--launcher', 'pytorch']' returned non-zero exit status 1.
我搜索了此种报错的解决方案,有一种是在DistributedDataParallel中添加find_unused_parameters=True
model = torch.nn.parallel.DistributedDataParallel(model,device_ids=[args.local_rank],output_device=args.local_rank, find_unused_parameters=True)
我想知道这个项目的find_unused_parameters应该在哪个文件中设置呢?感谢。
我的整体报错如下:
(open-mmlab) a@a-System-Product-Name:~/LD$ ./tools/dist_train.sh configs/ld/ld_r18_gflv1_r101_fpn_coco_1x.py 1
2021-11-27 22:44:39,804 - mmdet - INFO - Environment info:
sys.platform: linux
Python: 3.7.11 (default, Jul 27 2021, 14:32:16) [GCC 7.5.0]
CUDA available: True
GPU 0: NVIDIA GeForce RTX 2060 SUPER
CUDA_HOME: /usr/local/cuda-10.2
NVCC: Cuda compilation tools, release 10.2, V10.2.89
GCC: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
PyTorch: 1.5.1
PyTorch compiling details: PyTorch built with:
TorchVision: 0.6.0a0+35d732a
OpenCV: 4.5.4
MMCV: 1.2.7
MMCV Compiler: GCC 7.3
MMCV CUDA Compiler: 10.1
MMDetection: 2.10.0+9856a78
2021-11-27 22:44:39,967 - mmdet - INFO - Distributed training: True
2021-11-27 22:44:40,128 - mmdet - INFO - Config:
dataset_type = 'CocoDataset'
data_root = 'data/coco/'
img_norm_cfg = dict(
mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
train_pipeline = [
dict(type='LoadImageFromFile'),
dict(type='LoadAnnotations', with_bbox=True),
dict(type='Resize', img_scale=(1333, 800), keep_ratio=True),
dict(type='RandomFlip', flip_ratio=0.5),
dict(
type='Normalize',
mean=[123.675, 116.28, 103.53],
std=[58.395, 57.12, 57.375],
to_rgb=True),
dict(type='Pad', size_divisor=32),
dict(type='DefaultFormatBundle'),
dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels'])
]
test_pipeline = [
dict(type='LoadImageFromFile'),
dict(
type='MultiScaleFlipAug',
img_scale=(1333, 800),
flip=False,
transforms=[
dict(type='Resize', keep_ratio=True),
dict(type='RandomFlip'),
dict(
type='Normalize',
mean=[123.675, 116.28, 103.53],
std=[58.395, 57.12, 57.375],
to_rgb=True),
dict(type='Pad', size_divisor=32),
dict(type='ImageToTensor', keys=['img']),
dict(type='Collect', keys=['img'])
])
]
data = dict(
samples_per_gpu=3,
workers_per_gpu=2,
train=dict(
type='CocoDataset',
ann_file='data/coco/annotations/instances_train2017.json',
img_prefix='data/coco/images/',
pipeline=[
dict(type='LoadImageFromFile'),
dict(type='LoadAnnotations', with_bbox=True),
dict(type='Resize', img_scale=(1333, 800), keep_ratio=True),
dict(type='RandomFlip', flip_ratio=0.5),
dict(
type='Normalize',
mean=[123.675, 116.28, 103.53],
std=[58.395, 57.12, 57.375],
to_rgb=True),
dict(type='Pad', size_divisor=32),
dict(type='DefaultFormatBundle'),
dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels'])
]),
val=dict(
type='CocoDataset',
ann_file='data/coco/annotations/instances_val2017.json',
img_prefix='data/coco/images/',
pipeline=[
dict(type='LoadImageFromFile'),
dict(
type='MultiScaleFlipAug',
img_scale=(1333, 800),
flip=False,
transforms=[
dict(type='Resize', keep_ratio=True),
dict(type='RandomFlip'),
dict(
type='Normalize',
mean=[123.675, 116.28, 103.53],
std=[58.395, 57.12, 57.375],
to_rgb=True),
dict(type='Pad', size_divisor=32),
dict(type='ImageToTensor', keys=['img']),
dict(type='Collect', keys=['img'])
])
]),
test=dict(
type='CocoDataset',
ann_file='data/coco/annotations/instances_val2017.json',
img_prefix='data/coco/images/',
pipeline=[
dict(type='LoadImageFromFile'),
dict(
type='MultiScaleFlipAug',
img_scale=(1333, 800),
flip=False,
transforms=[
dict(type='Resize', keep_ratio=True),
dict(type='RandomFlip'),
dict(
type='Normalize',
mean=[123.675, 116.28, 103.53],
std=[58.395, 57.12, 57.375],
to_rgb=True),
dict(type='Pad', size_divisor=32),
dict(type='ImageToTensor', keys=['img']),
dict(type='Collect', keys=['img'])
])
]))
evaluation = dict(interval=1, metric='bbox')
optimizer = dict(type='SGD', lr=0.00375, momentum=0.9, weight_decay=0.0001)
optimizer_config = dict(grad_clip=None)
lr_config = dict(
policy='step',
warmup='linear',
warmup_iters=500,
warmup_ratio=0.001,
step=[8, 11])
runner = dict(type='EpochBasedRunner', max_epochs=12)
checkpoint_config = dict(interval=1)
log_config = dict(interval=50, hooks=[dict(type='TextLoggerHook')])
custom_hooks = [dict(type='NumClassCheckHook')]
dist_params = dict(backend='nccl')
log_level = 'INFO'
load_from = None
resume_from = None
workflow = [('train', 1)]
teacher_ckpt = 'https://download.openmmlab.com/mmdetection/v2.0/gfl/gfl_r101_fpn_mstrain_2x_coco/gfl_r101_fpn_mstrain_2x_coco_20200629_200126-dd12f847.pth'
model = dict(
type='KnowledgeDistillationSingleStageDetector',
pretrained='torchvision://resnet18',
teacher_config='configs/gfl/gfl_r101_fpn_mstrain_2x_coco.py',
teacher_ckpt=
'https://download.openmmlab.com/mmdetection/v2.0/gfl/gfl_r101_fpn_mstrain_2x_coco/gfl_r101_fpn_mstrain_2x_coco_20200629_200126-dd12f847.pth',
output_feature=True,
backbone=dict(
type='ResNet',
depth=18,
num_stages=4,
out_indices=(0, 1, 2, 3),
frozen_stages=1,
norm_cfg=dict(type='BN', requires_grad=True),
norm_eval=True,
style='pytorch'),
neck=dict(
type='FPN',
in_channels=[64, 128, 256, 512],
out_channels=256,
start_level=1,
add_extra_convs='on_output',
num_outs=5),
bbox_head=dict(
type='LDHead',
num_classes=80,
in_channels=256,
stacked_convs=4,
feat_channels=256,
anchor_generator=dict(
type='AnchorGenerator',
ratios=[1.0],
octave_base_scale=8,
scales_per_octave=1,
strides=[8, 16, 32, 64, 128]),
loss_cls=dict(
type='QualityFocalLoss',
use_sigmoid=True,
beta=2.0,
loss_weight=1.0),
loss_dfl=dict(type='DistributionFocalLoss', loss_weight=0.25),
loss_ld=dict(
type='KnowledgeDistillationKLDivLoss', loss_weight=0.25, T=10),
reg_max=16,
loss_bbox=dict(type='GIoULoss', loss_weight=2.0)),
train_cfg=dict(
assigner=dict(type='ATSSAssigner', topk=9),
allowed_border=-1,
pos_weight=-1,
debug=False),
test_cfg=dict(
nms_pre=1000,
min_bbox_size=0,
score_thr=0.05,
nms=dict(type='nms', iou_threshold=0.6),
max_per_img=100))
work_dir = './work_dirs/ld_r18_gflv1_r101_fpn_coco_1x'
gpu_ids = range(0, 1)
Traceback (most recent call last):
File "/home/a/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/utils/registry.py", line 179, in build_from_cfg
return obj_cls(**args)
TypeError: init() missing 1 required positional argument: 'loss_im'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/a/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/utils/registry.py", line 179, in build_from_cfg
return obj_cls(**args)
File "/home/a/LD/mmdet/models/detectors/kd_one_stage.py", line 35, in init
pretrained)
File "/home/a/LD/mmdet/models/detectors/single_stage.py", line 30, in init
self.bbox_head = build_head(bbox_head)
File "/home/a/LD/mmdet/models/builder.py", line 59, in build_head
return build(cfg, HEADS)
File "/home/a/LD/mmdet/models/builder.py", line 34, in build
return build_from_cfg(cfg, registry, default_args)
File "/home/a/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/utils/registry.py", line 182, in build_from_cfg
raise type(e)(f'{obj_cls.name}: {e}')
TypeError: LDHead: init() missing 1 required positional argument: 'loss_im'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "./tools/train.py", line 187, in
main()
File "./tools/train.py", line 161, in main
test_cfg=cfg.get('test_cfg'))
File "/home/a/LD/mmdet/models/builder.py", line 77, in build_detector
return build(cfg, DETECTORS, dict(train_cfg=train_cfg, test_cfg=test_cfg))
File "/home/a/LD/mmdet/models/builder.py", line 34, in build
return build_from_cfg(cfg, registry, default_args)
File "/home/a/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/utils/registry.py", line 182, in build_from_cfg
raise type(e)(f'{obj_cls.name}: {e}')
TypeError: KnowledgeDistillationSingleStageDetector: LDHead: init() missing 1 required positional argument: 'loss_im'
Traceback (most recent call last):
File "/home/a/anaconda3/envs/open-mmlab/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/a/anaconda3/envs/open-mmlab/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/a/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/distributed/launch.py", line 263, in
main()
File "/home/a/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/distributed/launch.py", line 259, in main
cmd=cmd)
subprocess.CalledProcessError: Command '['/home/a/anaconda3/envs/open-mmlab/bin/python', '-u', './tools/train.py', '--local_rank=0', 'configs/ld/ld_r18_gflv1_r101_fpn_coco_1x.py', '--launcher', 'pytorch']' returned non-zero exit status 1.
The text was updated successfully, but these errors were encountered: