Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

训练时报错:has no im_shape field #1849

Closed
iceriver97 opened this issue Dec 9, 2020 · 29 comments
Closed

训练时报错:has no im_shape field #1849

iceriver97 opened this issue Dec 9, 2020 · 29 comments

Comments

@iceriver97
Copy link

配置文件在 mask_rcnn_r50_2x.yml 的基础上进行修改,执行下面的命令之后:

!python tools/train.py -c configs/myconfig/mask_rcnn_r50_2x.yml --eval -o use_gpu=true --use_vdl=True --vdl_log_dir=vdl_dir/scalar

报错:

Traceback (most recent call last):
  File "tools/train.py", line 377, in <module>
    main()
  File "tools/train.py", line 146, in main
    fetches = model.eval(feed_vars)
  File "/home/aistudio/work/PaddleDetection/ppdet/modeling/architectures/mask_rcnn.py", line 338, in eval
    return self.build(feed_vars, 'test')
  File "/home/aistudio/work/PaddleDetection/ppdet/modeling/architectures/mask_rcnn.py", line 81, in build
    self._input_check(required_fields, feed_vars)
  File "/home/aistudio/work/PaddleDetection/ppdet/modeling/architectures/mask_rcnn.py", line 271, in _input_check
    "{} has no {} field".format(feed_vars, var)
AssertionError: OrderedDict([('image', name: "image"
type {
  type: LOD_TENSOR
  lod_tensor {
    tensor {
      data_type: FP32
      dims: -1
      dims: 3
      dims: -1
      dims: -1
    }
    lod_level: 0
  }
}
persistable: false
need_check_feed: true
), ('im_info', name: "im_info"
type {
  type: LOD_TENSOR
  lod_tensor {
    tensor {
      data_type: FP32
      dims: -1
      dims: 3
    }
    lod_level: 0
  }
}
persistable: false
need_check_feed: true
), ('im_id', name: "im_id"
type {
  type: LOD_TENSOR
  lod_tensor {
    tensor {
      data_type: INT64
      dims: -1
      dims: 1
    }
    lod_level: 0
  }
}
persistable: false
need_check_feed: true
), ('gt_bbox', name: "gt_bbox"
type {
  type: LOD_TENSOR
  lod_tensor {
    tensor {
      data_type: FP32
      dims: -1
      dims: 4
    }
    lod_level: 1
  }
}
persistable: false
need_check_feed: true
), ('gt_class', name: "gt_class"
type {
  type: LOD_TENSOR
  lod_tensor {
    tensor {
      data_type: INT32
      dims: -1
      dims: 1
    }
    lod_level: 1
  }
}
persistable: false
need_check_feed: true
), ('is_crowd', name: "is_crowd"
type {
  type: LOD_TENSOR
  lod_tensor {
    tensor {
      data_type: INT32
      dims: -1
      dims: 1
    }
    lod_level: 1
  }
}
persistable: false
need_check_feed: true
), ('gt_mask', name: "gt_mask"
type {
  type: LOD_TENSOR
  lod_tensor {
    tensor {
      data_type: FP32
      dims: -1
      dims: 2
    }
    lod_level: 3
  }
}
persistable: false
need_check_feed: true
)]) has no im_shape field

我查看了 mask_rcnn.py 第81 行,这里应该是因为 model != 'train' 所以 self._input_check(required_fields, feed_vars) check 了 im_shape 我的训练启动命令有问题吗?

@iceriver97
Copy link
Author

在不开启 eval 选项的时候不会报错,是哪里的问题?

@iceriver97
Copy link
Author

训练一段时间后会报错:nan
iter: 1580, lr: 0.010000, 'loss_cls': 'nan', 'loss_bbox': 'nan', 'loss_rpn_cls': '0.397665', 'loss_rpn_bbox': '0.007382', 'loss_mask': '0.352444', 'loss': 'nan', eta: 1:14:28, batch_cost: 0.30992 sec, ips: 3.22669 images/sec

@willthefrog
Copy link
Collaborator

用的是release版本吗?
修改了什么?
单卡跑的话要修改下学习率吧

@iceriver97
Copy link
Author

用的是release版本吗?
修改了什么?
单卡跑的话要修改下学习率吧

谢谢解答!
版本:v0.5;
配置文件中就修改了数据集的路径,你可以帮我check一下吗?
是单卡,但学习率要怎么修改呢?
myconfig.zip

@willthefrog
Copy link
Collaborator

你这是代码升级,配置没升级吧, 参看这里

改成1/8

@iceriver97
Copy link
Author

你这是代码升级,配置没升级吧, 参看这里

改成1/8

哦哦,谢谢,确实是这么回事;
我是刚上手,跟着那个全流程的教程走的,配置改完就可以开启 eval 选项了;
但是在训练中evald的时候又报错了:

2020-12-09 14:27:30,149-WARNING: Your reader has raised an exception!
Exception in thread Thread-9:
Traceback (most recent call last):
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/reader.py", line 1145, in __thread_main__
    six.reraise(*sys.exc_info())
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/six.py", line 703, in reraise
    raise value
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/reader.py", line 1125, in __thread_main__
    for tensors in self._tensor_reader():
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/reader.py", line 1195, in __tensor_reader_impl__
    for slots in paddle_reader():
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/data_feeder.py", line 507, in __reader_creator__
    yield self.feed(item)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/data_feeder.py", line 348, in feed
    ret_dict[each_name] = each_converter.done()
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/data_feeder.py", line 157, in done
    arr = np.array(self.data, dtype=self.dtype)
ValueError: could not broadcast input array from shape (3,641,1333) into shape (3)

/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/executor.py:1070: UserWarning: The following exception is not an EOF exception.
  "The following exception is not an EOF exception.")
Traceback (most recent call last):
  File "tools/train.py", line 377, in <module>
    main()
  File "tools/train.py", line 294, in main
    resolution=resolution)
  File "/home/aistudio/work/PaddleDetection/ppdet/utils/eval_utils.py", line 148, in eval_run
    return_numpy=False)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/executor.py", line 1071, in run
    six.reraise(*sys.exc_info())
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/six.py", line 703, in reraise
    raise value
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/executor.py", line 1066, in run
    return_merged=return_merged)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/executor.py", line 1167, in _run_impl
    return_merged=return_merged)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/executor.py", line 879, in _run_parallel
    tensors = exe.run(fetch_var_names, return_merged)._move_to_list()
paddle.fluid.core_avx.EnforceNotMet: 

--------------------------------------------
C++ Call Stacks (More useful to developers):
--------------------------------------------
0   std::string paddle::platform::GetTraceBackString<std::string const&>(std::string const&, char const*, int)
1   paddle::platform::EnforceNotMet::EnforceNotMet(std::string const&, char const*, int)
2   paddle::operators::reader::BlockingQueue<std::vector<paddle::framework::LoDTensor, std::allocator<paddle::framework::LoDTensor> > >::Receive(std::vector<paddle::framework::LoDTensor, std::allocator<paddle::framework::LoDTensor> >*)
3   paddle::operators::reader::PyReader::ReadNext(std::vector<paddle::framework::LoDTensor, std::allocator<paddle::framework::LoDTensor> >*)
4   std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> (), std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<unsigned long>, std::__future_base::_Result_base::_Deleter>, unsigned long> >::_M_invoke(std::_Any_data const&)
5   std::__future_base::_State_base::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>&, bool&)
6   ThreadPool::ThreadPool(unsigned long)::{lambda()#1}::operator()() const

------------------------------------------
Python Call Stacks (More useful to users):
------------------------------------------
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/framework.py", line 2610, in append_op
    attrs=kwargs.get("attrs", None))
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/reader.py", line 1080, in _init_non_iterable
    attrs={'drop_last': self._drop_last})
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/reader.py", line 978, in __init__
    self._init_non_iterable()
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/reader.py", line 620, in from_generator
    iterable, return_list, drop_last)
  File "/home/aistudio/work/PaddleDetection/ppdet/modeling/architectures/mask_rcnn.py", line 329, in build_inputs
    iterable=iterable) if use_dataloader else None
  File "tools/train.py", line 145, in main
    feed_vars, eval_loader = model.build_inputs(**inputs_def)
  File "tools/train.py", line 377, in <module>
    main()

----------------------
Error Message Summary:
----------------------
Error: Blocking queue is killed because the data reader raises an exception
  [Hint: Expected killed_ != true, but received killed_:1 == true:1.] at (/paddle/paddle/fluid/operators/reader/blocking_queue.h:141)
  [operator < read > error]
terminate called without an active exception

@iceriver97
Copy link
Author

还有就是在训练过程中其他loss曲线都比较稳定,但是 loss_bbox 与 loss_rpn_bbox 震荡比较严重,这正常吗?
loss_rpn_bbox
loss_bbox

@iceriver97
Copy link
Author

训练开始前还有一些 warning:
2020-12-09 14:17:04,278-WARNING: Found an invalid bbox in annotations: im_id: 22, x1: 616.0, y1: 211.0, x2: 594.0, y2: 408.0.

@willthefrog
Copy link
Collaborator

看上去还你是配置问题,跑默认还报错吗。

@iceriver97
Copy link
Author

看上去还你是配置问题,跑默认还报错吗。

谢谢,是bs的问题,我在eval的时候不能调大 batch_size 吗?
一般检测任务的batch_size取多少呢,根据显存能大就大吗?
lr 与 batch_size 是要一起调整吗?
an invalid bbox 以及 box loss 曲线的震荡有影响吗?

@iceriver97
Copy link
Author

看上去还你是配置问题,跑默认还报错吗。
我的 eval结果 这个样子是不是不太对啊;
CE6 L1P(RP@AS)GPWC@61WU

@willthefrog
Copy link
Collaborator

eval batch size 应该是1吧。
LR和bs要等比例调整。

@iceriver97
Copy link
Author

iceriver97 commented Dec 9, 2020

eval batch size 应该是1吧。
LR和bs要等比例调整。

调到1了,eval结果如上图,不太正常吧?
是训练轮次太少了吗?an invalid bbox 以及 box loss 曲线的震荡有影响吗?

@willthefrog
Copy link
Collaborator

invalid bbox没影响。

@willthefrog
Copy link
Collaborator

不确定你这个是什么数据集,早期bbox学习有可能不稳定,如果实在不确定可以多加些warmup step

@iceriver97
Copy link
Author

不确定你这个是什么数据集,早期bbox学习有可能不稳定,如果实在不确定可以多加些warmup step
数据集样本较少 10* 65张图片

@willthefrog
Copy link
Collaborator

注意LR是对应的8卡的LR,也要根据卡数调整的。
而且你这个数据集有点少,不确定baseline结果会是什么情况。

@iceriver97
Copy link
Author

iceriver97 commented Dec 9, 2020

注意LR是对应的8卡的LR,也要根据卡数调整的。
而且你这个数据集有点少,不确定baseline结果会是什么情况。

LR 我改成 0.00125 了,数据集少的情况下比较推荐什么模型呢?
还有我下载的数据集中是有negative image 的,没有标注,这些图片是不是无法使用呢?

@willthefrog
Copy link
Collaborator

willthefrog commented Dec 9, 2020

RCNN系是可以的。其他模型需要的数据也不见的更少。
可以把drop_empty设false试下。

@iceriver97
Copy link
Author

可以把drop_empty设false试下。

好的谢谢,我还是换一个数据集试一下把;drop_empty 已经是false了

@iceriver97
Copy link
Author

RCNN系是可以的。其他模型需要的数据也不见的更少。
可以把drop_empty设false试下。
我换了个数据集整理成了 VOC格式,但是启动训练的时候一直检查的是COCO的配置,是怎么回事呢?配置文件如下:
Q~H~9VZQQ4OBFXCF7GZDW
image
这样的配置有什么问题吗?
报错如下:

Traceback (most recent call last):
  File "tools/train.py", line 377, in <module>
    main()
  File "tools/train.py", line 118, in main
    train_fetches = model.train(feed_vars)
  File "/home/aistudio/work/PaddleDetection/ppdet/modeling/architectures/mask_rcnn.py", line 333, in train
    return self.build(feed_vars, 'train')
  File "/home/aistudio/work/PaddleDetection/ppdet/modeling/architectures/mask_rcnn.py", line 81, in build
    self._input_check(required_fields, feed_vars)
  File "/home/aistudio/work/PaddleDetection/ppdet/modeling/architectures/mask_rcnn.py", line 271, in _input_check
    "{} has no {} field".format(feed_vars, var)
AssertionError: OrderedDict([('image', name: "image"
type {
  type: LOD_TENSOR
  lod_tensor {
    tensor {
      data_type: FP32
      dims: -1
      dims: 3
      dims: -1
      dims: -1
    }
    lod_level: 0
  }
}
persistable: false
need_check_feed: true
), ('im_info', name: "im_info"
type {
  type: LOD_TENSOR
  lod_tensor {
    tensor {
      data_type: FP32
      dims: -1
      dims: 3
    }
    lod_level: 0
  }
}
persistable: false
need_check_feed: true
), ('im_id', name: "im_id"
type {
  type: LOD_TENSOR
  lod_tensor {
    tensor {
      data_type: INT64
      dims: -1
      dims: 1
    }
    lod_level: 0
  }
}
persistable: false
need_check_feed: true
), ('gt_bbox', name: "gt_bbox"
type {
  type: LOD_TENSOR
  lod_tensor {
    tensor {
      data_type: FP32
      dims: -1
      dims: 4
    }
    lod_level: 1
  }
}
persistable: false
need_check_feed: true
), ('gt_class', name: "gt_class"
type {
  type: LOD_TENSOR
  lod_tensor {
    tensor {
      data_type: INT32
      dims: -1
      dims: 1
    }
    lod_level: 1
  }
}
persistable: false
need_check_feed: true
), ('is_difficult', name: "is_difficult"
type {
  type: LOD_TENSOR
  lod_tensor {
    tensor {
      data_type: INT32
      dims: -1
      dims: 1
    }
    lod_level: 1
  }
}
persistable: false
need_check_feed: true
)]) has no gt_mask field

@willthefrog
Copy link
Collaborator

mask rcnn 结构必须有 gt_mask输入

@iceriver97
Copy link
Author

iceriver97 commented Dec 10, 2020

mask rcnn 结构必须有 gt_mask输入

fields 里面加上 gt_mask ?

还是说只能用 COCO格式?

@iceriver97
Copy link
Author

mask rcnn 结构必须有 gt_mask输入

这个标注有问题吗?我怎么转化的时候会报错
SXOE)UP6YFG1LK_%I86_TV3

执行代码:```
python tools/x2coco.py
--dataset_type voc
--voc_anno_dir dataset/my_dataset/annotations/
--voc_anno_list dataset/my_dataset/train.txt
--voc_label_list dataset/my_dataset/label_list.txt
--voc_out_name voc_train.json

报错:

Start converting !

0%| | 0/10802 [00:00<?, ?it/s]
0%| | 0/10802 [00:00<?, ?it/s]
Traceback (most recent call last):
File "tools/x2coco.py", line 445, in
main()
File "tools/x2coco.py", line 347, in main
output_file=args.voc_out_name)
File "tools/x2coco.py", line 265, in voc_xmls_to_cocojson
ann_tree = ET.parse(a_path)
File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/xml/etree/ElementTree.py", line 1197, in parse
tree.parse(source, parser)
File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/xml/etree/ElementTree.py", line 587, in parse
source = open(source, "rb")
FileNotFoundError: [Errno 2] No such file or directory: 'dataset/my_dataset/annotations/./images/00001.jpg.xml'

@iceriver97
Copy link
Author

我尝试根据原来的 train.txt 转化会报错:
image

@willthefrog
Copy link
Collaborator

mask rcnn 结构必须有 gt_mask输入

fields 里面加上 gt_mask ?

还是说只能用 COCO格式?

目前只支持coco格式的mask

@willthefrog
Copy link
Collaborator

路径错了吧
dataset/my_dataset/annotations/./images/00001.jpg.xml 应该是 dataset/my_dataset/images/00001.jpg.xml ?

@iceriver97
Copy link
Author

路径错了吧
dataset/my_dataset/annotations/./images/00001.jpg.xml 应该是 dataset/my_dataset/images/00001.jpg.xml ?

上面的错误是我把整理好的VOC格式往COCO格式转了;
我在尝试把原始VOC转化为COCO格式时,出现了错误:
image
最后我在网上找的其他脚本已经开始训练了;

@willthefrog
Copy link
Collaborator

好的。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants