print error - ICDAR2017_shared_task_workflows.ipynb #16

thiagopx · 2019-10-27T20:18:33Z

Hi guys,

I suggest change print wf.list_steps() to print (wf.list_steps()) in the notebook ICDAR2017_shared_task_workflows.ipynb

Also, I would not able to run cwltool ochre/cwl/ICDAR2017_shared_task_workflows. That is what I have got:
ochre/cwl/vudnc-preprocess-pack.cwl: error: argument --archive is required

jvdzwaan · 2019-10-29T18:59:18Z

Thanks! The signature of wf.list_steps() changed, so, yes, you should do print(wf.list_steps()).

Please note that the workflow is about preprocessing the vudnc data, this has nothing to do with the icdar 2017 shared task. Also, I do not recommend using the vudnc data, because it is very noisy. But if you do want to preprocess it anyway, you should do

cwltool ochre/cwl/vudnc-preprocess-pack.cwl --archive path/to/vudnc/archive

thiagopx · 2019-10-29T19:07:22Z

Thanks! The signature of wf.list_steps() changed, so, yes, you should do print(wf.list_steps()).

Please note that the workflow is about preprocessing the vudnc data, this has nothing to do with the icdar 2017 shared task. Also, I do not recommend using the vudnc data, because it is very noisy. But if you do want to preprocess it anyway, you should do
cwltool ochre/cwl/vudnc-preprocess-pack.cwl --archive path/to/vudnc/archive

You are correct. I meant that I was not able to run vudnc-preprocess-pack.cwl.

For good results in english, do you recommend using the english monograph partition of ICDAR? I trained with both monograph and the periodical partitions in separated but the validation accuracy and loss were not good (and also the tests I made).

I would like to help with some additional documentation to improve reproducibility, but I need a roadmap of how to get significant results (mainly for english documents).

jvdzwaan · 2019-11-02T14:05:23Z

Unfortunately, ochre is not (yet) fit for training good ocr post-correction models. I plan to work on it in the future, but only as a hobby project. So no promises there!

Generally speaking, the OCR post-correction datasets are small. That's why I'm making a list of them, so they can be used for generalization. I don't think that training on the English monograph data will give you a model that will work on other data, because OCR errors tend to depend on time period, font, the ocr software that was used, etc.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

print error - ICDAR2017_shared_task_workflows.ipynb #16

print error - ICDAR2017_shared_task_workflows.ipynb #16

thiagopx commented Oct 27, 2019

jvdzwaan commented Oct 29, 2019

thiagopx commented Oct 29, 2019

jvdzwaan commented Nov 2, 2019

print error - ICDAR2017_shared_task_workflows.ipynb #16

print error - ICDAR2017_shared_task_workflows.ipynb #16

Comments

thiagopx commented Oct 27, 2019

jvdzwaan commented Oct 29, 2019

thiagopx commented Oct 29, 2019

jvdzwaan commented Nov 2, 2019