Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

print error - ICDAR2017_shared_task_workflows.ipynb #16

Open
thiagopx opened this issue Oct 27, 2019 · 3 comments
Open

print error - ICDAR2017_shared_task_workflows.ipynb #16

thiagopx opened this issue Oct 27, 2019 · 3 comments

Comments

@thiagopx
Copy link

Hi guys,

I suggest change print wf.list_steps() to print (wf.list_steps()) in the notebook ICDAR2017_shared_task_workflows.ipynb

Also, I would not able to run cwltool ochre/cwl/ICDAR2017_shared_task_workflows. That is what I have got:
ochre/cwl/vudnc-preprocess-pack.cwl: error: argument --archive is required

@jvdzwaan
Copy link
Collaborator

Thanks! The signature of wf.list_steps() changed, so, yes, you should do print(wf.list_steps()).

Please note that the workflow is about preprocessing the vudnc data, this has nothing to do with the icdar 2017 shared task. Also, I do not recommend using the vudnc data, because it is very noisy. But if you do want to preprocess it anyway, you should do

cwltool ochre/cwl/vudnc-preprocess-pack.cwl --archive path/to/vudnc/archive

@thiagopx
Copy link
Author

Thanks! The signature of wf.list_steps() changed, so, yes, you should do print(wf.list_steps()).

Please note that the workflow is about preprocessing the vudnc data, this has nothing to do with the icdar 2017 shared task. Also, I do not recommend using the vudnc data, because it is very noisy. But if you do want to preprocess it anyway, you should do

cwltool ochre/cwl/vudnc-preprocess-pack.cwl --archive path/to/vudnc/archive

You are correct. I meant that I was not able to run vudnc-preprocess-pack.cwl.

For good results in english, do you recommend using the english monograph partition of ICDAR? I trained with both monograph and the periodical partitions in separated but the validation accuracy and loss were not good (and also the tests I made).

I would like to help with some additional documentation to improve reproducibility, but I need a roadmap of how to get significant results (mainly for english documents).

@jvdzwaan
Copy link
Collaborator

jvdzwaan commented Nov 2, 2019

Unfortunately, ochre is not (yet) fit for training good ocr post-correction models. I plan to work on it in the future, but only as a hobby project. So no promises there!

Generally speaking, the OCR post-correction datasets are small. That's why I'm making a list of them, so they can be used for generalization. I don't think that training on the English monograph data will give you a model that will work on other data, because OCR errors tend to depend on time period, font, the ocr software that was used, etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants