Page level images #7

Shreeshrii · 2018-05-04T08:20:43Z

The script works for line level images.

I have a number of scanned page images with ground truth files.

Does OCR-D project have any tools to segment it to line images with corresponding ground truth text?

wrznr · 2018-05-04T08:51:53Z

Unfortunately, not yet. we are working on something in this direction to align the fulltexts from the German Text Archive with the corresponding images. Hopefully, I can get back to you soon with some tool.
Additionally, @jbaiter does some things along these lines...

Shreeshrii · 2018-05-04T10:46:17Z

Thanks. It will be a useful tool.

I am trying to use some ocropus tools to split the page into line images. I will either ocr the line images to create text to be corrected for ground truth, or type it fully,

jbaiter · 2018-05-04T11:09:22Z

@Shreeshrii , you could try this approach:

Split the page image into line images with ocropus/kraken
Run the most suitable OCR model on the line images
For each line in the resulting OCR, find the ground truth line with the lowest edit distance (e.g. Levenshtein)
Every matching line with an edit distance above a certain threshold should have fairly high chance of being a correct match

One problem with this approach is that segmentation errors (e.g. a line gets cut in two, a few words at the beginning/end are missing, etc) lead to false positives.
This also assumes that your ground-truth is split into lines. If not, you will have to modify step 3 to slide each OCR line over the ground truth and determine the best match that way, with some added heuristics to not match partial words, etc.

Shreeshrii · 2018-05-04T12:39:09Z

@jbaiter

I want to use it for Devanagari script. I had looked at ocropus quite sometime back. I am not sure if ocropus/kraken supports Devanagari.

Do you know if it has support for complex scripts?

zuphilip · 2018-05-11T19:17:18Z

@Shreeshrii There are some papers with text recognition results with Ocropus on Devanagari script. However, I am not aware of any shared model you could reuse. You can find some models for Ocropus here https://github.com/tmbdev/ocropy/wiki/Models

However, instead of 1.+2. you can also use tesseract for creating a hocr output and then use hocr-extract-images to create the line images and texts.

Moreover, if you have the ground truth in hocr format you can use hocr-eval for the evaluation with your recognition format. Or, do you have the ground truth only as a text with the geometric information?

Shreeshrii · 2018-05-12T03:41:54Z

@zuphilip I have also read about Devanagari training for ocropus but the models are not available (I had looked couple of years ago or so).

Thank you for the link to specific HOCR tools. I will give them a try.

The ground truth I have are plain text files matching the scanned images without any positional info. I was able to use them to eval OCR accuracy by comparing to recognized output.

Shreeshrii · 2018-05-12T03:44:56Z

https://github.com/Shreeshrii/imagessan/tree/master/groundtruthimages

Sanskrit language samples in Devanagari script.

zuphilip · 2018-05-12T07:05:19Z

Ping @adnanulhasan who may still have some sources from the Ocropus training with Deviangari script texts.

Shreeshrii · 2018-05-28T08:59:59Z

you can also use tesseract for creating a hocr output and then use hocr-extract-images to create the line images and texts.

@zuphilip Thank you. I was able to use it for Devanagari script files also. The commands which worked for me (it took a little experimenting to get it right).

:~/hocr-tools$ PYTHONIOENCODING=UTF-8 ./hocr-extract-images -b ./shree/ -p ./shree/san.pothi-%03d.png  ./shree/Mudgala-Test-01.hocr
:~/hocr-tools$ PYTHONIOENCODING=UTF-8 ./hocr-extract-images -b ./shree/ -p ./shree/san.pothi-%03d.tif  ./shree/Mudgala-Test-01.hocr

Shreeshrii · 2018-05-28T09:03:01Z

The other option which I had used was


    # perform binarization
    ./ocropus-nlbin tests/devatest?.png -o devatest -n -g

    # perform page layout analysis
    ./ocropus-gpageseg 'devatest/????.bin.png' -n

And then running tesseract to get text and correcting it.

Shreeshrii · 2018-09-09T13:07:25Z

In case it is helpful to others looking for a solution, posting below a bash script I use for -

taking a scanned page image,
running tesseract with hocr option on it,
running hocr tools to split it into lines.

The ground truth needs to be updated manually, if there is an existing page level ground truth file, copy line by line into the lines ground truth.

#!/bin/bash
SOURCE="./myfiles/"
lang=san
set -- "$SOURCE"*.png
for img_file; do
    echo -e  "\r\n File: $img_file"
    OMP_THREAD_LIMIT=1 tesseract --tessdata-dir ../tessdata_fast   "${img_file}" "${img_file%.*}"  --psm 6  --oem 1  -l $lang -c page_separator='' hocr
    source venv/bin/activate
    PYTHONIOENCODING=UTF-8 ./hocr-extract-images -b ./myfiles/ -p "${img_file%.*}"-%03d.exp0.tif  "${img_file%.*}".hocr 
    deactivate
done
rename s/exp0.txt/exp0.gt.txt/ ./myfiles/*exp0.txt

echo "Image files converted to tif. Correct the ground truth files and then run ocr-d train to create box and lstmf files"

SultanOrazbayev · 2018-11-23T02:26:03Z

Occasionally, the line images are a bit wider than the text and so they catch the letters from the preceding or the subsequent lines. Is this a problem for the training (i.e. should such images be fixed to ensure that they do not contain top/bottom of the neighbouring lines)?

wrznr · 2018-11-23T09:33:27Z

I think this is a problem. It would be great if you could provide a corresponding example, maybe in a specific GitHub issue. Many thanks in advance!

Shreeshrii · 2019-02-11T03:11:42Z

Please see tesseract-ocr/tesseract#2231 for the WordStr format box files.

wrznr · 2019-08-29T11:22:38Z

@bertsky: Concerning the comment by @SultanOrazbayev, clipping may help here, right? Is it possible, to get polygonal line shapes from tesseract?

bertsky · 2019-08-29T11:51:53Z

It is possible to get polygon-based segmentation from Tesseract: with BlockPolygon from the page iterator delivered by AnalyseLayout. There is a bug somewhere though: sometimes, paths self-intersect, which even Tesseract itself does not cope with very well (as can be seen by the mask images produced internally, available with GetImage when also passing the raw image again). Maybe by postprocessing one can circumvent this issue – using shapely.geometry functions to self-disjoin paths, or similar.

But even without polygon masked line images you could try clipping to rid of the intrusions from neighbours, yes. Or alternatively, do resegmentation (i.e. increase coherence via another line segmentation). Both methods are already available as OCR-D processors, as is Tesseract region segmentation (optionally with polygons).

But you want line segmentation with polygons here, right? I am afraid Tesseract's API does not offer that – only for the "block" level!

Should I give details (what/where/how) on using clipping and resegmentation?

kabilankiruba · 2019-09-16T13:15:49Z

Hi,
I am using ocr-D for preparing traindata and i am try to extract data from dot matrix font pdf.i created some sample in dot matrix tif image and gt.txt then i am using tesseract to extract my pdf but it extract only some letters and some time its consider 0 as 8.please give solution to fix this issue

wrznr · 2019-10-01T11:06:52Z

@kabilankiruba This is clearly not related to this thread. Pls. consider to contact the Tesseract user group.

Shreeshrii · 2020-01-12T14:12:43Z

Is there any tool which will display the line images and gt.txt side by side for easy correction after generating the files from HOCR output (as suggested here).

I do not want to run a web server to do this.

Can it be done via javascript/html - show an image and its gt.txt - save corrected gt.txt and have an arrow/option to display next image and gt.txt.

Basically, i would like to run this on my windows10 desktop.

wrznr · 2020-01-13T07:08:45Z

@kba @cneud @stweil Can you recommend a tool for this purpose? Wasn't there such a thing in OCRopy?

Shreeshrii · 2020-01-13T10:51:53Z

https://github.com/OpenArabic/OCR_GS_Data/blob/master/_doublecheck_viewer.py creates HTML5 based webpage for Reviewing OCR Training/Testing Data.

kba · 2020-01-13T11:26:45Z

Can you recommend a tool for this purpose?

ocropus-gtedit
ketos transcribe (in kraken)
https://github.com/qurator-spk/neath (server based)
https://github.com/UB-Mannheim/ocr-gt-tools (server-based ocropus-gtedit)

Can it be done via javascript/html - show an image and its gt.txt - save corrected gt.txt and have an arrow/option to display next image and gt.txt.

Both kraken's and ocropy's transcription do that. the hocrjs viewer has an option to make items contenteditable but no way to save it.

Shreeshrii · 2020-01-13T14:10:09Z

Thank you. I think the following workflow will do the trick.

./ocropus-nlbin bookpages/*.png -o book

 ./ocropus-gpageseg 'book/????.bin.png'

 ./ocropus-gtedit html -f 20 -H 48   ./book/*/*.png

writing correction.html

Transfer and browse correction.html on Windows. Add the ground truth text for each line image. Save HTML as complete webpage. Transfer file back to Linux.

./ocropus-gtedit extract -p bookgt correction.html

fjp · 2020-01-31T14:18:28Z

@Shreeshrii could you please clarify how you match the extracted ground truth txt files from ocropy/ocropus with the line level images obtained with your script? After using ocropus-nlbin, the original filename is "lost" (ocropus uses numerical increasing values).

Using tesstrain, I assume that you don't train tesseract on the line level images and gt obtained with ocropus? These images are slightly different compared to the line images obtained with your script (which uses tesseract directly) because of preprocessing with ocropus-nlbin. But please correct me if I am wrong.

I am confused what the current workflow is to correct the extracted ground truth:

Using just your script and edit the gt files manually
Use only ocropy
Combine your script together with the tools from ocropus?
1. Change the filenames to the same numerical increasing values that ocropus uses
2. Run script
3. Work with ocropus commands on the line images obtained from your script
4. use the line images obtaind from tesseract script with the edited gt from ocropus

Shreeshrii · 2020-01-31T15:52:08Z

@fjp These are two different approaches.
I have used both separately, only on experimental basis, mostly for testing.

M3ssman · 2020-04-02T07:07:00Z

Hello,
are there still any plans to integrate some kind of tool into tesstrain?

I was facing similar requirements for generation of training data in a windows-env, which ended up in a small Script that extracts both coords and textdata from an existing ALTO-file and writes training-data-pairs.

wrznr · 2020-04-02T09:02:45Z

@M3ssman This would be a great contribution. Especially, since it opens up a way to use Aletheia-created GT with tesstrain.

M3ssman · 2020-04-02T19:57:02Z

@wrznr I must confess: There are some caveats.
It adds another dependency, python-opencv. Pillow kept complaining about images >80 MB.
Further, on Windows 10, one needs additionally to install C++14.0-Buildtools, which varies according to the used Python Version which is used by numpy which in turn is used by opencv.

rraina97 · 2020-06-01T05:49:24Z

In case it is helpful to others looking for a solution, posting below a bash script I use for -

1. taking a scanned page image,

2. running tesseract with hocr option on it,

3. running hocr tools to split it into lines.

The ground truth needs to be updated manually, if there is an existing page level ground truth file, copy line by line into the lines ground truth.

#!/bin/bash
SOURCE="./myfiles/"
lang=san
set -- "$SOURCE"*.png
for img_file; do
    echo -e  "\r\n File: $img_file"
    OMP_THREAD_LIMIT=1 tesseract --tessdata-dir ../tessdata_fast   "${img_file}" "${img_file%.*}"  --psm 6  --oem 1  -l $lang -c page_separator='' hocr
    source venv/bin/activate
    PYTHONIOENCODING=UTF-8 ./hocr-extract-images -b ./myfiles/ -p "${img_file%.*}"-%03d.exp0.tif  "${img_file%.*}".hocr 
    deactivate
done
rename s/exp0.txt/exp0-gt.txt/ ./myfiles/*exp0.txt

echo "Image files converted to tif. Correct the ground truth files and then run ocr-d train to create box and lstmf files"

Could you please explain what each line does. I want to run it on my system but am confused on what to change @Shreeshrii

Shreeshrii · 2020-06-01T08:32:27Z

I want to run it on my system but am confused on what to change

Assuming that you have tesseract and hocr-tools installed, put your image (png) files in ./myfiles/ folder. Change lang=san in the bash script to whichever language you need eg. lang=eng
save and run the bash script.

for each image file
runs tesseract on image file to produce hocr output
runs hocr-extract-images to split the image to line images with the OCRed text for the line
done
rename generated text file from *.txt to *.gt.txt
Correct command will be the following (. instead of - in filename)
rename s/exp0.txt/exp0.gt.txt/ ./myfiles/*exp0.txt

After this the *.gt.txt files need to be manually corrected to match the line images.

rraina97 · 2020-06-02T05:45:21Z

I want to run it on my system but am confused on what to change

Assuming that you have tesseract and hocr-tools installed, put your image (png) files in ./myfiles/ folder. Change lang=san in the bash script to whichever language you need eg. lang=eng
save and run the bash script.

for each image file
runs tesseract on image file to produce hocr output
runs hocr-extract-images to split the image to line images with the OCRed text for the line
done
rename generated text file from *.txt to *.gt.txt
Correct command will be the following (. instead of - in filename)
rename s/exp0.txt/exp0.gt.txt/ ./myfiles/*exp0.txt

After this the *.gt.txt files need to be manually corrected to match the line images.

Thank You. It has solved some issues but still a problem persists. I'm attaching a screenshot. Please look into the matter @Shreeshrii

Shreeshrii · 2020-06-02T06:09:35Z

Do you have tesseract and hocr-tools installed correctly?

It is not finding hocr config file. Is your tessdata_prefix directory setup correctly?

Are hocr-tools working fine?

Change the paths based on your setup.

rraina97 · 2020-06-02T06:41:07Z

i installed hocr-tools using "sudo pip3 install hocr-tools". And as of tesseract i cloned the tesstrain repo aur used make leptonica tesseract since i had to train tesseract manually on data.
i guess my tessdata_prefix if fine. its ./usr/share/tessdata .
I tried several steps but am not able to run it correctly @Shreeshrii

Shreeshrii · 2020-06-02T06:46:22Z

Take one image file. Run tesseract on it, see if you get text output. Try again with pdf at end of command and see if you get pdf output. Then try with hocr.

Similarly test the hocr-tools. Check that you can run the hocr-extract-images command.
If you have installed it, then you may not need ./ before command.

Once you can do this for one file, use the appropriate commands in a for loop for all files.

Shreeshrii · 2020-06-02T06:48:27Z

./usr/share/tessdata

Check the files and folders in that directory. Do you have a newer set of files under /usr/share/tessdata/4.00

rraina97 · 2020-06-02T13:19:59Z

both tesseract and hocr are working.
So, i manually ran tesseract hocr on a jpg file "img.png" and it provided ne with an output "img.hocr".
Now, to extract line data from this i ran hocr-extract-images but am faced with an error which i have attached below. Please help @Shreeshrii

kba · 2020-06-02T13:27:16Z

@rraina97 Please open an issue at https://github.com/tmbdev/hocr-tools for help on invoking hocr-extract-images to keep this issue uncluttered.

It looks like img.hocr is not in the current directory. Make sure you are in the right location, i.e. ls img.hocr is successful.

prasad01dalavi · 2020-08-31T06:42:13Z

In my case to make it run, have made some minor changes in @Shreeshrii script, I put the image file of page in myfiles and run the script with bash generate_training_data.sh

Created training virtualenv
sudo apt-get install hocr-tools
sudo apt-get install rename

#!/bin/bash
SOURCE="./myfiles/"
lang=eng
set -- "$SOURCE"*.jpg
for img_file; do
    echo -e  "\r\n File: $img_file"
    OMP_THREAD_LIMIT=1 tesseract "${img_file}" "${img_file%.*}"  --psm 6  --oem 1  -l $lang -c page_separator='' hocr
    source training_env/bin/activate
    PYTHONIOENCODING=UTF-8 hocr-extract-images -b ./myfiles/ -p "${img_file%.*}"-%03d.exp0.tif  "${img_file%.*}".hocr 
    deactivate
done
rename s/exp0.txt/exp0-gt.txt/ ./myfiles/*exp0.txt
echo "Image files converted to tif. Correct the ground truth files and then run ocr-d train to create box and lstmf files"

Special thanks to Shreeshri!

sahrawat · 2020-11-26T17:43:36Z

Nitpick:

rename s/exp0.txt/exp0-gt.txt/ ./myfiles/*exp0.txt

should be

rename s/exp0.txt/exp0.gt.txt/ ./myfiles/*exp0.txt

kba · 2020-11-26T17:49:24Z

Note that @M3ssman has proposed a set of python scripts to generate line image/text pairs from PAGE and ALTO in #205.

@Shreeshrii

Courtesy of [@Shreeshrii](tesseract-ocr/tesstrain#7 (comment))

bertzi87 · 2022-03-13T11:31:28Z

In case it is helpful to others looking for a solution, posting below a bash script I use for -

taking a scanned page image,

running tesseract with hocr option on it,

running hocr tools to split it into lines.

The ground truth needs to be updated manually, if there is an existing page level ground truth file, copy line by line into the lines ground truth.
#!/bin/bash
SOURCE="./myfiles/"
lang=san
set -- "$SOURCE"*.png
for img_file; do
    echo -e  "\r\n File: $img_file"
    OMP_THREAD_LIMIT=1 tesseract --tessdata-dir ../tessdata_fast   "${img_file}" "${img_file%.*}"  --psm 6  --oem 1  -l $lang -c page_separator='' hocr
    source venv/bin/activate
    PYTHONIOENCODING=UTF-8 ./hocr-extract-images -b ./myfiles/ -p "${img_file%.*}"-%03d.exp0.tif  "${img_file%.*}".hocr 
    deactivate
done
rename s/exp0.txt/exp0.gt.txt/ ./myfiles/*exp0.txt

echo "Image files converted to tif. Correct the ground truth files and then run ocr-d train to create box and lstmf files"

For a simpler and more efficient way, I recommend gnu parallel. The above stuff becomes 2 lines. First generate the hocr files:
parallel --bar -j 4 'OMP_THREAD_LIMIT=1 tesseract {} {/.} --psm 4 --oem 1 -l eng hocr' ::: *.png
Then extract the tif/txt pairs:
parallel --bar -j 4 'hocr-extract-images {} -p {/.}-%03d.tif' ::: *.hocr

Even faster (around 10% for me) if you recompile tesseract without openMP (./configure --disable-openmp)

whisere · 2022-04-28T05:07:00Z

If we only have page images and page ground truth text, can we use them to train tesseract instead of line images and line ground truth? I imagine page images/text are closer to the tesseract input/output format?

wrznr · 2022-04-28T06:04:52Z

@whisere The question is whether your page images/texts are aligned on line-level. I.e. for each text line the coordinates of the corresponding part of the page image have to be annotated. If not, training Tesseract with your data is not possible.

whisere · 2022-04-28T06:16:22Z

Thanks, That's not good, There is no text line information in page texts at all.. only multiple blocks with <p>..

whisere · 2022-04-29T01:46:47Z

How about block images and block ground truth text?

wrznr · 2022-04-29T06:52:16Z

You would have to align them manually or semi-automatically (i.e. you could try to OCR the images to get the line segmentation and than heuristically match the text on the lines) on the line level. Tesseract text recognition has to be trained on the level of lines. No other way (cf. e.g. https://ieeexplore.ieee.org/abstract/document/6628705).

whisere · 2022-04-29T07:05:46Z

Many thanks for the information!

ssandrews · 2023-12-01T21:40:42Z

This is a helpful script. Thank you. However, it ends with "run ocr-d train to create box and lstmf files". Can someone tell me how to do this? Thanks.

kba mentioned this issue Aug 23, 2018

Adding box file for whole page results a very high Error rate #21

Closed

wrznr added the enhancement New feature or request label Sep 20, 2018

wrznr self-assigned this Sep 20, 2018

kba added this to Backlog in coordinate Oct 29, 2018

Shreeshrii mentioned this issue Jan 17, 2019

Fine Tuning Leads to Segmentation Issue tesseract-ocr/tesseract#2132

Open

stweil removed this from Backlog in coordinate Sep 2, 2019

L1800Turbo mentioned this issue Dec 9, 2019

Training parts lists #131

Closed

lnutimura mentioned this issue Mar 7, 2020

LSTM: Training - Image not trainable tesseract-ocr/tesseract#590

Open

srdg added a commit to srdg/unarchivingbengali that referenced this issue Dec 12, 2020

Create line_hocr.sh

3a46a68

Courtesy of [@Shreeshrii](tesseract-ocr/tesstrain#7 (comment))

Marco-Parente mentioned this issue Jan 31, 2022

[Question] How to generate tif line images from tif pages / How to train with no specified language? #302

Closed

jbarth-ubhd mentioned this issue Feb 16, 2023

Ground truth: spaces before and after text? #335

Open

Page level images #7

Page level images #7

Comments

Shreeshrii commented May 4, 2018

wrznr commented May 4, 2018 • edited

Shreeshrii commented May 4, 2018

jbaiter commented May 4, 2018 • edited

Shreeshrii commented May 4, 2018

zuphilip commented May 11, 2018

Shreeshrii commented May 12, 2018

Shreeshrii commented May 12, 2018

zuphilip commented May 12, 2018 • edited by kba

Shreeshrii commented May 28, 2018

Shreeshrii commented May 28, 2018

Shreeshrii commented Sep 9, 2018 • edited

SultanOrazbayev commented Nov 23, 2018 • edited

wrznr commented Nov 23, 2018

Shreeshrii commented Feb 11, 2019

wrznr commented Aug 29, 2019

bertsky commented Aug 29, 2019

kabilankiruba commented Sep 16, 2019

wrznr commented Oct 1, 2019

Shreeshrii commented Jan 12, 2020 • edited

wrznr commented Jan 13, 2020

Shreeshrii commented Jan 13, 2020

kba commented Jan 13, 2020

Shreeshrii commented Jan 13, 2020 • edited

fjp commented Jan 31, 2020 • edited

Shreeshrii commented Jan 31, 2020

M3ssman commented Apr 2, 2020

wrznr commented Apr 2, 2020

M3ssman commented Apr 2, 2020

rraina97 commented Jun 1, 2020

Shreeshrii commented Jun 1, 2020 • edited

rraina97 commented Jun 2, 2020

Shreeshrii commented Jun 2, 2020

rraina97 commented Jun 2, 2020

Shreeshrii commented Jun 2, 2020

Shreeshrii commented Jun 2, 2020

rraina97 commented Jun 2, 2020

kba commented Jun 2, 2020

prasad01dalavi commented Aug 31, 2020

sahrawat commented Nov 26, 2020

kba commented Nov 26, 2020

bertzi87 commented Mar 13, 2022

whisere commented Apr 28, 2022

wrznr commented Apr 28, 2022

whisere commented Apr 28, 2022 • edited

whisere commented Apr 29, 2022

wrznr commented Apr 29, 2022

whisere commented Apr 29, 2022

ssandrews commented Dec 1, 2023

wrznr commented May 4, 2018 •

edited

jbaiter commented May 4, 2018 •

edited

zuphilip commented May 12, 2018 •

edited by kba

Shreeshrii commented Sep 9, 2018 •

edited

SultanOrazbayev commented Nov 23, 2018 •

edited

Shreeshrii commented Jan 12, 2020 •

edited

Shreeshrii commented Jan 13, 2020 •

edited

fjp commented Jan 31, 2020 •

edited

Shreeshrii commented Jun 1, 2020 •

edited

whisere commented Apr 28, 2022 •

edited