Skip to content

Commit

Permalink
add langtests for Devanagari and Sanskrit
Browse files Browse the repository at this point in the history
  • Loading branch information
Shreeshrii committed Sep 4, 2018
1 parent e4b9cff commit be443a5
Show file tree
Hide file tree
Showing 18 changed files with 159 additions and 81 deletions.
86 changes: 21 additions & 65 deletions langtests/README.md
Original file line number Diff line number Diff line change
@@ -1,85 +1,41 @@
# How to run Language tests.

The scripts in this directory make it possible to test Accuracy of Tesseract
for different languages.

# Language tests.
The scripts in this directory make it possible to test Accuracy of Tesseract for different languages.
## Setup
### Step 1: If not already installed, download the modified ISRI toolkit,
make and install the tools in /usr/local/bin.

```
git clone https://github.com/Shreeshrii/ocr-evaluation-tools.git
cd ~/ocr-evaluation-tools
sudo make install
```

### Step 2: If not alrady installed, Build tesseract.

### Step 2: If not alrady built, Build tesseract.
Use binaries from the tesseract/src/api and tesseract/src/training directory.
### Step 3
Download images and corresponding ground truth text for the language to be tested.
Each testset should have only one kind of images (eg. tif, png, jpg etc).
The ground truth text files should have the same base filename with txt extension.
As needed, modify the filenames and create the `pages` file for each testset.
Instructions for testing Fraktur and Sanskrit languages are given below as an example.
## Testing for Fraktur - frk and script/Fraktur

### Step 3: download the images and groundtruth

### Download the images and groundtruth, modify to required format.
```
mkdir -p ~/lang-downloads
cd ~/lang-downloads
wget -O frk-jbarth-ubhd.zip http://digi.ub.uni-heidelberg.de/diglitData/v/abbyy11r8-vs-tesseract4.zip
wget -O frk-stweil-gt.zip https://digi.bib.uni-mannheim.de/~stweil/fraktur-gt.zip
bash -x frk_setup.sh
```

### Step 4: extract the files.
It doesn't really matter where in your filesystem you put them,
but they must go under a common root, for example, ~/lang-files

### Run tests for Fraktur - frk and script/Fraktur
```
mkdir -p ~/lang-files
cd ~/lang-files
unzip ~/lang-downloads/frk-jbarth-ubhd.zip -d frk
unzip ~/lang-downloads/frk-stweil-gt.zip -d frk
mkdir -p ./frk-ligatures
cp ./frk/abbyy-vs-tesseract/*.tif ./frk-ligatures/
cp ./frk/gt/*.txt ./frk-ligatures/
cd ./frk-ligatures/
ls -1 *.tif >pages
sed -i -e 's/.tif//g' pages
cat pages
bash -x frk_test.sh
```

## Testing for Sanskrit - san and script/Devanagari
### Download the images and groundtruth, modify to required format.
```
mkdir -p ~/lang-stopwords
cd ~/lang-stopwords
wget -O frk.stopwords.txt https://raw.githubusercontent.com/stopwords-iso/stopwords-de/master/stopwords-de.txt
bash -x deva_setup.sh
```
Edit ~/lang-files/stopwords/frk.stopwords.txt as
wordacc uses a space delimited stopwords file, not line delimited.

### Run tests
```
sed -i -e 's/\n/ /g' frk.stopwords.txt
cat frk.stopwords.txt
```

### Step 5: run langtests/runlangtests.sh with the root ISRI data dir, testname, tessdata-dir, language code:

```
cd ~/tesseract
langtests/runlangtests.sh ~/lang-files 4_fast_Fraktur ../tessdata_fast/script Fraktur
langtests/runlangtests.sh ~/lang-files 4_fast_frk ../tessdata_fast frk
langtests/runlangtests.sh ~/lang-files 4_best_int_frk ../tessdata frk
langtests/runlangtests.sh ~/lang-files 4_best_frk ../tessdata_best frk
langtests/runlangtests.sh ~/lang-files 4_shreetest_frk-Fraktur /home/ubuntu/tessdata_frk/frk-finetune-impact frk
langtests/runlangtests.sh ~/lang-files 4_shreetest_frk-frk /home/ubuntu/tessdata_frk/frk-finetune-frk frk
```
and go to the gym, have lunch etc. It takes a while to run.

### Step 6: There should be a RELEASE.summary file
*langtests/reports/4-beta_fast.summary* that contains the final summarized accuracy

bash -x deva_test.sh
```

#### Notes from Nick White regarding wordacc
### Notes from Nick White regarding wordacc

If you just want to remove all lines which have 100% recognition,
you can add a 'awk' command like this:
Expand Down
2 changes: 1 addition & 1 deletion langtests/counttestset.sh
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ do
else
srcdir="$imdir"
fi
echo "$srcdir/$page.tif"
echo "$srcdir/$page"
# Count character errors.
ocrevalutf8 accuracy "$srcdir/$page.txt" "$resdir/$page.txt" > "$resdir/$page.acc"
accfiles="$accfiles $resdir/$page.acc"
Expand Down
18 changes: 18 additions & 0 deletions langtests/deva_setup.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
#!/bin/bash
#
mkdir -p ~/lang-files
rm -rf ~/lang-files/san-*
for testset in vedic fontsamples oldstyle shreelipi alphabetsamples
do
cd ~/lang-files
mkdir -p ./san-$testset
cp ~/lang-deva-downloads/imagessan/$testset/*.* ./san-$testset/
cd ./san-$testset/
rename s/-gt.txt/.txt/ *.txt
ls -1 *.png >pages
sed -i -e 's/.png//g' pages
done

mkdir -p ~/lang-stopwords
cd ~/lang-stopwords
cp ~/lang-deva-downloads/imagessan/stopwords.txt ./san.stopwords.txt
18 changes: 18 additions & 0 deletions langtests/deva_test.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
#!/bin/bash
# run langtests/runlangtests.sh with the root data dir, testname, tessdata-dir, language code and image extension

cd ~/tesseract

langtests/runlangtests.sh ~/lang-files 4_fast_Devanagari ../tessdata_fast/script Devanagari png
langtests/runlangtests.sh ~/lang-files 4_best_int_Devanagari ../tessdata/script Devanagari png
langtests/runlangtests.sh ~/lang-files 4_best_Devanagari ../tessdata_best/script Devanagari png
langtests/runlangtests.sh ~/lang-files 4_fast_san ../tessdata_fast san png
langtests/runlangtests.sh ~/lang-files 4_best_int_san ../tessdata san png
langtests/runlangtests.sh ~/lang-files 4_best_san ../tessdata_best san png

langtests/runlangtests.sh ~/lang-files 4_plus40k_san ../tesstutorial-deva san png

#/home/ubuntu/tesstutorial-deva/san.traineddata at n iterations

### It takes a while to run.

8 changes: 4 additions & 4 deletions langtests/frk_test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,11 @@
# run langtests/runlangtests.sh with the root ISRI data dir, testname, tessdata-dir, language code:

cd ~/tesseract
langtests/runlangtests.sh ~/lang-files 4_fast_Fraktur ../tessdata_fast/script Fraktur
langtests/runlangtests.sh ~/lang-files 4_fast_Fraktur ../tessdata_fast/script Fraktur tif

langtests/runlangtests.sh ~/lang-files 4_fast_frk ../tessdata_fast frk
langtests/runlangtests.sh ~/lang-files 4_best_int_frk ../tessdata frk
langtests/runlangtests.sh ~/lang-files 4_best_frk ../tessdata_best frk
langtests/runlangtests.sh ~/lang-files 4_fast_frk ../tessdata_fast frk tif
langtests/runlangtests.sh ~/lang-files 4_best_int_frk ../tessdata frk tif
langtests/runlangtests.sh ~/lang-files 4_best_frk ../tessdata_best frk tif

### It takes a while to run.

8 changes: 8 additions & 0 deletions langtests/reports/4_best_Devanagari.summary
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
RELEASE TestSet CharErrors Accuracy WordErrors Accuracy NonStopWErrors Accuracy TimeTaken
4_best_Devanagari san-alphabetsamples 2013 56.17% 1323 12.27% 1323 12.27 606.28s
RELEASE TestSet CharErrors Accuracy WordErrors Accuracy NonStopWErrors Accuracy TimeTaken
4_best_Devanagari san-fontsamples 388 94.82% 87 86.38% 87 86.38 570.17s
RELEASE TestSet CharErrors Accuracy WordErrors Accuracy NonStopWErrors Accuracy TimeTaken
4_best_Devanagari san-oldstyle 2796 59.93% 523 39.61% 523 39.61 447.73s
RELEASE TestSet CharErrors Accuracy WordErrors Accuracy NonStopWErrors Accuracy TimeTaken
4_best_Devanagari san-shreelipi 830 94.01% 311 81.40% 311 81.40 1137.51s
8 changes: 8 additions & 0 deletions langtests/reports/4_best_int_Devanagari.summary
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
RELEASE TestSet CharErrors Accuracy WordErrors Accuracy NonStopWErrors Accuracy TimeTaken
4_best_int_Devanagari san-alphabetsamples 2010 56.24% 1321 12.40% 1321 12.40 556.26s
RELEASE TestSet CharErrors Accuracy WordErrors Accuracy NonStopWErrors Accuracy TimeTaken
4_best_int_Devanagari san-fontsamples 396 94.72% 89 86.07% 89 86.07 524.07s
RELEASE TestSet CharErrors Accuracy WordErrors Accuracy NonStopWErrors Accuracy TimeTaken
4_best_int_Devanagari san-oldstyle 2812 59.70% 523 39.61% 523 39.61 416.57s
RELEASE TestSet CharErrors Accuracy WordErrors Accuracy NonStopWErrors Accuracy TimeTaken
4_best_int_Devanagari san-shreelipi 829 94.01% 314 81.22% 314 81.22 1087.02s
2 changes: 1 addition & 1 deletion langtests/reports/4_best_int_frk.summary
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
RELEASE TestSet CharErrors Accuracy WordErrors Accuracy NonStopWErrors Accuracy TimeTaken
4_best_int_frk frk-ligatures 244 92.78% 109 79.63% 80 73.15 89.80s
4_best_int_frk frk-ligatures 244 92.78% 109 79.63% 80 73.15 367.73s
8 changes: 8 additions & 0 deletions langtests/reports/4_best_int_san.summary
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
RELEASE TestSet CharErrors Accuracy WordErrors Accuracy NonStopWErrors Accuracy TimeTaken
4_best_int_san san-alphabetsamples 2342 49.01% 1353 10.28% 1353 10.28 281.60s
RELEASE TestSet CharErrors Accuracy WordErrors Accuracy NonStopWErrors Accuracy TimeTaken
4_best_int_san san-fontsamples 474 93.68% 126 80.28% 126 80.28 281.05s
RELEASE TestSet CharErrors Accuracy WordErrors Accuracy NonStopWErrors Accuracy TimeTaken
4_best_int_san san-oldstyle 3121 55.27% 602 30.48% 602 30.48 206.20s
RELEASE TestSet CharErrors Accuracy WordErrors Accuracy NonStopWErrors Accuracy TimeTaken
4_best_int_san san-shreelipi 1163 91.60% 417 75.06% 417 75.06 606.80s
8 changes: 8 additions & 0 deletions langtests/reports/4_best_san.summary
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
RELEASE TestSet CharErrors Accuracy WordErrors Accuracy NonStopWErrors Accuracy TimeTaken
4_best_san san-alphabetsamples 2335 49.16% 1348 10.61% 1348 10.61 300.24s
RELEASE TestSet CharErrors Accuracy WordErrors Accuracy NonStopWErrors Accuracy TimeTaken
4_best_san san-fontsamples 473 93.69% 126 80.28% 126 80.28 267.05s
RELEASE TestSet CharErrors Accuracy WordErrors Accuracy NonStopWErrors Accuracy TimeTaken
4_best_san san-oldstyle 3121 55.27% 598 30.95% 598 30.95 205.28s
RELEASE TestSet CharErrors Accuracy WordErrors Accuracy NonStopWErrors Accuracy TimeTaken
4_best_san san-shreelipi 1168 91.56% 414 75.24% 414 75.24 610.52s
8 changes: 8 additions & 0 deletions langtests/reports/4_fast_Devanagari.summary
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
RELEASE TestSet CharErrors Accuracy WordErrors Accuracy NonStopWErrors Accuracy TimeTaken
4_fast_Devanagari san-alphabetsamples 2017 56.09% 1317 12.67% 1317 12.67 400.38s
RELEASE TestSet CharErrors Accuracy WordErrors Accuracy NonStopWErrors Accuracy TimeTaken
4_fast_Devanagari san-fontsamples 433 94.22% 108 83.10% 108 83.10 287.48s
RELEASE TestSet CharErrors Accuracy WordErrors Accuracy NonStopWErrors Accuracy TimeTaken
4_fast_Devanagari san-oldstyle 2883 58.68% 543 37.30% 543 37.30 289.85s
RELEASE TestSet CharErrors Accuracy WordErrors Accuracy NonStopWErrors Accuracy TimeTaken
4_fast_Devanagari san-shreelipi 750 94.58% 279 83.31% 279 83.31 813.19s
8 changes: 8 additions & 0 deletions langtests/reports/4_fast_san.summary
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
RELEASE TestSet CharErrors Accuracy WordErrors Accuracy NonStopWErrors Accuracy TimeTaken
4_fast_san san-alphabetsamples 2342 49.01% 1353 10.28% 1353 10.28 276.73s
RELEASE TestSet CharErrors Accuracy WordErrors Accuracy NonStopWErrors Accuracy TimeTaken
4_fast_san san-fontsamples 474 93.68% 126 80.28% 126 80.28 278.34s
RELEASE TestSet CharErrors Accuracy WordErrors Accuracy NonStopWErrors Accuracy TimeTaken
4_fast_san san-oldstyle 3121 55.27% 602 30.48% 602 30.48 222.35s
RELEASE TestSet CharErrors Accuracy WordErrors Accuracy NonStopWErrors Accuracy TimeTaken
4_fast_san san-shreelipi 1163 91.60% 417 75.06% 417 75.06 626.40s
8 changes: 8 additions & 0 deletions langtests/reports/4_plus10k_san.summary
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
RELEASE TestSet CharErrors Accuracy WordErrors Accuracy NonStopWErrors Accuracy TimeTaken
4_plus10k_san san-alphabetsamples 1725 62.44% 1112 26.26% 1112 26.26 160.48s
RELEASE TestSet CharErrors Accuracy WordErrors Accuracy NonStopWErrors Accuracy TimeTaken
4_plus10k_san san-fontsamples 349 95.34% 73 88.58% 73 88.58 138.09s
RELEASE TestSet CharErrors Accuracy WordErrors Accuracy NonStopWErrors Accuracy TimeTaken
4_plus10k_san san-oldstyle 2818 59.62% 548 36.72% 548 36.72 120.83s
RELEASE TestSet CharErrors Accuracy WordErrors Accuracy NonStopWErrors Accuracy TimeTaken
4_plus10k_san san-shreelipi 746 94.61% 279 83.31% 279 83.31 292.70s
8 changes: 8 additions & 0 deletions langtests/reports/4_plus20k_san.summary
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
RELEASE TestSet CharErrors Accuracy WordErrors Accuracy NonStopWErrors Accuracy TimeTaken
4_plus20k_san san-alphabetsamples 1441 68.63% 841 44.23% 841 44.23 156.57s
RELEASE TestSet CharErrors Accuracy WordErrors Accuracy NonStopWErrors Accuracy TimeTaken
4_plus20k_san san-fontsamples 356 95.25% 75 88.26% 75 88.26 135.13s
RELEASE TestSet CharErrors Accuracy WordErrors Accuracy NonStopWErrors Accuracy TimeTaken
4_plus20k_san san-oldstyle 2862 58.99% 555 35.91% 555 35.91 118.21s
RELEASE TestSet CharErrors Accuracy WordErrors Accuracy NonStopWErrors Accuracy TimeTaken
4_plus20k_san san-shreelipi 726 94.76% 267 84.03% 267 84.03 295.68s
8 changes: 8 additions & 0 deletions langtests/reports/4_plus30k_san.summary
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
RELEASE TestSet CharErrors Accuracy WordErrors Accuracy NonStopWErrors Accuracy TimeTaken
4_plus30k_san san-alphabetsamples 1656 63.95% 937 37.86% 937 37.86 615.62s
RELEASE TestSet CharErrors Accuracy WordErrors Accuracy NonStopWErrors Accuracy TimeTaken
4_plus30k_san san-fontsamples 429 94.28% 89 86.07% 89 86.07 617.42s
RELEASE TestSet CharErrors Accuracy WordErrors Accuracy NonStopWErrors Accuracy TimeTaken
4_plus30k_san san-oldstyle 2885 58.66% 561 35.22% 561 35.22 432.58s
RELEASE TestSet CharErrors Accuracy WordErrors Accuracy NonStopWErrors Accuracy TimeTaken
4_plus30k_san san-shreelipi 447 96.77% 123 92.64% 123 92.64 1081.29s
8 changes: 8 additions & 0 deletions langtests/reports/4_plus40k_san.summary
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
RELEASE TestSet CharErrors Accuracy WordErrors Accuracy NonStopWErrors Accuracy TimeTaken
4_plus40k_san san-alphabetsamples 1380 69.95% 775 48.61% 775 48.61 1198.16s
RELEASE TestSet CharErrors Accuracy WordErrors Accuracy NonStopWErrors Accuracy TimeTaken
4_plus40k_san san-fontsamples 401 94.65% 79 87.64% 79 87.64 1275.08s
RELEASE TestSet CharErrors Accuracy WordErrors Accuracy NonStopWErrors Accuracy TimeTaken
4_plus40k_san san-oldstyle 2860 59.01% 534 38.34% 534 38.34 977.65s
RELEASE TestSet CharErrors Accuracy WordErrors Accuracy NonStopWErrors Accuracy TimeTaken
4_plus40k_san san-shreelipi 441 96.81% 113 93.24% 113 93.24 2301.53s
17 changes: 11 additions & 6 deletions langtests/runlangtests.sh
Original file line number Diff line number Diff line change
@@ -1,12 +1,11 @@
#!/bin/bash
##############################################################################
# File: runalltests_spa.sh
# Description: Script to run a set of UNLV test sets for Spanish.
# File: runlangtests.sh
# Description: Script to run a set of accuracy test sets for any language.
# based on runalltests.sh by Ray Smith
# Author: Shree Devi Kumar
# Created: June 09, 2018
#
# (C) Copyright 2007, Google Inc.
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
Expand All @@ -17,14 +16,15 @@
# See the License for the specific language governing permissions and
# limitations under the License.
##############################################################################
if [ $# -ne 4 ]
if [ $# -ne 5 ]
then
echo "Usage:$0 unlv-data-dir version-id tessdata-dir langcode"
echo "Usage:$0 unlv-data-dir version-id tessdata-dir langcode imgext"
exit 1
fi

tessdata=$3
lang=$4
imgext=$5

#timesum computes the total cpu time
timesum() {
Expand All @@ -51,6 +51,11 @@ if [ "$lang" = "frk" ] || [ "$lang" = "Fraktur" ]
then
testsets="frk-ligatures"
fi
if [ "$lang" = "san" ] || [ "$lang" = "Devanagari" ]
then
testsets="san-fontsamples san-oldstyle san-shreelipi san-alphabetsamples"
### testsets="san-fontsamples"
fi

totalerrs=0
totalwerrs=0
Expand All @@ -63,7 +68,7 @@ do
if [ -r "$imdir/$set/pages" ]
then
# Run tesseract on all the pages.
$bindir/runtestset.sh "$imdir/$set/pages" "$tessdata" $lang
$bindir/runtestset.sh "$imdir/$set/pages" "$tessdata" "$lang" "$imgext"
# Count the errors on all the pages.
$bindir/counttestset.sh "$imdir/$set/pages" $lang
# Get the new character word and nonstop word errors and accuracy.
Expand Down
9 changes: 5 additions & 4 deletions langtests/runtestset.sh
Original file line number Diff line number Diff line change
Expand Up @@ -15,16 +15,17 @@
# See the License for the specific language governing permissions and
# limitations under the License.

if [ $# -ne 3 ]
if [ $# -ne 4 ]
then
echo "Usage:$0 pagesfile tessdata-dir langcode "
echo "Usage:$0 pagesfile tessdata-dir langcode imgext"
exit 1
fi

tess="time -f %U -o times.txt ./src/api/tesseract"

tessdata=$2
langcode=$3
imgext=$4
pages=$1
imdir=${pages%/pages}
setname=${imdir##*/}
Expand All @@ -45,8 +46,8 @@ do
else
srcdir="$imdir"
fi
echo "$srcdir/$page.tif"
$tess "$srcdir/$page.tif" "$resdir/$page" --tessdata-dir $tessdata --oem 1 -l $langcode --psm 6 $config 2>&1 |grep -v "OCR Engine" |grep -v "Page 1"
echo "$srcdir/$page"
$tess "$srcdir/$page.$imgext" "$resdir/$page" --tessdata-dir $tessdata --oem 1 -l $langcode --psm 6 $config 2>&1 |grep -v "OCR Engine" |grep -v "Page 1"
if [ -r times.txt ]
then
read t <times.txt
Expand Down

2 comments on commit be443a5

@zdenop
Copy link
Contributor

@zdenop zdenop commented on be443a5 Nov 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Shreeshrii
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are similar to the unlvtests, which are for English.

They don't belong under unittest.

If you are moving unlvtests under the test repo, then you can move these too.

Please sign in to comment.