This is the program to recognize text from image. Text recognizer is based on tesseract ocr and jamspell corrector.
To install you must have docker
on your computer.
pip install -r requirements.txt
docker-compose pull
In config file one can define which language is going to be recognized on the images using the parameter "language"
.
If the desired language is not English or Russian, then one should also modify the Dockerfile and rebuild the service.
Languages are mostly defined by three letter language codes, which are accepted by tesseract.
Seems it is not so easy to find the list of languages on the tesseract's site, so here is another help link with somebody's question on askubuntu.com.
Default language is English.
Text corrector JamSpell needs trained language models. Some of them may be downloaded, there is also an option to train a model on some text collection. As soon as the model is ready, one should put the absolute path to it (eg. /home/user/en.bin) in the config file!
The program processes all images from databyse.
Start docker containers with workers, queue and db:
docker-compose up
Put your image in dir data
. To add images in database run:
python text_extractor.py load
Images can be re-added in database and will considered as new and unhandled.
To process all unhandled images:
python text_extractor.py process
To clear database:
python text_extractor.py clear
You can change the number of workers for ocr and corrector with command:
docker-compose scale pytesseractocr=3
And
docker-compose scale jamspell=3
3 is an example.
Time to process dataset of 411 images:
n ocr | n corrector | time, s |
---|---|---|
1 | 1 | 65.3 |
2 | 2 | 36.5 |
Dataset description. Download.
The special one-day Apple shopping Ya
This Friday.
Ss
Meaningful Moments
With the sorrow of living so great, the sorrow of punishment had to be pit- less. We lived for the day and died for it. When there was reason and desire to punish we wrote our lesson with gun or whip immediately in the sullen flesh of the sufferer, and the case was beyond appeal. The desert did not afford the refined slow penalties of courts and gaols.
Ofcourse our rewards and pleasures were as suddenly sweeping as our troubles; but, to me in particular, they bulked less large. Bedouin ways were hard even for those brought up to them, and for strangers terrible: a death in life. When the march or labour ended I had no energy to record sensation, nor while it lasted any leisure to see the spiritual loveliness which sometimes came upon us by the way. In my notes, the cruel rather than the beautiful found place. We no doubt enjoyed more the rare moments of peace and forgetfulness; but I remember more the agony, the terrors, and the mistakes. Our life is not summed up in what Have written (there are things not to be repeated in cold blood for very shame); but what I have written was in and of our life. Pray God that men reading the story will not, for love of the glamour of strangeness, go out to prostitute themselves and their talents in serving another race.
Ly
(I.E. Lawrence, Seven Pillars of Wisdom)
If one needs to rebuild some service defined in docker-compose.yml (eg. after changing something in the source code to make the changes actually work), she may use this command
docker-compose build --no-cache <service_name>