This program unites some pieces of open sourced tools to provide the way for extracting of raw text from
- djvu
- docx
An extraction consists of two steps: if text layer exists then extract it. If doesn't, apply the OCR by the Tesseract engine. Extracted text is being written in text file.
List of libraries:
sudo apt-get install libzip-dev libpoppler-cpp-dev libdjvulibre-dev libtesseract-dev libleptonica-dev libtiff-devAfter that, jsut type make in the root directory. If there is fatal error "No such files or directory", make sure the libraries heades has installed in /usr/include/ or you can change that in Makefile.
You, also, need the data for the Tesseract. You can find it in tessdata directory of the repo or you can download it from official wiki on github here. Having done donwloading, you can put it in
/usr/share/tesseract-ocr/tessdata/
or you can set the environmental variable on arbitrary directory:
export TESSDATA_PREFIX=/path/to/downloaded/data- -l -> specify the language
- -c tessedit_write_images=true -> creates a result of preprocessed image.
- Add supporting doc files
- Add spellcheking and post OCR processing