TextExtractor

Overview

This program unites some pieces of open sourced tools to provide the way for extracting of raw text from

djvu
docx
pdf

An extraction consists of two steps: if text layer exists then extract it. If doesn't, apply the OCR by the Tesseract engine. Extracted text is being written in text file.

Requirement libraries and installation:

List of libraries:

sudo apt-get install libzip-dev libpoppler-cpp-dev libdjvulibre-dev libtesseract-dev libleptonica-dev libtiff-dev

After that, jsut type make in the root directory. If there is fatal error "No such files or directory", make sure the libraries heades has installed in /usr/include/ or you can change that in Makefile.

You, also, need the data for the Tesseract. You can find it in tessdata directory of the repo or you can download it from official wiki on github here. Having done donwloading, you can put it in

/usr/share/tesseract-ocr/tessdata/

or you can set the environmental variable on arbitrary directory:

export TESSDATA_PREFIX=/path/to/downloaded/data

Helpful tips for working with Tesseract-cli

-l -> specify the language
-c tessedit_write_images=true -> creates a result of preprocessed image.

Further work

Add supporting doc files
Add spellcheking and post OCR processing

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
.ipynb_checkpoints		.ipynb_checkpoints
examples		examples
python		python
scripts		scripts
tessdata		tessdata
.gitignore		.gitignore
BUGS.md		BUGS.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
application.cpp		application.cpp
application.hpp		application.hpp
czip.cpp		czip.cpp
czip.h		czip.h
djvu_extractor.cpp		djvu_extractor.cpp
djvu_extractor.h		djvu_extractor.h
docx_reader.cpp		docx_reader.cpp
docx_reader.h		docx_reader.h
doxygen.config		doxygen.config
file_manager.cpp		file_manager.cpp
file_manager.hpp		file_manager.hpp
main.cpp		main.cpp
pdf_extractor.cpp		pdf_extractor.cpp
pdf_extractor.h		pdf_extractor.h
tesseract_wp.cpp		tesseract_wp.cpp
tesseract_wp.h		tesseract_wp.h
text_extractor.cpp		text_extractor.cpp
text_extractor.h		text_extractor.h
zcompressor.c		zcompressor.c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

TextExtractor

Overview

Requirement libraries and installation:

Helpful tips for working with Tesseract-cli

Further work

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

Astromis/TextExtractor

Folders and files

Latest commit

History

Repository files navigation

TextExtractor

Overview

Requirement libraries and installation:

Helpful tips for working with Tesseract-cli

Further work

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages