Skip to content
This repository was archived by the owner on Nov 2, 2025. It is now read-only.

Astromis/TextExtractor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

41 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TextExtractor

Overview

This program unites some pieces of open sourced tools to provide the way for extracting of raw text from

  • djvu
  • docx
  • pdf

An extraction consists of two steps: if text layer exists then extract it. If doesn't, apply the OCR by the Tesseract engine. Extracted text is being written in text file.

Requirement libraries and installation:

List of libraries:

sudo apt-get install libzip-dev libpoppler-cpp-dev libdjvulibre-dev libtesseract-dev libleptonica-dev libtiff-dev

After that, jsut type make in the root directory. If there is fatal error "No such files or directory", make sure the libraries heades has installed in /usr/include/ or you can change that in Makefile.

You, also, need the data for the Tesseract. You can find it in tessdata directory of the repo or you can download it from official wiki on github here. Having done donwloading, you can put it in

/usr/share/tesseract-ocr/tessdata/

or you can set the environmental variable on arbitrary directory:

export TESSDATA_PREFIX=/path/to/downloaded/data

Helpful tips for working with Tesseract-cli

  • -l -> specify the language
  • -c tessedit_write_images=true -> creates a result of preprocessed image.

Further work

  • Add supporting doc files
  • Add spellcheking and post OCR processing

About

Programm that incorporates methods of extracting text data from different formats.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published