Skip to content
This repository has been archived by the owner on May 7, 2024. It is now read-only.

KonnexionsGmbH/dcr

Repository files navigation

DCR - Document Content Recognition - README

Coveralls GitHub GitHub (Pre-)Release GitHub (Pre-)Release Date GitHub commits since latest release

Based on the paper "Unfolding the Structure of a Document using Deep Learning" (Rahman and Finin, 2019), this software project attempts to use various software techniques to automatically recognise the structure in any pdf documents and thus make them more searchable.

DCR enables batch processing of documents with the DCR-CORE library. Details of the DCR-CORE library can be found [here}(https://konnexionsgmbh.github.io/dcr-core/). The documents to be processed are expected in a defined file directory. The processing result is made available either in a JSON file or in a PostgreSQL database.

Please see the Documentation for more detailed information.

1. Features

1.1 General

  • Support for documents in different languages - English, French, German and Italian as standard.

1.2 Preprocessor

  • Identifying scanned image pdf documents using PyMuPDF.
  • Converting scanned image pdf documents to a series of jpeg or png files using pdf2image and Poppler.
  • Converting bmp, gif, jp2, jpeg, png, pnm, tif, tiff or webp type documents to pdf format using Tesseract OCR.
  • Converting csv, docx, epub, html, odt, rst or rtf type documents to pdf format using Pandoc and TeX Live.

1.3 Natural Language Processing (NLP)

  • Extracting text and metadata from pdf documents using PDFlib TET.
  • Categorisation of the lines in the document, e.g. body, footer, header lines etc.
  • Determination of the token structure sentence by sentence with the help of spaCy.
  • Storage of the analysis result optional in a PostgreSQL database or in a JSON flat file.

2. Directory and File Structure of this Repository

2.1 Directories

Directory Content
.github/workflows GitHub Action workflows
data Inbox directories and database setup data
docs DCR documentation files
resources DBeaver configuration, Gammadyne utility and various external documentation
scripts Ubuntu and Windows Script for running the application
src Python scripts and PDFlib TET files
tests Scripts and data for pytest

2.2 Files

File Functionality
.gitignore Configuration of files and folders to be ignored.
.pylintrc Configuration file for pylint.
LICENSE Text of the licence terms.
logging_cfg.yaml Configuration of the Logger functionality.
Makefile Definition of tasks to be excuted with the make command.
mkdocs.yml Configuration file for MkDocs.
Pipfile Definition of the Python package requirements.
Pipfile.lock Definition of the specific versions of the Python packages.
pyproject.toml Configuration file for bandit, black, isort, mypy,
pydoc-markdown, pydocstyle, and pytest.
README.md This file.
run_dcr_dev Running the DCR functionality for development purposes.
run_dcr_prod Running the DCR functionality for productiove operation.
setup.cfg Configuration file for coverage, DCR, flake8, and radon.
setup.cfg.reference Original setup configuration file.

3. Support

If you need help with DCR, do not hesitate to get in contact with us!

  • For questions and high-level discussions, use Discussions on GitHub.
  • To report a bug or make a feature request, open an Issue on GitHub.

Please note that we may only provide support for problems / questions regarding core features of DCR. Any questions or bug reports about features of third-party themes, plugins, extensions or similar should be made to their respective projects. But, such questions are not banned from the Discussions.

Make sure to stick around to answer some questions as well!

4. Links

5. Contributing to DCR

The DCR project welcomes, and depends on, contributions from developers and users in the open source community. Please see the Contributing Guide for information on how you can help.

6. Code of Conduct

Everyone who interacts in the DCR project's codebase, issue trackers, and discussion forums is expected to follow the Code of Conduct.

7. License

Konnexions Public License (KX-PL)