Turn your documents into data!

Parsr, is a minimal-footprint document (image, pdf) cleaning, parsing and extraction toolchain which generates readily available, organized and usable data for data scientists and developers.

It provides users with clean structured and label-enriched information set for ready-to-use applications ranging from data entry and document analysis automation, archival, and many others.

Currently, Parsr can perform:

Document Hierarchy Regeneration - Words, Lines and Paragraphs
Headings Detection
Table Detection and Reconstruction
Lists Detection
Text Order Detection
Named Entity Recognition (Dates, Percentages, etc)
Key-Value Pair Detection (for the extraction of specific form-based entries)
Page Number Detection
Header-Footer Detection
Link Detection
Whitespace Removal

Parsr takes as input an image (.JPG, .PNG, .TIFF, ...) or a PDF generates the following output formats:

JSON
Markdown
Text
CSV (for tables), or Pandas Dataframes (see here)
PDF

Getting Started

Installation

-- The advanced installation guide is available here --

The quickest way to install and run the Parsr API is through the docker image:

docker pull axarev/parsr

If you also wish to install the GUI for sending documents and visualising results:

docker pull axarev/parsr-ui-localhost

Note: Parsr can also be installed bare-metal (not via Docker containers), the procedure for which is documented in the installation guide.

Usage

-- The advanced usage guide is available here --

To run the API, issue:

docker run -p 3001:3001 axarev/parsr

which will launch it on http://localhost:3001.
Consult the documentation on the usage of the API.

To use the Jupyter Notebook and the python interface to the Parsr API, follow here.
To use the GUI tool (the API needs to already be running), issue:
```
docker run -t -p 8080:80 axarev/parsr-ui-localhost:latest
```
Then, access it through http://localhost:8080.

Refer to the Configuration documentation to interpret the configurable options in the GUI viewer.

The API based usage and the command line usage are documented in the advanced usage guide.

Documentation

All documentation files can be found here.

Contribute

Please refer to the contribution guidelines.

Third Party Licenses

Third Party Libraries licenses for its dependencies:

QPDF: Apache http://qpdf.sourceforge.net
GraphicsMagick: MIT http://www.graphicsmagick.org/index.html
ImageMagick: Apache 2.0 https://imagemagick.org/script/license.php
Pdfminer.six: MIT https://github.com/pdfminer/pdfminer.six/blob/master/LICENSE
PDF.js: Apache 2.0 https://github.com/mozilla/pdf.js
Tesseract: Apache 2.0 https://github.com/tesseract-ocr/tesseract
Camelot: MIT https://github.com/camelot-dev/camelot
MuPDF (Optional dependency): AGPL https://mupdf.com/license.html
Pandoc (Optional dependency): GPL https://github.com/jgm/pandoc

License

Copyright 2019 AXA Group Operations S.A.
Licensed under the Apache 2.0 license (see the LICENSE file).

Name		Name	Last commit message	Last commit date
Latest commit History 825 Commits
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
.s2i/bin		.s2i/bin
.vscode		.vscode
api		api
demo		demo
docker		docker
docs		docs
samples		samples
server		server
test		test
train		train
.dockerignore		.dockerignore
.drone.yml		.drone.yml
.gitignore		.gitignore
.prettierrc.js		.prettierrc.js
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
README_fr.md		README_fr.md
README_zh-cn.md		README_zh-cn.md
docker-compose-build.yml		docker-compose-build.yml
docker-compose.yml		docker-compose.yml
logo.png		logo.png
package-lock.json		package-lock.json
package.json		package.json
sonar-project.properties		sonar-project.properties
tsconfig.json		tsconfig.json
tslint.json		tslint.json

License

ThaneAcheron/Parsr

Folders and files

Latest commit

History

Repository files navigation

Turn your documents into data!

Table of Contents

Getting Started

Installation

Usage

Documentation

Contribute

Third Party Licenses

License

About

Resources

License

Stars

Watchers

Forks

Languages