Simple demo project for the PDF extractor created within the CODE project by Know-Center. It's a simple command line application that allows you to specify a PDF file, uploads it to the CODE service and returns a JSON response containing:
- main text in proper reading order, honoring multi-column PDFs
- hierarchical table of contents
- figures & tables
To build a runnable JAR invoke Maven in the root directory of the project as follows:
mvn clean assembly:assembly
This will generate a JAR file called pdf-extractor-jar-with-dependencies.jar
in the target/
folder.
Once the JAR with dependencies is build, you can invoke it as follows from the root directory
java -jar target/pdf-extractor-jar-with-dependencies.jar <pdf-file>
Where pdf-file
is the path to a locally stored PDF, e.g. nestin.pdf
, the PDF we ship with this project.
The demo will then upload the specified PDF to the CODE enrichment service. The CODE enrichment service will return
the result as JSON which the demo will write to a file called <pdf-file>.json
, e.g. nestin.pdf.json
in the
current working directory.
Additionally, the demo will output some data like the main text, the table of content and the figures & tables found in the PDF
The contents of this project are licensed under AGPL 3, see the LICENSE
file.