GitHub - elacin/PDFExtract: my take at a PDF text extraction utility

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 152 Commits
analysis		analysis
datasource-pdfbox		datasource-pdfbox
datasource-poppler		datasource-poppler
datasource		datasource
logicaltree		logicaltree
model		model
parent		parent
pdfextract-cli		pdfextract-cli
renderer		renderer
xmlout-simple		xmlout-simple
xmlout-tei-p5		xmlout-tei-p5
xmlout		xmlout
.gitignore		.gitignore
LICENSE		LICENSE
README		README

Repository files navigation

#######################################
# How to build PDFExtract from source
#######################################

# 1. create a folder for the projects and cd to it


#
# 2. install TEI P5 model
#
git clone http://github.com/elacin/TEI-P5-Java-model.git
cd TEI-P5-Java-model/
#this chooses version 0.3, which is currently used by PDFExtract
git checkout 29d668e
mvn install
cd ..

#
# 3. install patched PDFBox 
#
svn checkout http://svn.apache.org/repos/asf/pdfbox/trunk/ pdfbox
#apply patch (tested against pdfbox svn r1157684)
cd pdfbox
patch -p0 < ../PDFExtract/parent/patch/pdfbox_poms.patch
patch -p0 < ../PDFExtract/parent/patch/pdfbox-font-bounding-boxes.patch 
patch -p0 < ../PDFExtract/parent/patch/pdfbox-drawer-visibility.patch
mvn install
cd ..


#
# 4. install PDFExtract
#
git clone http://github.com/elacin/PDFExtract.git
cd PDFExtract/parent
mvn -DskipTests=true assembly:assembly #yes, some cleanup of tests is in order


# the binary distribution will end up as PDFExtract/pdfextract-cli/target/pdfextract-cli-${VERSION}-bin.tar.bz2

About

my take at a PDF text extraction utility

elacin.github.com/PDFExtract/

Readme

Activity

Report repository

Releases

2 tags

Packages

No packages published

Languages

Java 58.0%
TeX 42.0%

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

analysis

analysis

datasource-pdfbox

datasource-pdfbox

datasource-poppler

datasource-poppler

datasource

datasource

logicaltree

logicaltree

model

model

parent

parent

pdfextract-cli

pdfextract-cli

renderer

renderer

xmlout-simple

xmlout-simple

xmlout-tei-p5

xmlout-tei-p5

xmlout

xmlout

.gitignore

.gitignore

LICENSE

LICENSE

README

README

Repository files navigation

About

Releases

Packages

Languages

License

elacin/PDFExtract

Folders and files

Latest commit

History

Repository files navigation

About

Resources

License

Stars

Watchers

Forks

Languages