No description, website, or topics provided.
Clone or download
Latest commit 05cb0e0 Jan 20, 2019
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
lib/Text more failsafe Jan 20, 2019
share/lib version 0.3.0 Jan 5, 2019
t fixed locale issues Jan 6, 2019
CHANGES more failsafe Jan 20, 2019
LICENSE.txt added license Jan 29, 2013
Makefile.PL server request timeout parameter added and fallback for timeout issues Jan 19, 2019
README autofill in README Jan 30, 2013
pdf2xml everything in a library Jan 4, 2019

README

pdf2xml - convert PDF files to XML
----------------------------------

This script heavily relies on Apache Tika and pdftotext for the
extraction of text and the conversion to XML. It tries to combine
information from both tools and different conversion modes:

 - pdftotext in standard mode puts some hyphenated words together
 - pdftotext in 'raw' mode is better in keeping characters together
 - Apache Tika can add XML markup to mark paragraphs and lists


pdf2xml uses several heuristics to use information from both tools. It
collects the vocabulary from different conversions and tries to merge
tokens if they would form a valid word (which is part of the
vocabulary). The longest possibe string is prefered. Hyphens at the
end of a line may mark a hyphenated word. pdf2xml tries to put the
token before the hyphen together with the first token of the next line
and accepts the new string if it is part of the vocabulary.


Look at the example PDF-file in the test directory (t/). If you look
at the difference between raw conversion (without cleanup heuristics),
you will find a lot of changes. For example:


  raw:    <p>PRESENTATION ET R A P P E L DES PRINCIPAUX RESULTATS 9</p>
  clean:  <p>PRESENTATION ET RAPPEL DES PRINCIPAUX RESULTATS 9</p>

  raw: <p>2. Les c r i t è r e s de choix : la c o n s o m m a t i o n
          de c o m b u s - t ib les et l e u r moda l i t é 
          d ' u t i l i s a t i on d 'une p a r t , 
          la concen t r a t ion d ' a u t r e p a r t 16</p>

  clean: <p>2. Les critères de choix : la consommation 
            de combustibles et leur modalité 
            d'utilisation d'une part, 
            la concentration d'autre part 16</p>

Note: The heuristics may not always produce correct results!