PDF2SVG conversion #17

petermr · 2017-01-30T10:46:05Z

There are several "PDF2SVG"converters running on different platforms (Java, Python, C(++)). Although the format is SVG there are many ways that it could be structured. We have used 2 which "run on all platforms":

PDF2SVG (AMI) Java https://bitbucket.org/petermr/pdf2svg/wiki/Home . This was based on PDFBox 1.8 (https://pdfbox.apache.org/) which has a very thorough toolchain for extracting PDF.
This is the default which will be used for this project. It runs from the commandline but is not yet pacaked as an uber-jar.
We plan to move to PDFBox 2.0.4 but not during the CM-UCL project.
PDF2SVG (http://www.cityinthesky.co.uk/opensource/pdf2svg/) this wraps some existing libraries. This is (somewhat) easier to install than AMI-PDF2SVG and has a more compact output. However it has not been tested for producing SVG2XML input and will not be used for production.

PDF2SVG only needs to be run once (and has been). The tables have been extracted by hand from both corpora.

petermr · 2017-01-30T10:53:47Z

PDF2SVG conversion
There are many issues with converting PDF2SVG and some of the current converters may not have all of them. Issues include:

Are font-family names captured?
Are font-styles captured?
Are font-weights captured?
Are the coordinates normalized or does the user need to apply transformations
Are character widths captured?
Do characters have individual coordinates?
Is the painter model captured?
Is the order of drawing important?
Have fonts been converted to paths?
Have words been created?
How are legacy fonts treated?
Are characters converted to Unicode?
How is vertical or oblique text managed?
Are there SVG constructs or elements that SVG2XML does not support?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PDF2SVG conversion #17

PDF2SVG conversion #17

petermr commented Jan 30, 2017 •

edited

petermr commented Jan 30, 2017

PDF2SVG conversion #17

PDF2SVG conversion #17

Comments

petermr commented Jan 30, 2017 • edited

petermr commented Jan 30, 2017

petermr commented Jan 30, 2017 •

edited