Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDF2SVG conversion #17

Open
petermr opened this issue Jan 30, 2017 · 1 comment
Open

PDF2SVG conversion #17

petermr opened this issue Jan 30, 2017 · 1 comment

Comments

@petermr
Copy link
Member

petermr commented Jan 30, 2017

There are several "PDF2SVG"converters running on different platforms (Java, Python, C(++)). Although the format is SVG there are many ways that it could be structured. We have used 2 which "run on all platforms":

  1. PDF2SVG (AMI) Java https://bitbucket.org/petermr/pdf2svg/wiki/Home . This was based on PDFBox 1.8 (https://pdfbox.apache.org/) which has a very thorough toolchain for extracting PDF.
    This is the default which will be used for this project. It runs from the commandline but is not yet pacaked as an uber-jar.
    We plan to move to PDFBox 2.0.4 but not during the CM-UCL project.

  2. PDF2SVG (http://www.cityinthesky.co.uk/opensource/pdf2svg/) this wraps some existing libraries. This is (somewhat) easier to install than AMI-PDF2SVG and has a more compact output. However it has not been tested for producing SVG2XML input and will not be used for production.

PDF2SVG only needs to be run once (and has been). The tables have been extracted by hand from both corpora.

@petermr
Copy link
Member Author

petermr commented Jan 30, 2017

PDF2SVG conversion
There are many issues with converting PDF2SVG and some of the current converters may not have all of them. Issues include:

  1. Are font-family names captured?
  2. Are font-styles captured?
  3. Are font-weights captured?
  4. Are the coordinates normalized or does the user need to apply transformations
  5. Are character widths captured?
  6. Do characters have individual coordinates?
  7. Is the painter model captured?
  8. Is the order of drawing important?
  9. Have fonts been converted to paths?
  10. Have words been created?
  11. How are legacy fonts treated?
  12. Are characters converted to Unicode?
  13. How is vertical or oblique text managed?
  14. Are there SVG constructs or elements that SVG2XML does not support?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant