Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Links in generated PDF #1

Closed
waawal opened this issue Apr 16, 2012 · 12 comments
Closed

Links in generated PDF #1

waawal opened this issue Apr 16, 2012 · 12 comments

Comments

@waawal
Copy link

waawal commented Apr 16, 2012

Hi!

I really find the cleanliness of the generated .pdf's astounding! Well done!

Is there a way to generate pdf's with links either reffering to other places in the document or externally (http://...)?
If not, is that something that's on the roadmap?

@SimonSapin
Copy link
Member

Hi,

We would very much like to have that feature ourselves, but implementing is not obvious at best.

PDF documents generated by WeasyPrint currently come straight from cairo. Hyperlinks are on the cairo roadmap, but unfortunately it looks like it has been there for a few years already, without much progress.

An alternative could be to post-process the PDF’s to add the hyperlinks. I guess this would involve patching PyPDF heavily (Current it looks like it does not know much about the content of pages.) and some familiarity with the PDF spec. (I spend plenty of time with the CSS specs already...)

So, currently there are two ideas on how to make this happen, but none of them are easy. Any suggestion or patch is welcome ;)

@waawal
Copy link
Author

waawal commented Apr 26, 2012

Hi Simon,
thanks for your response.

I've looked into the post-parsing idea but without success. PyPDF has a extractText method but I haven't been able to get it to work on cairo generated PDF's.

Reportlab supports hyperlinks with the <link href=""></link> tag but since I've not found a sane way to to parse the cairo pdf's yet it's really of no use.

I think that you are right in that much effort would be required in order to add this functionality to WeasyPrint. So far I've only looked on pure Python implementations, maybe there are other solutions available that can be implemented with subprocesses hooks.

I will let you know if I stumble upon something that can be used for this in the interim period until cairo adds this functionality.

References:

@SimonSapin
Copy link
Member

Thanks for sharing your research.

Extracting text from a PDF would not get us very far, it skips all of the layout/formatting. If you want the unformatted text, getting it from WeasyPrint’s boxes without going through PDF is probably easier anyway.

Switching entirely from cairo+Pango to ReportLab would be possible while keeping most of the code. But it is still a huge change with many implications. I have not looked the pros and cons.
Alternatively, ReportLab could post-process cairo’s PDF. But according to this only the paid version can "Reuse existing PDFs".

Your second link hints that links in the PDF format are clickable rectangles. It would be "easy" for WeasyPrint to get a list of (rectangle, URL) pairs. After that we "just" need to find a way to add them in the PDFs. Ideally this should really happen in cairo. So if anyone feels like reading the PDF spec and writing C code ... :)

@liZe
Copy link
Member

liZe commented May 14, 2012

You can try the "links" branch adding internal and external links. That's an experimental feature (broken in some cases such as links hidden behind boxes or css transforms), but it seems to work quite well now. Comments are welcome!

@waawal
Copy link
Author

waawal commented May 18, 2012

Wow! This is great news!
I will for sure try it out asap, just need to get an environment I can install WeasyPrint in.

🍰

@SimonSapin
Copy link
Member

I just released WeasyPrint 0.9 (yet to be announced on the mailing-list) with support for PDF hyperlinks and bookmarks. Check out the demo: http://weasyprint.org/samples/CSS21-intro.pdf

In the end we bit the bullet and read the PDF spec. 0.9 parses the PDF files produced by cairo and use the incremental update mechanism to add metadata. The parser makes a lot of assumptions based on cairo’s output. It is not suitable for reading any PDF in the wild. This is all in the weasyprint/pdf.py file.

In the end it was not even that hard, it just took time to get familiar with the PDF spec. (And some courage/procrastination to dive in.)

@waawal
Copy link
Author

waawal commented Jul 2, 2012

@SimonSapin and @liZe

Sorry for lagging behind on this. I just installed version 0.10 on a clean Ubuntu 12.04 and tested to weasyprint http://www.w3.org/TR/CSS21/intro.html and some Sphinx-generated html-output.

👍 It looks and works beautifully!

Thanks for all your hard work and for keeping this project open source!

@sublee
Copy link

sublee commented Jun 8, 2019

I'm using WeasyPrint 47. My result doesn't include clickable links. Is the hyperlink feature still available?

$ pip install weasyprint==47
$ weasyprint https://www.w3.org/TR/CSS21/selector.html selector.pdf

Demo: selector.pdf

@liZe
Copy link
Member

liZe commented Jun 8, 2019

Yes, it is, but you need at least Cairo 1.15.4. If you can't install a recent version of Cairo, you can use WeasyPrint 0.42.3 and get links (but miss the features added since).

@sublee
Copy link

sublee commented Jun 8, 2019

@liZe Thanks for the very fast response.

I installed pycairo-1.18.1 but WeasyPrint still didn't generate clickable hyperlinks. Anyways, the workaround by WeasyPrint 0.42.3 you mentioned works well. I chose the version because I need only simple features.

Thank you so much :)

@liZe
Copy link
Member

liZe commented Jun 8, 2019

I installed pycairo-1.18.1 but WeasyPrint still didn't generate clickable hyperlinks.

You need a recent version of Cairo, not PyCairo (WeasyPrint doesn't use PyCairo). But if 0.42.3 works for you, no problem 😉.

@sublee
Copy link

sublee commented Jun 9, 2019

@liZe Oh, that information of Cairo was confusing for me but now it's clear. libcairo2-1.14.6 has been installed in my system. Perhaps to install libcairo2-1.15.4+ is more complicated than WeasyPrint-0.42.3. So I decided to keep using WeasyPrint-0.42.3 as you recommended. Thank you for clarifying!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants