New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weasyprint 0.42 seems to hang / freeze on processing large tables #691

Closed
RafaelLinux opened this Issue Sep 18, 2018 · 29 comments

Comments

Projects
None yet
2 participants
@RafaelLinux

RafaelLinux commented Sep 18, 2018

Trying WeasyPrint with our larger bulletins, we found some (with the largest tables) that WeasyPrint seems unable to process. These are some of them:
http://www.dip-badajoz.es/bop/boletin_completo.php?FechaSolicitada=2006-08-21
http://www.dip-badajoz.es/bop/boletin_completo.php?FechaSolicitada=2012-04-09
http://www.dip-badajoz.es/bop/boletin_completo.php?FechaSolicitada=2014-09-05

Is not usual, but we have some HTML pages up to 9MiB file size. Is there any workaround or parameter to get WeasyPrint to get to process this pages?

Thank you

@RafaelLinux

This comment has been minimized.

RafaelLinux commented Sep 18, 2018

I looked into this issue a little more. WeasyPrint is launched to process http://www.dip-badajoz.es/bop/boletin_completo.php?FechaSolicitada=2012-04-09 page in a linux server without any CPU load and with its 8GB RAM near unused. However, in this scenario, the o.s. finished killing WeasyPrint cause system was "out of memory". Then, we upgraded the system to 16GB RAM + 4GB SWAP and tried again with this configuration. In this case, WeasyPrint has eaten all 16GB RAM and it took 2GB from SWAP and just now we are waiting it finish, but it's taking more that 12 minutes, so I'll write here if it finished fine at how much time it took.

UPDATE: The process eaten (slowly) even the swap partition so it was killed by kernel :(

@RafaelLinux RafaelLinux changed the title from Weasyprint seems to hang / freeze on processing large page to Weasyprint 0.42 seems to hang / freeze on processing large page Sep 19, 2018

@RafaelLinux RafaelLinux changed the title from Weasyprint 0.42 seems to hang / freeze on processing large page to Weasyprint 0.42 seems to hang / freeze on processing large tables Sep 19, 2018

@liZe liZe added the performance label Sep 19, 2018

@liZe

This comment has been minimized.

Member

liZe commented Sep 19, 2018

Rendering large tables is known to be slow, and when tables have 23,540 lines (!), it's awfully slow.

There are 2 main topics that could help:

  • Optimize the auto layout. It's currently naive with a lot of loops through columns and lines, we could easily avoid many loops. (Easy)
  • Don't redo one full table layout per page. As long as the available width is the same, there's no need to redo the whole layout, and we could just cache it. (Harder, but really useful)

I'm currently rendering "2006-08-21", it seems to require less memory than your example does, but it's still really bad.

@RafaelLinux

This comment has been minimized.

RafaelLinux commented Sep 20, 2018

I understand the complexity, taking into account that there are even accessibility identification for each cell, but it's a priority. Even "mPDF" hanged proccesing these tables. Despite we are nowadays plitting our tables in most cases, we can't assure we will not need to publish similar tables in the future.

On our part, we tried to give all necessary hardware resources to avoid to bother you at this issue :) but the hardware upgrade didn't was enough to solve the problem :( 16GB RAM (+4GB SWAP) it's too much memory to waste

I wish you find the way to solve this problem.

Regards

@liZe liZe added this to the 43 milestone Sep 20, 2018

@liZe

This comment has been minimized.

Member

liZe commented Sep 20, 2018

I understand the complexity, taking into account that there are even accessibility identification for each cell, but it's a priority. Even "mPDF" hanged proccesing these tables. Despite we are nowadays plitting our tables in most cases, we can't assure we will not need to publish similar tables in the future.

Yes, 2012-04-09 is a 12MB HTML file generating a 1515-page document, it's pretty huge! WebKit takes more than 1 minute and more than 1GB RAM to render it.

On our part, we tried to give all necessary hardware resources to avoid to bother you at this issue :) but the hardware upgrade didn't was enough to solve the problem :( 16GB RAM (+4GB SWAP) it's too much memory to waste

You don't bother me at all, it's really useful to get real-life problems.

I wish you find the way to solve this problem.

I've tried many things, and here's what I've discovered so far:

  • Removing 1 line of CSS (.contenido_anuncio table, […] { width: 100% }) makes the rendering go much, much faster: 2012-04-09takes less than 13 minutes on my computer, it was more than 30 minutes (and probably more than 1 hour) with this line.
  • Memory consumption is huge: 12.7GB RAM, and probably more with older versions of Python. 1/3 is used by applying CSS, 1/3 by the formatting structure, 1/3 by the layout. We can probably reduce the CSS and formatting structure parts. This problem is not caused by long tables, it's caused by tables creating a lot of boxes.

I'll spend some time trying to find why this 100% takes so much time, and to save a little bit of memory if possible.

@RafaelLinux

This comment has been minimized.

RafaelLinux commented Sep 20, 2018

Thank you @liZe .

Our CSS files are evolving thru years, making use of new CSS improvements and new features, so I can try to change the "with:100%" class property and see how it affects visually in general to our older and newer pages. What we can't do it's to change HTML code (so tables published must remain as they are unfortunately).

Sincerely, we are not afraid about how much time takes to generate a PDF, while it don't takes more than 1 hour. For us it's more critical memory consumption, despite we can reach up to 18-20GB RAM if neccesary, as long as we have clear what's the maximum RAM necessary for the worst case (basically, the previously included links have the larger tables).

@RafaelLinux

This comment has been minimized.

RafaelLinux commented Oct 11, 2018

Any news on this ....??? :(

@liZe

This comment has been minimized.

Member

liZe commented Oct 21, 2018

Any news on this ....??? :(

Unfortunately no, but it's in the version 43 milestone, so I won't forget.

@RafaelLinux

This comment has been minimized.

RafaelLinux commented Oct 22, 2018

Thank you!!!

@liZe

This comment has been minimized.

Member

liZe commented Oct 29, 2018

Here are the news!

The "problem" is in distribute_excess_width. This function dispatches extra width among columns when the table is larger than its cells (and it's the case for you, as the cells don't have enough content to fill the width: 100% table). The algorithm needs the max_content_width of each cell, but this takes between 20 and 25 seconds per page on my computer for the first pages (when the whole table has to be calculated).

Fortunately, these values have already been calculated and are cached in the table. I'm writing a patch that should avoid re-calculating these values for each page.

@liZe liZe closed this in 3d2c75f Oct 29, 2018

@liZe

This comment has been minimized.

Member

liZe commented Oct 29, 2018

Rendering the whole document now takes less than 13 minutes on my computer, almost the same as without width: 100%.

@RafaelLinux

This comment has been minimized.

RafaelLinux commented Oct 29, 2018

Time spent is not a problem in my case, cause it will be processed in a server and I'll bet that the process will not take longer. I try just tomorrow (if I'm able to update Weasyprint .....)
Thank you for the great news!!

@RafaelLinux

This comment has been minimized.

RafaelLinux commented Nov 6, 2018

I upgraded weayprint
pip3 install --upgrade git+https://github.com/Kozea/WeasyPrint
and tried with the worst HTML page, on our case http://www.dip-badajoz.es/bop/boletin_completo.php?FechaSolicitada=2012-04-09. It seems that resources consumption are lesser, but I must say that after 20 minutes, process didn't stop till eat all RAM and maybe the SWAP. I was not able to know if it ended, cause my session expired (was launched from PHP) so I need to do the same from console and I'll report here about result.

@liZe

This comment has been minimized.

Member

liZe commented Nov 7, 2018

It seems that resources consumption are lesser, but I must say that after 20 minutes, process didn't stop till eat all RAM and maybe the SWAP.

I'm sorry to hear that… I've tried again on my laptop and it worked in 11:34 minutes with 12.7 GB of RAM with Python 3.7 (should be the same with Python 3.6, but may be 30% more with Python 3.5 and older).

$ time weasyprint -v "http://www.dip-badajoz.es/bop/boletin_completo.php?FechaSolicitada=2012-04-09" test.pdf
INFO: Step 1 - Fetching and parsing HTML - http://www.dip-badajoz.es/bop/boletin_completo.php?FechaSolicitada=2012-04-09
[…]
INFO: Step 5 - Creating layout - Page 1515
INFO: Step 6 - Drawing
INFO: Step 7 - Adding PDF metadata
684.49user 4.45system 11:34.46elapsed 99%CPU (0avgtext+0avgdata 12673944maxresident)k
25944inputs+0outputs (21major+3274385minor)pagefaults 0swaps
@RafaelLinux

This comment has been minimized.

RafaelLinux commented Nov 7, 2018

With the same command, I obtained this:

...
INFO: Step 5 - Creating layout - Page 1340
INFO: Step 5 - Creating layout - Page 1341
Terminado (killed)

real    62m21,942s
user    14m33,668s
sys     2m28,080s

Process was killed by system, not by user, cause as I commented sometime, it literrally eats all RAM and, slowly, the SWAP. Could it be related with the cairo version? (cairo < 1.15.4). I'll try updating Cairo and I will test again.

@RafaelLinux

This comment has been minimized.

RafaelLinux commented Nov 7, 2018

I checked "Cairo" libraries. This is all I have installed about cairo, and they are the latest stable versions ....

||/ Nombre                Versión         Arquitectura    Descripción
+++-=====================-===============-===============-================================================
ii  libcairo2:amd64       1.14.8-1        amd64           Cairo 2D vector graphics library
ii  libpangocairo-1.0-0:a 1.40.5-1        amd64           Layout and rendering of internationalized text
un  python3-gi-cairo      <ninguna>       <ninguna>       (no hay ninguna descripción disponible)

Should I force system to install not stable version?

@liZe

This comment has been minimized.

Member

liZe commented Nov 12, 2018

I really think that the problem is not in external libraries. Your generation is slow because your RAM is full and the swap is slow. It then gets killed because even the swap is full.

You can try to generate the document and monitor your server with top to know the memory size of the process and the memory left on your server.

@RafaelLinux

This comment has been minimized.

RafaelLinux commented Nov 13, 2018

As I wrote in this comment I monitorized previously what's was happening with HTOP. This server have all RAM (16GiB) available for Weasyprint (only 400MiB occupied) , and it's slowly taking RAM. At "Step3 - Applying CSS" weasyprint have near 4GiB and quickly takes whole RAM (VIRT 15.1G RES 15.0G SHR 3500 CPU 84% MEM% 98.2). So on "Step 5 - Creating layout - Page 77", SWAP partition is at half of its capacity (4GiB total). Continue creating pages even to page 487, and very slowly still is taking more SWAP. On that page, for several minutes, SWAP maintains on half the total available and CPU is barely 6%. Then it runs again to page 595, and the same situation for several minutes. At page 818, SWAP occupied is 2.25GiB and CPU 10%. At this point, Weasyprint taked over 11 minutes. Same when reaching page 860, but Weasyprint is taking about 43% CPU. SWAP maintains at half 4GiB. After that page, Wesayprint again is quickly processing pages, and taking up to 3GiB of SWAP when it reachs page 977. Finally, reaching page 1161, after take all RAM and SWAP, Weasyprint is killed by system.

@liZe

This comment has been minimized.

Member

liZe commented Nov 13, 2018

@RafaelLinux Which version of Python do you use?

@RafaelLinux

This comment has been minimized.

RafaelLinux commented Nov 13, 2018

Python 2.7.13 (default, Sep 26 2018, 18:42:22)
[GCC 6.3.0 20170516] on linux2

@liZe

This comment has been minimized.

Member

liZe commented Nov 13, 2018

Python 2.7.13

Hmmm… WeasyPrint 43 (with all the optimizations) only works with Python 3. Are you using an older version of WeasyPrint?

@RafaelLinux

This comment has been minimized.

RafaelLinux commented Nov 14, 2018

This is what I get asking Weasyprint:

# weasyprint --version
/usr/local/lib/python3.5/dist-packages/weasyprint/document.py:34: UserWarning: There are known rendering problems and missing features with cairo < 1.15.4. WeasyPrint may work with older versions, but please read the note about the needed cairo version on the "Install" page of the documentation before reporting bugs. http://weasyprint.readthedocs.io/en/latest/install.html
  'There are known rendering problems and missing features with '
WeasyPrint version 43rc1

You can see that always notice me about "Cairo" version ... I don't know if it's the problem.
Please, any suggestion?

@liZe

This comment has been minimized.

Member

liZe commented Nov 14, 2018

You can see that always notice me about "Cairo" version ... I don't know if it's the problem.

No, it's not. You may get rendering problems, but it's not related to memory use.

# weasyprint --version
/usr/local/lib/python3.5/[…]

Then you're using Python 3.5 😄. If you can use 3.6 or 3.7 instead, you'll improve your memory use a lot (because of the new dict implementation).

WeasyPrint version 43rc1

You can upgrade to the final 43 version with a simple pip install --upgrade weasyprint.

With Python 3.6+ and WeasyPrint 43, there's no reason not to reach my 12 minutes and 12.7 GB RAM on your server.

@RafaelLinux

This comment has been minimized.

RafaelLinux commented Nov 14, 2018

I don't understand why if I type "python", a get a console that says Python 2.7.13 and however Weasyprint says version 3.5 ... I'm going crazy. Is Weasyprint using a local version? How must then I update the python version that is using Weasyprint?

@liZe

This comment has been minimized.

Member

liZe commented Nov 14, 2018

Is Weasyprint using a local version?

Yes, it's using a version installed in /usr/local/ where is generally installed software that is not packaged by the distribution.

How must then I update the python version that is using Weasyprint?

I don't know how you installed Python 3.5, so I can't help you about how to upgrade 😞.

netbsd-srcmastr pushed a commit to NetBSD/pkgsrc that referenced this issue Nov 14, 2018

py-weasyprint: Update to 43.
Version 43
----------

Released on 2018-11-09.

Bug fixes:

* `#726 <https://github.com/Kozea/WeasyPrint/issues/726>`_:
  Make empty strings clear previous values of named strings
* `#729 <https://github.com/Kozea/WeasyPrint/issues/729>`_:
  Include tools in packaging

This version also includes the changes from unstable rc1 and rc2 versions
listed below.

Version 43rc2
-------------

Released on 2018-11-02.

**This version is experimental, don't use it in production. If you find bugs,
please report them!**

Bug fixes:

* `#706 <https://github.com/Kozea/WeasyPrint/issues/706>`_:
  Fix text-indent at the beginning of a page
* `#687 <https://github.com/Kozea/WeasyPrint/issues/687>`_:
  Allow query strings in file:// URIs
* `#720 <https://github.com/Kozea/WeasyPrint/issues/720>`_:
  Optimize minimum size calculation of long inline elements
* `#717 <https://github.com/Kozea/WeasyPrint/issues/717>`_:
  Display <details> tags as blocks
* `#691 <https://github.com/Kozea/WeasyPrint/issues/691>`_:
  Don't recalculate max content widths when distributing extra space for tables
* `#722 <https://github.com/Kozea/WeasyPrint/issues/722>`_:
  Fix bookmarks and strings set on images
* `#723 <https://github.com/Kozea/WeasyPrint/issues/723>`_:
  Warn users when string() is not used in page margin


Version 43rc1
-------------

Released on 2018-10-15.

**This version is experimental, don't use it in production. If you find bugs,
please report them!**

Dependencies:

* Python 3.4+ is now needed, Python 2.x is not supported anymore
* Cairo 1.15.4+ is now needed, but 1.10+ should work with missing features
  (such as links, outlines and metadata)
* Pdfrw is not needed anymore

New features:

* `Beautiful website <https://weasyprint.org>`_
* `#579 <https://github.com/Kozea/WeasyPrint/issues/579>`_:
  Initial support of flexbox
* `#592 <https://github.com/Kozea/WeasyPrint/pull/592>`_:
  Support @font-face on Windows
* `#306 <https://github.com/Kozea/WeasyPrint/issues/306>`_:
  Add a timeout parameter to the URL fetcher functions
* `#594 <https://github.com/Kozea/WeasyPrint/pull/594>`_:
  Split tests using modern pytest features
* `#599 <https://github.com/Kozea/WeasyPrint/pull/599>`_:
  Make tests pass on Windows
* `#604 <https://github.com/Kozea/WeasyPrint/pull/604>`_:
  Handle target counters and target texts
* `#631 <https://github.com/Kozea/WeasyPrint/pull/631>`_:
  Enable counter-increment and counter-reset in page context
* `#622 <https://github.com/Kozea/WeasyPrint/issues/622>`_:
  Allow pathlib.Path objects for HTML, CSS and Attachment classes
* `#674 <https://github.com/Kozea/WeasyPrint/issues/674>`_:
  Add extensive installation instructions for Windows

Bug fixes:

* `#558 <https://github.com/Kozea/WeasyPrint/issues/558>`_:
  Fix attachments
* `#565 <https://github.com/Kozea/WeasyPrint/issues/565>`_,
  `#596 <https://github.com/Kozea/WeasyPrint/issues/596>`_,
  `#539 <https://github.com/Kozea/WeasyPrint/issues/539>`_:
  Fix many PDF rendering, printing and compatibility problems
* `#614 <https://github.com/Kozea/WeasyPrint/issues/614>`_:
  Avoid crashes and endless loops caused by a Pango bug
* `#662 <https://github.com/Kozea/WeasyPrint/pull/662>`_:
  Fix warnings and errors when generating documentation
* `#666 <https://github.com/Kozea/WeasyPrint/issues/666>`_,
  `#685 <https://github.com/Kozea/WeasyPrint/issues/685>`_:
  Fix many table layout rendering problems
* `#680 <https://github.com/Kozea/WeasyPrint/pull/680>`_:
  Don't crash when there's no font available
* `#662 <https://github.com/Kozea/WeasyPrint/pull/662>`_:
  Fix support of some align values in tables
@RafaelLinux

This comment has been minimized.

RafaelLinux commented Nov 15, 2018

Well, I need some help at this point, cause I have not Python knowledge ... I did again an update of python with apt-get install python and system answered me python is yet in the more recent version (2.7.13-2).. However, as you noticed, Weasyprint is using Python 3.5 but I want to update it to 3.6 or 3.7, but where is located and how to update Python version that Weasyprint is using? I'm sorry, but I'm lost at this point ...

# pip3 install --upgrade weasyprint
Collecting weasyprint
  Downloading https://files.pythonhosted.org/packages/39/70/160b94a31be9151cbdf582206e3d8392c8ec11b1f430c4759e4c5a095f3f/WeasyPrint-43-py3-none-any.whl (353kB)
    100% |████████████████████████████████| 358kB 2.5MB/s
Requirement already up-to-date: cairocffi>=0.9.0 in /usr/local/lib/python3.5/dist-packages (from weasyprint)
Requirement already up-to-date: html5lib>=0.999999999 in /usr/local/lib/python3.5/dist-packages (from weasyprint)
Requirement already up-to-date: CairoSVG>=1.0.20 in /usr/local/lib/python3.5/dist-packages (from weasyprint)
Requirement already up-to-date: cssselect2>=0.1 in /usr/local/lib/python3.5/dist-packages (from weasyprint)
Requirement already up-to-date: Pyphen>=0.8 in /usr/local/lib/python3.5/dist-packages (from weasyprint)
Requirement already up-to-date: cffi>=0.6 in /usr/local/lib/python3.5/dist-packages (from weasyprint)
Requirement already up-to-date: tinycss2>=0.5 in /usr/local/lib/python3.5/dist-packages (from weasyprint)
Requirement already up-to-date: webencodings in /usr/local/lib/python3.5/dist-packages (from html5lib>=0.999999999->weasyprint)
Requirement already up-to-date: six>=1.9 in /usr/local/lib/python3.5/dist-packages (from html5lib>=0.999999999->weasyprint)
Requirement already up-to-date: pillow in /usr/local/lib/python3.5/dist-packages (from CairoSVG>=1.0.20->weasyprint)
Requirement already up-to-date: defusedxml in /usr/local/lib/python3.5/dist-packages (from CairoSVG>=1.0.20->weasyprint)
Requirement already up-to-date: pycparser in /usr/local/lib/python3.5/dist-packages (from cffi>=0.6->weasyprint)
Installing collected packages: weasyprint
  Found existing installation: WeasyPrint file-.weasyprint-VERSION
    Uninstalling WeasyPrint-file-.weasyprint-VERSION:
      Successfully uninstalled WeasyPrint-file-.weasyprint-VERSION
Successfully installed weasyprint-43

Thank you in advance

@liZe

This comment has been minimized.

Member

liZe commented Nov 15, 2018

Well, I need some help at this point, cause I have not Python knowledge

No problem, I'm here to help! What's your Linux distribution and its version? (You can get this information with lsb_release -a or cat /etc/*-release for example.)

@RafaelLinux

This comment has been minimized.

RafaelLinux commented Nov 15, 2018

Sorry, I should have wrote that info before.

# lsb_release -a
No LSB modules are available.
Distributor ID: Debian
Description:    Debian GNU/Linux 9.5 (stretch)
Release:        9.5
Codename:       stretch
@liZe

This comment has been minimized.

Member

liZe commented Nov 15, 2018

Then your version of Python 3 comes from the official Debian package. There's probably no easy way to get Python 3.6 or 3.7 cleanly installed…

According to #70, Python 3.6 helped to save about 25% of memory consumption. Using this approximative value, rendering this page would eat about 17GB with Python 3.5 on your server, and your 16GB RAM + 4GB Swap may not be enough.

I have to take the time to profile memory consumption, like it was done on #70.

@RafaelLinux

This comment has been minimized.

RafaelLinux commented Nov 16, 2018

Thank you liZe. We just upgraded to Debian 9.6, but still remains Python 3 (2.7 according to apt-get install). Anyway, we could even start to work with weasyprint in production being conscious that we must upgrade as son as possible our server, cause that cases are very uncommon and could be controlled manually. But we can't enter to production till bug #36 has been solved and the related (IMHO), collateral problem of losing tables (see last part of my last post in thread #727, that was not replied :( ).

So at this thread issue, as we can't try your patch and it's solved, I should not ask more about it till we can give a try to your fix!!!! Thank you again!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment