Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Freeze on processing text with some punctuation marks #923

Closed
aiharisov opened this issue Aug 16, 2019 · 7 comments

Comments

@aiharisov
Copy link

commented Aug 16, 2019

Trying with this table test.txt
The process starts to consume a lot of processor time (near 25%) and RAM. RAM consumption increases over time, looks like memory leak.
WeasyPrint==48

@Tontyna

This comment has been minimized.

Copy link
Contributor

commented Aug 16, 2019

No freeze here. 3 pages generated in a flash.

@liZe

This comment has been minimized.

Copy link
Member

commented Aug 17, 2019

The process starts to consume a lot of processor time (near 25%) and RAM. RAM consumption increases over time, looks like memory leak.

If it never finishes, it's probably an infinite loop.

That's strange, I've received a personal mail from someone with quite the same problem, but I can't reproduce either. It includes tables too.

@aiharisov Could you please give your OS and the versions of Cairo and Pango you're using? It would be also useful to check that the number of pages is growing using the -v option when launching WeasyPrint.

@aiharisov

This comment has been minimized.

Copy link
Author

commented Aug 19, 2019

new example in attachment test1.txt

for test i've been using this script:

import os

from weasyprint import HTML

with open(os.path.join('D:', 'test.txt'), 'r', encoding='utf-8') as f:
text = f.read()
pdf_file = HTML(string=text).write_pdf()
with open(os.path.join('D:', 'test.txt.pdf'), 'wb') as file:
file.write(pdf_file)

system: Windows 10 Pro x64
used gtk3-runtime-3.24.10-2019-08-05-ts-win64.exe
pip:
Package Version


allure-pytest 2.6.2
allure-python-commons 2.6.2
apipkg 1.5
atomicwrites 1.3.0
attrs 19.1.0
beautifulsoup4 4.7.1
cached-property 1.5.1
cairocffi 1.0.2
CairoSVG 2.4.0
certifi 2019.6.16
cffi 1.12.3
chardet 3.0.4
colorama 0.4.1
coverage 4.5.4
cssselect2 0.2.1
Cython 0.29.10
defusedxml 0.6.0
Django 2.1.5
django-cors-headers 2.4.0
djangorestframework 3.9.2
dnspython 1.15.0
execnet 1.7.0
gevent 1.4.0
greenlet 0.4.15
gunicorn 19.9.0
html5lib 1.0.1
idna 2.8
importlib-metadata 0.19
Jinja2 2.10
jsonschema 3.0.1
MarkupSafe 1.1.1
more-itertools 7.2.0
numpy 1.16.0
pandas 0.23.4
pep8 1.7.1
Pillow 5.2.0
pip 19.2.2
pluggy 0.12.0
psycopg2-binary 2.7.6.1
py 1.8.0
pycparser 2.19
pydevd 1.5.1
PyPDF2 1.26.0
Pyphen 0.9.5
pyrsistent 0.15.4
pytest 3.10.1
pytest-cache 1.0
pytest-cov 2.6.0
pytest-django 3.4.4
pytest-pep8 1.0.6
python-dateutil 2.8.0
pytz 2018.9
requests 2.21.0
setuptools 40.6.2
six 1.12.0
soupsieve 1.9.3
tinycss2 1.0.2
urllib3 1.24.3
WeasyPrint 48
webencodings 0.5.1
XlsxWriter 1.1.8
zipp 0.5.2

@Tontyna

This comment has been minimized.

Copy link
Contributor

commented Aug 19, 2019

With test1.txt I can reproduce the freeze. Will dive into it tonight...

@aiharisov to create a pdf you don't need to open/write the files yourself. Let WeasyPrint do that for you:

import os
from weasyprint import HTML

input_filename = os.path.join('D:', 'test.txt')
output_filename = os.path.join('D:', 'test.txt.pdf')

HTML(input_filename).write_pdf(output_filename)
@Tontyna

This comment has been minimized.

Copy link
Contributor

commented Aug 20, 2019

It's not the table but the strange cell content:

<p>
Adverse drug experience 
[see <u>
<a href="">21 CFR Part 314.80 (a)</a>
<idrac> (IDRAC 8891)</idrac></u> ]: 
</p>

If this funny paragraph is rendered with a certain width the line iterator iter_line_boxes(), called in blocks.block_container_layout, never finishes yielding.

The surrounding table just generates the certain width that triggers the inifinite loop.

Minimal example to freeze WeasyPrint:

<p style="width:130px">
<span>
<span>xxxxxx YYY yyyyyy yyy</span>
<span>ZZZZZZ zzzzz
</span></span>)x
</p>

Of course, the special width value depends on the font size.

The three spans are required. It's also required that the block ends with </span></span>)x -- no space between </span> and </span> , followed by a bracket and another letter.

@liZe as always: I'm able to encircle the root of evil in the layout source but cannot fix it. It's probably another split_first_line jumping back and forth, unable to decide where to cut, cf. #660 (comment)

@liZe

This comment has been minimized.

Copy link
Member

commented Aug 21, 2019

@liZe as always: I'm able to encircle the root of evil in the layout source but cannot fix it. It's probably another split_first_line jumping back and forth, unable to decide where to cut, cf. #660 (comment)

Reading test_breaking_linebox_regression_* makes me sad, but it's way better than reading split_first_line. I know I have to rewrite text.py when I add rtl and bidi support.

@Tontyna

This comment has been minimized.

Copy link
Contributor

commented Aug 22, 2019

More lurking dragons inside innocent nested spans followed by bracket-and-letter. Slight modifications generate

IndexError list index out of range in skip_first_whitespace()

<p  style="width:130px">
<span>
<span>xxxxxx YYY yyyyyy yyy</span>
ZZZZZZ zzzzz
</span> )x 
</p>

That's the prime number bug #783, resurrected. Not joking. The skip stack is (0, (0, (13, None)))

AssertionError assert next_skip_stack is None in skip_first_whitespace()

<p style="width:130px">
<span>
xxxxxx YYY yyyyyy yyy
<span>ZZZZZZ zzzzz
</span></span>)x: 
</p>

Thats prime number bug again, disguised.

The prime number bug was resolved by introducing same_broken_child in split_inline_box.

This bug requires something similar, named is_this_child_a_box_we_already_failed_to_break, but I don't know how to implement that.

Fact is: If I set broken_child = True in split_inline_box()

broken_child = same_broken_child(

then neither infinite loop nor IndexError nor AssertionError is raised when the certain width hits the nested-spans-followed-by-bracket-etc

@liZe liZe added the crash label Aug 27, 2019
@liZe liZe changed the title Freeze on processing tables Freeze on processing text with some punctuation marks Sep 5, 2019
grewn0uille added a commit that referenced this issue Sep 12, 2019
The old code assumed that both skip stacks were absolute, but for the second
one previous children have already been skipped. We now check that we're in
the first child at each level, meaning that we're still breaking the same
child.

Related to #923.
@liZe liZe closed this in f1b1d14 Sep 13, 2019
@liZe liZe added this to the 50 milestone Sep 13, 2019
netbsd-srcmastr pushed a commit to NetBSD/pkgsrc that referenced this issue Oct 9, 2019
Version 50
----------

Released on 2019-09-19.

New features:

* `#209 <https://github.com/Kozea/WeasyPrint/issues/209>`_:
  Make ``break-*`` properties work inside tables
* `#661 <https://github.com/Kozea/WeasyPrint/issues/661>`_:
  Make blocks with ``overflow: auto`` grow to include floating children

Bug fixes:

* `#945 <https://github.com/Kozea/WeasyPrint/issues/945>`_:
  Don't break pages between a list item and its marker
* `#727 <https://github.com/Kozea/WeasyPrint/issues/727>`_:
  Avoid tables lost between pages
* `#831 <https://github.com/Kozea/WeasyPrint/issues/831>`_:
  Ignore auto margins on flex containers
* `#923 <https://github.com/Kozea/WeasyPrint/issues/923>`_:
  Fix a couple of crashes when splitting a line twice
* `#896 <https://github.com/Kozea/WeasyPrint/issues/896>`_:
  Fix skip stack order when using a reverse flex direction

Contributors:

- grewn0uille
- Guillaume Ayoub

Version 49
----------

Released on 2019-09-11.

Performance:

* Speed and memory use have been largely improved.

New features:

* `#700 <https://github.com/Kozea/WeasyPrint/issues/700>`_:
  Handle ``::marker`` pseudo-selector
* `135dc06c <https://github.com/Kozea/WeasyPrint/commit/135dc06c>`_:
  Handle ``recto`` and ``verso`` parameters for page breaks
* `#907 <https://github.com/Kozea/WeasyPrint/pull/907>`_:
  Provide a clean way to build layout contexts

Bug fixes:

* `#937 <https://github.com/Kozea/WeasyPrint/issues/937>`_:
  Fix rendering of tables with empty lines and rowspans
* `#897 <https://github.com/Kozea/WeasyPrint/issues/897>`_:
  Don't crash when small columns are wrapped in absolute blocks
* `#913 <https://github.com/Kozea/WeasyPrint/issues/913>`_:
  Fix a test about gradient colors
* `#924 <https://github.com/Kozea/WeasyPrint/pull/924>`_:
  Fix title for document with attachments
* `#917 <https://github.com/Kozea/WeasyPrint/issues/917>`_:
  Fix tests with Pango 1.44
* `#919 <https://github.com/Kozea/WeasyPrint/issues/919>`_:
  Fix padding and margin management for column flex boxes
* `#901 <https://github.com/Kozea/WeasyPrint/issues/901>`_:
  Fix width of replaced boxes with no intrinsic width
* `#906 <https://github.com/Kozea/WeasyPrint/issues/906>`_:
  Don't respect table cell width when content doesn't fit
* `#927 <https://github.com/Kozea/WeasyPrint/pull/927>`_:
  Don't use deprecated ``logger.warn`` anymore
* `a8662794 <https://github.com/Kozea/WeasyPrint/commit/a8662794>`_:
  Fix margin collapsing between caption and table wrapper
* `87d9e84f <https://github.com/Kozea/WeasyPrint/commit/87d9e84f>`_:
  Avoid infinite loops when rendering columns
* `789b80e6 <https://github.com/Kozea/WeasyPrint/commit/789b80e6>`_:
  Only use in flow children to set columns height
* `615e298a <https://github.com/Kozea/WeasyPrint/commit/615e298a>`_:
  Don't include floating elements each time we try to render a column
* `48d8632e <https://github.com/Kozea/WeasyPrint/commit/48d8632e>`_:
  Avoid not in flow children to compute column height
* `e7c452ce <https://github.com/Kozea/WeasyPrint/commit/e7c452ce>`_:
  Fix collapsing margins for columns
* `fb0887cf <https://github.com/Kozea/WeasyPrint/commit/fb0887cf>`_:
  Fix crash when using currentColor in gradients
* `f66df067 <https://github.com/Kozea/WeasyPrint/commit/f66df067>`_:
  Don't crash when using ex units in word-spacing in letter-spacing
* `c790ff20 <https://github.com/Kozea/WeasyPrint/commit/c790ff20>`_:
  Don't crash when properties needing base URL use var functions
* `d63eac31 <https://github.com/Kozea/WeasyPrint/commit/d63eac31>`_:
  Don't crash with object-fit: non images with no intrinsic size

Documentation:

* `#900 <https://github.com/Kozea/WeasyPrint/issues/900>`_:
  Add documentation about semantic versioning
* `#692 <https://github.com/Kozea/WeasyPrint/issues/692>`_:
  Add a snippet about PDF magnification
* `#899 <https://github.com/Kozea/WeasyPrint/pull/899>`_:
  Add .NET wrapper link
* `#893 <https://github.com/Kozea/WeasyPrint/pull/893>`_:
  Fixed wrong nested list comprehension example
* `#902 <https://github.com/Kozea/WeasyPrint/pull/902>`_:
  Add ``state`` to the ``make_bookmark_tree`` documentation
* `#921 <https://github.com/Kozea/WeasyPrint/pull/921>`_:
  Fix typos in the documentation
* `#328 <https://github.com/Kozea/WeasyPrint/issues/328>`_:
  Add CSS sample for forms

Contributors:

- grewn0uille
- Guillaume Ayoub
- Raphael Gaschignard
- Stani
- Szmen
- Thomas Dexter
- Tontyna

Version 48
----------

Released on 2019-07-08.

Dependencies:

* CairoSVG 2.4.0+ is now needed

New features:

* `#891 <https://github.com/Kozea/WeasyPrint/pull/891>`_:
  Handle ``text-overflow``
* `#878 <https://github.com/Kozea/WeasyPrint/pull/878>`_:
  Handle ``column-span``
* `#855 <https://github.com/Kozea/WeasyPrint/pull/855>`_:
  Handle all the ``text-decoration`` features
* `#238 <https://github.com/Kozea/WeasyPrint/issues/238>`_:
  Don't repeat background images when it's not needed
* `#875 <https://github.com/Kozea/WeasyPrint/issues/875>`_:
  Handle ``object-fit`` and ``object-position``
* `#870 <https://github.com/Kozea/WeasyPrint/issues/870>`_:
  Handle ``bookmark-state``

Bug fixes:

* `#686 <https://github.com/Kozea/WeasyPrint/issues/686>`_:
  Fix column balance when children are not inline
* `#885 <https://github.com/Kozea/WeasyPrint/issues/885>`_:
  Actually use the content box to resolve flex items percentages
* `#867 <https://github.com/Kozea/WeasyPrint/issues/867>`_:
  Fix rendering of KaTeX output, including (1) set row baseline of tables when
  no cells are baseline-aligned, (2) set baseline for inline tables, (3) don't
  align lines larger than their parents, (4) force CairoSVG to respect image
  size defined by CSS.
* `#873 <https://github.com/Kozea/WeasyPrint/issues/873>`_:
  Set a minimum height for empty list elements with outside marker
* `#811 <https://github.com/Kozea/WeasyPrint/issues/811>`_:
  Don't use translations to align flex items
* `#851 <https://github.com/Kozea/WeasyPrint/issues/851>`_,
  `#860 <https://github.com/Kozea/WeasyPrint/issues/860>`_:
  Don't cut pages when content overflows a very little bit
* `#862 <https://github.com/Kozea/WeasyPrint/issues/862>`_:
  Don't crash when using UTC dates in metadata

Documentation:

* `#854 <https://github.com/Kozea/WeasyPrint/issues/854>`_:
  Add a "Tips & Tricks" section

Contributors:

- Gabriel Corona
- Guillaume Ayoub
- Manuel Barkhau
- Nathan de Maestri
- grewn0uille
- theopeek

Version 47
----------

Released on 2019-04-12.

New features:

* `#843 <https://github.com/Kozea/WeasyPrint/pull/843>`_:
  Handle CSS variables
* `#846 <https://github.com/Kozea/WeasyPrint/pull/846>`_:
  Handle ``:nth()`` page selector
* `#847 <https://github.com/Kozea/WeasyPrint/pull/847>`_:
  Allow users to use a custom SSL context for HTTP requests

Bug fixes:

* `#797 <https://github.com/Kozea/WeasyPrint/issues/797>`_:
  Fix underlined justified text
* `#836 <https://github.com/Kozea/WeasyPrint/issues/836>`_:
  Fix crash when flex items are replaced boxes
* `#835 <https://github.com/Kozea/WeasyPrint/issues/835>`_:
  Fix ``margin-break: auto``

Version 46
----------

Released on 2019-03-20.

New features:

* `#771 <https://github.com/Kozea/WeasyPrint/issues/771>`_:
  Handle ``box-decoration-break``
* `#115 <https://github.com/Kozea/WeasyPrint/issues/115>`_:
  Handle ``margin-break``
* `#821 <https://github.com/Kozea/WeasyPrint/issues/821>`_:
  Continuous integration includes tests on Windows

Bug fixes:

* `#765 <https://github.com/Kozea/WeasyPrint/issues/765>`_,
  `#754 <https://github.com/Kozea/WeasyPrint/issues/754>`_,
  `#800 <https://github.com/Kozea/WeasyPrint/issues/800>`_:
  Fix many crashes related to the flex layout
* `#783 <https://github.com/Kozea/WeasyPrint/issues/783>`_:
  Fix a couple of crashes with strange texts
* `#827 <https://github.com/Kozea/WeasyPrint/pull/827>`_:
  Named strings and counters are case-sensitive
* `#823 <https://github.com/Kozea/WeasyPrint/pull/823>`_:
  Shrink min/max-height/width according to box-sizing
* `#728 <https://github.com/Kozea/WeasyPrint/issues/728>`_,
  `#171 <https://github.com/Kozea/WeasyPrint/issues/171>`_:
  Don't crash when fixed boxes are nested
* `#610 <https://github.com/Kozea/WeasyPrint/issues/610>`_,
  `#828 <https://github.com/Kozea/WeasyPrint/issues/828>`_:
  Don't crash when preformatted text lines end with a space
* `#808 <https://github.com/Kozea/WeasyPrint/issues/808>`_,
  `#387 <https://github.com/Kozea/WeasyPrint/issues/387>`_:
  Fix position of some images
* `#813 <https://github.com/Kozea/WeasyPrint/issues/813>`_:
  Don't crash when long preformatted text lines end with ``\n``

Documentation:

* `#815 <https://github.com/Kozea/WeasyPrint/pull/815>`_:
  Add documentation about custom ``url_fetcher``
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
3 participants
You can’t perform that action at this time.