skip_first_whitespace IndexError #783

knixeur · 2019-02-01T11:15:14Z

Hi there, thanks for creating WeasyPrint, it does a really good job!
I've been using it with CKEditor which generates really ugly HTML (specially when someone pastes a Word document into it).
It worked fine until recently where I hit a bug, it seems the combination of style + characters in this HTML breaks it, I stripped it down all I could (original HTML was huge with lot of attributes).

Seems to be a corner case in skip_first_whitespace.
I've created a minimal test (style + html) which fails, if I change some letters/remove style it works, it seems to be an edge combination.
I'm not sure how to properly fix it, I modified the code to keep rendering (and seems fine so far) by catching the IndexError exception.

This a a test case to reproduce the bug
knixeur@64b66f9

@assert_no_logs
def test_page_and_linebox_breaking_out_of_range():
    page_content = (
    '''
    <style>
    @page :right { margin-left: 4cm; }
    @page :left { margin-right: 4cm; }
    @page { size: legal; }
    </style>
    <p><span><span>'''
    + ('&nbsp;' * 42) + ''' </span>*</span><span>* ********* ******* *** ****'''
    + ''' ** ** ******* ***aaaaaaaa aaaaaaaa *** ** aallaala a *** aaalaaaaaa'''
    + '''aaaa aa Juplloia Naalaaai lAo. *** IAIAIAIAIA, aaaaaaa *** Ao. 66/99'''
    + ''') ** ** ******* *** <b><i>l</i></b></span><b><i><span>aa aaaaaaaaaa '''
    + '''aaaaaaaaaa <u>aa aaaaaaaaaaaaaaaaaa</u> aaaaaaa aaaaaa aa aaaaaaa aa'''
    + '''aaaaa aa lollo</span></i></b></p>
    ''')
    # It works with render_pages/FakeHTML, I guess it's because of the lighter
    # stylesheet
    # render_pages(page_content)
    # FakeHTML(string=page_content).render()
    from ... import HTML
    HTML(string=page_content).render()

The "fix"
knixeur@340a8c7

------------------------ weasyprint/layout/inlines.py -------------------------
index 240330b0..8a287b01 100644
@@ -204,5 +204,8 @@ def skip_first_whitespace(box, skip_stack):
         if index == 0 and not box.children:
             return None
-        result = skip_first_whitespace(box.children[index], next_skip_stack)
+        try:
+            result = skip_first_whitespace(box.children[index], next_skip_stack)
+        except IndexError:
+            return None
         if result == 'continue':
             index += 1

Stack trace of the test when ran against master

weasyprint/layout/inlines.py:56: in iter_line_boxes
    device_size, absolute_boxes, fixed_boxes, first_letter_style)
weasyprint/layout/inlines.py:73: in get_next_linebox
    skip_stack = skip_first_whitespace(linebox, skip_stack)
weasyprint/layout/inlines.py:206: in skip_first_whitespace
    result = skip_first_whitespace(box.children[index], next_skip_stack)
weasyprint/layout/inlines.py:206: in skip_first_whitespace
    result = skip_first_whitespace(box.children[index], next_skip_stack)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

box = <InlineBox b>, skip_stack = (113, None)

    def skip_first_whitespace(box, skip_stack):
        """Return the ``skip_stack`` to start just after the remove spaces
        at the beginning of the line.
    
        See http://www.w3.org/TR/CSS21/text.html#white-space-model
        """
        if skip_stack is None:
            index = 0
            next_skip_stack = None
        else:
            index, next_skip_stack = skip_stack
    
        if isinstance(box, boxes.TextBox):
            assert next_skip_stack is None
            white_space = box.style['white_space']
            length = len(box.text)
            if index == length:
                # Starting a the end of the TextBox, no text to see: Continue
                return 'continue'
            if white_space in ('normal', 'nowrap', 'pre-line'):
                while index < length and box.text[index] == ' ':
                    index += 1
            return (index, None) if index else None
    
        if isinstance(box, (boxes.LineBox, boxes.InlineBox)):
            if index == 0 and not box.children:
                return None
>           result = skip_first_whitespace(box.children[index], next_skip_stack)
E           IndexError: list index out of range

weasyprint/layout/inlines.py:206: IndexError

Let me know if I can help you in any way, I tried to follow the code to find the real cause but couldn't and have to keep going on other stuff.

Edit: fixed formatting
Edit2: inlined test and "fix"

The text was updated successfully, but these errors were encountered:

Tontyna · 2019-02-01T13:43:26Z

Looks like in skip_first_whitespace the skip_stack and the boxe's children are discoordinated.

When exception happens, the given skip_stack ist (113, None), but the given InlineBox has only 1 child, and consequently in line 206

        result = skip_first_whitespace(box.children[index], next_skip_stack)

the recursive call with box.children[113] fails with IndexError.

The concerned InlineBox is the l. Don't know who and where 113 children could be established in the skip_stack...

BTW: 113 is a prime number 😬

knixeur · 2019-02-01T14:01:35Z

Lol, IIRC the real HTML crashed with (107, None) 😆

Edit: Confirmed, it's the prime number bug!

(Pdb) skip_stack
(107, None)

Tontyna · 2019-02-01T14:07:57Z

Yep, let's call it the prime number bug

liZe · 2019-03-01T15:09:20Z

Another problem, probably related:

<p>*<span>****************************************** *** **** ** ** ******* *********** ************************************************************************* <b>l</b></span><b>aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa</b></p>

Tontyna · 2019-03-14T23:08:42Z

Encircling the issue reveals: The prime number isn't meant to be the index of a child box, but the index of the letter (space) in a TextBox's text where the overlong text should be/has been broken.

After splitting all the text snippets from the loooong text into separate LineBoxes, the resume_at aka skip_stack returned by split_inline_box() manages to point at the box after the TextBox, wich in the example given by @knixeur happens to be the bold <InlineBox b>.
But somehow/sadly it leaves the last entry of the skip_stack still pointing to the letter where the last text splitting happened. This last entry, containing (<textbreakpos>, None), should, of course be replaced with None. Dont (yet) know why it isn't.

Subsequently skip_first_whitespace() tries to accesses the inexistent child № textbreakpos of the InlineBox and raises IndexError.

Painstaking preparation is required to trigger this erroneous skip_stack -- particular nesting of InlineBoxes and TextBoxes in combination with the right (wrong?) text content, page width and font size. The TextBox, its text extending close to the right margin, and a following InlineBox () being enclosed in a InlineBox (), immediately followed by non-space (x) seems to be crucial.

This fine-tuned snippet

<p><span>
********************************* *******
********* *************** *** 
********* *************** *** 
********************************** ** ******* *** <b>l</b></span>x</p>

crashes with prime number 73 😄

@liZe Your snippet is another issue. It just doesn't break where it ought to. The skip_stack is ok, throughout. No leftovers from TextBoxes.

Tontyna · 2019-03-17T00:56:02Z

Found the bug but cannot fix it. Happens in the most ugly most dirty most abominable part of the inline layout code in split_inline_box:

WeasyPrint/weasyprint/layout/inlines.py

Lines 854 to 859 in fc089f5

    
           # We have to check whether the child we're breaking 
        
           # is the one broken by the initial skip stack. 
        
           broken_child = bool( 
        
               initial_skip_stack and 
        
               initial_skip_stack[0] == child_index and 
        
               initial_skip_stack[1])

BTW: The broken_child-detection was introduced to fix #580.

Under prime number bug conditions broken_child evaluates to true, the subsequent reconstruction of the skip_stack ("adding skip stacks is a bit complicated") produces the invalid stack that finally raises the IndexError.

By forcibly setting broken_child=False and skipping the reconstruction of the stack everything is fine. Nah, not everything, of course, #580 is broken again.

So obviously the algorithm to detect broken_child is incomplete/wrong/lacking something. But it's beyond my skills to tweak it.

This minimal snippet crashes with prime number 1:

<p><span>
**************************************************************
***********
<b>fits</b></span>xxx</p>

And yes, the three whitespaces (here: linebreaks) are vital to the IndexError. Ditto vital: no whitespace in front of the xxx.

Tontyna · 2019-03-17T11:36:13Z

Recipe to reproduce the IndexError:

<p><span title="span required to trigger the bug">
*breakable*text* *followed*by*withespace*
<b>+fit+</b></span>CrossingTheMarginWithoutBreakRaisesIndexError</p>

Though @liZe's snippet doesnt crash with IndexError, it's related -- of course! he was right!
It reads like this:

<p><span title="span required to trigger the bug">
*non*breaking*text**followed*by*withespace*
<b>+fit+</b></span>CrossingTheMarginWithoutBreakExtendsTheLineUntil next whitespace goes on the next line</p>

The crash is prevented because

WeasyPrint/weasyprint/layout/inlines.py

Line 820 in fc089f5

can_break_inside(child)):

returns False for the *non*breaking*text**followed*by*withespace*, while for the *breakable*text* *followed*by*withespace* we get True and enter "The dirty solution".

Tontyna · 2019-03-17T12:12:37Z

BTW: Taking "The dirty solution" path (i.e. calling split_inline_level one more time), but avoiding the recreation of the skip_stack, seems to fix both, the IndexError and the CrossingTheMarginShouldWrap bug.
At least if the CrossingTheMarginWithoutBreak string doesnt cross the margin of the next line, too.

Related to #783.

liZe · 2019-03-18T12:38:55Z

Happens in the most ugly most dirty most abominable part of the inline layout code in split_inline_box

At least it's documented now, it saves me hours each time I have to understand it again 😉.

So obviously the algorithm to detect broken_child is incomplete/wrong/lacking something.

Exactly. We had to check that the split children is the same, not only at level 1 but for all nested children until we find the same text box. It's fixed now.

Though @liZe's snippet doesnt crash with IndexError, it's related -- of course! he was right!

8724bc3 fixes this bug too, but … there's another bug in can_break_inside that is unable to detect a line break in aaa bbb. I've added a failing test in e7fd37b.

knixeur · 2019-03-18T13:01:43Z

Thanks @Tontyna and @liZe !

Tontyna · 2019-03-18T19:55:14Z

bug in can_break_inside that is unable to detect a line break in aaa bbb

That's because can_break_text() from text.py returns False for aaa , i.e. a string that ends with a space.

That's because can_break_text() checks for the PangoLogAttr's is_line_break attribute. And indeed, a trailing whitespace isn't a line break. In fact: whitespace is never a line break.
When I get it right, pango.pango_get_log_attrs() sets this attribute on letters that could become the first letter of a new line, that is, the letter that follows a possible line-breaking (whitespace or punctuation or...) letter.

It's a pity that we must take box.style['lang'] and box.style['white_space'] into account. Otherwise we could simply collect the text from the child's TextBoxes, including all the intermediate whitespaces, and pass that as a single string to can_break_text...

liZe · 2019-03-18T20:21:03Z

Yes, I've learned a lot of things fixing #301, and Unicode is really fascinating. I'm glad to have Pango 😄.

We already have (almost) the whole logic in WeasyPrint: split_inline_level is able to find correctly if we can split a line, and it takes care of line breaks with nested tags. We should use it instead of relying on can_break_inside (that's small and fast but definitely buggy).

Version 50 ---------- Released on 2019-09-19. New features: * `#209 <https://github.com/Kozea/WeasyPrint/issues/209>`_: Make ``break-*`` properties work inside tables * `#661 <https://github.com/Kozea/WeasyPrint/issues/661>`_: Make blocks with ``overflow: auto`` grow to include floating children Bug fixes: * `#945 <https://github.com/Kozea/WeasyPrint/issues/945>`_: Don't break pages between a list item and its marker * `#727 <https://github.com/Kozea/WeasyPrint/issues/727>`_: Avoid tables lost between pages * `#831 <https://github.com/Kozea/WeasyPrint/issues/831>`_: Ignore auto margins on flex containers * `#923 <https://github.com/Kozea/WeasyPrint/issues/923>`_: Fix a couple of crashes when splitting a line twice * `#896 <https://github.com/Kozea/WeasyPrint/issues/896>`_: Fix skip stack order when using a reverse flex direction Contributors: - grewn0uille - Guillaume Ayoub Version 49 ---------- Released on 2019-09-11. Performance: * Speed and memory use have been largely improved. New features: * `#700 <https://github.com/Kozea/WeasyPrint/issues/700>`_: Handle ``::marker`` pseudo-selector * `135dc06c <https://github.com/Kozea/WeasyPrint/commit/135dc06c>`_: Handle ``recto`` and ``verso`` parameters for page breaks * `#907 <https://github.com/Kozea/WeasyPrint/pull/907>`_: Provide a clean way to build layout contexts Bug fixes: * `#937 <https://github.com/Kozea/WeasyPrint/issues/937>`_: Fix rendering of tables with empty lines and rowspans * `#897 <https://github.com/Kozea/WeasyPrint/issues/897>`_: Don't crash when small columns are wrapped in absolute blocks * `#913 <https://github.com/Kozea/WeasyPrint/issues/913>`_: Fix a test about gradient colors * `#924 <https://github.com/Kozea/WeasyPrint/pull/924>`_: Fix title for document with attachments * `#917 <https://github.com/Kozea/WeasyPrint/issues/917>`_: Fix tests with Pango 1.44 * `#919 <https://github.com/Kozea/WeasyPrint/issues/919>`_: Fix padding and margin management for column flex boxes * `#901 <https://github.com/Kozea/WeasyPrint/issues/901>`_: Fix width of replaced boxes with no intrinsic width * `#906 <https://github.com/Kozea/WeasyPrint/issues/906>`_: Don't respect table cell width when content doesn't fit * `#927 <https://github.com/Kozea/WeasyPrint/pull/927>`_: Don't use deprecated ``logger.warn`` anymore * `a8662794 <https://github.com/Kozea/WeasyPrint/commit/a8662794>`_: Fix margin collapsing between caption and table wrapper * `87d9e84f <https://github.com/Kozea/WeasyPrint/commit/87d9e84f>`_: Avoid infinite loops when rendering columns * `789b80e6 <https://github.com/Kozea/WeasyPrint/commit/789b80e6>`_: Only use in flow children to set columns height * `615e298a <https://github.com/Kozea/WeasyPrint/commit/615e298a>`_: Don't include floating elements each time we try to render a column * `48d8632e <https://github.com/Kozea/WeasyPrint/commit/48d8632e>`_: Avoid not in flow children to compute column height * `e7c452ce <https://github.com/Kozea/WeasyPrint/commit/e7c452ce>`_: Fix collapsing margins for columns * `fb0887cf <https://github.com/Kozea/WeasyPrint/commit/fb0887cf>`_: Fix crash when using currentColor in gradients * `f66df067 <https://github.com/Kozea/WeasyPrint/commit/f66df067>`_: Don't crash when using ex units in word-spacing in letter-spacing * `c790ff20 <https://github.com/Kozea/WeasyPrint/commit/c790ff20>`_: Don't crash when properties needing base URL use var functions * `d63eac31 <https://github.com/Kozea/WeasyPrint/commit/d63eac31>`_: Don't crash with object-fit: non images with no intrinsic size Documentation: * `#900 <https://github.com/Kozea/WeasyPrint/issues/900>`_: Add documentation about semantic versioning * `#692 <https://github.com/Kozea/WeasyPrint/issues/692>`_: Add a snippet about PDF magnification * `#899 <https://github.com/Kozea/WeasyPrint/pull/899>`_: Add .NET wrapper link * `#893 <https://github.com/Kozea/WeasyPrint/pull/893>`_: Fixed wrong nested list comprehension example * `#902 <https://github.com/Kozea/WeasyPrint/pull/902>`_: Add ``state`` to the ``make_bookmark_tree`` documentation * `#921 <https://github.com/Kozea/WeasyPrint/pull/921>`_: Fix typos in the documentation * `#328 <https://github.com/Kozea/WeasyPrint/issues/328>`_: Add CSS sample for forms Contributors: - grewn0uille - Guillaume Ayoub - Raphael Gaschignard - Stani - Szmen - Thomas Dexter - Tontyna Version 48 ---------- Released on 2019-07-08. Dependencies: * CairoSVG 2.4.0+ is now needed New features: * `#891 <https://github.com/Kozea/WeasyPrint/pull/891>`_: Handle ``text-overflow`` * `#878 <https://github.com/Kozea/WeasyPrint/pull/878>`_: Handle ``column-span`` * `#855 <https://github.com/Kozea/WeasyPrint/pull/855>`_: Handle all the ``text-decoration`` features * `#238 <https://github.com/Kozea/WeasyPrint/issues/238>`_: Don't repeat background images when it's not needed * `#875 <https://github.com/Kozea/WeasyPrint/issues/875>`_: Handle ``object-fit`` and ``object-position`` * `#870 <https://github.com/Kozea/WeasyPrint/issues/870>`_: Handle ``bookmark-state`` Bug fixes: * `#686 <https://github.com/Kozea/WeasyPrint/issues/686>`_: Fix column balance when children are not inline * `#885 <https://github.com/Kozea/WeasyPrint/issues/885>`_: Actually use the content box to resolve flex items percentages * `#867 <https://github.com/Kozea/WeasyPrint/issues/867>`_: Fix rendering of KaTeX output, including (1) set row baseline of tables when no cells are baseline-aligned, (2) set baseline for inline tables, (3) don't align lines larger than their parents, (4) force CairoSVG to respect image size defined by CSS. * `#873 <https://github.com/Kozea/WeasyPrint/issues/873>`_: Set a minimum height for empty list elements with outside marker * `#811 <https://github.com/Kozea/WeasyPrint/issues/811>`_: Don't use translations to align flex items * `#851 <https://github.com/Kozea/WeasyPrint/issues/851>`_, `#860 <https://github.com/Kozea/WeasyPrint/issues/860>`_: Don't cut pages when content overflows a very little bit * `#862 <https://github.com/Kozea/WeasyPrint/issues/862>`_: Don't crash when using UTC dates in metadata Documentation: * `#854 <https://github.com/Kozea/WeasyPrint/issues/854>`_: Add a "Tips & Tricks" section Contributors: - Gabriel Corona - Guillaume Ayoub - Manuel Barkhau - Nathan de Maestri - grewn0uille - theopeek Version 47 ---------- Released on 2019-04-12. New features: * `#843 <https://github.com/Kozea/WeasyPrint/pull/843>`_: Handle CSS variables * `#846 <https://github.com/Kozea/WeasyPrint/pull/846>`_: Handle ``:nth()`` page selector * `#847 <https://github.com/Kozea/WeasyPrint/pull/847>`_: Allow users to use a custom SSL context for HTTP requests Bug fixes: * `#797 <https://github.com/Kozea/WeasyPrint/issues/797>`_: Fix underlined justified text * `#836 <https://github.com/Kozea/WeasyPrint/issues/836>`_: Fix crash when flex items are replaced boxes * `#835 <https://github.com/Kozea/WeasyPrint/issues/835>`_: Fix ``margin-break: auto`` Version 46 ---------- Released on 2019-03-20. New features: * `#771 <https://github.com/Kozea/WeasyPrint/issues/771>`_: Handle ``box-decoration-break`` * `#115 <https://github.com/Kozea/WeasyPrint/issues/115>`_: Handle ``margin-break`` * `#821 <https://github.com/Kozea/WeasyPrint/issues/821>`_: Continuous integration includes tests on Windows Bug fixes: * `#765 <https://github.com/Kozea/WeasyPrint/issues/765>`_, `#754 <https://github.com/Kozea/WeasyPrint/issues/754>`_, `#800 <https://github.com/Kozea/WeasyPrint/issues/800>`_: Fix many crashes related to the flex layout * `#783 <https://github.com/Kozea/WeasyPrint/issues/783>`_: Fix a couple of crashes with strange texts * `#827 <https://github.com/Kozea/WeasyPrint/pull/827>`_: Named strings and counters are case-sensitive * `#823 <https://github.com/Kozea/WeasyPrint/pull/823>`_: Shrink min/max-height/width according to box-sizing * `#728 <https://github.com/Kozea/WeasyPrint/issues/728>`_, `#171 <https://github.com/Kozea/WeasyPrint/issues/171>`_: Don't crash when fixed boxes are nested * `#610 <https://github.com/Kozea/WeasyPrint/issues/610>`_, `#828 <https://github.com/Kozea/WeasyPrint/issues/828>`_: Don't crash when preformatted text lines end with a space * `#808 <https://github.com/Kozea/WeasyPrint/issues/808>`_, `#387 <https://github.com/Kozea/WeasyPrint/issues/387>`_: Fix position of some images * `#813 <https://github.com/Kozea/WeasyPrint/issues/813>`_: Don't crash when long preformatted text lines end with ``\n`` Documentation: * `#815 <https://github.com/Kozea/WeasyPrint/pull/815>`_: Add documentation about custom ``url_fetcher``

liZe added the crash Problems preventing documents from being rendered label Feb 1, 2019

liZe closed this as completed in 8724bc3 Mar 18, 2019

liZe added a commit that referenced this issue Mar 18, 2019

Add broken test

e7fd37b

Related to #783.

liZe added this to the 46 milestone Mar 18, 2019

Tontyna mentioned this issue Aug 22, 2019

Freeze on processing text with some punctuation marks #923

Closed

stefanw mentioned this issue Sep 27, 2019

IndexError: tuple index out of range with weasyprint 50 #953

Closed

em1le mentioned this issue Dec 18, 2019

got exception : <class 'IndexError'> : list index out of range #1009

Closed

izquierdo mentioned this issue Dec 20, 2019

IndexError in skip_first_whitespace #1011

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

skip_first_whitespace IndexError #783

skip_first_whitespace IndexError #783

knixeur commented Feb 1, 2019 •

edited

Tontyna commented Feb 1, 2019

knixeur commented Feb 1, 2019 •

edited

Tontyna commented Feb 1, 2019

liZe commented Mar 1, 2019

Tontyna commented Mar 14, 2019

Tontyna commented Mar 17, 2019

Tontyna commented Mar 17, 2019

Tontyna commented Mar 17, 2019

liZe commented Mar 18, 2019

knixeur commented Mar 18, 2019

Tontyna commented Mar 18, 2019 •

edited

liZe commented Mar 18, 2019

skip_first_whitespace IndexError #783

skip_first_whitespace IndexError #783

Comments

knixeur commented Feb 1, 2019 • edited

Tontyna commented Feb 1, 2019

knixeur commented Feb 1, 2019 • edited

Tontyna commented Feb 1, 2019

liZe commented Mar 1, 2019

Tontyna commented Mar 14, 2019

Tontyna commented Mar 17, 2019

Tontyna commented Mar 17, 2019

Tontyna commented Mar 17, 2019

liZe commented Mar 18, 2019

knixeur commented Mar 18, 2019

Tontyna commented Mar 18, 2019 • edited

liZe commented Mar 18, 2019

knixeur commented Feb 1, 2019 •

edited

knixeur commented Feb 1, 2019 •

edited

Tontyna commented Mar 18, 2019 •

edited