Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

skip_first_whitespace IndexError #783

Closed
knixeur opened this Issue Feb 1, 2019 · 12 comments

Comments

3 participants
@knixeur
Copy link

commented Feb 1, 2019

Hi there, thanks for creating WeasyPrint, it does a really good job!
I've been using it with CKEditor which generates really ugly HTML (specially when someone pastes a Word document into it).
It worked fine until recently where I hit a bug, it seems the combination of style + characters in this HTML breaks it, I stripped it down all I could (original HTML was huge with lot of attributes).

Seems to be a corner case in skip_first_whitespace.
I've created a minimal test (style + html) which fails, if I change some letters/remove style it works, it seems to be an edge combination.
I'm not sure how to properly fix it, I modified the code to keep rendering (and seems fine so far) by catching the IndexError exception.

This a a test case to reproduce the bug
https://github.com/knixeur/WeasyPrint/commit/64b66f909ce9273388420e6c49c524250f57beb6

@assert_no_logs
def test_page_and_linebox_breaking_out_of_range():
    page_content = (
    '''
    <style>
    @page :right { margin-left: 4cm; }
    @page :left { margin-right: 4cm; }
    @page { size: legal; }
    </style>
    <p><span><span>'''
    + ('&nbsp;' * 42) + ''' </span>*</span><span>* ********* ******* *** ****'''
    + ''' ** ** ******* ***aaaaaaaa aaaaaaaa *** ** aallaala a *** aaalaaaaaa'''
    + '''aaaa aa Juplloia Naalaaai lAo. *** IAIAIAIAIA, aaaaaaa *** Ao. 66/99'''
    + ''') ** ** ******* *** <b><i>l</i></b></span><b><i><span>aa aaaaaaaaaa '''
    + '''aaaaaaaaaa <u>aa aaaaaaaaaaaaaaaaaa</u> aaaaaaa aaaaaa aa aaaaaaa aa'''
    + '''aaaaa aa lollo</span></i></b></p>
    ''')
    # It works with render_pages/FakeHTML, I guess it's because of the lighter
    # stylesheet
    # render_pages(page_content)
    # FakeHTML(string=page_content).render()
    from ... import HTML
    HTML(string=page_content).render()

The "fix"
https://github.com/knixeur/WeasyPrint/commit/340a8c7a4977792a13773233482003f941514f97

------------------------ weasyprint/layout/inlines.py -------------------------
index 240330b0..8a287b01 100644
@@ -204,5 +204,8 @@ def skip_first_whitespace(box, skip_stack):
         if index == 0 and not box.children:
             return None
-        result = skip_first_whitespace(box.children[index], next_skip_stack)
+        try:
+            result = skip_first_whitespace(box.children[index], next_skip_stack)
+        except IndexError:
+            return None
         if result == 'continue':
             index += 1

Stack trace of the test when ran against master

weasyprint/layout/inlines.py:56: in iter_line_boxes
    device_size, absolute_boxes, fixed_boxes, first_letter_style)
weasyprint/layout/inlines.py:73: in get_next_linebox
    skip_stack = skip_first_whitespace(linebox, skip_stack)
weasyprint/layout/inlines.py:206: in skip_first_whitespace
    result = skip_first_whitespace(box.children[index], next_skip_stack)
weasyprint/layout/inlines.py:206: in skip_first_whitespace
    result = skip_first_whitespace(box.children[index], next_skip_stack)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

box = <InlineBox b>, skip_stack = (113, None)

    def skip_first_whitespace(box, skip_stack):
        """Return the ``skip_stack`` to start just after the remove spaces
        at the beginning of the line.
    
        See http://www.w3.org/TR/CSS21/text.html#white-space-model
        """
        if skip_stack is None:
            index = 0
            next_skip_stack = None
        else:
            index, next_skip_stack = skip_stack
    
        if isinstance(box, boxes.TextBox):
            assert next_skip_stack is None
            white_space = box.style['white_space']
            length = len(box.text)
            if index == length:
                # Starting a the end of the TextBox, no text to see: Continue
                return 'continue'
            if white_space in ('normal', 'nowrap', 'pre-line'):
                while index < length and box.text[index] == ' ':
                    index += 1
            return (index, None) if index else None
    
        if isinstance(box, (boxes.LineBox, boxes.InlineBox)):
            if index == 0 and not box.children:
                return None
>           result = skip_first_whitespace(box.children[index], next_skip_stack)
E           IndexError: list index out of range

weasyprint/layout/inlines.py:206: IndexError

Let me know if I can help you in any way, I tried to follow the code to find the real cause but couldn't and have to keep going on other stuff.

Edit: fixed formatting
Edit2: inlined test and "fix"

@liZe liZe added the crash label Feb 1, 2019

@Tontyna

This comment has been minimized.

Copy link
Contributor

commented Feb 1, 2019

Looks like in skip_first_whitespace the skip_stack and the boxe's children are discoordinated.

When exception happens, the given skip_stack ist (113, None), but the given InlineBox has only 1 child, and consequently in line 206

        result = skip_first_whitespace(box.children[index], next_skip_stack)

the recursive call with box.children[113] fails with IndexError.

The concerned InlineBox is the <b><i>l</i></b>. Don't know who and where 113 children could be established in the skip_stack...

BTW: 113 is a prime number 😬

@knixeur

This comment has been minimized.

Copy link
Author

commented Feb 1, 2019

Lol, IIRC the real HTML crashed with (107, None) 😆

Edit: Confirmed, it's the prime number bug!

(Pdb) skip_stack
(107, None)
@Tontyna

This comment has been minimized.

Copy link
Contributor

commented Feb 1, 2019

Yep, let's call it the prime number bug

@liZe

This comment has been minimized.

Copy link
Member

commented Mar 1, 2019

Another problem, probably related:

<p>*<span>****************************************** *** **** ** ** ******* *********** ************************************************************************* <b>l</b></span><b>aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa</b></p>
@Tontyna

This comment has been minimized.

Copy link
Contributor

commented Mar 14, 2019

Encircling the issue reveals: The prime number isn't meant to be the index of a child box, but the index of the letter (space) in a TextBox's text where the overlong text should be/has been broken.

After splitting all the text snippets from the loooong text into separate LineBoxes, the resume_at aka skip_stack returned by split_inline_box() manages to point at the box after the TextBox, wich in the example given by @knixeur happens to be the bold <InlineBox b>.
But somehow/sadly it leaves the last entry of the skip_stack still pointing to the letter where the last text splitting happened. This last entry, containing (<textbreakpos>, None), should, of course be replaced with None. Dont (yet) know why it isn't.

Subsequently skip_first_whitespace() tries to accesses the inexistent child № textbreakpos of the InlineBox and raises IndexError.

Painstaking preparation is required to trigger this erroneous skip_stack -- particular nesting of InlineBoxes and TextBoxes in combination with the right (wrong?) text content, page width and font size. The TextBox, its text extending close to the right margin, and a following InlineBox (<b>) being enclosed in a InlineBox (<span>), immediately followed by non-space (x) seems to be crucial.

This fine-tuned snippet

<p><span>
********************************* *******
********* *************** *** 
********* *************** *** 
********************************** ** ******* *** <b>l</b></span>x</p>

crashes with prime number 73 😄

@liZe Your snippet is another issue. It just doesn't break where it ought to. The skip_stack is ok, throughout. No leftovers from TextBoxes.

@Tontyna

This comment has been minimized.

Copy link
Contributor

commented Mar 17, 2019

Found the bug but cannot fix it. Happens in the most ugly most dirty most abominable part of the inline layout code in split_inline_box:

# We have to check whether the child we're breaking
# is the one broken by the initial skip stack.
broken_child = bool(
initial_skip_stack and
initial_skip_stack[0] == child_index and
initial_skip_stack[1])

BTW: The broken_child-detection was introduced to fix #580.

Under prime number bug conditions broken_child evaluates to true, the subsequent reconstruction of the skip_stack ("adding skip stacks is a bit complicated") produces the invalid stack that finally raises the IndexError.

By forcibly setting broken_child=False and skipping the reconstruction of the stack everything is fine. Nah, not everything, of course, #580 is broken again.

So obviously the algorithm to detect broken_child is incomplete/wrong/lacking something. But it's beyond my skills to tweak it.

This minimal snippet crashes with prime number 1:

<p><span>
**************************************************************
***********
<b>fits</b></span>xxx</p>

And yes, the three whitespaces (here: linebreaks) are vital to the IndexError. Ditto vital: no whitespace in front of the xxx.

@Tontyna

This comment has been minimized.

Copy link
Contributor

commented Mar 17, 2019

Recipe to reproduce the IndexError:

<p><span title="span required to trigger the bug">
*breakable*text* *followed*by*withespace*
<b>+fit+</b></span>CrossingTheMarginWithoutBreakRaisesIndexError</p>

Though @liZe's snippet doesnt crash with IndexError, it's related -- of course! he was right!
It reads like this:

<p><span title="span required to trigger the bug">
*non*breaking*text**followed*by*withespace*
<b>+fit+</b></span>CrossingTheMarginWithoutBreakExtendsTheLineUntil next whitespace goes on the next line</p>

The crash is prevented because

can_break_inside(child)):

returns False for the *non*breaking*text**followed*by*withespace*, while for the *breakable*text* *followed*by*withespace* we get True and enter "The dirty solution".

@Tontyna

This comment has been minimized.

Copy link
Contributor

commented Mar 17, 2019

BTW: Taking "The dirty solution" path (i.e. calling split_inline_level one more time), but avoiding the recreation of the skip_stack, seems to fix both, the IndexError and the CrossingTheMarginShouldWrap bug.
At least if the CrossingTheMarginWithoutBreak string doesnt cross the margin of the next line, too.

@liZe liZe closed this in 8724bc3 Mar 18, 2019

liZe added a commit that referenced this issue Mar 18, 2019

Add broken test
Related to #783.
@liZe

This comment has been minimized.

Copy link
Member

commented Mar 18, 2019

Happens in the most ugly most dirty most abominable part of the inline layout code in split_inline_box

At least it's documented now, it saves me hours each time I have to understand it again 😉.

So obviously the algorithm to detect broken_child is incomplete/wrong/lacking something.

Exactly. We had to check that the split children is the same, not only at level 1 but for all nested children until we find the same text box. It's fixed now.

Though @liZe's snippet doesnt crash with IndexError, it's related -- of course! he was right!

8724bc3 fixes this bug too, but … there's another bug in can_break_inside that is unable to detect a line break in <p>aaa <b>bbb</b></p>. I've added a failing test in e7fd37b.

@liZe liZe added this to the 46 milestone Mar 18, 2019

@knixeur

This comment has been minimized.

Copy link
Author

commented Mar 18, 2019

Thanks @Tontyna and @liZe !

@Tontyna

This comment has been minimized.

Copy link
Contributor

commented Mar 18, 2019

bug in can_break_inside that is unable to detect a line break in <p>aaa <b>bbb</b></p>

That's because can_break_text() from text.py returns False for aaa , i.e. a string that ends with a space.

That's because can_break_text() checks for the PangoLogAttr's is_line_break attribute. And indeed, a trailing whitespace isn't a line break. In fact: whitespace is never a line break.
When I get it right, pango.pango_get_log_attrs() sets this attribute on letters that could become the first letter of a new line, that is, the letter that follows a possible line-breaking (whitespace or punctuation or...) letter.

It's a pity that we must take box.style['lang'] and box.style['white_space'] into account. Otherwise we could simply collect the text from the child's TextBoxes, including all the intermediate whitespaces, and pass that as a single string to can_break_text...

@liZe

This comment has been minimized.

Copy link
Member

commented Mar 18, 2019

Yes, I've learned a lot of things fixing #301, and Unicode is really fascinating. I'm glad to have Pango 😄.

We already have (almost) the whole logic in WeasyPrint: split_inline_level is able to find correctly if we can split a line, and it takes care of line breaks with nested tags. We should use it instead of relying on can_break_inside (that's small and fast but definitely buggy).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.