Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSS Line Breaks/New Line doesn't work for non-latin chars #828

Closed
AjawadMahmoud opened this Issue Mar 18, 2019 · 19 comments

Comments

4 participants
@AjawadMahmoud
Copy link

AjawadMahmoud commented Mar 18, 2019

Hello.

The \A new line escape char in CSS raises AssesrtionError if the first char of the second line is non-latin, or in my case an Arabic letter. The following is what I get:

2019-03-18 15:33:39,460  [ERROR]  Error occured: Traceback (most recent call last):
  File "D:\my\project\pdf.py", line 126, in generate
    pdf = html.write_pdf(stylesheets=stylesheets)
  File "C:\Python37\lib\site-packages\weasyprint\__init__.py", line 198, in write_pdf
    font_config=font_config).write_pdf(
  File "C:\Python37\lib\site-packages\weasyprint\__init__.py", line 159, in render
    font_config)
  File "C:\Python37\lib\site-packages\weasyprint\document.py", line 361, in _render
    [Page(p, enable_hinting) for p in page_boxes],
  File "C:\Python37\lib\site-packages\weasyprint\document.py", line 361, in <listcomp>
    [Page(p, enable_hinting) for p in page_boxes],
  File "C:\Python37\lib\site-packages\weasyprint\layout\__init__.py", line 184, in layout_document
    make_margin_boxes(context, page, state))
  File "C:\Python37\lib\site-packages\weasyprint\layout\pages.py", line 435, in make_margin_boxes
    yield margin_box_content_layout(context, page, box)
  File "C:\Python37\lib\site-packages\weasyprint\layout\pages.py", line 444, in margin_box_content_layout
    absolute_boxes=[], fixed_boxes=[])
  File "C:\Python37\lib\site-packages\weasyprint\layout\blocks.py", line 363, in block_container_layout
    for line, resume_at in lines_iterator:
  File "C:\Python37\lib\site-packages\weasyprint\layout\inlines.py", line 56, in iter_line_boxes
    device_size, absolute_boxes, fixed_boxes, first_letter_style)
  File "C:\Python37\lib\site-packages\weasyprint\layout\inlines.py", line 105, in get_next_linebox
    waiting_floats, line_children=[])
  File "C:\Python37\lib\site-packages\weasyprint\layout\inlines.py", line 745, in split_inline_box
    line_placeholders, child_waiting_floats, line_children))
  File "C:\Python37\lib\site-packages\weasyprint\layout\inlines.py", line 583, in split_inline_level
    context, box, max_x - position_x, skip)
  File "C:\Python37\lib\site-packages\weasyprint\layout\inlines.py", line 1020, in split_text_box
    'Expected nothing or a preserved line break' % (between,))
AssertionError: Got '\n التاريخ: الأحد 3 مارس 2019' between two lines. Expected nothing or a preserved line break

I'm using:

WeasyPrint==45
cairocffi==1.0.2
CairoSVG==2.3.0

The part of CSS string causing the issue:

@top-right {{
				padding: 0.5cm;
				font-family: 'Dubai-Medium', sans-serif;
				content: " {doc_ref}: {ref} \A {doc_date}: {weekday} {day} {month} {year} \A {doc_addressee}: {addressee} ";
				display: block;
				white-space: pre;
				vertical-align: top;
			}}
@liZe

This comment has been minimized.

Copy link
Member

liZe commented Mar 18, 2019

Hello!

To be honest, right-to-left scripts are poorly supported. As the library we're using to render text (called Pango) handles them perfectly, WeasyPrint may sometimes give the impression that everything just works. But it actually doesn't at all (see #106) 😢.

I can leave this issue open and see if anyone finds a way to fix this exact use case. But until #106 is closed, you can expect a lot of bugs with Arabic text.

@liZe liZe added the crash label Mar 18, 2019

@AjawadMahmoud

This comment has been minimized.

Copy link
Author

AjawadMahmoud commented Mar 20, 2019

Hello LiZe.

I have to admit, your work is beyond amazing. The fact that I can generate pdf docs containing arabic texts is more than enough for me. The RTL caveats are being handled in special cases in my code. So, thanks a lot for your efforts.

Okay after a lot of attempts in a lot of previous cases I found that it's actauly more common and that 51dcbe3 had resolved this problem after I manually ported this commit to my current install.
Do you think you can get a release soon with this commit as I can't use master branch in the production, I can only use the releases (some CI/CD limitations). I would be really really grateful if you would create an out of schedule release (today?) with that commit with it in order to be able to push some major updates to the project.

@liZe

This comment has been minimized.

Copy link
Member

liZe commented Mar 20, 2019

I have to admit, your work is beyond amazing. The fact that I can generate pdf docs containing arabic texts is more than enough for me. The RTL caveats are being handled in special cases in my code. So, thanks a lot for your efforts.

😄

Do you think you can get a release soon with this commit as I can't use master branch in the production, I can only use the releases (some CI/CD limitations)

Maybe you already know that, but you can install the master branch with pip using pip install git+https://github.com/Kozea/WeasyPrint.git@master.

I would be really really grateful if you would create an out of schedule release (today?) with that commit with it in order to be able to push some major updates to the project.

A version was already scheduled this week, it can be today if I can find the time. I'd just like to fix the problem in this comment before.

@AjawadMahmoud

This comment has been minimized.

Copy link
Author

AjawadMahmoud commented Mar 20, 2019

Maybe you already know that, but you can install the master branch with pip using pip install git+https://github.com/Kozea/WeasyPrint.git@master.
I'm aware of this use-case, but unfortunately I have genuine reason not to use it.

A version was already scheduled this week, it can be today if I can find the time. I'd just like to fix the problem in this comment before.
I'm wishing you best of luck.

@liZe liZe added this to the 46 milestone Mar 20, 2019

@liZe

This comment has been minimized.

Copy link
Member

liZe commented Mar 20, 2019

Version 46 has been released!

@AjawadMahmoud

This comment has been minimized.

Copy link
Author

AjawadMahmoud commented Mar 21, 2019

I bursted into productivity cycle after the release and missed replying here.
Thanks a lot. Keep up the good work.

@AjawadMahmoud

This comment has been minimized.

Copy link
Author

AjawadMahmoud commented Mar 22, 2019

Hello and wish you a great weekend ahead!

I'm reopening this issue due to false positive condition. The commit didn't actually fix the whole problem. I was able to reproduce exactly the same error with other templates. I'm going to spend some more time testing this area of your lib. If I achieve anything I would prepare a PR.

@AjawadMahmoud AjawadMahmoud reopened this Mar 22, 2019

@liZe

This comment has been minimized.

Copy link
Member

liZe commented Mar 22, 2019

I was able to reproduce exactly the same error with other templates.

Don't hesitate to post your HTML+CSS samples, I'll see if I can help.

I'm going to spend some more time testing this area of your lib. If I achieve anything I would prepare a PR.

Thank you, good luck!

@AjawadMahmoud

This comment has been minimized.

Copy link
Author

AjawadMahmoud commented Mar 22, 2019

Could you give a shot with some basic lipsom Arabic text like:

import weasyprint
weasyprint.HTML(string=''''لوريم ايبسوم دولار سيت أميت ,كونسيكتيتور أدايبا يسكينج أليايت,سيت دو أيوسمود تيمبور
أنكايديديونتيوت لابوري ات دولار ماجنا أليكيوا . يوت انيم أد مينيم فينايم,كيواس نوستريد
أكسير سيتاشن يللأمكو لابورأس نيسي يت أليكيوب أكس أيا كوممودو كونسيكيوات . ديواس
أيوتي أريري دولار إن ريبريهينديرأيت فوليوبتاتي فيلايت أيسسي كايلليوم دولار أيو فيجايت
نيولا باراياتيور. أيكسسيبتيور ساينت أوككايكات كيوبايداتات نون بروايدينت ,سيونت ان كيولبا
كيو أوفيسيا ديسيريونتموليت انيم أيدي ايست لابوريوم''').write_pdf()
@liZe

This comment has been minimized.

Copy link
Member

liZe commented Mar 22, 2019

Could you give a shot with some basic lipsom Arabic text like:

I can reproduce your bug, thank you.

We were very lucky to have your original bug fixed, but as I said earlier: right-to-left scripts are poorly supported. The code is full of assumptions that only work with Latin or Cyrilic scripts, and even some left-to-right languages should fail.

@AjawadMahmoud

This comment has been minimized.

Copy link
Author

AjawadMahmoud commented Mar 22, 2019

@liZe liZe removed this from the 46 milestone Mar 22, 2019

@AjawadMahmoud

This comment has been minimized.

Copy link
Author

AjawadMahmoud commented Mar 23, 2019

LiZe, what's the real purpose of this:

assert between in line_breaks #inlines.py line1022

after commenting this line (and the following two) in an attempt to see where the code is breaking the error message was gone. Could you please elaborate more on where in your code is the line breaks being generated? I assume it's get_next_linebox but since I'm new to your codebase I'm unable to find my way through it.

UPDATE:
Another observation is after commenting lines 1022, 1023, and 1024 in inlines.py what I get is a text that is overflowing over the page width. I'm starting to assume it has something to do with iterating over bytes rather that chars. Your help here would allow me to fix this issue.

@liZe

This comment has been minimized.

Copy link
Member

liZe commented Mar 25, 2019

what's the real purpose of this:

This assert is here to check that there's no text between the end of a line and the beginning of the next line. If this assert is wrong, it basically means that some text is missing in the layout. So it's generally a good idea to keep it…

I'm starting to assume it has something to do with iterating over bytes rather that chars. Your help here would allow me to fix this issue.

The problem is probably in text.split_first_line. The whole line splitting algorithm has been created with left-to-right scripts in mind, and even if Pango is really good at dealing with right-to-left scripts, our code doesn't use it "the right way".

If you want to debug it, your lipsum sample is perfect. You can comment the assert, see in the PDF where the lines should be split, and check the values returned by split_first_line. resume_at is basically the place in the document where the split occurs (i.e. the number of characters that have already been displayed). It is used as a string index, and it may be wrong with RTL scripts.

@AjawadMahmoud

This comment has been minimized.

Copy link
Author

AjawadMahmoud commented Apr 2, 2019

Hello.
@liZe, there's some funny business going on here.
I realised an old app that i've written in Python as well sometime before is still working and it has a condition where an overlapping line in Arabic is there. However, unline version 45 and 46, no errors nor any part of the lines are getting hidden. The app was written with version 43 of weasyprint.

I tried having a deep look at the diff between master and the tagged version and found nothing to tell me about any kind of related issue to the previously discussed attrs; split_first_line and resume_at. Maybe since you are the author of most of the lib you can find the issue way faster than me.

I'm still digging into the issue though in order to find which commit changed the behaviour.

@liZe

This comment has been minimized.

Copy link
Member

liZe commented Apr 2, 2019

Hello,

Thanks a lot for your hints.

I've bisected using your script and the bad commit is 0a3db7d. There's probably a problem with the index we get from pango_layout_iter_get_index.

@lnprieto

This comment has been minimized.

Copy link

lnprieto commented Apr 10, 2019

Hello

We have the same problem. With older version of weasyprint our rtl documents were generated (not perfect but we have pdfs). Now we get an error :
line_children)) File "/usr/local/lib/python3.7/dist-packages/weasyprint/layout/inlines.py", line 584, in split_inline_level context, box, max_x - position_x, skip) File "/usr/local/lib/python3.7/dist-packages/weasyprint/layout/inlines.py", line 1024, in split_text_box 'Expected nothing or a preserved line break' % (between,)) AssertionError: Got ' فرنسي- أردني بإدار' between two lines. Expected nothing or a preserved line break ' on stderr, on output. CMD : /usr/local/bin/weasyprint --encoding utf8 --format pdf
As for AjawadMahmoud the problem was skipped by commenting lines (1022 to 1024 in inlines.py) but it is not a long-term solution. And we don't evaluate the side effects yet.

We have a lot of arabic documents and it is a real problem if we can't get pdf versions.

Do you plan to fix this bug soon ? just to think how we can do
Thank you for your answer
Have a good day

@liZe

This comment has been minimized.

Copy link
Member

liZe commented Apr 12, 2019

It's now fixed, but the fix is probably bad for performance. I'll release version 47 and try to improve performance later.

@liZe liZe added this to the 47 milestone Apr 12, 2019

@StephanMeijer

This comment has been minimized.

Copy link

StephanMeijer commented Apr 15, 2019

I want to thank you all for fixing it!

@AjawadMahmoud

This comment has been minimized.

Copy link
Author

AjawadMahmoud commented Apr 17, 2019

Thanks a lot, @liZe. I really wanted to give you more hand in here to help you fix it but my schedule is really full.
Thanks for @lnprieto as well for adding to conversation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.