New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

slow performance on large JSON files #720

Closed
dmalan opened this Issue Oct 29, 2018 · 15 comments

Comments

Projects
None yet
2 participants
@dmalan

dmalan commented Oct 29, 2018

Not sure if this is a regression from #553, but we noticed that large JSON files can take quite a few minutes to render (in an Ubuntu 18.04 Docker container using Docker 18.06.1-ce-mac73 on macOS 10.14 on a 3.1 GHz Intel Core i7). For instance, given https://gist.github.com/dmalan/284f73968484b1fe0eef6bb532235a7c, which contains 7844 lines totaling 274,359 B, it takes 19+ minutes to render the PDF:

# JSON to PDF
$ time weasyprint package-lock.json package-lock.json.pdf
real	19m1.293s
user	18m51.310s
sys	0m9.070s

Rendering to HTML first with pygmentize reduces the time, but it still takes 4+ minutes to render the PDF:

# JSON to HTML
$ time pygmentize -f html -O full package-lock.json  > package-lock.json.html
real	0m0.708s
user	0m0.490s
sys	0m0.010s

# HTML to PDF
$ time weasyprint package-lock.json.html package-lock.json.html.pdf
real	4m40.722s
user	4m39.300s
sys	0m1.080s

Is there perhaps another performance bottleneck that wasn't addressed by 46da8b5 or 6087982?

Thank you!

@liZe liZe added the performance label Oct 29, 2018

@liZe liZe closed this in 07d3da6 Oct 29, 2018

@liZe

This comment has been minimized.

Member

liZe commented Oct 29, 2018

Hello!

Thanks a lot for your corner cases, they're very useful! The rendering time is now lower than 2s on my computer 😉.

@liZe

This comment has been minimized.

Member

liZe commented Oct 29, 2018

Wow, that's even a 5−10% speed improvement for Alice. 🎉

@liZe liZe added this to the 43 milestone Oct 29, 2018

@dmalan

This comment has been minimized.

dmalan commented Oct 29, 2018

Wow, thanks so much for the quick follow-up! Any chance the 2s you saw was based on other changes too? I just installed 07d3da6, but I'm afraid I'm still seeing the below, wherein JSON-to-PDF rendering is definitely faster (19m to 51s), but HTML-to-PDF rendering is actually a bit slower (4m40s to 5m31s)?

$ sudo pip3 install git+https://github.com/Kozea/WeasyPrint.git@07d3da653391bd678c7a71424f64e3c38edf6ae6

$ sed -n 938,947p /opt/pyenv/versions/3.7.0/lib/python3.7/site-packages/weasyprint/text.py
        if max_width == 0:
            # Trying to find minimum size, let's naively split on spaces and
            # keep one word + one letter
            space_index = text.find(' ')
            if space_index == -1:
                expected_length = len(text)
            else:
                expected_length = space_index + 2  # index + space + one letter
        else:
            expected_length = int(max_width / style['font_size'] * 2.5)

# JSON to PDF
$ time weasyprint package-lock.json package-lock.json.pdf
real	0m51.952s
user	0m51.350s
sys	0m0.420s

# JSON to HTML
$ time pygmentize -f html -O full package-lock.json  > package-lock.json.html
real	0m0.658s
user	0m0.460s
sys	0m0.020s

# HTML to PDF
$ time weasyprint package-lock.json.html package-lock.json.html.pdf
real	5m31.433s
user	5m30.520s
sys	0m0.580s

Thank you!

@liZe

This comment has been minimized.

Member

liZe commented Oct 29, 2018

Any chance the 2s you saw was based on other changes too?

Yes, unfortunately it was based on a truncated JSON file 😒. It's 66s for the whole file on my computer.

but HTML-to-PDF rendering is actually a bit slower (4m40s to 5m31s)?

It shouldn't be slower. I'll be back.

@liZe

This comment has been minimized.

Member

liZe commented Oct 29, 2018

It shouldn't be slower. I'll be back.

I get almost the same rendering time with (301s) or without (306s) optimization.

@dmalan

This comment has been minimized.

dmalan commented Oct 29, 2018

Interesting! Just re-ran the tests on another machine and got the below, whereby there's definitely a significant improvement for JSON-to-PDF, though not for HTML-to-PDF (as you note). Remarkably, the former somehow took 26m this time on version 0.42.3! Even ran that test two times to be sure.

Do you have a sense of where other bottlenecks might be? Happy to take a look ourselves if we can be helpful. Thank you!

JSON to PDF

$ weasyprint --version
WeasyPrint version 0.42.3

$ time weasyprint package-lock.json package-lock.json.pdf
real	26m43.515s
user	26m43.820s
sys	0m0.410s


$ sudo pip3 install git+https://github.com/Kozea/WeasyPrint.git@07d3da653391bd678c7a71424f64e3c38edf6ae6


$ weasyprint --version
WeasyPrint version 43rc1

$ time weasyprint package-lock.json package-lock.json.pdf
real	1m16.751s
user	1m16.050s
sys	0m0.550s

HTML to PDF

$ weasyprint --version
WeasyPrint version 0.42.3

$ time pygmentize -f html -O full package-lock.json  > package-lock.json.html

$ time weasyprint package-lock.json.html package-lock.json.html.pdf
real	5m33.894s
user	5m31.800s
sys	0m1.970s


$ weasyprint --version
WeasyPrint version 43rc1

$ time weasyprint package-lock.json.html package-lock.json.html.pdf
real	5m52.727s
user	5m51.830s
sys	0m0.850s
@liZe

This comment has been minimized.

Member

liZe commented Oct 30, 2018

Thanks a lot for your tests.

My tests with the HTML version were before and after my optimization commit, but your tests are between 0.42.3 and master, so there's probably something wrong between 0.42.3 and 43rc1. I'll check that!

@liZe

This comment has been minimized.

Member

liZe commented Oct 30, 2018

I've launched the rendering with 0.42.3, 43rc1 (without optimization) and master (with optimization), using the CPU performance governor set to performance. Here are the results:

  • 0.42.3: 270 seconds
  • 43rc1: 300 seconds
  • master: 285 seconds

You're right, 0.42.3 is faster here. The slowdown has been introduced between 0.42.3 and 43rc1. The optimization for JSON seems to (slightly) improve the situation with HTML, as seen in my previous post.

But… Trying to have a faster test, I've truncated the HTML file, and the results are strange…

With a truncated version of the HTML file (first 1000 lines instead of about 8000):

  • 0.42.3: 4,5 seconds
  • 43rc1: 3,8 seconds
  • master: 3,8 seconds

So I can't bisect the short version. Bisecting a 5-minute-long test is quite painful, and I'm afraid to find that there's only one culprit commit making both the short version faster and the long version slower… Do you have a better idea?

@dmalan

This comment has been minimized.

dmalan commented Oct 30, 2018

Hm, so I just ran

cd /opt/pyenv/versions/3.7.0/bin
python -m cProfile weasyprint package-lock.json package-lock.json.pdf

which, after 78s, yielded a summary that I've reformatted as a CSV:

https://docs.google.com/spreadsheets/d/173l5QJo5scn9MRFEalFLLAy8AmY0oJDRTVP84JrmndQ/edit?usp=sharing

If you sort Z-to-A by tottime, it looks like most of the time is spent in iter_lines (with 56905 invocations thereof), which in turn seems to call on Pango:

https://github.com/Kozea/WeasyPrint/blob/master/weasyprint/text.py#L665-L672

Any chance that yields any insights? Or to be expected?

@liZe

This comment has been minimized.

Member

liZe commented Oct 30, 2018

https://docs.google.com/spreadsheets/d/173l5QJo5scn9MRFEalFLLAy8AmY0oJDRTVP84JrmndQ/edit?usp=sharing

I can't reach the document.

Any chance that yields any insights? Or to be expected?

iter_lines currently has to be called at least twice for each tag, and the file has 24,750 tags, so we can assume that it's OK to have 56,905 calls.

Sorry, you're talking about the JSON file.

@dmalan

This comment has been minimized.

@liZe

This comment has been minimized.

Member

liZe commented Oct 30, 2018

Any chance that yields any insights? Or to be expected?

It's quite common to have a high number of calls of iter_lines. It's a bit high here but nothing really bad.

The problem is that it's sometimes called with the whole text, and that's really, really long. My commit should exactly fix this case, but obviously it doesn't in every case. I have to check why.

@liZe

This comment has been minimized.

Member

liZe commented Oct 31, 2018

I have to check why.

My fix assumes that you can always break lines after spaces. But sometimes you can't, for example in ' ". Unicode rules are really, hmmm, interesting.

We have a function finding the correct place where to break lines, I'll try to use it instead of my naive optimization based on spaces.

@liZe

This comment has been minimized.

Member

liZe commented Oct 31, 2018

What we're trying to do here is to avoid the layout of the whole text when trying to layout the first line.

When lines have a "normal" available length, we keep a random number of characters set to (available width / font-size × 2.5). We try to render these characters, and if a second line has been found (ie. the first line is full), we keep the first line and everything is OK.

When lines have an available length of 0 to find the minimum width of the line, (available width / font-size × 2.5) equals 0, the first line is empty and we use the fallback rendering the whole text.

To avoid this, we have to find another random values when available length is 0. My first try was to find the first space and break after this space. It always works for "normal" text, but not in your case as you have JSON with "strange" characters strings where we can't break lines after some spaces.

So, here's what I've tried with your JSON file:

  • no optimization: more than 15 minutes
  • finding spaces: 60 seconds (optimization works for most cases, but we have to render the whole text as a fallback when we cannot break after spaces)
  • finding real break points: 40 seconds (optimization always work, but finding real break points is slower than finding spaces)
  • truncate the text at a random value of 100 characters: 6 seconds

For Alice:

  • no optimization: 7 seconds
  • spaces: 6.6 seconds
  • break points: 6.6 seconds
  • truncate: 6.7 seconds

For the Odyssey (same as Alice but longer):

  • no optimization: 20.6 seconds
  • spaces: 16.4 seconds
  • break points: 16.8 seconds
  • truncate: 17.2 seconds

Well…

It's hard to choose the right solution. I think that the solution that will be faster in most cases (for "normal" text) is the current solution, finding spaces. Finding real break points may be longer for long blocks of texts (it's the case for your JSON file), but it's THE clean solution. Truncating may seem appealing, but… 100 is really a random value. If I use 50 it doesn't work for your file (you have sha keys with 95 characters).

Using real break points is the way to go, don't you think?

@liZe

This comment has been minimized.

Member

liZe commented Oct 31, 2018

(And for the future: we must find a Pango way to get the first line without rendering the whole text.)

netbsd-srcmastr pushed a commit to NetBSD/pkgsrc that referenced this issue Nov 14, 2018

py-weasyprint: Update to 43.
Version 43
----------

Released on 2018-11-09.

Bug fixes:

* `#726 <https://github.com/Kozea/WeasyPrint/issues/726>`_:
  Make empty strings clear previous values of named strings
* `#729 <https://github.com/Kozea/WeasyPrint/issues/729>`_:
  Include tools in packaging

This version also includes the changes from unstable rc1 and rc2 versions
listed below.

Version 43rc2
-------------

Released on 2018-11-02.

**This version is experimental, don't use it in production. If you find bugs,
please report them!**

Bug fixes:

* `#706 <https://github.com/Kozea/WeasyPrint/issues/706>`_:
  Fix text-indent at the beginning of a page
* `#687 <https://github.com/Kozea/WeasyPrint/issues/687>`_:
  Allow query strings in file:// URIs
* `#720 <https://github.com/Kozea/WeasyPrint/issues/720>`_:
  Optimize minimum size calculation of long inline elements
* `#717 <https://github.com/Kozea/WeasyPrint/issues/717>`_:
  Display <details> tags as blocks
* `#691 <https://github.com/Kozea/WeasyPrint/issues/691>`_:
  Don't recalculate max content widths when distributing extra space for tables
* `#722 <https://github.com/Kozea/WeasyPrint/issues/722>`_:
  Fix bookmarks and strings set on images
* `#723 <https://github.com/Kozea/WeasyPrint/issues/723>`_:
  Warn users when string() is not used in page margin


Version 43rc1
-------------

Released on 2018-10-15.

**This version is experimental, don't use it in production. If you find bugs,
please report them!**

Dependencies:

* Python 3.4+ is now needed, Python 2.x is not supported anymore
* Cairo 1.15.4+ is now needed, but 1.10+ should work with missing features
  (such as links, outlines and metadata)
* Pdfrw is not needed anymore

New features:

* `Beautiful website <https://weasyprint.org>`_
* `#579 <https://github.com/Kozea/WeasyPrint/issues/579>`_:
  Initial support of flexbox
* `#592 <https://github.com/Kozea/WeasyPrint/pull/592>`_:
  Support @font-face on Windows
* `#306 <https://github.com/Kozea/WeasyPrint/issues/306>`_:
  Add a timeout parameter to the URL fetcher functions
* `#594 <https://github.com/Kozea/WeasyPrint/pull/594>`_:
  Split tests using modern pytest features
* `#599 <https://github.com/Kozea/WeasyPrint/pull/599>`_:
  Make tests pass on Windows
* `#604 <https://github.com/Kozea/WeasyPrint/pull/604>`_:
  Handle target counters and target texts
* `#631 <https://github.com/Kozea/WeasyPrint/pull/631>`_:
  Enable counter-increment and counter-reset in page context
* `#622 <https://github.com/Kozea/WeasyPrint/issues/622>`_:
  Allow pathlib.Path objects for HTML, CSS and Attachment classes
* `#674 <https://github.com/Kozea/WeasyPrint/issues/674>`_:
  Add extensive installation instructions for Windows

Bug fixes:

* `#558 <https://github.com/Kozea/WeasyPrint/issues/558>`_:
  Fix attachments
* `#565 <https://github.com/Kozea/WeasyPrint/issues/565>`_,
  `#596 <https://github.com/Kozea/WeasyPrint/issues/596>`_,
  `#539 <https://github.com/Kozea/WeasyPrint/issues/539>`_:
  Fix many PDF rendering, printing and compatibility problems
* `#614 <https://github.com/Kozea/WeasyPrint/issues/614>`_:
  Avoid crashes and endless loops caused by a Pango bug
* `#662 <https://github.com/Kozea/WeasyPrint/pull/662>`_:
  Fix warnings and errors when generating documentation
* `#666 <https://github.com/Kozea/WeasyPrint/issues/666>`_,
  `#685 <https://github.com/Kozea/WeasyPrint/issues/685>`_:
  Fix many table layout rendering problems
* `#680 <https://github.com/Kozea/WeasyPrint/pull/680>`_:
  Don't crash when there's no font available
* `#662 <https://github.com/Kozea/WeasyPrint/pull/662>`_:
  Fix support of some align values in tables
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment