Improve compilation times for projects using PyO3 #1604

1tgr · 2021-05-13T12:16:40Z

These changes reduce the number of lines of LLVM code on the word-count example from 29,299 to 24,980.

Top 10 biggest functions in the binary, as reported by cargo llvm-lines --release:

Before
======
Lines         Copies       Function name
  -----         ------       -------------
  29299 (100%)  1294 (100%)  (TOTAL)
   1205 (4.1%)     4 (0.3%)  pyo3::callback::handle_panic
    546 (1.9%)     7 (0.5%)  std::thread::local::LocalKey<T>::try_with
    530 (1.8%)     8 (0.6%)  core::result::Result<T,E>::map_err
    528 (1.8%)     8 (0.6%)  std::panicking::try
    467 (1.6%)     5 (0.4%)  core::iter::traits::iterator::Iterator::fold
    439 (1.5%)    10 (0.8%)  core::option::Option<T>::map
    421 (1.4%)     9 (0.7%)  core::ptr::swap_nonoverlapping_one
    409 (1.4%)     1 (0.1%)  rayon::iter::plumbing::bridge_unindexed_producer_consumer
    394 (1.3%)    20 (1.5%)  core::ptr::read
    353 (1.2%)    11 (0.9%)  core::option::Option<T>::ok_or

After
=====
  Lines         Copies       Function name
  -----         ------       -------------
  24980 (100%)  1116 (100%)  (TOTAL)
    546 (2.2%)     7 (0.6%)  std::thread::local::LocalKey<T>::try_with
    528 (2.1%)     8 (0.7%)  std::panicking::try
    467 (1.9%)     5 (0.4%)  core::iter::traits::iterator::Iterator::fold
    432 (1.7%)     6 (0.5%)  core::result::Result<T,E>::map_err
    409 (1.6%)     1 (0.1%)  rayon::iter::plumbing::bridge_unindexed_producer_consumer
    397 (1.6%)     4 (0.4%)  pyo3::callback::handle_panic
    376 (1.5%)     8 (0.7%)  core::ptr::swap_nonoverlapping_one
    371 (1.5%)    19 (1.7%)  core::ptr::read
    368 (1.5%)     8 (0.7%)  core::option::Option<T>::map
    345 (1.4%)     1 (0.1%)  word_count::word_count

Here I am using number of lines of LLVM code as an indicator of overall compilation time. I have a large real-world project that sees a 22% improvement in compilation time on its release build due to these changes.

birkenfeld · 2021-05-13T16:34:59Z

Sounds like nice savings! Not having reviewed the code, can you say anything about if and where runtime performance could be affected?

1tgr · 2021-05-13T17:57:59Z

I'll take a look, but initial thoughts are:

For the argument extraction in 32dc93e, the result is the same, but expressed in fewer lines of code generated by the macro. So I don't expect any change at runtime.
The rest apply to initialization code. I have an example of a large PyO3 module, so I'll find out if there is any impact to load time.

1tgr · 2021-05-13T17:59:59Z

For 2 - I just spotted bench_pyclass.rs, which looks like a good benchmark of the PyClass initialization logic.

1tgr · 2021-05-13T18:54:36Z

I think these are the relevant benchmarks -- they look fine to me:

test bench_call_0        (before) bench:      43,298 ns/iter (+/- 1,758)
test bench_call_0         (after) bench:      43,358 ns/iter (+/- 2,231)

test first_time_init (before) bench:       3,637 ns/iter (+/- 2,484)
test first_time_init  (after) bench:       3,704 ns/iter (+/- 3,058)

first_time_init comes out a little longer, but it's within the variance of this benchmark (the variance is quite large here).

And the rest of the results:

test bench_call_method_0 (before) bench:     135,235 ns/iter (+/- 19,143)
test bench_call_method_0  (after) bench:     146,706 ns/iter (+/- 21,929)

test dict_get_item    (before) bench:   2,070,662 ns/iter (+/- 209,741)
test dict_get_item     (after) bench:   2,168,498 ns/iter (+/- 288,392)

test drop_many_objects (before) bench:       2,611 ns/iter (+/- 136)
test drop_many_objects  (after) bench:       2,666 ns/iter (+/- 418)

test extract_btreemap (before) bench:  11,299,160 ns/iter (+/- 1,320,198)
test extract_btreemap  (after) bench:  11,032,676 ns/iter (+/- 3,320,406)

test extract_btreeset (before) bench:   9,175,273 ns/iter (+/- 910,780)
test extract_btreeset  (after) bench:   9,290,653 ns/iter (+/- 1,572,654)

test extract_hashmap  (before) bench:   5,608,385 ns/iter (+/- 847,949)
test extract_hashmap   (after) bench:   5,949,059 ns/iter (+/- 1,696,773)

test extract_hashset  (before) bench:   5,715,297 ns/iter (+/- 1,266,259)
test extract_hashset   (after) bench:   5,685,043 ns/iter (+/- 891,854)

test iter_dict        (before) bench:   2,323,729 ns/iter (+/- 323,264)
test iter_dict         (after) bench:   2,342,320 ns/iter (+/- 420,593)

test iter_list     (before) bench:   1,646,704 ns/iter (+/- 122,113)
test iter_list      (after) bench:   1,716,904 ns/iter (+/- 206,571)

test iter_set         (before) bench:   1,631,793 ns/iter (+/- 505,714)
test iter_set          (after) bench:   1,550,450 ns/iter (+/- 324,212)

test iter_tuple     (before) bench:     916,047 ns/iter (+/- 165,699)
test iter_tuple      (after) bench:     886,762 ns/iter (+/- 131,282)

test list_get_item (before) bench:     761,099 ns/iter (+/- 199,763)
test list_get_item  (after) bench:     758,764 ns/iter (+/- 51,724)

test tuple_get_item (before) bench:     495,202 ns/iter (+/- 55,964)
test tuple_get_item  (after) bench:     473,639 ns/iter (+/- 90,762)

davidhewitt

Thasks! If tests pass, I'm happy with this. To summarise what I see, it's changing some of the #[pyclass] instantiation code to use dynamic instead of static calling. That'll marginally affect first time initialization, but the change you detect is within tolerance and this is a cost that'll be paid once per whole-program execution anyway.

The changes to handle_panic may slightly affect the measurements I just made in #1607, however I think they're a reasonable refactoring and I don't expect to be a major performance issue.

CHANGELOG.md

src/callback.rs

src/panic.rs

1tgr · 2021-05-15T10:30:32Z

Thanks for the review @davidhewitt. Indeed I don't see a conflict with the approach in #1607, although a compiler optimisation that can improve #1607 (inlining panic_result_into_callback_output into the tail of handle_panic) may make compilation times worse.

The benefit of having panic_result_into_callback_output is that it gets monomorphised over fewer Rs (method return types) than Fs (the method bodies themselves). The call to panic_result_into_callback_output is an unconditional branch (good), but it's not a tail call, since the GILPool must be dropped.

davidhewitt · 2021-05-16T06:18:17Z

👍 agreed.

FWIW I'm not sure we're ready yet to guarantee that we won't worsen compile times with future changes to PyO3, however I'm sure I'll be tempted to use cargo llvm-lines from time to time to keep an eye on this myself! 😄

davidhewitt · 2021-05-16T06:18:28Z

Thanks again for the contribution!

1tgr added 5 commits May 13, 2021 13:06

Shorten typical argument extraction checks

32dc93e

Reduce size of compiled code for methods

95636f1

Reduce size of compiled code for class initialization

c3b935f

Add to CHANGELOG

c803033

Update for_each_proto_slot for consistency

735a9a1

davidhewitt approved these changes May 15, 2021

View reviewed changes

CHANGELOG.md Show resolved Hide resolved

src/callback.rs Show resolved Hide resolved

src/panic.rs Outdated Show resolved Hide resolved

Apply review suggestions

50352f9

davidhewitt merged commit c4b19c7 into PyO3:main May 16, 2021

davidhewitt mentioned this pull request May 16, 2021

PyO3 performance analysis: function overheads #1607

Closed

1tgr deleted the shrink branch May 28, 2021 14:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve compilation times for projects using PyO3 #1604

Improve compilation times for projects using PyO3 #1604

1tgr commented May 13, 2021

birkenfeld commented May 13, 2021 •

edited

Loading

1tgr commented May 13, 2021

1tgr commented May 13, 2021

1tgr commented May 13, 2021

davidhewitt left a comment

1tgr commented May 15, 2021 •

edited

Loading

davidhewitt commented May 16, 2021

davidhewitt commented May 16, 2021

Improve compilation times for projects using PyO3 #1604

Improve compilation times for projects using PyO3 #1604

Conversation

1tgr commented May 13, 2021

birkenfeld commented May 13, 2021 • edited Loading

1tgr commented May 13, 2021

1tgr commented May 13, 2021

1tgr commented May 13, 2021

davidhewitt left a comment

Choose a reason for hiding this comment

1tgr commented May 15, 2021 • edited Loading

davidhewitt commented May 16, 2021

davidhewitt commented May 16, 2021

birkenfeld commented May 13, 2021 •

edited

Loading

1tgr commented May 15, 2021 •

edited

Loading