Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve compilation times for projects using PyO3 #1604

Merged
merged 6 commits into from
May 16, 2021
Merged

Conversation

1tgr
Copy link
Contributor

@1tgr 1tgr commented May 13, 2021

These changes reduce the number of lines of LLVM code on the word-count example from 29,299 to 24,980.

Top 10 biggest functions in the binary, as reported by cargo llvm-lines --release:

Before
======
Lines         Copies       Function name
  -----         ------       -------------
  29299 (100%)  1294 (100%)  (TOTAL)
   1205 (4.1%)     4 (0.3%)  pyo3::callback::handle_panic
    546 (1.9%)     7 (0.5%)  std::thread::local::LocalKey<T>::try_with
    530 (1.8%)     8 (0.6%)  core::result::Result<T,E>::map_err
    528 (1.8%)     8 (0.6%)  std::panicking::try
    467 (1.6%)     5 (0.4%)  core::iter::traits::iterator::Iterator::fold
    439 (1.5%)    10 (0.8%)  core::option::Option<T>::map
    421 (1.4%)     9 (0.7%)  core::ptr::swap_nonoverlapping_one
    409 (1.4%)     1 (0.1%)  rayon::iter::plumbing::bridge_unindexed_producer_consumer
    394 (1.3%)    20 (1.5%)  core::ptr::read
    353 (1.2%)    11 (0.9%)  core::option::Option<T>::ok_or

After
=====
  Lines         Copies       Function name
  -----         ------       -------------
  24980 (100%)  1116 (100%)  (TOTAL)
    546 (2.2%)     7 (0.6%)  std::thread::local::LocalKey<T>::try_with
    528 (2.1%)     8 (0.7%)  std::panicking::try
    467 (1.9%)     5 (0.4%)  core::iter::traits::iterator::Iterator::fold
    432 (1.7%)     6 (0.5%)  core::result::Result<T,E>::map_err
    409 (1.6%)     1 (0.1%)  rayon::iter::plumbing::bridge_unindexed_producer_consumer
    397 (1.6%)     4 (0.4%)  pyo3::callback::handle_panic
    376 (1.5%)     8 (0.7%)  core::ptr::swap_nonoverlapping_one
    371 (1.5%)    19 (1.7%)  core::ptr::read
    368 (1.5%)     8 (0.7%)  core::option::Option<T>::map
    345 (1.4%)     1 (0.1%)  word_count::word_count

Here I am using number of lines of LLVM code as an indicator of overall compilation time. I have a large real-world project that sees a 22% improvement in compilation time on its release build due to these changes.

@birkenfeld
Copy link
Member

birkenfeld commented May 13, 2021

Sounds like nice savings! Not having reviewed the code, can you say anything about if and where runtime performance could be affected?

@1tgr
Copy link
Contributor Author

1tgr commented May 13, 2021

I'll take a look, but initial thoughts are:

  1. For the argument extraction in 32dc93e, the result is the same, but expressed in fewer lines of code generated by the macro. So I don't expect any change at runtime.
  2. The rest apply to initialization code. I have an example of a large PyO3 module, so I'll find out if there is any impact to load time.

@1tgr
Copy link
Contributor Author

1tgr commented May 13, 2021

For 2 - I just spotted bench_pyclass.rs, which looks like a good benchmark of the PyClass initialization logic.

@1tgr
Copy link
Contributor Author

1tgr commented May 13, 2021

I think these are the relevant benchmarks -- they look fine to me:

test bench_call_0        (before) bench:      43,298 ns/iter (+/- 1,758)
test bench_call_0         (after) bench:      43,358 ns/iter (+/- 2,231)

test first_time_init (before) bench:       3,637 ns/iter (+/- 2,484)
test first_time_init  (after) bench:       3,704 ns/iter (+/- 3,058)

first_time_init comes out a little longer, but it's within the variance of this benchmark (the variance is quite large here).

And the rest of the results:

test bench_call_method_0 (before) bench:     135,235 ns/iter (+/- 19,143)
test bench_call_method_0  (after) bench:     146,706 ns/iter (+/- 21,929)

test dict_get_item    (before) bench:   2,070,662 ns/iter (+/- 209,741)
test dict_get_item     (after) bench:   2,168,498 ns/iter (+/- 288,392)

test drop_many_objects (before) bench:       2,611 ns/iter (+/- 136)
test drop_many_objects  (after) bench:       2,666 ns/iter (+/- 418)

test extract_btreemap (before) bench:  11,299,160 ns/iter (+/- 1,320,198)
test extract_btreemap  (after) bench:  11,032,676 ns/iter (+/- 3,320,406)

test extract_btreeset (before) bench:   9,175,273 ns/iter (+/- 910,780)
test extract_btreeset  (after) bench:   9,290,653 ns/iter (+/- 1,572,654)

test extract_hashmap  (before) bench:   5,608,385 ns/iter (+/- 847,949)
test extract_hashmap   (after) bench:   5,949,059 ns/iter (+/- 1,696,773)

test extract_hashset  (before) bench:   5,715,297 ns/iter (+/- 1,266,259)
test extract_hashset   (after) bench:   5,685,043 ns/iter (+/- 891,854)

test iter_dict        (before) bench:   2,323,729 ns/iter (+/- 323,264)
test iter_dict         (after) bench:   2,342,320 ns/iter (+/- 420,593)

test iter_list     (before) bench:   1,646,704 ns/iter (+/- 122,113)
test iter_list      (after) bench:   1,716,904 ns/iter (+/- 206,571)

test iter_set         (before) bench:   1,631,793 ns/iter (+/- 505,714)
test iter_set          (after) bench:   1,550,450 ns/iter (+/- 324,212)

test iter_tuple     (before) bench:     916,047 ns/iter (+/- 165,699)
test iter_tuple      (after) bench:     886,762 ns/iter (+/- 131,282)

test list_get_item (before) bench:     761,099 ns/iter (+/- 199,763)
test list_get_item  (after) bench:     758,764 ns/iter (+/- 51,724)

test tuple_get_item (before) bench:     495,202 ns/iter (+/- 55,964)
test tuple_get_item  (after) bench:     473,639 ns/iter (+/- 90,762)

Copy link
Member

@davidhewitt davidhewitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thasks! If tests pass, I'm happy with this. To summarise what I see, it's changing some of the #[pyclass] instantiation code to use dynamic instead of static calling. That'll marginally affect first time initialization, but the change you detect is within tolerance and this is a cost that'll be paid once per whole-program execution anyway.

The changes to handle_panic may slightly affect the measurements I just made in #1607, however I think they're a reasonable refactoring and I don't expect to be a major performance issue.

CHANGELOG.md Show resolved Hide resolved
src/callback.rs Show resolved Hide resolved
src/panic.rs Outdated Show resolved Hide resolved
@1tgr
Copy link
Contributor Author

1tgr commented May 15, 2021

Thanks for the review @davidhewitt. Indeed I don't see a conflict with the approach in #1607, although a compiler optimisation that can improve #1607 (inlining panic_result_into_callback_output into the tail of handle_panic) may make compilation times worse.

The benefit of having panic_result_into_callback_output is that it gets monomorphised over fewer Rs (method return types) than Fs (the method bodies themselves). The call to panic_result_into_callback_output is an unconditional branch (good), but it's not a tail call, since the GILPool must be dropped.

@davidhewitt
Copy link
Member

👍 agreed.

FWIW I'm not sure we're ready yet to guarantee that we won't worsen compile times with future changes to PyO3, however I'm sure I'll be tempted to use cargo llvm-lines from time to time to keep an eye on this myself! 😄

@davidhewitt
Copy link
Member

Thanks again for the contribution!

@davidhewitt davidhewitt merged commit c4b19c7 into PyO3:main May 16, 2021
@1tgr 1tgr deleted the shrink branch May 28, 2021 14:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants