Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Micro-optimize the __morestack fast path #3565

Closed
brson opened this issue Sep 23, 2012 · 5 comments
Closed

Micro-optimize the __morestack fast path #3565

brson opened this issue Sep 23, 2012 · 5 comments
Labels
A-runtime Area: std's runtime and "pre-main" init for handling backtraces, unwinds, stack overflows I-slow Issue: Problems and improvements with respect to performance of generated code.

Comments

@brson
Copy link
Contributor

brson commented Sep 23, 2012

This is very performance critical code used for growing the stack, and it currently wastes a lot of instructions on the non-allocating fast path. There are a number of distinct optimizations we can identify.

Here's what happens after calling into __morestack, on the fast path

  • Set up the frame pointer
  • Push all possible argument registers of the calling function in case the call to upcall_new_stack clobbers them
  • Shuffle the argument registers from the __morestack custom calling convention registers to the C calling convention registers used by upcall_new_stack
  • Call upcall_new_stack, through the indirection of the dynamic linker
  • Call get_sp_limit, an entire assembly function consisting of movq %fs:112, %rax
  • Compare the sp_limit to 0 and don't branch to the rust_get_current_task slow path. This branch always makes the same decision during a __morestack call.
  • Do some math to find the task pointer from the stack limit
  • Check the stack canary to make sure we haven't run off the end of the stack
  • Assert that the task pointer is not null
  • Get the minimum stack size
  • Do some simple math and pointer indirections to determine if task->stk->next is a big enough stack segment to use
  • Assert some invariants
  • memcpy the arguments from the old stack to the new stack
  • Align the new stack frame
  • Call reuse_valgrind_stack to give valgrind hints
  • Call record_stack_limit to execute another single instruction
  • Return the stack pointer to __morestack
  • Pop all the saved argument registers
  • Finally, call the original function

And returning from the segment:

  • Call upcall_del_stack through the dynamic linker
  • Call get_sp_limit, an entire function consisting of movq %fs:112, %rax
  • Compare the sp_limit to 0, etc.
  • Check the stack canary to make sure we haven't run off the end of the stack
  • Assert that the task pointer is not null
  • Update the current stack pointer in the task
  • Call record_stack_limit

Potential optimizations:

  • Don't save the frame pointer - This could be tricky to make work with dwarf unwinding, due to the odd frame shapes around __morestack. Will be easier after rolling our own unwinder Invoke instructions kick us off the FastISel path #3551.
  • Inline get_sp_limit, record_stack_limit (Inline get_sp_limit, set_sp_limit, get_sp runtime functions #2521)
  • Statically link upcall_new_stack and upcall_del_stack, hitting new dynamically linked upcalls for the slow path
  • Create a new version of rust_get_current_task that doesn't have a fallback path for the case when the task pointer can't be retrieved from the stack segment. Use it from upcall_new_stack/del_stack.
  • Consider saving the task pointer between upcall_new_stack/del_stack to avoid calculating it again
  • Do fewer pointer indirections and calculations to verify the suitability of the stack segment, possibly storing more information directly in the stack segment header, never accessing the task pointer directly. (See also Remove unnecessary logic in new_stack_fast #3566).
  • Put all asserts under the compile-time debug flag, including the canary check
  • Put the valgrind hinting under a debug flag too. I believe it does have a runtime penalty.
  • Ensure that upcall_new_stack doesn't use xmm registers and remove the xmm saves and restores in __morestack Stop saving floating point registers in __morestack #2043
  • Inline upcall_del_stack into __morestack
  • Write the entire fast path in assembly
@msullivan
Copy link
Contributor

When all does __morestack get called?

There has also been a bunch of discussion about possibly ditching segmented stacks?

@thestinger
Copy link
Contributor

It's added to every single function, and LLVM does accounting of stack space and growth for us through our __morestack implementation. There are other growth/safety strategies we could use, like using guard pages + checks on allocations larger than the guard pages, but I think doing that would require patching LLVM.

@pnkfelix
Copy link
Member

visiting for triage, email from 2013-09-09

Right now split-stacks are turned off since they are not supported in the newrt. But I imagine most/all of the suggestions above could be applicable in the next implementation, unless we switch to an entirely new strategy (like using guard pages as suggested by thestinger)

@alexcrichton
Copy link
Member

In today's meeting we have decided to jettison segmented stacks.

@alexcrichton
Copy link
Member

We only use __morestack for detecting stack overflow, and that doesn't need to get micro-optimized.

bors pushed a commit to rust-lang-ci/rust that referenced this issue May 15, 2021
* Implement Serialize on IgnoreList

* Add a test for rust-lang#3536
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-runtime Area: std's runtime and "pre-main" init for handling backtraces, unwinds, stack overflows I-slow Issue: Problems and improvements with respect to performance of generated code.
Projects
None yet
Development

No branches or pull requests

5 participants