You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A-runtimeArea: std's runtime and "pre-main" init for handling backtraces, unwinds, stack overflowsI-slowIssue: Problems and improvements with respect to performance of generated code.
This is very performance critical code used for growing the stack, and it currently wastes a lot of instructions on the non-allocating fast path. There are a number of distinct optimizations we can identify.
Here's what happens after calling into __morestack, on the fast path
Set up the frame pointer
Push all possible argument registers of the calling function in case the call to upcall_new_stack clobbers them
Shuffle the argument registers from the __morestack custom calling convention registers to the C calling convention registers used by upcall_new_stack
Call upcall_new_stack, through the indirection of the dynamic linker
Call get_sp_limit, an entire assembly function consisting of movq %fs:112, %rax
Compare the sp_limit to 0 and don't branch to the rust_get_current_task slow path. This branch always makes the same decision during a __morestack call.
Do some math to find the task pointer from the stack limit
Check the stack canary to make sure we haven't run off the end of the stack
Assert that the task pointer is not null
Get the minimum stack size
Do some simple math and pointer indirections to determine if task->stk->next is a big enough stack segment to use
Assert some invariants
memcpy the arguments from the old stack to the new stack
Align the new stack frame
Call reuse_valgrind_stack to give valgrind hints
Call record_stack_limit to execute another single instruction
Return the stack pointer to __morestack
Pop all the saved argument registers
Finally, call the original function
And returning from the segment:
Call upcall_del_stack through the dynamic linker
Call get_sp_limit, an entire function consisting of movq %fs:112, %rax
Compare the sp_limit to 0, etc.
Check the stack canary to make sure we haven't run off the end of the stack
Assert that the task pointer is not null
Update the current stack pointer in the task
Call record_stack_limit
Potential optimizations:
Don't save the frame pointer - This could be tricky to make work with dwarf unwinding, due to the odd frame shapes around __morestack. Will be easier after rolling our own unwinder Invoke instructions kick us off the FastISel path #3551.
Statically link upcall_new_stack and upcall_del_stack, hitting new dynamically linked upcalls for the slow path
Create a new version of rust_get_current_task that doesn't have a fallback path for the case when the task pointer can't be retrieved from the stack segment. Use it from upcall_new_stack/del_stack.
Consider saving the task pointer between upcall_new_stack/del_stack to avoid calculating it again
Do fewer pointer indirections and calculations to verify the suitability of the stack segment, possibly storing more information directly in the stack segment header, never accessing the task pointer directly. (See also Remove unnecessary logic in new_stack_fast #3566).
Put all asserts under the compile-time debug flag, including the canary check
Put the valgrind hinting under a debug flag too. I believe it does have a runtime penalty.
It's added to every single function, and LLVM does accounting of stack space and growth for us through our __morestack implementation. There are other growth/safety strategies we could use, like using guard pages + checks on allocations larger than the guard pages, but I think doing that would require patching LLVM.
Right now split-stacks are turned off since they are not supported in the newrt. But I imagine most/all of the suggestions above could be applicable in the next implementation, unless we switch to an entirely new strategy (like using guard pages as suggested by thestinger)
A-runtimeArea: std's runtime and "pre-main" init for handling backtraces, unwinds, stack overflowsI-slowIssue: Problems and improvements with respect to performance of generated code.
This is very performance critical code used for growing the stack, and it currently wastes a lot of instructions on the non-allocating fast path. There are a number of distinct optimizations we can identify.
Here's what happens after calling into
__morestack
, on the fast pathupcall_new_stack
clobbers them__morestack
custom calling convention registers to the C calling convention registers used byupcall_new_stack
upcall_new_stack
, through the indirection of the dynamic linkerget_sp_limit
, an entire assembly function consisting ofmovq %fs:112, %rax
sp_limit
to 0 and don't branch to therust_get_current_task
slow path. This branch always makes the same decision during a__morestack
call.task
pointer from the stack limittask->stk->next
is a big enough stack segment to usereuse_valgrind_stack
to give valgrind hintsrecord_stack_limit
to execute another single instruction__morestack
And returning from the segment:
upcall_del_stack
through the dynamic linkerget_sp_limit
, an entire function consisting ofmovq %fs:112, %rax
sp_limit
to 0, etc.record_stack_limit
Potential optimizations:
get_sp_limit
,record_stack_limit
(Inline get_sp_limit, set_sp_limit, get_sp runtime functions #2521)upcall_new_stack
andupcall_del_stack
, hitting new dynamically linked upcalls for the slow pathrust_get_current_task
that doesn't have a fallback path for the case when the task pointer can't be retrieved from the stack segment. Use it from upcall_new_stack/del_stack.upcall_new_stack
doesn't use xmm registers and remove the xmm saves and restores in__morestack
Stop saving floating point registers in __morestack #2043upcall_del_stack
into__morestack
The text was updated successfully, but these errors were encountered: