Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: use memcpy for by-val aggregate type input parameters #1196

Merged
merged 30 commits into from
Apr 26, 2024

Conversation

mhasel
Copy link
Member

@mhasel mhasel commented Apr 10, 2024

Aggregate VAR_INPUT args to function calls are now generated/passed as pointers and then memcpyd into a local variable instead of passing it by value and using store.
In order to achieve this, quite a bit of logic is moved from the expression_generator to the pou_generator - in other words, the caller will now only bitcast an aggregate argument to its pointer (if necessary) and the function will take care of correctly memseting/memcpying.
This results in significantly reduced allocations/IR in some cases, especially when passing member variables of FUNCTION_BLOCK/PROGRAM structs or when passing a by-ref arg on to a by-val parameter:
Where previously the caller had to allocate a local temporary variable and copy the value into it before passing it on to the callee, it is now sufficient to directly pass the pointer.

Using the same example as given in issue #1074

FUNCTION bar : DINT
    VAR_INPUT
        val : STRING[65536];
    END_VAR
END_FUNCTION

the llc-14 --time-passes benchmark improves significantly:

master/store:

===-------------------------------------------------------------------------===
                      ... Pass execution timing report ...
===-------------------------------------------------------------------------===
  Total Execution Time: 69.0989 seconds (69.0998 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
  64.0869 ( 93.0%)   0.0700 ( 43.7%)  64.1569 ( 92.8%)  64.1579 ( 92.8%)  X86 DAG->DAG Instruction Selection
   4.6626 (  6.8%)   0.0000 (  0.0%)   4.6626 (  6.7%)   4.6626 (  6.7%)  Machine Instruction Scheduler
   0.0767 (  0.1%)   0.0900 ( 56.2%)   0.1667 (  0.2%)   0.1667 (  0.2%)  X86 Assembly Printer

...

===-------------------------------------------------------------------------===
                      Instruction Selection and Scheduling
===-------------------------------------------------------------------------===
  Total Execution Time: 61.4012 seconds (61.4021 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
  60.5939 ( 98.7%)   0.0300 ( 75.0%)  60.6238 ( 98.7%)  60.6248 ( 98.7%)  DAG Combining 1
   0.3744 (  0.6%)   0.0000 (  0.0%)   0.3744 (  0.6%)   0.3744 (  0.6%)  Instruction Selection
   0.1517 (  0.2%)   0.0000 (  0.0%)   0.1517 (  0.2%)   0.1517 (  0.2%)  DAG Combining 2
   0.1485 (  0.2%)   0.0000 (  0.0%)   0.1485 (  0.2%)   0.1485 (  0.2%)  Instruction Scheduling
   0.0481 (  0.1%)   0.0000 (  0.0%)   0.0481 (  0.1%)   0.0481 (  0.1%)  DAG Legalization
 
...

memcpy:

===-------------------------------------------------------------------------===
                      ... Pass execution timing report ...
===-------------------------------------------------------------------------===
  Total Execution Time: 0.0016 seconds (0.0017 wall clock)

   ---User Time---   --User+System--   ---Wall Time---  --- Name ---
   0.0004 ( 23.3%)   0.0004 ( 23.3%)   0.0004 ( 23.3%)  X86 DAG->DAG Instruction Selection
   0.0004 ( 23.2%)   0.0004 ( 23.2%)   0.0004 ( 23.1%)  Expand Atomic instructions
   0.0002 ( 10.6%)   0.0002 ( 10.6%)   0.0002 ( 10.6%)  X86 Assembly Printer

...

===-------------------------------------------------------------------------===
                      Instruction Selection and Scheduling
===-------------------------------------------------------------------------===
  Total Execution Time: 0.0002 seconds (0.0002 wall clock)

   ---User Time---   --User+System--   ---Wall Time---  --- Name ---
   0.0001 ( 46.7%)   0.0001 ( 46.7%)   0.0001 ( 46.9%)  Instruction Selection
   0.0000 ( 19.5%)   0.0000 ( 19.5%)   0.0000 ( 19.8%)  DAG Combining 1
   0.0000 ( 13.8%)   0.0000 ( 13.8%)   0.0000 ( 13.5%)  Instruction Scheduling
   0.0000 ( 10.5%)   0.0000 ( 10.5%)   0.0000 ( 10.4%)  Instruction Creation
   0.0000 (  3.8%)   0.0000 (  3.8%)   0.0000 (  3.5%)  DAG Combining 2
   0.0000 (  3.3%)   0.0000 (  3.3%)   0.0000 (  3.2%)  DAG Legalization

...

Pass execution timing and instruction selection and scheduling improve by a factor of ~40000 and ~300000 respectively.

Resolves #1074

volsa and others added 4 commits April 2, 2024 10:28
- Bumps the Windows Rust Version to 1.77 for test runs
- Applies 1.77 clippy and rustfmt suggestions
@mhasel mhasel changed the title fix: use memcpy for by-val aggregate type parameters fix: use memcpy for by-val aggregate type input parameters Apr 17, 2024
@mhasel mhasel marked this pull request as ready for review April 19, 2024 13:48
@mhasel mhasel requested review from ghaith and riederm April 19, 2024 13:48
@mhasel mhasel changed the title fix: use memcpy for by-val aggregate type input parameters refactor: use memcpy for by-val aggregate type input parameters Apr 19, 2024
volsa
volsa previously approved these changes Apr 22, 2024
Copy link
Member

@volsa volsa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe @ghaith or @riederm can double-check but looks good from my end.

p.s. loving these LLVM IR reductions 🤤

tests/correctness/strings.rs Outdated Show resolved Hide resolved
let bitcast = self.llvm.builder.build_bitcast(ptr, ty, "bitcast").into_pointer_value();
let (size, alignment) = if let DataTypeInformation::String { size, encoding } = type_info
{
// since passed string args might be larger than the local acceptor, we need to first memset the local variable to 0
Copy link
Member

@volsa volsa Apr 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate on this? I still don't understand why the memset is needed here 😅 I initially thought the memset was required to avoid garbage values in the alloca call but thats not the case here?

Copy link
Member Author

@mhasel mhasel Apr 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't copy the entire length of the locally allocated string, since the passed string might be larger than the acceptor:

FUNCTION foo
VAR_INPUT
	str: STRING[3];
END_VAR
END_FUNCTION

foo('longer than 3');

And to ensure the last grapheme is in fact a null-terminator, we memset the entire string to 0. An alternative would be to GEP into the last element and set it to 0 ourselves, but I'm not sure if that is worth it, since memsetting to 0 should just be an XOR with the local variable.

@volsa
Copy link
Member

volsa commented Apr 22, 2024

As a side note, is this a good candidate to expand our performance tests to detect potential regressions? That is create a test case with many big aggregate types all passed by value and track their runtime behaviour in our dashboard?

@mhasel
Copy link
Member Author

mhasel commented Apr 22, 2024

As a side note, is this a good candidate to expand our performance tests to detect potential regressions? That is create a test case with many big aggregate types all passed by value and track their runtime behaviour in our dashboard?

Sounds good. This would also allow to better test future front-end optimizations (e.g. more accurate byte-alignment for memset/memcpy calls, ...)

@mhasel mhasel merged commit 7807d9d into master Apr 26, 2024
15 checks passed
@mhasel mhasel deleted the by-ref-aggregate-types branch April 26, 2024 10:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Replace store instructions with memcpy for aggregate types
2 participants