[Enhancement] Optimize a subtle inline performance problem #23300

liuyehcf · 2023-05-11T11:11:53Z

What type of PR is this:

Reproduce

we can reproduce it by the following steps:

table creation statement.

CREATE TABLE `datachangeset` (
`Id` varchar(2000) NOT NULL COMMENT "",
`PublishStatus` varchar(2000) NULL COMMENT ""
) ENGINE=OLAP
PRIMARY KEY(`Id`)
COMMENT "OLAP"
DISTRIBUTED BY HASH(`Id`) BUCKETS 5
PROPERTIES (
"replication_num" = "3",
"in_memory" = "false",
"storage_format" = "DEFAULT",
"enable_persistent_index" = "false",
"replicated_storage" = "true",
"compression" = "LZ4"
);

populate the table with:
- Id densely increase from "1" to "500'0000"
- PublishStatus is a const value New

test sql

select PublishStatus, COUNT(*) from datachangeset group by 1;

+---------------+----------+
| PublishStatus | count(*) |
+---------------+----------+
| New           | 5000000  |
+---------------+----------+

Analysis

It's a subtle performance problem, it is introduced by #13968, which is a refactor pr, only for extracting and reusing the common codes, and there are no logic changes(at least the above case).

The main reason is that the size of instructions of function is too large, and this case only go through a certain subset of the instructions. But unfortunately, some part of the instructions are put together with those won't be executed in this case(see the blow code, this case only go through branch 2). When instructions are loaded into the instruction cache, they are typically loaded in multiple cache line-sized chunks, which may lead to substantial cache miss.

template <typename HashMap, bool is_nullable>
struct AggHashMapWithOneStringKeyWithNullable {
    // Nullable
    template <typename Func, bool allocate_and_compute_state, bool compute_not_founds>
    void compute_agg_states_nullable(size_t chunk_size, const Columns& key_columns, MemPool* pool, Func&& allocate_func,
                                     Buffer<AggDataPtr>* agg_states, std::vector<uint8_t>* not_founds) {
        // ... omit details

        if (key_columns[0]->only_null()) {
            // -----branch 1 ----
            // ... omit details
        } else {
            if (!nullable_column->has_null()) {
                // -----branch 2 ----
                // .. omit details
                return;
            }

            // -----branch 3 ----
            // .. omit details
        }
    }
};

In order to illustrate it more clearly. I present a graph down below(It is just a assumption of mine, which is not proved):

Assume a function has two branchs, branchA and branchB, each of them contains some instructions.
We organize these instructions into mutiply blocks, each block euqlas to the size of cache line.
In block4, it contains instructions both belongs to branchA and branchB.
When the processor is about to execute the instructorA1, and it will loaded the block4 from memory to cache system. But practically, current processor will load many blocks at a time, which means the block4, block5, block6, block7 may all be loaded to cache. The total code size of starrocks_be is huge which is absolutely greater than the total cache size(at least L1 cache), so it may replace some hot instructions which may be re-executed before long, the re-execution of these hot instructions my occur further cache miss. All these may lead to the processor being bussy replacing the cache used for code. And then the performence deduction happens.

                                                                 
               Each block represents a bunch of                  
               instructions with the size of cache line          
                                                                 
                      +------------------+                       
                      | instructorA...   |     block1            
                      +------------------+                       
                      +------------------+                       
 branchA              | instructorA...   |     block2            
                      +------------------+                       
                      +------------------+                       
                      | instructorA...   |     block3            
                      +------------------+                       
                      +------------------+                       
                      | instructorA1     |                       
                      | instructorA2     |     block4            
                      | instructorB...   |                       
                      +------------------+                       
                      +------------------+                       
                      | instructorB...   |     block5            
 branchB              +------------------+                       
                      +------------------+                       
                      | instructorB...   |     block6            
                      +------------------+                       
                      +------------------+                       
                      | instructorB...   |     block7            
                      +------------------+

Solution

For those functions that process a whole chunk's data, we can mark it as __attribute__((noinline)) to avoid being inlined by compiler.

This makes each function's instructions highly related without containing too many un-executed instructions.

Experiment

Since this sql runs very fast, so we use mysqlslap to test it

	`-c 1 -n 1000`	`-c 10 -n 1000`	`-c 100 -n 1000`
main-Before	24.992s	5.325s	4.436s
main-After	22.027s	4.720s	3.859s
2.5.4	22.839s	4.794s	4.219s
3.0.0	25.903s	5.912s	4.898s
3.0.0 + This PR	22.355s	5.294s	4.290s

Notic that 3.0.0 + This PR still has dedecution of 10 concurrency scenerio, this can be fixed by #22744 which is already cherry picked to branch-3.0.0, we can just ignore it here.

Perf top

main-Before

34.47%  starrocks_be        [.] phmap::priv::raw_hash_set<phmap::priv::FlatHashMapPolicy<starrocks::Slice, unsigned char*>, starrocks::SliceHashWithSeed<(starrocks::PhmapSeed)0>, starrocks::SliceEqual, std::allocator<std::pair<starrocks::Slice const, unsigned char*> >
16.52%  starrocks_be        [.] starrocks::AggHashMapWithOneStringKeyWithNullable<phmap::flat_hash_map<starrocks::Slice, unsigned char*, starrocks::SliceHashWithSeed<(starrocks::PhmapSeed)0>, starrocks::SliceEqual, std::allocator<std::pair<starrocks::Slice const, unsi
10.31%  starrocks_be        [.] starrocks::append_fixed_length<unsigned int, 8ul>
6.86%  starrocks_be        [.] starrocks::BinaryDictPageDecoder<(starrocks::LogicalType)17>::next_batch
4.42%  starrocks_be        [.] starrocks::AggregateFunctionBatchHelper<starrocks::AggregateCountFunctionState<false>, starrocks::CountAggregateFunction<false> >::update_batch
3.23%  starrocks_be        [.] std::vector<starrocks::Slice, starrocks::raw::RawAllocator<starrocks::Slice, 0ul, std::allocator<starrocks::Slice> > >::_M_default_append
2.22%  starrocks_be        [.] starrocks::pipeline::PipelineDriverPoller::run_internal
1.41%  libc-2.17.so        [.] __memcpy_ssse3_back
1.40%  [vdso]              [.] __vdso_clock_gettime

main-After

44.09%  starrocks_be        [.] starrocks::AggHashMapWithOneStringKeyWithNullable<phmap::flat_hash_map<starrocks::Slice, unsigned char*, starrocks::SliceHashWithSeed<(starrocks::PhmapSeed)0>, starrocks::SliceEqual, std::allocator<std::pair<starrocks::Slice const, unsi
11.98%  starrocks_be        [.] starrocks::append_fixed_length<unsigned int, 8ul>
7.81%  starrocks_be        [.] starrocks::BinaryDictPageDecoder<(starrocks::LogicalType)17>::next_batch
5.07%  starrocks_be        [.] starrocks::AggregateFunctionBatchHelper<starrocks::AggregateCountFunctionState<false>, starrocks::CountAggregateFunction<false> >::update_batch
3.15%  starrocks_be        [.] std::vector<starrocks::Slice, starrocks::raw::RawAllocator<starrocks::Slice, 0ul, std::allocator<starrocks::Slice> > >::_M_default_append
2.21%  starrocks_be        [.] starrocks::pipeline::PipelineDriverPoller::run_internal
1.68%  libc-2.17.so        [.] __memcpy_ssse3_back
1.59%  [vdso]              [.] __vdso_clock_gettime
0.64%  libc-2.17.so        [.] __memset_sse2
0.60%  libc-2.17.so        [.] __memcmp_sse4_1
0.60%  starrocks_be        [.] jemalloc
0.55%  libpthread-2.17.so  [.] pthread_mutex_unlock
0.53%  starrocks_be        [.] my_malloc

Checklist:

I have added test cases for my bug fix or my new feature
This pr will affect users' behaviors
This pr needs user documentation (for new or modified features or behaviors)
- I have added documentation for my new feature or new function

Bugfix cherry-pick branch check:

stdpain · 2023-05-12T06:51:37Z

add an ALWAYS_NOINLINE in be/src/common/compiler_util.h

Signed-off-by: liuyehcf <1559500551@qq.com>

wanpengfei-git · 2023-05-12T16:43:25Z

@Mergifyio backport branch-3.0

mergify · 2023-05-12T16:43:47Z

backport branch-3.0

✅ Backports have been created

#23372 [Enhancement] Optimize a subtle inline performance problem (backport #23300) has been created for branch branch-3.0

Signed-off-by: liuyehcf <1559500551@qq.com> (cherry picked from commit 9e03202)

…#23300) Signed-off-by: liuyehcf <1559500551@qq.com> Signed-off-by: Moonm3n <saxonzhan@gmail.com>

…#23300) Signed-off-by: liuyehcf <1559500551@qq.com>

github-actions bot added the 3.0 label May 11, 2023

mergify bot assigned liuyehcf May 11, 2023

liuyehcf changed the title ~~[Enhancement] Optimize non-nullable group by case~~ [Enhancement] Optimize a subtle inline performance problem May 11, 2023

liuyehcf changed the title ~~[Enhancement] Optimize a subtle inline performance problem~~ [WIP][Enhancement] Optimize a subtle inline performance problem May 11, 2023

github-actions bot added the title needs [type] label May 12, 2023

liuyehcf force-pushed the performance branch 2 times, most recently from 717451c to 59c13f0 Compare May 12, 2023 03:02

liuyehcf changed the title ~~[WIP][Enhancement] Optimize a subtle inline performance problem~~ [Enhancement] Optimize a subtle inline performance problem May 12, 2023

github-actions bot removed the title needs [type] label May 12, 2023

liuyehcf enabled auto-merge (squash) May 12, 2023 05:16

kangkaisen previously approved these changes May 12, 2023

View reviewed changes

silverbullet233 previously approved these changes May 12, 2023

View reviewed changes

liuyehcf dismissed stale reviews from silverbullet233 and kangkaisen via b1259d5 May 12, 2023 08:08

stdpain approved these changes May 12, 2023

View reviewed changes

liuyehcf added 2 commits May 12, 2023 18:24

[Enhancement] Optimize a subtle inline performance problem

082b11c

Signed-off-by: liuyehcf <1559500551@qq.com>

update 1

57fefdf

Signed-off-by: liuyehcf <1559500551@qq.com>

liuyehcf force-pushed the performance branch from b1259d5 to 57fefdf Compare May 12, 2023 10:24

kangkaisen approved these changes May 12, 2023

View reviewed changes

liuyehcf merged commit 9e03202 into StarRocks:main May 12, 2023
20 checks passed

github-actions bot removed the 3.0 label May 12, 2023

mergify bot mentioned this pull request May 12, 2023

[Enhancement] Optimize a subtle inline performance problem (backport #23300) #23372

Merged

mergify bot pushed a commit that referenced this pull request May 12, 2023

[Enhancement] Optimize a subtle inline performance problem (#23300)

511819b

Signed-off-by: liuyehcf <1559500551@qq.com> (cherry picked from commit 9e03202)

wanpengfei-git pushed a commit that referenced this pull request May 13, 2023

[Enhancement] Optimize a subtle inline performance problem (#23300)

9b44aea

Signed-off-by: liuyehcf <1559500551@qq.com> (cherry picked from commit 9e03202)

Moonm3n pushed a commit to Moonm3n/starrocks that referenced this pull request May 23, 2023

[Enhancement] Optimize a subtle inline performance problem (StarRocks…

f55a5a8

…#23300) Signed-off-by: liuyehcf <1559500551@qq.com> Signed-off-by: Moonm3n <saxonzhan@gmail.com>

numbernumberone pushed a commit to numbernumberone/starrocks that referenced this pull request May 31, 2023

[Enhancement] Optimize a subtle inline performance problem (StarRocks…

60f171e

…#23300) Signed-off-by: liuyehcf <1559500551@qq.com>

numbernumberone pushed a commit to numbernumberone/starrocks that referenced this pull request May 31, 2023

[Enhancement] Optimize a subtle inline performance problem (StarRocks…

bcc401a

…#23300) Signed-off-by: liuyehcf <1559500551@qq.com>

abc982627271 pushed a commit to abc982627271/starrocks that referenced this pull request Jun 5, 2023

[Enhancement] Optimize a subtle inline performance problem (StarRocks…

d3a6ab3

…#23300) Signed-off-by: liuyehcf <1559500551@qq.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Enhancement] Optimize a subtle inline performance problem #23300

[Enhancement] Optimize a subtle inline performance problem #23300

liuyehcf commented May 11, 2023 •

edited

stdpain commented May 12, 2023

wanpengfei-git commented May 12, 2023

mergify bot commented May 12, 2023 •

edited

[Enhancement] Optimize a subtle inline performance problem #23300

[Enhancement] Optimize a subtle inline performance problem #23300

Conversation

liuyehcf commented May 11, 2023 • edited

What type of PR is this:

Reproduce

Analysis

Solution

Experiment

Checklist:

Bugfix cherry-pick branch check:

stdpain commented May 12, 2023

wanpengfei-git commented May 12, 2023

mergify bot commented May 12, 2023 • edited

✅ Backports have been created

liuyehcf commented May 11, 2023 •

edited

mergify bot commented May 12, 2023 •

edited