Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Enhancement] Optimize a subtle inline performance problem #23300

Merged
merged 2 commits into from May 12, 2023

Conversation

liuyehcf
Copy link
Contributor

@liuyehcf liuyehcf commented May 11, 2023

What type of PR is this:

  • BugFix
  • Feature
  • Enhancement
  • Refactor
  • UT
  • Doc
  • Tool

Reproduce

we can reproduce it by the following steps:

  1. table creation statement.
    CREATE TABLE `datachangeset` (
    `Id` varchar(2000) NOT NULL COMMENT "",
    `PublishStatus` varchar(2000) NULL COMMENT ""
    ) ENGINE=OLAP
    PRIMARY KEY(`Id`)
    COMMENT "OLAP"
    DISTRIBUTED BY HASH(`Id`) BUCKETS 5
    PROPERTIES (
    "replication_num" = "3",
    "in_memory" = "false",
    "storage_format" = "DEFAULT",
    "enable_persistent_index" = "false",
    "replicated_storage" = "true",
    "compression" = "LZ4"
    );
  2. populate the table with:
    • Id densely increase from "1" to "500'0000"
    • PublishStatus is a const value New
  3. test sql
    select PublishStatus, COUNT(*) from datachangeset group by 1;
    
    +---------------+----------+
    | PublishStatus | count(*) |
    +---------------+----------+
    | New           | 5000000  |
    +---------------+----------+

Analysis

It's a subtle performance problem, it is introduced by #13968, which is a refactor pr, only for extracting and reusing the common codes, and there are no logic changes(at least the above case).

The main reason is that the size of instructions of function is too large, and this case only go through a certain subset of the instructions. But unfortunately, some part of the instructions are put together with those won't be executed in this case(see the blow code, this case only go through branch 2). When instructions are loaded into the instruction cache, they are typically loaded in multiple cache line-sized chunks, which may lead to substantial cache miss.

template <typename HashMap, bool is_nullable>
struct AggHashMapWithOneStringKeyWithNullable {
    // Nullable
    template <typename Func, bool allocate_and_compute_state, bool compute_not_founds>
    void compute_agg_states_nullable(size_t chunk_size, const Columns& key_columns, MemPool* pool, Func&& allocate_func,
                                     Buffer<AggDataPtr>* agg_states, std::vector<uint8_t>* not_founds) {
        // ... omit details

        if (key_columns[0]->only_null()) {
            // -----branch 1 ----
            // ... omit details
        } else {
            if (!nullable_column->has_null()) {
                // -----branch 2 ----
                // .. omit details
                return;
            }

            // -----branch 3 ----
            // .. omit details
        }
    }
};

In order to illustrate it more clearly. I present a graph down below(It is just a assumption of mine, which is not proved):

  • Assume a function has two branchs, branchA and branchB, each of them contains some instructions.
  • We organize these instructions into mutiply blocks, each block euqlas to the size of cache line.
  • In block4, it contains instructions both belongs to branchA and branchB.
  • When the processor is about to execute the instructorA1, and it will loaded the block4 from memory to cache system. But practically, current processor will load many blocks at a time, which means the block4, block5, block6, block7 may all be loaded to cache. The total code size of starrocks_be is huge which is absolutely greater than the total cache size(at least L1 cache), so it may replace some hot instructions which may be re-executed before long, the re-execution of these hot instructions my occur further cache miss. All these may lead to the processor being bussy replacing the cache used for code. And then the performence deduction happens.
                                                                 
               Each block represents a bunch of                  
               instructions with the size of cache line          
                                                                 
                      +------------------+                       
                      | instructorA...   |     block1            
                      +------------------+                       
                      +------------------+                       
 branchA              | instructorA...   |     block2            
                      +------------------+                       
                      +------------------+                       
                      | instructorA...   |     block3            
                      +------------------+                       
                      +------------------+                       
                      | instructorA1     |                       
                      | instructorA2     |     block4            
                      | instructorB...   |                       
                      +------------------+                       
                      +------------------+                       
                      | instructorB...   |     block5            
 branchB              +------------------+                       
                      +------------------+                       
                      | instructorB...   |     block6            
                      +------------------+                       
                      +------------------+                       
                      | instructorB...   |     block7            
                      +------------------+                       

Solution

For those functions that process a whole chunk's data, we can mark it as __attribute__((noinline)) to avoid being inlined by compiler.

This makes each function's instructions highly related without containing too many un-executed instructions.

Experiment

Since this sql runs very fast, so we use mysqlslap to test it

-c 1 -n 1000 -c 10 -n 1000 -c 100 -n 1000
main-Before 24.992s 5.325s 4.436s
main-After 22.027s 4.720s 3.859s
2.5.4 22.839s 4.794s 4.219s
3.0.0 25.903s 5.912s 4.898s
3.0.0 + This PR 22.355s 5.294s 4.290s

Notic that 3.0.0 + This PR still has dedecution of 10 concurrency scenerio, this can be fixed by #22744 which is already cherry picked to branch-3.0.0, we can just ignore it here.

Perf top

  • main-Before
    34.47%  starrocks_be        [.] phmap::priv::raw_hash_set<phmap::priv::FlatHashMapPolicy<starrocks::Slice, unsigned char*>, starrocks::SliceHashWithSeed<(starrocks::PhmapSeed)0>, starrocks::SliceEqual, std::allocator<std::pair<starrocks::Slice const, unsigned char*> >
    16.52%  starrocks_be        [.] starrocks::AggHashMapWithOneStringKeyWithNullable<phmap::flat_hash_map<starrocks::Slice, unsigned char*, starrocks::SliceHashWithSeed<(starrocks::PhmapSeed)0>, starrocks::SliceEqual, std::allocator<std::pair<starrocks::Slice const, unsi
    10.31%  starrocks_be        [.] starrocks::append_fixed_length<unsigned int, 8ul>
    6.86%  starrocks_be        [.] starrocks::BinaryDictPageDecoder<(starrocks::LogicalType)17>::next_batch
    4.42%  starrocks_be        [.] starrocks::AggregateFunctionBatchHelper<starrocks::AggregateCountFunctionState<false>, starrocks::CountAggregateFunction<false> >::update_batch
    3.23%  starrocks_be        [.] std::vector<starrocks::Slice, starrocks::raw::RawAllocator<starrocks::Slice, 0ul, std::allocator<starrocks::Slice> > >::_M_default_append
    2.22%  starrocks_be        [.] starrocks::pipeline::PipelineDriverPoller::run_internal
    1.41%  libc-2.17.so        [.] __memcpy_ssse3_back
    1.40%  [vdso]              [.] __vdso_clock_gettime
    
  • main-After
    44.09%  starrocks_be        [.] starrocks::AggHashMapWithOneStringKeyWithNullable<phmap::flat_hash_map<starrocks::Slice, unsigned char*, starrocks::SliceHashWithSeed<(starrocks::PhmapSeed)0>, starrocks::SliceEqual, std::allocator<std::pair<starrocks::Slice const, unsi
    11.98%  starrocks_be        [.] starrocks::append_fixed_length<unsigned int, 8ul>
    7.81%  starrocks_be        [.] starrocks::BinaryDictPageDecoder<(starrocks::LogicalType)17>::next_batch
    5.07%  starrocks_be        [.] starrocks::AggregateFunctionBatchHelper<starrocks::AggregateCountFunctionState<false>, starrocks::CountAggregateFunction<false> >::update_batch
    3.15%  starrocks_be        [.] std::vector<starrocks::Slice, starrocks::raw::RawAllocator<starrocks::Slice, 0ul, std::allocator<starrocks::Slice> > >::_M_default_append
    2.21%  starrocks_be        [.] starrocks::pipeline::PipelineDriverPoller::run_internal
    1.68%  libc-2.17.so        [.] __memcpy_ssse3_back
    1.59%  [vdso]              [.] __vdso_clock_gettime
    0.64%  libc-2.17.so        [.] __memset_sse2
    0.60%  libc-2.17.so        [.] __memcmp_sse4_1
    0.60%  starrocks_be        [.] jemalloc
    0.55%  libpthread-2.17.so  [.] pthread_mutex_unlock
    0.53%  starrocks_be        [.] my_malloc
    

Checklist:

  • I have added test cases for my bug fix or my new feature
  • This pr will affect users' behaviors
  • This pr needs user documentation (for new or modified features or behaviors)
    • I have added documentation for my new feature or new function

Bugfix cherry-pick branch check:

  • I have checked the version labels which the pr will be auto backported to target branch
    • 3.0
    • 2.5
    • 2.4
    • 2.3

@github-actions github-actions bot added the 3.0 label May 11, 2023
@liuyehcf liuyehcf changed the title [Enhancement] Optimize non-nullable group by case [Enhancement] Optimize a subtle inline performance problem May 11, 2023
@liuyehcf liuyehcf changed the title [Enhancement] Optimize a subtle inline performance problem [WIP][Enhancement] Optimize a subtle inline performance problem May 11, 2023
@liuyehcf liuyehcf force-pushed the performance branch 2 times, most recently from 717451c to 59c13f0 Compare May 12, 2023 03:02
@liuyehcf liuyehcf changed the title [WIP][Enhancement] Optimize a subtle inline performance problem [Enhancement] Optimize a subtle inline performance problem May 12, 2023
@liuyehcf liuyehcf enabled auto-merge (squash) May 12, 2023 05:16
kangkaisen
kangkaisen previously approved these changes May 12, 2023
@stdpain
Copy link
Contributor

stdpain commented May 12, 2023

add an ALWAYS_NOINLINE in be/src/common/compiler_util.h

@liuyehcf liuyehcf dismissed stale reviews from silverbullet233 and kangkaisen via b1259d5 May 12, 2023 08:08
Signed-off-by: liuyehcf <1559500551@qq.com>
Signed-off-by: liuyehcf <1559500551@qq.com>
@liuyehcf liuyehcf merged commit 9e03202 into StarRocks:main May 12, 2023
20 checks passed
@wanpengfei-git
Copy link
Collaborator

@Mergifyio backport branch-3.0

@github-actions github-actions bot removed the 3.0 label May 12, 2023
@mergify
Copy link
Contributor

mergify bot commented May 12, 2023

backport branch-3.0

✅ Backports have been created

mergify bot pushed a commit that referenced this pull request May 12, 2023
Signed-off-by: liuyehcf <1559500551@qq.com>
(cherry picked from commit 9e03202)
wanpengfei-git pushed a commit that referenced this pull request May 13, 2023
Signed-off-by: liuyehcf <1559500551@qq.com>
(cherry picked from commit 9e03202)
Moonm3n pushed a commit to Moonm3n/starrocks that referenced this pull request May 23, 2023
…#23300)

Signed-off-by: liuyehcf <1559500551@qq.com>
Signed-off-by: Moonm3n <saxonzhan@gmail.com>
numbernumberone pushed a commit to numbernumberone/starrocks that referenced this pull request May 31, 2023
numbernumberone pushed a commit to numbernumberone/starrocks that referenced this pull request May 31, 2023
abc982627271 pushed a commit to abc982627271/starrocks that referenced this pull request Jun 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants