Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Enhancement] optimize harmonic mean evaluation in hll::estimate_cardinality #16351

Merged
merged 1 commit into from Jan 9, 2023

Conversation

satanson
Copy link
Contributor

@satanson satanson commented Jan 9, 2023

What type of PR is this:

  • BugFix
  • Feature
  • Enhancement
  • Refactor
  • UT
  • Doc
  • Tool

Which issues of this PR fixes :

Fixes #

Problem Summary(Required) :

Checklist:

  • I have added test cases for my bug fix or my new feature
  • This pr will affect users' behaviors
  • This pr needs user documentation (for new or modified features or behaviors)
    • I have added documentation for my new feature or new function

Bugfix cherry-pick branch check:

  • I have checked the version labels which the pr will be auto backported to target branch
    • 2.5
    • 2.4
    • 2.3
    • 2.2

Harmonic mean evaluation in estimate_cardinality is quite slow, code as follows, 3 other choices are tried to speed up this code snippet.

  1. orignal method:
std::pair<float, int> calc_harmonic_mean1(int8_t* data, size_t n) {
    float harmonic_mean = 0;
    int num_zeros = 0;

    for (int i = 0; i < n; ++i) {
        harmonic_mean += powf(2.0f, -data[i]);

        if (data[i] == 0) {
            ++num_zeros;
        }
    }
    harmonic_mean = 1.0f / harmonic_mean;
    return std::make_pair(harmonic_mean, num_zeros);
}
  1. simd method
    notice: exp256_ps comes from https://github.com/reyoung/avx_mathfun
std::pair<float, int> calc_harmonic_mean2(int8_t* data, size_t n) {
    float harmonic_mean = 0;
    int num_zeros = 0;
#if defined(__AVX2__)
    auto* p = data;
    const auto end = data + n;
    constexpr auto BLOCK_SIZE = sizeof(__m256i);
    const auto end0 = data + (n & ~(BLOCK_SIZE - 1));
    const auto ln2 = _mm256_set1_ps(0.69314718055995f);
    const auto zerof32 = _mm256_setzero_ps();
    const auto zeroi8 = _mm256_setzero_si256();
    auto sumf32 = _mm256_setzero_ps();
    for (; p < end0; p += BLOCK_SIZE) {
        auto d = _mm256_load_si256(reinterpret_cast<__m256i*>(p));
        num_zeros += _mm_popcnt_u32(_mm256_movemask_epi8(_mm256_cmpeq_epi8(d, zeroi8)));

        auto pp = p;
        for (int i = 0; i < 4; ++i) {
            auto x = _mm256_set_ps(pp[0], pp[1], pp[2], pp[3], pp[4], pp[5], pp[6], pp[7]);
            sumf32 = _mm256_add_ps(exp256_ps((_mm256_mul_ps(_mm256_sub_ps(zerof32, x), ln2))), sumf32);
            pp += 8;
        }
    }
    for (int i = 0; i < sizeof(sumf32) / sizeof(float); ++i) {
        harmonic_mean += (reinterpret_cast<float*>(&sumf32))[i];
    }
#endif
    for (; p < end; ++p) {
        harmonic_mean += powf(2.0f, p[0]);
        if (p[0] == 0) {
            ++num_zeros;
        }
    }

    harmonic_mean = 1.0f / harmonic_mean;
    return std::make_pair(harmonic_mean, num_zeros)
  1. use 1.0f / (1L << x) instead of powf(2, -x) becase x ranges in (0..40):
std::pair<float, int> calc_harmonic_mean3(int8_t* data, size_t n) {
    float harmonic_mean = 0;
    int num_zeros = 0;

    for (int i = 0; i < n; ++i) {
        harmonic_mean += 1.0f / static_cast<float>((1L << data[i]));
        if (data[i] == 0) {
            ++num_zeros;
        }
    }
    harmonic_mean = 1.0f / harmonic_mean;
    return std::make_pair(harmonic_mean, num_zeros);
}
  1. similar to 3, but use a lookup table to access pre-computed 1.0f / (1L << x) x ranges(0..64)
static float tables[65] = {
        1.0f / static_cast<float>(1L << 0),  1.0f / static_cast<float>(1L << 1),  1.0f / static_cast<float>(1L << 2),
        1.0f / static_cast<float>(1L << 3),  1.0f / static_cast<float>(1L << 4),  1.0f / static_cast<float>(1L << 5),
        1.0f / static_cast<float>(1L << 6),  1.0f / static_cast<float>(1L << 7),  1.0f / static_cast<float>(1L << 8),
        1.0f / static_cast<float>(1L << 9),  1.0f / static_cast<float>(1L << 10), 1.0f / static_cast<float>(1L << 11),
        1.0f / static_cast<float>(1L << 12), 1.0f / static_cast<float>(1L << 13), 1.0f / static_cast<float>(1L << 14),
        1.0f / static_cast<float>(1L << 15), 1.0f / static_cast<float>(1L << 16), 1.0f / static_cast<float>(1L << 17),
        1.0f / static_cast<float>(1L << 18), 1.0f / static_cast<float>(1L << 19), 1.0f / static_cast<float>(1L << 20),
        1.0f / static_cast<float>(1L << 21), 1.0f / static_cast<float>(1L << 22), 1.0f / static_cast<float>(1L << 23),
        1.0f / static_cast<float>(1L << 24), 1.0f / static_cast<float>(1L << 25), 1.0f / static_cast<float>(1L << 26),
        1.0f / static_cast<float>(1L << 27), 1.0f / static_cast<float>(1L << 28), 1.0f / static_cast<float>(1L << 29),
        1.0f / static_cast<float>(1L << 30), 1.0f / static_cast<float>(1L << 31), 1.0f / static_cast<float>(1L << 32),
        1.0f / static_cast<float>(1L << 33), 1.0f / static_cast<float>(1L << 34), 1.0f / static_cast<float>(1L << 35),
        1.0f / static_cast<float>(1L << 36), 1.0f / static_cast<float>(1L << 37), 1.0f / static_cast<float>(1L << 38),
        1.0f / static_cast<float>(1L << 39), 1.0f / static_cast<float>(1L << 40), 1.0f / static_cast<float>(1L << 41),
        1.0f / static_cast<float>(1L << 42), 1.0f / static_cast<float>(1L << 43), 1.0f / static_cast<float>(1L << 44),
        1.0f / static_cast<float>(1L << 45), 1.0f / static_cast<float>(1L << 46), 1.0f / static_cast<float>(1L << 47),
        1.0f / static_cast<float>(1L << 48), 1.0f / static_cast<float>(1L << 49), 1.0f / static_cast<float>(1L << 50),
        1.0f / static_cast<float>(1L << 51), 1.0f / static_cast<float>(1L << 52), 1.0f / static_cast<float>(1L << 53),
        1.0f / static_cast<float>(1L << 54), 1.0f / static_cast<float>(1L << 55), 1.0f / static_cast<float>(1L << 56),
        1.0f / static_cast<float>(1L << 57), 1.0f / static_cast<float>(1L << 58), 1.0f / static_cast<float>(1L << 59),
        1.0f / static_cast<float>(1L << 60), 1.0f / static_cast<float>(1L << 61), 1.0f / static_cast<float>(1L << 62),
        1.0f / static_cast<float>(1L << 63), 1.0f / static_cast<float>(1L << 64),
};

std::pair<float, int> calc_harmonic_mean4(int8_t* data, size_t n) {
    float harmonic_mean = 0;
    int num_zeros = 0;

    for (int i = 0; i < n; ++i) {
        harmonic_mean += tables[data[i]];
        if (data[i] == 0) {
            ++num_zeros;
        }
    }
    harmonic_mean = 1.0f / harmonic_mean;
    return std::make_pair(harmonic_mean, num_zeros);
}

Micro-benchmarks are conducted on these functions, it shows that choice 4 is the best. choice 4 outperform original implemenation 7.74X.

image

NOTICE: in BM_calc_harmonic_mean{n}_{m}, n means which choice, m means the function as applied to std::vector<int8_t> of length 2^m.

@satanson satanson force-pushed the optimize_harmoic_mean_evaluation branch from 2fd5a59 to c8dde24 Compare January 9, 2023 01:27
@satanson satanson force-pushed the optimize_harmoic_mean_evaluation branch from c8dde24 to ad13ebc Compare January 9, 2023 02:22
@silverbullet233
Copy link
Contributor

LGTM

@github-actions
Copy link

github-actions bot commented Jan 9, 2023

clang-tidy review says "All clean, LGTM! 👍"

@sonarcloud
Copy link

sonarcloud bot commented Jan 9, 2023

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

@wanpengfei-git wanpengfei-git added the Approved Ready to merge label Jan 9, 2023
@wanpengfei-git
Copy link
Collaborator

run starrocks_admit_test

@satanson satanson enabled auto-merge (squash) January 9, 2023 15:58
@mofeiatwork
Copy link
Contributor

run starrocks_admit_test

@satanson satanson merged commit 16f3429 into StarRocks:main Jan 9, 2023
@github-actions github-actions bot removed Approved Ready to merge be-build labels Jan 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants