Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

arrayCumSum function #1908

Merged
merged 6 commits into from Feb 15, 2018
Merged

arrayCumSum function #1908

merged 6 commits into from Feb 15, 2018

Conversation

@javisantana
Copy link
Contributor

javisantana commented Feb 15, 2018

This PR adds cumulative sum function.

When you are working with some specific data arrays is useful to store them using deltas. When CH stores these delta encoded arrays the compression ratio is higher than the one achieved with the original data (in some cases it's several orders of magnitude in near-constant arrays) and allows to save disk space (and likely execution time even with the cumsum overhead).

arrayCumSum allows decoding those delta encoded arrays so you can work with them as regular arrays.

I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en

@alexey-milovidov alexey-milovidov merged commit 2c4052f into ClickHouse:master Feb 15, 2018
1 check passed
1 check passed
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@alexey-milovidov

This comment has been minimized.

Copy link
Member

alexey-milovidov commented Feb 15, 2018

Great!

@alexey-milovidov

This comment has been minimized.

Copy link
Member

alexey-milovidov commented Feb 15, 2018

I've tried this variant of inner loop:

        size_t pos = 0;
        for (size_t i = 0; i < offsets.size(); ++i)
        {
            Result sum{};

            for (; pos < offsets[i]; ++pos)
            {
                sum += data[pos];
                res_values[pos] = sum;
            }
        }

It is more obvious. But there is no difference in performance (or it is slightly worse). So, I will leave your variant.

@alexey-milovidov

This comment has been minimized.

Copy link
Member

alexey-milovidov commented Feb 15, 2018

Looking forward for arrayCumReduce function :)

@javisantana

This comment has been minimized.

Copy link
Contributor Author

javisantana commented Feb 16, 2018

what do you mean with arrayCumReduce? something like cumsum but more generic?:

:) select arrayCumReduce(x, y -> y - x, [1, 2, 3, 4]
[1, 1, 1, 1]

I was thinking about the reverse method:

:) select arrayDiff([100, 102, 103, 104])
[100, 1, 1, 1]

Other things I think would be useful are fillna (to fill nan values) but I don't know if spending time on those methods is actually useful since they could be a mix of general functions + array join

Thanks for testing that variant, it's indeed simpler to read

@alexey-milovidov

This comment has been minimized.

Copy link
Member

alexey-milovidov commented Feb 17, 2018

Examples (along with arrayReduce function, that is already exist in ClickHouse):

arrayReduce('avg', [1, 2, 3]) = 2
arrayCumReduce('avg', [1, 2, 3]) = [1, 1.5, 2]

arrayReduce('uniq', [10, 11, 11, 12]) = 3
arrayCumReduce('uniq', [10, 11, 11, 12]) = [1, 2, 2, 3]

arrayReduce('corr', [1, 2, 3], [1, 0, 1]) = 0
arrayCumReduce('corr', [1, 2, 3], [1, 0, 1]) = [nan, -1, 0]

Implementation looks very straightforward. Please give it a try!

@javisantana

This comment has been minimized.

Copy link
Contributor Author

javisantana commented Feb 18, 2018

Ok, I get it, makes sense.

I was thinking about (and sorry about the offtopic) working on some methods to allow easier geospatial filtering/analysis. I worked with CH in the past and worked really great (I wrote a blogspot about it which BTW is on your front page, thanks!)

You recently added support for pointInPolygon which is one of the pillars of geospatial analysis, I think it would be interesting to add:

  • quadkey/hilbert curves support to allow indexing. Kind of easy and works great with CH indexing mechanism.
  • knn search. In general closer stuff is correlated so to do fast analysis having that kind of thing is interesting.

With that CH would be the best geospatial analytical database out there.

@ethlo ethlo mentioned this pull request Jun 1, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.