Request for feedback on implementation of ASOF join#4774
Request for feedback on implementation of ASOF join#47744ertus2 merged 10 commits intoClickHouse:masterfrom
Conversation
|
@KochetovNicolai briefly reviewed and said that the idea is all right. I will look further in a moment. |
@alexey-milovidov Thanks very much! If you have any further feedback/issues I might have overlooked that need to be addressed before any any possibility of merging, please let me know before I put in some more work to address the current issues. I looked over the failed tests and other than the glibc compatibility and performance checks (the failing ones do not seem to be related to joining tables?) they seem to be straightforward to fix from here. Furthermore, I had a quick look through the codebase looking for binary-tree like structure that is Arena allocator enabled, but wasn't able to find any. Is it correct that clickhouse doesn't use any, eg. for lookups into indexed columns? |
e9bb5e6 to
948c0e2
Compare
…cutor insert the time series into a struct ready for joining working version that inserts the data into the hash table using the existing dispatching machinery for various types working asof left join in clickhouse add a test for the asof join do some asof join cleanup revisit the logic in case the values match between left and right side
948c0e2 to
84f40dd
Compare
|
I've updated the pull request to match the recent refactoring of Join.cpp and my local tests pass again. There is still some code smell around having to pass in the timestamp, but in order to avoid this without much change to the rest of the codebase it might have to be required to write a specialized HashMap that includes a binary search tree as the last level lookup. It would have to be aware of the asof criterion and would not fit nicely into the existing MapsTemplate in Join.h. |
… whole file automatically :(
|
@4ertus2 Thanks very much for your time in leaving some very useful feedback! I'll reply to the individual points inline and try to fix them up as well as possible tonight. |
|
@4ertus2 The current approach introduces quite a lot of if constexprs into the Join class - what is your option on rather than having these conditionals, templating in different implementations of the core HashTable (similar-ish to the 2-stage hash table, but then with a normal HT at the base followed by a BST for the final layer) which internally does the asof join and implements the same API as HashTable. It would introduce quite a bit more code in the implementation and also somehow the KeyGetterForTypeImpls would need to be aware of it as the Key type now needs to index into both the HashTable and the BST, but it would clean up the Join impl quite a bit. What are your thoughts on this? |
My opinion is, it would be cool to merge current version and make a new PR to discuss improvements. I really want to test your ASOF JOIN on my data these holidays %) |
|
I think your CI might have some issues - it's reporting the error below for commit e7a10b8 (https://clickhouse-builds.s3.yandex.net/4774/e7a10b8a3e82b7abef2b62db70ae219b16007fa7/build_log_403388044_1553802981.txt) , but that that call isn't actually present in Join.cpp for that commit hash (nor is it the right line where the call would be, with the right number of arguments). |
Try to merge master to your branch. I've recently changed some NonJoinedBlockInputStream logic. It probably gets build cache from master branch. P.S. We have discussed it. It's an expected behaviour that it would not build if it cannot be merged into master without errors. |
|
@4ertus2 Thanks very much, have merged master and am fixing the issues locally before re-pushing. |
We've discussed it with @alexey-milovidov. The right way is to insert data into PODArray, sort it and then use std::lower_bound and std::upper_bound to search. |
I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en
For changelog.
Category
Short description
I implemented an asof join strictness to the already existing All and Any. This allows you to run queries that join to the most recent value known. The API idea is based on the kdb+ asof operator (https://code.kx.com/q/ref/joins/#aj-aj0-ajf-ajf0-asof-join), using the last column specified as the asof column that would represent the timestamp. Note, that this is a proof of concept implementation looking for some feedback on whether this would be mergeable at some point or is fundamentally flawed. Therefore, there are still some limitations on the functionality (see below).
Detailed description
The asof strictness would allow you to run queries like these:
Current limitations on the implementation:
Stuff that should be improved before merging: