-
Notifications
You must be signed in to change notification settings - Fork 160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Convert Hive UDFs to GenericUDF #54
Comments
I don't remember the exact discussion but I thought the vote against GenericUDF initially was due to some sort of performance penalty (perhaps Java reflection being in play?) so from a purely optimal performance case of any unknown problem, the GenericUDF pattern was slower? (I don't have any notes or anything but that seemed like something we had discussed.) Having said that, even without the simplified expressions aiding a query author by simplifying language or implementation... I would guess the "Optimizing for constants" argument you presented sounds like a perfect reason to implement. Accelerating and caching a constructed geometry would likely outperform any penalty due to reflection, especially as the record count (number of times the accelerated geometry is re-used) increases. |
I believe it was due to time constraints and ease of development, but you could be right. I would think Either way, limited testing shows a ~40% decrease in cumulative CPU time with I'll have it in a branch later today/this weekend. |
Using a larger, more complex polygon as the first argument to generic |
More performance numbers using the source in the genericudf branch. Running the sample query on a dataset with 14 million points takes 13 minutes before optimization and just under a minute after optimization. SELECT counties.name, count(*) cnt FROM counties
JOIN gold.faa
WHERE ST_Contains(counties.boundaryshape, ST_Point(faa.longitude, faa.latitude))
GROUP BY counties.name
ORDER BY cnt desc; |
UDFs that extend
GenericUDF
have more control over how the objects passed to them are handled. Here are a few of the benefits of switching to generic - I'll useST_Contains
as an example.Simplified expressions
ST_Contains
takes two arguments and checks to see if the first geometry contains the geometry argument.ST_Contains(ST_GeomFromText('POLYGON (( ... ))'), ST_Point(longitude, latitude))
Using
GenericUDF
we can support different data types and simplify the previous expression toST_Contains('POLYGON (( ... ))', ST_Point(longitude, latitude))
Optimizing for constants
In the last expression,
'POLYGON (( ... ))'
is a literal constant. We can use this property to optimize at least two different aspects of the operator.OGCGeometry
once per instance of theGenericUDF
classST_Contains
is constant, we can accelerate the correspondingOGCGeometry
.Eliminate unneeded (de)serialization between UDFs
This isn't necessarily related to GenericUDF conversion, but can happen at the same time.
Example:
ST_Contains('POLYGON (...)', ST_Point(lon,lat))
In this case,
ST_Point
serializes theOGCPoint
to bytes and thenST_Contains
deserializes those bytes back to anOGCPoint
. This is expected behavior when values are passed across nodes, but in the case of this example, each call toST_Contains
almost certainly happens on the same node as the corresponding call toST_Point
. This is not a huge performance hit for points, but for larger polygons it will be.The text was updated successfully, but these errors were encountered: