Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert Hive UDFs to GenericUDF #54

Closed
climbage opened this issue Jun 13, 2014 · 4 comments
Closed

Convert Hive UDFs to GenericUDF #54

climbage opened this issue Jun 13, 2014 · 4 comments

Comments

@climbage
Copy link
Member

UDFs that extend GenericUDF have more control over how the objects passed to them are handled. Here are a few of the benefits of switching to generic - I'll use ST_Contains as an example.

Simplified expressions

ST_Contains takes two arguments and checks to see if the first geometry contains the geometry argument.

ST_Contains(ST_GeomFromText('POLYGON (( ... ))'), ST_Point(longitude, latitude))

Using GenericUDF we can support different data types and simplify the previous expression to

ST_Contains('POLYGON (( ... ))', ST_Point(longitude, latitude))

Optimizing for constants

In the last expression, 'POLYGON (( ... ))' is a literal constant. We can use this property to optimize at least two different aspects of the operator.

  1. only create the OGCGeometry once per instance of the GenericUDF class
  2. if the first argument to ST_Contains is constant, we can accelerate the corresponding OGCGeometry.

Eliminate unneeded (de)serialization between UDFs

This isn't necessarily related to GenericUDF conversion, but can happen at the same time.

Example:

ST_Contains('POLYGON (...)', ST_Point(lon,lat))

In this case, ST_Point serializes the OGCPoint to bytes and then ST_Contains deserializes those bytes back to an OGCPoint. This is expected behavior when values are passed across nodes, but in the case of this example, each call to ST_Contains almost certainly happens on the same node as the corresponding call to ST_Point. This is not a huge performance hit for points, but for larger polygons it will be.

@ddkaiser
Copy link
Contributor

I don't remember the exact discussion but I thought the vote against GenericUDF initially was due to some sort of performance penalty (perhaps Java reflection being in play?) so from a purely optimal performance case of any unknown problem, the GenericUDF pattern was slower? (I don't have any notes or anything but that seemed like something we had discussed.)

Having said that, even without the simplified expressions aiding a query author by simplifying language or implementation... I would guess the "Optimizing for constants" argument you presented sounds like a perfect reason to implement. Accelerating and caching a constructed geometry would likely outperform any penalty due to reflection, especially as the record count (number of times the accelerated geometry is re-used) increases.

@climbage
Copy link
Member Author

I believe it was due to time constraints and ease of development, but you could be right. I would think UDF suffers more from reflection than GenericUDF.

Either way, limited testing shows a ~40% decrease in cumulative CPU time with ST_Contains extends GenericUDF on 14 million points.

I'll have it in a branch later today/this weekend.

@climbage
Copy link
Member Author

Using a larger, more complex polygon as the first argument to generic ST_Contains yields a ~300% decrease in cumulative CPU time. 130 seconds vs 410 seconds.

@climbage
Copy link
Member Author

More performance numbers using the source in the genericudf branch.

Running the sample query on a dataset with 14 million points takes 13 minutes before optimization and just under a minute after optimization.

SELECT counties.name, count(*) cnt FROM counties
JOIN gold.faa
WHERE ST_Contains(counties.boundaryshape, ST_Point(faa.longitude, faa.latitude))
GROUP BY counties.name
ORDER BY cnt desc;

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants