Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Geometry_Engine: GeometryHash as a single value #3251

Closed
pawelbaran opened this issue Jan 17, 2024 · 2 comments · Fixed by #3257
Closed

Geometry_Engine: GeometryHash as a single value #3251

pawelbaran opened this issue Jan 17, 2024 · 2 comments · Fixed by #3257
Assignees
Labels
type:feature New capability or enhancement type:question Ask for further details or start conversation

Comments

@pawelbaran
Copy link
Member

pawelbaran commented Jan 17, 2024

Important: highly recommended to read the text with this script open to visualise the thought process 😉

During and following the recent chats with @al-fisher and @alelom, we were investigating the possibility of using GeometryHash to efficiently hash geometry. The method itself works great, but it does return an array of doubles, which is not very practical if meant to be used as a db or dictionary key. Therefore, we started thinking of aggregating the output array into a single value.

First guess of @alelom was to simply sum the values in the array, but that yields wrong results for cases where the sum of coordinates of points defining the hashed geometry is the same. That is caused by each geometry being hashed by degrading it to its defining points, which then get converted into a double array:

private static double[] ToDoubleArray(this Point p, double typeTranslationFactor)
{
return new double[]
{
p.X + typeTranslationFactor,
p.Y + typeTranslationFactor,
p.Z + typeTranslationFactor
};
}

..where translationFactor is (usually) a constant depending on geometry type. So Point (1,2,3) will yield the same aggregate hash as (3,2,1), and Line { (1,2,3), (3,2,1) } will yield the same aggregate has as { (0,0,6), (6,0,0) } etc.

First solution for the above that I tried was to generate the aggregate by multiplying each subsequent 3 values from the output array by a constant set of 3 different prime numbers. This, however, works only against points with swapped/different coordinates (e.g. (1,2,3) is considered different than (3,2,1)), but fails against higher order geometries with shifted defining points (e.g. Polyline (A,B,C) has same geometry hash aggregate as (B,A,C)).

Second solution that I tried and that seems to be reliable enough is to generate the aggregate by multiplying each subsequent 3 values from the output array by a changing set of 3 different prime numbers, i.e. we have a pool of n prime numbers, from which per each subsequent trio of doubles we pick another set of 3, e.g. for 1st trio we apply prime numbers at indexes 0,1,2 for the next trio 1,2,3 etc. Please see the attached script to visualise.

Here come the questions 😃

  • do you see any immediate gap in my thought process that would prove the second solution incorrect?
  • if not, shall we add a new method in Geometry_Engine, e.g. AggregateGeometryHash or shall I keep it in my test bed for now?

I am looking forward to actioning this!

@alelom
Copy link
Member

alelom commented Jan 17, 2024

Thanks @pawelbaran !

First off, I definitely agree, we need an aggregate geometry hash. I think the current methods are useful but feel more like the "back-end" of what the GeometryHash should be. Further, I think that the general definition of hash is of a single string or number representing the signature of an object, not of an array of elements, because that's of difficult use.
For this reason, I raised #3253 , suggesting we rename the current methods to ToHashArray(), or ToHashDoubleArray(), or similar. I think we should implement an "aggregate" function as you suggest and have that as our GeometryHash() method(s).

Secondly, I also agree with the implementation you've made, it makes sense and it's robust enough.
I think we can simplify it a bit by multiplying for 1,2,3, because I think that is the simplest implementation that keeps the same robustness, while minimizing the rate of increase of the operands. I would like to avoid working with extremely large numbers if that can be avoided, because it's wasteful (leads to overflows or wraps from zero), and I don't think that prime numbers add robustness. I tried your script replacing the list of prime numbers with 1,2,3 and the result is the same.
Additionally, we also need to add an unchecked statement around these methods, as it is good practice when writing hash methods.

@pawelbaran
Copy link
Member Author

Thanks @alelom, 100% agreed, all makes sense 👍

One word of caution though: using only 3 multipliers in alternating order leads to duplicate hashes for geometries that are represented by 3 points (e.g. arcs). I ran into exactly such edge case while building the script (where I started from 3 prime numbers instead of 10). So I would consider starting from at least 5 different multipliers as circles are represented by 4 points - to avoid duplicates in case of the most primitive geometries. Easiest way of testing against this edge case is creating a polyline with just a few control points in various order.

Finally, one more thought that I had in mind: I would be a bit afraid of using double as return type as it by nature bears flaws related to the way in which it is stored in memory. Integers or strings seem to have a more 'stable' nature - maybe I am overreacting, but rather safe than sorry 😉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:feature New capability or enhancement type:question Ask for further details or start conversation
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants