Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Querying a vector of NaN occasionally results in invalid indices #125

Open
nlw0 opened this issue Aug 24, 2021 · 5 comments
Open

Querying a vector of NaN occasionally results in invalid indices #125

nlw0 opened this issue Aug 24, 2021 · 5 comments
Labels
Milestone

Comments

@nlw0
Copy link

nlw0 commented Aug 24, 2021

I was able to reproduce the issue with the following snippet:

tree = KDTree(randn(35,408))
nn(tree, rand(35))
nn(tree, repeat([NaN], 35))

In some situations the function returns (1, Inf), which is just fine. Sometimes, though the returned index will be an invalid number, eg:

julia> tree = KDTree(randn(35,40))
KDTree{SArray{Tuple{35},Float64,1,35},Euclidean,Float64}
  Number of points: 40
  Dimensions: 35
  Metric: Euclidean(0.0)
  Reordered: true

julia> nn(tree, rand(35))
(30, 5.039524506008606)

julia> nn(tree, repeat([NaN], 35))
(139777101468400, Inf)

Ideally the returned indices should always be valid, in my opinion.

@axsk
Copy link

axsk commented Sep 21, 2021

I also experience the problem of non-existent large returned indices. Will have to investigate further.
Maybe #78 might be related?

@KristofferC
Copy link
Owner

Clearly, something is going bananas somewhere. I think this just has to be honestly debugged with print statements and whatnot to find out where things go bad.

@nlw0
Copy link
Author

nlw0 commented Sep 26, 2021

I could finally take a look at the code. Apologies I don't have a concrete answer or a working patch yet, but here's theory to what may be happening:

LOC 1

@inbounds idx[j] = tree.indices[idx[j]]

At this point in the code we have had initialized the indices array with -1, and the distances with Inf. The code is probably assuming that by now the indices have all become valid. With a distance of NaN, though, we still have a -1 indices there, and then we can assign tree.indices[-1] to idx[j], what is allowed by the @inbounds . This is how we get crazy values in the output.

LOC 2

if new_min < best_dists[1]

I believe the ultimate reason this ends up happening is because any tests with a distance of NaN will return false, including NaN < Inf. Notice it's the same for Inf in this line. Here is potentially where this may be happening.

I think the solution for that may actually involve some decisions about how the whole thing can behave. If the metric function was returning Inf when we get NaN, this might help, but I'm not sure this would guarantee the proper initialization of the indices. We might actually have to initialize them with eg 1 as well, and then in the end we would get (1, Inf) for such NaN points. Or we could even do something nifty and use "nothing" as the index, since this should be the neat way to implement an optional class for integers, that in the end is kind of what NaN is. In any case, we probably want to do something to prevent invalid array accesses based on these NaN values (and Infs?), which unfortunately are all valid Float64. Actually testing the inputs to detect NaNs would be the other path...

@axsk
Copy link

axsk commented Sep 28, 2021

Handling of nothing/null/Nan seems to be an never ending story. Initialization with 1 has the problem that in the worst case when not checking the distance, one might just reference into 1, obtaining a wrong result. I prefer the initialization with nothing, but this can lead to exceptions where if one doesn't expect nothing, though that probably is what should happen (and its more informative then some random Int). However I cannot tell how this comes down to performance...

A dirty compromise (still and Int, but throwing out of bound exceptions) might be returning 0.

Is there a Julian way to deal with NaN/nothings?

@nlw0
Copy link
Author

nlw0 commented Sep 30, 2021

I think initializing with 1 should be just fine, any match might be considered "good" for a NaN or Inf distances, as long as we have that distance value along with the result to judge what happened. It's a good thing if we guarantee always valid indices. The other approach is a neat handling of NaN as an optional class, either returning nothing or a guaranteed invalid index such as 0, which I'm not really a big fan of. In my opinion, returning 1 to an Inf distance, and mapping NaN distances to Inf should be fine for this code.

I don't believe there really is a Julian way to do this, because part of it is about application domain decisions. Although using nothing is indeed more or less the Julian way to implement an optional class, similar to modern C++ std::option, or the Scala option etc, or the hacky way we use "None" "null" and pointer to zero in JavaScrip, Python C etc... So returning a nothing index would probably be a neat way to do it, but I personally think ensuring valid indices would be better.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants