-
Notifications
You must be signed in to change notification settings - Fork 771
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add explicit NaN handling to proj_trans and gie #3603
Conversation
src/4D_api.cpp
Outdated
@@ -189,6 +189,11 @@ double proj_roundtrip(PJ *P, PJ_DIRECTION direction, int n, PJ_COORD *coord) { | |||
/* finally, we take the last half-step */ | |||
t = proj_trans(P, opposite_direction(direction), t); | |||
|
|||
/* if we start with NaN, we expect NaN as output */ | |||
if (isnan(org.v[0]) && isnan(t.v[0])) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I check all 4 coordinates?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I check all 4 coordinates?
I think that would make sense. Also, should it be changed to or
instead of and
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding the or/and: I don't think so. This is checking that the original input is NaN and the final output is NaN. If the final output was NaN, but we didn't give it something that should return NaN then that should be a failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could do: (isnan(org.v[0]) || isnan(org.v[1]) || ...) && (isnan(t.v[0]) || isnan(t.v[1]) || ...)
? Basically if any of the inputs are NaN and any of the outputs are NaN...no maybe it should be that any NaNs in the input produce all NaNs in the output?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If everyone agrees on other parts of this PR, I think this is the only other part I might want to change. That is, checking all axes for NaN. This also adds to execution time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
proj_roundtrip() makes only sense as part of a test utility like gie, not "production" use cases. Execution time is not a concern
src/apps/gie.cpp
Outdated
@@ -753,7 +753,7 @@ static PJ_COORD parse_coord(const char *args) { | |||
// test failures when testing points at edge of grids. | |||
// For example 1501000.0 becomes 1501000.000000000233 | |||
double d = proj_strtod(prev, &endp); | |||
if (*endp != '\0' && !isspace(*endp)) { | |||
if (!isnan(d) && *endp != '\0' && !isspace(*endp)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was the simplest solution here to just say "skip this if statement if we have a NaN" but I'm not sure it is the right decision.
src/apps/proj_strtod.cpp
Outdated
/* NaN */ | ||
if (strncmp(p, "NaN", 3) == 0) { | ||
if (endptr) | ||
*endptr = const_cast<char *>(p + 3); | ||
return NAN; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how strict/flexible this code needs to be. It specifically checks for only NaN (not "nan", not "NAN", etc). It also doesn't do anything if the character after NaN is a whitespace. For a valid NaN it should be, but right now someone could technically enter the next coordinate number right after a NaN (ex. NaN0.0m
would be two values I think).
Also, I didn't notice until after I started making this PR that |
Co-authored-by: Even Rouault <even.rouault@spatialys.com>
Co-authored-by: Even Rouault <even.rouault@spatialys.com>
Co-authored-by: Even Rouault <even.rouault@spatialys.com>
Co-authored-by: Even Rouault <even.rouault@spatialys.com>
src/apps/proj_strtod.cpp
Outdated
@@ -123,6 +125,13 @@ double proj_strtod(const char *str, char **endptr) { | |||
return 0; | |||
} | |||
|
|||
/* NaN */ | |||
if (strncmp(p, "NaN", 3) == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this comparison also include nan
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically, if we're going for feature parity with standard strtod, then it should handle nan
, NAN
, NaN
, and maybe every combination of upper and lowercase. I mentioned in a higher up comment:
I'm not sure how strict/flexible this code needs to be. It specifically checks for only NaN (not "nan", not "NAN", etc). It also doesn't do anything if the character after NaN is a whitespace. For a valid NaN it should be, but right now someone could technically enter the next coordinate number right after a NaN (ex. NaN0.0m would be two values I think).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you may use the ci_equal() function for case insensitive comparison, provided you add at top of file
#include "proj/internal/internal.hpp"
using namespace NS_PROJ::internal;
Note that GDAL's strtod() function also catches other variations of NaN that used to be/are output by Microsoft C runtime library. Cf https://github.com/OSGeo/gdal/blob/master/port/cpl_strtod.cpp#L227 and below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, actually ci_equal() isn't appropriate. That should be ci_starts_with()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How may those comparisons affect performance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#include "proj/internal/internal.hpp"
using namespace NS_PROJ::internal;
results in:
In file included from /home/davidh/repos/git/PROJ/src/apps/proj_strtod.cpp:89:
/home/davidh/repos/git/PROJ/include/proj/internal/internal.hpp:30:2: error: #error This file should only be included from a PROJ cpp file
30 | #error This file should only be included from a PROJ cpp file
| ^~~~~
gmake[2]: *** [src/apps/CMakeFiles/cct.dir/build.make:90: src/apps/CMakeFiles/cct.dir/proj_strtod.cpp.o] Error 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add #define FROM_PROJ_CPP before including proj/internal/internal.hpp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#include "proj_strtod.h"
#define FROM_PROJ_CPP
#include "proj/internal/internal.hpp"
using namespace NS_PROJ::internal;
#include <ctype.h>
#include <errno.h>
It's been a long time since I've done real C, and I really don't know C++ and especially not namespaces...
if (ci_starts_with(p, "NaN")) {
/usr/bin/ld: CMakeFiles/cct.dir/proj_strtod.cpp.o: in function `proj_strtod(char const*, char**) [clone .part.0]':
proj_strtod.cpp:(.text+0x6b): undefined reference to `osgeo::proj::internal::ci_starts_with(char const*, char const*)'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
proj_strtod.cpp:(.text+0x6b): undefined reference to `osgeo::proj::internal::ci_starts_with(char const*, char const*)'
ah, I missed proj_strtod.cpp was an application file, and not a file in the library. ci_starts_with() needs to be exported then.
You need to apply the following patch:
diff --git a/include/proj/internal/internal.hpp b/include/proj/internal/internal.hpp
index dda11cbc..d7aa0d49 100644
--- a/include/proj/internal/internal.hpp
+++ b/include/proj/internal/internal.hpp
@@ -128,7 +128,7 @@ inline bool starts_with(const std::string &str, const char *prefix) noexcept {
bool ci_less(const std::string &a, const std::string &b) noexcept;
-bool ci_starts_with(const char *str, const char *prefix) noexcept;
+PROJ_DLL bool ci_starts_with(const char *str, const char *prefix) noexcept;
bool ci_starts_with(const std::string &str, const std::string &prefix) noexcept;
diff --git a/scripts/reference_exported_symbols.txt b/scripts/reference_exported_symbols.txt
index 6ff75bef..fc512037 100644
--- a/scripts/reference_exported_symbols.txt
+++ b/scripts/reference_exported_symbols.txt
@@ -335,6 +335,7 @@ osgeo::proj::HorizontalShiftGridSet::reopen(pj_ctx*)
osgeo::proj::internal::ci_equal(std::string const&, char const*)
osgeo::proj::internal::ci_equal(std::string const&, std::string const&)
osgeo::proj::internal::ci_find(std::string const&, char const*)
+osgeo::proj::internal::ci_starts_with(char const*, char const*)
osgeo::proj::internal::c_locale_stod(std::string const&)
osgeo::proj::internal::replaceAll(std::string const&, std::string const&, std::string const&)
osgeo::proj::internal::split(std::string const&, char)
What are the exact implications of this change for me as a user of PROJ? It's clear we've had a somewhat undefined approach to NaNs and sorting it out is probably a good idea. I'm also worried this will break existing use cases that have come to rely on specific behaviour. |
@kbevers To me it is about consistency. From my experience, most projections "handle" NaNs just by the basic nature of any math performed on a NaN results in a NaN. So NaN in means NaN out. The LAEA projection (as described in #3596) is the first I've found where it returned 0 with NaN input and even that may have been a PROJ 6+ change. This had the odd effect of causing my resampling algorithm to put image pixels at the projections reference point (0m, 0m) because there were some invalid (NaN) values in the satellite instrument's coordinates. So I think anyone using NaN as a sentinel value for invalid/bad values in their data will be pleased. Anyone who was benefiting from the NaN -> 0 or NaN -> ? of certain projections probably (big assumption) didn't know they were using it, especially given that most projections do return NaNs. I could also see people who aren't used to working with NaNs ( On the developer side, this should mean any projection algorithm development shouldn't have to worry about NaN handling like doing weird inverse/negated |
@jjimenezshaw brings up a good point that I didn't address in my above comment. This is adding extra checks in places where people who aren't working with NaN will still be affected by them. I'm not sure what modern C/C++ compilers can do as far as NaN checks and string comparison optimizations. |
proj_strtod() is not used in computational code paths, just in pipeline initialization, so I'd expect any change in it to have totally unmeasurable timing effects |
Would there be any benefit to removing all NaN checks from the lower level projection code? |
Or possibly they have put in work-arounds that'll break. I'm asking to get a sense of the scale of this change. If it ends up only affecting a handful of projections I'm happy to call it a bug fix and include this in the next release. If this ends up affecting most projections it's a big change in behaviour and I don't think that's justifiable in anything other than a major version release. |
@kbevers Understood. I think this will be difficult to have a real answer to those questions. There is such inconsistent behavior here that I'm nervous to find out what workarounds people may have had. I would hope any generic application (one that handles a lot of projections) is using the detection of NaNs in the inputs to mask the results because otherwise I'm not sure how NaNs could be worked with. I did a little test where I did from pyproj import Transformer, CRS
import numpy as np
for proj in projections:
try:
t = Transformer.from_crs('EPSG:4326', CRS.from_proj4(f"+proj={proj}"))
print(f"{proj}: ", t.transform(np.nan, np.nan))
except:
print(f"Bad projection: {proj}") So basically: for each projection define a PROJ.4 projection as Proj transform results
What you should notice: many of the projections are |
@snowman2 I don't think so, but I'm also not sure how much work people have put in to handling them. Like I said, I think in a lot of cases NaN handling is just happening due to the math. When the result is not NaN, my guess is this is from if (lon > some_valid_threshold) {
// do math
} else {
// fail
x = 0
y = 0
}
// more math |
In case you run it again, I tweaked the script so you can do everything natively in python without from math import nan
from pyproj.list import get_proj_operations_map
from pyproj import Transformer
for proj in get_proj_operations_map():
try:
t = Transformer.from_crs('EPSG:4326', f"+proj={proj}")
print(f"{proj}: ", t.transform(nan, nan))
except Exception:
print(f"Bad projection: {proj}") |
@kbevers I doubt the proposed change in this PR would have large breaking potential. I would assume that anyone having relied in the past on PROJ's undefined NaN handling got the (nan, nan) result as output as @djhoese's experiment shows and thus will see no difference, and if they ran into the few cases where PROJ currently returns garbage, the user would have had to catch the NaN input before submitting coordinates to proj_trans() and do whatever they felt appropriate, so any change in PROJ isn't going to affect them. So overall I'd see this as a fix-as-usual situation.
@djhoese Could you run bin/bench_proj_trans on a few transformations before & after your change (on a -DCMAKE_BUIKD_TYPE=Release build to get realistic numbers) ? |
Ok so I have my main
So both of these are based off of a conda environment (for the dependencies) which also has PROJ installed. And I'm not really sure how to use the benchmark tool, but here's what I got:
I'll see if I can verify that this is the different code sources being used. |
ok, well, this mostly demonstrates that the effect of this PR is neglectable on performance. In theory your PR should be slightly slower than master as it adds a few instructions, but with modern CPUs and dynamic CPU frequency scaling, accurate measurement is incredible hard. Actually trying myself, and using hyperfine to get hopefully more accurate measurements, I find that this PR would be ~ 2% slower (but this might be within measurement noise and not being significant) on that particular use case: master:
your PR:
So, I'd say we are OK from a performance point of view |
Not sure how you feel about helper functions, but I made two inline functions for the "any NaNs" and "all NaNs" checks in 4D_api.cpp. |
Thanks. That gave a good overview of the current state of affairs. I agree with @rouault that this isn't in breaking changes territory. There's a chance some people will have put in place code to sanitize potentially weird PROJ output but most likely that will not be tripped up by this change. I would like to have it mentioned in the docs for |
Sounds good. Would that go in |
yes, that one |
Last release was 9.1.1, if I put this under a |
9.2.0 |
I've updated this. Could someone start the tests for me? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @djhoese 👍
This is an attempt at resolving some of the NaN issues I described in #3596. It has been a long time since I've done any real C/C++ development so feel free to complain about everything. I'll describe a little of what I did in the next paragraph(s) and also with inline comments after I create this.
First, I added a test to
more_builtins.gie
. This (as predicted by @rouault) meant adding some level of support of NaNs togie.cpp
. To fully support roundtrip tests this also meant updatingproj_roundtrip
. So nowgie
andproj_roundtrip
understand a NaN output and expected NaN to be a distance of 0.Next, to get
gie
to actually parse NaNs from the text file I needed to updateproj_strtod
. I went the simplest and maybe least flexible route. I do astrncmp
for the exact characters "NaN" which as far as the documentation tells me is also supported bystrtod
(the C++ builtin), butstrtod
also supportsNAN
andnan
I think. See the C99/C11 docs here: https://cplusplus.com/reference/cstdlib/strtod/Lastly, I updated
proj_trans
to check if any of the 4D input coordinates for NaN and if so skip the actual transformation and just return NaNs for all 4 coordinates.docs/source/*.rst
for new API