Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Approximate tmerc (Snyder): speed optimizations #2039

Merged
merged 1 commit into from Mar 29, 2020

Conversation

rouault
Copy link
Member

@rouault rouault commented Mar 9, 2020

fwd: 7% faster on Core-i7@2.6GHz (with FMA triggered), 22% faster on GCE Xeon@2GHz (with FMA)
inv: 31% faster on Core-i7@2.6GHz (with FMA triggered), 60% faster on GCE Xeon@2GHz (with FMA)

The optimizations consists in different things:

  • optionaly use the FMA (Fused Multiply Addition) instruction set with gcc >= 6.
    Binaries are generated with the standard instruction set (SSE/SSE2),
    and with one variant with FMA, and the appropriate version is selected automatically
    at runtime. This gives a modest speedup, but measurable. The speedup is more
    obvious on lower clocked CPU.
  • inline mlfn and inv_mlfn
  • for inv_mlfn avoid recomputation of sin()/cos() at each iteration stage,
    by observing that the argument changes in modest way at each iteration,
    and using approximation of sin()/cos(). The differences due to that approximation
    are way below the 1e-11 tolerance threshold.

Different in results are neglectable (only found in areas where the approximations
of the Snyder formulas are already no longer valid)

Before:
$ echo 8e5 9e6 | src/proj -d 12 +proj=utm +zone=31 -I +approx | src/proj -d 12 +proj=utm +zone=31 +approx
799997.896522093331 8999999.520601103082
$ echo 8e5 5e6 | src/proj -d 12 +proj=utm +zone=31 -I +approx | src/proj -d 12 +proj=utm +zone=31 +approx
800000.000007762224 4999999.999971268699
$ echo 18e5 9e6 | src/proj -d 12 +proj=utm +zone=31 -I +approx | src/proj -d 12 +proj=utm +zone=31 +approx
1079182.990696100984 8661150.574729491025
$ echo 18e5 5e6 | src/proj -d 12 +proj=utm +zone=31 -I +approx | src/proj -d 12 +proj=utm +zone=31 +approx
1799997.510861013783 4999999.567328464240

After:
$ echo 8e5 9e6 | src/proj -d 12 +proj=utm +zone=31 -I +approx | src/proj -d 12 +proj=utm +zone=31 +approx
799997.896522093331 8999999.520601103082
$ echo 8e5 5e6 | src/proj -d 12 +proj=utm +zone=31 -I +approx | src/proj -d 12 +proj=utm +zone=31 +approx
800000.000007762224 4999999.999971268699
$ echo 18e5 9e6 | src/proj -d 12 +proj=utm +zone=31 -I +approx | src/proj -d 12 +proj=utm +zone=31 +approx
1079182.990696124267 8661150.574729502201
$ echo 18e5 5e6 | src/proj -d 12 +proj=utm +zone=31 -I +approx | src/proj -d 12 +proj=utm +zone=31 +approx
1799997.510861013783 4999999.567328464240

fwd: 7% faster on Core-i7@2.6GHz (with FMA triggered), 22% faster on GCE Xeon@2GHz (with FMA)
inv: 31% faster on Core-i7@2.6GHz (with FMA triggered), 60% faster on GCE Xeon@2GHz (with FMA)

The optimizations consists in different things:
- optionaly use the FMA (Fused Multiply Addition) instruction set with gcc >= 6.
  Binaries are generated with the standard instruction set (SSE/SSE2),
  and with one variant with FMA, and the appropriate version is selected automatically
  at runtime. This gives a modest speedup, but measurable. The speedup is more
  obvious on lower clocked CPU.
- inline mlfn and inv_mlfn
- for inv_mlfn avoid recomputation of sin()/cos() at each iteration stage,
  by observing that the argument changes in modest way at each iteration,
  and using approximation of sin()/cos(). The differences due to that approximation
  are way below the 1e-11 tolerance threshold.

Different in results are neglectable (only found in areas where the approximations
of the Snyder formulas are already no longer valid)

Before:
$ echo 8e5 9e6 | src/proj -d 12 +proj=utm +zone=31 -I +approx |  src/proj -d 12 +proj=utm +zone=31 +approx
799997.896522093331	8999999.520601103082
$ echo 8e5 5e6 | src/proj -d 12 +proj=utm +zone=31 -I +approx |  src/proj -d 12 +proj=utm +zone=31 +approx
800000.000007762224	4999999.999971268699
$ echo 18e5 9e6 | src/proj -d 12 +proj=utm +zone=31 -I +approx |  src/proj -d 12 +proj=utm +zone=31 +approx
1079182.990696100984	8661150.574729491025
$ echo 18e5 5e6 | src/proj -d 12 +proj=utm +zone=31 -I +approx |  src/proj -d 12 +proj=utm +zone=31 +approx
1799997.510861013783	4999999.567328464240

After:
$ echo 8e5 9e6 | src/proj -d 12 +proj=utm +zone=31 -I +approx |  src/proj -d 12 +proj=utm +zone=31 +approx
799997.896522093331	8999999.520601103082
$ echo 8e5 5e6 | src/proj -d 12 +proj=utm +zone=31 -I +approx |  src/proj -d 12 +proj=utm +zone=31 +approx
800000.000007762224	4999999.999971268699
$ echo 18e5 9e6 | src/proj -d 12 +proj=utm +zone=31 -I +approx |  src/proj -d 12 +proj=utm +zone=31 +approx
1079182.990696124267	8661150.574729502201
$ echo 18e5 5e6 | src/proj -d 12 +proj=utm +zone=31 -I +approx |  src/proj -d 12 +proj=utm +zone=31 +approx
1799997.510861013783	4999999.567328464240
rouault added a commit to rouault/PROJ that referenced this pull request Mar 10, 2020
This PR is on top of OSGeo#2039 for convenience, but is conceputally independent
from it.

When gdalwarp is used to reproject an image to a tmerc projection, the inverse
tmerc projection is actually used, and called by batch of coordinates belonging
to the same line in the output image, that is lines of same northing. Looking
at formulas, a lot of intermediate computations, including the use of hyperbolic
functions, only depend on the northing value. So cache them to potentially speed
up following transformations.

I couldn't get a significant value for the extra cost of doing that caching in
situations where it is not needed ('random' points). It is likely below 1%
(which is largely compensated by the improvements of OSGeo#2039)

By modifying my bench program, to simulate 10 times the processing of  a
2000x2000 image (so 2000 coordinates of same northing being transformed
consecutively), I get a 40% speed-up over the ones of PR OSGeo#2039

Some timings with gdalwarp on a 4000x4000 Byte single-channel image in EPSG:32631:

$ time OSR_USE_APPROX_TMERC=YES gdalwarp test.tif out.tif -t_srs EPSG:32631 -et 0 -overwrite

With this PR :   3.15s (30% improvement over PR OSGeo#2039, 41% improvement over master)
With PR OSGeo#2039:   4.54s (15% improvement over master)
Before PR OSGeo#2039: 5.39s
@rouault rouault merged commit b84c9d0 into OSGeo:master Mar 29, 2020
@rouault rouault added this to the 7.1.0 milestone Mar 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant