C++/CUDA port of BELLHOP
/BELLHOP3D
underwater acoustics simulator.
Copyright (C) 2021-2023 The Regents of the University of California
Marine Physical Lab at Scripps Oceanography, c/o Jules Jaffe, jjaffe@ucsd.edu
Based on BELLHOP / BELLHOP3D, which is Copyright (C) 1983-2022 Michael B. Porter
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see https://www.gnu.org/licenses/.
This is a single codebase which can be built as multithreaded C++ code for your CPU, or as CUDA code for your NVIDIA GPU. You can use the CPU version (bellhopcxx) even if you don't have an NVIDIA GPU.
bellhopcxx
is compatible with all platforms (Linux, Windows, Mac), and
bellhopcuda
is compatible with all platforms which support CUDA (Linux and
Windows).
- While the performance gains from multithreading are highly dependent on the
environment file, in most cases the performance scales roughly with the number
of logical CPU cores. On a 12-core / 24-thread test machine,
bellhopcxx
is typically about 10 to 30 times faster thanBELLHOP
. - For simulations with many (tens of thousands or more) rays and a reasonably
good desktop GPU,
bellhopcuda
is typically even faster, sometimes surpassing 100x gain overBELLHOP
. - Our team has made fixes to
BELLHOP
to improve its numerical properties and robustness (see below). These fixes were essential to being able to reproduce its results in another language (or two). These fixes are of course also incorporated intobellhopcxx
/bellhopcuda
. - As well as the traditional executable version, this repo also builds
bellhopcxx
as a library, which can be incorporated into other programs. The library version allows the simulation parameters to be modified and the simulation rerun, and it is not bottlenecked by slow file I/O for getting the results. The library version also allows running multiple independent simulations with different parameters at the same time (there are no global variables). - By default, our version supports certain combinations of parameters which are
not supported by the original version, such as 2D Gaussian ray-centered
influence and 3D PCHIP. A compile-time option can be set to artificially limit
the feature set to the same as
BELLHOP
/BELLHOP3D
for testing.
Yes. bellhopcxx
and bellhopcuda
read the same .env
and other input files
and produce output files in the same formats as BELLHOP
/ BELLHOP3D
. It is
a drop-in replacement.
The normal bellhopcxx.exe
and bellhopcuda.exe
executables select their
dimensionality with command-line flags (e.g. --2D
, --3D
, --Nx2D
). The
build system also produces executables which support only one dimensionality,
bellhopcxx2d.exe
, bellhopcxx3d.exe
, and so on. You can use these if you
don't want to change the MATLAB wrapper code to add the flags.
Once you have your desired executable, simply rename it to bellhop.exe
or
bellhop3d.exe
and replace the BELLHOP
/ BELLHOP3D
executable in your
MATLAB setup with it. Multithreading is enabled by default, so you should get a
significant speedup.
All features of all three dimensionalities (2D, 3D, Nx2D) are working and
reasonably well-tested. The numerical stability and robustness of all of
these versions is significantly improved from the original BELLHOP
/
BELLHOP3D
(see more discussion below), so in any application where the
original version was sufficient, our versions should be as well. With that said,
the numerical stability and reproducibility is better in 2D than in 3D or Nx2D,
and slightly better in 3D than Nx2D.
Download pre-compiled binaries for Windows 10 64-bit. We try to keep these up-to-date with most changes to the repo's main branch, though this is not guaranteed. However, the pre-compiled versions:
- do not include all the executable and library versions possible to build
- are only built for one GPU version (SM86, RTX 30x0)
- are Windows versions only
If you need any other versions, please git clone
and build from source.
- Make sure you got the Git submodules (the
glm
folder should not be empty). - If you want
bellhopcuda
, install the latest CUDA toolkit. Otherwise if you do not wantbellhopcuda
, set the environment variableBHC_NO_CUDA
(to something like "1") or turn off the CMake optionBHC_ENABLE_CUDA
. - Build the project with CMake in the usual way for your platform. If you are not familiar with CMake, there are numerous tutorials online; the process is not project-specific, besides the options mentioned in this readme.
This project uses libcudacxx for creating a shared codebase for CPU and GPU. This open-source library is generally very good, but has occasional compatibility issues. Some errors you might see might mention some of the following items:
cuda::std::__4::hypot
_LInf
,_LNan
,_LSnan
__ATOMIC_RELEASE
,ATOMIC_INT_LOCK_FREE
__type_traits/disjunction.h
,error C2210: '_First': pack expansions cannot be used as arguments...
We have submitted an issue to NVIDIA
about some of these issues, and they are in the process of being fixed. We also
put workarounds in place in the bellhopcxx
/ bellhopcuda
codebase. However,
if you run into any of these problems, always first upgrade to the latest CUDA
version. If that does not solve the problem, let us know and also try the
following:
- Upgrade from Visual Studio 2017, to 2019 or 2022.
- Download the libcudacxx source manually and replace
include/cuda
andinclude/nv
in your CUDA installation with the appropriate directories from the Git repo.
- Copy the
[bellhopcuda repo]/include/bhc
directory to aninclude
directory in your parent project. Or, if a Git submodule, add the[bellhopcuda repo]/include
directory to your include paths. - Set your parent project's C++ version to C++17 or later.
- If you are on Windows and using the
bellhopcxxlib.dll
shared (dynamic) library version, defineBHC_DLL_IMPORT
before including the header. This cannot be done automatically because it must not be defined if you're using the static library version, and we can't know in advance which version you will use. #include <bhc/bhc.hpp>
. Follow the instructions in that file for API use.- Link to:
- Windows shared (dynamic) library:
bellhopcxxlib.dll
and import librarybellhopcxxlib.lib
- Windows static library:
bellhopcxxstatic.lib
- Linux shared library:
libbellhopcxxlib.so
- Linux static library:
libbellhopcxxstatic.a
- Windows shared (dynamic) library:
- You can set up the library without an initial environment file. All parameters are set to reasonable basic defaults. You can modify these up one at a time from your host program.
- The
extsetup
functions allow you to allocate arrays with the run parameters, for example to change the number of sources in Z to 100. After allocating the array with your chosen size, you can fill in the data directly. - The
writeenv
function allows you to save the current state of the run parameters to an environment file and other required input files (e.g. SSP, bathymetry). This does not have to be relative to the same FileRoot provided at setup time. - The
readout
function allows you to read results from a past run (ray file, TL / shade file, or arrivals) into memory, so your host program can display or manipulate these results.
If the bug is regarding results being wrong (or different from
BELLHOP
/BELLHOP3D
), please check and report to us whether the behavior is
the same in our version of BELLHOP
/BELLHOP3D
with improved numerical
stability and robustness. Also,
please provide an environment file which traces only the ray whose results are
incorrect (if applicable); if this is impossible, due to the bug disappearing
when a different set of rays are traced, let us know of this.
Submit a bug report on the GitHub issues page.
This section was last updated 2/2023.
The summary is:
- Speedups vary widely depending on many parameters of the run.
- Generally, the speedup in multithreaded mode roughly scales with the number of logical CPU cores, with a scaling factor roughly ranging from 0.5 to 1.2 (e.g. a 6-core, 12-thread CPU will often get a 6x-15x speedup).
- If you have a reasonably modern (but consumer-grade) GPU and the run uses a large number of rays (in the 10,000s or more), the speedup of the CUDA version will often be a few times above the CPU speedup (e.g. 20x-100x compared to the Fortran version). Conversely, if there are a small number of rays, the CUDA performance will be worse than the CPU performance.
Some factors affecting the performance are discussed below.
In both CPU multithreaded mode and CUDA mode, the processing is parallelized
over the rays. The number of rays needed to fully utilize each chip is a few
times the number of cores--but on the CPU this will be anything over a hundred
or so rays, whereas on the GPU you will need tens of thousands of rays as there
are many thousands of cores. However, experimentally, the speedup does not stop
once this number is reached--running hundreds of thousands or millions of rays
continues to bring additional, though diminishing, performance gain over
BELLHOP
/ BELLHOP3D
. Of course, whether more rays is useful or not is very
application-dependent.
bellhopcxx
/ bellhopcuda
can be built and used as a library, so input data
(e.g. SSP) does not have to be read from files, and output data does not have to
be written to disk. In some runs, the file I/O is trivial, whereas in others,
it takes 10x-100x the time compared to actually computing the results. Normally,
we count the file I/O time in our version in order to compare most fairly to
the original version, but if file I/O can be avoided in a particular
application using the library version, this may lead to a substantial gain in
effective overall performance in some cases.
A run with receivers concentrated in one area far away from the source will be faster than one where the receivers are spread uniformly over a large area where the rays traverse. In the latter, every step of every ray will influence a number of receivers, whereas in the former, whereas in the former most rays will not influence any receivers. This is especially true for eigenrays and arrivals runs: these run types will work best when the user is interested in finding a handful of rays which reach an area of interest out of a large number of rays initially traced.
Transmission loss runs typically bring the largest speedups over BELLHOP
/
BELLHOP3D
. Arrivals typically have lower speedups, depending on the receivers'
layout as discussed above.
For ray runs, the full rays must be written out to disk, serially one after another. For this reason, our CUDA version does not even compute the rays on the GPU at all: it is unlikely the user will want to run tens of thousands of rays to saturate the GPU, and if they did, writing them is likely to take more time than computing them. Nevertheless, the performance gains with multithreaded mode are still often noticeable.
For eigenrays runs, when a ray influences a receiver, a data structure similar
to an arrival is written to memory, but the full ray is not saved to memory (or
worse for the parallelism, to disk). This allows a large number of rays to be
searched for the few which hit target receivers, on either CPU multithreaded or
CUDA. Then, at the end, just those rays which hit receivers are re-traced in CPU
multithreaded mode (even if the CUDA version is running), and the resulting rays
are written. This can be much more efficient than BELLHOP
/ BELLHOP3D
when
there are few receivers, but it will be less efficient when there are receivers
everywhere and consequently extremely large numbers of eigenrays.
The speedup for 2D is typically slightly (maybe 10% on average) higher than for 3D. The speedup for Nx2D is similar to that for 3D.
By default, bellhopcxx
/ bellhopcuda
uses double precision (64-bit floats)
for almost all arithmetic. (It uses the same precision as the Fortran version
for each individual operation, which is almost universally double precision,
except for most writing of output data which is in single precision.) All the
performance and accuracy results presented here are for double precision mode.
However, our version can also be built to use single precision (32-bit floats) for all arithmetic. This generally works--the ray or TL plots generally look the same by eye--but as you might expect, this substantially increases the chances of numerical stability problems, and the details of the results generally do not match the double precision version. For example, in a particular run, 99% of the rays might follow nearly the same trajectory as with full precision, but 1% of the rays might diverge and go elsewhere.
Consumer NVIDIA GPUs only contain vestigial double-precision floating-point support, at 1/64th the throughput compared to single-precision. Only the server- class GPUs (in the tens of thousands of dollars) include full double-precision support. The fact that the CUDA version still can bring substantial speedups over the CPU is explained by the fact that most of the actual instructions being executed are integer or memory instructions, with only a portion being floating-point math. Nevertheless, the floating-point performance may be the bottleneck in some runs. For some applications, the single-precision version may be useful for obtaining fast, approximate initial results on a consumer GPU.
The physics model in the original BELLHOP
/ BELLHOP3D
has a number of
properties which make it sensitive to numerical details. For example, moving a
source by 1 cm in a basic test case causes up to about 3 dB differences
throughout the field. As a more extreme example, we created a modified version
of the original BELLHOP
code which adds random perturbations less than 100
microns to the position (not even direction!) of each ray after each step.
(Steps are typically on the order of 1 km, so this is a relative error on the
order of 10^-7.) This was chosen because on the one hand, the real-world
uncertainty in the ocean is at a far larger scale, and on the other hand, these
perturbations wreak havoc on BELLHOP
's edge case handling (see below). These
tiny perturbations result in up to 40 dB differences in the field, which are
visible when comparing the transmission loss plots by eye.
Of course, BELLHOP
has been providing useful results to researchers for four
decades now, despite these limitations of its model. We mention this for the
sake of context for the discussion below. Usually in software development, any
mismatch between the program's output and the reference output means the test
has failed and the program has a bug somewhere. In this project, such a mismatch
could be caused by something as subtle as the Fortran compiler emitting a fused
multiply-add instruction where the C++ compiler emitted separate multiply and
add instructions. (In fact, in one case, we confirmed this was the source of a
discrepancy from examining the disassembly, and explicitly added the FMA to the
C++ version.)
But not only is it not realistic to try to reproduce the exact set of floating- point operations in a different language and compiler to get exactly matching results--we want the set of operations to be different in some cases. For example, early on in the project, we found a combination of bugs and edge case conditions--documented in the readme of our modified Fortran version--which made it impossible to parallelize the computation if we wanted to reproduce the original results (with the bugs). Instead of abandoning the parallelization--which was one of the main goals of the project--we decided to fix (many of) the bugs and improve the numerical stability.
Since then, we have been comparing the results of bellhopcxx
/ bellhopcuda
to our modified Fortran version, not to the original BELLHOP
/ BELLHOP3D
.
When we find a discrepancy, we find the failing ray, print out its state as it
travels, find the step where it diverges, and isolate the part of the code
responsible. Of course, if it is a bug in our new code, we fix that. But when it
is a numerical stability / robustness issue, we implement a change in both
versions which attempts to make that part of the code more reproducible, and
hopefully then the results start matching again.
We have made a variety of these changes, documented
here, but the most significant set
of changes was to the handling of boundaries. Boundaries are locations in the
ocean, such as a change in SSP or a change in the bathymetry slope, where the
ray may have to recompute its direction to curve. As such, the ray steps forward
until it hits the next boundary in the direction it is traveling, and then
adjusts its direction. Thus, most steps place a ray directly on a boundary--in
other words, almost every step of every ray hits an edge case. Due to
floating-point imprecision, this may actually be slightly in front of or behind
the boundary. This affects the subsequent trajectory in a way which is usually
very small, but can be amplified to large changes later. The original BELLHOP
/ BELLHOP3D
made no attempt to handle stepping to boundaries consistently; our
version handles most boundaries in a fully consistent manner and the remaining
couple in a manner which is perhaps about 99.99% consistent. (Of course, that
means the remaining ~0.01% of rays diverge slightly, potentially causing
results to not match.)
So all of the results discussed here are as compared to our modified Fortran
version, not the original BELLHOP
/ BELLHOP3D
. And when we report results that do not match, these are mostly
either cases where the best methods we were able to come up with still
occasionally diverge, or cases we have not studied in detail. Because of how
many cases do match, it's unlikely that these are just the result of bugs in our
version, though it is possible that there are bugs on codepaths which have been
rarely or never tested such as some of the more obscure attenuation options.
One final note: We generally compare values, whether ray positions or complex field values, with a combination of absolute and relative error thresholds. There is also logic for things such as reordering and getting rid of duplicate steps in rays, and combining arrivals. However, the criteria are always much more strict than what would be noticed from simply looking at MATLAB plots of the outputs.
A set of scripts are provided which generate test environment files (along with
SSP files, bathymetry, etc. as appropriate) for every combination of run type,
influence type, SSP, and dimensionality (about 1500 tests). These are primarily
for code coverage, rather than testing the correctness of the physics, and are
short to be able to run relatively quickly. The scripts also test to make sure
that any combinations of these settings which are not supported by BELLHOP
/
BELLHOP3D
are rejected by our version, assuming the BHC_LIMIT_FEATURES
build
option was enabled.
The current version of bellhopcxx
/ bellhopcuda
matches on all the coverage
tests.
Dr. Porter provided a set of test cases with the Acoustics Toolbox. These have
been sorted by dimensionality and run type. The specific runs comprising this
data are in the text files included in the root of this repo, e.g.
ray_3d_pass.txt
, tl_long.txt
, etc. This table is just counts of the entries
in those files.
The "Both fail" column is for tests which fail to run in BELLHOP
/ BELLHOP3D
at all, either because they may be for one of the other Acoustics Toolbox
simulators or because the environment file syntax has changed since they were
written. Our version produces the same behavior as the original version on all
of these test cases (though often with more descriptive error messages), so this
"Both fail" case is actually a "pass" for our version.
Match | Do not match | Both fail | |
---|---|---|---|
2D ray | 12 | 0 | 8 |
2D TL | 48 | [1] | 0 |
2D eigen | 1 | 0 | 0 |
2D arrivals | 6 | 1 | 0 |
3D ray | 12 | 0 | 2 |
3D TL | 42 | 8 | 12 |
3D eigen | 0 | 0 | [2] |
3D arrivals | 0 | 0 | 0 |
Nx2D ray | 4 | 0 | 0 |
Nx2D TL | 13 | 5 | 6 |
Nx2D eigen | 0 | 0 | 0 |
Nx2D arrivals | 0 | 0 | 0 |
[1]: There are two runs which consistently match in single-threaded mode, but
sometimes do and sometimes don't match in multithreaded or CUDA mode.
[2]: There is only one environment file; it runs out of memory after over 3
hours, so this has not been investigated further.
Note that some of the TL tests take several hours (one about 12 hours), and we do not guarantee we will re-run those tests after every change, so some of the numbers here may be slightly out of date.
Unattributed comments in all translated code are copied directly from the
original BELLHOP
/ BELLHOP3D
and/or Acoustics Toolbox code, mostly by Dr.
Michael B. Porter. Unattributed comments in new code are by the Marine Physical
Lab team, mostly Louis Pisha. It should usually be easy to distinguish the
comment author from the style.