Improve timestep performance further: `numba` or not? #57

xaviernogueira · 2023-11-29T21:54:41Z

@sjordan29 found TSM to be too slow -> it was nearly 80% of the time spent compared to 20% on ClearWater-riverine, when running over a grid of flow fields from HEC-RAS-2D.

Ideas to test:

Have a way to skip writing intermediate arrays to xarray if we are not keeping them. Writing to xarray is a slow point.
Pre-initing a np.nan and writing to that may be faster than writing over existing values (current behavior)? It's worth checking, especially if we can pre-init many time steps.
@aufdenkampe had the idea of just writing to slices of a pre-initiated xarray with a time dimension. The way I understand it, this could be implemented first via an optional argument to Model.__init__, and then the timestep you want to write onto can be passed as an argument into the run timestep function. This may also increase performance.

The text was updated successfully, but these errors were encountered:

aufdenkampe · 2023-11-30T13:14:39Z

@sjordan29 did some performance profiling for a test case of ClearWater-modules TSM coupled with ClearWater-riverine transport. See https://github.com/EcohydrologyTeam/ClearWater-riverine/blob/43-integrate-tsm/examples/dev_sandbox/43_Realistic_TSM_Scenario.ipynb.

She shared her cProfile results (profile.prof), which I explored in SnakeViz. It did a great job of identifying which functions in our code were the bottlenecks. From what I see, the top 3 were

clearwater_modules\base.py:332(increment_timestep)
clearwater_modules\utils.py:53(iter_computations)
clearwater_riverine\transport.py:159(update)

Below that, the time is spent primarily in xarray merge, copy, and init (create) functions such as:

xarray\core\merge.py:196(merge_collected)
xarray\core\variable.py:230(as_compatible_data)
xarray\core\variable.py:337(init)
xarray\core\variable.py:914(_copy)
many more

Some of these take so much time because they are called a lot of times.

So, the key progress will be made by reducing our use of xarray functions that call these merge/copy/init functions.

@xaviernogueira, the profile that Sarah shared with us yesterday was https://github.com/pyutils/line_profiler, which can tell you what is happening line by line within a function. So it complements the cProfile/Snakeviz perspectives that you have used. Please do some line profiling of these functions:

clearwater_modules\base.py:332(increment_timestep)
clearwater_modules\utils.py:53(iter_computations)

I also want to go over the approach of taking slices of a pre-existing array for reading/writing values. This is a combination of your bottom two checkboxes above, which should be combined into a single solution. Let's discus this.

xaviernogueira · 2023-12-14T21:35:31Z

Numba vs no-numba

Numba

I noticed that (as one might expect with understanding of JIT) running one timestep is much slower (per timestep) than running many. This is because the Just-In-Time compilation done by numba only runs the first time thru a loop (full TSM calculations).

# speed of just 1 timestep
Increment timestep speed (average of 1): 6.521975040435791
Run time: 6.548974990844727

# speed of 100 timesteps
Increment timestep speed (average of 100): 0.09739033937454224 (67x faster than running 1)
Run time: 9.782024145126343

# speed of 1000 timesteps
Increment timestep speed (average of 1000): 0.030201313734054564 (215x faster than running 1, 3x faster than running 100)
Run time: 30.230324506759644

# speed of 10,000 timesteps
Increment timestep speed (average of 10000): 0.026183737969398498
Run time: 261.8664631843567

Notice how the the majority of the run-time is the first timestep / JIT compilation! The more timesteps being ran, the smaller % of total run-time the initial compilation becomes.

No-Numba

Perhaps due to the # of process functions that get JIT compiled in the first loop, the compilation stage does significantly slow things down!
Here is the same 100 loops without numba (test_no_numba branch).

# speed of just 1 timestep
Increment timestep speed (average of 1): 0.6430933475494385
Run time: 0.6760678291320801

# speed of 100 timesteps
Increment timestep speed (average of 100): 0.027670950889587403 (23x faster than running 1)
Run time: 2.797003746032715

# speed of 1000 timesteps
Increment timestep speed (average of 1000): 0.022910058975219725 (28x faster than running 1.2x faster than running 100)
Run time: 22.939059734344482

# speed of 10,000 timesteps
Increment timestep speed (average of 10000): 0.02401784040927887
Run time: 240.20740365982056

Note that the speed per timestep flattens out at around 0.024 seconds. Therefore many timestep calculations can are completed in a linear fashion, appox (first_timestep_time + (total_timesteps - 1)* 0.24) on my machine.

Discussion

So the question is: do we use numba wrapped functions or not? The ideal solution involves getting a bit fancy and wrapping functions dynamically in a response to a user setting (i.e., use_numba: bool).
In the immediate future however, what direction do you think makes more sense @aufdenkampe , it depends on how many timesteps are typical to run.
For TSM at least, it is unclear if JIT ever becomes worth it. The rate per timestep never got below the non-jit version. This is likely because the functions we have JIT compiled are rather simple math on vectors, without loops (which is where JIT shines). Therefore the time it takes to read from the cached-compiled version + running the compiled code is not that major of a time saving.

PeterKlaver · 2023-12-14T21:42:40Z

Typical models have 10,000s to 100,000s of time steps.

xaviernogueira · 2023-12-14T21:46:55Z

@PeterKlaver got it, just added a 10k timestep test

PeterKlaver · 2023-12-14T21:56:49Z

It's not unusual to do water quality simulations over a full recreation season, say, to evaluate compliance with recreational use criteria. That's seven months - if the model time step is 30 seconds, the total N is ~ 600,000. If the time step is 10 seconds, N ~ 1,800,000. You can do the math. But if you see linear scaling up to around 10K it is probably okay to extrapolate from there.

xaviernogueira · 2023-12-15T00:47:30Z

Strategy of writing to a pre-initialized set of `xarray` time slices

This was proposed by @aufdenkampe earlier as a potential avenue for performance increases.
While I have not had time to implement a fully version into the existing code for true apples to apples comparison (more on that later), I did make simplified xarray workflows in the compute_engine_testing.ipynb notebook to gage if the approach is worth developing.
The TLDR: Computation becomes 4-5x faster, however, implementation may involve some tradeoffs, especially when the user hasn't pre-decided how many timesteps will be executed.

Performance Comparison

I used the tutorial air_temperature Dataset from xarray, which is a (lat: 25, lon: 53, time: 50) grid.
The test was adding 50 extra time steps where a mock water_temp_c (renamed air temperature) is updated to +10 more than the last timestep.

Test 1:

I implement the current algo that is in clearwater_modules: copy the last timestep (makes a Dataset with len(time=1)), update it's values, the concatenate it at the end of our existing Dataset.

Time: 364ms

Test 2:

Here we make another (lat: 25, lon: 53, time: 50) dataset, where time ranges from 1+ the last time of the existing dataset, to 50+. All values are initialized as np.NaN. Then concat to our existing dataset, and iterate over the new indices writing to each time slice.

Creating the new Dataset and concatenating it takes about 9ms (O(0) time since it runs once). The test itself

Considerations:

An array the size of your output needs to be initialized upon startup, which can potentially hog memory for a very long model run, or make workflows where the data is saved to a file at regular intervals more difficult to implement.
Not a massive issue, but if the user doesn't know how many timesteps they will run a-priori, the code will have to make a "guess", and add those time slices with np.NaN. This can make the dataset look odd in between timesteps and can make debugging more difficult. Additionally, if one runs more timesteps than the guess, there needs to be logic to add in another "buffer" of empty time slices to write too.
One option around the above issue is to simply require the user to set a model iteration time, but is this desired?
That said, 4-5x faster is hard to argue against. Implementation shouldn't be too bad.

If the desired behavior is well constrained, I am happy to implement it, potentially past my office time in one shot. This can be discussed in our meeting this Friday (12/15/2023).

aufdenkampe · 2024-02-26T17:35:40Z

Spinning off #57 (comment) into a new issue:

Pre-initialize output xarray time coordinates and variables #68

aufdenkampe · 2024-02-26T18:30:13Z

Based on @xaviernogueira's Numba vs no-numba test findings described in #57 (comment), he found that:

Removing numba, in his test_no_numba, was always faster than using numba!

This surprising result is is probably due to the built-in performance of using "vectorized" calculations enabled by numpy, xarray, and dask-array.

The difference is negligible of after about 10,000 time steps, so this doesn't deserve a huge effort to refactor away from numba.

That said, this does provide evidence that we should slowly start moving away from numba, given that it also substantially increases code complexity versus vectorized array calculations.

NOTE: There is still a case for using numba when a "vectorized" approach isn't available, but fortunately those are likely to be rare due to the extensive development of numpy and xarray.

xaviernogueira added enhancement New feature or request refactor-core For core functionality, not model specifc labels Nov 29, 2023

xaviernogueira self-assigned this Nov 29, 2023

sjordan29 mentioned this issue Nov 30, 2023

Improve timestep performance EcohydrologyTeam/ClearWater-riverine#52

Open

aufdenkampe mentioned this issue Nov 30, 2023

Basic Ability to Integrate with TSM EcohydrologyTeam/ClearWater-riverine#53

Merged

xaviernogueira linked a pull request Dec 14, 2023 that will close this issue

Adding examples/dev_sandbox/prof.py #60

Merged

xaviernogueira closed this as completed in #60 Dec 14, 2023

xaviernogueira reopened this Dec 14, 2023

aufdenkampe mentioned this issue Dec 21, 2023

Reduce memory usage: zero-copy #62

Open

aufdenkampe added this to the Performance enhancements for TSM coupling demo milestone Dec 21, 2023

aufdenkampe mentioned this issue Feb 26, 2024

Pre-initialize output xarray time coordinates and variables #68

Closed

aufdenkampe changed the title ~~Improve timestep performance further~~ Improve timestep performance further: numba or not? Feb 26, 2024

aufdenkampe modified the milestones: Performance enhancements for TSM coupling demo, Performance enhancements for NSM demo Feb 26, 2024

This was referenced Feb 26, 2024

Reduce memory usage: lazy writing to dask.array #69

Open

Add memory profiling #70

Open

Explore refactoring to use xarray-simlab #71

Open

Add a dimensions attribute to variables, as in `xarray-simlab #72

Open

This was referenced Mar 11, 2024

Increment timestep raising error: Unknown attribute 'where' of type 'module 'xarray.' #74

Closed

Update TSM processes: numba and xr.where() #76

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve timestep performance further: `numba` or not? #57

Improve timestep performance further: `numba` or not? #57

xaviernogueira commented Nov 29, 2023 •

edited by aufdenkampe

aufdenkampe commented Nov 30, 2023

xaviernogueira commented Dec 14, 2023 •

edited

PeterKlaver commented Dec 14, 2023

xaviernogueira commented Dec 14, 2023

PeterKlaver commented Dec 14, 2023

xaviernogueira commented Dec 15, 2023 •

edited

aufdenkampe commented Feb 26, 2024

aufdenkampe commented Feb 26, 2024 •

edited

Improve timestep performance further: numba or not? #57

Improve timestep performance further: numba or not? #57

Comments

xaviernogueira commented Nov 29, 2023 • edited by aufdenkampe

aufdenkampe commented Nov 30, 2023

xaviernogueira commented Dec 14, 2023 • edited

Numba vs no-numba

Numba

No-Numba

Discussion

PeterKlaver commented Dec 14, 2023

xaviernogueira commented Dec 14, 2023

PeterKlaver commented Dec 14, 2023

xaviernogueira commented Dec 15, 2023 • edited

Strategy of writing to a pre-initialized set of xarray time slices

Performance Comparison

Test 1:

Test 2:

aufdenkampe commented Feb 26, 2024

aufdenkampe commented Feb 26, 2024 • edited

Improve timestep performance further: `numba` or not? #57

Improve timestep performance further: `numba` or not? #57

xaviernogueira commented Nov 29, 2023 •

edited by aufdenkampe

xaviernogueira commented Dec 14, 2023 •

edited

xaviernogueira commented Dec 15, 2023 •

edited

Strategy of writing to a pre-initialized set of `xarray` time slices

aufdenkampe commented Feb 26, 2024 •

edited